## Introduction:
   We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.
   As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

## Goal:
   Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

## Resources:
   Collecting data for over 4 million apps requires a significant amount of time and money, so we'll try to analyze a sample of the data instead. To avoid spending resources on collecting new data ourselves, we should first try to see if we can find any relevant existing data at no cost. Luckily, these are two data sets that seem suitable for our goals
- A dataset containing ~10,000 Android apps from Google Play. Data was collected in August 2018.                              Here is the link: https://dq-content.s3.amazonaws.com/350/googleplaystore.csv
- A dataset containing ~7,000 iOS apps from App Store. Data was collected in July 2017.                                        Here is the link https://dq-content.s3.amazonaws.com/350/AppleStore.csv

## Exploring the Data

Lets create an 'explore' function to look into the dataset

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    '''
    This functions takes in four parameters.
    dataset: expected list of the lists
    start: integer, representing the starting 
           indice of the slice
    end: integer, representing the ending indice
          of the slice
    rows_and_columns: boolean, has False value by 
        default.
        '''
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader
### The Google Play data set 
open_file = open('googleplaystore.csv',encoding="utf8")
read_file = reader(open_file)
android = list(read_file)
android_header = android[0] # Separate the header row
android = android[1:]

### The App Store data set
open_file = open('AppleStore.csv',encoding="utf8")
read_file = reader(open_file)
apple = list(read_file)
apple_header = apple[0]    # Separate the header row
apple = apple[1:]

In [3]:
explore_data(android,0,5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
explore_data(apple, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


In [5]:
print(android_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [6]:
print(apple_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


## Data Cleaning

#### Detect **inaacurate data**, and correct or remove it.

The Google Play data set has a dedicated discussion (https://www.kaggle.com/lava18/google-play-store-apps/discussion) section where an error for a certain row has been described. The row 10472 has been highlighted as it does not have the Category details which has made the colums shift.
This entry has missing 'Rating' and a column shift happened for next columns..

In [7]:
print(android[10472])
print('\n')
print(android_header)
print(len(android[10472]))
print(len(android_header))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
12
13


In [8]:
del android[10472]

#### Detect **duplicate data**, and remove the duplicates.

In [9]:
# lets check our data for duplicate entries through a for loop

duplicate_entries = [] # a new list for duplicate entries
unique_entries = [] # a list to contain unique entries

for row in android: # iterate through the rows
    name = row[0]   # within each row, go for the name.
    if name in unique_entries: # if name already exists in unique_entries,
        duplicate_entries.append(name) # append it to duplicate_entries
    else: # or if it never existed in unique_entries, it must be unique
        unique_entries.append(name) # add it into the list of unique_entries
        
print('Number of Duplicate entries:\n{}'.format(len(
    duplicate_entries)))
print('Starting 5 entries:\n{}'.format(duplicate_entries[:5]))

Number of Duplicate entries:
1181
Starting 5 entries:
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


In [10]:
for row in android: # iterate through every row
    name = row[0] # go for the name inside a row, the first item,
    if name == 'Quick PDF Scanner + OCR FREE': # if name=='X'
        print(row) # print the entire row with name 'X'

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


This dictionary has the unique app names as key and its max review as value.

In [11]:
# We want to keep the row with maximum reviews
# Lets create a dictionary for that
reviews_max = {} # make a new dictionary
for row in android:  # iterate through each row of the dataset
    name = row[0] # assign variable to the first element of each row, that is its name.
    n_reviews = float(row[3]) # each of the 3rd element of the row tells about the number of reviews, convert it into floats assign it to n_reviews
    if name in reviews_max and reviews_max[name] < n_reviews: # if the name already existed in our newly built dictionary, 
        reviews_max[name] = n_reviews
    else:
        reviews_max[name] = n_reviews
        

In [12]:
android_clean = []
already_added = []
for row in android:
    name = row[0]
    n_reviews = float(row[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
print(len(android_clean))
explore_data(android_clean,0,10)

9659
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']


['Smoke Effect Photo Maker - Smoke Editor', 'ART_AND_DESIGN', '3.8', '178', '19M', '50

Here we appended android clean with unique matches in reviews_max 

In [13]:
duplicate_entries = []
unique_entries = []

for row in apple:
    if row[0] in unique_entries:
        duplicate_entries.append(row[0])
    else :
        unique_entries.append(row[0])
        
print('Number of unique apps: ' ,len(unique_entries))
print('Number of duplicate apps: ' , len(duplicate_entries))
print('\n')
print('Examples of duplicate apps:', duplicate_entries[:15])

Number of unique apps:  7197
Number of duplicate apps:  0


Examples of duplicate apps: []


There are no Duplicate entries in IOS dataset

### Removing Non-English Apps

<font color='red'>The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name.

In [14]:
def is_english(string):
    for element in string:
        if ord(element) > 127:
            return False
        return True
    

Lets check whether the function works correctly or not:


In [15]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


The function worked correctly but, If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127, the emojis and characters that the function picked out

Lets change the function to make it a bit more lenient

In [16]:
def is_english(string):
    non_ascii = 0
    for element in string:
        if ord(element) > 127:
            non_ascii += 1
    if non_ascii > 3:
        return False
    else:
        return True

Lets test the newer version of the function we just made

In [17]:
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))
print(is_english("爱奇艺PPS -《欢乐颂2》电视剧热播"))

True
True
False


Next, we are going to apply the above function our datasets to remove any apps which include non-English characters.

In [18]:
english_android = []
english_apple = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        english_android.append(app)
        
for app in apple:
    name = app[1]
    if is_english(name):
        english_apple.append(app)
        
print('After applying the function, we are left with:\n {} android apps\n'.format(len(english_android)))        
print('After applying the function, we are left with:\n {} apple apps\n'.format(len(english_apple)))        



After applying the function, we are left with:
 9614 android apps

After applying the function, we are left with:
 6183 apple apps



### Isolating free english apps 

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [19]:
free_english_android = []
free_english_apple = []

for row in english_android:
    name = row[0]
    price = row[6]
    
    if price == "Free":
        free_english_android.append(row)


for row in english_apple:
    name = row[1]
    price = row[4]
    
    if price == "0.0":
        free_english_apple.append(row)

print('Android apps those are in english and are free:\n', len(free_english_android))

print('Apple apps those are in english and are free:\n', len(free_english_apple))

Android apps those are in english and are free:
 8863
Apple apps those are in english and are free:
 3222


## Data Analysis

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. 

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.

Lets build two functions to analyze the frequencies:

- freq_tables to generate frequency tables that show percentages
- display_table to display the percentages in an order

In [20]:
def freq_table(dataset, index):
    freq_apps = {}
    total = 0
    
    for row in dataset:
        total += 1
        val = row[index]
        if val in freq_apps:
            freq_apps[val] += 1
        else:
            freq_apps[val] = 1
    
    #Converting frquencies to percentage table
    table_percentages = {}
    for key in freq_apps:
        percentage = (freq_apps[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

In [21]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Lets check both functions on our datasets

**Free English Android Apps:**

In [22]:
display_table(free_english_android, 9)

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.580841701455489
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.542818458761142
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2494640640866526
Action : 3.102786866749408
Health & Fitness : 3.068938282748505
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8616721200496444
Video Players & Editors : 1.7826920907142052
Casual : 1.7488435067133026
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
St

**Free English Apple Apps:**

In [23]:
display_table(free_english_apple, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


##### Displaying percentages of category columns for each dataset

In [24]:
display_table(free_english_android, 1)

FAMILY : 19.21471285117906
GAME : 9.511452104253639
TOOLS : 8.462146000225657
BUSINESS : 4.580841701455489
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.542818458761142
SPORTS : 3.4187069840911652
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2494640640866526
HEALTH_AND_FITNESS : 3.068938282748505
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7826920907142052
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.128286133363421
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
ENTERTAINMENT : 0.8800631840234684
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0.64

Apple (App Store) has more number of users looking for fun Apps whereas Android (Google playstore) has a more balanced user base with equal emphasis for everything.

## Most Popular Apps by Genre 


#### on Apple's App Store

In [25]:
genres_apple = freq_table(free_english_apple, -5)
genres_table = []
count = 0

for genre in genres_apple:
    total = 0
    len_genre = 0
    
    for row in free_english_apple:
        genre_app = row[-5]
        
        if genre_app == genre:            
            n_ratings = float(row[5])
            total += n_ratings
            len_genre += 1
            
    avg_n_ratings = total / len_genre
    key_val_tuple1 = (avg_n_ratings, genre)
    genres_table.append(key_val_tuple1)
    count += avg_n_ratings
    
#Sorting in Descending order    
sorted_table1 = sorted(genres_table, reverse = True)

for element in sorted_table1:
    print(element[1], ':', round(element[0]/count*100,2), '%' )

Navigation : 12.12 %
Reference : 10.55 %
Social Networking : 10.08 %
Music : 8.07 %
Weather : 7.36 %
Book : 5.6 %
Food & Drink : 4.69 %
Finance : 4.43 %
Photo & Video : 4.01 %
Travel : 3.98 %
Shopping : 3.79 %
Health & Fitness : 3.28 %
Sports : 3.24 %
Games : 3.21 %
News : 2.99 %
Productivity : 2.96 %
Utilities : 2.63 %
Lifestyle : 2.32 %
Entertainment : 1.98 %
Business : 1.06 %
Education : 0.99 %
Catalogs : 0.56 %
Medical : 0.09 %


**Observations**

Top 5 Genres in App Store are 
1. Navigation 
2. Reference
3. Social Networking
4. Music
5. Weather 

#### on Android's PlayStore

In [26]:
genres_android = freq_table(free_english_android, 1)
genres_table = []
count2 = 0

for genre in genres_android:
    total = 0
    len_genre = 0
    
    for row in free_english_android:
        genre_app = row[1]
        if genre_app == genre:            
            n_installs = row[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_genre += 1
            
    avg_n_installs = total / len_genre
    key_val_tuple2 = (avg_n_installs, genre)
    genres_table.append(key_val_tuple2)
    count2 += avg_n_installs
    
sorted_table2 = sorted(genres_table, reverse = True)

for element in sorted_table2:
    print(element[1], ':', round(element[0]/count2*100,2), '%')

COMMUNICATION : 16.17 %
VIDEO_PLAYERS : 10.46 %
SOCIAL : 9.81 %
PHOTOGRAPHY : 7.53 %
PRODUCTIVITY : 7.07 %
TRAVEL_AND_LOCAL : 5.9 %
GAME : 5.45 %
TOOLS : 4.56 %
NEWS_AND_MAGAZINES : 4.03 %
ENTERTAINMENT : 3.86 %
BOOKS_AND_REFERENCE : 3.7 %
SHOPPING : 2.97 %
PERSONALIZATION : 2.19 %
FAMILY : 2.19 %
WEATHER : 2.14 %
SPORTS : 1.8 %
HEALTH_AND_FITNESS : 1.76 %
MAPS_AND_NAVIGATION : 1.71 %
ART_AND_DESIGN : 0.84 %
FOOD_AND_DRINK : 0.81 %
EDUCATION : 0.75 %
BUSINESS : 0.72 %
LIFESTYLE : 0.61 %
FINANCE : 0.59 %
HOUSE_AND_HOME : 0.56 %
DATING : 0.36 %
COMICS : 0.34 %
AUTO_AND_VEHICLES : 0.27 %
LIBRARIES_AND_DEMO : 0.27 %
PARENTING : 0.23 %
BEAUTY : 0.22 %
EVENTS : 0.11 %
MEDICAL : 0.05 %


**Observations:**
Top 5 Genres in Android's Goolge PlayStore are: 
1. Communication
2. Video Players
3. Social Networking
4. Photo
5. Productivity 

## Conclusions:
- Based on the above analysis, we Social (networking) Genre is the most commonly popular between IOS and GooglePlayStore free english Apps with 10.8%, 9.88% of most popular apps. 
- Another encouraging factor to consider the social genre is that it is quite under representated, as it can be seen that on IOS only 3.28% apps, and on Android, only 2.66% of total free english apps are there in their respective app stores