# Analysis on free apps in Google Play and the Apple Store
The apps available on Google Play and the App Store. Goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users because the revenue is highly influenced by the number of people using our apps.

import apple data and google data

In [1]:
from csv import reader
openfileapple=open('AppleStore.csv')
readfileapple=reader(openfileapple)
dataapple=list(readfileapple)
dataappleheader=dataapple[0]
dataapple=dataapple[1:]

openfilegoogle=open('googleplaystore.csv')
readfilegoogle=reader(openfilegoogle)
datagoogle=list(readfilegoogle)
datagoogleheader=datagoogle[0]
datagoogle=datagoogle[1:]


def explore_data(dataset, start, end, rows_and_columns=True):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
explore_data(dataapple,0,2)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of rows: 7197
Number of columns: 17


In [2]:
explore_data(datagoogle,0,4)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


In [3]:
print(dataappleheader)
print('\n')
print(datagoogleheader)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


## Data Cleaning
1. Removed inaccurate data
2. Removed duplicate app entries
3. Removed non-English apps
4. Isolated the free apps



* Detect inaccurate data, and correct or remove it.


In [4]:
print(datagoogle[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


since the colum "Category" is missing for the row[10472], so need to be deleted

In [5]:
del datagoogle[10472]

* Detect duplicate data, and remove the duplicates.

In [6]:
duplicateapple=[]
uniqueapple=[]
for item in dataapple:
    name=item[0]
    if name in uniqueapple:
        duplicateapple.append(name)
    else:uniqueapple.append(name)  
print('The number of duplicate records for Apple apps:',len(duplicateapple))

The number of duplicate records for Apple apps: 0


In [7]:
duplicategoogle=[]
uniquegoogle=[]
for item in datagoogle:
    name=item[0]
    if name in uniquegoogle:
        duplicategoogle.append(name)
    else:uniquegoogle.append(name)  
print('The number of duplicate records for andriod:', len(duplicategoogle))

The number of duplicate records for andriod: 1181


Show some duplicate records

In [8]:
print(duplicategoogle[:5])

['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


check one record as example

In [9]:
for item in datagoogle:
    if item[0]=='Instagram':
        print(item)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


the review canbe used as criterion for removing the duplicates.we keep the largest number of review

Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.


In [10]:
reviews_max={}
for item in datagoogle:
    name=item[0]
    num_reviews=float(item[3])
    if name in reviews_max and reviews_max[name]<num_reviews:
        reviews_max[name]=num_reviews
    elif name not in reviews_max:
        reviews_max[name]=num_reviews       

In [11]:
print('Expected Records',len(datagoogle)-1181)
print('Cleaned Records',len(reviews_max))


Expected Records 9659
Cleaned Records 9659


generate cleaned data for google app

In [12]:
googleclean=[]
actualadd=[]
for item in datagoogle:
    name=item[0]
    num_reviews=float(item[3])
    if name not in actualadd and num_reviews ==reviews_max[name]:
        googleclean.append(item)
        actualadd.append(name)
    

to confirm if cleaned data is the same to what expected

In [13]:
explore_data(googleclean,0,4,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Number of columns: 13


* to remove the none English app

The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127. use ord(' ') 

In [14]:
def englishcheck(string):
    for character in string:
        if ord(character) >127:
            return False
        else: return True

In [15]:
print(englishcheck('Instagram'))
print(englishcheck('爱奇艺PPS'))

True
False


To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

In [16]:
def englishcheck(string):
    nonenglish_num=0
    for character in string:
        if ord(character) >127:
            nonenglish_num+=1
    if nonenglish_num>=2:
        return False
    else:
        return True

print(englishcheck('Docs To Go™ Free Office Suite'))
print(englishcheck('Instachat 😜'))

True
True


Use the new function to filter out non-English apps from both data sets. Loop through each data set. If an app name is identified as English, append the whole row to a separate list.

In [17]:
datagoogle_englishonly=[]
dataapple_englishonly=[]
for item in googleclean:
    name=item[0]
    if englishcheck(name):
        datagoogle_englishonly.append(item)
        
for item in dataapple:
    name=item[2]
    if englishcheck(name):
        dataapple_englishonly.append(item) 


Explore the data sets and check how many rows remains

In [18]:
explore_data(dataapple_englishonly,0,2,True)
print('\n')
explore_data(datagoogle_englishonly,0,2,True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of rows: 6100
Number of columns: 17


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9523
Number of columns: 13


there are 9523 google app in English and 6100 apple app in English

In [19]:
print(dataappleheader)
print('\n')
print(datagoogleheader)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


* To isolate the free apps in separate lists. 

In [20]:
dataapplefree=[]
datagooglefree=[]

for app in dataapple_englishonly:
    price=app[5]
    if price=='0':
        dataapplefree.append(app)
        
for app in datagoogle_englishonly:
    price=app[7]
    if price=='0':
        datagooglefree.append(app)
        
print(len(dataapplefree))
print(len(datagooglefree))

3169
8781


There are 8781 Android apps and 3169 iOS apps are free

## data analysis

### find which genre has most app that in the App Store and Google Play

define a founction to generate frequency tables to find out what are the most common genres in each market.    ****sort dict by value****  define a function that we can use to display the percentages in a descending order

In [21]:
import collections


def freq_table(dataset,index):
    table={}
    total=0
    for raw in dataset:
        total+=1
        value=raw[index]
        if value in table:
            table[value]+=1
        else:
            table[value]=1
    
    percentage={}
    for app in table:
        percent=round((table[app]/total)*100,2)
        percentage[app]=percent

#sort dict by value     
    sorted_percentage = sorted(percentage.items(), key=lambda kv:kv[1],reverse = True)

#convert list to dict
    sorted_dict = collections.OrderedDict(sorted_percentage)
    return sorted_dict
    


ios apps by 'prime_genre'

In [22]:
freq_table(dataapplefree,-5)

OrderedDict([('Games', 58.54),
             ('Entertainment', 7.83),
             ('Photo & Video', 5.05),
             ('Education', 3.72),
             ('Social Networking', 3.28),
             ('Shopping', 2.52),
             ('Utilities', 2.4),
             ('Sports', 2.18),
             ('Music', 2.05),
             ('Health & Fitness', 1.99),
             ('Productivity', 1.7),
             ('Lifestyle', 1.55),
             ('News', 1.33),
             ('Travel', 1.14),
             ('Finance', 1.1),
             ('Weather', 0.85),
             ('Food & Drink', 0.82),
             ('Reference', 0.54),
             ('Business', 0.54),
             ('Book', 0.38),
             ('Navigation', 0.19),
             ('Medical', 0.19),
             ('Catalogs', 0.13)])

google apps by 'Category'

In [23]:
freq_table(datagooglefree,1)

OrderedDict([('FAMILY', 18.95),
             ('GAME', 9.66),
             ('TOOLS', 8.46),
             ('BUSINESS', 4.64),
             ('PRODUCTIVITY', 3.93),
             ('LIFESTYLE', 3.91),
             ('FINANCE', 3.71),
             ('MEDICAL', 3.54),
             ('SPORTS', 3.33),
             ('PERSONALIZATION', 3.3),
             ('COMMUNICATION', 3.26),
             ('HEALTH_AND_FITNESS', 3.1),
             ('PHOTOGRAPHY', 2.97),
             ('NEWS_AND_MAGAZINES', 2.8),
             ('SOCIAL', 2.69),
             ('TRAVEL_AND_LOCAL', 2.33),
             ('SHOPPING', 2.25),
             ('BOOKS_AND_REFERENCE', 2.15),
             ('DATING', 1.86),
             ('VIDEO_PLAYERS', 1.8),
             ('MAPS_AND_NAVIGATION', 1.38),
             ('FOOD_AND_DRINK', 1.23),
             ('EDUCATION', 1.17),
             ('ENTERTAINMENT', 0.96),
             ('LIBRARIES_AND_DEMO', 0.93),
             ('AUTO_AND_VEHICLES', 0.92),
             ('HOUSE_AND_HOME', 0.8),
             ('WEA

google apps by 'Genres'. 'category' is better since 'Genres' is mixed

In [24]:
freq_table(datagooglefree,-4)

OrderedDict([('Tools', 8.45),
             ('Entertainment', 6.07),
             ('Education', 5.38),
             ('Business', 4.64),
             ('Productivity', 3.93),
             ('Lifestyle', 3.89),
             ('Finance', 3.71),
             ('Medical', 3.54),
             ('Sports', 3.39),
             ('Personalization', 3.3),
             ('Communication', 3.26),
             ('Health & Fitness', 3.1),
             ('Action', 3.1),
             ('Photography', 2.97),
             ('News & Magazines', 2.8),
             ('Social', 2.69),
             ('Travel & Local', 2.32),
             ('Shopping', 2.25),
             ('Books & Reference', 2.15),
             ('Simulation', 2.05),
             ('Dating', 1.86),
             ('Arcade', 1.83),
             ('Video Players & Editors', 1.78),
             ('Casual', 1.74),
             ('Maps & Navigation', 1.38),
             ('Food & Drink', 1.23),
             ('Puzzle', 1.14),
             ('Racing', 1.0),
             ('

### calculate which genre has most user: 

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

use nested loop to calculate avaerage

In [25]:
genre_ios= freq_table(dataapplefree,-5)
avggenreios={}
for genre in genre_ios:
    total=0
    totalapp=0
    for app in dataapplefree:
        if app[-5]==genre:
            total+=float(app[6])
            totalapp+=1
    avggenreios[genre]=round(total/totalapp,2)
#    avggenre=total/totalapp
#sort dict by value     
    sorted_iosgenre = sorted(avggenreios.items(), key=lambda kv:kv[1],reverse = True)

#convert list to dict
    sorted_dict_iosgenre  = collections.OrderedDict(sorted_iosgenre )
print(sorted_dict_iosgenre)
    


OrderedDict([('Navigation', 86090.33), ('Reference', 79350.47), ('Social Networking', 72916.55), ('Music', 58205.03), ('Weather', 54215.3), ('Book', 46384.92), ('Food & Drink', 33333.92), ('Finance', 32367.03), ('Travel', 31358.5), ('Photo & Video', 28441.54), ('Shopping', 27816.2), ('Health & Fitness', 24037.63), ('Sports', 23008.9), ('Games', 22985.21), ('Productivity', 21799.15), ('News', 21750.07), ('Utilities', 19900.47), ('Lifestyle', 16739.35), ('Entertainment', 14364.77), ('Business', 7491.12), ('Education', 7003.98), ('Catalogs', 4004.0), ('Medical', 612.0)])


in apple app store, 'Navigation' has the most reviews. dig into the apps included:WAZE and Google maps account for 96% of the numbers in this genre

In [26]:
for app in dataapplefree:
    if app[-5]=='Navigation':
        print(app[2],':',app[6])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


In [27]:
for app in dataapplefree:
    if app[-5]=='Social Networking':
        print(app[2],':',app[6])

Facebook : 2974676
LinkedIn : 71856
Skype for iPhone : 373519
Tumblr : 334293
Match™ - #1 Dating App. : 60659
WhatsApp Messenger : 287589
TextNow - Unlimited Text + Calls : 164963
Grindr - Gay and same sex guys chat, meet and date : 23201
imo video calls and chat : 18841
Ameba : 269
Weibo : 7265
Badoo - Meet New People, Chat, Socialize. : 34428
Kik : 260965
Qzone : 1649
Fake-A-Location Free ™ : 354
Tango - Free Video Call, Voice and Chat : 75412
MeetMe - Chat and Meet New People : 97072
SimSimi : 23530
Viber Messenger – Text & Call : 164249
Find My Family, Friends & iPhone - Life360 Locator : 43877
Weibo HD : 16772
POF - Best Dating App for Conversations : 52642
GroupMe : 28260
Lobi : 36
WeChat : 34584
ooVoo – Free Video Call, Text and Voice : 177501
Pinterest : 1061624
Qzone HD : 458
Skype for iPad : 60163
LINE : 11437
QQ : 9109
LOVOO - Dating Chat : 1985
QQ HD : 5058
Messenger : 351466
eHarmony™ Dating App - Meet Singles : 11124
YouNow: Live Stream Video Chat : 12079
Cougar Dating & 

### Most Popular Apps by Genre on Google Play
For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.). We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

In [36]:
googlecat=freq_table(datagooglefree,1)
avgcatgoogle={}
for cat in googlecat:
    totalapp=0
    totaluser=0
    for app in datagooglefree:
        if app[1]==cat:
            n_user=app[5]
            n_user=n_user.replace(',','')
            n_user=n_user.replace('+','')
            n_user=float(n_user)
            totalapp+=1
            totaluser+=n_user
    avgcatgoogle[cat]=round(totaluser/totalapp,2)

#sort dict by value     
sorted_googlecat = sorted(avgcatgoogle.items(), key=lambda kv:kv[1],reverse = True)

#convert list to dict
sorted_dict_googleca  = collections.OrderedDict(sorted_googlecat )
print(sorted_dict_googleca)
    


OrderedDict([('COMMUNICATION', 38590581.09), ('VIDEO_PLAYERS', 24878048.86), ('SOCIAL', 23253652.13), ('PHOTOGRAPHY', 17840110.4), ('PRODUCTIVITY', 16787331.34), ('GAME', 15593824.69), ('TRAVEL_AND_LOCAL', 14120454.08), ('ENTERTAINMENT', 11767380.95), ('TOOLS', 10902378.83), ('NEWS_AND_MAGAZINES', 9626407.36), ('BOOKS_AND_REFERENCE', 8814199.79), ('SHOPPING', 7072366.59), ('PERSONALIZATION', 5273184.1), ('WEATHER', 5212877.1), ('HEALTH_AND_FITNESS', 4204220.23), ('MAPS_AND_NAVIGATION', 4115374.21), ('SPORTS', 3750580.64), ('FAMILY', 3717297.58), ('ART_AND_DESIGN', 1986335.09), ('FOOD_AND_DRINK', 1951283.81), ('EDUCATION', 1833495.15), ('BUSINESS', 1712290.15), ('LIFESTYLE', 1447458.98), ('HOUSE_AND_HOME', 1380033.73), ('FINANCE', 1365500.4), ('DATING', 861409.55), ('COMICS', 859042.16), ('AUTO_AND_VEHICLES', 654074.83), ('LIBRARIES_AND_DEMO', 645070.85), ('PARENTING', 552875.18), ('BEAUTY', 513151.89), ('EVENTS', 253542.22), ('MEDICAL', 121161.88)])



On average, communication apps have the most installs: 38,456,119. This number is heavily influenced by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts), and a few others with over 100 and 500 million installs


In [38]:
for app in datagooglefree:
    if app[1]=='COMMUNICATION':
        print(app[0], ':', app[5])

WhatsApp Messenger : 1,000,000,000+
Messenger for SMS : 10,000,000+
My Tele2 : 5,000,000+
imo beta free calls and text : 100,000,000+
Contacts : 50,000,000+
Call Free – Free Call : 5,000,000+
Web Browser & Explorer : 5,000,000+
Browser 4G : 10,000,000+
MegaFon Dashboard : 10,000,000+
ZenUI Dialer & Contacts : 10,000,000+
Cricket Visual Voicemail : 10,000,000+
TracFone My Account : 1,000,000+
Xperia Link™ : 10,000,000+
TouchPal Keyboard - Fun Emoji & Android Keyboard : 10,000,000+
Skype Lite - Free Video Call & Chat : 5,000,000+
My magenta : 1,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Seznam.cz : 1,000,000+
Antillean Gold Telegram (original version) : 100,000+
AT&T Visual Voicemail : 10,000,000+
GMX Mail : 10,000,000+
Omlet Chat : 10,000,000+
My Vodacom SA : 5,000,000+
Microsoft Edge : 5,000,000+
Messenger – Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Calls & Text by Mo+ : 5,000,000+
free 

If we removed all the communication apps that have over 100 million installs, the average would be reduced roughly ten times:

In [39]:
under_100m=[]
for app in datagooglefree:
    n_user = app[5]
    n_user = n_user.replace(',', '')
    n_user = n_user.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_user) < 100000000):
        under_100m.append(float(n_user))
avg_under_100m=sum(under_100m)/len(under_100m)
avg_under_100m

3617398.420849421

We see the same pattern for the video players category, which is the runner-up with 24,727,872 installs. The market is dominated by apps like Youtube, Google Play Movies & TV, or MX Player. The pattern is repeated for social apps (where we have giants like Facebook, Instagram, Google+, etc.), photography apps (Google Photos and other popular photo editors), or productivity apps (Microsoft Word, Dropbox, Google Calendar, Evernote, etc.).