# Analysing Google Play Store Data and Apple Play store Data

The aim of this project is to filter out data from Google Play store and Apple iOS store, to extract information useful for making a free app and analyze what kind of apps are more preferred by users, have higher ratings, generate more revenue through in-app advertising etc. 

#### Google Play store data: https://www.kaggle.com/lava18/google-play-store-app
#### Apple iOS store data: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home


In [2]:
from csv import reader

### The Google Play data set ###
opened_file = open('E:\Kaggle Datasets\Google Play\googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('E:\Kaggle Datasets\Google Play\AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

The following function is used to print any dataset i.e print the sliced dataset and also display the number or rows and columns if the user wants

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice= dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') 
    if rows_and_columns:
        print('Number of rows:',len(dataset))
        print('Number of columns:',len(dataset[0]))

In [4]:
explore_data(ios,0,5)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']




In [5]:
explore_data(ios,0,5,True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


('Number of rows:', 7197)
('Number of columns:', 17)


### Data Cleaning

#### 1) Corrupt Data
We observe that in our Google Play store dataset some of the data is corrupt. For example data entry 10472 has Category as 1.9. Which is not a string. Further we need to clean this data for removing such corrupt data

In [6]:
print(android[10472])  
print('\n')
print(android_header)  
print('\n')
print(android[0])  

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


In [7]:

print(len(android))
del android[10472]  # don't run this more than once
print(len(android))

10841
10840


#### 2) Redundant Data
We observe that some of the data is redundant. The example is given below. We keep the data with the maximum number of reviews and delete the rest.

In [8]:
for app in android:
    name= app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


 We are removing duplicate rows. The criterion used here is that the apps with the most number of reviews are kept and the others are removed as the ones with more reviews would be the latest entries

In [9]:
duplicate_apps =[]
unique_apps =[]

for app in android:
    name=app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps', len(duplicate_apps))
print('\n')
print('Fifteen examples of duplicate apps', duplicate_apps[:15])


('Number of duplicate apps', 1181)


('Fifteen examples of duplicate apps', ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software'])


In [10]:
print('Expected length', len(android)-1181)

('Expected length', 9659)


#### We are removing duplicate rows and keeping the apps with maximum reviews

In [11]:
reviews_max = {}

for app in android:
    name=app[0]
    n_reviews= float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
         reviews_max[name] = n_reviews
            
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        

In [12]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

('Expected length:', 9659)
('Actual length:', 9659)


#### We are cleaning the data and segregating the clean data and the data already added

In [13]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [14]:
explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite \xe2\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


('Number of rows:', 9659)
('Number of columns:', 13)


#### 3) Non-essential Data
We remove the Non-English Apps because our aim is to develop a free app in English language

In [20]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False


In [23]:
is_english('Instachat 😜')

False

In [24]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[2]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite \xe2\x80\x93 FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


('Number of rows:', 9500)
('Number of columns:', 13)


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '1005240

In [25]:
print(android_header)    

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [26]:
print(ios_header)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [27]:

android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[5]
    if price == '0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8760
3169


### Generating Frequency table 

In [28]:

def freq_table(dataset, index):
    table = {}
    total = 0.0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = float((table[key] / total)) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Let us see which Genre of apps is most preferred by users in iOS Store

In [29]:
display_table(ios_final, 12)

('Games', ':', 58.53581571473651)
('Entertainment', ':', 7.82581255916693)
('Photo & Video', ':', 5.0489113284947935)
('Education', ':', 3.72357210476491)
('Social Networking', ':', 3.2817923635216157)
('Shopping', ':', 2.5244556642473968)
('Utilities', ':', 2.398232881035027)
('Sports', ':', 2.1773430104133795)
('Music', ':', 2.0511202272010096)
('Health & Fitness', ':', 1.9880088355948247)
('Productivity', ':', 1.7040075733669928)
('Lifestyle', ':', 1.5462290943515304)
('News', ':', 1.3253392237298833)
('Travel', ':', 1.1360050489113285)
('Finance', ':', 1.1044493531082362)
('Weather', ':', 0.8520037866834964)
('Food & Drink', ':', 0.8204480908804039)
('Reference', ':', 0.5364468286525718)
('Business', ':', 0.5364468286525718)
('Book', ':', 0.3786683496371095)
('Navigation', ':', 0.18933417481855475)
('Medical', ':', 0.18933417481855475)
('Catalogs', ':', 0.12622278321236985)


#### As we see, Games are the most popular genre accounting for about 58% of the apps 

In [30]:
display_table(android_final, -4)

('Tools', ':', 8.470319634703197)
('Entertainment', ':', 6.084474885844749)
('Education', ':', 5.3881278538812785)
('Business', ':', 4.646118721461187)
('Productivity', ':', 3.9383561643835616)
('Lifestyle', ':', 3.904109589041096)
('Finance', ':', 3.721461187214612)
('Medical', ':', 3.550228310502283)
('Sports', ':', 3.4018264840182644)
('Personalization', ':', 3.287671232876712)
('Communication', ':', 3.2534246575342465)
('Action', ':', 3.105022831050228)
('Health & Fitness', ':', 3.093607305936073)
('Photography', ':', 2.9794520547945202)
('News & Magazines', ':', 2.808219178082192)
('Social', ':', 2.6484018264840183)
('Travel & Local', ':', 2.328767123287671)
('Shopping', ':', 2.2488584474885847)
('Books & Reference', ':', 2.146118721461187)
('Simulation', ':', 2.054794520547945)
('Dating', ':', 1.860730593607306)
('Arcade', ':', 1.82648401826484)
('Video Players & Editors', ':', 1.7808219178082192)
('Casual', ':', 1.7351598173515983)
('Maps & Navigation', ':', 1.3812785388127853)


#### As we see, Tools account for the top genre accounting for about 8.4% followed by Entertainment accounting for 6% of the total genres in Google Play store

### .......................................................................................................................................................................................

In [31]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0.0
    len_genre = 0.0
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = float(total / len_genre)
    print(genre, ':', avg_n_ratings)

('Productivity', ':', 0.0)
('Photo & Video', ':', 0.0)
('Entertainment', ':', 0.0)
('Travel', ':', 0.0)
('Sports', ':', 0.0)
('Food & Drink', ':', 0.0)
('Book', ':', 0.0)
('Music', ':', 0.0)
('Shopping', ':', 0.0)
('Catalogs', ':', 0.0)
('Finance', ':', 0.0)
('Business', ':', 0.0)
('Social Networking', ':', 0.0)
('Utilities', ':', 0.0)
('News', ':', 0.0)
('Lifestyle', ':', 0.0)
('Medical', ':', 0.0)
('Games', ':', 0.0)
('Health & Fitness', ':', 0.0)
('Navigation', ':', 0.0)
('Reference', ':', 0.0)
('Weather', ':', 0.0)
('Education', ':', 0.0)


In [27]:
display_table(android_final, 5)

('1,000,000+', ':', 15.74200913242009)
('100,000+', ':', 11.518264840182649)
('10,000,000+', ':', 10.60502283105023)
('10,000+', ':', 10.205479452054794)
('1,000+', ':', 8.367579908675799)
('100+', ':', 6.952054794520548)
('5,000,000+', ':', 6.872146118721462)
('500,000+', ':', 5.5479452054794525)
('50,000+', ':', 4.7716894977168955)
('5,000+', ':', 4.486301369863014)
('10+', ':', 3.515981735159817)
('500+', ':', 3.2077625570776256)
('50,000,000+', ':', 2.28310502283105)
('100,000,000+', ':', 2.134703196347032)
('50+', ':', 1.9292237442922375)
('5+', ':', 0.7876712328767124)
('1+', ':', 0.5136986301369862)
('500,000,000+', ':', 0.273972602739726)
('1,000,000,000+', ':', 0.228310502283105)
('0+', ':', 0.045662100456621)
('0', ':', 0.01141552511415525)


In [28]:
categories_android = freq_table(android_final, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)

('LIBRARIES_AND_DEMO', ':', 649314.0506329114)
('SHOPPING', ':', 7103190.78680203)
('BUSINESS', ':', 1712290.1474201474)
('ENTERTAINMENT', ':', 11767380.952380951)
('MEDICAL', ':', 121161.87781350482)
('MAPS_AND_NAVIGATION', ':', 4115374.214876033)
('LIFESTYLE', ':', 1447458.976676385)
('GAME', ':', 15571586.690307328)
('BOOKS_AND_REFERENCE', ':', 8329168.936170213)
('AUTO_AND_VEHICLES', ':', 654074.8271604938)
('HOUSE_AND_HOME', ':', 1385541.463768116)
('BEAUTY', ':', 513151.88679245283)
('COMICS', ':', 859042.1568627451)
('PHOTOGRAPHY', ':', 17840110.40229885)
('PARENTING', ':', 552875.1785714285)
('WEATHER', ':', 5212877.101449275)
('ART_AND_DESIGN', ':', 1986335.0877192982)
('PERSONALIZATION', ':', 5240358.986111111)
('DATING', ':', 861409.5521472392)
('EVENTS', ':', 253542.22222222222)
('FAMILY', ':', 3716053.755274262)
('HEALTH_AND_FITNESS', ':', 4219697.055350553)
('VIDEO_PLAYERS', ':', 24878048.860759493)
('FINANCE', ':', 1365500.4049079753)
('PRODUCTIVITY', ':', 16787331.34492

In [29]:
for app in android_final:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+'):
        print(app[0], ':', app[5])

('WhatsApp Messenger', ':', '1,000,000,000+')
('imo beta free calls and text', ':', '100,000,000+')
('Android Messages', ':', '100,000,000+')
('Google Duo - High Quality Video Calls', ':', '500,000,000+')
('Messenger \xe2\x80\x93 Text and Video Chat for Free', ':', '1,000,000,000+')
('imo free video calls and chat', ':', '500,000,000+')
('Skype - free IM & video calls', ':', '1,000,000,000+')
('Who', ':', '100,000,000+')
('GO SMS Pro - Messenger, Free Themes, Emoji', ':', '100,000,000+')
('LINE: Free Calls & Messages', ':', '500,000,000+')
('Google Chrome: Fast & Secure', ':', '1,000,000,000+')
('Firefox Browser fast & private', ':', '100,000,000+')
('UC Browser - Fast Download Private & Secure', ':', '500,000,000+')
('Gmail', ':', '1,000,000,000+')
('Hangouts', ':', '1,000,000,000+')
('Messenger Lite: Free Calls & Messages', ':', '100,000,000+')
('Kik', ':', '100,000,000+')
('KakaoTalk: Free Calls & Text', ':', '100,000,000+')
('Opera Mini - fast web browser', ':', '100,000,000+')
('O

In [30]:
under_100_m = []

for app in android_final:
    n_installs = app[5]
    n_installs = n_installs.replace(',', '')
    n_installs = n_installs.replace('+', '')
    if (app[1] == 'COMMUNICATION') and (float(n_installs) < 100000000):
        under_100_m.append(float(n_installs))
        
sum(under_100_m) / len(under_100_m)

3437620.895348837

In [31]:

for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ':', app[5])

('E-Book Read - Read Book for free', ':', '50,000+')
('Download free book with green book', ':', '100,000+')
('Wikipedia', ':', '10,000,000+')
('Cool Reader', ':', '10,000,000+')
('Free Panda Radio Music', ':', '100,000+')
('Book store', ':', '1,000,000+')
('FBReader: Favorite Book Reader', ':', '10,000,000+')
('English Grammar Complete Handbook', ':', '500,000+')
('Free Books - Spirit Fanfiction and Stories', ':', '1,000,000+')
('Google Play Books', ':', '1,000,000,000+')
('AlReader -any text book reader', ':', '5,000,000+')
('Offline English Dictionary', ':', '100,000+')
('Offline: English to Tagalog Dictionary', ':', '500,000+')
('FamilySearch Tree', ':', '1,000,000+')
('Cloud of Books', ':', '1,000,000+')
('Recipes of Prophetic Medicine for free', ':', '500,000+')
('ReadEra \xe2\x80\x93 free ebook reader', ':', '1,000,000+')
('Anonymous caller detection', ':', '10,000+')
('Ebook Reader', ':', '5,000,000+')
('Litnet - E-books', ':', '100,000+')
('Read books online', ':', '5,000,000+

### Conclusions
In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.