# Profitable App Profiles for the App Store and Google Play Markets
We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.
We will analyze below 2 datasets to acheive the same:
        
- [googleplaystore.csv](https://www.kaggle.com/lava18/google-play-store-apps/home) -containing data about approximately 10,000 Android apps from Google Play
- [AppleStore.csv](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) - containing data about approximately 7,000 iOS apps from the App Store
        


In [1]:
opened_apple = open('AppleStore.csv')
from csv import reader
read_apple = reader(opened_apple)
apple_data = list(read_apple)

In [2]:
opened_android = open('googleplaystore.csv')
read_android = reader(opened_android)
android_data = list(read_android)

Below function can be reused to print rows in readable way.

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
# Header of Apple Apps
apple_data[0]

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [5]:
# Exploring Data 
print(apple_data[0])
print('\n')
explore_data(apple_data[1:],1,5,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16


Important Columns for Apple Apps are Track Name, price , user_rating , prime_genre

In [6]:
# printing header of android data & some sample rows

In [7]:
print(android_data[0])
print('\n')
explore_data(android_data[1:],1,4,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


important columns for android apps are app, category ,rating , price , genres

In [8]:
# checking reported incorrect data
print(android_data[0])
print(android_data[10473])


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [9]:
#deleting above data since it is missing category
del android_data[10473]

In [10]:
# Exploring if there are duplicate apps in google play store data 

unique_apps = []
duplicate_apps = []


for row in android_data[1:]:
    name = row[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print("Total Number of duplicate apps are" ,len(duplicate_apps))
print("\n")
print("Examples of duplicate apps are ", duplicate_apps[:15])


Total Number of duplicate apps are 1181


Examples of duplicate apps are  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [11]:
for app in android_data[1:]:
    name = app[0]
    if name =='Slack':
        print(app)

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


we observed above that app data was taken at different times hence number of reviews are different . We will take criteria of retaining data with most number of reviews & deleting rest

In [12]:
reviews_max = {}
for row in android_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
        

In [13]:
# Length of reviews max dictionary
print(len(reviews_max))

9659


In [14]:
android_clean =[]
already_added = []
for row in android_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        

In [15]:
# Lets Check android_clean data set
print(len(android_clean))

9659


## Removing Non-English Apps

Below is function which takes any app name & return true if it contains English Characters . It returns false if it contains non -english app names

In [16]:
def is_english_app(a_string):
    i=0
    for a_chr in a_string:
        if ord(a_chr) > 127:
            i = i+1
            
    if i>3:
        return False 
    else:
        return True

In [17]:
# Testing few apps
print(is_english_app('Instagram'))
print(is_english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english_app('Docs To Go™ Free Office Suite'))
print(is_english_app('Instachat 😜'))


True
False
True
True


We will use above function to remove non-english apps

In [18]:
apple_eng = []
android_eng = []
for row_apple in apple_data:
    name= row_apple[1]
    if is_english_app(name) :
        apple_eng.append(row_apple)
for row_android in android_clean:
    name = row_android[0]
    if is_english_app(name):
        android_eng.append(row_android)
        
        

In [19]:
# Explore English Apple , Android Data sets
# length of apple apps in english
print(len(apple_eng))
# length of android apps in english
print(len(android_eng))
# explore data for android english apps
explore_data(android_eng, 0, 4, rows_and_columns=True)
# explore data for apple english apps
explore_data(apple_eng, 0, 4, rows_and_columns=True)



6184
9614
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9614
Number of columns: 13
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num

In [20]:
apple_eng[1]  # 4 th column is price

['284882215',
 'Facebook',
 '389879808',
 'USD',
 '0.0',
 '2974676',
 '212',
 '3.5',
 '3.5',
 '95.0',
 '4+',
 'Social Networking',
 '37',
 '1',
 '29',
 '1']

In [21]:
android_data[2]  # 7th column

['Coloring book moana',
 'ART_AND_DESIGN',
 '3.9',
 '967',
 '14M',
 '500,000+',
 'Free',
 '0',
 'Everyone',
 'Art & Design;Pretend Play',
 'January 15, 2018',
 '2.0.0',
 '4.0.3 and up']

In [22]:
android_free_eng = []
apple_free_eng = []
for row in android_eng:
    if row[7] == '0':
        android_free_eng.append(row)
for row in apple_eng:
    if row[4] == '0.0':
        apple_free_eng.append(row)
        

In [23]:
# Check lengths of free apps
print("Android free apps count is " ,len(android_free_eng))
print("Apple free apps count is " ,len(apple_free_eng))

Android free apps count is  8864
Apple free apps count is  3222


# our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.
# To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

# - Build a minimal Android version of the app, and add it to Google Play.
# - If the app has a good response from users, we develop it further.
# - If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.


In [24]:
android_data[0]

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [25]:
# we will use genres , installs, current ver , amdroid ver
apple_data[0]

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [26]:
# ver , prime_genre of apple app can be used 

In [31]:
def freq_table(dataset,index):
    freq_dict = {}
    total = 0
    
    for row in dataset[1:]:
        total += 1
        name = row[index]
        if name in freq_dict:
            freq_dict[name] += 1
        else:
            freq_dict[name] = 1
   
    for key in freq_dict:
        freq_dict[key] = (freq_dict[key]/total)*100
    return freq_dict
    
    
        
    

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Check frequency by 
- Android Data's' category (1), genres (9)
- Apple Data's prime_genre

In [32]:
# Android Data Analysis
display_table(android_free_eng,1)  # By Category


FAMILY : 18.910075595170937
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

In [33]:
display_table(android_free_eng,9)  # By Genres

Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto & Vehicles : 0.9251946293580051
S

In [34]:
# Apple App Analysis
display_table(apple_free_eng,-5)  # Prime_genre

Games : 58.180689226948154
Entertainment : 7.885749767153058
Photo & Video : 4.967401428127911
Education : 3.6634585532443342
Social Networking : 3.2598571872089415
Shopping : 2.607885749767153
Utilities : 2.5147469729897547
Sports : 2.1421918658801617
Music : 2.049053089102763
Health & Fitness : 2.018006830176964
Productivity : 1.7385904998447685
Lifestyle : 1.5833592052157717
News : 1.334989133809376
Travel : 1.2418503570319777
Finance : 1.11766532132878
Weather : 0.8692952499223843
Food & Drink : 0.8072027320707855
Reference : 0.55883266066439
Business : 0.5277864017385905
Book : 0.43464762496119214
Navigation : 0.18627755355479667
Medical : 0.18627755355479667
Catalogs : 0.12418503570319776


In [35]:
for genre in freq_table(apple_free_eng,-5):
    total = 0
    len_genre = 0
    for row in apple_free_eng:
        genre_app = row[-5]
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
    print(" App Genre", genre)
    print("Average User Ratings", (total/len_genre) )

 App Genre Social Networking
Average User Ratings 71548.34905660378
 App Genre Music
Average User Ratings 57326.530303030304
 App Genre Health & Fitness
Average User Ratings 23298.015384615384
 App Genre Utilities
Average User Ratings 18684.456790123455
 App Genre Finance
Average User Ratings 31467.944444444445
 App Genre Weather
Average User Ratings 52279.892857142855
 App Genre Productivity
Average User Ratings 21028.410714285714
 App Genre Reference
Average User Ratings 74942.11111111111
 App Genre Education
Average User Ratings 7003.983050847458
 App Genre Navigation
Average User Ratings 86090.33333333333
 App Genre Travel
Average User Ratings 28243.8
 App Genre Food & Drink
Average User Ratings 33333.92307692308
 App Genre Business
Average User Ratings 7491.117647058823
 App Genre Medical
Average User Ratings 612.0
 App Genre Lifestyle
Average User Ratings 16485.764705882353
 App Genre News
Average User Ratings 21248.023255813954
 App Genre Shopping
Average User Ratings 26919.6904

In [39]:
for category in freq_table(android_free_eng,1):
    total = 0
    len_category = 0
    for row in android_free_eng:
        category_app = row[1]
        if category_app == category:
            installs = row[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            total += float(installs)
            len_category += 1
    print(category,":",(total/len_category))
 

HOUSE_AND_HOME : 1331540.5616438356
COMMUNICATION : 38456119.167247385
HEALTH_AND_FITNESS : 4188821.9853479853
BOOKS_AND_REFERENCE : 8767811.894736841
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
EVENTS : 253542.22222222222
DATING : 854028.8303030303
PRODUCTIVITY : 16787331.344927534
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
TRAVEL_AND_LOCAL : 13984077.710144928
MAPS_AND_NAVIGATION : 4056941.7741935486
MEDICAL : 120550.61980830671
AUTO_AND_VEHICLES : 647317.8170731707
GAME : 15588015.603248259
PERSONALIZATION : 5201482.6122448975
SHOPPING : 7036877.311557789
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
ENTERTAINMENT : 11640705.88235294
FAMILY : 3695641.8198090694
PHOTOGRAPHY : 17840110.40229885
VIDEO_PLAYERS : 24727872.452830188
WEATHER : 5074486.197183099
NEWS_AND_MAGAZINES : 9549178.467741935
BEAUTY : 513151.88679245283
TOOLS : 10801391.298666667
SOCIAL : 23253652.127118643
PARENTING : 542603.6206896552
SPORTS : 363