# Guided Project: Profitable App Profiles for the App Store and Google Play Markets

**The project is about investigating what type of apps are likely to attract more users**

** Our goal is to analyze data to help our developers understand this. **

In [1]:
google_open = open('googleplaystore.csv')
applestore_open = open('AppleStore.csv')
from csv import reader

In [2]:
read_google = reader(google_open)
read_apple = reader(applestore_open)

In [3]:
google_list = list(read_google)
apple_list = list(read_apple)

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [5]:
explore_data(apple_list,0,2,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


In our analysis, we will use:
* track_name
* size_bytes
* prime_genre
* cont_rating
* price
* currency
* rating_count_tot

In [6]:
explore_data(google_list, 0, 2, False)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']




The column to identify in our analysis are:
* App
* Category
* Reviews
* Size
* Installs
* Type
* Genres

In [7]:
explore_data(google_list,10473,10474)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']




Here, for cleaning data, we found a shifted column where the Category column is deleted, and other columns afterwards are shifted to the left. We detected these column and deleted it.

In [8]:
del google_list[10473]

Now, we detect and clean the duplicates with the function above. The procedure to find duplicates is to check whether an entry has the same name with other or not. 

In [9]:
def clean_duplicates(dataset,indexOfname):
    count = 0;
    names = []
    newset = []
    for x in dataset[1:]:
        name = x[indexOfname]
        if name in names:
            count += 1
        else:
            names.append(name)
            newset.append(x)
    print("Duplicates are omitted")
    print("Number of duplicates found: " + str(count))
    return newset

In [10]:
googleRemovedDuplicates = clean_duplicates(google_list,0)

Duplicates are omitted
Number of duplicates found: 1181


In [11]:
appleRemovedDuplicates = clean_duplicates(apple_list,1)

Duplicates are omitted
Number of duplicates found: 2


The current length of google_list:

In [12]:
len(googleRemovedDuplicates)

9659

The current length of apple_list

In [13]:
len(appleRemovedDuplicates)

7195

Now, we will create a dictionary where we display the names of the apps with their maximum number of reviews:

In [14]:
def review_dict(dataset,indexOfname,indexOfmaxReviews):
    set = {}
    for x in dataset:
        name = x[indexOfname]
        n_reviews = float(x[indexOfmaxReviews])
        set[name] = n_reviews
    return set

In [15]:
reviews_max = review_dict(googleRemovedDuplicates, 0, 3)

In [16]:
len(reviews_max)

9659

We want to reach English customers, so we want to eliminate apps that have non-English characters. We have a procedure that in case the app name has allowable characters such as emojis etc, we will allow up to three characters. The function above checks if the given string has a non-English character or not based on this procedure:

In [17]:
def isinEnglish(string):
    isEnglish = True
    count = 0
    for x in string:
        if ord(x) > 127:
            count += 1
            if count > 3:
                isEnglish = False
                break
    return isEnglish

Now check if the function works correctly by examining a few samples:

In [18]:
isinEnglish('Instagram')

True

In [19]:
isinEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播')

False

In [20]:
isinEnglish('Docs To Go™ Free Office Suite')

True

In [21]:
isinEnglish('Instachat 😜')

True

In [22]:
def filterDataEnglish(dataset,indexOfname):
    newset = []
    count = 0
    for x in dataset:
        name = x[indexOfname]
        if isinEnglish(name):
            newset.append(x)
        else:
            count +=1
    print(str(count) + ' of apps were not in English, deleted.')
    print('There are ' + str(len(newset)) + ' apps remaining')
    return newset

In [23]:
googleEngFilt = filterDataEnglish(googleRemovedDuplicates,0)

45 of apps were not in English, deleted.
There are 9614 apps remaining


In [24]:
appleEngFilt = filterDataEnglish(appleRemovedDuplicates,1)

1014 of apps were not in English, deleted.
There are 6181 apps remaining


As we mentioned earlier, we want to analyze apps that is free to download and install, so that we should eliminate non-free apps. Below, we will construct a function that collects only free-apps:

In [25]:
def collectFree(dataset,indexOfPrice,stringFree):
    newset = []
    count = 0
    for x in dataset:
        price = x[indexOfPrice]
        if price == stringFree:
            newset.append(x)
        else:
            count += 1
    print(str(count) + " non-free apps are eliminated")
    print("There are " + str(len(newset)) + " apps remaining")
    return newset

In [26]:
google_free = collectFree(googleEngFilt,7,'0')

752 non-free apps are eliminated
There are 8862 apps remaining


In [27]:
apple_free = collectFree(appleEngFilt, 4, '0.0')

2961 non-free apps are eliminated
There are 3220 apps remaining


We want to find properties of apps that can attract more customers, because we are planning to build apps that have a revenue from in-app ads. To attract more customers, we should present apps that fits both the App Store and Google Play.
Our Validation strategy for an app idea is comprised of three steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We will begin by getting a sense of what are the most common genres for each market, and build frequency tables for a few colums in our data sets.

In [28]:
def freq_table(dataset,index): # A function to extract freq. table
    table = {}
    for x in dataset:
        if x[index] in table:
            table[x[index]] += 1
        else:
            table[x[index]] = 1
    return table

In [29]:
def display_table(dataset,index):
    table = freq_table(dataset,index)
    list = []
    for x in table:
        addtolist = (table[x],x)
        list.append(addtolist)
    sortedlist = sorted(list, reverse = True)
    for x in sortedlist:
        print(x[1] + ' : ' + str(x[0]))

In [30]:
display_table(google_free,9) # For genres

Tools : 747
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 312
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 156
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 74
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 40
Casino : 38
Trivia : 37
Educational;Education : 35
Educational : 33
Board : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

In [31]:
display_table(google_free,1) # Category

FAMILY : 1635
GAME : 875
TOOLS : 748
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 312
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 158
MAPS_AND_NAVIGATION : 124
EDUCATION : 114
FOOD_AND_DRINK : 110
ENTERTAINMENT : 100
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 74
WEATHER : 71
EVENTS : 63
ART_AND_DESIGN : 60
PARENTING : 58
COMICS : 55
BEAUTY : 53


In [32]:
display_table(apple_free,-5) # prime_genre

Games : 1872
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


Now, we want to analyze the total rating counts of genres to see a more precise description which are more popular:

In [33]:
prime_genre = freq_table(apple_free, -5)

In [34]:
tuples = []
for genre in prime_genre: # For all genres in the apple store
    total = 0
    len_genre = 0
    for x in apple_free: # Find the average total ratings for a typical genre
        genre_app = x[-5]
        if genre_app == genre:
            total += float(x[5])
            len_genre += 1
    add_to_tupples = (total/len_genre,genre) 
    tuples.append(add_to_tupples)
sorted_tuples = sorted(tuples, reverse = True)
for x in sorted_tuples: # Sort them in ascending order
    print(x[1] + ' : ' + str(x[0]))

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22812.92467948718
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


Here we can see that the most ratings belongs to the apps in the Navigation Category. However, there are only 6 apps for this apps. If we would want to come up with at least one app profile recommendation, it could be Social Networking and Music genres, which have a good rank in both total ratings and number of apps.

In [35]:
display_table(google_list, 5)

1,000,000+ : 1579
10,000,000+ : 1252
100,000+ : 1169
10,000+ : 1054
1,000+ : 907
5,000,000+ : 752
100+ : 719
500,000+ : 539
50,000+ : 479
5,000+ : 477
100,000,000+ : 409
10+ : 386
500+ : 330
50,000,000+ : 289
50+ : 205
5+ : 82
500,000,000+ : 72
1+ : 67
1,000,000,000+ : 58
0+ : 14
Installs : 1
0 : 1


We found a shift in the insallments column as you can see here. We will now detect and delete it:

In [47]:
for x in google_list:
    if x[5] == 'Installs':
        del x

Now check if we see the mistake again:

In [48]:
display_table(google_list, 5)

1,000,000+ : 1579
10,000,000+ : 1252
100,000+ : 1169
10,000+ : 1054
1,000+ : 907
5,000,000+ : 752
100+ : 719
500,000+ : 539
50,000+ : 479
5,000+ : 477
100,000,000+ : 409
10+ : 386
500+ : 330
50,000,000+ : 289
50+ : 205
5+ : 82
500,000,000+ : 72
1+ : 67
1,000,000,000+ : 58
0+ : 14
Installs : 1
0 : 1


In [38]:
cat_freq = freq_table(google_free,1)

In [39]:
google_genre = freq_table(google_free,9)

In [46]:
tuples = []
for genres in google_genre:
    total = 0
    len_category = 0
    for x in google_free:
        category_app = x[9]
        if category_app == genres:
            installs = x[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            total += float(installs)
            len_category += 1
    add_to_tuples = (total/len_category,genres)
    tuples.append(add_to_tuples)
sortedtuples = sorted(tuples, reverse = True)
for x in sortedtuples:
    print(x[1] + " : " + str(x[0]))

Communication : 38456119.167247385
Adventure;Action & Adventure : 35333333.333333336
Video Players & Editors : 24947335.796178345
Social : 23253652.127118643
Arcade : 22888365.48780488
Casual : 19569221.602564104
Puzzle;Action & Adventure : 18366666.666666668
Photography : 17805627.643678162
Educational;Action & Adventure : 17016666.666666668
Productivity : 16787331.344927534
Racing : 15910645.681818182
Travel & Local : 14051476.145631067
Casual;Action & Adventure : 12916666.666666666
Action : 12603588.872727273
Strategy : 11199902.530864198
Tools : 10696176.002677375
Tools;Education : 10000000.0
Role Playing;Brain Games : 10000000.0
Lifestyle;Pretend Play : 10000000.0
Casual;Music & Video : 10000000.0
Card;Action & Adventure : 10000000.0
Adventure;Education : 10000000.0
News & Magazines : 9549178.467741935
Music : 9445583.333333334
Educational;Pretend Play : 9375000.0
Puzzle;Brain Games : 9280666.666666666
Racing;Action & Adventure : 8816666.666666666
Books & Reference : 8767811.89473

For Google Play Store, the apps showing potential for being profitable on both the App Store and Google Play are the apps being considered in *Communications* and *Social* in Google Play and *Social&Networking* in Apple Store, which are eventually the same. Also we should note that apps in the general category of *Video* have potential profitability, too.