# Profitable App Analysis for Google Play Markets and App Store

The aim of this analysis is to find out what apps are likely to attract more users and hence generate more revenue through in-app ads. Assume we are working as a data analyst for a company that only creates free mobile apps on Google Play and App store. By providing this analysis to our developer team, they will be able to make decisions by considering the types of apps that are most successful in ad revenue.

# Exploration

There are a vast number of apps on each store (over 4 million), so we will take a sample from this data. We will use a data set containg data about approximately 10,000 Android apps from Google Play, collected in 2018. Also, we will use a data set containing 7,000 iOS apps from the App Store, collected in 2017.

First, we have to open the data and adjust it for our needs.


In [15]:
from csv import reader
def open_dataset(file_name):
    
    opened_file = open(file_name)
    read_file = reader(opened_file)
    data = list(read_file)
    
    return data


apple_data=open_dataset('AppleStore.csv')
print(apple_data[:4])

google_data=open_dataset('googleplaystore.csv')
print(google_data[:4])

[['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']]
[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['C

By printing out some of the data, we can see its data structure. Evidently, it is quite difficult to read. We will create a function that allows us to read the data more easily and find out the number of columns and rows there are.



In [16]:
def exdata(data_set):
    for row in data_set:
        print(row)
        print('\n')

exdata(apple_data[:4])
exdata(google_data[:4])

print(len(apple_data[1:]))
print(len(google_data[1:]))

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']



The Apple Store data has 16 columns and the Google Play data has 13 columns. Only some of these columns are useful for intrepreting how attractive the app is to users. 

The columns that are most relevant to our analysis from the Apple Store data are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'.

For the Google Play data, they are: 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.

# Data Cleaning

Before beginning our analysis, we need to make sure the data we analyse is accurate otherwise our analysis could end up being wrong. We have 7197 Apple Store apps and 10841 Google Play apps.

Since our developers build apps towards an English-speaking audience, we have to remove non-English apps. We also need to remove apps that are not free.


In [17]:
print(apple_data[6748][1])#6748

你我贷理财-P2P理财管家


This app has chinese letters in its name. To remove apps that have non-English letters, we will take advantage of the ASCII standard which has a corresponding number between 0 and 127 for each character. However, some english apps include emojis in their app name and these emojis fall outside the ASCII range. In order to tackle this problem, we will allow up to 3 characters outside the ASCII range in the name, to be included in our new data set. 

In [18]:
def english(name):
    not_ascii=0
    for character in name:
        if ord(character)>127:
            not_ascii+=1
    if not_ascii>3:
        return False
    else:
        return True
    
print(english(apple_data[6748][1]))
print(english(apple_data[1][1]))

False
True


Here, we have created a function that returns false if the number of characters outside the ASCII range is greater than 3. It may be the case that non-English apps still pass through our filter and that English apps may not pass through, however there will be very cases where this will happen. Next, we will apply this function to our data sets.    

In [19]:
google_english = []
apple_english = []

for row in google_data:
    name = row[0]
    if english(name):
        google_english.append(row)
        
for row in apple_data:
    name = row[1]
    if english(name):
        apple_english.append(row)
        
print(len(apple_english[1:]))
print(len(google_english[1:]))

6183
10796


We see that some apps have now been removed, by looking at the number of rows in each data set. Another issue that arises is that some apps are repeated.


In [20]:
repeated=[]
unique=[]
for row in google_english:
    if row[0] in unique:
        repeated.append(row[0])
    else:
        unique.append(row[0])
    
print(len(repeated))


1181


1181 apps in the Google Play are repeated. 

In [21]:
for row in google_english:
    if row[0]=='Facebook':
        print(row)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']


The difference here is that they have different number of reviews (4th column). We do not want to remove these repeated entries randomly, we want to keep the highest review number as the review rating will be more reliable for a larger number of reviews.

In [22]:
max_reviews={}
already_added = []
google_clean = []

for row in google_english:
    name=row[0]
    num_reviews=row[3]

    if name not in max_reviews:
        max_reviews[name]=num_reviews
        
    if (max_reviews[name] == num_reviews) and (name not in already_added):
        google_clean.append(row)
        already_added.append(name)

    elif name in max_reviews and max_reviews[name]< num_reviews:
        max_reviews[name]=num_reviews
        
print(len(max_reviews))
print(len(google_clean))

9616
9616


Now that we have the max number of reviews for each app, we can isolate the free apps.
    


In [23]:

google_final = []
apple_final = []

for row in google_clean:
    price = row[7]
    if price == '0':
        google_final.append(row)
        
for row in apple_english:
    price = row[4]
    if price == '0.0':
        apple_final.append(row)
        
print(len(google_final))
print(len(apple_final))

8862
3222


# Most Common Apps Sorted by Category

One way to find out what type of apps are most popular is to determine what apps dominate the app store. We can do this by creating a frequency table. For now, we will create a function to genererate a frequency table.

In [24]:

def freq_table(data_set, index):
    table = {}
    count = 0
    
    for row in data_set:
        count += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / count) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(data_set, index):
    table = freq_table(data_set, index)
    table_display = []
    for key in table:
        key_val = (table[key], key)
        table_display.append(key_val)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
print('Google Play App Percentages')
print('\n')
display_table(google_final, 1)
print('\n')
print('Apple Store App Percentages')
print('\n')
display_table(apple_final, -5)

Google Play App Percentages


FAMILY : 18.449559918754233
GAME : 9.873617693522906
TOOLS : 8.440532611148726
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5206499661475967
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.782893252087565
MAPS_AND_NAVIGATION : 1.399232678853532
EDUCATION : 1.2863913337846988
FOOD_AND_DRINK : 1.2412547957571656
ENTERTAINMENT : 1.128413450688332
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8350259535093659
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
ART_AND_DESIGN : 0.677

For both apps, we see that the 'Games' category dominate the app stores, especially for the Apple Store. We can look into further detail about the apps that are in the 'Games' category.

In [25]:
for app in google_final:
    if app[1] == 'GAME' and (app[5]=='500,000,000+'
                        or app[5]=='1,000,000,000+'):
        print(app[0], ':', app[5])

Subway Surfers : 1,000,000,000+
Candy Crush Saga : 500,000,000+
Temple Run 2 : 500,000,000+
Pou : 500,000,000+
My Talking Tom : 500,000,000+


The likes of 'Subway Surfers', 'Candy Crush Saga' and 'Temple Run' have the highest installs in the app store. It may be worth considering creating an app like this where the user has infinite attempts at the game but also collecting rewards such as tokens that make the games addicting to play. We can also look at the total user ratings in the Apple Store.

In [29]:
for app in apple_final:
    if app[-5] == 'Games' and float(app[5])>600000:
        print(app[1], ':', app[5])

Clash of Clans : 2130805
Temple Run : 1724546
Candy Crush Saga : 961794
Angry Birds : 824451
Subway Surfers : 706110
Solitaire : 679055
CSR Racing : 677247
Crossy Road - Endless Arcade Hopper : 669079
Injustice: Gods Among Us : 612532


The top 3 games in the Google Play store are also in the top 5 games in the Apple Store. We may also want to explore the family category in the Google Play store.

In [30]:
for app in google_final:
    if app[1] == 'FAMILY' and (app[5]=='10,000,000+'
                        or app[5]=='50,000,000+'):
        print(app[0], ':', app[5])

Baby Panda Care : 10,000,000+
Toca Kitchen 2 : 50,000,000+
PJ Masks: Moonlight Heroes : 10,000,000+
No. Color - Color by Number, Number Coloring : 10,000,000+
ABC Kids - Tracing & Phonics : 10,000,000+
Barbie Magical Fashion : 10,000,000+
Piano Kids - Music & Songs : 10,000,000+
Farming Simulator 14 : 10,000,000+
Hot Wheels: Race Off : 10,000,000+
School of Dragons : 10,000,000+
Cars: Lightning League : 10,000,000+
LEGO® Juniors Create & Cruise : 50,000,000+
Thomas & Friends: Go Go Thomas : 10,000,000+
Plants vs. Zombies™ Heroes : 10,000,000+
Ice Cream Jump : 10,000,000+
Disney Magic Kingdoms: Build Your Own Magical Park : 10,000,000+
Turbo FAST : 50,000,000+
My Little Pony: Harmony Quest : 10,000,000+
Equestria Girls : 10,000,000+
Disney Crossy Road : 10,000,000+
Supermarket – Game for Kids : 10,000,000+
Kids Animals Jigsaw Puzzles 😄 : 10,000,000+
Inside Out Thought Bubbles : 10,000,000+
Frozen Free Fall : 50,000,000+
Shopkins World! : 10,000,000+
Masha and the Bear Child Games : 10,0

Looking at the names of the apps in the 'FAMILY' section, the majority of them are aimed at children. We notice that apps that are based on puzzles, learning or animals are popular. This gives us an idea of how we can incorporate the ideas discussed about what makes the games successful and what makes the family apps successful.

# Conclusion 

In this project, we analyzed data about the Apple App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that making a family-type game on the app stores would be most profitable. By incorporating a token based system or daily rewards system on a learning/puzzle game, the app will become a fun and addictive way for children to learn creatively. This will drive the number of installs and reviews and hence create a profitable app on the App Store and Google Play Markets.