# App Store and Google Play Profitable Apps Profile 

**Introduction:** We want to find out what apps would be profitable in both mobile apps markets. We also want to make sure that they are the most used and downloaded. To do this we will find the apps with the highest rating, since users are happy with the app. Next we'll see what genre of app is most common.

Finally we will find a few options for what apps would be best to create that can compete in the already overwhelming big market.

https://www.kaggle.com/lava18/google-play-store-apps<br>
https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

**Data**<br>
Variables:

In [1]:
from csv import reader
def openafile(filename):
    opened_file = open(filename,encoding='utf8')
    read_file = reader(opened_file)
    newdata = list(read_file)
    return newdata

google_data = openafile('googleplaystore.csv')
app_datata = openafile('AppleStore.csv')

In [2]:
print(google_data[:3])

[['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'], ['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']]


In [3]:
print(app_datata[:3])

[['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'], ['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1'], ['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']]


# Let's explore the data
First we will print out the first few rows of the data to get a sense of what the datasets look like. Then |e'll see the column headers to see what type data we could use for our goal.

In [4]:
def explore_data(dataset, start, end, rows_and_columns=False):
    data_slice = dataset[start:end]
    for row in data_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [5]:
explore_data(google_data,1,4,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [6]:
explore_data(app_datata,4,7,True)

['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


['5', '282935706', 'Bible', '92774400', 'USD', '0', '985920', '5320', '4.5', '5', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']


['6', '283619399', 'Shanghai Mahjong', '10485713', 'USD', '0.99', '8253', '5516', '4', '4', '1.8', '4+', 'Games', '47', '5', '1', '1']


Number of rows: 7198
Number of columns: 17


### Columns
Let's explore the columns to see what we can use for our purpose. We want to know if an app is free and the type of genre. We also want to know the rating, to see if users enjoy it.

In [7]:
print('Google Play column headers', google_data[0])

Google Play column headers ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [8]:
print('App Store column headers', app_datata[0])

App Store column headers ['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Looking at these, we will want from the Google Play Store:<br>
- App - Application Name
- Category - Category the app belongs to
- Rating - Overall user rating of the app
- Installs - Number of user downloads/ installs for the app
- Type - Whether app is Free Paid
- Genres - An app can belong to multiple genres (apart from its main category)
<br><br>
From the App Store we will want the following:<br><br>
- track_name - Application Name
- price - Price off app (0 for free apps)
- rating_count_tot - User Rating counts for all versions
- user_rating - Average User Rating value for all versions
- prime_genre - Primary Genre

The information of the column descriptions can be found on the linked webpage found in the introduction.

# Data Cleaning
Now we will proceed to search for incorrect data. We will search for duplicates, and data that does not fit our criteria (paid apps). We will also check on the data webpage for comments from other users if there have been any errors or mistakes in the data.
<br><br>
The first step would be to check the source of the data's website. Here we found that there is an issue with row 10472 that has a missing value. 
The discussion section in the App store data did not have error data discussions.

In [9]:
print(google_data[10473])
del google_data[10473]

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


We found that infact, row 10473 has a missing value for 'Category' and the rest of the data was shifted, making it incorrect.

### Next we will move on to find duplicates.
It was mentioned in the discussion of the Google Play data that there are multiple duplicates.

In [10]:
unique_apps = []
duplicate_apps =[]

for app in google_data[1:]:
    appname = app[0]
    if appname in unique_apps:
        duplicate_apps.append(appname)
    else:
        unique_apps.append(appname)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:5])

Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


We found quite a few amount of duplicates, so we need to find a way to choose the proper app entry to keep. It would be of good idea to use the one with the most current information. We could use the date that it was updated, but that stays with the application for a long time. The most accurate way is finding the rating count, which will only continue increasing.<br><br>
We need to find the highest rating for each app, and keep that.
To do this we will create a dictionary which will keep the application name as the key, and the value will be the highest review count.

In [11]:
reviews_max = {}

for app in google_data[1:]:
    appname = app[0]
    n_reviews = float(app[3])
    
    if appname in reviews_max and reviews_max[appname] < n_reviews:
        reviews_max[appname] = n_reviews
        
    elif appname not in reviews_max:
        reviews_max[appname] = n_reviews

Previousl we calculated that the number of duplicate apps is 1181 so we will subtract this to our dataset length to find the expected value. Since our dataset has the headers we will begin with row 1 instead of the default of 0.

In [12]:
print('Expected length:', len(google_data[1:])-1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


### Duplicate deletion
Now we will move to delete the duplicates to have a clean data. To do this we will iterate through the apps and append the entry with the most reviews (reviews_max).

In [13]:
google_data_clean= [] # Where our new cleaned dataset will be stored
already_added= [] # Criteria to check if we already added an item to the clean dataset

for app in google_data[1:]:
    appname = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[appname] == n_reviews) and (appname not in already_added):
        google_data_clean.append(app)
        already_added.append(appname)
        
explore_data(google_data_clean,10,13,True)

['Name Art Photo Editor - Focus n Filters', 'ART_AND_DESIGN', '4.4', '8788', '12M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'July 31, 2018', '1.0.15', '4.0 and up']


['Tattoo Name On My Photo Editor', 'ART_AND_DESIGN', '4.2', '44829', '20M', '10,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'April 2, 2018', '3.8', '4.1 and up']


['Mandala Coloring Book', 'ART_AND_DESIGN', '4.6', '4326', '21M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design', 'June 26, 2018', '1.0.4', '4.4 and up']


Number of rows: 9659
Number of columns: 13


In [14]:
unique_apps = []
duplicate_apps =[]

for app in app_datata[1:]:
    appname = app[0]
    if appname in unique_apps:
        duplicate_apps.append(appname)
    else:
        unique_apps.append(appname)

print('Number of duplicate apps:', len(duplicate_apps))
print('Examples of duplicate apps:', duplicate_apps[:5])

Number of duplicate apps: 0
Examples of duplicate apps: []


We can see that the app_datata does not have any duplicates so we will move on to the next step. We noticed that there are some apps in other languages.

### Apps for English users

Since we are going to be targeting an English-speaking audience, we will remove the apps that are in other languages. There is no variable with language so we will find this through the application names.
<br>
The English language uses A-Z, a-z and some numbers. Since each character has an ordinal number we will use it to determine if a string (application name) contains ordinals that are not used by the English language.

In [15]:
def eng_checker(string):
    for character in string:
        if ord(character) > 127:
            return False
    return True

In [16]:
str1 ='Instagram'
str2 ='爱奇艺PPS -《欢乐颂2》电视剧热播'
str3 ='Docs To Go™ Free Office Suite'
str4 ='Instachat 😜'

print(eng_checker(str1))
print(eng_checker(str2))
print(eng_checker(str3))
print(eng_checker(str4))

True
False
False
False


Unfortunately this marks the last two string as non-English. This must be from the extra character that emjois and characters that are not commonly used but still used in the English language. To accomodate for this we will check if a given application name has more than 3 odd characters. If it does, we can conclude that it is a different language.

In [17]:
def eng_checker(string):
    odd_count = 0
    
    for character in string:
        if ord(character) > 127:
            odd_count += 1
    if odd_count > 3:
            return False
    else:
        return True

print(eng_checker(str1))
print(eng_checker(str2))
print(eng_checker(str3))
print(eng_checker(str4))

True
False
True
True


Now that we have a working function we will go through each application name on both datasets and create new lists with only the English apps.

In [18]:
google_data_eng = []
app_data_eng = []

for app in google_data_clean:
    appname = app[0]
    if eng_checker(appname):
        google_data_eng.append(app)
for app in app_datata[1:]:
    appname = app[2]
    if eng_checker(appname):
        app_data_eng.append(app)
        
# Then let's explore the new lists.
explore_data(google_data_eng,10,12,True)
explore_data(app_data_eng,10,12,True)

['Name Art Photo Editor - Focus n Filters', 'ART_AND_DESIGN', '4.4', '8788', '12M', '1,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'July 31, 2018', '1.0.15', '4.0 and up']


['Tattoo Name On My Photo Editor', 'ART_AND_DESIGN', '4.2', '44829', '20M', '10,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'April 2, 2018', '3.8', '4.1 and up']


Number of rows: 9614
Number of columns: 13
['11', '284791396', 'Solitaire by MobilityWare', '49618944', 'USD', '4.99', '76720', '4017', '4.5', '4.5', '4.10.1', '4+', 'Games', '38', '4', '11', '1']


['12', '284815117', 'SCRABBLE Premium', '227547136', 'USD', '7.99', '105776', '166', '3.5', '2.5', '5.19.0', '4+', 'Games', '37', '0', '6', '1']


Number of rows: 6183
Number of columns: 17


We have 9614 app entries in the Google Play data and 6183 in the Apps Store data. Now that we have separated English only apps we can now continue cleaning the data. In this last section of cleaning, we will remove apps that are not free.

### Removing non-free apps
We will again create new lists to store the free-only apps. For Google Play this can be found in the **type column (6)** and in the Apps Store it can be found in the **price column (5)**.

In [19]:
google_data_free = []
app_data_free = []

for app in google_data_eng[1:]:
    free_checker = app[6]
    if free_checker == 'Free':
        google_data_free.append(app)
        
for apps in app_data_eng[1:]:
    price = apps[5]
    if price == '0':
        app_data_free.append(apps)
        
print(len(google_data_free))
print(len(app_data_free))

8862
3222


Now we have 8862 entry for the Google play and 3222 for the Apps Store data.
<br>
Next we will begin analysing the data.

# Data Analysis
After cleaning the data we will now begin to explore the data to find a suitable app we can create.
### Most profitable app
Since we want to make an app that can be ported to both app markets, we need to find what works best for both markets. This leads to a step process of introducing the app in the Android market, and see if it does well. If there is good growth, it will be continued to be developed. If it is profitable after six months then it will be ported into the iOS market.

<br>
We will look at three variables for this:

1. Category, Genres/ prime_genre to find what types of genres are popular
2. Installs / raint_count_tot - to find apps the are the most downloaded
3. Rating/ user_rating - to find the apps that have a good review from users

## 1. Genres

First we will begin by creating a frequency table of the genres for both datasets.

In [20]:
def freq_table(dataset,index):
    frequency_table={}
    total = 0
    for row in dataset[1:]:
        total += 1
        colname = row[index]
        if colname in frequency_table:
            frequency_table[colname] += 1
        else:
            frequency_table[colname] = 1
            
    table_percentages = {}
    for entry in frequency_table:
        percentage = (frequency_table[entry]/total) * 100
        table_percentages[entry] = percentage
    return table_percentages

def display_table(dataset,index):
    table = freq_table(dataset,index)
    table_display=[]
    for key in table:
        key_val_as_tuple = (table[key],key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse=True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [21]:
# Category Google Play
display_table(google_data_free, 1)

FAMILY : 18.90305834555919
GAME : 9.728021667983297
TOOLS : 8.46405597562352
BUSINESS : 4.593161042771697
LIFESTYLE : 3.9047511567543167
PRODUCTIVITY : 3.8934657487868187
FINANCE : 3.7016138133393524
MEDICAL : 3.532332693826882
SPORTS : 3.3969077982169056
PERSONALIZATION : 3.317909942444419
COMMUNICATION : 3.2389120866719328
HEALTH_AND_FITNESS : 3.080916375126961
PHOTOGRAPHY : 2.9454914795169844
NEWS_AND_MAGAZINES : 2.7987811759395105
SOCIAL : 2.663356280329534
TRAVEL_AND_LOCAL : 2.3360794492720913
SHOPPING : 2.245796185532107
BOOKS_AND_REFERENCE : 2.144227513824625
DATING : 1.8620923146371742
VIDEO_PLAYERS : 1.794379866832186
MAPS_AND_NAVIGATION : 1.3993905879697552
FOOD_AND_DRINK : 1.2413948764247829
EDUCATION : 1.1623970206522967
ENTERTAINMENT : 0.9592596772373322
LIBRARIES_AND_DEMO : 0.9366888613023362
AUTO_AND_VEHICLES : 0.9254034533348381
HOUSE_AND_HOME : 0.8238347816273558
WEATHER : 0.8012639656923597
EVENTS : 0.7109807019523756
PARENTING : 0.6545536621148855
COMICS : 0.62069743

In [22]:
# Genre Google Play
display_table(google_data_free, 9)

Tools : 8.452770567656021
Entertainment : 6.071549486513937
Education : 5.349283376594064
Business : 4.593161042771697
Productivity : 3.8934657487868187
Lifestyle : 3.8934657487868187
Finance : 3.7016138133393524
Medical : 3.532332693826882
Sports : 3.464620246021894
Personalization : 3.317909942444419
Communication : 3.2389120866719328
Action : 3.103487191061957
Health & Fitness : 3.080916375126961
Photography : 2.9454914795169844
News & Magazines : 2.7987811759395105
Social : 2.663356280329534
Travel & Local : 2.3247940413045933
Shopping : 2.245796185532107
Books & Reference : 2.144227513824625
Simulation : 2.0426588421171425
Dating : 1.8620923146371742
Arcade : 1.8508069066696762
Video Players & Editors : 1.7718090508971898
Casual : 1.760523642929692
Maps & Navigation : 1.3993905879697552
Food & Drink : 1.2413948764247829
Puzzle : 1.1285407967498025
Racing : 0.9931159011398263
Role Playing : 0.9366888613023362
Libraries & Demo : 0.9366888613023362
Auto & Vehicles : 0.925403453334838

In [23]:
# Prime_genre Apps Store
display_table(app_data_free, 12)

Games : 58.180689226948154
Entertainment : 7.885749767153058
Photo & Video : 4.967401428127911
Education : 3.6634585532443342
Social Networking : 3.290903446134741
Shopping : 2.607885749767153
Utilities : 2.5147469729897547
Sports : 2.1421918658801617
Music : 2.049053089102763
Health & Fitness : 2.018006830176964
Productivity : 1.7075442409189692
Lifestyle : 1.5833592052157717
News : 1.334989133809376
Travel : 1.2418503570319777
Finance : 1.11766532132878
Weather : 0.8692952499223843
Food & Drink : 0.8072027320707855
Reference : 0.55883266066439
Business : 0.5277864017385905
Book : 0.43464762496119214
Navigation : 0.18627755355479667
Medical : 0.18627755355479667
Catalogs : 0.12418503570319776


### Category (Google)
Here we can see that the most common application is for **family** and next is **game** then **tools**. Perhaps the applications for the Google Play store are more for function. 
### Genre (Google)
Also in the Google Play data we can see that Genres has many more categories to choose from although many have very little share of the total. The most prominent is Tools, Entertaiment, Education, and Business. These all seem to be for productivity. Since it is such a long list we will continue to use the category for the rest of the analysis for the Google data.
### Prime_genre (App Store)
Here we can see 58% of the apps are Games, a huge number compared to 18% from the Google Play store. The apps are for entertainment and not necessarily productivity.

### Category Conclusion
We cannot recommend an app that could work for both markets because at this point, they seem to be having differing target audiences.
Next we will move on to the next variable: Total downloads
This way we can analyze what type of apps people are downloading more.

## 2. Application Installs
Now we will continue by checking the amount of installs of each genre instead of by app. We don't want to copy an app, but would like to see what genre is most popular. This could give us direction for what type of app can be created and be successful in the applications market.

For the google store we have a variable **Installs** that gives us the amount of times an application has been downloaded. For the App Store we don't have such variable. To make up for this we will use the variable **rating_count_tot** to approximate the amount of downloads it has.

First we'll use the frequency function we created earlier to use for counting the total amount of downloads per genre.

In [24]:
google_ft = freq_table(google_data_free,1)
app_ft = freq_table(app_data_free,12)

In [25]:
for category in google_ft:
    total = 0
    len_category = 0
    for app in google_data_free[1:]:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            # remove the + and , and convert to a float
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            total += n_installs
            len_category += 1
    avg_installs = round(total/len_category)
    print(category,':',avg_installs)

ART_AND_DESIGN : 1967475
AUTO_AND_VEHICLES : 647318
BEAUTY : 513152
BOOKS_AND_REFERENCE : 8767812
BUSINESS : 1712290
COMICS : 817657
COMMUNICATION : 38456119
DATING : 854029
EDUCATION : 1833495
ENTERTAINMENT : 11640706
EVENTS : 253542
FINANCE : 1387692
FOOD_AND_DRINK : 1924898
HEALTH_AND_FITNESS : 4188822
HOUSE_AND_HOME : 1331541
LIBRARIES_AND_DEMO : 638504
LIFESTYLE : 1437816
GAME : 15588016
FAMILY : 3697848
MEDICAL : 120551
SOCIAL : 23253652
SHOPPING : 7036877
PHOTOGRAPHY : 17840110
SPORTS : 3638640
TRAVEL_AND_LOCAL : 13984078
TOOLS : 10801391
PERSONALIZATION : 5201483
PRODUCTIVITY : 16787331
PARENTING : 542604
WEATHER : 5074486
VIDEO_PLAYERS : 24727872
NEWS_AND_MAGAZINES : 9549178
MAPS_AND_NAVIGATION : 4056942


For the Google Play Store the most popular apps are Communication, Video Players, Social, Photography, Productivity, game and Travel and Local.
Having these as the most installed doesn't necessarily mean this application genre is most popular. Perhaps a few apps make the most of the installs. Let's check this out.

In [58]:
def top_10apps(var_name):
    list_name = []
    for app in google_data_free[1:]:
        if app [1] == var_name:
            name = app[0]
            installs = app[5]
            installs = installs.replace('+','')
            installs = installs.replace(',','')
            installs = float(installs)
            list_name.append([installs,name])
            list_name = sorted(list_name, reverse=True)
    print(var_name)
    for element in list_name[:10]:
        print (element)

In [59]:
top_10apps('COMMUNICATION')
top_10apps('VIDEO_PLAYERS')
top_10apps('SOCIAL')
top_10apps('PHOTOGRAPHY')
top_10apps('PRODUCTIVITY')
top_10apps('GAME')
top_10apps('TRAVEL_AND_LOCAL')

COMMUNICATION
[1000000000.0, 'WhatsApp Messenger']
[1000000000.0, 'Skype - free IM & video calls']
[1000000000.0, 'Messenger – Text and Video Chat for Free']
[1000000000.0, 'Hangouts']
[1000000000.0, 'Google Chrome: Fast & Secure']
[1000000000.0, 'Gmail']
[500000000.0, 'imo free video calls and chat']
[500000000.0, 'Viber Messenger']
[500000000.0, 'UC Browser - Fast Download Private & Secure']
[500000000.0, 'LINE: Free Calls & Messages']
VIDEO_PLAYERS
[1000000000.0, 'YouTube']
[1000000000.0, 'Google Play Movies & TV']
[500000000.0, 'MX Player']
[100000000.0, 'VivaVideo - Video Editor & Photo Movie']
[100000000.0, 'VideoShow-Video Editor, Video Maker, Beauty Camera']
[100000000.0, 'VLC for Android']
[100000000.0, 'Motorola Gallery']
[100000000.0, 'Motorola FM Radio']
[100000000.0, 'Dubsmash']
[50000000.0, 'Vote for']
SOCIAL
[1000000000.0, 'Instagram']
[1000000000.0, 'Google+']
[1000000000.0, 'Facebook']
[500000000.0, 'Snapchat']
[500000000.0, 'Facebook Lite']
[100000000.0, 'VK']
[100000