# Exploring Profitable App Profiles

This project is about to explore in-app ads and find out what type of apps are likely to attract more users. Our fiction company build free for download apps, and our main source of revenue consists of in-app ads. My gold to analyze data, see what type of apps attract more users and hence to help developers to build new apps on Google Play and the App Store.

To avoid spending resources and time with collecting new data ourselves, we should first try to see whether we can find any relevant existing data. There are two data sets that seem suitable for our purpose:

* A [data set](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately ten thousand Android apps from Google Play.
* A [data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.

In [2]:
#importing and reading data as CSV. Get the list of lists of the data.
import csv
opened_file_apple = open('AppleStore.csv')
opened_file_google = open('googleplaystore.csv')
apple_data = list(csv.reader(opened_file_apple))
google_data = list(csv.reader(opened_file_google))


### Exploring the data sets

In [3]:
#function to explore the data
def explore_data(dataset, start, end, rows_and_columns=False, header = True):
    if header: #delete header
        dataset = dataset[1:]
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row, end = '\n\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [4]:
print('{}\nFew examples of apple_data:\n'.format('*'*79))
explore_data(apple_data,5,10,rows_and_columns=True)
print('{}\nFew examples of google_data:\n'.format('*'*79), )
explore_data(google_data,5,10,rows_and_columns=True)

*******************************************************************************
Few examples of apple_data:

['429047995', 'Pinterest', '74778624', 'USD', '0.0', '1061624', '1814', '4.5', '4.0', '6.26', '12+', 'Social Networking', '37', '5', '27', '1']

['282935706', 'Bible', '92774400', 'USD', '0.0', '985920', '5320', '4.5', '5.0', '7.5.1', '4+', 'Reference', '37', '5', '45', '1']

['553834731', 'Candy Crush Saga', '222846976', 'USD', '0.0', '961794', '2453', '4.5', '4.5', '1.101.0', '4+', 'Games', '43', '5', '24', '1']

['324684580', 'Spotify Music', '132510720', 'USD', '0.0', '878563', '8253', '4.5', '4.5', '8.4.3', '12+', 'Music', '37', '5', '18', '1']

['343200656', 'Angry Birds', '175966208', 'USD', '0.0', '824451', '107', '4.5', '3.0', '7.4.0', '4+', 'Games', '38', '0', '10', '1']

Number of rows: 7197
Number of columns: 16
*******************************************************************************
Few examples of google_data:

['Paper flowers instructions', 'ART_AND_DESIGN'

Now let's see what column names are:

In [5]:
print('header for apple apps:\n{}'.format(';  '.join(apple_data[0])), end='\n\n')
print('header for google apps:\n{}'.format(';  '.join(google_data[0])), end='\n\n')

header for apple apps:
id;  track_name;  size_bytes;  currency;  price;  rating_count_tot;  rating_count_ver;  user_rating;  user_rating_ver;  ver;  cont_rating;  prime_genre;  sup_devices.num;  ipadSc_urls.num;  lang.num;  vpp_lic

header for google apps:
App;  Category;  Rating;  Reviews;  Size;  Installs;  Type;  Price;  Content Rating;  Genres;  Last Updated;  Current Ver;  Android Ver



#### Name describtion of the google apps:

* App - Application name
    
* Category - Category the app belongs to
    
* Rating - Overall user rating of the app 
* Reviews - Number of user reviews for the app
    
* Size - Size of the app
    
*  Installs - Number of user downloads/installs for the app 
    
* Type - Paid or Free
    
* Price - Price of the app (as when scraped)
    
* Content Rating - Age group the app is targeted at - Children / Mature 21+ / Adult
    
* Genres - An app can belong to multiple genres (apart from its main category). For eg, a musical family game will belong to Music, Game, Family genres.
    
* Last Updated - Date when the app was last updated on Play Store
    
* Current Ver - Current version of the app available on Play Store
    
* Android Ver - Min required Android version   


#### Name describtion of the apple apps:
* "id" : App ID

* "track_name": App Name

* "size_bytes": Size (in Bytes)

* "currency": Currency Type

* "price": Price amount

* "rating_count_tot": User Rating counts (for all version)

* "rating_count_ver": User Rating counts (for current version)

* "user_rating" : Average User Rating value (for all version)

* "user_rating_ver": Average User Rating value (for current version)

* "ver" : Latest version code

* "cont_rating": Content Rating

* "prime_genre": Primary Genre

* "sup_devices.num": Number of supporting devices

* "ipadSc_urls.num": Number of screenshots showed for display

* "lang.num": Number of supported languages

* "vpp_lic": Vpp Device Based Licensing Enabled

# Data cleaning

Now, we need to look at whether or not we have some wrong data. First, let's find rows with missing data.  

In [6]:
#function to recognize apps with missing data
def find_missing_data(dataset,header = True):
    if header:
        dataset = dataset[1:]
    wrong_data = [row for row in dataset for item in row if item == '']
    if wrong_data:
        print('There are some wrong data here\n', wrong_data)
    else:
        print('There are no wrong data here')

In [7]:
print('Google data set:', end = '\n\n')
find_missing_data(google_data)
print('\nApple data set:', end = '\n\n')
find_missing_data(apple_data)

Google data set:

There are some wrong data here
 [['Market Update Helper', 'LIBRARIES_AND_DEMO', '4.1', '20145', '11k', '1,000,000+', 'Free', '0', 'Everyone', 'Libraries & Demo', 'February 12, 2013', '', '1.5 and up'], ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']]

Apple data set:

There are no wrong data here


Let's see what type of data is missing in the google data set. The second app has missing its "Category" value, which causes the column shift. In this case, it'll better to remove this entire row from our list of the data. The first app has missing current version value, which isn't so meaningful for this data analysis and I'll just leave this app untouched.

In [8]:
#removing Life Made WI-Fi Touchscreen Photo Frame from data set
del google_data[10473]

## Removing duplicate data

Then, let's look if we can find duplicate data in our data set. First, write a function recognized these duplicates and then mull over choosing what data to leave and what data to remove.

In [9]:
# function to find duplicates. One list stores unique names, the second duplicate names.
def duplicate(dataset,header=True,ind_name = 0): # ind_name = 1 for apple and 0 for google, thus because of position of the "app" clomun
    if header:
        dataset = dataset[1:]
    unique_name = []
    duplicates = []
    for app in dataset:
        name = app[ind_name]
        if name in unique_name:
            duplicates.append(name)
        else:
            unique_name.append(name)
    #print the length of these lists
    print('Number of duplicate apps:{}'.format(len(duplicates)))
    print('Examples of duplicate apps:{}'.format(duplicates[:10]))

In [10]:
duplicate(google_data)
print('\n')
duplicate(apple_data, ind_name = 1)

Number of duplicate apps:1181
Examples of duplicate apps:['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


Number of duplicate apps:2
Examples of duplicate apps:['Mannequin Challenge', 'VR Roller Coaster']


Okey, look at these duplicates to confirm. 

In [11]:
# for google:
print('\nfor google:\n')
for app in google_data:
    name = app[0]
    if name == 'Slack':
        print(app)
# for apple:
print('\nfor apple:\n')
for app in apple_data:
    name = app[1]
    if name == 'Mannequin Challenge' or name == 'VR Roller Coaster':
        print(app)


for google:

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']

for apple:

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608',

---
We may find out, that duplicates for google apps are pretty the same, but sometimes they differ from each other with one column named 'number of reviews'. The different numbers show the data was collected at different times. So, logically, let's keep only the last version of the app corresponded to the highest number of reviews.

To do that, I will:

* Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
* Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)

For apple duplicate apps, we see a little bit different picture. They almost differ from each other except a few points. The correct decision is to choose the latest version (index of column 11). 
***

In [12]:
# function to find duplicates in our google data set
def clean_data(dataset,header = True):
    if header:
        dataset = dataset[1:]
    reviews_max = {}
    already_aded = []
    google_clean = []
    for app in dataset: # create that dictionary described above
        name = app[0]
        n_reviews = float(app[3])
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
    for app in dataset: # create a new clean data set 
        name = app[0]
        n_reviews = float(app[3])
        if n_reviews == reviews_max[app[0]] and name not in already_aded:
            google_clean.append(app)
            already_aded.append(name)   
    print('Length of the clean data:', len(google_clean))
    return (google_clean)

In [13]:
clean_google_dataset = clean_data(google_data)

Length of the clean data: 9659


Manually choose the latest version (10 column) in apple data and remove the old one. There's no need to build any functions because of a few duplicate examples. 

In [14]:
for app in enumerate(apple_data): #to get index of the row
    name = app[1][1]
    if name == 'Mannequin Challenge' or name == 'VR Roller Coaster':
        print(app)
# delete chosen rows        
del apple_data[4464]
del apple_data[4832]

(2949, ['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1'])
(4443, ['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1'])
(4464, ['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1'])
(4832, ['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1'])


## Removing non-English apps

***
We're considering the English language market only. So we need to remove non-English apps from our data set.
Let's build a function that detects non-English string if it has more than three characters with corresponding numbers falling outside the ASCII range. As we know all of the characters used in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. So, I'll use a built-in function called "ord" to recognize non-English words.
***

In [15]:
def detect_english(string):
    count = 0
    for ch in string:
        if ord(ch) > 127:
            count += 1 
    if count > 3:
        return False
    else:
        return True

In [16]:
# Checking our function
print(detect_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(detect_english('Instachat 😜'))
print(detect_english('Docs To Go™ Free Office Suite'))

False
True
True


Everything seems to be okay. go and loop our clean data and remove non-English words.

In [17]:
english_clean_apple_data = [app for app in apple_data if detect_english(app[1])]
english_clean_google_data = [app for app in clean_google_dataset if detect_english(app[0])]

In [18]:
#see how much data left
print('number of the new english apple data:', len(english_clean_apple_data))
print('number of the new english google data:', len(english_clean_google_data))

number of the new english apple data: 6182
number of the new english google data: 9614


The function is not perfect, and a few non-English apps might pass our filter, but this seems good enough at this point in our analysis — we shouldn't spend too much time on optimization at this point.

## Remaining only free apps

____
As I mentioned in the introduction, we only build apps that are free to download since our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for analysis.
***

In [19]:
new_google_data = [app for app in english_clean_google_data if app[6] == 'Free']
new_apple_data = [app for app in english_clean_apple_data if app[4] == '0.0']
print(len(new_google_data))
print(len(new_apple_data))

8863
3221


# Data Analysis

Our aim is to determine the kinds of apps that are likely to attract more users because revenue is highly influenced by the number of people using apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

* Build a minimal Android version of the app, and add it to Google Play.
* If the app has a good response from users, we develop it further.
* If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

****
Let's build two functions we can use to analyze the data:

* One function to generate frequency tables that show percentages
* Another function we can use to display the percentages in descending order for convenience

In [20]:
# build frequency table
def freq_table(dataset, index, header = False):
    frequancy_table = {}
    total = 0
    for app in dataset:
        total += 1
        if app[index] in frequancy_table:
            frequancy_table[app[index]] += 1
        else:
            frequancy_table[app[index]] = 1
    # using list comprehension get percentage table      
    table_percentages = {key : (frequancy_table[key] / total) * 100 for key in frequancy_table}
    
    return table_percentages


In [21]:
# convenient displaying table
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key,value in table.items(): # create a tuple of key-value to sort the dictionary
        dic_as_tuple = (value, key) 
        table_display.append(dic_as_tuple)
    #for nice printing and to sort data:
    table_sorted = ['{}:{}'.format(var[1],var[0]) for var in sorted(table_display, reverse = True)] 
    return table_sorted

In [22]:
# table for google_data:
Genres_table = display_table(new_google_data, -4) #Genres
Category_table = display_table(new_google_data, 1) #Category
# table for google_data:
Prime_Genre_table = display_table(new_apple_data, -5) #Prime Genre

**Let's do some data analysis.** 

First, look at the Apple store:

In [23]:
print(*Prime_Genre_table, sep='\n')

Games:58.149642968022356
Entertainment:7.885749767153058
Photo & Video:4.967401428127911
Education:3.6634585532443342
Social Networking:3.290903446134741
Shopping:2.607885749767153
Utilities:2.5147469729897547
Sports:2.1421918658801617
Music:2.049053089102763
Health & Fitness:2.018006830176964
Productivity:1.7385904998447685
Lifestyle:1.5833592052157717
News:1.334989133809376
Travel:1.2418503570319777
Finance:1.11766532132878
Weather:0.8692952499223843
Food & Drink:0.8072027320707855
Reference:0.55883266066439
Business:0.5277864017385905
Book:0.43464762496119214
Navigation:0.18627755355479667
Medical:0.18627755355479667
Catalogs:0.12418503570319776


As we can see the most common genre in free, English field is `games` (58%). The runner-up is also entertainment apps, followed by `photo and video apps`. The difference between Entertainment + games apps and, for instance, `education` is pretty large `(~ 64% vs 4%)` . 

We can make the conclusion that the apps in the free segment are designed mostly for entertainment, whereas apps with practical purposes (education, shopping, some utilities) are more rare. However, there is not enough for building recommendation of the app profile. Looking only at the frequency table we can't say that the most common genre has to attract more users and we must use the needed pattern of marketing in particularly field of entertaiment. The fact that these apps are the most numerous doesn't also mean that they also have the greatest number of users — the demand might not be the same as the offer. But it's essential to look through the market and find out what's the most frequent type of apps in our considered field.

**Next look at google play market:**

In [24]:
print(*Category_table, sep='\n')

FAMILY:18.898792733837304
GAME:9.725826469592688
TOOLS:8.462146000225657
BUSINESS:4.592124562789123
LIFESTYLE:3.9038700214374367
PRODUCTIVITY:3.8925871601038025
FINANCE:3.7007785174320205
MEDICAL:3.5315355974275078
SPORTS:3.396141261423897
PERSONALIZATION:3.317161232088458
COMMUNICATION:3.2381812027530184
HEALTH_AND_FITNESS:3.0802211440821394
PHOTOGRAPHY:2.944826808078529
NEWS_AND_MAGAZINES:2.798149610741284
SOCIAL:2.6627552747376737
TRAVEL_AND_LOCAL:2.335552296062281
SHOPPING:2.245289405393208
BOOKS_AND_REFERENCE:2.1437436533904997
DATING:1.8616721200496444
VIDEO_PLAYERS:1.7939749520478394
MAPS_AND_NAVIGATION:1.399074805370642
FOOD_AND_DRINK:1.241114746699763
EDUCATION:1.1621347173643235
ENTERTAINMENT:0.9590432133589079
LIBRARIES_AND_DEMO:0.9364774906916393
AUTO_AND_VEHICLES:0.9251946293580051
HOUSE_AND_HOME:0.8236488773552973
WEATHER:0.8010831546880289
EVENTS:0.7108202640189552
PARENTING:0.6544059573507841
ART_AND_DESIGN:0.6431230960171499
COMICS:0.6205573733498815
BEAUTY:0.597991650

The Picture on the Google play market seems differently. Here, market is dominated by apps designed for family (19%), the runner-up are games(8%). It seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this, we can see that the family category includes mostly games for kids.

In [25]:
for app in new_google_data:
    if app[1] == 'FAMILY':
        print(app[0])

Jewels Crush- Match 3 Puzzle
Coloring & Learn
Mahjong
Super ABC! Learning games for kids! Preschool apps
Toy Pop Cubes
Educational Games 4 Kids
Candy Pop Story
Princess Coloring Book
Hello Kitty Nail Salon
Candy Smash
Happy Fruits Bomb - Cube Blast
Princess Adventures Puzzles
Kids Educational Game 3 Free
Puzzle Kids - Animals Shapes and Jigsaw Puzzles
Coloring book moana
Baby Panda Care
Kids Educational :All in One
Number Counting games for toddler preschool kids
Learn To Draw Glow Flower
No. Color - Color by Number, Number Coloring
Draw.ly - Color by Number Pixel Art Coloring
Baby puzzles
Garden Fruit Legend
Barbie™ Fashion Closet
Candy Day
Learn To Draw Glow Princess
ABC Kids - Tracing & Phonics
Barbie Magical Fashion
Minion Rush: Despicable Me Official Game
Piano Kids - Music & Songs
Educational Games for Kids
No.Draw - Colors by Number 2018
Fruit Boom
Baby Tiger Care - My Cute Virtual Pet Friend
Rhythm Patrol
Kiddopia - Preschool Learning Games
Papumba Academy - Fun Learning For Ki

Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the Genres column:

In [26]:
print(*Genres_table, sep='\n', end='\n\n')

Tools:8.450863138892023
Entertainment:6.070179397495204
Education:5.348076272142616
Business:4.592124562789123
Productivity:3.8925871601038025
Lifestyle:3.8925871601038025
Finance:3.7007785174320205
Medical:3.5315355974275078
Sports:3.463838429425702
Personalization:3.317161232088458
Communication:3.2381812027530184
Action:3.102786866749408
Health & Fitness:3.0802211440821394
Photography:2.944826808078529
News & Magazines:2.798149610741284
Social:2.6627552747376737
Travel & Local:2.324269434728647
Shopping:2.245289405393208
Books & Reference:2.1437436533904997
Simulation:2.042197901387792
Dating:1.8616721200496444
Arcade:1.8503892587160102
Video Players & Editors:1.771409229380571
Casual:1.7601263680469368
Maps & Navigation:1.399074805370642
Food & Drink:1.241114746699763
Puzzle:1.128286133363421
Racing:0.9928917973598104
Role Playing:0.9364774906916393
Libraries & Demo:0.9364774906916393
Auto & Vehicles:0.9251946293580051
Strategy:0.9026289066907368
House & Home:0.8236488773552973
Wea

***
So, tools, education, productivity apps occupy not the last position in popularity. The difference between 'Genres' and 'Categories' is a little bit unclear, but I can say the category column is much more variable and further I'll be working only with genres. 

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play seems to be more balanced of both practical and for-fun apps. 

For answering the main question of this project and find out what the field of apps is more profitable for our business plan, we need to take a look at the apps that have most users.
***

One way to find out what genres are have the most users is to get the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App, Store data set this information is absent. Nevertheless, I'll take the total number of user ratings, which we can find in the rating_count_tot app. This will get to us approximately view on the number of installs

Below is calculating the average number of user ratings per app genre on the App Store:

In [27]:
for genre in freq_table(new_apple_data,-5):
    count = 0
    tot_rating = 0
    for app in new_apple_data:
        genre_in_data = app[-5]
        app_ratting = float(app[5])
        if genre_in_data == genre:
            count += 1
            tot_rating += app_ratting
    avr_rating = tot_rating/count
    print(genre,avr_rating)

Shopping 26919.690476190477
Lifestyle 16485.764705882353
Medical 612.0
Weather 52279.892857142855
Food & Drink 33333.92307692308
Travel 28243.8
Sports 23008.898550724636
Entertainment 14029.830708661417
Utilities 18684.456790123455
Business 7491.117647058823
Productivity 21028.410714285714
Reference 74942.11111111111
Book 39758.5
Education 7003.983050847458
Social Networking 71548.34905660378
Health & Fitness 23298.015384615384
Music 57326.530303030304
Games 22800.780565937
Navigation 86090.33333333333
Photo & Video 28441.54375
Finance 31467.944444444445
Catalogs 4004.0
News 21248.023255813954


It can be tempting for us to say, that the most profitable genres for App store are Navigation or Social Networking, but if we look deeper we will see, how some apps skewed up our main picture.

In [46]:
[str(app[1])+' : '+str(app[5]) for app in new_apple_data 
                                    if app[-5] == 'Navigation']

['Waze - GPS Navigation, Maps & Real-time Traffic : 345046',
 'Google Maps - Navigation & Transit : 154911',
 'Geocaching® : 12811',
 'CoPilot GPS – Car Navigation & Offline Maps : 3582',
 'ImmobilienScout24: Real Estate Search in Germany : 187',
 'Railway Route Search : 5']

This figure is heavily influenced by Waze and Google Maps. The same pattern applies to social networking apps, where the average number is heavily influenced by a few giants like Facebook, Pinterest, Skype, etc. Same picture applies to music apps, where we have Spotify,Shazam,Pandora heavily influence the average number.

In [44]:
print('Social Networking:\n',[str(app[1]) +  ' : ' + str(app[5]) 
                              for app in new_apple_data 
                              if app[-5] == 'Social Networking'][:5])

print('\n\nMusic:\n', [str(app[1]) +  ' : ' + str(app[5]) 
                       for app in new_apple_data 
                       if app[-5] == 'Music'][:5])

Social Networking:
 ['Facebook : 2974676', 'Pinterest : 1061624', 'Skype for iPhone : 373519', 'Messenger : 351466', 'Tumblr : 334293']


Music:
 ['Pandora - Music & Radio : 1126879', 'Spotify Music : 878563', 'Shazam - Discover music, artists, videos & lyrics : 402925', 'iHeartRadio – Free Music & Radio Stations : 293228', 'SoundCloud - Music & Audio : 135744']


___
Reference apps have 74,942 user ratings on average, but it's actually the Bible and Dictionary.com which skew up the average rating:

In [30]:
for app in new_apple_data:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


However, in general, if we drop 'Bible' and ' Dictionary.com' there will not be so bad. It is possibly taht field we are looking for. the App Store is dominated by entartaiment apps. There are a lot of simular apps and market has already tired of it and might be a bit saturated. This will help a practical app in reference category stand out among the huge number of apps on the App Store. One thing we could do is take popular book and turn it into an app where we could add different features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes about the book, etc.

### Let's look at google data:

If we look at the column named 'installs' we see that there are different scales of numbers of the installs like 10,000+, 100,000+, 1,000,000+.

One problem with this data is that is not enough precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

I'm going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.

In [37]:
sorted_list = [] #create a list to hold result in each loop and sort it then.
for genre in freq_table(new_google_data,1):
    count = 0
    tot_installs = 0
    for app in new_google_data:
        genre_in_data = app[1]
        if genre_in_data == genre:
            app_installs = app[5]
            app_installs = app_installs.replace('+' , '')
            app_installs = app_installs.replace(',' , '')
            count += 1
            tot_installs += float(app_installs)
    avr_installs = round(tot_installs/count)
    sorted_list.append((avr_installs, genre))
print(*sorted(sorted_list, reverse = True), sep = '\n')

(38456119, 'COMMUNICATION')
(24727872, 'VIDEO_PLAYERS')
(23253652, 'SOCIAL')
(17840110, 'PHOTOGRAPHY')
(16787331, 'PRODUCTIVITY')
(15588016, 'GAME')
(13984078, 'TRAVEL_AND_LOCAL')
(11640706, 'ENTERTAINMENT')
(10801391, 'TOOLS')
(9549178, 'NEWS_AND_MAGAZINES')
(8767812, 'BOOKS_AND_REFERENCE')
(7036877, 'SHOPPING')
(5201483, 'PERSONALIZATION')
(5074486, 'WEATHER')
(4188822, 'HEALTH_AND_FITNESS')
(4056942, 'MAPS_AND_NAVIGATION')
(3697848, 'FAMILY')
(3638640, 'SPORTS')
(1986335, 'ART_AND_DESIGN')
(1924898, 'FOOD_AND_DRINK')
(1833495, 'EDUCATION')
(1712290, 'BUSINESS')
(1437816, 'LIFESTYLE')
(1387692, 'FINANCE')
(1331541, 'HOUSE_AND_HOME')
(854029, 'DATING')
(817657, 'COMICS')
(647318, 'AUTO_AND_VEHICLES')
(638504, 'LIBRARIES_AND_DEMO')
(542604, 'PARENTING')
(513152, 'BEAUTY')
(253542, 'EVENTS')
(120551, 'MEDICAL')


At first glance, it might seem that is communication app the most profitable, because it has the largest number of installs: 38,456,119. But this number is heavily skewed up by a few apps that have over one billion installs (WhatsApp, Facebook Messenger, Skype, Google Chrome, Gmail, and Hangouts)

In [38]:
[app[0] + ':' + app[5] for app in new_google_data 
                                      if app[1] == 'COMMUNICATION' 
                                      and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+')]


['WhatsApp Messenger:1,000,000,000+',
 'imo beta free calls and text:100,000,000+',
 'Android Messages:100,000,000+',
 'Google Duo - High Quality Video Calls:500,000,000+',
 'Messenger – Text and Video Chat for Free:1,000,000,000+',
 'imo free video calls and chat:500,000,000+',
 'Skype - free IM & video calls:1,000,000,000+',
 'Who:100,000,000+',
 'GO SMS Pro - Messenger, Free Themes, Emoji:100,000,000+',
 'LINE: Free Calls & Messages:500,000,000+',
 'Google Chrome: Fast & Secure:1,000,000,000+',
 'Firefox Browser fast & private:100,000,000+',
 'UC Browser - Fast Download Private & Secure:500,000,000+',
 'Gmail:1,000,000,000+',
 'Hangouts:1,000,000,000+',
 'Messenger Lite: Free Calls & Messages:100,000,000+',
 'Kik:100,000,000+',
 'KakaoTalk: Free Calls & Text:100,000,000+',
 'Opera Mini - fast web browser:100,000,000+',
 'Opera Browser: Fast and Secure:100,000,000+',
 'Telegram:100,000,000+',
 'Truecaller: Caller ID, SMS spam blocking & Dialer:100,000,000+',
 'UC Browser Mini -Tiny F

This niche seems to be dominated by a few giants who are hard to compete against. It can be said the same about video players category, which is the runner-up with 24,727,872 installs, about social apps. The market is dominated by apps like Youtube, Google Play Movies & TV, MX Player, Facebook, Instagram, Google+, etc. 

The game genre seems pretty popular as it was on the Apple store and it can be profitable with in-app adds strategy of income, but previously we found out this part of the market seems really saturated, so I'd like to come up with a different app recommendation if possible.

Next is productivity. This is an interesting niche, because of my experience of using apps such a 'to-do lists' and 'task managers' there is a lack of good free apps. Let's explore our potential competitors.

The books and reference genre looks fairly popular as well, with an average number of installs of 8,767,811. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

In [39]:
[app[0] + ':' + app[5] for app in new_google_data if app[1] == 'PRODUCTIVITY'] 

['Microsoft Word:500,000,000+',
 'All-In-One Toolbox: Cleaner, Booster, App Manager:10,000,000+',
 'AVG Cleaner – Speed, Battery & Memory Booster:10,000,000+',
 'QR Scanner & Barcode Scanner 2018:10,000,000+',
 'Chrome Beta:10,000,000+',
 'Microsoft Outlook:100,000,000+',
 'Google PDF Viewer:10,000,000+',
 'My Claro Peru:5,000,000+',
 'Power Booster - Junk Cleaner & CPU Cooler & Boost:1,000,000+',
 'Google Assistant:10,000,000+',
 'Microsoft OneDrive:100,000,000+',
 'Calculator - unit converter:50,000,000+',
 'Microsoft OneNote:100,000,000+',
 'Metro name iD:10,000,000+',
 'Google Keep:100,000,000+',
 'Archos File Manager:5,000,000+',
 'ES File Explorer File Manager:100,000,000+',
 'ASUS SuperNote:10,000,000+',
 'HTC File Manager:10,000,000+',
 'MyMTN:1,000,000+',
 'Dropbox:500,000,000+',
 'ASUS Quick Memo:10,000,000+',
 'HTC Calendar:10,000,000+',
 'Google Docs:100,000,000+',
 'ASUS Calling Screen:10,000,000+',
 'lifebox:5,000,000+',
 'Yandex.Disk:5,000,000+',
 'Content Transfer:5,000

We see how Microsoft Word, All-In-One Toolbox: Cleaner, Booster, App Manager, AVG Cleaner – Speed, Battery & Memory Booster and so on have a large number of installs. But these apps are about data, resources, battery management. If we look at our particular interesting category we can see: `Wunderlist: To-Do List & Tasks:10,000,000+, Todoist: To-do lists for task management & errands:10,000,000+, 'Any.do: To-do list, Calendar, Reminders & Planner:10,000,000+`. They are serious competitors, but some of them have an only free trial version or dividing by free and pro version. Anyway, we can create an app with some tips, resources, lists for tasks, etc. for fitness, workout, some sports. There is a lack of such specialized free apps and how we see the genre is fairly popular as well as on the Apple store (`Health & Fitness - 23298` average number of user ratings, `Productivity  - 21028`).

Next niche is `book and references`. It's interesting to explore this in more depth, since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

In [43]:
[app[0] + ':' + app[5] for app in new_google_data 
                                    if app[1] == 'BOOKS_AND_REFERENCE'] 

['E-Book Read - Read Book for free:50,000+',
 'Download free book with green book:100,000+',
 'Wikipedia:10,000,000+',
 'Cool Reader:10,000,000+',
 'Free Panda Radio Music:100,000+',
 'Book store:1,000,000+',
 'FBReader: Favorite Book Reader:10,000,000+',
 'English Grammar Complete Handbook:500,000+',
 'Free Books - Spirit Fanfiction and Stories:1,000,000+',
 'Google Play Books:1,000,000,000+',
 'AlReader -any text book reader:5,000,000+',
 'Offline English Dictionary:100,000+',
 'Offline: English to Tagalog Dictionary:500,000+',
 'FamilySearch Tree:1,000,000+',
 'Cloud of Books:1,000,000+',
 'Recipes of Prophetic Medicine for free:500,000+',
 'ReadEra – free ebook reader:1,000,000+',
 'Anonymous caller detection:10,000+',
 'Ebook Reader:5,000,000+',
 'Litnet - E-books:100,000+',
 'Read books online:5,000,000+',
 'English to Urdu Dictionary:500,000+',
 'eBoox: book reader fb2 epub zip:1,000,000+',
 'English Persian Dictionary:500,000+',
 'Flybook:500,000+',
 'All Maths Formulas:1,000,0

This genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc.

Let's see on skewed apps:

In [41]:
[app[0] + ':' + app[5] for app in new_google_data 
                                      if app[1] == 'BOOKS_AND_REFERENCE' 
                                      and (app[5] == '1,000,000,000+'
                                      or app[5] == '500,000,000+'
                                      or app[5] == '100,000,000+')]


['Google Play Books:1,000,000,000+',
 'Bible:100,000,000+',
 'Amazon Kindle:100,000,000+',
 'Wattpad 📖 Free Books:100,000,000+',
 'Audiobooks from Audible:100,000,000+']

There are only a few really popular apps that skew up our data, so this market shows potential. Now we need to realize what kind of apps will compete with us.

In [42]:
# we will see apps between 1,000,000+ and 50,000,000+
[app[0] + ':' + app[5] for app in new_google_data 
                                      if app[1] == 'BOOKS_AND_REFERENCE' 
                                      and (app[5] == '1,000,000+'
                                            or app[5] == '5,000,000+'
                                            or app[5] == '10,000,000+'
                                            or app[5] == '50,000,000+')]

['Wikipedia:10,000,000+',
 'Cool Reader:10,000,000+',
 'Book store:1,000,000+',
 'FBReader: Favorite Book Reader:10,000,000+',
 'Free Books - Spirit Fanfiction and Stories:1,000,000+',
 'AlReader -any text book reader:5,000,000+',
 'FamilySearch Tree:1,000,000+',
 'Cloud of Books:1,000,000+',
 'ReadEra – free ebook reader:1,000,000+',
 'Ebook Reader:5,000,000+',
 'Read books online:5,000,000+',
 'eBoox: book reader fb2 epub zip:1,000,000+',
 'All Maths Formulas:1,000,000+',
 'Ancestry:5,000,000+',
 'HTC Help:10,000,000+',
 'Moon+ Reader:10,000,000+',
 'English-Myanmar Dictionary:1,000,000+',
 'Golden Dictionary (EN-AR):1,000,000+',
 'All Language Translator Free:1,000,000+',
 'Aldiko Book Reader:10,000,000+',
 'Dictionary - WordWeb:5,000,000+',
 '50000 Free eBooks & Free AudioBooks:5,000,000+',
 'Al-Quran (Free):10,000,000+',
 'Al Quran Indonesia:10,000,000+',
 "Al'Quran Bahasa Indonesia:10,000,000+",
 'Al Quran Al karim:1,000,000+',
 'Al Quran : EAlim - Translations & MP3 Offline:5,00

There are a lot of apps for processing and reading ebooks, as well as various collections of libraries and dictionaries, so it's not a good idea to build similar apps like readers since there'll be severe competition.

As you can notice in the middle of this list there are quite a few apps built around the book Quran. Maybe it will be a good solution to wrap up a popular book (more recent or modern) in an app. It seems that it could be profitable for both the Google Play and the App Store markets. But we need to add some special features besides the raw version of the book to compete with other readers on the market. This might include daily quotes from the book, an audio version of the book, quizzes, a really impressing design.

# Conclusion

In this project, I explored data about the App Store and Google Play market to find and come up with the appropriate recommendation for an app profile that can be profitable for both markets.

The most interesting niches, how we concluded, are `Books and reference`, combined `Productivity` and `Health and Fitness`. We may take a popular book and turn it into an app with different features besides the raw version of the book. Also, we can design some task management app and include special sections concretely for fitness needs. It could be a plan for exercises, time-tables, diet recommendations, daily to-do things to keep up with a shape. Also, as possible, we can create a game app with in-app adds, but both markets are really saturated with it. It will be hard to design something fresh, gripping and compete with huge corporations of video games.