For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and in the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app — the more users who see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

# Profitable Apps for Apple Store and Google Play Store Market

Our objective is to determine what kind of applications are profitable for Apple Store and Google Play Store.

At the company we only build apps that are free to download and install, and our main source of revenue consists on in-apps ads. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

## 1. Collect and Explore the Data

First, we need to collect the data from the respective sources.

1. [Apple Store Apps](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
2. [Google Play Store Apps](https://www.kaggle.com/lava18/google-play-store-apps/version/6)

In [1]:
from csv import reader

# Apple Store Dataset
data_file = open('AppleStore.csv', encoding='utf8')
read_file = reader(data_file)
apple_data = list(read_file)
apple_header = apple_data[0]
apple_data = apple_data[1:]
data_file.close()

# Google Play Store Dataset
data_file = open('googleplaystore.csv', encoding='utf8')
read_file = reader(data_file)
google_data = list(read_file)
google_header = google_data[0]
google_data = google_data[1:]
data_file.close()

Next we will create a function called `explore_data()` to be easier to inspect our dataset. It's possible to select how many rows we want to see.

In [2]:
def explore_data(dataset: list, start: int, end: int, rows_and_columns = False):
    data_sliced = dataset[start:end]
    for row in data_sliced:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of columns: ',len(dataset[0]))
        print('Number of rows: ',len(dataset))

Let's check a preview of our datasets:

In [3]:
# Apple Dataset
print('-------------------------------------------------------------------------')
print('------------------------------Apple Dataset------------------------------')
print('-------------------------------------------------------------------------')
print(apple_header)
print('\n')
explore_data(apple_data,0,4,True)
print('\n')

# Google Dataset
print('--------------------------------------------------------------------------')
print('------------------------------Google Dataset------------------------------')
print('--------------------------------------------------------------------------')
print(google_header)
print('\n')
explore_data(google_data,0,4,True)

-------------------------------------------------------------------------
------------------------------Apple Dataset------------------------------
-------------------------------------------------------------------------
['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping'

Apple dataset has 17 columns and 7197 rows. The useful columns for our analysis could be `track_name` , `currency`, `price`, `rating_count_tot`, `rating_count_ver` and `prime_genre`. You can find a best columns description [here](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps).

Google dataset has 13 columns and 10841. The useful columns for our analysis could be `App`, `Category`, `Rating`, `Installs`, `Type`, `Price` and `Genres`.

## 2. Data Cleaning

### 2.1 Deleting wrong data

The Google Play dataset has a dedicated discussion section, and we can see that one of the discussions describes an error in row 10472. Let's compare with the header and another row.

In [4]:
print(google_header) # Google dataset header
print(google_data[2]) # App info
print(google_data[10472]) # Row with error

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The row with the error referes to the app "Life Made WI-Fi Touchscreen Photo Frame" and we can verify in the column *Category* is **1.9**. This means that the right value for *Category* is missing. We can also check the row length and compare with the header.

In [5]:
print("Google header lenght: ", len(google_header))
print("Google row with error length: ", len(google_data[10472]))

Google header lenght:  13
Google row with error length:  12


We can see that the row with error is shorter. It will be deleted.

In [6]:
print("Length before delete: ", len(google_data))
del google_data[10472]
print("Length after delete: ", len(google_data))

Length before delete:  10841
Length after delete:  10840


### 2.2 Deleting Duplicates

According to the [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion) it's possible to find out that there are some duplicate apps in the Google Play dataset. Duplicate apps should be deleted from the dataset.

For instance, let's try to find out how many Instagram apps can we find in the dataset.

In [7]:
app_name = 'Instagram'
app_duplicate = []
for row in google_data:
    if row[0] == app_name:
        app_duplicate.append(row)

print('The are {} apps with the name {}.'.format(len(app_duplicate),app_name))

The are 4 apps with the name Instagram.


Let's verify how many duplicate apps there are in our dataset.

In [8]:
unique_apps = []
non_unique_apps = []

for row in google_data:
    name = row[0]
    if name in unique_apps:
        non_unique_apps.append(name)
    else:
        unique_apps.append(name)

print('Total of unique apps: ',len(unique_apps))
print('Total of non-unique apps: ',len(non_unique_apps))

Total of unique apps:  9659
Total of non-unique apps:  1181


From our previous example with the app Instagram, let's inspect if there is any difference at each entry.

In [9]:
print(google_header)
print('\n')
explore_data(app_duplicate,0,len(app_duplicate),False)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




We can identify that the 'Rating' column (nr.4) has a different number of reviews for each entry. We can determine the more reviews the app has, more recently the data is.

In order to have the more recent application in our dataset without any duplicate, a couple of steps will be executed:

- Create a empty dictionary
- Iterate over the Google Play dataset and save in the dictionary the app with the highest number of reviews for each duplicate app
- Create a new list of apps only with unique apps


In [10]:
app_dict = {} #new dictionary

#Iterate over Google Play dataset and save the app in the dictionary
for row in google_data:
    name = row[0]
    row[3] = float(row[3])
    nbr_reviews = row[3]
    #print(nbr_reviews)
    if name not in app_dict:
        app_dict[name] = row
    elif (name in app_dict) and (nbr_reviews > app_dict[name][3]):
        app_dict[name] = row

print(len(app_dict))
print(app_dict['Instagram'])


9659
['Instagram', 'SOCIAL', '4.5', 66577446.0, 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


Taking the Instagram app as an example let's confirm that the more recent app is in your dictionary and confirm that we only have unique apps.

In [11]:
print(app_dict['Instagram']) # Instagram app data
print('Total apps with duplicates: ', len(google_data)) #Length orginal dataset
print('Total unique apps: ',len(app_dict))
print('Total of duplicate apps: ',len(non_unique_apps))
print('Total apps after remove duplicates: ', str(len(google_data)-len(non_unique_apps)))

['Instagram', 'SOCIAL', '4.5', 66577446.0, 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
Total apps with duplicates:  10840
Total unique apps:  9659
Total of duplicate apps:  1181
Total apps after remove duplicates:  9659


Now let's create a new list based on your dictionary.

In [12]:
android_apps = list(app_dict.values())
explore_data(android_apps,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', 159.0, '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'FAMILY', '3.9', 974.0, '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', 87510.0, '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of columns:  13
Number of rows:  9659


### 2.3 Deleting Non-English apps

Since we only design apps for English-speaking audience we'll check if we have non-English app in our dataset. The only away to identify non-English apps in our datasets is by their respective name.

According to the ASCII system, the numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127. Let's create a function that identifies if a character is non-English.

In [13]:
def english_app(name: str):
    for char in name:
        if ord(char) > 127:
            return False
    return True

Let's try some examples.

In [14]:
print(english_app('Instagram'))
print(english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_app('Docs To Go™ Free Office Suite'))
print(english_app('Instachat 😜'))
print('\n')
print(ord('™'))
print(ord('😜'))

True
False
False
False


8482
128540


We can see that emojis and the Trademark symbol are not included of the range (0 - 127) for English text in the ASCII system. To include these cases in our datasets to be analyzed, we will considered a maximum of 3 characters that fall outside the ASCII range (0 - 127).

In [15]:
def english_app(name: str):
    count = 0
    for char in name:
        if ord(char) > 127:
            count += 1
    if count > 3:
        return False
    return True

print(english_app('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_app('Docs To Go™ Free Office Suite'))
print(english_app('Instachat 😜'))

False
True
True


Now let's filter all the non English apps over our datasets and and keep the only English apps.

In [16]:
english_android = []
english_ios = []

# For Android apps
for row in android_apps:
    name = row[0]
    if english_app(name):
        english_android.append(row)

# For Apple apps
for row in apple_data:
    name = row[2]
    if english_app(name):
        english_ios.append(row)

In [17]:
print('Total English apps in Android: ', len(english_android))
print('Total English apps in iOS: ', len(english_ios))

Total English apps in Android:  9614
Total English apps in iOS:  6183


We can verify the we have a total of **9614 Android** and **7197 iOS** English applications.

### 2.4 Filter Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our datasets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [18]:
print(google_header)
explore_data(english_android, 0, 2,False)
print('\n')
print(apple_header)
explore_data(english_ios, 0, 2,False)


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', 159.0, '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'FAMILY', '3.9', 974.0, '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', '

On our Google Play Store dataset, we should filter the 'Price' column, index = 7, to get free apps (price = 0 dollars). In iOS store dataset we should filter the 'price' column to get free apps, index = 5.

In [19]:
free_android_apps = [] # Only free Android free apps
free_ios_apps = [] # Only free iOS free apps

# Filtering Android free apps
for row in english_android:
    price = row[7]
    if price == '0':
        free_android_apps.append(row)

# Filtering iOS free Apps
for row in english_ios:
    price = float(row[5])
    if price == 0.0:
        free_ios_apps.append(row)

In [20]:
print('Total Free apps in Android: ', len(free_android_apps))
print('Total Free apps in iOS: ', len(free_ios_apps))

Total Free apps in Android:  8864
Total Free apps in iOS:  3222


## 3. Most Common Apps by Genre

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.
To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
- Build a minimal Android version of the app, and add it to Google Play.
- If the app has a good response from users, we develop it further.
- If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for `Category` and `Genre` columns in Google Play dataset and `prime_genre` for Apple Store dataset.

First, we will built a function to calculate the frequency table. Then we will built a function to display the percentages in a descending order.

In [21]:
# Frequency table function
def frequency_table(dataset: list, index: int):
    freq_dict = {}
    for row in dataset:
        value = row[index]
        if value in freq_dict:
            freq_dict[value] += 1
        else:
            freq_dict[value] = 1
    
    # calculate the percentage for each genre
    for row in freq_dict:
        freq_dict[row] = round((freq_dict[row]/len(dataset))*100,2)
        #print(row,': ',freq_dict[row])
    return freq_dict
        
    

In [22]:
teste = frequency_table(free_ios_apps, 12)
teste

{'Productivity': 1.74,
 'Weather': 0.87,
 'Shopping': 2.61,
 'Reference': 0.56,
 'Finance': 1.12,
 'Music': 2.05,
 'Utilities': 2.51,
 'Travel': 1.24,
 'Social Networking': 3.29,
 'Sports': 2.14,
 'Health & Fitness': 2.02,
 'Games': 58.16,
 'Food & Drink': 0.81,
 'News': 1.33,
 'Book': 0.43,
 'Photo & Video': 4.97,
 'Entertainment': 7.88,
 'Business': 0.53,
 'Lifestyle': 1.58,
 'Education': 3.66,
 'Navigation': 0.19,
 'Medical': 0.19,
 'Catalogs': 0.12}

In [23]:
# Display results function
def display_results(dataset: list, index: int):
    freq_dict = frequency_table(dataset, index)
    table = []
    for row in freq_dict:
        entry_as_tuple = (freq_dict[row], row)
        table.append(entry_as_tuple)
    # Sort the table descending
    table_sorted = sorted(table, reverse = True)
    for row in table_sorted:
        print(row[1],':',row[0],'%')
    #print(table)


We will start analyzing the frequency table for the the `prime_genre`column of the Apple store dataset.

In [24]:
display_results(free_ios_apps, 12)

Games : 58.16 %
Entertainment : 7.88 %
Photo & Video : 4.97 %
Education : 3.66 %
Social Networking : 3.29 %
Shopping : 2.61 %
Utilities : 2.51 %
Sports : 2.14 %
Music : 2.05 %
Health & Fitness : 2.02 %
Productivity : 1.74 %
Lifestyle : 1.58 %
News : 1.33 %
Travel : 1.24 %
Finance : 1.12 %
Weather : 0.87 %
Food & Drink : 0.81 %
Reference : 0.56 %
Business : 0.53 %
Book : 0.43 %
Navigation : 0.19 %
Medical : 0.19 %
Catalogs : 0.12 %


We can verify for the English free apps, over then **55.65% are games**. Following we have Entertainment apps with 8.23%, then Photo & Video with 4.12%, Social Networking with 3.53% and Education Apps with 3.25%, closing then our Top 5 Prime Genre apps.

In general, most off the apps available in the Apple Store are designed for fun (Games, Entertainment, Photo and Video, Social Networks, Sports, etc). However it doesn't not mean that those apps have a high number of users.

Now it's time to observe the frequency table for `Category` and `Genre` columns in Google Play dataset. It seems that both columns share some similarities.

In [25]:
# Google Play store dataset
category_index = 1
genre_index = 9

In [26]:
display_results(free_android_apps,category_index)

FAMILY : 18.91 %
GAME : 9.72 %
TOOLS : 8.46 %
BUSINESS : 4.59 %
LIFESTYLE : 3.9 %
PRODUCTIVITY : 3.89 %
FINANCE : 3.7 %
MEDICAL : 3.53 %
SPORTS : 3.4 %
PERSONALIZATION : 3.32 %
COMMUNICATION : 3.24 %
HEALTH_AND_FITNESS : 3.08 %
PHOTOGRAPHY : 2.94 %
NEWS_AND_MAGAZINES : 2.8 %
SOCIAL : 2.66 %
TRAVEL_AND_LOCAL : 2.34 %
SHOPPING : 2.25 %
BOOKS_AND_REFERENCE : 2.14 %
DATING : 1.86 %
VIDEO_PLAYERS : 1.79 %
MAPS_AND_NAVIGATION : 1.4 %
FOOD_AND_DRINK : 1.24 %
EDUCATION : 1.16 %
ENTERTAINMENT : 0.96 %
LIBRARIES_AND_DEMO : 0.94 %
AUTO_AND_VEHICLES : 0.93 %
HOUSE_AND_HOME : 0.82 %
WEATHER : 0.8 %
EVENTS : 0.71 %
PARENTING : 0.65 %
ART_AND_DESIGN : 0.64 %
COMICS : 0.62 %
BEAUTY : 0.6 %


In this case, looks a little bit different compared to Apple Store frequency table. There is more diversification on the apps. Most of the top 5 apps are design for pratical purposes (Family, Tools, Business, Lifestyle, Productivity and so on).

In [27]:
display_results(free_android_apps,genre_index)

Tools : 8.45 %
Entertainment : 6.07 %
Education : 5.35 %
Business : 4.59 %
Productivity : 3.89 %
Lifestyle : 3.89 %
Finance : 3.7 %
Medical : 3.53 %
Sports : 3.46 %
Personalization : 3.32 %
Communication : 3.24 %
Action : 3.1 %
Health & Fitness : 3.08 %
Photography : 2.94 %
News & Magazines : 2.8 %
Social : 2.66 %
Travel & Local : 2.32 %
Shopping : 2.25 %
Books & Reference : 2.14 %
Simulation : 2.04 %
Dating : 1.86 %
Arcade : 1.85 %
Video Players & Editors : 1.77 %
Casual : 1.76 %
Maps & Navigation : 1.4 %
Food & Drink : 1.24 %
Puzzle : 1.13 %
Racing : 0.99 %
Role Playing : 0.94 %
Libraries & Demo : 0.94 %
Auto & Vehicles : 0.93 %
Strategy : 0.91 %
House & Home : 0.82 %
Weather : 0.8 %
Events : 0.71 %
Adventure : 0.68 %
Comics : 0.61 %
Beauty : 0.6 %
Art & Design : 0.6 %
Parenting : 0.5 %
Card : 0.45 %
Casino : 0.43 %
Trivia : 0.42 %
Educational;Education : 0.39 %
Board : 0.38 %
Educational : 0.37 %
Education;Education : 0.34 %
Word : 0.26 %
Casual;Pretend Play : 0.24 %
Music : 0.2 %
R

It's not very clear what is difference between `Category` and `Genre` columns in Google Play dataset. It seems that `Genre` column is much more granular, it's like a subset of the `Category` column.

Now that we checked the frequency table for both datasets, Apple and Google, we can conclude that the Apple dataset has a significant percentage apps for fun, while Google Play dataset display a variety of apps, most of them are pratical apps.

### 3.1 Popular Apps by Genre in Apple Store

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play dataset, we can find this information in the `Installs` column, but this information is missing for the App Store dataset. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot`app.

In [28]:
prime_genre_index = 12
genres_apple = frequency_table(free_ios_apps, prime_genre_index)
popular_apple_apps = []

for genre in genres_apple:
    #print(genre)
    total = 0
    len_genre = 0
    for row in free_ios_apps:
        genre_app = row[prime_genre_index]
        if genre == genre_app:
            usr_ratings = float(row[6])
            #print(usr_ratings)
            total += usr_ratings
            len_genre += 1
    avg_nbr_installs = total/len_genre
    popular_apple_apps.append((avg_nbr_installs,genre))
    
popular_apple_apps_sorted = sorted(popular_apple_apps, reverse = True)
for row in popular_apple_apps_sorted: 
    print(row[1],':',row[0])
            

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


On average, Navigation apps have the highest number of user reviews, followed by reference, music, social networking, weather and photo & video.

The highest number of user reviews in navigation apps is heavily in influenced by Waze and Google Maps (popular apps).


In [29]:
for app in free_ios_apps:
    if app[12] == 'Navigation':
        print(app[2],':',app[6])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Geocaching® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS – Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


It's also possible to observe the same pattern in Music and Social Networking.
- **Music**: Pandora / Shazam / Spotify
- **Social Networking**: Facebook / Skype / Pinterest / Whatsapp

In [30]:
for app in free_ios_apps:
    if app[12] == 'Reference':
        print(app[2],':',app[6])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
教えて!goo : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Real Bike Traffic Rider Virtual Reality Glasses : 8


In [31]:
for app in free_ios_apps:
    if app[12] == 'Reference':
        print(app[2],':',app[6])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
教えて!goo : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Real Bike Traffic Rider Virtual Reality Glasses : 8


The highest number of user reviews in reference apps is heavily in influenced by Bible and Dictionary.com.

This genre could be a good choise to developed a new app. Since our main source of revenue consists on in-apps ads and usually people spend some time on these types of apps this can be an excellent option. Pick up a famous book and transform it into an app.

Let's do a quick analysis on the other genres with high user revenues.
- Weather: Usually people don't spend much time on these kind of apps
- Book: It seems to overlap the Reference genre, so in this case we can ignore
- Food & Drinks: Some of the famous fast food companies dominate this genre in Apple Store. It will be hard to have a market share in this genre.

### 3.2 Popular Apps by Genre in Google Play

In Google Play dataset we have the number of user installs. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.). For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to find out which app genres attract the most users.

In [32]:
display_results(free_android_apps,5)

1,000,000+ : 15.73 %
100,000+ : 11.55 %
10,000,000+ : 10.55 %
10,000+ : 10.2 %
1,000+ : 8.39 %
100+ : 6.92 %
5,000,000+ : 6.83 %
500,000+ : 5.56 %
50,000+ : 4.77 %
5,000+ : 4.51 %
10+ : 3.54 %
500+ : 3.25 %
50,000,000+ : 2.3 %
100,000,000+ : 2.13 %
50+ : 1.92 %
5+ : 0.79 %
1+ : 0.51 %
500,000,000+ : 0.27 %
1,000,000,000+ : 0.23 %
0+ : 0.05 %
0 : 0.01 %


We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on. To perform computations, however, we'll need to convert each install number from a string to a float. This means we need to remove the commas and the plus characters, or the conversion will fail and cause an error.

In [33]:
category_index = 1
category_android = frequency_table(free_android_apps, category_index)

In [34]:
installs_index = 5 # index of number of installs
popular_android_apps = []
for category in category_android:
    #print(category)
    total = 0
    len_category = 0
    for app in free_android_apps:
        if category == app[1]:
            nbr_installs = float(app[installs_index].replace('+','').replace(',',''))
            total += nbr_installs
            len_category +=1
    avg_category_installs = total/len_category
    popular_android_apps.append((avg_category_installs,category))

category_android_sorted = sorted(popular_android_apps, reverse = True)
for cat in category_android_sorted:
    print(cat[1],':',cat[0])
    

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3695641.8198090694
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

On top 5 of Android free apps, on average, we can find Communication apps as the first one, followed by video_players, social, photography and productivity.

In [35]:
for app in free_android_apps:
    if app[1] == 'COMMUNICATION' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+'):
        print(app[0],':',app[5])

Messenger – Text and Video Chat for Free : 1,000,000,000+
WhatsApp Messenger : 1,000,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Viber Messenger : 500,000,000+
imo free video calls and chat : 500,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
LINE: Free Calls & Messages : 500,000,000+


As we can see above, this category is dominated by the big players on the industry like Messenger, WhatsApp, Google Apps. The same apps in VIDEO_PLAYERS Category lead by Youtube, Google Play Movies and MX Player. The same pattern applies to the rest of the top five categories: SOCIAL (Facebook, Instagram, Snapchat), PHOTOGRAPHY (Google Photos and other apps over 100,000,000 installs) and PRODUCTIVITY (Microsoft apps, Google apps...) 

In [36]:
top_5 = ['VIDEO_PLAYERS','SOCIAL', 'PHOTOGRAPHY', 'PRODUCTIVITY']
for cat in top_5:
    print(cat,'-------------------')
    for app in free_android_apps:
        if app[1] == cat and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):
            print(app[0],':',app[5])
    print('\n')

VIDEO_PLAYERS -------------------
YouTube : 1,000,000,000+
Motorola FM Radio : 100,000,000+
Motorola Gallery : 100,000,000+
VLC for Android : 100,000,000+
Google Play Movies & TV : 1,000,000,000+
MX Player : 500,000,000+
Dubsmash : 100,000,000+
VivaVideo - Video Editor & Photo Movie : 100,000,000+
VideoShow-Video Editor, Video Maker, Beauty Camera : 100,000,000+


SOCIAL -------------------
Facebook : 1,000,000,000+
Instagram : 1,000,000,000+
Facebook Lite : 500,000,000+
Tumblr : 100,000,000+
Snapchat : 500,000,000+
Pinterest : 100,000,000+
Google+ : 1,000,000,000+
LinkedIn : 100,000,000+
Badoo - Free Chat & Dating App : 100,000,000+
Tango - Live Video Broadcast : 100,000,000+
Tik Tok - including musical.ly : 100,000,000+
BIGO LIVE - Live Stream : 100,000,000+
VK : 100,000,000+


PHOTOGRAPHY -------------------
Google Photos : 1,000,000,000+
B612 - Beauty & Filter Camera : 100,000,000+
YouCam Makeup - Magic Selfie Makeovers : 100,000,000+
BeautyPlus - Easy Photo Editor & Selfie Camera 

As we see, the top 5 categories are dominated by the biggest players in the industry so it will be hard to compete against them. The GAME category could be a good option because it has a bunch of apps but as we know, this market is over dominated for free games with in-app ads. It will not be profit for us.

We can do the same analysis that we did for Apple Store. Let's check if BOOKS_AND_REFERENCE category has potential.

In [37]:
for app in free_android_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):
        print(app[0],':',app[5])

Wattpad 📖 Free Books : 100,000,000+
Amazon Kindle : 100,000,000+
Google Play Books : 1,000,000,000+
Bible : 100,000,000+
Audiobooks from Audible : 100,000,000+


In [38]:
for app in free_android_apps:
    if app[1] == 'BOOKS_AND_REFERENCE' and not (app[5] == '1,000,000,000+' or app[5] == '500,000,000+' or app[5] == '100,000,000+'):
        print(app[0],':',app[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Dictionary - Merriam-Webster : 10,000,000+
NOOK: Read eBooks & Magazines : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Oxford Dictionary of English : Free : 10,000,000+
Offline: English to Tagalog Dictionary : 500,000+
Spanish English Translator : 10,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra – free ebook reader : 1,000,000+
NOOK App for NOOK Devices : 500,000+
Anonymous caller detection : 10,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+


It's possible to observe that there are a lot of app variety: e-readers, bibles, dictionaries, guides, programming tutorials and so on. No counting with Wattpad, Amazon Kindle, Google Play Books, Bible and Audiobooks from Audible (over 100,000,000+ downloads) this market is dominated by programming books, dictionaries and bibles.

We can suggest the same as on the Apple Store analysis. We can try to find a famous book and turn into an app with some features or try to build a tutorial programming app for a new emergent programming language. But since we have a bunch of offers in the market, it will be a little hard to find our spot.

## 4. Conclusions

In this project we analyzed the datasets of Apple Store and Google Play from Kaggle website with the objective to find a app profile that can be profitable for both markets.

We see that the categories with more downloads are dominated by big tech companies, which it will be hard to compete against them. However we could suggest to pick a famous book and transform into a app but with some features, like have more than one language within the app.