# Profitable App Profiles for Google Play and App Store

Our aim in this project is to find mobile app profiles that are profitable in the Google Play and App Store markets. As a data analyst for a company that builds mobile apps, the job as a data analyst is to empower the team of developers to make data driven decisions with respect to the type of apps they build.

At our company, we only build apps that are free to use and hence the main source of revenue is from in app advertisements. This means that the revenue is influenced by the number of users using the app. The goal of this project is to analyse the app database and provide insights to the development team with reagards to what kind of app is likely to attract more number of users. 

# 1. Opening and exploring the dataset

As of the third quarter of 2023 there are a total of 3.55 million apps available on google play store and store. As it is quite complicated to collect data on all of these apps, for the purpose of our project we are going to use two sample datasets, one for each: 
- A data set containing data about approximately ten thousand Android apps from Google Play.
- A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from this link.

Let's start by opening the datasets and then explore the datasets.


In [1]:
from csv import reader

#Opening the google play dataset.
opened_file = open("Downloads\googleplaystore.csv", encoding='utf8')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

#Opening the app store dataset.
opened_file = open("Downloads\AppleStore.csv", encoding='utf8')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]


Now we define a function which will help us examine the basic charecteristics of a dataset and apply it our datasets:

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

#android dataset
print(android_header)
print('\n')
explore_data(android, 0, 3, True)
print('\n')
print('\n')
#ios dataset
print(ios_header)
print('\n')
explore_data(ios, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13




['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['28488

- We see that the Google Play data set has 10841 apps and 13 columns. At a quick glance, the columns that might be useful for the purpose of our analysis are 'App', 'Category', 'Reviews', 'Installs', 'Type', 'Price', and 'Genres'.
- We have 7197 iOS apps in the IOS data set, and the columns that seem interesting are: 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. 

# 2. Data Cleaning

## 2.1. Dealing with incorrect data 

From the [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/164101) for the google play data set we can identify that row 10472 has incorrect data. Let us print this row and compare it with the header and a correct row. 

In [3]:
print(android_header)
print('\n')
print(android[10472])
print('\n')
print(android[10473])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


Here we can see that for row 10472, under category the value is '1.9' and under rating the value is '19' which are inconsistent compared to the other rows. Upon examination, we can conclude that the missing value for row 10472 is under the column category.
As a consequence we delete this row.

In [4]:
print(len(android))
del android[10472]
print(len(android))

10841
10840


## 2.2. Removing duplicate entries

From the discussion section we can also gather that the dataset has duplicate entries for certain apps. Instagram is one such app:

In [5]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


As, we can see there are four entries for the same app. Let us find out how many duplicate entries exist in the dataset. 

In [6]:
duplicate_apps = []
unique_apps = []
for app in android :
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else :
        unique_apps.append(name)
        
print('There are a total of ',len(duplicate_apps),' duplicate apps in the googleplaystore dataset.')
print('\n')
print('Examples of duplicate apps: ')
print(duplicate_apps[:5])

There are a total of  1181  duplicate apps in the googleplaystore dataset.


Examples of duplicate apps: 
['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']


We aim to refine our data analysis process by eliminating duplicate entries for apps. Instead of randomly removing duplicate rows, we can adopt a more systematic approach based on review counts.

Here's how we'll proceed:

- Construct a dictionary where each app name serves as a key, and the associated value represents the highest number of reviews recorded for that app.
- Utilize this dictionary to form a new dataset containing only one entry per app, selecting the entries with the highest review counts.

In essence, our criterion for retaining rows involves prioritizing entries with the highest review counts, as they typically offer the most reliable ratings.

In [7]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181.

In [8]:
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

In [9]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

explore_data(android_clean, 0, 3, True)
    

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We have **9659** rows just as expected.

## 2.3. Removing Non-English Apps

By exploring the dataset, we notice that the names of some of the apps suggest they are not directed toward an English-speaking audience. Below, we see a couple of examples from both data sets:

In [10]:
print(ios[813][1])
print(ios[6731][1])

print(android_clean[4412][0])
print(android_clean[7940][0])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜
中国語 AQリスニング
لعبة تقدر تربح DZ


We're not interested in keeping these kind of apps, so we'll remove them. One way to go about this is to remove each app whose name contains a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;, etc.), and other symbols (+, *, /, etc.).

All these characters that are specific to English texts are encoded using the ASCII standard. Each ASCII character has a corresponding number between 0 and 127 associated with it, and we can take advantage of that to build a function that checks an app name and tells us whether it contains non-ASCII characters.

We built this function below, and we use the built-in ord() function to find out the corresponding encoding number of each character.

In [11]:
def is_english(string):
    for character in string:
        if ord(character) > 127:
            return False
        return True
    
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


The function seems to work fine, but some English app names use emojis or other symbols (™, — (em dash), – (en dash), etc.) that fall outside of the ASCII range. Because of this, we'll remove useful apps if we use the function in its current form.

To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:

In [12]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
        
    if non_ascii > 3:
        return False
    else:
        return True


The function is still not perfect, and very few non-English apps might get past our filter, but this seems good enough at this point in our analysis — we shouldn't spend too much time on optimization at this point.

Below, we use the is_english() function to filter out the non-English apps for both data sets:

In [13]:
android_english = []
ios_english = []

for app in android_clean:
    if is_english(app[0]):
        android_english.append(app)
        
for app in ios:
    if is_english(app[1]):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

We can see that we're left with **9614 Android apps** and **6183 iOS apps**.

## 2.4. Isolating the Free Apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [14]:
android_final = []
ios_final = []

for app in android_english:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

8864
3222


Finally, we're left with **8864 Android apps** and **3222 iOS apps**, which should be enough for our analysis.

# 3. Data Analysis

## 3.1. Most Common Apps by Genre

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the **prime_genre** column of the App Store data set, and the **Genres and Category columns** of the Google Play data set.


In [15]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
            
    table_percentages = {}
    for key in table:
        percentage = (table[key]/total)*100
        table_percentages[key] = percentage
    
    return table_percentages

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_value_as_tuple = (table[key], key)
        table_display.append(key_value_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

Here, we have built two functions we can use to analyze the frequency tables:

- One function to generate frequency tables that show percentages
- Another function that we can use to display the percentages in a descending order

We start by examining the frequency table for the prime_genre column of the App Store data set.

In [16]:
display_table(ios_final, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Among the free English apps, over half (58.16%) are categorized as games in the App Store. Entertainment apps follow closely, accounting for approximately 8%, with photo and video apps trailing at around 5%. Education-focused apps make up only 3.66% of the total, with social networking apps slightly behind at 3.29%.

The data suggests that the App Store, particularly its free English apps section, is predominantly occupied by apps geared towards entertainment and leisure (including games, entertainment, photo and video, social networking, sports, and music). Meanwhile, apps with practical functionalities (such as education, shopping, utilities, productivity, and lifestyle) are less common.

However, it's important to note that while entertainment apps are abundant, this doesn't necessarily correlate with higher user demand or engagement. The prevalence of entertainment apps doesn't inherently imply that they enjoy the highest user engagement; user demand might vary significantly across different app categories.

Next, we'll explore the Genres and Category columns within the Google Play dataset, as they appear to be interrelated.

In [17]:
display_table(android_final, 1) #Category

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

In [18]:
display_table(android_final, -4) #Genres

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

The distribution of apps on Google Play presents a contrasting landscape compared to the App Store. Google Play showcases a notable presence of practical apps, with a considerable number designed for purposes such as family, tools, business, lifestyle, and productivity. However, delving deeper, we discover that the family category, comprising nearly 19% of the apps, predominantly consists of games targeting children.

Despite this, practical apps enjoy a more significant representation on Google Play in comparison to the App Store. This observation is reinforced by the frequency table generated for the Genres column, which underscores the prevalence of practical app categories.

## 3.2. Most Popular Apps by Genre on the App Strore

To determine the popularity of genres based on user engagement, we calculate the average number of installs for each app genre in the Google Play dataset. This information is readily available in the Installs column. However, for the App Store dataset, we lack explicit install data. Instead, we employ the total number of user ratings (available in the rating_count_tot column) as a proxy for popularity.

Here's the process:

- For the Google Play dataset, we directly analyze the Installs column to calculate average installs per genre.
- For the App Store dataset, lacking install data, we use the total number of user ratings (rating_count_tot column) as a substitute for app popularity.
- We compute the average number of user ratings per app genre in the App Store dataset.

This methodology allows us to approximate the popularity of genres across both platforms, facilitating comparative analysis despite the differing data available in each dataset.

In [19]:
genres_ios = freq_table(ios_final, -5)

for genre in genres_ios:
    total = 0 
    len_genre = 0 
    for app in ios_final:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


On average, navigation apps have the highest number of user reviews, but this figure is heavily influenced by Waze and Google Maps, which have close to half a million user reviews together:

In [20]:
for app in ios_final:
    if app[-5] == 'Navigation':
        print(app[1], ':', app[5]) #print name and number of ratings

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


The observed pattern extends to social networking and music apps, where the average number of ratings is disproportionately influenced by a handful of industry giants like Facebook, Pinterest, Skype, Pandora, Spotify, and Shazam. These dominant players heavily skew the average ratings, creating a misleading perception of the popularity of their respective genres.

Our primary objective is to identify genuinely popular genres; however, navigation, social networking, or music apps may appear more popular than they actually are. This discrepancy arises because the average number of ratings is heavily skewed by a few apps with hundreds of thousands of user ratings, while the majority of apps struggle to surpass the 10,000 threshold.

For the purpose of our analysis we will exclude these genres and look at the next most popular genre which is **Reference**. 

In [21]:
for app in ios_final:
    if app[-5] == 'Reference':
        print(app[1], ':', app[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


This genre seems to have a more evely distributed user ratings among it's apps.

An intriguing venture involves transforming another popular book into an app, enhancing it with various features beyond the raw text. These features might encompass daily quotes sourced from the book, an audio rendition, quizzes related to its content, and more. Additionally, embedding a dictionary within the app could streamline users' word lookups without necessitating external app navigation.

This concept aligns well with the dominance of entertainment-focused apps in the App Store. The market saturation of entertainment apps suggests an opportunity for practical applications to distinguish themselves amid the vast array of offerings.

While genres like weather, book, food and drink, and finance are popular, they may not align with our current interests or capabilities:

- Weather apps typically don't engage users extensively, and revenue opportunities through in-app ads are limited. Moreover, accessing reliable live weather data may entail integration with non-free APIs.
- Food and drink apps, exemplified by brands like Starbucks and McDonald's, often involve cooking and delivery services, which fall beyond our company's scope.
- Finance apps entail complexities related to banking, bill payments, and money transfers, necessitating domain expertise we may not possess or wish to acquire.

Thus, the proposed book app concept presents an appealing opportunity to explore practical functionalities within the context of the App Store's predominantly entertainment-oriented ecosystem.

## 3.3. Most Popular Apps by Genre on Google Play

For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [22]:
display_table(android_final, 5)

1,000,000+ : 15.726534296028879
100,000+ : 11.552346570397113
10,000,000+ : 10.548285198555957
10,000+ : 10.198555956678701
1,000+ : 8.393501805054152
100+ : 6.915613718411552
5,000,000+ : 6.825361010830325
500,000+ : 5.561823104693141
50,000+ : 4.7721119133574
5,000+ : 4.512635379061372
10+ : 3.5424187725631766
500+ : 3.2490974729241873
50,000,000+ : 2.3014440433213
100,000,000+ : 2.1322202166064983
50+ : 1.917870036101083
5+ : 0.78971119133574
1+ : 0.5076714801444043
500,000,000+ : 0.2707581227436823
1,000,000,000+ : 0.22563176895306858
0+ : 0.04512635379061372
0 : 0.01128158844765343


The data we have may lack precision, particularly regarding the exact number of installs for certain apps. For instance, apps labeled with "100,000+" installs could have anywhere from 100,000 to potentially higher figures, but we don't require pinpoint accuracy for our analysis. Our primary goal is to discern which app genres tend to attract the most users, rather than obtaining precise user counts.

In our approach, we'll maintain the install numbers as they are, treating an app labeled with "100,000+" installs as having 100,000 installs, and similarly for other install ranges. However, for computational purposes, we'll need to convert each install number to a float data type. To achieve this, we'll remove the commas and plus characters from the install counts to facilitate accurate conversion to floats.

This conversion will occur within the loop where we compute the average number of installs for each genre (or category), enabling us to conduct meaningful analysis despite the inherent imprecision in the data.

In [23]:
category_android = freq_table(android_final, 1)
category_avg_installs = {}

for category in category_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_n_installs = (total/len_category)
    category_avg_installs[category] = avg_n_installs

sorted_categories = sorted(category_avg_installs.items(), key=lambda x: x[1], reverse=True)
for category, avg_installs in sorted_categories[:15]:
    print(category, ':', avg_installs)


COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853


In our exploration of app categories across both the Google Play Store and the App Store, we've noticed intriguing similarities in the top categories. However, we've also uncovered a common challenge: certain categories, such as Communication, Video Players, Social, Photography, and Productivity, are heavily influenced by industry giants, leading to skewed figures that may not accurately represent the potential for new entrants.

While the gaming genre remains popular, our analysis suggests that this segment might be oversaturated, making it more challenging for new apps to gain traction. As a result, we've set our sights on identifying alternative app recommendations with promising profit potential.

One genre that stands out amidst our findings is "Books and Reference." With an average number of installs exceeding 8.7 million, this genre piques our interest for several reasons. Not only does it demonstrate considerable popularity across both the App Store and Google Play Store, but our previous research has also hinted at its potential for profitability, especially on the App Store platform.

Given these insights, we believe that delving deeper into the Books and Reference genre presents an exciting opportunity. By exploring this genre more comprehensively, we aim to uncover specific niches or app concepts that hold promise for success across both app platforms. Our ultimate goal is to recommend an app genre that not only shows potential for profitability but also aligns with our strategic objectives for market penetration and user engagement.

In [24]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE':
        print(app[0], ' : ', app[5])


E-Book Read - Read Book for free  :  50,000+
Download free book with green book  :  100,000+
Wikipedia  :  10,000,000+
Cool Reader  :  10,000,000+
Free Panda Radio Music  :  100,000+
Book store  :  1,000,000+
FBReader: Favorite Book Reader  :  10,000,000+
English Grammar Complete Handbook  :  500,000+
Free Books - Spirit Fanfiction and Stories  :  1,000,000+
Google Play Books  :  1,000,000,000+
AlReader -any text book reader  :  5,000,000+
Offline English Dictionary  :  100,000+
Offline: English to Tagalog Dictionary  :  500,000+
FamilySearch Tree  :  1,000,000+
Cloud of Books  :  1,000,000+
Recipes of Prophetic Medicine for free  :  500,000+
ReadEra – free ebook reader  :  1,000,000+
Anonymous caller detection  :  10,000+
Ebook Reader  :  5,000,000+
Litnet - E-books  :  100,000+
Read books online  :  5,000,000+
English to Urdu Dictionary  :  500,000+
eBoox: book reader fb2 epub zip  :  1,000,000+
English Persian Dictionary  :  500,000+
Flybook  :  500,000+
All Maths Formulas  :  1,000

The book and reference genre includes a variety of apps: software for processing and reading ebooks, various collections of libraries, dictionaries, tutorials on programming or languages, etc. However, it looks like there are only a few very popular apps, so this market still shows potential. Let's try to get some app ideas based on the kind of apps that are somewhere in the middle in terms of popularity (between 1,000,000 and 100,000,000 downloads):

In [25]:
for app in android_final:
    if app[1] == 'BOOKS_AND_REFERENCE' and (app[5] == '1,000,000+' or
                                            app[5] == '5,000,000+' or
                                            app[5] == '10,000,000+' or
                                            app[5] == '50,000,000+') :
        print(app[0], ':', app[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra – free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+
H

As we delve deeper into the Books and Reference genre, we uncover a landscape dominated by software tailored for ebook processing and reading, along with an array of libraries and dictionaries. While this indicates a vibrant market, it also signals significant competition in these areas.

However, amidst our exploration, we notice a notable trend: a cluster of apps centered around the Quran, hinting at the profitability of building apps around popular literary works. This observation sparks an intriguing idea: leveraging the popularity of a well-known book, potentially a more recent title, to create a profitable venture in both the Google Play and the App Store markets.

Yet, as we analyze further, we realize that simply offering the raw version of a book may not suffice in a market already saturated with libraries. To truly stand out, we recognize the need for innovation and added value. Hence, our strategy pivots towards infusing unique features into our app concept.

Imagine an app built around a beloved book, enriched with daily quotes from its pages, an immersive audio version for on-the-go listening, engaging quizzes to test readers' comprehension, and a vibrant forum where enthusiasts can dive deep into discussions about the book's themes and characters.

This approach not only capitalizes on the popularity of a well-loved literary work but also offers a fresh, interactive experience that transcends traditional reading. By infusing technology with the timeless allure of literature, we aim to carve out a distinct space in the market, captivating readers and fostering a community united by their passion for storytelling.

# 4. Conclusion

In this project, we have analysed the googleplaystore and app store datasets with the goal of recommending an app profile that can be profitable to both the markets. 

We concluded that taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.