# User Rating Data Analysis
We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means that the number of users of our apps determines our revenue for any given app the more users who see and engage with the ads, the better.\
Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.


## Opening and Exploring Dataset

![img](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png) Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/)

The inforgraphic shows that there are more than 4 million apps in both Google Play and iOS. This data is enormous to collect hence we will use a sample.
Luckily we have a [dataset](https://www.kaggle.com/datasets/lava18/google-play-store-apps) containing approximately 10,000 Android apps from Google Play and another [dataset](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps) containing approximately 7,000 iOS apps from App Store.

We will begin by opening the two datasets to explore the data contained therein.

In [1]:
from csv import reader
### Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### App Store data set ###

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

We will then use the `explore_data()` function to explore the dataset in a more readable way.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android_header)
print('\n')
explore_data(android, 0, 4, True)
print('\n')
print(ios_header)
print('\n')
explore_data(ios, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_c

# Deleting Wrong Data

With reference to the Google Play discussion section, there is a missing entry in row index 10472. To ascertain this error, we will first print that particular row (10472), print the android dataset header, and lastly print a random row that has correct entries as a model for comparison. 

In [3]:
print(android[10472])
print('\n')
print(android_header)
print('\n')
print(android[871])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['StarTimes - Live International Champions Cup', 'ENTERTAINMENT', '4.4', '17682', '9.7M', '1,000,000+', 'Free', '0', 'Everyone', 'Entertainment', 'August 3, 2018', '5.2', '4.0.3 and up']


Above, we see that the app corresponding to row index 10472 has a rating of 19 contrary to the maximum rating in Google Play Store dataset that is 5. The error is as a result of missing data in the 'category' column (as per the [discussion section](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015)). We therefore need to delete that particular row for accurate analysis. 

In [4]:
print(android[10472])
del android[10472]
print(len(android))

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
10840


## Removing Duplicates
The Google Play dataset has duplicate entries that may distort our data analysis process hence the need to remove them inorder to remain with unique data.

### Part One

When we explore the data (Google Play Dataset) we find that there are some apps with more than one entry, for example **Instagram**

In [5]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)
        
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('/n')
print('Examples of duplicate apps:', duplicate_apps[:15])
     

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
Number of duplicate apps: 1181
/n
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accountin

### Part Two

In [6]:
reviews_max = {}

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print('Expected length:', len(android) - 1181)
print('Actual length:', len(reviews_max))

Expected length: 9659
Actual length: 9659


We want to use the dictionary created above `review_max{}` to remove duplicates.
* Step 1
    - We create two empty lists i.e. android_clean; for storing the row that has no duplicateds
    - alrady_added; to store app names.
* Step 2
    - We loop through the android dataset and in each iteration we;
        * isolate the name and the number of reviews of the app
    - We then intiate a conditional statement that matches the number of reiview of the current app to the number of reviews that are in the `review_max dictionary`
    - In the same conditional statement, we also ensure that the name of the app is not in the `already_added`
    
* Step 3
    - We explore the `android_clean` dataset using the `explore_data()` function to ensure everything went as expected.

In [7]:
android_clean = []
already_added = []

for app in android:
    name = app[0]
    n_reviews = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

explore_data(android_clean, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


## Removing Non-English Apps

### Part One
When we further explore the data, we realize that some apps are not designed for the english speaking audience.
We will remove such apps because our objective to reach the english speaking audience.

In [8]:
def is_english(string):
    
    for character in string:
        if ord(character) > 127:
            return False
    
    return True

print (is_english('Instagram'))
print (is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))


True
False
False
False


### Part Two

The function we wrote above isolates apps with english names that have special characters and emojis. This may lead to loss of useful data.
In order to mitigate that risk, we will only remove an app if its name has more than three emojis.

In [9]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    
    else:
        return True
    
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))


True
True


In [10]:
android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    if is_english(name):
        android_english.append(app)
        
for app in ios:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

## Isolating the Free Apps

Our objective of this analysis is to build apps that are free to download and install. Both the datasets have apps that are free and apps that are not free. We will therefore isolate to have apps that are only free/apps whose prices are zero.

In [11]:
android_final = []
ios_final = []

for app in android:
    price = app[7]
    if price == '0':
        android_final.append(app)
        
for app in ios:
    price = app[4]
    if price == '0.0':
        ios_final.append(app)
        
print(len(android_final))
print(len(ios_final))

10040
4056


## Most Common Apps by Genre

### Part One

As we mentioned earlier, our objective is to identify the types of apps that have a high response rate from the users since the number of users of the apps is directly proportional to our revenue\. To mitigate the risks and overhead, our ploy for app generation has three levels.

* Build a minimal Android version of the app, and add it to Google Play
* If the app has a good response from users, we develop it further.
* If the app is profitable after six months, we build an iOS version of the app and add it to the App Store

Our aim at the end of our analysis is to add the app on both Google Play and the App Store, we need to find app profiles that are successful in both markets.

### Part Two

In [12]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages


def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

print(freq_table(ios_final, 11))
print('\n')
print(freq_table(android_final, 9))
print('\n')
print(freq_table(android_final, 1))

    

{'Social Networking': 3.5256410256410255, 'Photo & Video': 4.117357001972387, 'Games': 55.64595660749507, 'Music': 1.6518737672583828, 'Reference': 0.4930966469428008, 'Health & Fitness': 1.8737672583826428, 'Weather': 0.7642998027613412, 'Utilities': 2.687376725838264, 'Travel': 1.3806706114398422, 'Shopping': 2.983234714003945, 'News': 1.4299802761341223, 'Navigation': 0.4930966469428008, 'Lifestyle': 2.3175542406311638, 'Entertainment': 8.234714003944774, 'Food & Drink': 1.0601577909270217, 'Sports': 1.947731755424063, 'Book': 1.6272189349112427, 'Finance': 2.0710059171597637, 'Education': 3.2544378698224854, 'Productivity': 1.5285996055226825, 'Business': 0.4930966469428008, 'Catalogs': 0.22189349112426035, 'Medical': 0.19723865877712032}


{'Art & Design': 0.5478087649402391, 'Art & Design;Pretend Play': 0.0199203187250996, 'Art & Design;Creativity': 0.06972111553784861, 'Art & Design;Action & Adventure': 0.0199203187250996, 'Auto & Vehicles': 0.8167330677290837, 'Beauty': 0.52788

### Part Three

In [13]:
display_table(ios_final, 11) # Prime genre


Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


Among the free English apps, 55.64% are games (more than a half). Entertainment apps are close to 8%, followed by photo and video apps, which are 4%. 3.25% of the apps are designed for education, followed by social networking apps which amount for 3.52% of the apps in our data set.

The general conclusion from our data set is that App Store is majorly dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare.

The fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer


In [14]:
display_table(android_final, 1) # Category


FAMILY : 17.739043824701195
GAME : 10.56772908366534
TOOLS : 7.6195219123505975
BUSINESS : 4.442231075697211
PRODUCTIVITY : 3.944223107569721
LIFESTYLE : 3.6155378486055776
SPORTS : 3.5856573705179287
COMMUNICATION : 3.5856573705179287
MEDICAL : 3.5258964143426295
FINANCE : 3.4760956175298805
HEALTH_AND_FITNESS : 3.237051792828685
PHOTOGRAPHY : 3.117529880478088
PERSONALIZATION : 3.0776892430278884
SOCIAL : 2.908366533864542
NEWS_AND_MAGAZINES : 2.7988047808764938
SHOPPING : 2.5697211155378485
TRAVEL_AND_LOCAL : 2.450199203187251
DATING : 2.2609561752988045
BOOKS_AND_REFERENCE : 2.0219123505976095
VIDEO_PLAYERS : 1.7031872509960162
EDUCATION : 1.5139442231075697
ENTERTAINMENT : 1.4641434262948207
MAPS_AND_NAVIGATION : 1.3147410358565739
FOOD_AND_DRINK : 1.245019920318725
HOUSE_AND_HOME : 0.8764940239043826
LIBRARIES_AND_DEMO : 0.8366533864541833
AUTO_AND_VEHICLES : 0.8167330677290837
WEATHER : 0.7370517928286853
EVENTS : 0.6274900398406374
ART_AND_DESIGN : 0.6175298804780877
COMICS : 0

The most common genres are family with a 17.74% followed by game which is close to 12%. Business is at 4.4% followed closely by productivity which is close to 4%. The trend seems significantly different on Google Play compared to AppleStore. 

There are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.)



In [15]:
display_table(android_final, 9) # Genres


Tools : 7.609561752988048
Entertainment : 6.01593625498008
Education : 5.169322709163347
Business : 4.442231075697211
Productivity : 3.944223107569721
Sports : 3.7250996015936253
Lifestyle : 3.6055776892430282
Communication : 3.5856573705179287
Medical : 3.5258964143426295
Finance : 3.4760956175298805
Action : 3.396414342629482
Health & Fitness : 3.237051792828685
Photography : 3.117529880478088
Personalization : 3.0776892430278884
Social : 2.908366533864542
News & Magazines : 2.7988047808764938
Shopping : 2.5697211155378485
Travel & Local : 2.4402390438247012
Dating : 2.2609561752988045
Books & Reference : 2.0219123505976095
Arcade : 1.9920318725099602
Simulation : 1.902390438247012
Casual : 1.8326693227091633
Video Players & Editors : 1.6832669322709164
Maps & Navigation : 1.3147410358565739
Food & Drink : 1.245019920318725
Puzzle : 1.205179282868526
Racing : 0.9462151394422311
Strategy : 0.9362549800796812
House & Home : 0.8764940239043826
Role Playing : 0.8665338645418327
Libraries

There is a thin line separating Genre and Category columns in our dataset, but one observable feature we see is that the Genres column has more categories than the Genre column.

Up to this point, we found that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and for-fun apps.

## Most Popular Apps by Genre on the App Store

In [16]:
genres_ios = freq_table(ios_final, 11)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in ios_final:
        genre_app = app[11]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total/len_genre
    print(genre, ':', avg_n_ratings)
        

Social Networking : 53078.195804195806
Photo & Video : 27249.892215568863
Games : 18924.68896765618
Music : 56482.02985074627
Reference : 67447.9
Health & Fitness : 19952.315789473683
Weather : 47220.93548387097
Utilities : 14010.100917431193
Travel : 20216.01785714286
Shopping : 18746.677685950413
News : 15892.724137931034
Navigation : 25972.05
Lifestyle : 8978.308510638299
Entertainment : 10822.961077844311
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Book : 8498.333333333334
Finance : 13522.261904761905
Education : 6266.333333333333
Productivity : 19053.887096774193
Business : 6367.8
Catalogs : 1779.5555555555557
Medical : 459.75


On average, reference apps have the highest number of user reviews, perhaps this is heavily influenced by a few big players like Bible and Dictionary.com

## Most Popular Apps by Genre on Google Play

In [17]:
category_android = freq_table(android_final, 1)

for category in category_android:
    total = 0
    len_category = 0
    for app in android_final:
        category_app = app[1]
        if category_app == category:
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
            
    avg_n_installs = total / len_category
    print(category, ':', avg_n_installs)
        


ART_AND_DESIGN : 2005195.1612903227
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 9465252.512315271
BUSINESS : 2245520.3811659194
COMICS : 934769.1666666666
COMMUNICATION : 90683100.55833334
DATING : 1164270.7356828193
EDUCATION : 5729276.315789473
ENTERTAINMENT : 19516734.69387755
EVENTS : 253542.22222222222
FINANCE : 2511355.6790830945
FOOD_AND_DRINK : 2190710.008
HEALTH_AND_FITNESS : 4869225.852307692
HOUSE_AND_HOME : 1917187.0568181819
LIBRARIES_AND_DEMO : 749950.119047619
LIFESTYLE : 1477863.44077135
GAME : 33048939.16116871
FAMILY : 5742274.952835485
MEDICAL : 147563.28813559323
SOCIAL : 48184458.56849315
SHOPPING : 12588522.03488372
PHOTOGRAPHY : 32218111.54952077
SPORTS : 4860918.563888889
TRAVEL_AND_LOCAL : 27921561.32520325
TOOLS : 14968685.586928105
PERSONALIZATION : 7508854.330097088
PRODUCTIVITY : 35794644.73232323
PARENTING : 542603.6206896552
WEATHER : 5747142.162162162
VIDEO_PLAYERS : 36385565.614035085
NEWS_AND_MAGAZINES : 2667

On average, communication apps in the Google Play data set have the highest number of user installations, perhaps this is heavily influenced by a few apps that have over a huge number of installs (Facebook, Whatsapp, Gmail, Google Chrome, etc.). 

On the other hand Social apps follow (though by a wide margin) with close to 50,000,000 installs.

The books and reference genre looks fairly popular as well, with an average number of installs of 9,465,252. It's interesting since we found this genre has some potential to work well on the App Store, and our aim is to recommend an app genre that shows potential for being profitable on both the App Store and Google Play.

## Conclusion

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that taking a popular book and turning it into an app could be profitable for both the Google Play and the App Store markets. The markets are already full of libraries, so we need to add some special features besides the raw version of the book. This might include daily quotes from the book, an audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.