# Analysis of Google Play Store & Apps Store markets to identify popular english free apps
The project is aimed at helping the companies that build free apps to identify the type (or genre) of apps that are most popular on the Google Play Store (Android) markets and the Apple Apps Store (iOS) markets. Also, this project focusses on apps built for English (language) users.

## How can the results be interpreted? And who can use them most effectively?
For several App developing players, 'In-App Advertising' has been and still remains one of the major app monetization tools and a proven revenue model. According to a survey of Apps on Google Play Store by Sweetpricing.com (https://bit.ly/3dBnCFM), about 65% of the Apps on Google Play Store use In-App Advertising for revenue generation.

IMAGE 1

Thus, for an In-App Advertising Revenue model to work most effectively, the app developers should target an app that will have a large user base. This will make advertising through the app an enticing prospect for the clients of the app developing company. The work in this project identifies the most popular (highest number of users) 'type' or 'genre' of the app across both markets (Google Play Store and App Store). The results of this project can be best applicable to app developing companies looking to launch an app across <b> both Google Play Store and Apple Store </b> and who's domain or market space is <b> not restricted by a particular genre/type of app </b>.

## Source of data:
There are over 2 million apps in each of the Google Play Store and the App Store markets. However, for this project, a sample dataset containing few thousands of apps for each market has been used. 
For Google Play Store Apps, a sample dataset containing over 10K apps can be found here: https://go.aws/2Y8Ebm2
For App Store Apps, a sample dataset containing over 7K apps can be found here: https://go.aws/3dHgIPh

For more details of the source for the Google Play Store Apps, please refer to this link: https://www.kaggle.com/lava18/google-play-store-apps
For more details of the source for the Apps Store Apps, please refer to this link: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

## Opening the datasets
The first step is to successfully open the datasets and explore them to understand what sort of data that is being processed. The first row of both datasets is the header row denoting the type of data in any particular column. In the following cell, the datasets have been opened, the first (header) row has been displayed for both datasets and the first few apps have been displayed just to get a feel of the type of data that will be processe in the project.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

#code to open files
from csv import reader
open_gp = open(r'C:\Users\Official\Desktop\Dataset\googleplaystore.csv',encoding='utf8',errors='ignore')
read_gp=reader(open_gp)
data_gp=list(read_gp)
gp_data=explore_data(dataset=data_gp,start=1,end=3,rows_and_columns=False)

open_ios=open(r'C:\Users\Official\Desktop\Dataset\AppleStore.csv',encoding='utf8',errors='ignore')
read_ios=reader(open_ios)
data_ios=list(read_ios)
ios_data=explore_data(dataset=data_ios,start=1,end=3,rows_and_columns=False)

#code to find out the number of datapoints/rows (number of apps)
def data_no_cal(dataset):
    data_cal=0
    for row in dataset[1:]:
        data_cal+=1
    return data_cal

dp_gp=data_no_cal(dataset=data_gp)
print("The number of apps in the Google Play Store Dataset are: ", dp_gp)
dp_ios=data_no_cal(dataset=data_ios)
print("The number of apps in the App Store Dataset are: ",dp_ios)
print('\n')
#code to print first row to understand the data header
print("The header row of the Google Play Store Dataset is:\n",data_gp[0])
print('\n')
print("The header row of the App Store Dataset is:\n",data_ios[0])


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


The number of apps in the Google Play Store Dataset are:  10841
The number of apps in the App Store Dataset are:  7197


The header row of the Google Play Store Dataset is:
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


The header row of th

## Cleaning Data (Removing dirty data & duplicate entries)
Before the data is analyzed, it is to be ensured that apps with erroneous values or wrong/incomplete values in headers like ratings, price, etc. are excluded from the dataset for the analysis in this project. As per the data source (Kaggle), the 10472 nd entry in the Google Play Store dataset has incorrect/incomplete data elements. After further investigation, it is noted that this entry does not have any value in the 'Category' header and subsequently, other values occupy incorrect (previous) locations in the list/entry. Therefore this entry will not be considered (thus removed below) in the analysis.

It is also desirable to remove the duplicate values/entries to make the results more accurate. This is achieved in a structured way in which, for an App, the entry with the highest number of reviews (thus assumed to be latest) will be retained for the analysis as the correct entry for that particular App.

In [2]:
# cleaning data; delete dirty data
print(data_gp[10473])
del data_gp[10473]
dp_gp=data_no_cal(dataset=data_gp)
print("The number of apps in the Google Play Dataset are now: ",dp_gp)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
The number of apps in the Google Play Dataset are now:  10840


In [3]:
# remove duplicate entries
def remove_dup(dataset,name_index,review_index):
    duplicate_apps=[]
    unique_apps=[]
    for app in dataset[1:]:
        name=app[name_index]
        if name in unique_apps:
            duplicate_apps.append(name)
        else:
            unique_apps.append(name)
    
    print("Some examples of duplicate apps are: ", duplicate_apps[:5])
    print("The number of duplicate apps are: ",len(duplicate_apps))
    print("The expected number of unique apps are: ",len(unique_apps))
    
    reviews_max={}
    for app in dataset[1:]:
        name=app[name_index]
        review=float(app[review_index])
        
        if name in reviews_max and review>reviews_max[name]:
            reviews_max[name]=review
        elif name not in reviews_max:
            reviews_max[name]=review  
    
    clean_database=[]
    dummy_database=[]
    
    for app in dataset[1:]:
        name=app[name_index]
        review=float(app[review_index])
        if (reviews_max[name]==review) and (name not in dummy_database):
            clean_database.append(app)
            dummy_database.append(name)
    print("The actual number of unique apps are: ",len(clean_database),'\n')
    return(clean_database)

print("Removing duplicate apps in the Google Play Store Dataset:\n")
gp_cleandb=remove_dup(dataset=data_gp,name_index=0,review_index=3)
print("Removing duplicate apps in the App Store Dataset:\n")
ios_cleandb=remove_dup(dataset=data_ios,name_index=1,review_index=5)

Removing duplicate apps in the Google Play Store Dataset:

Some examples of duplicate apps are:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings']
The number of duplicate apps are:  1181
The expected number of unique apps are:  9659
The actual number of unique apps are:  9659 

Removing duplicate apps in the App Store Dataset:

Some examples of duplicate apps are:  ['Mannequin Challenge', 'VR Roller Coaster']
The number of duplicate apps are:  2
The expected number of unique apps are:  7195
The actual number of unique apps are:  7195 



## Remove Non-English Apps
As mentioned earlier, the scope of the project has been defined to include only English apps. This section removes the non-English apps in the dataset. This is implemented using the ASCII codes for standard English alpha-numeric characters. The 'name' of apps have been used to identify which apps are built using non-English languages or for non-English users. For almost every standard alpha-numeric English character, the ASCII codes are less than '128'. Therefore, a loop to check whether the ASCII value of each element of the string app name is greater than 128 or not has been implemented.

However, it was noticed that certain English apps have special characters like emoticons/emojis, superscript characters, etc. which may have an ASCII value more than 128. 'Instachat 😜' and 'Docs To Go™ Free Office Suite' are two examples of such apps. These apps should not be excluded from the dataset on which the analysis will be done. In order to retain these apps, a check variable (strike_counter) will only exclude an app if there are more than three (03) non-standard characters (ASCII value > 128) in the app name string.

In [4]:
#Remove Non-English Apps
def remove_noneng(dataset,name_index):
    non_eng=[]
    clean_database=[]
    for app in dataset:
        name=app[name_index]
        max_length=len(name)
        i=0
        strike_counter=0
        if (name not in non_eng):
            for i in range (max_length):
                if (ord(name[i])>128):
                    strike_counter+=1
        if strike_counter>3:
            non_eng.append(name)
        else:
            clean_database.append(app)
    print("The number of English apps are: ",len(clean_database))
    print("The number of non-English apps are: ",len(non_eng))
    print("Examples of few non-English apps are: ",non_eng[:3])
    return(clean_database)

print("Removing non-English apps in the Google Play Store Dataset:\n")
gp_cleandb=remove_noneng(dataset=gp_cleandb,name_index=0)
print('\n')
print("Removing non-English apps in the App Store Dataset:\n")
ios_cleandb=remove_noneng(dataset=ios_cleandb,name_index=1)

Removing non-English apps in the Google Play Store Dataset:

The number of English apps are:  9614
The number of non-English apps are:  45
Examples of few non-English apps are:  ['Flame - درب عقلك يوميا', 'သိင်္ Astrology - Min Thein Kha BayDin', 'РИА Новости']


Removing non-English apps in the App Store Dataset:

The number of English apps are:  6181
The number of non-English apps are:  1014
Examples of few non-English apps are:  ['爱奇艺PPS -《欢乐颂2》电视剧热播', '聚力视频HD-人民的名义,跨界歌王全网热播', '优酷视频']


### Isolating Free Apps
The project objective is the analysis of free apps in both markets. In this section, paid/non-free apps are identified using the'price' header/data in both datasets and excluded from the final dataset to be analyzed.

In [5]:
# Isolating Free Apps
def remove_paidapps(dataset,name_index,price_index):
    freeapps=[]
    for app in dataset:
        name=app[name_index]
        price=app[price_index]
        if price=='0' or price=='0.0':
            freeapps.append(app)
    print("The final lenghth of the dataset is: ",len(freeapps))
    print("The first few apps of the final dataset are: ",freeapps[:3])
    return(freeapps)

print("Isolating free apps in the Google Play Store Dataset:\n")
gp_freeapps=remove_paidapps(dataset=gp_cleandb,name_index=0,price_index=7)
print('\n')
print("Isolating free apps in the App Store Dataset:\n")
ios_freeapps=remove_paidapps(dataset=ios_cleandb,name_index=1,price_index=4)

Isolating free apps in the Google Play Store Dataset:

The final lenghth of the dataset is:  8864
The first few apps of the final dataset are:  [['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']]


Isolating free apps in the App Store Dataset:

The final lenghth of the dataset is:  3220
The first few apps of the final dataset are:  [['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'In

### Analysing apps based on genre
As per the defined objectives, it is important to recognize free apps that are widely used across <b>both</b> platforms (Google Play Store and IOS Apps Store). The more popular an app is, i.e., the more the number of users, the more ads will be viewed by the users and thus more revenue can be generated by advertising via these apps. Thus, from the datasets, the following data type can be useful for this analysis: Number of users, Category, Rating, Installs, and Genre.

In the below code, the 'genre' data in the Google Play dataset and the 'prime_genre' data in the Apps Store dataset have been analyzed. Based on the genre, a frequency table has been created which throws light on the distribution of the apps across the different genres like 'Games', 'Entertainment', etc. The table indicates a dictionary consisting of tuples that contain the genre name and the percentage of genre apps (out of 100% of the total number of apps). Please note that the categorization of genres is different in the Google Play and App Store datasets.

Result will be represented as:
{(% of genre_1 apps,genre_1 name), (% of genre_2 apps,genre_2 name), (% of genre_3 apps,genre_3 name),.....} 

Image 2

In [6]:
def freq_table(dataset,genre_index):
    freq={}
    for app in dataset:
        value=app[genre_index]
        if value in freq:
            freq[value]+=1
        else:
            freq[value]=1
    total_apps=len(dataset)
    sum=0
    for key in freq:
        freq[key]=round(freq[key]/total_apps*100,2)
        sum=sum+freq[key]
    table_display=[]
    for key in freq:
        key_val_as_tuple=(freq[key],key)
        table_display.append(key_val_as_tuple)
        
    table_sorted=sorted(table_display,reverse=True)
    print(table_sorted)

print("Distribution (in %) of the number of apps based on genre for Google Play Store Dataset:")    
freq_table(dataset=gp_freeapps,genre_index=1)
print('\n')
print("Distribution (in %) of the number of apps based on genre for App Store Dataset:") 
freq_table(dataset=ios_freeapps,genre_index=11)

Distribution (in %) of the number of apps based on genre for Google Play Store Dataset:
[(18.91, 'FAMILY'), (9.72, 'GAME'), (8.46, 'TOOLS'), (4.59, 'BUSINESS'), (3.9, 'LIFESTYLE'), (3.89, 'PRODUCTIVITY'), (3.7, 'FINANCE'), (3.53, 'MEDICAL'), (3.4, 'SPORTS'), (3.32, 'PERSONALIZATION'), (3.24, 'COMMUNICATION'), (3.08, 'HEALTH_AND_FITNESS'), (2.94, 'PHOTOGRAPHY'), (2.8, 'NEWS_AND_MAGAZINES'), (2.66, 'SOCIAL'), (2.34, 'TRAVEL_AND_LOCAL'), (2.25, 'SHOPPING'), (2.14, 'BOOKS_AND_REFERENCE'), (1.86, 'DATING'), (1.79, 'VIDEO_PLAYERS'), (1.4, 'MAPS_AND_NAVIGATION'), (1.24, 'FOOD_AND_DRINK'), (1.16, 'EDUCATION'), (0.96, 'ENTERTAINMENT'), (0.94, 'LIBRARIES_AND_DEMO'), (0.93, 'AUTO_AND_VEHICLES'), (0.82, 'HOUSE_AND_HOME'), (0.8, 'WEATHER'), (0.71, 'EVENTS'), (0.65, 'PARENTING'), (0.64, 'ART_AND_DESIGN'), (0.62, 'COMICS'), (0.6, 'BEAUTY')]


Distribution (in %) of the number of apps based on genre for App Store Dataset:
[(58.14, 'Games'), (7.89, 'Entertainment'), (4.97, 'Photo & Video'), (3.66, 'Edu

The frequency tables calculated above show that the App Store dataset is dominated by 'Games' apps with over 58% of the total apps belonging to this category. On the other hand, the Google Play Store dataset has a more evenly distributed profile for the type/genre of apps with the 'Family' apps constituting about 19% of all apps in the dataset. However, the above result is not sufficient enough to conclude whether creating an app based on a certain 'genre' is a guarantee for the app to be installed and used by several users. Further analysis of the number of users will give a more complete picture of this project. This will be further explored in the code below.

### Analysis of apps based on the number of users
It is important to scrutinize which type/genres of apps have more users than others and which type of apps fare better in terms of popularity in user base in comparison to others. One angle of looking at this is finding the number of users for each genre. However, this approach seems myopic since there may be a high number of apps for a particular genre and therefore a higher user base in total compared to other genres/categories. However, from a more granular perspective, this does not commensurately ensure that a single app of that genre has a higher number of the user base. The parameter that will provide more credentialed insight is the average number of users per app in a particular genre. The code below calculates exactly that.

Please note that in the Google Play the data in the 'Installs' header can be used to estimate the total number of users. However, such a parameter indicating the number of installations is not available in the App Store dataset. Therefore, the data in the 'rating_count_tot' has been used as an approximate figure for the number of users. The approach of a weighted average (average number of users for an app in a particular genre) as a true indication of the popularity of the genre further helps to largely dilute the inaccuracy created due to the unavailability of data. It is thus, a more valid method of actually determining the popularity of a particular genre.

<b>Calcuations used below:</b><br>
genre_total_users : Total number of users for a particular 'genre'<br>
genre_total_apps : Total apps for a particular 'genre'<br>
tuple_avg_users_per_app_genre : Average number of users per app for a particular 'genre' ('genre' is the key in tuple)<br>
tuple_avg_users_per_app_genre = genre_total_users / genre_total_apps

In [7]:
def users_by_genre(dataset,genre_index,users_index):
    all_genres_list=[]
    total_users_by_genre=[]
    avg_users_per_app_by_genre=[]
    for app in dataset:
        genre=app[genre_index]
        if genre not in all_genres_list:
            all_genres_list.append(genre)
    
    for genre in all_genres_list:
        genre_total_users=0
        genre_total_apps=0
        for app in dataset:
            app_genre=app[genre_index]
            app_users_str=app[users_index]
            app_users=''
            for element in app_users_str:
                if (element!=',') and (element!='+'):
                    app_users+=element
            app_users=int(app_users)
            if app_genre==genre:
                genre_total_users+=app_users
                genre_total_apps+=1
        tuple_total_users=(genre_total_users,genre)
        total_users_by_genre.append(tuple_total_users)
        
        tuple_avg_users_per_app_genre=((round(genre_total_users/genre_total_apps,2)),genre)
        avg_users_per_app_by_genre.append(tuple_avg_users_per_app_genre)

    total_users_by_genre=sorted(total_users_by_genre,reverse=True)
    avg_users_per_app_by_genre=sorted(avg_users_per_app_by_genre,reverse=True)
    
    print("The total number of users for the genres are:\n",total_users_by_genre,'\n')
    print("The average number of users per app for the genres are:\n",avg_users_per_app_by_genre)

print("For the Google Play Store dataset:\n")
users_by_genre(dataset=gp_freeapps,genre_index=1,users_index=5)
print('\n')
print("For the App Store dataset:\n")
users_by_genre(dataset=ios_freeapps,genre_index=11,users_index=5)

For the Google Play Store dataset:

The total number of users for the genres are:
 [(13436869450, 'GAME'), (11036906201, 'COMMUNICATION'), (8101043474, 'TOOLS'), (6193895690, 'FAMILY'), (5791629314, 'PRODUCTIVITY'), (5487861902, 'SOCIAL'), (4656268815, 'PHOTOGRAPHY'), (3931731720, 'VIDEO_PLAYERS'), (2894704086, 'TRAVEL_AND_LOCAL'), (2368196260, 'NEWS_AND_MAGAZINES'), (1665884260, 'BOOKS_AND_REFERENCE'), (1529235888, 'PERSONALIZATION'), (1400338585, 'SHOPPING'), (1143548402, 'HEALTH_AND_FITNESS'), (1095230683, 'SPORTS'), (989460000, 'ENTERTAINMENT'), (696902090, 'BUSINESS'), (503060780, 'MAPS_AND_NAVIGATION'), (497484429, 'LIFESTYLE'), (455163132, 'FINANCE'), (360288520, 'WEATHER'), (211738751, 'FOOD_AND_DRINK'), (188850000, 'EDUCATION'), (140914757, 'DATING'), (113221100, 'ART_AND_DESIGN'), (97202461, 'HOUSE_AND_HOME'), (53080061, 'AUTO_AND_VEHICLES'), (52995810, 'LIBRARIES_AND_DEMO'), (44971150, 'COMICS'), (37732344, 'MEDICAL'), (31471010, 'PARENTING'), (27197050, 'BEAUTY'), (15973160

### Interpreting the results
The analysis above shows the skewed nature of observing pure numbers in certain circumstances. For example, although the 'Games' genre dominates the App Store market with the highest number of apps and consequentially the highest number of users, the average number of users per app is only 14th highest in a list of 23 genres. On the contrary, 'Navigations' apps in the App Store dataset have the 4th least number of users. However, the same genre has the highest number of users per app.

Since the objective of this project is to identify a popular and common type/pattern across <b>both</b> markets, the genre that is particularly popular in both the Google Play Store market and the App Store market is the 'Social' or 'Social Networking' genre. Across both markets, it has a consistent performance of the 3rd highest number of average users per app. Furthermore, in the Google Play Store dataset, this genre has the 6th highest number of user and in the App Store Dataset, it has the 2nd highest number of users. All these factors consolidate the findings of the project that 'Social Networking' apps are a formidable and reliable bet for companies looking to develop apps based on the 'In-App purchase' revenue model. Further analysis of the user ratings of apps and other parameters such as the size of apps can be implemented to gain more insights (or identify any deeper-lying patterns) on the type of apps that are popular across both platforms.