# App Profiles for the App Store and Google Play Markets

This project is to understand the most popular apps in each app store and find the best demographics for each store. 

The following opens the two datasets that will be used for this project. The first block of code is to open the Google Play Store data set, and the second block of code is to open the App Store data set. Both blocks of code open the file and saves them as lists, extracts the header, and extracts just the data. 

In [1]:
from csv import reader

google_file = open('googleplaystore.csv', encoding='utf8')
google_read = reader(google_file)
g_list = list(google_read)
g_header = g_list[0]
g_data = g_list[1:]

apple_file = open('AppleStore.csv', encoding='utf8')
apple_read = reader(apple_file)
apple_list = list(apple_read)
apple_header = apple_list[0]
apple_data = apple_list[1:]

The kernel below is a function to look at the data that is imported. 
Four parameters are expected:
1. dataset: a list of lists containing all of the data
2. start: an integer to represent the starting index to slice the data set
3. end: an integer to represent the ending index to slice the data set
4. rows_and_columns: a boolean with False as the default argument; this parameter will display the number of rows and columns if True

The data is sliced to avoid any header rows based off of the start and end integers. 
Then the code loops through the slice, printing a new row after each row for readability

If rows_and_columns is true, then the function will print the number of rows and columns in the dataset.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Now, we'll explore the Play Store data:

In [3]:
explore_data(g_list, 0, 4, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


There are 10842 apps listed with 13 columns in the Play Store data. 
The following are the columns of the data:

|Number| Column name | Description|
| --- | --- | --- | 
| 1 | App |Name of the app|
| 2 |Category| The category the app belongs to|
| 3 |Rating| The rating of the app|
| 4 |Reviews| The number of reviews|
| 5 |Size| The amount of space the app takes up|
| 6 |Installs| The number of installations the app has|
| 7 |Type| If the app is free or paid for|
| 8 |Price| The cost of the app|
| 9 |Content Rating| The allowed age demographic for the app|
| 10 |Genres| The genre of the app|
| 11 |Last Updated| The date the app was last updated|
| 12 |Current Ver| The current version of the app|
| 13 |Android Ver| The Android version the app is compatible with|

Similarly, we'll explore the App Store data:

In [4]:
explore_data(apple_list, 0, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7198
Number of columns: 16


You can see that there are 7198 apps listed with 16 columns of data from the Apple App store. The following are the columns of data:

| Column name | Description|
| --- | --- | 
| id |ID number of the app|
|track_name| Name of the app|
|size_bytes| The size of the app|
|currency| The currency the price is listed in|
|price| The cost of the app|
|rating_count_tot| The total number of reviews for the app|
|rating_count_ver| The number of reviews for the current version of the app|
|user_rating| The rating for the app overall|
|user_rating_ver| The rating for the current version of the app|
|ver| The current version of the app|
|cont_rating| The minimum age recommendation for the app|
|prime_genre| The kind of app it is| 
|sup_devices.num| |
|ipadSc_urls.num||
|lang.num||
|vpp_lic||




# Deleting Incorrect Data
Now, looking at a discussion thread [Kaggle Link](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), it looks as though there is an error within the Google Play Store data. 
To remedy this, we will check each row of the dataset and then fix any issues that occur. 

In [5]:
head_len = len(g_header)
for row in g_data:
    if len(row) != head_len:
        print(g_data.index(row))
        print(len(row))

10472
12


Running the code shows that there is an error with row 10472 within the Google Play data--it only has 12 columns instead of 13. 
To remedy this, we will delete that row within the dataset. 

In [6]:
del g_data[10472]

Now, we'll run the above code again to check and see if the issue has been resolved.

In [7]:
head_len = len(g_header)
for row in g_data:
    if len(row) != head_len:
        print(g_data.index(row))
        print(len(row))

print('Done!')

Done!


As you can see, there are no issues with the row lengths being mismatched now that we've deleted that one row. 

# Deleting Duplicate Data
Looking through the data source threads on Kaggle, users report that there are duplicate entries for certain apps for the Google Play data set (there are no duplicate entries in the Apple data set). To remedy this, we will need to:
1. Find the duplicate entries
2. Determine a criteria to delete the entries
3. Delete the duplicates

We will start with the first step in the below cell.

In [26]:
dup_apps = []
unique_apps = []

def count_dups(data_source, app_name_index):
    for row in data_source:
        app_name = row[app_name_index]
        if app_name in unique_apps:
            dup_apps.append(app_name)
        else:
            unique_apps.append(app_name)
    return(dup_apps, unique_apps)

#Google Data Set
g_dup_apps, g_unique_apps = count_dups(g_data, 0)
print('For Google data')
print('The number of duplicate entries is ' + str(len(g_dup_apps)))
print('The number of unique entries is ' + str(len(unique_apps)))
print('\n')

For Google data
The number of duplicate entries is 1181
The number of unique entries is 9659




Now we need to find a way to delete the duplicates by defining a criteria to delete the duplicate entries and then deleting them based off of the criteria. 

For this project, I am choosing the criteria of largest number of reviews to decide whether to delete an entry or not. To decide which ones to delete the following steps must be done: 
1. Create a dictionary containing the highest number of reviews for each app
2. Use the dictionary to create a new data set with only unique entries

The function below works by iterating through each app within the data source. If the app is not already included in the dictionary, the it is added with the number of reviews it has. If the app is already included, it compares the two number of total reviews and then replaces the value of the entry with the higher number of reviews.

In [25]:
def max_rev(data_source, name_index, rev_index):
    rev_max = {}
    for row in data_source:
        app_name = row[name_index]
        n_rev = float(row[rev_index])
        if app_name in rev_max and n_rev > rev_max[app_name]:
            rev_max[app_name] = n_rev
        else:
            rev_max[app_name] = n_rev
    return(rev_max)
    
g_revs = max_rev(g_data, 0, 3)
print(len(g_revs))

9659


Now that we have two dictionaries that display the maxium number of reviews for each app in each app store, we need to build a new data source with this information. 

The function del_dups will take in the dictionary with the max number of reviews for each app, compare it to the number of reviews within the data source, and then add the row to the new cleaned data set if it is the entry with the highest number of reviews. 

We will initialize two lists, an empty one that will contain the new cleaned data and another list that will contain the names of the apps that have already been added to avoid adding duplicate entries.


In [32]:
def del_dups(data_source, revs_data, name_index, rev_index):
    cleaned_data = []
    done_apps = []
    for row in data_source:
        app_name = row[name_index]
        num_revs = float(row[rev_index])
        if num_revs == revs_data[app_name] and app_name not in done_apps:
            cleaned_data.append(row)
            done_apps.append(app_name)
    return(cleaned_data)

g_clean = del_dups(g_data, g_revs, 0, 3)
print(len(g_clean))
print(len(g_clean[0]))

9659
13


After cleaning the data, it looks like we have 9659 rows which is the exact number of unique entries we found earlier. 

# Removing Non-English Apps

Now that we have a clean data set, we are only interested in looking at the apps that are in English because this analysis is for an English-speaking audience. 

To do this, we will remove any entries that are not in English by defining a function to see if any character in the name of the app is not in English. Since there are special characters in some English apps, our function will count the number of non-English characters, and if that number exceeds 3, then it will consider it to be not-English.

In [38]:
def is_eng(app_name):
    count = 0
    for char in app_name:
        if ord(char) > 127:
            count += 1
            if count > 3:
                return(False)
    return(True)

print(is_eng('Instagram'))
print(is_eng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_eng('Docs To Go™ Free Office Suite'))
print(is_eng('Instachat 😜'))

True
False
True
True


Now that we have a function to check all English apps, we will now clean our data further by 

In [48]:
def keep_eng(data_set, name_index):
    eng_apps = []
    non_eng_apps = []
    for row in data_set:
        app_name = row[name_index]
        eng = is_eng(app_name)
        if eng:
            eng_apps.append(row)
        else:
            non_eng_apps.append(row)
    return(eng_apps, non_eng_apps)

g_eng_apps, g_non_eng_apps = keep_eng(g_clean, 0)
print(g_eng_apps[:5])
print('\n')
print(g_non_eng_apps[:5])
print('\n')

a_eng_apps, a_non_eng_apps = keep_eng(apple_data, 1)
print(a_eng_apps[:5])
print('\n')
print(a_non_eng_apps[:5])


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]


[['Flame - درب عقلك يوميا', 'EDUCATION', '4.6', '56065', '37M', '1,000,000+', 'Free', '0', 'E

# Finding Free Apps

Now that we have two clean data sets of only English and non-English apps for both stores, we are going to look at just the free apps. 
To do this, we create a function that iterates over each data set and appends entries to a new free data set if the price is equal to zero. 

In [56]:
def free_apps(data_set, price_index):
    free = []
    for row in data_set: 
        price = str(row[price_index])
        if price == '0' or price == '0.0':
            free.append(row)
    return(free)

g_free = free_apps(g_eng_apps, 7)
print('Number of free apps in Google Play store is ' + str(len(g_free)))
print('\n')
print(g_free[:5])
print('\n')

a_free = free_apps(a_eng_apps, 4)
print('Number of free apps in Apple app store is ' + str(len(a_free)))
print('\n')
print(a_free[:5])
print('\n')
            

Number of free apps in Google Play store is 8864


[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up'], ['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']]


Number of free apps in Apple app store is 

# Most Popular Genre of App
Now that we have a data set of free apps, we will look at what free apps are the most popular. 
It is important to find a genre that is popular across both stores to reach the largest audience, allowing us to bring in the most amount of money possible via in-app ads.

To find the most popular genre of free apps in each app store, we can use the Category field (index 1) in the Google data set and the Prime_Genre (index 11) field in the Apple data set to create a frequency table with the number of apps in each category. 

In [72]:
def freq_table(data_set, genre_index):
    table = {}
    count = 0
    
    for row in data_set:
        count += 1
        genre = row[genre_index]
        if genre in table:
            table[genre] += 1
        else:
            table[genre] = 1
    
    p_table = {}
    for key in table:
        percent = (table[key]/count) * 100
        p_table[key] = percent
        
    return(p_table)

In [73]:
def display_table(data_set, index):
    table = freq_table(data_set, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [74]:
display_table(g_free, 1)
print('\n')

display_table(a_free, 11)

FAMILY : 19.223826714801444
GAME : 9.510379061371841
TOOLS : 8.461191335740072
BUSINESS : 4.580324909747293
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.5424187725631766
SPORTS : 3.4183212996389893
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2490974729241873
HEALTH_AND_FITNESS : 3.068592057761733
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.782490974729242
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.128158844765343
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
ENTERTAINMENT : 0.8799638989169676
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0

It looks as though the Play store has mostly Family apps (with Games coming in second) while the Apple App store has games as their most popular free category (and entertainment as the runner up). 

For the Play store, it looks as though there are a wider variety of app genres, with a lot of apps being for practical purposes. Based off of this, a family or game app would probably be a good idea to create.

As for the App Store, it looks like many of the apps are for fun, with more than half of the apps being for Games. This is a bit different from the Play Store, but since there is an overlap with Games being a very popular category in both stores, a games app would probably do very well in both stores. 


In [None]:
g_genres = freq_table(g_free, 1)

def avg_rate(data_set, genre_index):
    total = 0
    len_genre = 0
    for row in data_set:
        genre_app = row[genre_index]
    