# Analysis of Mobile Apps

We are analysing the usage of Mobile apps to understand the types of apps that yield maximum revenue, both for Apple as well as Google.

In [1]:
# We start with reading the data from the files
appstore_file = open('AppleStore.csv', encoding='utf8')
googlestore_file = open('googleplaystore.csv', encoding='utf8')

In [2]:
# Now we must read the data in these files
from csv import reader

apps_data = list(reader(appstore_file))
google_data = list(reader(googlestore_file))

Now that we have opened and read the individual files, we can explore the data in these files using the pre-defined function provided by the dataquest.io team

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

The `explore_data` function allows us to generate various slices of the data and quickly review specific records.

The method also allows us to check the count of records in the data sets.

In [4]:
# Exploring the Google Dataset
explore_data(google_data, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
# Exploring the Apple Dataset
explore_data(apps_data, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


## Cleaning the Data

After we have opened the data sets, we will need to explore the data and identify any inaccuracies in it that may exist and try to clean it.

### Get Rid of the Headers

In [6]:
apple_header = apps_data[0]
google_header = google_data[0]

apps_data = apps_data[1:]
google_data = google_data[1:]

### Incomplete Records
We first need to ensure that all records in the given data set have complete data. If any records have data missing for any column, that record can be discarded.

In [7]:
# The check_data method checks if all the records have data
# for all the columns
# 
# If any records is found with a column count that is different
# from the column count specified in the Header Row, then that
# record can be treated as having a problem
def check_data(dataset, col_count):
    problem_indices = []
    for idx, record in enumerate(dataset):
        if len(record) != col_count:
            print(idx, record)
            problem_indices.append(idx)

    return problem_indices

In [8]:
apps_data_col_count = len(apple_header)
google_data_col_count = len(google_header)

problem_apple = check_data(apps_data, apps_data_col_count)
problem_google = check_data(google_data, google_data_col_count)

print('Found problems in Apple:', problem_apple)
print('Found problems in Google:', problem_google)

10472 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
Found problems in Apple: []
Found problems in Google: [10472]


Now that we have identified the problematic record, we have to remove it from the set.

In [9]:
def remove_problem_records(data_set, problem_recs):
    for idx in reversed(problem_recs):
        print('Deleting row ', idx, data_set[idx])
        del data_set[idx]

In [10]:
remove_problem_records(apps_data, problem_apple)
remove_problem_records(google_data, problem_google)

Deleting row  10472 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


### Duplicate Records

We can also check if there are records that occur twice and try to eliminate all duplicates while retaining only a single copy of that record with the latest data in it.

#### Finding the duplicates
We first have to check the records and identify the duplicate records. For this, we will have to examine the App name in each record and compare it with the other records in the dataset.

In [11]:
# The find_duplicates method will allow us to quickly scan the 
# dataset and identify all the duplicate apps in that dataset
def find_duplicates(dataset, check_col):
    duplicate_apps = []
    unique_apps = []
    for record in dataset:
        if record[check_col] in unique_apps:
            duplicate_apps.append(record[check_col])
        else:
            unique_apps.append(record[check_col])
    return len(duplicate_apps), duplicate_apps


In [12]:
google_dup_count, google_dups = find_duplicates(google_data, 0)
app_dup_count, app_dups = find_duplicates(apps_data, 1)

print('Number of Duplicates in Google Data: ', google_dup_count)
print('Number of Duplicates in Apple Data: ', app_dup_count)

Number of Duplicates in Google Data:  1181
Number of Duplicates in Apple Data:  2


Now that we have identified the numbers of duplicate apps, we will need to devise a way to eliminate these duplicates and come up with a dataset with only unique apps.

#### Approach for eliminating duplicates
While we have identified the duplicates based on the App name, we need to decide which of those duplicate records should be retained for each App.

For the Google dataset, this could be done based on the following data points in each record:

1. **Number of Reviews**: The record having the most number of reviews among the duplicates, can be considered the one that has the latest data. Hence, that record can be retained while the remaining duplicates deleted.
    
1. **Last Updated Date**: The record with the latest value for the Last Updated Date can be considered the latest record to be retained while the others get deleted.
    
1. **Version Number**: The record with the highest version number for the app can be considered the latest record to be retained while th others get deleted.


#### Removing the Duplicates
To remove the duplicates, we will have to build logic to identify the records that will need to be retained.

The logic is implemented within the `dedup_float` method. The name suggests that this method handles deduplication of records based on a float value comparison between records. This method can be used to de-duplicate the records based on the values in any column that contains a float value. This can be used to deduplicate based on the number of reviews for each app. 

The aim of the logic is to avoid multiple iterations through the large data-set. The idea is to not only identify the records that should be retained or discarded, but also to collect the records to be retained into a list during the same iteration.

To achieve this, we use two dictionaries. The first dictionary `max_values` will store the maximum value of reviews for each record. The second dictionary `deduped_records` is used to hold the deduped records.

As we iterate through the dataset, we compare the reviews value for each app with the value stored in the `max_values` dictionary. If no value is found in that dictionary, or if there exists a value but it is lesser than the current value, then the values in both dictionaries are replaced with the values from the current record.

In [13]:
# Creating a dictonary to maintain the max values for the apps
def dedup_float(dataset, name_col, value_col):
    # We create two dictionaries: one to store the max values
    max_values = {}
    # and the other to store the record with max value
    deduped_records = {}
    
    # Then loop through the data set
    for record in dataset:
        # extract the value to be compared (reviews in our case)
        dedup_val = float(record[value_col])
        
        # if the current app name already exists in the dictionary
        # then we have already seen this app before; which means
        # the current record is a duplicate
        if record[name_col] in max_values:
            # Since the current record is a duplicate, we will
            # have to check the value for the current records
            # against the max value stored in the dictionary
            if dedup_val > max_values[record[name_col]]:
                # If the current value is greater than that in 
                # the dictionary, then we replace the value in
                # the dictionary with the current value
                max_values[record[name_col]] = dedup_val
                # We also store the current record
                deduped_records[record[name_col]] = record
        else:
            # If the record is not in the dictionary, then it
            # has not been seen before. Store it in the dictionary
            max_values[record[name_col]] = dedup_val
            deduped_records[record[name_col]] = record
            
    # In the end, return the values list from the records 
    # dictionary
    return list(deduped_records.values())

In [14]:
android_deduped = dedup_float(google_data, 0, 3)
#apple_deduped = dedup_float(apps_data, 1, 5)
apple_deduped = apps_data  # Because the solution notebook does not dedup apple for some reason

print(len(android_deduped))
print(len(apple_deduped))

9659
7197


### Removing Apps with Non-English Names

We would also want to eliminate any App names that contain non-English characters. For this, we loop through each name and only include names that do not contain any non-English characters.

We do this by a simple comparison of the ASCII Codes for each character in the name. However, given that some characters may be considered valid in the English language but still fall outside the normal range of ASCII Codes for English characters, we only filter out the app if its name has more than 3 non-english characters.

In [15]:
def has_more3_nonenglish_characters(string):
    count = 0
    for chr in string:
        # English characters will be between 0 - 127
        if ord(chr) > 127:
            count += 1
    if count > 3: 
        return True
    else:
        return False

def remove_non_english_app_names(dataset, name_col):
    clean_list = []
    for record in dataset:
        if not has_more3_nonenglish_characters(record[name_col]):
            clean_list.append(record)
    return clean_list

In [16]:
[(x, not has_more3_nonenglish_characters(x)) for x in [
    'Instagram',
    '爱奇艺PPS -《欢乐颂2》电视剧热播',
    'Docs To Go™ Free Office Suite',
    'Instachat 😜'
]]

[('Instagram', True),
 ('爱奇艺PPS -《欢乐颂2》电视剧热播', False),
 ('Docs To Go™ Free Office Suite', True),
 ('Instachat 😜', True)]

In [17]:
android_english = remove_non_english_app_names(android_deduped, 0)
apple_english = remove_non_english_app_names(apple_deduped, 1)

print(len(android_english))
print(len(apple_english))

9614
6183


### Isolating Free Apps
Since we are interested only in in-app ad revenues, we must remove any apps that may be earning revenue by charging for the app itself.

Hence, we must extract only those apps that are offerred free of cost.

In [18]:
def extract_digits(in_string):
    return ''.join(i for i in in_string if i.isdigit() or i == '.')

def isolate_free_apps(dataset, price_col):
    free_apps = []
    for record in dataset:
        if float(extract_digits(record[price_col])) == 0.0:
            free_apps.append(record)
            
    return free_apps


In [19]:
android_free = isolate_free_apps(android_english, 7)
apple_free = isolate_free_apps(apple_english, 4)

print(len(android_free))
print(len(apple_free))

8864
3222


## Analysing the Data
Now that we have cleaned the data sufficiently, we can now start analyzing the types of apps that would yield maximum revenue in both the Google Play store as well as the App Store.

The idea is the find apps that are most popular among users in both segments. For this, we first start by analyzing the frequency with which these apps are being installed. We look at the types of apps that are available in these markets based on the published genre for these apps.

### Building a Frequency Table

In [20]:
# Method to build frequency table
def build_freq_table(dataset, freq_col):
    freq_count = {}
    for record in dataset:
        if record[freq_col] in freq_count:
            freq_count[record[freq_col]] += 1
        else:
            freq_count[record[freq_col]] = 1
    return freq_count

# Method to convert frequency table into percentages
def build_freq_percentages(freq_table, total):
    freq_percents = {}
    for key in freq_table:
        freq_percents[key] = (freq_table[key] / total) * 100
    return freq_percents

In [21]:
def print_nice(dataset, percent=True):
    for record in sorted(
        [(x, dataset[x]) for x in dataset],
        key=lambda x: x[1],
        reverse=True
    ):
        if (percent):
            strFormat = '{:<40}:{:>10.2f}%'
        else:
            strFormat = '{:<40}:{:>10.2f}'
            
        print(strFormat.format(record[0], record[1]))

def print_freqs(dataset, freq_col):
    freq_table = build_freq_table(dataset, freq_col)
    print_nice(freq_table, False)
#     for record in sorted(
#         [(x, freq_table[x]) for x in freq_table], 
#         key=lambda x: x[1], 
#         reverse=True
#     ):
#         print('{:<40}:{:>8.2f}'.format(record[0], record[1]))

def print_percents(dataset, freq_col):
    freq_percents = build_freq_percentages(
            build_freq_table(dataset, freq_col),
            len(dataset)
        )
    print_nice(freq_percents)
#     for record in sorted(
#         [(x, freq_percents[x]) for x in freq_percents], 
#         key=lambda x: x[1], 
#         reverse=True
#     ):
#         print('{:<40}:{:>8.2f}%'.format(record[0], record[1]))


#### Android Apps -- _Category_
The frequency table for the Android apps based on the _Category_ field is as below:

In [22]:
print_percents(android_free, 1)

FAMILY                                  :     18.91%
GAME                                    :      9.72%
TOOLS                                   :      8.46%
BUSINESS                                :      4.59%
LIFESTYLE                               :      3.90%
PRODUCTIVITY                            :      3.89%
FINANCE                                 :      3.70%
MEDICAL                                 :      3.53%
SPORTS                                  :      3.40%
PERSONALIZATION                         :      3.32%
COMMUNICATION                           :      3.24%
HEALTH_AND_FITNESS                      :      3.08%
PHOTOGRAPHY                             :      2.94%
NEWS_AND_MAGAZINES                      :      2.80%
SOCIAL                                  :      2.66%
TRAVEL_AND_LOCAL                        :      2.34%
SHOPPING                                :      2.25%
BOOKS_AND_REFERENCE                     :      2.14%
DATING                                  :     

#### Android Apps -- _Genres_

The frequency table for the Android apps based on the _Genres_ field is as below

In [23]:
print_percents(android_free, 9)

Tools                                   :      8.45%
Entertainment                           :      6.07%
Education                               :      5.35%
Business                                :      4.59%
Productivity                            :      3.89%
Lifestyle                               :      3.89%
Finance                                 :      3.70%
Medical                                 :      3.53%
Sports                                  :      3.46%
Personalization                         :      3.32%
Communication                           :      3.24%
Action                                  :      3.10%
Health & Fitness                        :      3.08%
Photography                             :      2.94%
News & Magazines                        :      2.80%
Social                                  :      2.66%
Travel & Local                          :      2.32%
Shopping                                :      2.25%
Books & Reference                       :     

#### Apple Apps

In [24]:
print_percents(apple_free, 11)

Games                                   :     58.16%
Entertainment                           :      7.88%
Photo & Video                           :      4.97%
Education                               :      3.66%
Social Networking                       :      3.29%
Shopping                                :      2.61%
Utilities                               :      2.51%
Sports                                  :      2.14%
Music                                   :      2.05%
Health & Fitness                        :      2.02%
Productivity                            :      1.74%
Lifestyle                               :      1.58%
News                                    :      1.33%
Travel                                  :      1.24%
Finance                                 :      1.12%
Weather                                 :      0.87%
Food & Drink                            :      0.81%
Reference                               :      0.56%
Business                                :     

From the above analysis it becomes evident that while apps related to Games & Entertainment dominate the Apple AppStore, the usage in the Google Play Store is much more varied.


### Finding the most popular Genres

While the above frequency tables will give us a glimpse of the number of apps in each Genre/Category, they don't necessarily reflect the popularity of those apps. For instance, a Genre might have numerous apps, but it may so be that none of those apps have any users. We can't say that a particular genre is popular just because there are so many apps that fall into that genre. Popularity is a function of how many users there are.

To analyze the number of users, we derive these numbers based on the data available.
- For Google Play, the number of users may be reflected by the number of Installs for each app
- For the Apple AppStore, though the number of installs is unavailable, a good proxy may be the number of users who have chosen to leave a rating for each app

#### Apple AppStore

We will calculate the average number of user ratings for each Genre in the App Store. 

In [34]:
def calculate_agg_rating_for_genre(dataset, genre_col, rating_col, transform = lambda x: x):
    genre_ratings = {}
    genre_counts = {}
    genre_highest = {}
    genre_agg_ratings = {}
    for record in dataset:
        record_rating = float(transform(record[rating_col]))
        if (record[genre_col] in genre_ratings):
            genre_ratings[record[genre_col]] += record_rating
            genre_counts[record[genre_col]] += 1
            
            if (genre_highest[record[genre_col]] < record_rating):
                genre_highest[record[genre_col]] = record_rating
        else:
            genre_ratings[record[genre_col]] = record_rating
            genre_counts[record[genre_col]] = 1
            genre_highest[record[genre_col]] = record_rating
    
    for genre in genre_ratings:
        # print(genre, genre_ratings[genre], genre_counts[genre])
        genre_agg_ratings[genre] = (
            genre_ratings[genre],
            genre_counts[genre],
            genre_highest[genre]
        )
    
    return genre_agg_ratings

def print_avg_ratings(dataset, genre_col, rating_col, transform = lambda x: x):
    print('{:^20}:  {:^14}  :  {:^4}  :  {:^14} : {:^6}'.format('Genre', 'Inst./Rat.', '#', 'Highest', '%age'))
    agg_ratings = calculate_agg_rating_for_genre(dataset, genre_col, rating_col, transform)
    for rec in sorted(
        [(
            x, 
            agg_ratings[x][0] / agg_ratings[x][1], 
            agg_ratings[x][1], 
            agg_ratings[x][2], 
            agg_ratings[x][2] * 100 / agg_ratings[x][0]
        ) for x in agg_ratings],
        key=lambda x: x[1],
        reverse=True
    ):
        print('{:<20}:  {:>14,.2f}  :  {:>4.0f}  :  {:>14,.0f} : {:>5.2f}%'.format(rec[0], rec[1], rec[2], rec[3], rec[4]))

In [35]:
print_avg_ratings(apple_free, 11, 5)

       Genre        :    Inst./Rat.    :   #    :     Highest     :  %age 
Navigation          :       86,090.33  :     6  :         345,046 : 66.80%
Reference           :       74,942.11  :    18  :         985,920 : 73.09%
Social Networking   :       71,548.35  :   106  :       2,974,676 : 39.22%
Music               :       57,326.53  :    66  :       1,126,879 : 29.78%
Weather             :       52,279.89  :    28  :         495,626 : 33.86%
Book                :       39,758.50  :    14  :         252,076 : 45.29%
Food & Drink        :       33,333.92  :    26  :         303,856 : 35.06%
Finance             :       31,467.94  :    36  :         233,270 : 20.59%
Photo & Video       :       28,441.54  :   160  :       2,161,558 : 47.50%
Travel              :       28,243.80  :    40  :         446,185 : 39.49%
Shopping            :       26,919.69  :    84  :         417,779 : 18.48%
Health & Fitness    :       23,298.02  :    65  :         507,706 : 33.53%
Sports              :    

Analyzing the data above, we can make some observations:
- **Navigation** apps seem to draw the most user ratings. However, there are only 6 apps on offer and the one with the highest rating contributes to 66% of those ratings.

- Similarly, the **Reference** genre too has one very popular application contributing to 73% of all ratings.

- **Social Networking** as a genre has the top app contributing roughly 39% of the user ratings; however, there are over 100 apps in this space; Social Networking is also the genre which seems to have the most number of users who care to leave ratings.

The other Genres that stand out are **Photo & Video**, and **Games**, each of which seem to have a large user-base that leave ratings. While there seems to be a dominant player in the **Photo & Video** space, the **Games** space seems to be very hotly contested with the app with the highest rating contributing only about 5% of the total ratings.

#### Recommendation - Apple AppStore

A competitive app may be able to do well in one of the Genres where there is heavy user interest: **Navigation**, **Reference**, **Social Networking**, **Music**, **Photo & Video**, and **Games**. However the entry barrier in these genres can be expected to be high given that there are already popular players entrenched here.

Genres such as **Medical**, **Catalogs**, would have a low entry barrier since there are not many apps on offer. However, the level of user interest is also low.

This data can be used for further discussions with the team. Depending on the core competency available within the team building the app, we can decide either to compete in a popular genre, or to deploy an app that can potentially trigger wide user interest and capture a genre that has until now been ignored by other players.


#### Google Play Store
We need to build a Table to analyze the details of the Google Play store data too.

In [36]:
print_avg_ratings(android_free, 1, 5, lambda x: x.replace('+','').replace(',',''))

       Genre        :    Inst./Rat.    :   #    :     Highest     :  %age 
COMMUNICATION       :   38,456,119.17  :   287  :   1,000,000,000 :  9.06%
VIDEO_PLAYERS       :   24,727,872.45  :   159  :   1,000,000,000 : 25.43%
SOCIAL              :   23,253,652.13  :   236  :   1,000,000,000 : 18.22%
PHOTOGRAPHY         :   17,840,110.40  :   261  :   1,000,000,000 : 21.48%
PRODUCTIVITY        :   16,787,331.34  :   345  :   1,000,000,000 : 17.27%
GAME                :   15,588,015.60  :   862  :   1,000,000,000 :  7.44%
TRAVEL_AND_LOCAL    :   13,984,077.71  :   207  :   1,000,000,000 : 34.55%
ENTERTAINMENT       :   11,640,705.88  :    85  :     100,000,000 : 10.11%
TOOLS               :   10,801,391.30  :   750  :   1,000,000,000 : 12.34%
NEWS_AND_MAGAZINES  :    9,549,178.47  :   248  :   1,000,000,000 : 42.23%
BOOKS_AND_REFERENCE :    8,767,811.89  :   190  :   1,000,000,000 : 60.03%
SHOPPING            :    7,036,877.31  :   199  :     100,000,000 :  7.14%
PERSONALIZATION     :    

From the above data analysis, one thing that is immediately apparent is that the space on the Google Play Store is much more competitive with very few Genres being totally dominated by a single app. Only the **Books & Reference** Genre has a single app that dominates with more than 50% of the user installs.

All genres also have many more players in comparison with the AppStore.

The Most Popular Genres are no doubt **Communication** and **Video Players** with the highest number of installs. Moreover, there are numerous players in both these genres, especially with 287 players in the **Communication** Genre.
