# Choosing the Right Niche for Future Mobile App Development 
## Exploring the Apple App Store and Google Play Markets

The goal of this project is to provide a data-driven recommendation to our software development team as to which specific kind of app they should invest resources building. 

Our company relies on in-app ads geared towards English speaking users which means that revenue generated by the app will be proportional to the number of users using the app. The data-analysis that follows will try to determine which free, English-language apps tend to perform the best in both iOS and Android markets. 

## 1. Opening and Exploring the Data

To perform our analysis, we will be using a free and readily available subset of apps for both the Apple App Store and Google Play Store found on Kaggle.com. Granted both of these subsets are only a fraction of the total number of apps available, it will serve us well given the resources it would take to gather data on all 2 million apps on the App Store and 2.1 million apps on the Play Store (as of September 2018).

* [Link to dataset for Apple App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
* [Link to dataset for Google Play Store](https://www.kaggle.com/lava18/google-play-store-apps)

In [1]:
from csv import reader

# Apple App Store data
apple = reader(open('AppleStore.csv'))
ios = list(apple)
ios_header = ios[0]
ios_apps = ios[1:]

# Google Play data
google = reader(open('googleplaystore.csv'))
android = list(google)
android_header = android[0]
android_apps = android[1:]

In [2]:
# Print rows from the dataset, returns a list with the # of rows and # of columns
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    # Modified explore_data to return a list for the (# rows, # cols)
    if rows_and_columns:
        return [len(dataset), len(dataset[0])]

In [3]:
# Print the first five rows from the ios_apps dataset
ios_apps_size = explore_data(ios_apps, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']




In [4]:
# Print the first five rows from the android_apps dataset
android_apps_size = explore_data(android_apps, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']




In [5]:
# List out the column names for the ios_apps dataset
ios_header

['id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

In [6]:
# The size (#rows, #cols) of the (uncleaned) ios_apps dataset
ios_apps_size

[7197, 16]

In [7]:
# List out the column names for the android_apps dataset
android_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

In [8]:
# The size (#rows, #cols) of the (uncleaned) android_apps dataset
android_apps_size

[10841, 13]

## 2. Cleaning the Data

### 2.1 Fixing Missing Values

#### Google Play Store
From the discussion on Kaggle.com, the app located at index 10472 ('Life Made Wi-Fi Touchscreen Photo Frame') is missing a value for the `'Category'` column. Doing a little research, we was able to deduce that the `'Category'` is 'LIFESTYLE' and the `'Genres'` is 'Lifestyle'.  
* See [discussion](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015).

Rather than deleting this datapoint, we will insert the respective values.

In [9]:
android_apps[10472]

['Life Made WI-Fi Touchscreen Photo Frame',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 '',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

In [10]:
# Insert 'LIFESTYLE' where the 'Category' column should be 
# and insert 'Lifestyle' into the 'Genres' column 
android_apps[10472].insert(1,'LIFESTYLE')
android_apps[10472].pop(9)
android_apps[10472].insert(9, 'Lifestyle')

In [11]:
android_apps[10472]

['Life Made WI-Fi Touchscreen Photo Frame',
 'LIFESTYLE',
 '1.9',
 '19',
 '3.0M',
 '1,000+',
 'Free',
 '0',
 'Everyone',
 'Lifestyle',
 'February 11, 2018',
 '1.0.19',
 '4.0 and up']

### 2.2 Remove Duplicate Apps

#### Google Play Store
It appears that there are a lot of duplicate apps in the `android_apps` dataset. We will remove all duplicates which have the same name. To detetmine which of the duplicate apps to keep, we will base our decision off of which duplicate has the most number of reviews. This criterion will give us the most up-to-date information for those particular apps. 

#### Apple App Store
After reading through some of the discussion found on Kaggle.com, it appears that there two sets of apps with the same names are all unqiue (VR Roller Coaster and Mannequin Challenge).
* See [discussion](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409).

Rather than remove these two datapoints, we will keep both of them.

In [12]:
# function takes a 2D array and the index of the name and
# returns a list of duplicate apps based on the name of the app
def duplicate_apps(data, name_index=0):
    dup_apps = []
    unique_apps = []
    
    for app in data:
        name = app[name_index]
        if name in unique_apps:
            dup_apps.append(name)
        else:
            unique_apps.append(name)
  
    return dup_apps

In [13]:
dup_apps_android = duplicate_apps(android_apps)
print('Google Play Store')
print('Number of duplicate apps: ', len(dup_apps_android))

Google Play Store
Number of duplicate apps:  1181


In [14]:
dup_apps_ios = duplicate_apps(ios_apps, 1)
print('Apple App Store')
print('Number of duplicate apps: ', len(dup_apps_ios))

Apple App Store
Number of duplicate apps:  2


#### Apple App Store
Since there are two sets of duplicates found in the `ios_apps` dataset, these must be the two sets of applications with the same names (as mentioned above). We will not remove these as they appear to be unique.

In [15]:
# function to remove duplicates by selecting the row with the 
# highest number of reviews (i.e. the most up-to-date row) 
# data is a 2D array (a list of lists)
# name_index and n_reviews_index are both integers 
# returns a 2d array without any duplicates
def remove_duplicates(data, name_index, n_reviews_index):
    # create a dictionary for the max number of reviews per app 
    reviews_max = {}
    
    for row in data:
        name = row[name_index]
        n_reviews = float(row[n_reviews_index])
        if name in reviews_max and reviews_max[name] < n_reviews:
            reviews_max[name] = n_reviews
        elif name not in reviews_max:
            reviews_max[name] = n_reviews
        
    clean_data = []
    already_added = []

    for row in data:
        name = row[name_index]
        n_reviews = float(row[n_reviews_index])
        if n_reviews == reviews_max[name] and name not in already_added:
            clean_data.append(row)
            already_added.append(name)
    
    return clean_data

In [16]:
android_clean = remove_duplicates(android_apps, 0, 3)
print('The new android dataset (w/o duplicates) has', 
      len(android_clean), 'rows.')

print('\nTo confirm the cleaned dataset has the correct number of rows...')
print('(# of rows w/ duplicates) - (# of duplicates) = (# of rows w/o duplicates:)')
print(android_apps_size[0] - len(dup_apps_android))

The new android dataset (w/o duplicates) has 9660 rows.

To confirm the cleaned dataset has the correct number of rows...
(# of rows w/ duplicates) - (# of duplicates) = (# of rows w/o duplicates:)
9660


In [17]:
ios_clean = ios_apps
print('This new ios_clean dataset is the same as ios_apps dataset and has', 
      len(ios_clean), 'rows.')

print('\nTo confirm the copied dataset has the correct number of rows...')
print(len(ios_clean), '=', ios_apps_size[0])

This new ios_clean dataset is the same as ios_apps dataset and has 7197 rows.

To confirm the copied dataset has the correct number of rows...
7197 = 7197


### 2.3  Remove Non-English Apps

The way we have choosen to remove non-English apps is by creating a rule in which we count the number of non-English characters (0-127 ASCII codes) and if the name of the app has more than 3 non-English characters, we will remove it from our datasets. 

This rule is not perfect since we may remove apps with many emojis in their name that are still English. Or the opposite, we may keep apps that are in other languages but use mostly English characters (i.e. German or French)). 

However this rule should be good enough for our analysis.

In [18]:
# function which takes a string and returns False if more than 3 
# characters in the string are non-English (0-127 ASCII)
def english_detector(string):
    
    non_english_char_count = 0
    
    for char in string:
        if ord(char) > 127:
            non_english_char_count += 1
    
    if non_english_char_count > 3:
        return False
        
    return True

In [19]:
android_clean_english = []

for row in android_clean:
    english_name = english_detector(row[0])
    if english_name == True:
        android_clean_english.append(row)

print('''The new android dataset (w/o duplicates and non-English names) 
has''', len(android_clean_english), 'rows.')

The new android dataset (w/o duplicates and non-English names) 
has 9615 rows.


In [20]:
ios_clean_english = []

for row in ios_clean:
    english_name = english_detector(row[1])
    if english_name == True:
        ios_clean_english.append(row)

print('''The new ios dataset (w/o duplicates and non-English names) 
has''', len(ios_clean_english), 'rows.')

The new ios dataset (w/o duplicates and non-English names) 
has 6183 rows.


### 2.4 Remove Non-Free Apps

We are only concerned with free apps and will remove all apps require payment to download.

In [21]:
android_clean_english_free = []

for row in android_clean_english:
    if row[6] == 'Free':
        android_clean_english_free.append(row)

print('''The new android dataset (w/o duplicates, non-English names
and non-Free apps) has''', len(android_clean_english_free), 'rows.')

The new android dataset (w/o duplicates, non-English names
and non-Free apps) has 8864 rows.


In [22]:
ios_clean_english_free = []

for row in ios_clean_english:
    if float(row[4]) == 0:
        ios_clean_english_free.append(row)

print('''The new ios dataset (w/o duplicates and non-English names
and non-Free apps) has''', len(ios_clean_english_free), 'rows.')

The new ios dataset (w/o duplicates and non-English names
and non-Free apps) has 3222 rows.


### 3. Find Application Profiles that are Successful in Both Markets 

In order to focus the time and attention of our software developers, we will try to narrow our focus to types of apps that been successful in both markets (Google Play Store and Apple App Store). 

Our validation strategy is as follows:
1. Build a basic Android version of the app and launch it in the Google Play Store.
2. If the app has a good response from users, develop it further.
3. If the app is profitable after six months, build an iOS version of the app and launch it in the Apple App Store.

The first step will be to use our newly cleaned datasets to determine which types apps are the most common and the most profitable. 

In [23]:
and_size = explore_data(android_clean_english_free, 0, 5, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


['Paper flowers instructions', 'ART_AND_DESIGN', '4.4', '167', '5.6M', '50,000+', 'Free', '0', 'Everyone', 'Art & Design', 'March 26, 2017', '1.0', '2.3 and up']




In [24]:
and_size

[8864, 13]

In [25]:
ios_size = explore_data(ios_clean_english_free, 0, 5, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']




In [26]:
ios_size

[3222, 16]

#### Google Play Store
For Android apps, we can use the `'Genres'` column (index=9) or the `'Category'` column (index=1) to determine what are the most common genres. Note that apps can have more than one type of `'Genres'`, see above ('Pixel Draw - Number Art Coloring Book' falls into two 'genres', specifically 'Art & Design' and 'Creativity'.

#### Apple App Store
For iOS apps, we can use the `'prime_genre'` column (index=11) to determine what are the most common genres.

In [27]:
# function takes a dataset (2D array) and an index (integer)
# and returns a frequency table as a dictionary for any column 
def freq_table(dataset, index):
    freq_dict = {}
    total = 0
    
    for row in dataset:
        key = row[index]
        total += 1
        
        if key in freq_dict:
            freq_dict[key] += 1
        else:
            freq_dict[key] = 1
    
    percentage_dict = {}
    
    for key in freq_dict:
        percentage = (freq_dict[key] / total) * 100
        percentage_dict[key] = percentage
    
    return percentage_dict   

In [28]:
# function takes a dataset (2D array) and an index (integer)
# and prints the entries found by the freq_table function (above) in descending order
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [29]:
# Android 'Category' 
display_table(android_clean_english_free, 1)

FAMILY : 18.896660649819495
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.91471119133574
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 0.

In [30]:
# Android 'Genres'
display_table(android_clean_english_free, 9)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Lifestyle : 3.9034296028880866
Productivity : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.925090252707

In [31]:
# iOS 'prime_genre' 
display_table(ios_clean_english_free, 11)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


#### Apple App Store (for free, English apps)
* Most common genre: Games (58.2%)
* Second most common genre: Entertainment (7.9%)
* It appears that most apps are used for entertainment purposes and based on the data above

#### Google Play Store (for free, English apps)
* Most common category: Family (18.9%)
* Second most common category: Games (9.7%)
* After those the top two categories, a lot of the other apps appear to be business or productivity related apps. 
* Given that the games category is high on the list (like iOS), it makes sense for our developers to build a game app.
* The genres column allows for multiple genres to be listed and as a result and because our function does not seperate out these different genres, and instead counts them as a single new genre, it is hard to get a realistic sense of how well certain genres are represented.


In [34]:
ios_genre_freq = freq_table(ios_clean_english_free, 11)

for genre in ios_genre_freq:
    total = 0 # store the number of user ratings
    len_genre = 0 # store the number of apps specific to each genre
    
    for row in ios_clean_english_free:
        genre_app = row[11]
        
        if genre_app == genre:
            total += float(row[5])
            len_genre += 1
            
    ave_n_rating = total / len_genre
    print(genre, ':', int(ave_n_rating))   

Games : 22788
Productivity : 21028
Food & Drink : 33333
Finance : 31467
Education : 7003
Sports : 23008
Travel : 28243
Business : 7491
Lifestyle : 16485
Shopping : 26919
Navigation : 86090
Book : 39758
Music : 57326
Social Networking : 71548
Reference : 74942
Photo & Video : 28441
News : 21248
Catalogs : 4004
Health & Fitness : 23298
Weather : 52279
Medical : 612
Utilities : 18684
Entertainment : 14029


In [35]:
android_cat_freq = freq_table(android_clean_english_free, 1)

for cat in android_cat_freq:
    total = 0 # store the number of user ratings
    len_cat = 0 # store the number of apps specific to each category
    
    for row in android_clean_english_free:
        cat_app = row[1]
        
        if cat_app == cat:
            n_installs = row[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            total += int(n_installs)
            len_cat += 1
            
    ave_n_rating = total / len_cat
    print(cat, ':', int(ave_n_rating)) 

SOCIAL : 23253652
BOOKS_AND_REFERENCE : 8767811
PARENTING : 542603
HOUSE_AND_HOME : 1331540
EVENTS : 253542
LIFESTYLE : 1433675
FINANCE : 1387692
LIBRARIES_AND_DEMO : 638503
WEATHER : 5074486
PRODUCTIVITY : 16787331
EDUCATION : 1833495
MEDICAL : 120550
AUTO_AND_VEHICLES : 647317
PHOTOGRAPHY : 17840110
TRAVEL_AND_LOCAL : 13984077
FAMILY : 3697848
HEALTH_AND_FITNESS : 4188821
ART_AND_DESIGN : 1986335
FOOD_AND_DRINK : 1924897
COMICS : 817657
COMMUNICATION : 38456119
SPORTS : 3638640
NEWS_AND_MAGAZINES : 9549178
PERSONALIZATION : 5201482
DATING : 854028
BEAUTY : 513151
BUSINESS : 1712290
VIDEO_PLAYERS : 24727872
MAPS_AND_NAVIGATION : 4056941
TOOLS : 10801391
ENTERTAINMENT : 11640705
SHOPPING : 7036877
GAME : 15588015


## Conclusion and Recommendation
Based on our analysis, we recommend building a Productivity or Book/Reference application. These may not be amoung the most popular genres but creating a popular gaming app or social media app will be extremely challenging. By choosing a genre that is less impacted gives us the opportunity to grow more quickly and