# App Popularity Analysis

In this project, our primary goal is to identify profitable mobile app profiles within both the Apple App Store and Google Play markets. Playing the role of a data analyst for a company that designs and builds mobile applications for iOS and Android platforms, our responsibility is to empower our development team with data-driven decision-making capabilities regarding the types of applications they should be focusing on.

Our organization develops free apps only. The principal revenue stream is derived from in-app advertisements. Therefore, the aim of this project is to analyze pertinent data to provide the business with a clearer understanding of what type of apps are more likely to draw in more users to see and interact with ads. This understanding will ultimately guide our development strategies and lead to more profitable applications.

### 1. Exploration

In [1]:
from csv import reader

# open and read Applie and Android app data from 
#https://dq-content.s3.amazonaws.com/350/googleplaystore.csv
#and https://dq-content.s3.amazonaws.com/350/AppleStore.csv

with open('AppleStore.csv') as f:
    read_file = reader(f)
    apple = list(read_file)
    apple_header = apple[0]
    apple = apple[1:]
    
with open('googleplaystore.csv') as f:
    read_file = reader(f)
    android = list(read_file)
    android_header = android[0]
    android = android[1:]

In [2]:
# define data exploration function

def explore_data(dataset, header, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') 
    
    print(header)
    print('\n')
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
# explore Apple data

explore_data(apple, apple_header, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows: 7197
Number of columns: 16


In [4]:
# explore Android data

explore_data(android, android_header, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows: 10841
Number of columns: 13


## 2. Cleaning

First, check for apps with missings values and delete those records

In [5]:
def delete_missing_data(dataset, header, dataset_name):
    # delete rows with missing values  
    indices_to_delete = []
    
    dataset_clean = dataset.copy()
    
    for row in dataset_clean:
        if len(row) != len(header):
            indices_to_delete.append(dataset_clean.index(row))
    
    for index in sorted(indices_to_delete, reverse=True):
        del dataset_clean[index]
        
    print(f'Number of records in dataset {dataset_name}: {len(dataset)}')    
    print(f'Number of records in clean dataset (without missing values): {len(dataset_clean)}')
    
    return dataset_clean

Next, check for duplicate app records and keep only the record with the highest number of ratings

#### Check for duplicated records

In [6]:
# check for duplicate records in each dataset

# create empty lists for unique and duplicate app records
# android
app_list = []
dup_app_list = []

for row in android:
    name = row[0]

    if name in app_list:
        dup_app_list.append(name)
    else:
        app_list.append(name)
    
print(f'Number of Anroid unique apps: {len(app_list)}')
print(f'Number of Anroid duplicate apps: {len(dup_app_list)}')

# apple
app_list = []
dup_app_list = []

for row in apple:
    name = row[0]
    
    if name in app_list:
        dup_app_list.append(name)
    else:
        app_list.append(name)

print(f'Number of Apple unique apps: {len(app_list)}')
print(f'Number of Apple duplicate apps: {len(dup_app_list)}')


Number of Anroid unique apps: 9660
Number of Anroid duplicate apps: 1181
Number of Apple unique apps: 7197
Number of Apple duplicate apps: 0


#### This function will return a dictionary with only unique app reviews

In [7]:
def remove_dup_records(dataset, dataset_name):
    # since apple has no duplicates, we will not modify this dataset
    if dataset_name == 'apple':
        return dataset
    
    else:
        # create empty dict for holding max rating per app name
        reviews_max = {}

        # store max number of reviews per app name in dict
        for row in dataset:
            name = row[0]
            n_reviews = float(row[3])

            if name in reviews_max and reviews_max[name] < n_reviews:
                reviews_max[name] = n_reviews
            elif name not in reviews_max:
                reviews_max[name] = n_reviews
            else:
                pass

        # create empty lists to hold rows for clean dataset and names of reviews already added
        dataset_clean = []
        already_added = []
        
        for row in dataset:
            name = row[0]
            n_reviews = float(row[3])
            
            if name in reviews_max and name not in already_added:
                dataset_clean.append(row)
                already_added.append(name)
            

        print(f'Number of duplicate records deleted {dataset_name}: {len(dataset) - len(dataset_clean)}')
        print(f'Number of records in clean dataset (unique) {dataset_name}: {len(dataset_clean)}')
        
        return dataset_clean

After removing duplicates, remove non-English app reviews

#### Define function to check if the name of an app is in English

In [8]:
def check_language(string):
    # initialize counter
    count = 0
    
    # count number of non-English characters
    for char in string:
        # 127 is max number for English text per ASCII
        if ord(char) > 127:
            count += 1
            
    #if there are at least 3 non-English characters, set to False      
    if count > 3:
        return False
    else:
        return True       
            

#### Define function to filter dataset to only English apps

In [9]:
def filter_language(dataset, dataset_name):
    # create empty list to hold only English apps
    dataset_clean = []
    
    # loop through each app in android, if app is English add to cleaned dataset list
    if dataset_name == 'android':
        for row in dataset:
            name = row[0]
            check = check_language(name)
            if check:
                dataset_clean.append(row)
            else:
                pass
            
    # loop through each app in apple, if app is English add to cleaned dataset list
    elif dataset_name == 'apple':
        for row in dataset:
            name = row[1]
            check = check_language(name)
            if check:
                dataset_clean.append(row)
            else:
                pass
        
    print(f'Number of non-English records deleted {dataset_name}: {len(dataset) - len(dataset_clean)}')
    print(f'Number of records in clean dataset (English) {dataset_name}: {len(dataset_clean)}')   
    
    return dataset_clean

Finally, after removing non-English apps, remove non-free apps

#### Define Function to filter only free apps

In [10]:
def filter_free(dataset, dataset_name):
    # create empty list to hold only free apps
    dataset_clean = []
    
    # loop through each app in android, if app is free add to cleaned dataset list
    if dataset_name == 'android':
        for row in dataset:
            price = row[7]
            if price == '0':
                dataset_clean.append(row)
                
    # loop through each app in apple, if app is free add to cleaned dataset list         
    if dataset_name == 'apple':
        for row in dataset:
            price = row[4]
            if price == '0.0':
                dataset_clean.append(row)
                
                
    print(f'Number of non-free records deleted {dataset_name}: {len(dataset) - len(dataset_clean)}')
    print(f'Number of records in clean dataset (free) {dataset_name}: {len(dataset_clean)}')   
    
    return dataset_clean

### Combine all cleaning steps in one cleaning function and call the function on the raw datasets

In [11]:
def all_cleaning(dataset, header, dataset_name):
    dataset = delete_missing_data(dataset, header, dataset_name)
    dataset = remove_dup_records(dataset, dataset_name)
    dataset = filter_language(dataset, dataset_name)
    dataset = filter_free(dataset, dataset_name)
    return dataset

android = all_cleaning(android, android_header, 'android')
apple = all_cleaning(apple, apple_header, 'apple')

Number of records in dataset android: 10841
Number of records in clean dataset (without missing values): 10840
Number of duplicate records deleted android: 1181
Number of records in clean dataset (unique) android: 9659
Number of non-English records deleted android: 45
Number of records in clean dataset (English) android: 9614
Number of non-free records deleted android: 752
Number of records in clean dataset (free) android: 8862
Number of records in dataset apple: 7197
Number of records in clean dataset (without missing values): 7197
Number of non-English records deleted apple: 1014
Number of records in clean dataset (English) apple: 6183
Number of non-free records deleted apple: 2961
Number of records in clean dataset (free) apple: 3222


In [12]:
android

[['Photo Editor & Candy Camera & Grid & ScrapBook',
  'ART_AND_DESIGN',
  '4.1',
  '159',
  '19M',
  '10,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'January 7, 2018',
  '1.0.0',
  '4.0.3 and up'],
 ['Coloring book moana',
  'ART_AND_DESIGN',
  '3.9',
  '967',
  '14M',
  '500,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design;Pretend Play',
  'January 15, 2018',
  '2.0.0',
  '4.0.3 and up'],
 ['U Launcher Lite – FREE Live Cool Themes, Hide Apps',
  'ART_AND_DESIGN',
  '4.7',
  '87510',
  '8.7M',
  '5,000,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design',
  'August 1, 2018',
  '1.2.4',
  '4.0.3 and up'],
 ['Sketch - Draw & Paint',
  'ART_AND_DESIGN',
  '4.5',
  '215644',
  '25M',
  '50,000,000+',
  'Free',
  '0',
  'Teen',
  'Art & Design',
  'June 8, 2018',
  'Varies with device',
  '4.2 and up'],
 ['Pixel Draw - Number Art Coloring Book',
  'ART_AND_DESIGN',
  '4.3',
  '967',
  '2.8M',
  '100,000+',
  'Free',
  '0',
  'Everyone',
  'Art & Design;Creativity',
  'J

## 3. Analysis

As mentioned, the goal is to determine the types of apps which are likely to attract more users, and therefore generate more revenue from in-app advertisements.

To minimize risks and overhead, the validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

The first step will be to identify the most common genres for each market.

#### Identify columns of interest and indexes - Category and Genres for Android, prime_genre for Apple

In [13]:
print(android_header)
print('\n')
print(apple_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


#### Define function to create freq tables

In [14]:
def freq_table(dataset, index):
    # create empty dict to hold freqs
    freqs = {}
    
    # loop through apps and populate dictionary with freqs
    for row in dataset:
        if row[index] in freqs:
            freqs[row[index]] += 1
        else:
            freqs[row[index]] = 1
            
    for key in freqs:
        freqs[key] = round(freqs[key] / len(dataset), 2)
    
    return freqs

#### Define function to display sorted freq tables

In [15]:
def display_table(dataset, index):
    
    #call freq_table() function
    table = freq_table(dataset, index)
    
    # create empty list to display table
    table_display = []
    
    # add tuple of key val pairs to display table
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    # sort the display table from highest to lowest freqs
    table_sorted = sorted(table_display, reverse = True)
    
    #print the keys and freqs from the sorted display table
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

What are the most common app genres/categories? 

In [16]:
display_table(apple, 11)

Games : 0.58
Entertainment : 0.08
Photo & Video : 0.05
Education : 0.04
Utilities : 0.03
Social Networking : 0.03
Shopping : 0.03
Sports : 0.02
Productivity : 0.02
Music : 0.02
Lifestyle : 0.02
Health & Fitness : 0.02
Weather : 0.01
Travel : 0.01
Reference : 0.01
News : 0.01
Food & Drink : 0.01
Finance : 0.01
Business : 0.01
Navigation : 0.0
Medical : 0.0
Catalogs : 0.0
Book : 0.0


Looking at the Apple App Store apps, it's obvious that games dominate the market. After a significant gap, Entertainment ranks as the second most common genre, and Photo and Video ranks third just behind Entertainment.

In [17]:
display_table(android, 1)

FAMILY : 0.18
GAME : 0.1
TOOLS : 0.08
BUSINESS : 0.05
PRODUCTIVITY : 0.04
MEDICAL : 0.04
LIFESTYLE : 0.04
FINANCE : 0.04
SPORTS : 0.03
SOCIAL : 0.03
PHOTOGRAPHY : 0.03
PERSONALIZATION : 0.03
NEWS_AND_MAGAZINES : 0.03
HEALTH_AND_FITNESS : 0.03
COMMUNICATION : 0.03
VIDEO_PLAYERS : 0.02
TRAVEL_AND_LOCAL : 0.02
SHOPPING : 0.02
DATING : 0.02
BOOKS_AND_REFERENCE : 0.02
WEATHER : 0.01
PARENTING : 0.01
MAPS_AND_NAVIGATION : 0.01
LIBRARIES_AND_DEMO : 0.01
HOUSE_AND_HOME : 0.01
FOOD_AND_DRINK : 0.01
EVENTS : 0.01
ENTERTAINMENT : 0.01
EDUCATION : 0.01
COMICS : 0.01
BEAUTY : 0.01
AUTO_AND_VEHICLES : 0.01
ART_AND_DESIGN : 0.01


In [18]:
display_table(android, 9)

Tools : 0.08
Entertainment : 0.06
Education : 0.05
Business : 0.05
Productivity : 0.04
Medical : 0.04
Lifestyle : 0.04
Finance : 0.04
Sports : 0.03
Social : 0.03
Photography : 0.03
Personalization : 0.03
News & Magazines : 0.03
Health & Fitness : 0.03
Communication : 0.03
Action : 0.03
Video Players & Editors : 0.02
Travel & Local : 0.02
Simulation : 0.02
Shopping : 0.02
Dating : 0.02
Casual : 0.02
Books & Reference : 0.02
Arcade : 0.02
Weather : 0.01
Strategy : 0.01
Role Playing : 0.01
Racing : 0.01
Puzzle : 0.01
Maps & Navigation : 0.01
Libraries & Demo : 0.01
House & Home : 0.01
Food & Drink : 0.01
Events : 0.01
Comics : 0.01
Beauty : 0.01
Auto & Vehicles : 0.01
Art & Design : 0.01
Adventure : 0.01
Word : 0.0
Video Players & Editors;Music & Video : 0.0
Video Players & Editors;Creativity : 0.0
Trivia;Education : 0.0
Trivia : 0.0
Travel & Local;Action & Adventure : 0.0
Tools;Education : 0.0
Strategy;Education : 0.0
Strategy;Creativity : 0.0
Strategy;Action & Adventure : 0.0
Sports;Act

Looking at Android Apps, Family, Games, Entertainment, and Tools app categories/genres are most popular. However, the most popular categories/genres do not lead by a large margin. In general, it appears there is a balance between entertainment and productivity apps.

Next, we'll look at popularity of the individual apps within the genres.

In [19]:
apple_freq_table = freq_table(apple, 11)

apple_app_popularity = []

for genre in apple_freq_table:
    total = 0
    len_genre = 0
    for row in apple:
        genre_app = row[11]
        if genre_app == genre:
            rating_count = float(row[5])
            total += rating_count
            len_genre += 1
    average_ratings = int(total / len_genre)
    apple_app_popularity.append((genre, average_ratings))

sorted(apple_app_popularity, key = lambda x: x[1], reverse = True)

[('Navigation', 86090),
 ('Reference', 74942),
 ('Social Networking', 71548),
 ('Music', 57326),
 ('Weather', 52279),
 ('Book', 39758),
 ('Food & Drink', 33333),
 ('Finance', 31467),
 ('Photo & Video', 28441),
 ('Travel', 28243),
 ('Shopping', 26919),
 ('Health & Fitness', 23298),
 ('Sports', 23008),
 ('Games', 22788),
 ('News', 21248),
 ('Productivity', 21028),
 ('Utilities', 18684),
 ('Lifestyle', 16485),
 ('Entertainment', 14029),
 ('Business', 7491),
 ('Education', 7003),
 ('Catalogs', 4004),
 ('Medical', 612)]

We can now see the most popular app genres by total number of user reviews for apps in that genre. This doesn't tell the full story, however. There are likely a few specific apps in the Navigation and Social Networking genres, and it may not make sense to attempt to build a similar app. 

In [20]:
nav_dict = {}

for row in apple:
    if row[11] == 'Navigation':
        nav_dict[row[1]] = float(row[5])
        
social_network_dict = {}

for row in apple:
    if row[11] == 'Social Networking':
        social_network_dict[row[1]] = float(row[5])
    
reference_dict = {}

for row in apple:
    if row[11] == 'Reference':
        reference_dict[row[1]] = float(row[5])    

    
print(nav_dict)
print('\n')
print(social_network_dict)
print('\n')
print(reference_dict)

{'Waze - GPS Navigation, Maps & Real-time Traffic': 345046.0, 'Google Maps - Navigation & Transit': 154911.0, 'Geocaching®': 12811.0, 'CoPilot GPS – Car Navigation & Offline Maps': 3582.0, 'ImmobilienScout24: Real Estate Search in Germany': 187.0, 'Railway Route Search': 5.0}


{'Facebook': 2974676.0, 'Pinterest': 1061624.0, 'Skype for iPhone': 373519.0, 'Messenger': 351466.0, 'Tumblr': 334293.0, 'WhatsApp Messenger': 287589.0, 'Kik': 260965.0, 'ooVoo – Free Video Call, Text and Voice': 177501.0, 'TextNow - Unlimited Text + Calls': 164963.0, 'Viber Messenger – Text & Call': 164249.0, 'Followers - Social Analytics For Instagram': 112778.0, 'MeetMe - Chat and Meet New People': 97072.0, 'We Heart It - Fashion, wallpapers, quotes, tattoos': 90414.0, 'InsTrack for Instagram - Analytics Plus More': 85535.0, 'Tango - Free Video Call, Voice and Chat': 75412.0, 'LinkedIn': 71856.0, 'Match™ - #1 Dating App.': 60659.0, 'Skype for iPad': 60163.0, 'POF - Best Dating App for Conversations': 52642.0,

The Navigation genre would not make sense and already has its market leaders clearly established. The Social Networking genre is also dominated by a few apps, but there are some niche audiences and perhaps this is a viable option. The Reference genre also has some room for apps designed for niche audiences.

Finally, looking at the most common app genres for android apps.

In [21]:
android_freq_table = freq_table(android, 1)

android_app_popularity = []

for category in android_freq_table:
    total = 0
    len_category = 0
    for row in android:
        category_app = row[1]
        if category_app == category:
            installs = float(row[5].replace('+', '').replace(',',''))
            total += installs
            len_category += 1
    avg_install = int(total / len_category)
    android_app_popularity.append((category, avg_install))
    
sorted(android_app_popularity, key = lambda x: x[1], reverse = True)


[('COMMUNICATION', 38456119),
 ('VIDEO_PLAYERS', 24852732),
 ('SOCIAL', 23253652),
 ('ENTERTAINMENT', 21134600),
 ('PHOTOGRAPHY', 17805627),
 ('PRODUCTIVITY', 16787331),
 ('GAME', 15837565),
 ('TRAVEL_AND_LOCAL', 13984077),
 ('TOOLS', 10695245),
 ('NEWS_AND_MAGAZINES', 9549178),
 ('BOOKS_AND_REFERENCE', 8767811),
 ('SHOPPING', 7036877),
 ('PERSONALIZATION', 5201482),
 ('WEATHER', 5074486),
 ('HEALTH_AND_FITNESS', 4188821),
 ('MAPS_AND_NAVIGATION', 4056941),
 ('SPORTS', 3638640),
 ('EDUCATION', 3082017),
 ('FAMILY', 2691618),
 ('FOOD_AND_DRINK', 1924897),
 ('ART_AND_DESIGN', 1905351),
 ('BUSINESS', 1712290),
 ('LIFESTYLE', 1437816),
 ('FINANCE', 1387692),
 ('HOUSE_AND_HOME', 1313681),
 ('DATING', 854028),
 ('COMICS', 817657),
 ('AUTO_AND_VEHICLES', 647317),
 ('LIBRARIES_AND_DEMO', 638503),
 ('PARENTING', 542603),
 ('BEAUTY', 513151),
 ('EVENTS', 253542),
 ('MEDICAL', 120616)]

Analyzing apps in a few of these categories in the same manner:

In [22]:
comm_dict = {}

for row in android:
    if row[1] == 'COMMUNICATION':
        comm_dict[row[0]] = float(row[5].replace('+', '').replace(',',''))
 
entertainment_dict = {}

for row in android:
    if row[1] == 'ENTERTAINMENT':
        entertainment_dict[row[0]] = float(row[5].replace('+', '').replace(',',''))
        
books_dict = {}

for row in android:
    if row[1] == 'BOOKS_AND_REFERENCE':
        books_dict[row[0]] = float(row[5].replace('+', '').replace(',',''))
    
print(comm_dict)
print('\n')
print(entertainment_dict)
print('\n')
print(books_dict)


{'Messenger – Text and Video Chat for Free': 1000000000.0, 'WhatsApp Messenger': 1000000000.0, 'Messenger for SMS': 10000000.0, 'Google Chrome: Fast & Secure': 1000000000.0, 'Messenger Lite: Free Calls & Messages': 100000000.0, 'Gmail': 1000000000.0, 'Hangouts': 1000000000.0, 'Viber Messenger': 500000000.0, 'My Tele2': 5000000.0, 'Firefox Browser fast & private': 100000000.0, 'Yahoo Mail – Stay Organized': 100000000.0, 'imo beta free calls and text': 100000000.0, 'imo free video calls and chat': 500000000.0, 'Contacts': 50000000.0, 'Call Free – Free Call': 5000000.0, 'Web Browser & Explorer': 5000000.0, 'Opera Mini - fast web browser': 100000000.0, 'Browser 4G': 10000000.0, 'MegaFon Dashboard': 10000000.0, 'ZenUI Dialer & Contacts': 10000000.0, 'Cricket Visual Voicemail': 10000000.0, 'Opera Browser: Fast and Secure': 100000000.0, 'TracFone My Account': 1000000.0, 'Firefox Focus: The privacy browser': 1000000.0, 'Google Voice': 10000000.0, 'Chrome Dev': 5000000.0, 'Xperia Link™': 100000

Perhaps there is some room for new entertainment apps, as there is a very wide range of possibilities and a wide range of interests. We could find a niche here. 

Books and References are also popular in Apple apps. This could be our starting point. Is there a market for a reference guide on a specific topic that has not yet been saturated with alternatives?

## 4. Conclusion

We have cleaned data about apps in both the Android and Apple Stores, analyzed that data to identify which app categories are most popular for users, and dug a bit deeper to examine which categories are already dominated by a few key players, and which may present some opportunities for us to create a unique app filling a need for a specific audience or covering a specific topic. 