# Profitable App Profiles for the App Store and Google Play Markets

When working in a compagny designing free mobile application, it is essential to properly aim the target application and market. 

The aim of this project is to analyse and gain key information about mobile application on the market in order to help the marketing strategy.


## Opening and exploring data

We gathered two data sets including information about the Apple store and Google Play applications :
- `googleplaystore.csv`
- `AppleStore.csv`

Let's explore them both :

In [1]:
from csv import reader

# Function to open the csv file
def open_data_set(filename):
    file = open(filename, encoding='utf8')
    r = reader(file)
    return list(r)

# To quickly explore a data set (input as a list of list)
def explore_data_set(data_set, start, end, rows_and_column=False):
    dataset_slice=data_set[start:end]
    for row in dataset_slice:
        print(row)
        print("\n")
    if rows_and_column:
        print("# Columns :", len(data_set[0]))
        print("# Rows :", len(data_set))  
        print("header :", data_set[0])

### Apple store data

In [2]:
# Apple store data
apple_apps_data=open_data_set('AppleStore.csv')

explore_data_set(apple_apps_data, 1, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


# Columns : 16
# Rows : 7198
header : ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


### Google play data

In [4]:
# Google play data
gplay_apps_data=open_data_set('googleplaystore.csv')

explore_data_set(gplay_apps_data, 1, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


# Columns : 13
# Rows : 10842
header : ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


As we are marketing free application for english-speaking audiance, we shall remove the non-free and non-english application

## Cleaning the data

As in every data science project, we need to make sure our data are clean, and the first step is to check whether or not there are duplicate rows in our data set.

### Removing duplicates

In [5]:
explore_data_set(gplay_apps_data, 10472,10475)    # 10473th row has a missing attribute so we'll remove this row  

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




In [6]:
del gplay_apps_data[10473]

It seems there are duplicate in our data_set, we need to explore it in order to remove the relevant data.

In [7]:
unique_names=[]
duplicate_names=[]
for row in gplay_apps_data[1:]:
    name = row[0]
    if name in unique_names:
        duplicate_names.append(name)
    else:
        unique_names.append(name)
        
print("Exemple of duplicates :\n", duplicate_names[:15])
print("\nExemple of unique names :\n", unique_names[:15])
print("\n\nNumber of Duplicate :", len(duplicate_names))
print("Number of unique application :", len(unique_names))

Exemple of duplicates :
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']

Exemple of unique names :
 ['Photo Editor & Candy Camera & Grid & ScrapBook', 'Coloring book moana', 'U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'Sketch - Draw & Paint', 'Pixel Draw - Number Art Coloring Book', 'Paper flowers instructions', 'Smoke Effect Photo Maker - Smoke Editor', 'Infinite Painter', 'Garden Coloring Book', 'Kids Paint Free - Drawing Fun', 'Text on Photo - Fonteee', 'Name Art Photo Editor - Focus n Filters', 'Tattoo Name On My Photo Editor', 'Mandala Coloring Book', '3D Color Pixel by Number - Sandbox Art Coloring']


Number of Duplicate : 1181
Number of unique application : 9659


We can see that around 10% of our data consists of duplicate. To properly analyse our data we need to clean it up. But first, let's take a closer look at one duplicate so that we can determine which entry we should keep.

In [8]:
n_show=5
pdf_apps=[]
for row in gplay_apps_data:
    if row[0] == "Instagram":
        pdf_apps.append(row)
        if n_show>=0:
            print(row)
            n_show-=1

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


After examination, we can see that duplicate data only differ on one parameter : the number of reviews.
As this number can only increase over time, we are going to keep the rows including the maximum number of review for any application. 

In [9]:
apps_and_reviews={}
for row in gplay_apps_data[1:]:
    name = row[0]
    review=float(row[3])
    if name not in apps_and_reviews or review > apps_and_reviews[name]:
        apps_and_reviews[name]=review
        
print(len(apps_and_reviews))

9659


In [10]:
android_clean=[]
for row in gplay_apps_data[1:]:
    name = row[0]
    review=float(row[3])
    if review == apps_and_reviews[name]:
        android_clean.append(row)
        apps_and_reviews[name]=-1
        
print(len(android_clean))

9659


We have now cleaned the duplicates rows of the android data set. Is the same issue arising in the Apple dataset ?

In [12]:
def has_duplicate(data_set):
    unique_name={}
    for row in data_set[1:]:
        name = row[0]
        if name not in unique_name:
            unique_name[name]=1
        else:
            return False
    return True

print(has_duplicate(android_clean))
print(has_duplicate(gplay_apps_data))
print(has_duplicate(apple_apps_data))

True
False
True


The define function above shows that neither our android_clean nor apple data sets have duplicates, whereas google play one does have duplicate, as we show above. Anyway, let us move on to remove non-english speaking applications.

### Non-english speaking applications

As our market study focuses on english speaking application, we will remove non english speaking application. First, we need to detect whether an application is in english or not

In [13]:
def is_english_string(string, n=4):
    for char in string:
        if ord(char) > 127:
            n-=1
        if n == 0:
            return False
    return True

print(is_english_string("hello there !"))
print(is_english_string("Здоровье"))
print(is_english_string('Docs To Go™ Free Office Suite'))
print(is_english_string('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english_string('Instachat 😜'))

True
False
True
False
True


In [14]:
def remove_non_english_apps(data_set, name_index):
    clean_dataset=[]
    for row in data_set[1:]:
        name = row[name_index]
        if is_english_string(name):
            clean_dataset.append(row)
    print("Initial data set # row :", len(data_set))
    print("Cleaned data set # row :", len(clean_dataset))
    print("")
    return clean_dataset

clean_android=remove_non_english_apps(android_clean,0)
clean_apple=remove_non_english_apps(apple_apps_data,1)

Initial data set # row : 9659
Cleaned data set # row : 9613

Initial data set # row : 7198
Cleaned data set # row : 6183



In [15]:
clean_android[2300]

['United Airlines',
 'TRAVEL_AND_LOCAL',
 '3.5',
 '30447',
 '80M',
 '5,000,000+',
 'Free',
 '0',
 'Everyone',
 'Travel & Local',
 'July 20, 2018',
 '2.1.56',
 '5.0 and up']

### Remove non free applications

As our compagny merely focuses on free application, we will not remove any paying application from each data set.

In [16]:
def remove_non_free_app(data_set, price_index):
    clean_set=[]
    for row in data_set:
        s_price = row[price_index]
        price = float(s_price.replace("$",""))
        if price == 0:
            clean_set.append(row)
    return clean_set

clean_android=remove_non_free_app(clean_android,7)
clean_apple=remove_non_free_app(clean_apple,4)

In [17]:
print(len(clean_android))
print(len(clean_apple))

8863
3222


We have now remove non-english apps and paying application. 

## Data analysis

Now that we cleaned both dataset, we can analyse them in order to fit our strategy. This one consists of find the kind of application that attracts user on both Apple and Google play markets. Then our validation strategy will consist of three steps :
- Build a minimal Android version of the app
- If the app is successful, develop it further
- If after 6 months, the application is profitable, develop an Apple version and add it to the Apple store

We will now examine what genre of application are the most commun on each market.

In [18]:
#Apple prime genre 11
#Android category 2 & genre 9
def freq_table(data_set, index):
    ft={}
    for row in data_set:
        key=row[index]
        if key in ft:
            ft[key]+=1
        else:
            ft[key]=1
    return ft

def display_table(dataset, index, is_dataset_table=False):
    if is_dataset_table == False:
        table = freq_table(dataset, index)
    else:
        table=dataset
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        


print("Apple store prime genre frequency table")
display_table(clean_apple, 11)

print("\nGPlay category frequency table")
display_table(clean_android, 1)

print("\nGPlay genre frequency table")
display_table(clean_android, 9)


Apple store prime genre frequency table
Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4

GPlay category frequency table
FAMILY : 1676
GAME : 862
TOOLS : 750
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 313
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 103
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 56
COMICS : 55
BEAUTY : 53

G

We can see that the Apple Store is dominated by gaming application and entertainment in general (games, video and phto, social networking...)

Meanwhile most of Google play application are categorized as Familly. The most commun genre is `Tools`. After some investigation, we realise that familly applications are mainly composed of gaming applications for kid. With that in mind, it appears that gaming application dominate both markets.

Now that we identified the main genres, we would like to know which kind of application attracts the most user. We can get this information in the android dataset in the `install`. Yet  on the Apple store data, this information is missing so we'll focus on the `rating_count_tot` column which tells us how many rating an application has received.

In [19]:
prime_genre_ft=freq_table(clean_apple, 11)

genre_avg_review_dic={}
for key in prime_genre_ft:
    total=0
    genre_avg_review_dic[key]=0
    for row in clean_apple:
        genre = row[11]
        review = float(row[5])
        if genre == key:
            genre_avg_review_dic[key]+=review
            total+=1
    genre_avg_review_dic[key]/=total
    
display_table(genre_avg_review_dic,0, True)

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0
