# What type of apps are likely to attract more users

This project is about analyzing data of apps from android and ios.
The goal of this project is to make a suggestion for apps developer about what tyep of apps are likely to attract more users.
In this project, we use only data of free apps for English users for some specific use cases.

## Opening and Exploring the Data
First of all, we will open the dataset about apple and google apps. The both datasets are open source data below.

- [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home) containing data about approximately ten thousand Android apps from Google Play
- [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) containing data about approximately seven thousand iOS apps from the App Store

First of all, we will open these data and get columns as headers.
Then, in order for exploreing data of apps, we will make a function `explore_data`.

In [1]:
from csv import reader
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

As a test, let's see each header of app from android and ios, and first few rows of large data.

In [3]:
print(android_header, '\n')
explore_data(android, 0, 4, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Number of columns: 13


In [4]:
print(ios_header, '\n')
explore_data(ios, 0, 4, rows_and_columns=True)

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


['4', '282614216', 'eBay: Best App to Buy, Sell, Save! Online Shopping', '128512000', 'USD', '0', '262241', '649', '4', '4.5', '5.10.0', '12+', 'Shopping', '37', '5', '9', '1']


Number of rows: 7197
Number of columns: 17


## Cleaning the Data
Before analyzing the Data, we should clean the data. This process is always taugh work, but this has to be done for a good data analysis. 
As for data used in this project, there are four process of cleaning data and each process is below.

1. Remove inaccurate data
- Remove duplicate app entries
- Remove non-English apps
- Isolat the free apps

### 1:Remove inaccurate data
According to [one of discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015), a row of data about apps from Google Play is missing 'Category' column. We know the number of the columns is 13, but the row '10472' of `google_data` has only 12 columns as shown below.

In [5]:
print(len(android_header))
explore_data(android, 10471, 10473)
print(len(android[10472]))

13
['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


12


Therefore, we have to delete the row by using `del` built-in function like this.
After deleting the row, we cannot see the row missing 'Category' column anymore.

In [6]:
del android[10472]
explore_data(android, 10471, 10473)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




### 2:Remove duplicate app entries
There are some data which have the same App name, so we have to find all duplicate app name and remove those entries form the dataset.

FIrst of all, we want to see the current situation of the repetition of apps by using a function made as `check_duplicate`.

In [7]:
def check_duplicate(dataset, os):
    duplicate_apps_name = []
    unique_apps_name = []
    duplicate_apps = []

    for app in dataset:
        name = app[0]
        if name in unique_apps_name:
            duplicate_apps_name.append(name)
            duplicate_apps.append(app)
        else:
            unique_apps_name.append(name)
    print('Number of duplicate apps from', os, ":",  len(duplicate_apps_name))
    print('Number of unique apps from', os, ":", len(unique_apps_name))
    print("\n")
    return duplicate_apps

duplicate_android = check_duplicate(android, "android")
duplicate_ios = check_duplicate(ios, "ios")

Number of duplicate apps from android : 1181
Number of unique apps from android : 9659


Number of duplicate apps from ios : 0
Number of unique apps from ios : 7197




These results show that there are no duplicate apps for apps form ios. We will focus on apps from android and remove the duplicate apps.

But before removing them, shall we see how those data are shown as duplicate apps by displaying a first few rows of `duplicate_android` and also see the rows of the same duplicate apps.

In [8]:
explore_data(duplicate_android, 0, 10)

def display_duplicate_apps(dataset, app_name):
    for num_row, app in enumerate(dataset):
        name = app[0]
        if name==app_name:
            print(num_row,":",  app)

display_duplicate_apps(duplicate_android, "Box")
display_duplicate_apps(duplicate_android, "Slack")

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


['Google My Business', 'BUSINESS', '4.4', '70991', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 24, 2018', '2.19.0.204537701', '4.4 and up']


['ZOOM Cloud Meetings', 'BUSINESS', '4.4', '31614', '37M', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 20, 2018', '4.1.28165.0716', '4.0 and up']


['join.me - Simple Meetings', 'BUSINESS', '4.0', '6989', 'Varies with device', '1,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 16, 2018', '4.3.0.508', '4.4 and up']


['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July

After the checking the condition of rows of duplicate apps, we are going to remove them.
We use a number of data "reviews", which seem to be an id added according to the order of reviews. We assume the data whose reviews are max in the same app are the final version the app. Therefore, we will pickup only the final version of app row and make a new list as `android_clean`.

In [9]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = app[3]
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
android_clean = []
already_added = []
for app in android:
    name = app[0]
    n_reviews = app[3]
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
        

explore_data(android_clean, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Number of columns: 13


### 3:Remove non-English apps
In this project, we will collect only English apps for a use case.
At this process, we will find all possible non-English data by using functions `is_english` and `check_english`. Then, display the data evaluated as non-English. For judging apps as English or non-English, we use `ord()` built-in function, which tell you the ASCII code number of a character added as input. English strings are basically composed of characters whose the ASCII code of the number is equal or under 127.

In [10]:
def is_english(string):
    for char in string:
        if ord(char) > 127:
            # print(ord(char), string)
            return False
            break
    return True

def check_english(os, which_os):
    print("Numbers of all", which_os, ":", len(os))
    apps_english = []
    for app in os:
        if which_os == "android":
            app_name = app[0]
        elif which_os == "ios":
            app_name = app[2]
        if is_english(app_name):
            apps_english.append(app)
    print("Numbers of english", which_os, ":", len(apps_english))

check_english(android_clean, which_os="android")

Numbers of all android : 9659
Numbers of english android : 9117


In [11]:
check_english(ios, which_os="ios")

Numbers of all ios : 7197
Numbers of english ios : 5707


As above, we could find possible non-English data. However, as you know, there are some English apps evaluated as non-English apps because the names include some special characters like 😜 or ™.
Solving this problem, we change the evaluate system a littele bit and evaluate apps as non-English if the name of the apps include more than three characters whose ASCII code number are equal or under 127. We use two better evaluation function, `better_is_english` and `better_check_english`. We also remember to return a list of English apps in the function `better_check_english`.

In [12]:
def better_is_english(string):
    count = 0
    for char in string:
        if ord(char) > 127:
            count += 1
            if count > 3:
                # print(ord(char), string)
                return False    
    return True

def better_check_english(os, which_os):
    print("Numbers of all", which_os, ":", len(os))
    apps_english = []
    for app in os:
        if which_os == "android":
            app_name = app[0]
        elif which_os == "ios":
            app_name = app[2]
        if better_is_english(app_name):
            apps_english.append(app)
    print("Numbers of english", which_os, ":", len(apps_english))
    return apps_english

In [13]:
android_english = better_check_english(android_clean, which_os="android")

Numbers of all android : 9659
Numbers of english android : 9614


In [14]:
ios_english = better_check_english(ios, which_os="ios")

Numbers of all ios : 7197
Numbers of english ios : 6183


After all, we got the better lists of English apps.

### 4:Isolat the free apps
In this project, we focus on the free apps for a use case. We pickup only rows the price of which are 0 by using a function `is_free`.

In [15]:
def is_free(os, which_os):
    free_app = []
    if which_os=='android':
        column_num = 7
    elif which_os=='ios':
        column_num = 5
    for app in os:
        if app[column_num]=='0':
            free_app.append(app)
    print("Numbers of free", which_os, "is", len(free_app))
    return free_app

android_final = is_free(android_english, "android")
ios_final = is_free(ios_english, "ios")


Numbers of free android is 8862
Numbers of free ios is 3222


So far, we spent a good amount of time on cleaning data!! Congrats!

## Analyzing Data
We have finally cleaned up the data, so let's analyze them.
This process has to be main in the data analysis, though many data analysists say the process of preparing data for analysis is the most complicated and taugh tasks. 

At this moment, we have to check our goal, which is to know what type of apps are likely to attract more users.
To extract these ideas from the data, we checked how many apps are published to each type of genres and how much scores of other types of index as attraction are got to each type of genres.

In [16]:
print(android_header, "\n")
print(ios_header, "\n")
explore_data(android, 0, 3, rows_and_columns=True)

def popular_genres(os, which_os):
    genres_dict = {}
    for app in os:
        if which_os=="android":
            genres = app[9]
        elif which_os=="ios":
            genres = app[12]
        if genres in genres_dict:
            genres_dict[genres] += 1
        else:
            genres_dict[genres] = 1
    genres_sorted = sorted(genres_dict.items(), key=lambda x:x[1], reverse=True)
    for g_s in genres_sorted[:30]:
        print(g_s)

print("\n")
print("android apps:")
popular_genres(android, "android")
print("\n")
print("ios apps:")
popular_genres(ios, "ios")


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10840
Number of columns: 13


andro

These results above are impressive!

While apps of Tools genres are published most for android, apps of Games genres are published most for ios. Moreover, the number of ios apps of Games genre are more than 6 times those of Entertainment genre, which are the second most popular apps in ios. 

However, both results of android and ios show an insight of that we should consider Entertainment genre apps if we want to publish apps on both of them because the number of apps of Entertainment genre are high in the both ranking.

In [17]:
def freq_table(dataset, index):
    freq_data = {}
    num_apps = len(dataset)
    for app in dataset:
        data = app[index]
        if data in freq_data:
            freq_data[data] += 1
        else:
            freq_data[data] = 1
    for data in freq_data:
        freq_data[data] *=  (100/num_apps)
        freq_data[data] = round(freq_data[data], 2)
    return freq_data

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
    print("\n")

display_table(android, 1)
display_table(android, 9)
display_table(ios, 12)

FAMILY : 18.19
GAME : 10.55
TOOLS : 7.78
MEDICAL : 4.27
BUSINESS : 4.24
PRODUCTIVITY : 3.91
PERSONALIZATION : 3.62
COMMUNICATION : 3.57
SPORTS : 3.54
LIFESTYLE : 3.52
FINANCE : 3.38
HEALTH_AND_FITNESS : 3.15
PHOTOGRAPHY : 3.09
SOCIAL : 2.72
NEWS_AND_MAGAZINES : 2.61
SHOPPING : 2.4
TRAVEL_AND_LOCAL : 2.38
DATING : 2.16
BOOKS_AND_REFERENCE : 2.13
VIDEO_PLAYERS : 1.61
EDUCATION : 1.44
ENTERTAINMENT : 1.37
MAPS_AND_NAVIGATION : 1.26
FOOD_AND_DRINK : 1.17
HOUSE_AND_HOME : 0.81
LIBRARIES_AND_DEMO : 0.78
AUTO_AND_VEHICLES : 0.78
WEATHER : 0.76
ART_AND_DESIGN : 0.6
EVENTS : 0.59
PARENTING : 0.55
COMICS : 0.55
BEAUTY : 0.49


Tools : 7.77
Entertainment : 5.75
Education : 5.06
Medical : 4.27
Business : 4.24
Productivity : 3.91
Sports : 3.67
Personalization : 3.62
Communication : 3.57
Lifestyle : 3.51
Finance : 3.38
Action : 3.37
Health & Fitness : 3.15
Photography : 3.09
Social : 2.72
News & Magazines : 2.61
Shopping : 2.4
Travel & Local : 2.37
Dating : 2.16
Books & Reference : 2.13
Arcade : 2.0

In [18]:
prime_genre = freq_table(ios, 12)
print(prime_genre)

for genre in prime_genre:
    total = 0
    len_genre = 0
    for app in ios:
        if genre==app[12]:
            total += int(app[6])
            len_genre += 1
    avg_ratings = total / len_genre
    print(genre, ":", round(avg_ratings) )   

{'Games': 53.66, 'Productivity': 2.47, 'Weather': 1.0, 'Shopping': 1.7, 'Reference': 0.89, 'Finance': 1.45, 'Music': 1.92, 'Utilities': 3.45, 'Travel': 1.13, 'Social Networking': 2.32, 'Sports': 1.58, 'Business': 0.79, 'Health & Fitness': 2.5, 'Entertainment': 7.43, 'Photo & Video': 4.85, 'Navigation': 0.64, 'Education': 6.29, 'Lifestyle': 2.0, 'Food & Drink': 0.88, 'News': 1.04, 'Book': 1.56, 'Medical': 0.32, 'Catalogs': 0.14}
Games : 13692
Productivity : 8051
Weather : 22181
Shopping : 18615
Reference : 22411
Finance : 11048
Music : 28842
Utilities : 6864
Travel : 14129
Social Networking : 45499
Sports : 14027
Business : 4788
Health & Fitness : 9913
Entertainment : 7534
Photo & Video : 14352
Navigation : 11854
Education : 2239
Lifestyle : 6162
Food & Drink : 13939
News : 13015
Book : 5125
Medical : 593
Catalogs : 1732


In [19]:
category_dict = freq_table(android, 1)

for category in category_dict:
    total = 0
    len_category= 0
    for app in android:
        category_app = app[1]
        if category==category_app:
            total += float(app[5].replace("+", "").replace(",", ""))
            len_category += 1
    avg_installs = total / len_category
    print(category, ":", round(avg_installs) )   

ART_AND_DESIGN : 1912894
AUTO_AND_VEHICLES : 625061
BEAUTY : 513152
BOOKS_AND_REFERENCE : 8318050
BUSINESS : 2178076
COMICS : 934769
COMMUNICATION : 84359887
DATING : 1129533
EDUCATION : 5586231
ENTERTAINMENT : 19256107
EVENTS : 249581
FINANCE : 2395215
FOOD_AND_DRINK : 2156683
HEALTH_AND_FITNESS : 4642441
HOUSE_AND_HOME : 1917187
LIBRARIES_AND_DEMO : 741128
LIFESTYLE : 1407444
GAME : 30669602
FAMILY : 5201959
MEDICAL : 115027
SOCIAL : 47694467
SHOPPING : 12491726
PHOTOGRAPHY : 30114172
SPORTS : 4560350
TRAVEL_AND_LOCAL : 26623594
TOOLS : 13585732
PERSONALIZATION : 5932385
PRODUCTIVITY : 33434178
PARENTING : 525352
WEATHER : 5196348
VIDEO_PLAYERS : 35554301
NEWS_AND_MAGAZINES : 26488755
MAPS_AND_NAVIGATION : 5286729
