# AppStore and Google Play free apps Analysis

## Exploring the Data

- In the code below we open `AppleStore.csv` and `googleplaystore.csv`
- Both file objects are then converted into lists using the csv reader() and the built-in list() functions
- Apple store data set: https://www.kaggle.com/lava18/google-play-store-apps
- Google Play data set: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps
PBZ

In [1]:
from csv import reader

app_store = open('AppleStore.csv')
appstore_read= reader(app_store)
app_store_list = list(appstore_read)
app_store_header = app_store_list[0]
appstore_data = app_store_list[1:]


In [2]:
google_store = open('googleplaystore.csv')
google_read = reader(google_store)
google_list=list(google_read)
google_header = google_list[0]
google_data=google_list[1:]

- The `explore_data()` function takes in a list and returns a spliced list within the defined parameters
- Setting the third parameter to `True` inside `explore_data()` returns the amount of rows and columns inside the list

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n') # adds a new (empty) line after each row

In [4]:
explore_data(google_data,0,1, True)
explore_data(appstore_data,0,1,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16




## Data Cleaning

- Row 10472(excluding header) has a missing 'Category' column. We remove this row from the gogle data set.

In [5]:
print(google_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
del google_data[10472]

In [7]:
print(google_data[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [8]:
def duplicate_apps(dataset):
    duplicate_set=[]
    unique_apps = []
    for app in dataset[1:]:
        #print(len(app))
        # First column contains name:
        name = app[0]
        if name in unique_apps:
            duplicate_set.append(name)
        else:
            unique_apps.append(name)
   # length = len(duplicate_apps)
   
    return duplicate_set

google_duplicates = duplicate_apps(google_data)
apple_duplicates = duplicate_apps(appstore_data)
print('Google duplicates: '+str(len(google_duplicates)))
print('Apple duplicates: '+str(len(apple_duplicates)))

Google duplicates: 1181
Apple duplicates: 0


- Duplicate entries will be removed using the review number as a criteria, since this will show the newest entry. Only a duplicate with the highest number of reviews will be kept, the rest will be discarded.


In [9]:
def remove_duplicates(dataset):
    android_clean = []
    already_added=[]
    
    reviews_max = {}
    for app in dataset:
        # First column contains name:
        # Third column containt review count
        app_name = app[0]
        reviews = float(app[3])
        
        # Do not use an else statement here because if rereviews>reviews_max[app_name] is False
        # then it will update the reviews falsely inside the else condition.
        if app_name in reviews_max and reviews>reviews_max[app_name]:
            reviews_max[app_name] = reviews
        if app_name not in reviews_max:
            reviews_max[app_name] = reviews
            
    for app in dataset:
        app_name = app[0]
        reviews = float(app[3])
        
        # include 'app_name not in already_added' otherwise we will be counting
        # duplicate reviews as well.
        if reviews == reviews_max[app_name] and app_name not in already_added:
            android_clean.append(app)
            already_added.append(app_name)
            
    return android_clean
            
    #print(len(reviews_max))
    #print(len(android_clean))
    
            
android_no_duplicates= remove_duplicates(google_data)
print(len(android_no_duplicates))

9659


- In the cell below we remove all apps containing non-Latin characters from the data set
- The data that will be analyzed needs to be targeted at English speaking audiences
- We test our first function for removing characters as an excercise from DQ
- It is important to note that app names in the apple data set are in column index 2, so our function has to include the specific name index for that data set

In [10]:
def get_latin_char_apps(dataset,name_index):
    android_cleaned_eng=[]
    
    for app in dataset:
        name = app[name_index]
        if check_characters(name):
            android_cleaned_eng.append(app)
            #print(name)
    #print(android_cleaned_eng)
    return android_cleaned_eng

    

def check_characters(string):
    max_non_latin_char = 0
    for character in string:
        if ord(character) >=127:
            #print("non-Latin characters detected"+str(max_non_latin_char))
            max_non_latin_char +=1
            if max_non_latin_char>3:
                #print(string)
                return False
    return True
    
            
android_latin_char = get_latin_char_apps(android_no_duplicates,0)
app_store_latin_char = get_latin_char_apps(appstore_data,1)

print(len(app_store_latin_char))

#check_characters('爱奇艺PPS -《欢乐颂2》电视剧热播')
    

6183


- Our current cleaned datasets are `android_latin_char` and `app_store_latin_char`.
- These contain no duplicated and only apps with Latin characters
- In the next cell we isolate all free apps found in the Apple and Adroid stores.

In [11]:
# Printing the headers for column reference:
print(app_store_header)
print('\n')
print(google_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [15]:
def get_free_apps(dataset, index):
    free_apps_list = []
    
    for row in dataset:
        free_apps = row[index]
        #print(free_apps)
        if free_apps == '0' or free_apps == '0.0':
            free_apps_list.append(row)
    #print(free_apps_list)
    
    return free_apps_list

#print(android_latin_char)

appstore_free = get_free_apps(app_store_latin_char,4)
android_free = get_free_apps(android_latin_char,7)

print(len(android_free))
print(len(appstore_free))

8864
3222


## Frequency Table of most popular apps on Android and iOS

- In this section we will create a frequency table of the most poluar app genres on both the iOS and Android platforms.
- A common App profile that fits into both the App Store and Google Play will have a boreader audience
- Using our corresponding datasets we will create frequency tables for the "prime_genre"(App Store, index 11), "Genres"(Google, index 9) and "Category"(Google, index 1) columns.
- Our current, cleaned datasets are called `appstore_free` and `android_free`
- freq_table(dataset,index) will return a frequency table converted to a percentage

In [65]:
def freq_table( dataset, index: int,is_percent = False):
    freq_dict={}
    
    for row in dataset:
        column = row[index]
        if column in freq_dict:
            freq_dict[column] +=1
        else:
            freq_dict[column] = 1
    
    #return a percent value:
    if is_percent:
        for key in freq_dict:
            freq_dict[key]= (freq_dict[key]/len(dataset))*100
    
    return freq_dict
     

def display_table(dataset_dict,index,is_percent):
    #return a dict based on our inut dictionary and index
    table = freq_table(dataset_dict,index,is_percent)
    table_display = []
    #print(table)
    
    for key in table :
        # Input Value, Key into each tupule
        # Add this tupule to the table_displal list
        key_val_as_tuple = (table[key],key)
        table_display.append(key_val_as_tuple)
    
    # Create a new list with the sorted values
    table_sorted = sorted(table_display, reverse=True)
    for category in table_sorted:
        #print the category in reverse with the name being first then the value
        print(category[1], ' : ', category[0])

#explore_data(appstore_free,0,3222)
print("App Store Prime Genres:")
display_table(appstore_free,11,True)
print("\n")
print("Google Genres:")
display_table(android_free,9,True)
print("\n")
print("Google Categories:")
display_table(android_free,1,True)

App Store Prime Genres:
Games  :  58.16263190564867
Entertainment  :  7.883302296710118
Photo & Video  :  4.9658597144630665
Education  :  3.662321539416512
Social Networking  :  3.2898820608317814
Shopping  :  2.60707635009311
Utilities  :  2.5139664804469275
Sports  :  2.1415270018621975
Music  :  2.0484171322160147
Health & Fitness  :  2.0173805090006205
Productivity  :  1.7380509000620732
Lifestyle  :  1.5828677839851024
News  :  1.3345747982619491
Travel  :  1.2414649286157666
Finance  :  1.1173184357541899
Weather  :  0.8690254500310366
Food & Drink  :  0.8069522036002483
Reference  :  0.5586592178770949
Business  :  0.5276225946617008
Book  :  0.4345127250155183
Navigation  :  0.186219739292365
Medical  :  0.186219739292365
Catalogs  :  0.12414649286157665


Google Genres:
Tools  :  8.449909747292418
Entertainment  :  6.069494584837545
Education  :  5.347472924187725
Business  :  4.591606498194946
Productivity  :  3.892148014440433
Lifestyle  :  3.892148014440433
Finance  :  3.7

- The most common genre on the apple app store is "Games" with approximately 58.1%. Entertainment, Photo & Video is place 2 and 3 respecively.
-  The majority of the top 10 apps are designed for entertainment purposes
- The frequency table alone can not give a concise recommendation about an optimal app profile, since a large category of apps might also have smaller user bases than niche apps.


- The most common genres for the google Play Store are Family, Games, Tool and Entertainment
- Google genres and categories are moe diverse compared to the app store
- Based on the frequency tables, the most common factor between Google and Apple is Entertainment and Education.  These could be a top tier nieche categories worth exploring for a new app profile. 

## Calculating the Most Popular Genres

- For google the most popular genres can be seen inside the "Installs" category(index 5)
- In the app store we will calculate the popularity based on the number of average user rating for each individual category.
- For this we will use the "rating_count_tot"(index 5) column.


In [85]:
prime_genre= freq_table(appstore_free,11,False)
#print(prime_genre)

def avg_usr_rating_category(dataset,index_column_genre):
    prime_genre= freq_table(dataset,index_column_genre,False)
    
    #Loop through the unique genre list
    for genre in prime_genre:
        #app_genre = app[-5]
        
        total_sum_usr_rting = 0
        len_genre = 0   
        # Compare each genrey type in data set to unique genre list, get its rating and 
        # expand the genre list length
        for sub_search in dataset:
                data_genre = sub_search[-5]
                if data_genre == genre:
                    # Add up the user ratings
                    total_sum_usr_rting += float(sub_search[5])
                    len_genre+=1
        average = total_sum_usr_rting/len_genre
        
        #replace the frequency table value with the average user rating
        prime_genre[genre] = average

    return prime_genre

average_rating = avg_usr_rating_category(appstore_free,11)        
print(average_rating)
    
    

{'Social Networking': 71548.34905660378, 'Photo & Video': 28441.54375, 'Games': 22788.6696905016, 'Music': 57326.530303030304, 'Reference': 74942.11111111111, 'Health & Fitness': 23298.015384615384, 'Weather': 52279.892857142855, 'Utilities': 18684.456790123455, 'Travel': 28243.8, 'Shopping': 26919.690476190477, 'News': 21248.023255813954, 'Navigation': 86090.33333333333, 'Lifestyle': 16485.764705882353, 'Entertainment': 14029.830708661417, 'Food & Drink': 33333.92307692308, 'Sports': 23008.898550724636, 'Book': 39758.5, 'Finance': 31467.944444444445, 'Education': 7003.983050847458, 'Productivity': 21028.410714285714, 'Business': 7491.117647058823, 'Catalogs': 4004.0, 'Medical': 612.0}


- In the next section we will be calculating the average number of installs per app genre for the Google Play data set. The current "Installs" category is too broad of a description of the install numbers, for this project we will consider the broad install ranges as absolute installs, i.e 100,000+ installs will become 100,000 installs.

In [111]:
def google_avg_installs_genre(dataset, index_column_genre):
    categories_dict = freq_table(dataset,index_column_genre)
    # For the android data set the Category column index is 1
    #print(categories_dict)
    for category in categories_dict:
        
        total = 0
        len_category = 0
        for app in dataset:
            data_category = app[index_column_genre]
            if category == data_category:
                len_category += 1
                string = app[5]
                string = string.replace('+','')
                string = string.replace(',','')
                installs = float(string)
                #print(installs)
                total += installs
            if len_category != 0:
                average = total/len_category
            
        categories_dict[category] = average
    return categories_dict

print(google_avg_installs_genre(android_free,1))
    

{'ART_AND_DESIGN': 1986335.0877192982, 'AUTO_AND_VEHICLES': 647317.8170731707, 'BEAUTY': 513151.88679245283, 'BOOKS_AND_REFERENCE': 8767811.894736841, 'BUSINESS': 1712290.1474201474, 'COMICS': 817657.2727272727, 'COMMUNICATION': 38456119.167247385, 'DATING': 854028.8303030303, 'EDUCATION': 1833495.145631068, 'ENTERTAINMENT': 11640705.88235294, 'EVENTS': 253542.22222222222, 'FINANCE': 1387692.475609756, 'FOOD_AND_DRINK': 1924897.7363636363, 'HEALTH_AND_FITNESS': 4188821.9853479853, 'HOUSE_AND_HOME': 1331540.5616438356, 'LIBRARIES_AND_DEMO': 638503.734939759, 'LIFESTYLE': 1437816.2687861272, 'GAME': 15588015.603248259, 'FAMILY': 3695641.8198090694, 'MEDICAL': 120550.61980830671, 'SOCIAL': 23253652.127118643, 'SHOPPING': 7036877.311557789, 'PHOTOGRAPHY': 17840110.40229885, 'SPORTS': 3638640.1428571427, 'TRAVEL_AND_LOCAL': 13984077.710144928, 'TOOLS': 10801391.298666667, 'PERSONALIZATION': 5201482.6122448975, 'PRODUCTIVITY': 16787331.344927534, 'PARENTING': 542603.6206896552, 'WEATHER': 50