# AppStore and Google Play free apps Analysis

## Exploring the Data

- In the code below we open `AppleStore.csv` and `googleplaystore.csv`
- Both file objects are then converted into lists using the csv reader() and the built-in list() functions
- Apple store data set: https://www.kaggle.com/lava18/google-play-store-apps
- Google Play data set: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

In [1]:
from csv import reader

app_store = open('AppleStore.csv')
appstore_read= reader(app_store)
app_store_list = list(appstore_read)
app_store_header = app_store_list[0]
appstore_data = app_store_list[1:]


In [2]:
google_store = open('googleplaystore.csv')
google_read = reader(google_store)
google_list=list(google_read)
google_header = google_list[0]
google_data=google_list[1:]

- The `explore_data()` function takes in a list and returns a spliced list within the defined parameters
- Setting the third parameter to `True` inside `explore_data()` returns the amount of rows and columns inside the list

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        print('\n') # adds a new (empty) line after each row

In [4]:
explore_data(google_data,0,1, True)
explore_data(appstore_data,0,1,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7197
Number of columns: 16




## Data Cleaning

- Row 10472(excluding header) has a missing 'Category' column. We remove this row from the gogle data set.

In [5]:
print(google_data[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [6]:
del google_data[10472]

In [7]:
print(google_data[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


- The google data set contains 1181 duplicate entries, we check by running the `duplicate_apps()` function on both datasets

In [8]:
def duplicate_apps(dataset):
    duplicate_set=[]
    unique_apps = []
    for app in dataset[1:]:
        #print(len(app))
        # First column contains name:
        name = app[0]
        if name in unique_apps:
            duplicate_set.append(name)
        else:
            unique_apps.append(name)
   # length = len(duplicate_apps)
   
    return duplicate_set

google_duplicates = duplicate_apps(google_data)
apple_duplicates = duplicate_apps(appstore_data)
print('Google duplicates: '+str(len(google_duplicates)))
print('Apple duplicates: '+str(len(apple_duplicates)))

Google duplicates: 1181
Apple duplicates: 0


- Duplicate entries will be removed using the review number as a criteria, since this will show the newest entry. Only a duplicate with the highest number of reviews will be kept, the rest will be discarded.


In [9]:
def remove_duplicates(dataset):
    android_clean = []
    already_added=[]
    
    reviews_max = {}
    for app in dataset:
        # First column contains name:
        # Third column containt review count
        app_name = app[0]
        reviews = float(app[3])
        
        # Do not use an else statement here because if rereviews>reviews_max[app_name] is False
        # then it will update the reviews falsely inside the else condition.
        if app_name in reviews_max and reviews>reviews_max[app_name]:
            reviews_max[app_name] = reviews
        if app_name not in reviews_max:
            reviews_max[app_name] = reviews
            
    for app in dataset:
        app_name = app[0]
        reviews = float(app[3])
        
        # include 'app_name not in already_added' otherwise we will be counting
        # duplicate reviews as well.
        if reviews == reviews_max[app_name] and app_name not in already_added:
            android_clean.append(app)
            already_added.append(app_name)
            
    return android_clean
            
    #print(len(reviews_max))
    #print(len(android_clean))
    
            
android_no_duplicates= remove_duplicates(google_data)
print(len(android_no_duplicates))


9659


- In the cell below we remove all apps containing non-Latin characters from the data set
- The data that will be analyzed needs to be targeted at English speaking audiences
- We test our first function for removing characters as an excercise from DQ
- It is important to note that app names in the apple data set are in column index 2, so our function has to include the specific name index for that data set

In [51]:
def get_latin_char_apps(dataset,name_index):
    android_cleaned_eng=[]
    
    for app in dataset:
        name = app[name_index]
        if check_characters(name):
            android_cleaned_eng.append(app)
            #print(name)
    #print(android_cleaned_eng)
    return android_cleaned_eng

    

def check_characters(string):
    max_non_latin_char = 0
    for character in string:
        if ord(character) >=127:
            #print("non-Latin characters detected"+str(max_non_latin_char))
            max_non_latin_char +=1
            if max_non_latin_char>3:
                #print(string)
                return False
    return True
    
            
android_latin_char = get_latin_char_apps(android_no_duplicates,0)
app_store_latin_char = get_latin_char_apps(appstore_data,1)

print(len(app_store_latin_char))

#check_characters('爱奇艺PPS -《欢乐颂2》电视剧热播')
    

6183


- Our current cleaned datasets are `android_latin_char` and `app_store_latin_char`.
- These contain no duplicated and only apps with Latin characters
- In the next cell we isolate all free apps found in the Apple and Adroid stores.


In [59]:
# Printing the headers for column reference:
print(app_store_header)
print('\n')
print(google_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [102]:
def get_free_apps(dataset, index):
    free_apps_list = []
    
    for row in dataset:
        free_apps = row[index]
        #print(free_apps)
        if free_apps == '0' or free_apps == '0.0':
            free_apps_list.append(row)
    #print(free_apps_list)
    
    return free_apps_list

#print(android_latin_char)

appstore_free = get_free_apps(app_store_latin_char,4)
android_free = get_free_apps(android_latin_char,7)

print(len(android_free))
print(len(appstore_free))
print("")
    
    

8864
3222

