## Description and goal
This work was carried out in the context of the online Dataquest Python courses.
Hypothetically, as a member of the data science team, my goal for this project is to analyze the given data, in order to help our developers understand what type of apps are more like to atract more users. Building apps that are free to download and install, our main source of revenue consists of in-app ads. This means the more users that engage with the ads, the better. 
The results of the analysis will further help our developers know the needs of the market and be driven through new ways of building a useful app!

## Datasets
The data where the analysis is based on are:
1) A data set containing data about approximately 10,000 Android apps from Google Play; Data collection in August 2018
2) A data set containing data about approximately 7,000 iOS apps from the App Store; Data collection in July 2017.

## Personal Statement
In terms of personal developement, this project was a good oppurtunity for me to apply the Python tools I've learned so far. Having given the datasets, I had to access and explore the data, apply preprocessing techniques to further clean the data, overcome challenges of missing or unedited data. This required the creation of specific functions to handle the data and the use of frequency tables to better organize the data. 

I realized that bringing out results requires a good understanding of the data I'm working with. I experienced the challenges when working with unclear and big datasets and the ways to overcome them. Finally, communicating the results to third readers requires clear and comprehensive descriptions of my work.

**Opening and exploring the datasets**

In [None]:
from csv import reader

def open_dataset(file_name):
    opened_file = open(file_name)
    read_file = reader(opened_file)
    data = list(read_file)
    return data

In [None]:
def explore_data(dataset,start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [None]:
app_store_apps = open_dataset("AppleStore.csv")
google_apps = open_dataset('googleplaystore.csv')
app_store_header = app_store_apps[0]
app_store_data = app_store_apps[1:]
google_apps_header = google_apps[0]
google_apps_data = google_apps[1:]

print(app_store_header)
print(google_apps_header)

In [None]:
explore_data(app_store_data,0,3,True)

In [None]:
print(google_apps_header)
explore_data(google_apps, 1, 4, True)

In [None]:
for app in google_apps_data:
    name = app[0]
    if name == "Facebook":
        print(app)

**Calculating duplicated entries**

In [None]:
duplicate_google_apps = []
unique_google_apps = []

for app in google_apps_data:
    name = app[0]
    if name in unique_google_apps:
        duplicate_google_apps.append(name)
    else:
        unique_google_apps.append(name)
print(len (duplicate_google_apps))
print (len(unique_google_apps))

In [None]:
duplicate_app_store = []
unique_app_store = []

for app in app_store_data:
    name = app[1]
    if name in unique_app_store:
        duplicate_app_store.append(name)
    else:
        unique_app_store.append(name)

print(len(duplicate_app_store))
print(len(unique_app_store))

**Removing duplicated apps, keeping the rows with the most recent information on the app (maximum users'reviews)

In [None]:
print("Expected length:", len(google_apps_data) - 1181)
#Expected length = 9659

**Google apps with maximum reviews**

In [None]:
google_reviews_max= {}

for app in google_apps_data:
    name = app[0]
    n_reviews_str = app[3]
    
    if "M" in n_reviews_str:
        n_reviews_str = n_reviews_str.replace("M", "")
        n_reviews = float(n_reviews_str) * 1000000
    else:    
        n_reviews = float(app[3])
    
    if name in google_reviews_max and google_reviews_max[name] < n_reviews :
        google_reviews_max[name] = n_reviews
    elif name not in google_reviews_max:
        google_reviews_max[name] = n_reviews
    
    

**iOS apps with maximum reviews**

In [None]:
app_store_reviews_max = {} 

for app in app_store_data :
    name = app[1]
    n_reviews_str = app[5]
    
    if "M" in n_reviews_str:
        n_reviews_str = n_reviews_str.replace("M", "")
        n_reviews = float(n_reviews_str) * 1000000
    else:    
        n_reviews = float(app[5])
    
    if name in app_store_reviews_max and app_store_reviews_max[name] < n_reviews :
        app_store_reviews_max[name] = n_reviews
    elif name not in app_store_reviews_max:
        app_store_reviews_max[name] = n_reviews
        

** Cleaning Google apps**

In [None]:
google_apps_data_clean = []
google_already_added = []

for app in google_apps_data:
    name = app[0]
    n_reviews_str = app[3]
    
    if "M" in n_reviews_str:
        n_reviews_str = n_reviews_str.replace("M", "")
        n_reviews = float(n_reviews_str) * 1000000
    else:    
        n_reviews = float(app[3])
    
    if (n_reviews == google_reviews_max[name]) and (name not in google_already_added):
        google_apps_data_clean.append(app)
        google_already_added.append(name)  
    

**Cleaning iOS apps**

In [None]:
app_store_data_clean = []
app_store_already_added = []

for app in app_store_data:
    name = app[1]
    n_reviews_str = app[5]
    
    if "M" in n_reviews_str:
        n_reviews_str = n_reviews_str.replace("M", "")
        n_reviews = float(n_reviews_str) * 1000000
    else:    
        n_reviews = float(app[5])
    
    if (n_reviews == app_store_reviews_max[name]) and (name not in app_store_already_added):
        app_store_data_clean.append(app)
        app_store_already_added.append(name) 


**Distinguishing the english apps**

In [None]:
def set_finder(a_string):
    counter = 0    
    for character in a_string:        
        if ord(character) > 127:
            counter += 1 
        if counter >= 3 :
            return False
        
    return True

In [None]:
english_google_apps = []

for app in google_apps_data_clean:
    name = app[0]
    if set_finder(name):
        english_google_apps.append(app)    

In [None]:
english_app_store = []

for app in app_store_data_clean:
    name = app[1]
    if set_finder(name):
        english_app_store.append(app)

In [None]:
free_google_apps = []
non_free_google_apps = []
for app in english_google_apps:
    name = app[0]
    price = app[7]
    if price == "0":
        free_google_apps.append(app)
    else:
        non_free_google_apps.append(name)
    
print(len(free_google_apps))
print(len(non_free_google_apps))

In [None]:
free_app_store = []
non_free_app_store = []

for app in english_app_store:
    name = app[1]
    price = app[4]
    
    if price == "0.0":
        free_app_store.append(app)
    else:
        non_free_app_store.append(name)
print(len(free_app_store))
        

**Building frequency tables for the "prime_genre" column of the App Store dataset and "Genres and Category" columns of the Google Play dataset**


**Generating frequency tables that show percentages**

In [None]:
def freq_table(dataset, index):
    table = {}
    
    for row in dataset:
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
        
    table_percent = {}
    for value in table:
        table_percent[value] = table[value] / len(dataset) * 100
    
    return table_percent

**Displaying the percentages in a descending order**

In [None]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [None]:
display_table(free_app_store, 11)

**App Stores apps - Percentage Table analysis**

Having a look at the percentage table created for the prime_genre column, we can see that "Games" is the dominant category of the Free English apps (58%). "Entertainment" then occupies 7% of the dataset. "Photo and Video" apps correspond to almost 5% of the dataset. "Education" and "Social Networking" share similar percentages, around 3,5%. The percentages of the rest of the categories range between 0,1 and 3%.

The general impression is that App Store  is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, music, etc.), while apps with practical purposes (education, shopping, utilities, productivity, lifestyle, etc.) are more rare. However, the fact that fun apps are the most numerous doesn't also imply that they also have the greatest number of users — the demand might not be the same as the offer.

In [None]:
display_table(free_google_apps, 1) #Category

The landscape seems significantly different on Google Play: there are not that many apps designed for fun, and it seems that a good number of apps are designed for practical purposes (family, tools, business, lifestyle, productivity, etc.). However, if we investigate this further, we can see that the family category (which accounts for almost 19% of the apps) means mostly games for kids.
Even so, practical apps seem to have a better representation on Google Play compared to App Store. This picture is also confirmed by the frequency table we see for the Genres column:

In [None]:
display_table(free_google_apps, 9) #Genres

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

**Calculating the average number of user ratings per app genre on the App Store**

In [None]:
genres_ios = freq_table(free_app_store, -5)

for genre in genres_ios:
    total = 0 ##this variable stores the sum of user ratings, specific to each genre
    len_genre = 0 ##this variable stores the number of apps specific to each genre
    for app in free_app_store:
        genre_app = app[-5]
        if genre_app == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre =+ 1
    avg_ratings = total / len_genre
    print(genre, ":", avg_ratings)

Looking at the results from our average-rating table, we can see "Games" has the highest mean of ratings (42.705.789,00).

In [None]:
google_installs = freq_table(free_google_apps, 1)
for category in google_installs:
    total = 0
    len_category = 0
    for app in free_google_apps:
        category_app = app[1]
        if category_app == category:
            n_installs = (app[5])
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category +=1
    avg_n_installs = total / len_category
    print(category, ":", avg_n_installs)