# Analysis of Mobile Apps

**Functional Goal**: We are interested in apps with a maximal number of users. We assume that all apps are free to download, and generate income based on the size of its userbase.

In this analysis, we use datasets from the App Store and Google Play to better understand which genres have a larger number of users.

**Datasets**: 
* A sampling of 10,000 Android apps from Google Play; collected on August 2018. [Link to download .csv file](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
* A sampling of 7,000 iOS apps from the App Store; collected on July 2017. [Link to download .csv file](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).

The entirety of this analysis was done **sans Pandas**. Instead, the datasets were analyzed using pure Python and bespoke functions, with a touch of help from very common packages. Let's begin by reading in the datasets and importing some basic libraries.

In [7]:
import csv
import os
import requests

Define a function to repeatedly load in datasets, and convert them to a list-of-lists for easier, iterative analysis.

In [18]:
def open_data(dataset, headers=True):
    '''
    A function to convert .csv files to a list-of-lists.

    Args:
        dataset (str): Relative path to a dataset in .csv format.
        headers (bool): Function returns first row of dataset, by default. Set to "False" to return only row with index 1 and higher.
        
    Dependencies:
        import csv

    Returns:
        multi_list (list): A list of lists.
    '''
    opened_file = open(dataset, encoding='utf8')
    read_file = csv.reader(opened_file)
    apps_data = list(read_file)

    if headers:
        return apps_data
    else:
        return apps_data[1:]


Load in the data and save the list-of-lists to separate variables.

In [23]:
android = open_data('./../googleplaystore.csv')
apple = open_data('./../AppleStore.csv')

Define a function for creating a data overview.

In [24]:
def explore_data(dataset, start, end, rows_and_columns=False):
    '''
    A function to look at a subset of rows in a list of lists.

    Args:
        dataset (list): A list of lists containing data. 
        start (int): The first row to include in a slice of the data view. 
        end (int): The last row to include in a slice of the data view. 
        rows_and_columns (bool): Return the number of rows and columns in a dataset? (y/n)

    Returns:
        None
    '''
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print(f"Number of rows: {len(dataset)}")
        print(f"Number of columns: {len(dataset[0])}")

In [21]:
explore_data(android, 0, 3, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [25]:
explore_data(apple, 0, 3, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


### Detecting Duplicates

Goals: 
* Remove non-English apps.
* Remove paid apps.

One of the rows in the dataset may have an incorrect value, leading to a frameshift in the `ratings` column. See this [Kaggle post](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) for more details.

In [26]:
explore_data(android, 10470, 10475)

['TownWiFi | Wi-Fi Everywhere', 'COMMUNICATION', '3.9', '2372', '58M', '500,000+', 'Free', '0', 'Everyone', 'Communication', 'August 2, 2018', '4.2.1', '4.2 and up']


['Jazz Wi-Fi', 'COMMUNICATION', '3.4', '49', '4.0M', '10,000+', 'Free', '0', 'Everyone', 'Communication', 'February 10, 2017', '0.1', '2.3 and up']


['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




The index of the row with the error in the `android` dataset is 10473.

In [29]:
if 'Life Made WI-Fi Touchscreen Photo Frame' in android[10473]:
    del android[10473]
    print(len(android))

Some apps are also duplicated; take Instagram as an example.

In [30]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


By looking at each of the duplicated Instagram entries, we notice something important: The column with index = 3 is the only bit that changes. This is the `ratings` column, and we can assume that the highest value here corresponds to the most recent data. In a future step, we will remove all duplicate entries, and keep only that row that has the highest number of reviews.

Instead of trying to find each app that is duplicated, though, we can write a function that iterates through the list, checks whether each app has already been viewed, and marks duplicates.

In [31]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

In [32]:
print(f"There are {len(duplicate_apps)} duplicated apps.")
print(f"There are {len(unique_apps)} unique apps in the Android dataset.")
print(f"Examples of duplicate apps: {duplicate_apps[:10]}")

There are 1181 duplicated apps.
There are 9660 unique apps in the Android dataset.
Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack']


### Removing Duplicates

Define a function that detects whether an app is already in the dataset. Remove it unless it has the highest number of ratings out of all the duplicate rows.

In [33]:
# Strategy: 
# Loop through the dataset. Populate a dictionary wherein the key is the "name" of the app and value is the "ratings" for the app. 
# If a row's app name is already in the dictionary, compare its rating to the dictionary value. 
# If the new rating is higher, replace the item in the dictionary. If the rating is lower, delete the row during the for loop. 

def remove_duplicates(dataset, name_column, value_column, print_results=False):
    '''
    A function that removes duplicate values from a list of lists, keeping only that row that has the highest value from a specified column.

    Args:
        dataset (list): A list of lists containing data. 
        name_column (int): The column name in which duplicates may appear (e.g. the list of app names in our example.)
        value_column (int): The column to store as values in the comparator dictionary (e.g. the number of ratings for each app in our example.)
        print_results (bool): Print sentences displaying the before-and-after results? (y/n)

    Returns:
        dataset (list): A deduplicated dataset. 
    '''
    pass

    # if print_results:
    #     print()
    #     print()
    #     print()