# Data Analyst Project01: 
## Profitable App Profiles for the App Store and Google Play Markets

**Synopsis**:
Pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [2]:
def explore_data(dataset, start=0, end=-1, rows_and_cols=False):
    dset_slice = dataset[start:end]
    for row in dset_slice:
        print(row)
        print('\n')
        
    if rows_and_cols:
        print('Number of rows: ', len(dataset))
        print('Number of cols: ', len(dataset[0]))
        
    return None

We will work with two datasets from Kaggle published within the last two years:

- Apple Store data (2018): https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps

- Google Play Store data (2019): https://www.kaggle.com/lava18/google-play-store-apps

In [3]:
import csv
# Apple Store data
with open('AppleStore.csv') as file:
    apple_store_data = list(csv.reader(file))

# Google Play Store data
with open('googleplaystore.csv') as file:
    gplay_store_data = list(csv.reader(file))

# Separate headers and data
apple_store_header = apple_store_data[0]
apple_store = apple_store_data[1:]

gplay_store_header = gplay_store_data[0]
gplay_store = gplay_store_data[1:]

In [4]:
print(apple_store_header, '\n')
_ = explore_data(apple_store, end=3, rows_and_cols=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows:  7197
Number of cols:  16


### First look at the data:

The Apple store data has 7197 apps listed with 16 columns of data. The provided column descriptions are a little obfuscated, but looking at the descriptions on the Kaggle page we can get an idea of a few columns of initial interest:
1. `track_name`: App name in store.
2. `currency` and `price`: Price and currency of app, since we are interested in free apps, we will want to filter out paid apps.
3. `rating_count_tot` and `rating_count_ver`: Number of ratings for app in total and for the most recent version, respectively. These give an idea of how popular the app is overall and with it's most recent version.
4. `user_rating` and `user_rating_ver`: Average user review scores (using a scale from 0.0-5.0) overall and for the most recent version. These indicate the quality/reception of an app.
5. `cont_rating` and `prime_genre`: Recommended age restrictions and the main category for an app. These help describe (generally) what the app does, and what audience the app is targeted towards.

In [5]:
print(gplay_store_header, '\n')
_ = explore_data(gplay_store, end=3, rows_and_cols=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows:  10841
Number of cols:  13


The Google Play store data lists just over 10,000 apps described with 13 columns. The column names are a little more descriptive so we can decide what columns might be of interest:

1. `App`: App name in store.
2. `Type` and `Price`: Type refers to an app that is either 'Free' or 'Paid' and the price of the app (in USD).
3. `Reviews`: The number of ratings for the app in total.
4. `Rating`: The average user review score (using a scale from 0.0-5.0) overall.
5. `Content Rating`, `Category`, and `Genres`: Recommended age restrictions, primary category, and full list of relevant categories for the app. Note that an app has belongs to only one `Category`, but possibly many `Genres`.

If we compare the two datasets, we can thankfully see that they both mostly contain the same types of data we're interested in. One notable difference is that the Apple store yields information about the most current version of an app as well as it's overall history, while the Google play store provides a finer resolution of the different types of apps.

## Data Validation

Let's do a quick smoke check for incorrect/missing data. Two of the most common data errors are missing values and duplicated data that should be unique.

The simplest check for missing data is to iterate looking for rows that are shorter than the header, indicating that at least one column is missing data.

To check for duplicate entries 

In [6]:
def data_smoke_test_missing(dataset, n_cols):
    # Simple smoke test of data looking at how many columns are present for each row    
    out = []
    for idx, row in enumerate(dataset):
        if len(row) != n_cols: # row is missing at least one column's worth of data
            out.append(idx)
    
    return out

def data_smoke_test_duplicate(dataset, key_index):
    # Another simple test looking for uniqueness of values for a given key column.
    # seen_values is keyed by the values present in the key column.
    # For a given key k, seen_values[k] is a list of each row index idx such that row[idx] = k 
    # Thus a key k is unique if and only if len(seen_values[k]) ==  1
    seen_values = {}
    for idx, row in enumerate(dataset):
        val = row[key_index]
        if val not in seen_values:
            seen_values[val] = [idx]
        else:
            seen_values[val].append(idx)
            
    # filter unique keys from seen_values
    non_unique_keys = {}
    for key,value in seen_values.items():
        if len(value) > 1:
            non_unique_keys[key] = value
    
    return non_unique_keys

In [27]:
idx_missing_apple = data_smoke_test_missing(apple_store, len(apple_store_header))
idx_missing_gplay = data_smoke_test_missing(gplay_store, len(gplay_store_header))

print("Apple store rows w. missing columns = {}".format(len(idx_missing_apple)))
print("Gplay store rows w. missing columns = {}".format(len(idx_missing_gplay)))

print([apple_store[idx] for idx in idx_missing_apple])
print([gplay_store[idx] for idx in idx_missing_gplay])
    
dup_key_apple = data_smoke_test_duplicate(apple_store, key_index=1)
dup_key_gplay = data_smoke_test_duplicate(gplay_store, key_index=0)

if dup_key_apple:
    print("\nDuplicated apps in apple store:\n")
    for key, idx_dup_apple in dup_key_apple.items():
        print(*[apple_store[idx] for idx in idx_dup_apple], sep='\n')

if dup_key_gplay:
    print("\nA selection of duplicated apps in gplay store:\n")
    # There are many more collisions in the gplay data.. we'll just show one for an example
    print(*[gplay_store[idx] for idx in dup_key_gplay['Instagram']], sep='\n')

Apple store rows w. missing columns = 0
Gplay store rows w. missing columns = 0
[]
[]

Duplicated apps in apple store:

['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']
['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']

A selection of duplicated apps in gplay store:

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'Ju

In [None]:
delete = False
if(delete):
    apple_store = [row for idx, row in enumerate(apple_store) if idx not in idx_missing_apple]
    gplay_store = [row for idx, row in enumerate(gplay_store) if idx not in idx_missing_gplay]
