# Profitable App Profiles for the App Store and Google Play Markets

Our _goal_ for this project is to analyze data to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store. To do this, we'll need to collect and analyze data about mobile apps available on Google Play and the App Store.

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

## Datasets
- A data set containing data about approximately ten thousand Android apps from Google Play. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv).
- A data set containing data about approximately seven thousand iOS apps from the App Store. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv).


In [1]:
from csv import reader

# read Apple Store
opened_ios = open('AppleStore.csv')
read_ios = reader(opened_ios)
ios = list(read_ios)
header_ios = ios[0]
ios = ios[1:]
# read Google Play Store
opened_android = open('googleplaystore.csv')
read_android = reader(opened_android)
android = list(read_android)
header_android = android[0]
android = android[1:]

## Opening and exploring the data
- To make them easier for you to explore, we created a function named explore_data() that you can repeatedly use to print rows in a readable way.
- Print length of both datasets
- Print column names (aka headers) for both datasets
- Print 3 first rows of both datasets

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print() # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        
#exploring ios dataset       
print('Apple Store dataset includes', len(ios), 'rows', '\n')
print('Column names are: ', header_ios, '\n')
explore_data(ios, 0, 3, False)

#exploring android dataset
print('Google Play Store dataset includes', len(android), 'rows','\n')
print('Column names are: ', header_android)
explore_data(android, 0, 3, False)

Apple Store dataset includes 7197 rows 

Column names are:  ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']

['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']

['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']

Google Play Store dataset includes 10841 rows 

Column names are:  ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIG

### Data cleaning
To make sure the data we analyze is accurate we have to:

- Detect inaccurate data, and correct or remove it.
- Detect duplicate data, and remove the duplicates.

The first step is to tailor the data for our aim:
- Remove apps that aren't free.

In [3]:
#finding free apps in English for Apple Store
ios_free = []
for row in ios:
    if row[4] == '0.0':
        ios_free.append(row)
print('Apple Store free dataset includes', len(ios_free), 'rows', '\n')

#finding free apps in English for Google Play Store
android_free = []
for row in android:
    if row[6] == 'Free':
        android_free.append(row)
print('Google Play Store free dataset includes', len(android_free), 'rows', '\n')

Apple Store free dataset includes 4056 rows 

Google Play Store free dataset includes 10039 rows 



## Removing duplicates from the datasets_1
At the first step we check both datasets for duplicates and print some of them to see, how much are they identical to each other.

In [4]:
#checking for duplicates for Apple Store
unique_names_ios = []
duplicate_names_ios = []
for row in ios_free:
    name = row[1]
    if name in unique_names_ios:
        duplicate_names_ios.append(name)
    else: 
        unique_names_ios.append(name)
print('There are', len(duplicate_names_ios), 'duplicates in the Apple Store dataset.', '\n')
for row in ios_free:
    name = row[1]
    if name in duplicate_names_ios[:1]:
        print(row)

#checking for duplicates for Google Play Store
unique_names_android = []
duplicate_names_android = []
for row in android_free:
    name = row[0]
    if name in unique_names_android:
        duplicate_names_android.append(name)
    else: 
        unique_names_android.append(name)
print('There are', len(duplicate_names_android), 'duplicates in the Google Play Store dataset.', '\n')
print('Here are duplicate rows for one application. We can see how they differ from each other.', '\n')
for row in android_free:
    name = row[0]
    if name in duplicate_names_android[:1]:
        print(row)

There are 2 duplicates in the Apple Store dataset. 

['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
There are 1135 duplicates in the Google Play Store dataset. 

Here are duplicate rows for one application. We can see how they differ from each other. 

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'Fe

## Removing duplicates from the datasets_2
If you examine the rows for duplicates the main difference happens on the the number of users' ratings or reviews (column 6 in Apple Store dataset and column 4 in Google Play Store dataset). The different numbers show the data was collected at different times.
So we choose only rows with the maximum number of reviews or ratings and add them to the cleaned datasets.
Finally we check the number of rows in datasets testing if we will get the same number with different methods of count.

In [5]:
#iterate over Apple Store dataset
clean_data_ios_dict = {}

for row in ios_free:
    name = row[1]
    n_ratings = float(row[5])
    
    if name not in clean_data_ios_dict or n_ratings > float(clean_data_ios_dict[name][5]):
        clean_data_ios_dict[name] = row

clean_data_ios = clean_data_ios_dict.values()
# редактировать нижележащий код       
print('Expected length of cleaned Apple Store dataset is', len(ratings_max))
print('Test length of cleaned Apple Store dataset is', len(ios_free)-len(duplicate_names_ios))
print('Length of final clean Apple Store dataset is:', len(clean_data_ios), '\n')
        
#iterate over Google Play Store dataset
reviews_max = {}
clean_data_android = []
already_added_android = []

for row in android_free:
    name = row[0]
    n_reviews = float(row[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
        
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
for row in android_free:
    name = row[0]
    n_reviews = float(row[3])
    
    if name not in already_added_android and reviews_max[name] == n_reviews:
        clean_data_android.append(row)
        already_added_android.append(name)
        
print('Expected length of cleaned Google Play Store dataset is', len(reviews_max))
print('Test length of cleaned Google Play Store dataset is', len(android_free)-len(duplicate_names_android))
print('Length of final clean Google Play Store dataset is:', len(clean_data_android), '\n')

NameError: name 'ratings_max' is not defined

Next step is to remove from both datasets applications with non-English names to ensure that we analyse only applications for English-speaking markets.
To do that we write a function to check if the name of the application is English and initialize it for both datasets.

In [None]:
#define a function to check the name
def isEnglish(app):
    number_of_false = 0
    
    for letter in app:
        if ord(letter) > 127:
            number_of_false += 1
            
    if number_of_false < 4:
        return True
        
#iterate over Apple Store dataset
AppleStore = []

for row in clean_data_ios:
    name = row[1]
    if isEnglish(name):
        AppleStore.append(row)
print("Final list of Apple Store apps has {} rows".format(len(AppleStore)))

#iterate over Google Play Store
GooglePlayStore = []

for row in clean_data_android:
    name = row[0]
    if isEnglish(name):
        GooglePlayStore.append(row)
print("Final list of Apple Store apps has {} rows".format(len(GooglePlayStore)))


### Most common apps by genre
Our end goal now is to find an idea of application which could be successful at both markets, Apple Store and Google Play Store.
At first we will develop an app for Google Play Store and if it will be succesful, roll it out to Apple Store.
First step now is to identify the most common genres for applications. We will use column 'prime_genre' for AppleStore (12th position) and column 'Genres' for Google Play Store (10th position) to count the most common genre.

In [None]:
#refresh the names of the columns
print('Column names for Apple Store dataset are: ', header_ios, '\n')
print('Column names for Google Play Store are: ', header_android, '\n')

#define function to create a frequency table
def freq_table(dataset, index):
    freq_table = {}
    total = len(dataset)
    
    for row in dataset:
        token = row[index]
        if token in freq_table:
            freq_table[token] += 1
        else:
            freq_table[token] = 1
    
    freq_percentages = {}
    for key in freq_table:
        percentage = (freq_table[key]/total)*100
        freq_percentages[key] = round(percentage, 2)
        
    return freq_percentages

#define function to present frequency table as a list of tuples
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

#iterate over Apple Store dataset
prime_genres = freq_table(AppleStore, 11)
print('The column prime genre includes {} genres total.'.format(len(prime_genres)), '\n')
display_table(AppleStore, 11)
print('\n')

#iterate over Google Play Store dataset
genres = freq_table(GooglePlayStore, 9)
print('The column Genres includes {} genres total.'.format(len(genres)), '\n')
display_table(GooglePlayStore, 9)
print('\n')
category = freq_table(GooglePlayStore, 1)
print('The column Category includes {} genres total.'.format(len(category)), '\n')
display_table(GooglePlayStore, 1)

## What observations can we make based on this data?
**Apple Store dataset**
- Data from Apple Store shows less variety in genres. 
- The most common genre at Apple Store is Games (more than half of all apps).
- 4 of 5 top genres are entertaining rather than designed for prectical purposes.

**Google Play Store**

- Data from Google Play Store is harder to analyze because the column Genres can include multiple genres so the total number of genres (114 total) make the data noisy. Maybe parce it somehow?
- Top 5 genres as well as categories show that there are more practically focused apps there.

**Comparison**

- The large amount of apps of the certain genre still doesn't mean that all of them are commercially succesful. We should also explore the amount of users as well as user retention rate (for that though we should have historic data).

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_column app.

In [None]:
#count total number of user ratings (column rating_count_tot) for Apple Store dataset
ratings_dict = {}
# FIXME: can be simplified without the need for nested loop
for genre in prime_genres:
    total = 0
    len_genre = 0
    for row in AppleStore:
        genre_app = row[11]
        if genre_app == genre:
            ratings = float(row[5])
            total += ratings
            len_genre += 1
    avg_ratings = round(total/len_genre)
    #create dictionary
    ratings_dict[genre] = avg_ratings
#sort and print resulting dictionary
for genre in sorted(ratings_dict, key=ratings_dict.get, reverse=True):
    print(genre, ratings_dict[genre])   
print()

###
# FIXME: remove the older version
# {genre: [app_1_rating, app_2_rating, ...], ...}
ratings_dict_2 = {}
for row in AppleStore:
    genre_app = row[11]
    rating = float(row[5])
    ratings_dict_2.setdefault(genre_app, []).append(rating)
# calculate
avg_ratings = {genre: round(sum(ratings)/len(ratings)) for genre, ratings in ratings_dict_2.items()}
assert ratings_dict == avg_ratings

for genre in sorted(ratings_dict, key=ratings_dict.get, reverse=True)[:10]:
    print(genre, ratings_dict[genre])   

for genre in sorted(avg_ratings, key=avg_ratings.get, reverse=True)[:10]:
    print(genre, avg_ratings[genre])   

###

#count number of Google Play
installs_dict = {}
# FIXME: can be simplified without the need for nested loop
for cat in category:
    total = 0
    len_gen = 0
    for row in GooglePlayStore:
        category_app = row[1]
        
        if category_app == cat:
            installs = (row[5])
            #format number of intsalls to get rid of "+" and "," and convert to float
            n_installs = installs.replace('+', '')
            n_installs = n_installs.replace(',', '')
            
            total += float(n_installs)
            len_genre += 1
            
    avg_installs = round(total/len_genre)
    #create dictionary
    installs_dict[cat] = avg_installs
#sort and print resulting dictionary
for cat in sorted(installs_dict, key=installs_dict.get, reverse=True):
    print(cat, installs_dict[cat])

To check the hypotesis that the most downloaded categories are dominated by the few leader of the industry we show the top apps in top 3 categories: communication, books_and_reference and game.

In [None]:
#check apps mostly installed at Google Play Store
#communication
print('The most popular apps in communication category are:', '\n')
# FIXME: parse installations into integers?
# FIXME: can the top categories be extracted programmatically?
for row in GooglePlayStore:
    if row[1] == 'COMMUNICATION' and row[5] == '1,000,000,000+':
        print(row[0], ':', row[5])
print('\n')
#books_and_reference
print('The most popular apps in books_and_reference category are:', '\n')
for row in GooglePlayStore:
    if row[1] == 'BOOKS_AND_REFERENCE' and (row[5] == '100,000,000+'
                                            or row[5] == '500,000,000+' 
                                            or row[5] == '1,000,000,000+'):
        print(row[0], ':', row[5])
print('\n')
#game
print('The most popular apps in game category are:', '\n')
for row in GooglePlayStore:
    if row[1] == 'GAME' and (row[5] == '100,000,000+'
                                            or row[5] == '500,000,000+' 
                                            or row[5] == '1,000,000,000+'):
        print(row[0], ':', row[5])

Games looks like very diverse category with a lot of players.
Next check out top prime_genres at Apple Store dataset

In [None]:
#check apps mostly rated at Apple Store
# FIXME: can the top genres be extracted programmatically?
#navigation
print('There are few apps at Navigation genre:')
for row in AppleStore:
    if row[11] == 'Navigation':
        print(row[1], ':', row[5])
print('\n')
#reference
print('There are more apps at Reference genre also dominated by dictionaries and religious texts:')
for row in AppleStore:
    if row[11] == 'Reference':
        print(row[1], ':', row[5])
print('\n')
#social networking
print('There are plenty apps at Social Networking genre but dominated by few of them:')
for row in AppleStore:
    if row[11] == 'Social Networking':
        print(row[1], ':', row[5])

### Final recommendation:
- either create a game
- or build an app based on some religious text
- or mix both (like quest based on Satan Bible)