# Profitable apps for the app store and Google Play

## This project is about:
we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.
### The goal of this project is: 
To analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

# Opening the data files and storing them as list datatype

In [2]:
opened_file = open('Applestore.csv')
from csv import reader
read_file = reader(opened_file)
ios_data = list(read_file)
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android_data = list(read_file)

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2693: character maps to <undefined>

# deleting the row with incorrect data

In [None]:
print(android_data[10473])
print(android_data[10474])
print(len(android_data))
##del android_data[10473]
print(android_data[10473])
print(len(android_data))

# Removing Duplicate Entries
## Part One
If we explore the Google Play data set long enough, we'll find that some apps have more than one entry. For instance, the application Instagram has four entries:

In [None]:
##find duplicate entries
def find_duplicates(dataset):
    duplicate_apps = []
    unique_apps = []
    for row in dataset[1:]:
        if row[0] in unique_apps:
            duplicate_apps.append(row[0])
        else:
            unique_apps.append(row[0])
    return duplicate_apps, unique_apps

In [None]:
dup_apps, uni_apps = find_duplicates(android_data)
print(len(uni_apps))
print(len(dup_apps))

We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

The main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)
# Part Two
Let's start by building the dictionary.

In [None]:
reviews_max = {}
for row in android_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] <= n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

Now using the dictionary created above, delete the duplicate entries in the android data set

In [None]:
android_clean = [] #android_data[0]
already_added = []
for row in android_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] == n_reviews and name not in already_added:
        android_clean.append(row)
        already_added.append(name)
        #del android_data[name]

In [None]:
explore_data(android_clean, 0, 5, True)

 # A function to find non english characters in the app name
 ## Part one
 
 Write a function that takes in a string and returns *False* if there's any character in the string that doesn't belong to the set of common English characters, otherwise it returns *True*

In [None]:
def find_ascii(app_name):
    for character in app_name:
        if ord(character) > 127:
            return character, False
    return True

In [None]:
print(find_ascii('Instagram'))
print(find_ascii('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(find_ascii('Docs To Go™ Free Office Suite'))
print(find_ascii('Instachat 😜'))

# Part Two
To minimize the impact of data loss, we'll only remove an app if its name has more than three non-ASCII characters:


In [None]:
def is_english(app_name):
    counter = 0
    for character in app_name:
        if ord(character) > 127:
            counter += 1
    if counter > 3:
            return False
    else:
        return True

In [None]:
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

Now we have to use the is_english function to create android and ios english apps

In [None]:
android_english = [] #android_data[0]
ios_english = []
for row in android_clean:
    name = row[0]
    if is_english(name):
        android_english.append(row)
for row in ios_data[1:]:
    name = row[1]#name is second column
    if is_english(name):
        ios_english.append(row)

explore_data(android_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

We can see that we're left with 9614 Android apps and 6183 iOS apps.
# Isolating the Free Apps
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps, and we'll need to isolate only the free apps for our analysis. Below, we isolate the free apps for both our data sets.

In [None]:
android_final = []
ios_final = []

for row in android_english:
    #app_type = row[6]
    #if app_type == 'Free':
    price = row[7]
    if price == '0':
        android_final.append(row)
    #elif app_type != 'Paid':
    #    print(row)
for row in ios_english:
    price = row[4]
    if price == '0.0':
        ios_final.append(row)
        
explore_data(android_final, 0, 4, True)
explore_data(ios_final, 0, 4, True)

We're left with 8864 Android apps and 3222 iOS apps, which should be enough for our analysis.
# Most Common Apps by Genre
## Part One
As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:
1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we then develop it further.
3. If the app is profitable after six months, we also build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both the App Store and Google Play, we need to find app profiles that are successful on both markets. For instance, a profile that might work well for both markets might be a productivity app that makes use of gamification.

Let's begin the analysis by getting a sense of the most common genres for each market. For this, we'll build a frequency table for the prime_genre column of the App Store data set, and the Genres and Category columns of the Google Play data set.
## Part Two
We'll build two functions we can use to analyze the frequency tables:
1. One function to generate frequency tables that show percentages
2. Another function that we can use to display the percentages in a descending order

In [None]:
def freq_table(dataset, index):
    fre_table = {}
    total_apps = 0
    for row in dataset:
        value = row[index]
        total_apps += 1
        if value in fre_table:
            fre_table[value] += 1      
        else:
            fre_table[value] = 1
    #find percentage
    app_percentage = {}
    for key in fre_table:
        app_percentage[key] = (fre_table[key]/total_apps) * 100
    
    return app_percentage

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_value_as_tuple = (table[key], key)
        table_display.append(key_value_as_tuple)
    #now sort
    final_table = sorted(table_display, reverse = True)
    
    for entry in final_table:
        print(entry[1], ':', entry[0])

In [None]:
display_table(android_final,1)

In [None]:
display_table(android_final, 9)

In [None]:
display_table(ios_final, -5)

# Most popular apps by genre

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but for the App Store data set this information is missing. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

Below, we calculate the average number of user ratings per app genre on the App Store:


In [None]:
genre_ios = freq_table(ios_final, -5)

for genre in genre_ios:
    total = 0
    len_genre = 0
    for row in ios_final:
        genre_app = row[-5]
        if genre == genre_app:
            total = total + float(row[5])
            len_genre += 1
    #total ratings by genre
    genre_tot = total/len_genre
    genre_ios[genre] = genre_tot
    print(genre, ':', genre_tot)
print(genre_ios)

Now let's analyze the Google Play market a bit.
# Most Popular Apps by Genre on Google Play
For the Google Play market, we actually have data about the number of installs, so we should be able to get a clearer picture about genre popularity. However, the install numbers don't seem precise enough — we can see that most values are open-ended (100+, 1,000+, 5,000+, etc.):

In [None]:
display_table(android_final, 5)

One problem with this data is that is not precise. For instance, we don't know whether an app with 100,000+ installs has 100,000 installs, 200,000, or 350,000. However, we don't need very precise data for our purposes — we only want to get an idea which app genres attract the most users, and we don't need perfect precision with respect to the number of users.

We're going to leave the numbers as they are, which means that we'll consider that an app with 100,000+ installs has 100,000 installs, and an app with 1,000,000+ installs has 1,000,000 installs, and so on.
To perform computations, however, we'll need to convert each install number to float — this means that we need to remove the commas and the plus characters, otherwise the conversion will fail and raise an error. We'll do this directly in the loop below, where we also compute the average number of installs for each genre (category).

In [None]:
#using high level categories column for android apps instead of genre column

category_android = freq_table(android_final, 1)

for category in category_android:
    total = 0
    len_category = 0
    for row in android_final:
        category_app = row[1]
        if category == category_app:
            install_ct = row[5]
            install_ct = install_ct.replace('+','')
            install_ct = install_ct.replace(',','')
            total += float(install_ct)
            len_category += 1
    #total ratings by category
    category_tot = total/len_category
    category_android[category] = category_tot
    print(category, ':', category_tot)
print(category_android)