**Profitable Apps Data Analysis Project**

The goal of this project is to analyze data for apps available on Google Play and App Store to help our developers understand what type of apps are likely to attract more users.

To do this, we'll need to collect and analyze data about mobile apps available on Google Play and the App Store.
This is a guided project.

In [1]:
#open applestore data
apple = open('AppleStore.csv', encoding='utf8')
from csv import reader
read_file = reader(apple)
ios_data = list(read_file)
ios_header = ios_data[0]
ios = ios_data[1:]

#open google play data
android = open('googleplaystore.csv', encoding='utf8')
from csv import reader
read_file = reader(android)
android_data = list(read_file)
android_header = android_data[0]
android = android_data[1:]
    

The `explore_data()` function does the following:

Takes in four parameters:
`dataset`, which will be a list of lists.
`start` and `end`, which will both be integers and represent the starting and the ending indices of a slice from the `dataset`.
`rows_and_columns`, which is expected to be a Boolean and has `False` as a default argument.
Slices the `dataset` using `dataset[start:end]`.
Loops through the slice, and for each iteration, prints a `row` and adds a new line after that row using `print('\n')`.
The `\n` in `print('\n')` is a special character that won't print. Instead, the `\n` character adds a new line, and we use `print('\n')` to add some blank space between rows.
Prints the number of rows and columns if `rows_and_columns` is `True`.
`dataset` shouldn't have a header row, or the function will print the wrong number of rows (one more row compared to the actual length).

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Explore both datasets using the `explore_data()` function and print the first few rows. Find the number of rows and columns for each data by providing `True` argument to the `rows_and_columns`

In [3]:
#explore data 
print(ios_header)
print('\n')
explore_data(ios, 0, 5, True)

print(android_header)
print('\n')
explore_data(android, 1, 4, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


['284035177', 'Pandora - Music & Radio', '130242560', 'USD', '0.0', '1126879', '3594', '4.0', '4.5', '8.4.1', '12+', 'Music', '37', '4', '1', '1']


Number of rows: 7197
Number of columns: 16
['App', 'Catego

**Data cleaning**

In [4]:
print(android_header) #print header row 
print('\n')
print(android[10472]) #the row with error

#10472 row has an error. category Column data is missing  
del android[10472] # delete the row with error

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [5]:
print(android[9148])
print('\n')
print(android[9147])

['Command & Conquer: Rivals', 'FAMILY', 'NaN', '0', 'Varies with device', '0', 'NaN', '0', 'Everyone 10+', 'Strategy', 'June 28, 2018', 'Varies with device', 'Varies with device']


['Plants vs. Zombies™ 2', 'FAMILY', '4.4', '567632', '15M', '10,000,000+', 'Free', '0', 'Everyone 10+', 'Casual', 'June 12, 2018', '6.8.1', '4.1 and up']


Some apps have duplicate entries. next is to remove the duplicate entries and keep only one entry per app. we use a combination of for loop and conditional statement

In [6]:
app_name = []
duplicate_names = []
for app in android:
    name = app[0]
    if name in app_name:
        duplicate_names.append(name)
        
    else:
        app_name.append(name)

#to find the number of duplicate apps
print('Number of duplicate apps: ', len(duplicate_names))
print('\n')

#examine afew duplicate apps
print('Examples of duplicates apps:', duplicate_names[:15])

Number of duplicate apps:  1181


Examples of duplicates apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


examine some of the duplicates to see how the data look like across the duplicates

In [7]:
for app in android:
    name = app[0]
    if name == 'Facebook' or name == 'Instagram':
        print(app)

['Facebook', 'SOCIAL', '4.1', '78158306', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Facebook', 'SOCIAL', '4.1', '78128208', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'August 3, 2

Because some duplicated apps have different number of reviews, keep the highest review and remove the other entries. The higher the number of reviews, the more recent the data should be. 

To do this, we created an empty dictionary, looped thro' the android data excluding the header, assigned app name to `name2` and reviews as a float to `n_reviews`. `if name2 in reviews_max and reviews_max[name2] < n_reviews`, uptade the number of reviews for that entry in the` reviews_max` dictionary. `If name2` is `not in` the `reviews_max` dictionary as a key, create a new entry in the dictionary where the key is the app name, and the value is the number of reviews

In [8]:
reviews_max = {}
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews
        
print(len(reviews_max))
print(len(android) - 1181) #minus duplicate apps

9659
9659


In [9]:
#Use the dictionary created above to remove the duplicate rows
android_clean = []
already_added = []
for app in android:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
#Explore the android_clean dataset
explore_data(android_clean, 0, 4, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Number of columns: 13


To analyze only the apps that are designed for an English-speaking audience, remove the rows corresponding to the non-English apps. One way to do this is to remove each app with a name containing a symbol that isn't commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Each character we use in a string has a corresponding number associated with it.The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII system.

We can get the corresponding number of each character using the `ord()`

In [10]:
#the function takes a string, checks if character doesn't belong to english 
def remove_notenglish(a_string): 
    for character in a_string:
        if ord(character) > 127:
            return False
        
    return True

In [11]:
#check if the function works
print(remove_notenglish('Instagram'))
print(remove_notenglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))

True
False


Emojis and characters like ™ fall outside the ASCII range. To minimize the impact of data loss, only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range.This means all English apps with up to three emoji or other special characters will still be labeled as English

In [12]:
#the function takes a string, checks if character doesn't belong to english 
# function filters if the input string has more than three characters that 
#fall outside the ASCII range (0 - 127)

def remove_notenglish(a_string):
    non_ascii = 0  
    
    for character in a_string:
        if ord(character) > 127:
            non_ascii += 1
            
    if non_ascii > 3:
        return False
    else:
        return True

In [13]:
#check if the function works
print(remove_notenglish('Instagram'))
print(remove_notenglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(remove_notenglish('Docs To Go™ Free Office Suite'))
print(remove_notenglish('Instachat 😜'))

True
False
True
True


Use the new function to filter out non-English apps from both datasets. Loop through each dataset. If an app name is identified as English, append the whole row to a separate list.

In [None]:
#for android
android_english = []
for app in android_clean:
    name = app[0]
    if remove_notenglish(name):
        android_english.append(app)
        
#for ios
ios_english = []
for app in ios:
    name = app[1]
    if remove_notenglish(name):
        ios.append(app)

In [None]:
explore_data(ios_english, 0, 5, True)
print('\n')
explore_data(android_english, 0, 5, True)

In [None]:
#Isolating the free apps
free_android = []
for app in android_english:
    price = app[7]
    if price == '0':
        free_android.append(app)
        
#for ios
free_ios = []
for app in ios_english:
    price = app[4]
    if price == '0.0':
        free_ios.append(app)

#check how many apps remaining        
print(len(free_android))
print(len(free_ios))   

our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

In [None]:
print(ios_header)
print('\n')

print(android_header)

**Find the most common apps by genre**

Build a frequency table for the prime_genre column of the App Store data set, and for the Genres and Category columns of the Google Play data set.

Define two functions we can use to analyze the frequency tables:
One function to generate frequency tables that show percentages
Another function we can use to display the percentages in a descending order.
dictionaries don't have order, and it will be very difficult to analyze the frequency tables.use of the built-in `sorted()` function. This function takes in an iterable data type (like a list, dictionary, tuple, etc.), and returns a list of the elements of that iterable sorted in ascending or descending order (the reverse parameter controls whether the order is ascending or descending).

In [None]:
#generate frequency tables to find out what are the most 
#common genres in each market.

#def function to generate frequency tables
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage  
    
    return table_percentages
        

#function to display frequency tables
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


`display_table()` function takes in two parameters: `dataset` and `index`. `dataset` will be a list of lists, and `index` will be an integer
Generates a frequency table using the `freq_table()` function, transforms the frequency table into a list of tuples, then sorts the list in a descending order
Prints the entries of the frequency table in descending order

In [None]:
#display the frequency table of the columns prime_genre for ios
#Genres, and Category for android
display_table(free_ios, -5) #prime_genre
print('\n')

display_table(free_android, 1) #category
print('\n')
display_table(free_android, -4) #genres

**Find the most popular apps by genre**

To do this, determine the kind of apps with the most users (have the most users)..by calculate the average number of installs for each app genre. To calculate the average number of user ratings for each genre,use a `for` loop inside of another `for` loop (a nested loop)

In [None]:
genres_ios = freq_table(free_ios, -5)

for genre in genres_ios:
    total = 0
    len_genre = 0
    for app in free_ios:
        genre_app = app[-5]
        if genre_app == genre:
            user_rating = float(app[5])
            total += user_rating
            len_genre += 1
    avg_rating = total / len_genre
    print(genre, ':', avg_ratings)

for google play, we'll need to convert each install number from a string to a float. This means we need to remove the commas and the plus characters, or the conversion will fail and cause an error.

To remove characters from strings, we can use the `str.replace(old, new)` method (just like `list.append()` or `list.copy()`, `str.replace()` is a special kind of function called method — we'll learn more about this early in the next course). `str.replace()` takes in two parameters, `old` and `new`, and replaces all occurrences of `old` within a string with `new`:

In [None]:
categories_android = freq_table(free_android, 1)

for category in categories_android:
    total = 0
    len_category = 0
    for app in free_android:
        category_app = app[1]
        if category_app == category:            
            n_installs = app[5]
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total += float(n_installs)
            len_category += 1
    avg_installs = total / len_category
    print(category, ':', avg_installs)

Author Frida