# App Data Analysis for Appstore and Playstore

This notebook contains data from Google Playstore and Apple Appstore.

<table width=100% align=left>
    <tr>
        <th>Dataset</th>
        <th>Rows</th>
        <th>Column</th>
    </tr>
    <tr>
        <td>Appstore</td>
        <td>7,198</td>
        <td>16</td>
    </tr>
    <tr>
        <td>Android</td>
        <td>10,843</td>
        <td>13</td>
    </tr>
    <tr>
        <td></td>
        <td></td>
    </tr>
</table>



**Playstore Headers**
```
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
```

**Appstore Headers**
```
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']
```

## Data Preparation

In [1]:
"""
Explore Data Function:

This function is designed to explore a dataset by printing a slice of rows and optionally
displaying the number of rows and columns.

Parameters:
- dataset (list): The dataset to explore.
- start (int): The index to start exploring from.
- end (int): The index to end exploration (exclusive).
- rows_and_columns (bool): If True, display the number of rows and columns.

Returns:None
"""
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]
    
    for row in dataset_slice:
        print(row)
        print('\n')
        
    if rows_and_columns:
        print('Number of rows: ', len(dataset))
        print('Number of columns: ', len(dataset[0]))

"""
Open File Function:

This function opens a CSV file, reads its content using the 'csv.reader', and returns the data as a list.
Optionally, it can include the header in the result.

Parameters:
- filename (str): The name of the CSV file to open.
- include_header (bool): If True, includes the header in the result.
- encoding (str): The encoding of the file. Default is 'utf8'.

Returns: 
list: A list containing the content of the CSV file.
"""
def open_file(filename, include_header=False, encoding='utf8'):
    opened_file = open(filename, encoding=encoding)
    from csv import reader
    read_file = reader(opened_file)
    
    if not include_header:
        return list(read_file)[1:]
    
    return list(read_file)

Load the datasets from both the Appstore and Playstore. Examine the contents that are present in both datasets. Subsequently, use the ```explore_data()``` function to display the indexes of these common contents, specifying the *2nd* and *3rd* parameters.

In [2]:
google_play_dataset = open_file('googleplaystore.csv', True) # true means to include the header
apple_appstore_dataset = open_file('AppleStore.csv', True) # true means to include the header

explore_data(google_play_dataset, 0, 1, True) # true means to count the no of rows
print('\n')
explore_data(apple_appstore_dataset, 0, 1, True) # true means to count the no of columns

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Number of rows:  10842
Number of columns:  13


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Number of rows:  7198
Number of columns:  17


## Data Sanitation

In [3]:
 """
Delete a row from the given data at the specified index.

Parameters:
- data (list): The dataset from which to delete the row.
- index (int): The index of the row to be deleted.

Returns:
None
"""
def delete_row(data, index):
    del data[index] 

Retrieve all incomplete rows from both the Playstore and Appstore datasets, then eliminate them.

In [4]:
# Playstore dataset
android_incomplete_rows = []
android_headers = google_play_dataset[0]

for app_data in google_play_dataset[1:]:
    if len(android_headers) != len(app_data):
        android_incomplete_rows.append(app_data)
        delete_row(google_play_dataset, google_play_dataset.index(app_data))
        
print("INCOMPLETE ROWS IN ANDROID: ", android_incomplete_rows)

# Appstore dataset
appstore_incomplete_rows = []
appstore_headers = apple_appstore_dataset[0]

for app_data in apple_appstore_dataset[1:]:
    if len(appstore_headers) != len(app_data):
        android_incomplete_rows.append(app_data)
        delete_row(apple_appstore_dataset, apple_appstore_dataset.index(app_data))
        
print("INCOMPLETE ROWS IN APPSTORE: ", appstore_incomplete_rows)

INCOMPLETE ROWS IN ANDROID:  [['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']]
INCOMPLETE ROWS IN APPSTORE:  []


Now we have a dataset that excludes incomplete rows, now get all rows that has duplicates in Playstore.

In [5]:
duplicate_apps = []
unique_apps = []

for app_data in google_play_dataset[1:]:
    app_name = app_data[0] #index for app name
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)

print('Number of duplicate apps: ', len(duplicate_apps))
print('\n')
print('15 examples of duplicate applications:\n', duplicate_apps[:15])
print('Expected length: ', len(google_play_dataset[1:]) - len(duplicate_apps))

Number of duplicate apps:  1181


15 examples of duplicate applications:
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']
Expected length:  9659


Loop through the Playstore dataset and create a dictionary with ```{<app_name> : <review_count>}``` format.

In [6]:
reviews_max = {}

for app_data in google_play_dataset[1:]:
    name = app_data[0] #index for app name
    review = float(app_data[3]) #index for app review
    
    if name in reviews_max and reviews_max[name] > review:
        reviews_max[name] = review
    else:
        reviews_max[name] = review
    
print('Length of Playstore app name and max number of reviews: ', len(reviews_max))

Length of Playstore app name and max number of reviews:  9659


Sanitize the Playstore dataset to remove duplicates. ```already_added``` variable is added so we can track the items we have added already.

In [7]:
android_clean = {}
already_added = {}

for app_data in google_play_dataset[1:]:
    name = app_data[0] #index for app name
    review = float(app_data[3]) #index for app review
    
    if reviews_max[name] == review and name not in already_added:
        android_clean[name] = review
        already_added[name] = review

print('Length of Playstore uniqe app name along with max number of reviews ', len(android_clean))

Length of Playstore uniqe app name along with max number of reviews  9659


Check if the given app name from Playstore dataset contains any non-english characters.

In [8]:
"""
Check if a given string contains non-English characters based on their ASCII values.

Parameters:
- name (str): The string to be checked.

Returns:
bool: True if the string contains non-English characters, otherwise False.
"""
def has_non_english_character(name):
    
    with_non_english_name = False
    character_arr = [char for char in name]
    
    non_english_counter = 0;
    for letter in character_arr:
        
        if ord(letter) > 127:
            non_english_counter += 1
            
            if non_english_counter == 3:
                with_non_english_name = True
                break
                

    return with_non_english_name        

Loop through the Playstore dataset and get all app names that doesn't have non-english characters.

In [9]:
apps_without_non_english_names = [];

for app_data in google_play_dataset[1:]:
    name = app_data[0]
    
    if has_non_english_character(name) is False:
        apps_without_non_english_names.append(app_data)

Get all free apps from Playstore dataset. Then, display all the totalities of free and total apps registered.

In [10]:
free_apps = []

for app_data in google_play_dataset[1:]:
    price = app_data[6];
    
    if price == 'Free':
        free_apps.append(app_data)

print('Free apps: ', len(free_apps))
print('Total apps: ', len(google_play_dataset[1:]))

Free apps:  10039
Total apps:  10840


Display the frequency table of a given data set.

In [11]:
def freq_table(dataset, index):
    data = {}
    
    for datum in dataset:
        tmp = datum[index]
        
        if tmp in data:
            data[tmp] += 1
        else:
            data[tmp] = 1
            
    return data

In [12]:
"""
Display Table Function:

Generates a frequency table from a dataset using a specified index and displays the table in a sorted format.

Parameters:
- dataset (list): The dataset from which the frequency table is generated.
- index: The index or key within the dataset used to create the frequency table.

Returns: None
"""
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

### Analyzation

For ***Appstore*** applications, determine which is the most popular genre.

### Most Popular Apps by Genre on the App Store

In [13]:
appstore_apps = freq_table(apple_appstore_dataset[1:], 11)

print('Average rating per genre on Appstore: \n-----------------')

for genre, count in appstore_apps.items():
    total = 0
    len_genre = 0
    
    for app_data in apple_appstore_dataset[1:]:
        genre_app = app_data[11]
        
        if genre_app is genre:
            user_rating = float(app_data[7])
            total = total + user_rating
            len_genre += 1
    
    total = total / len_genre
    
    print(genre, ' : ', total)
            

Average rating per genre on Appstore: 
-----------------
4+  :  26.0
12+  :  649.0
17+  :  203.0
9+  :  248.0


In [14]:
# TODO: Intepretation of data above

### Most Popular Apps by Genre on Google Play

In [32]:
playstore_apps = freq_table(google_play_dataset[1:], 1)

category_w_avg = {}
for category, count in playstore_apps.items():
    total = 0
    len_category = 0
    
    for app_data in google_play_dataset[1:]:
        category_app = app_data[1]
        
        if category_app is category:
            n_installs = app_data[5]
            n_installs = n_installs.replace('+','')
            n_installs = n_installs.replace(',','')
            n_installs = float(n_installs)
            
            total = total + n_installs
            len_category = len_category + 1
    
    avg = total / len_category
    
    category_w_avg[category] = '{:,}'.format(avg)

sorted_data = dict(sorted(category_w_avg.items(), key=lambda item: float(item[1].replace(',', ''))))
for category in sorted_data:
    print(category, ':', category_w_avg[category])

ART_AND_DESIGN : 10,000.0
DATING : 10,000.0
PARENTING : 10,000.0
LIBRARIES_AND_DEMO : 50,000.0
AUTO_AND_VEHICLES : 100,000.0
EVENTS : 100,000.0
MEDICAL : 100,000.0
BEAUTY : 500,000.0
PHOTOGRAPHY : 1,000,000.0
LIFESTYLE : 5,000,000.0
BUSINESS : 10,000,000.0
COMICS : 10,000,000.0
FINANCE : 10,000,000.0
FOOD_AND_DRINK : 10,000,000.0
HEALTH_AND_FITNESS : 10,000,000.0
HOUSE_AND_HOME : 10,000,000.0
SHOPPING : 10,000,000.0
SPORTS : 10,000,000.0
TRAVEL_AND_LOCAL : 10,000,000.0
TOOLS : 10,000,000.0
NEWS_AND_MAGAZINES : 10,000,000.0
FAMILY : 50,000,000.0
PERSONALIZATION : 50,000,000.0
WEATHER : 50,000,000.0
BOOKS_AND_REFERENCE : 100,000,000.0
EDUCATION : 100,000,000.0
ENTERTAINMENT : 100,000,000.0
GAME : 100,000,000.0
MAPS_AND_NAVIGATION : 100,000,000.0
PRODUCTIVITY : 500,000,000.0
COMMUNICATION : 1,000,000,000.0
SOCIAL : 1,000,000,000.0
VIDEO_PLAYERS : 1,000,000,000.0


In [16]:
# TODO: Interpretation of data aboove