# Profitable App Profiles for the App Store and Google Play Markets 

**Scenario**: You are a data analyst for a company that builds Android and iOS mobile apps. These apps become available on Google Play and in the App Store. Only apps that are free to download and install are built, thus the company's main source of revenue consists of in-app ads. The number of users of the apps is the primary revenue consideration for any given app — the more users who see and engage with the ads, the better. 

**Goal**: Analyze the data to help the developers understand what type of apps are likely to attract more users. 

**Table of Contents**:
- Load app data 
- Remove erroneous data
- Remove duplicate apps
- Remove non-English apps
- Remove paid apps


## Load app data from Apple Store and Google Play Store .csv files

In [1]:
from csv import reader

opened_ios = open("applestore.csv")
read_ios = reader(opened_ios)
ios = list(read_ios)
ios_header = ios[0]
ios = ios[1:]

opened_gplay = open("googleplaystore.csv")
read_gplay = reader(opened_gplay)
gplay = list(read_gplay)
gplay_header = gplay[0]
gplay = gplay[1:]

In [2]:
print(f"Total* number of Apple Store apps: {len(ios)}")
print(f"Total* number of Google Play Store: {len(gplay)}")
print("*As provided")

Total* number of Apple Store apps: 7197
Total* number of Google Play Store: 10841
*As provided


### Course-provided `explore_data` function serves to print rows from the data in a readable way

In [3]:
def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n")
    if rows_and_columns:
        print("Number of rows:", len(dataset))
        print("Number of columns:", len(dataset[0]))

#### Data snippets from Apple Store

In [4]:
explore_data(ios, 0, 3, True)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns: 16


#### Data snippets from Google Play Store

In [5]:
explore_data(gplay, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


## Remove erroneous data

From one of the [Kaggle discussion](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015)'s which points to app ```10472``` in the ```Google Play Store``` data, it appears that the category for the `Life Made Wi-Fi Touchscreen Photo Frame` app is missing and all subsequent columns have been shifted forward by one. Consequently, this entry has 12 total columns instead of 13.

The erroneous entry is found on row `10472` in our data (without the header).

In [6]:
print(gplay[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
print(len(gplay[10472]))

12


As one step in the data cleaning process, the erroneous row `10472` is removed.

In [8]:
del gplay[10472]

In [9]:
print(gplay[10472])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


For completeness, the entire Google Play Store data is checked to see if the number of columns for any given row is not 13.

In [10]:
gplay_erroneous_length = []
for app in gplay:
    name = app[0]
    if len(app) != 13:
        gplay_erroneous_length.append(name)
print(gplay_erroneous_length)

[]


We then apply this method to the Apple Store data to isolate rows that do not have 16 columns. 

In [11]:
ios_erroneous_length = []
for app in ios:
    name = app[0]
    if len(app) != 16:
        ios_erroneous_length.append(name)
print(ios_erroneous_length)

[]


In [12]:
print(f"Total* number of Apple Store apps: {len(ios)}")
print(f"Total* number of Google Play Store: {len(gplay)}")
print("*After check for data errors")

Total* number of Apple Store apps: 7197
Total* number of Google Play Store: 10840
*After check for data errors


## Remove duplicate apps

From another [Kaggle discussion](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps/discussion/90409), it would appear there may be potential duplicates in the Apple Store data.

A Kaggler provided the [code](https://www.kaggle.com/datasets/ramamet4/app-store-apple-data-set-10k-apps/discussion/106176#812842) below to list apps with duplicate names.

In [13]:
ios_unique_apps = [] 
ios_duplicate_apps = [] 

for app in ios: 
    name = app[1] 

    if name not in ios_unique_apps:
        ios_unique_apps.append(name)
    else:
        ios_duplicate_apps.append(app)
        
print(ios_duplicate_apps)

[['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1'], ['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']]


In [14]:
len(ios_duplicate_apps)

2

Upon closer inspection, the apps `Mannequin Challenge` and `VR Roller Coaster` are not duplicates but rather two separate apps each with the same names. This is noted by the differences primarily in the app size (`size_bytes`) and total rating count (`rating_count_tot`). 

Therefore, the Apple Store data does not contain any duplicates. 

There are 7,197 unique apps on the Apple Store.

In [15]:
print(ios[0])
print("\n")

for app in ios:
    if app[1] == "Mannequin Challenge" or app[1] == "VR Roller Coaster":
        print(app)

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['1173990889', 'Mannequin Challenge', '109705216', 'USD', '0.0', '668', '87', '3.0', '3.0', '1.4', '9+', 'Games', '37', '4', '1', '1']
['952877179', 'VR Roller Coaster', '169523200', 'USD', '0.0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['1178454060', 'Mannequin Challenge', '59572224', 'USD', '0.0', '105', '58', '4.0', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']
['1089824278', 'VR Roller Coaster', '240964608', 'USD', '0.0', '67', '44', '3.5', '4.0', '0.81', '4+', 'Games', '38', '0', '1', '1']


In [16]:
print(f"Total* number of Apple Store apps: {len(ios)}")
print("*After checks for errors and duplicates")

Total* number of Apple Store apps: 7197
*After checks for errors and duplicates


We now apply this same method of finding duplicates in the Google Play Store data.

In [17]:
gplay_unique_apps = [] 
gplay_duplicate_apps = [] 

for app in gplay: 
    name = app[0] 

    if name not in gplay_unique_apps:
        gplay_unique_apps.append(name)
    else:
        gplay_duplicate_apps.append(app)

# Printing only first 5 rows        
print(gplay_duplicate_apps[5])

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [18]:
len(gplay_duplicate_apps)

1181

It appears that the Google Play Store data has `1,181` lines associated with duplicates. 

Taking the header row into account, we expect the total number of Google Play apps to be `9,659`.

In [19]:
len(gplay)-len(gplay_duplicate_apps)

9659

Isolating the first app in the comprehensive list of duplicates above, we see the app `Quick PDF Scanner + OCR FREE` listed three times. There is a slight difference in the 3rd listing with `80804` reviews instead of `80805` as shown in the first two listings.

In [20]:
print(gplay_header)
print("\n")

for app in gplay[1:]:
    if app[0] == "Quick PDF Scanner + OCR FREE":
        print(app)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


We can assume the app listing with the highest number of reviews has spent the longest duration on the store. Therefore, we take the listing with the maximum number of reviews as the criteria for removing duplicates. We create a dictionary called `reviews_max` to house the maximum number of reviews for each app. As noted above, we should end up with `9,659` unique Google Play listings after removing duplicates.

In [21]:
reviews_max = {}
for app in gplay:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

len(reviews_max)

9659

Rather than removing duplicates, we build and store the unique Google Play Store apps into a new list called `android_clean`. Because the `for` loop will produce an error if including a header, we loop without it and then add the header as the first row in the new `android_clean` list. 

In the case where an app may have two listings with the same maximum number of reviews, we employ the `already_added` list so that once an app has been added, it is not added to the clean list.

To retrieve the total number of apps, we take the length of the new list and subtract one to account for the header row.

In [22]:
android_clean = []
already_added = []

for app in gplay:
    name = app[0]
    n_reviews = float(app[3])
    if (n_reviews == reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

print(f"Total* number of Google Play apps: {len(android_clean)}")
print("*After checks for errors and duplicates")

Total* number of Google Play apps: 9659
*After checks for errors and duplicates


## Remove non-English apps

From the Dataquest lesson:
`The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.`

We create a function that passes in the app's name returns `False` if the corresponding number for any character in the name is outside the range 0 to 127 and `True` if within the range.

In [23]:
def is_english(string):
    not_ascii = 0
    for char in string:
        if ord(char) > 127:
            not_ascii += 1
    if not_ascii > 3:
        return False
    else:
        return True
        
# tests
print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


Applying this function to both the Apple Store and Google Play Store data sets to retrieve only the apps designed for English speakers.

In [24]:
ios_english = []
for app in ios:
    name = app[1]
    if is_english(name) == True:
        ios_english.append(app)

gplay_english = []
for app in android_clean:
    name = app[0]
    if is_english(name) == True:
        gplay_english.append(app)
        
print(f"Total* number of Apple Store apps: {len(ios_english)}")
print(f"Total* number of Google Play Store apps: {len(gplay_english)}")
print("*After removal of errors, duplicates, and non-ASCII app names")

Total* number of Apple Store apps: 6183
Total* number of Google Play Store apps: 9614
*After removal of errors, duplicates, and non-ASCII app names


## Remove paid apps

We print the app headers again to view indices.\
The `price` is in index `4` for Apple and `7` for Google Play.

In [25]:
print(ios_header)
print("\n")
print(gplay_header)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


We create new lists to pass in only apps with `price` = `0`. Both the Apple and Google Play prices come as strings so these are converted into `float` values. The Google Play app `price` has a leading `$` so this symbol is stripped using the `lstrip` function.

In [26]:
ios_free = []
gplay_free = []

for app in ios_english:
    price = float(app[4])
    if price == 0:
        ios_free.append(app)

for app in gplay_english:
    price = float(app[7].lstrip("$"))
    if price == 0:
        gplay_free.append(app)
        
print(f"Total* number of English Apple Store apps: {len(ios_free)}")
print(f"Total* number of English Google Play Store apps: {len(gplay_free)}")
print("*After removal of errors, duplicates, non-ASCII app names, and paid apps")

Total* number of English Apple Store apps: 3222
Total* number of English Google Play Store apps: 8864
*After removal of errors, duplicates, non-ASCII app names, and paid apps


## Most Common Apps by Genre