# Play store profitable app recommendation

The goal for this project is to analyze data from apps downloaded and used in other for developers to understand what type of apps attract more users.

This project is about analyzing data from google play store and app store by using efficient data analysis tools and skills.

In [2]:
from csv import reader

#google play store data
google_data = list(reader(open('googleplaystore.csv')))

#apple(ios) data
ios_data = list(reader(open('AppleStore.csv')))

In [3]:
def explore_data(dataset, start, end, rows_and_columns=False):
    '''
    To explore data and make the data more readable.
    
    Takes in four parameters:
    * dataset, which is expected to be a list of lists.
    
    * start and end, which are both expected to be integers
    and represent the starting and the ending indices of a slice
    from the data set.
    
    * rows_and_columns, which is expected to be a Boolean
    and has False as a default argument.
    '''
    dataset_slice = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


In [4]:
explore_data(google_data, 0, 2, True) #first list is the header

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
explore_data(ios_data, 0, 3, True) #first list is the header

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


## Preparing the data

We are in the process of cleaning the data because some aspect of the data are "unimportant".

* Detect inaccurate data, and correct or remove it.
* Detect duplicate data, and remove the duplicates.

Our target is for english speakers, hence, we will only need the data for english apps.

In [6]:
print(google_data[10473], '\n')  # incorrect row
print(google_data[0], '\n')
print(google_data[0][2])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

Rating


Firstly, we need to seperate the data from the header

In [7]:
ios_data_header = ios_data[0] #header
google_data_header = google_data[0] #header
ios_data = ios_data[1:] #actual data
google_data = google_data[1:] #actual data

This is wrong data because the maximum rating for google app is 5 and not 19. We want to remove every one of them.

In [8]:
#to remove data row with wrong rating
def remove_wrong_rating(data:list)->None:
    for row in data:
        if float(row[2]) > 5:
            data.remove(row)

In [9]:
remove_wrong_rating(google_data)

In [10]:
print(google_data_header, '\n')
explore_data(google_data, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10840
Number of columns: 13


The google play data also contains duplicates due to different times it was recorded.

In [11]:
count = 0
for app in google_data:
    if app[0] == 'Instagram':
        count += 1
        print(app, '\n')
print('number of duplicates for instagram: ', count)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device'] 

number of duplicates for instagram:  4


In [12]:
duplicate_apps = []
unique_apps = []

for app in google_data:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('\n')
print('Examples of duplicates apps:', duplicate_apps[:15])

Number of duplicate apps: 1181




Examples of duplicates apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


In [13]:
max_reviews = {}
for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    if name in max_reviews and max_reviews[name] < n_reviews:
        max_reviews[name] = n_reviews
    elif name not in max_reviews:
        max_reviews[name] = n_reviews

In [14]:
print('Expected length:', len(google_data) - 1181)
print('Actual length:', len(max_reviews))

Expected length: 9659
Actual length: 9659


Giving the fact that we have 1181 duplicates and want a unique set of data. We created a dictionary where each key is a unique app name and the values corresponds to the maximum number of reviews.

In [15]:
clean_google = [] #clean data
added_google = [] #stores app names

In [16]:
for app in google_data:
    name = app[0]
    n_reviews = float(app[3])
    
    if (max_reviews[name] == n_reviews) and (name not in added_google):
        clean_google.append(app)
        added_google.append(name) # make sure this is inside the if block
        

In [17]:
#Explore the data
explore_data(clean_google, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


We have the number of rows and columns as expected.

### Removing non english apps.

 We are only concerned about english apps because our developers only speak english. However, if we explore the data enough we notice there are some non english apps.

In [21]:
print(clean_google[4412][0])

中国語 AQリスニング


In [25]:
#this is just to see some of the non english apps.
for  app in clean_google:
    if app[0][0].lower() not in 'abcdefghijklmnopqrstuvwxyz'\
    and not app[0][0].isdigit():
        print(app[0])

漫咖 Comics - Manga,Novel and Stories
【Ranobbe complete free】 Novelba - Free app that you can read and write novels
- Free Comics - Comic Apps
🔥 Football Wallpapers 4K | Full HD Backgrounds 😍
İşCep
乐屋网: Buying a house, selling a house, renting a house
သိင်္ Astrology - Min Thein Kha BayDin
💘 WhatsLov: Smileys of love, stickers and GIF
РИА Новости
乗換NAVITIME　Timetable & Route Search in Japan Tokyo
ÖBB Scotty
► MultiCraft ― Free Miner! 👍
صور حرف H
[Substratum] K-Manager for K-Klock
💎 I'm rich
[substratum] Vacuum: P
[Sub/EMUI] P Pro - EMUI 8.1/8.0/5.X Theme
.R
/u/app
[verify-U] VideoIdent
[ROOT] X Privacy Installer
中国語 AQリスニング
【Miku AR Camera】Mikuture
日本AV历史
¡Ay Caramba!
¡Ay Metro!
বাংলা টিভি প্রো BD Bangla TV
뽕티비 - 개인방송, 인터넷방송, BJ방송
あなカレ【BL】無料ゲーム
감성학원 BL 첫사랑
[BN] Blitz
[Substratum] M5 Theme
Билеты ПДД CD 2019 PRO
РееI Smart Remote MP3 CD Player
📏 Smart Ruler ↔️ cm/inch measuring for homework!
[adult swim]
الفاتحون Conquerors
Аim Training for CS
Šmelina .cz inzeráty inzerce
+Download 4 Inst

In [35]:
def is_english(string_val):
    #takes in a string value and checks if it is english or not.
    for val in string_val:
        if ord(val) > 127: #common english chracters
            return False
    return True

print(is_english('英漢字典 EC Dictionary'))
print(is_english('Instagram'))

False
True


In [36]:
print(is_english('Docs To Go™ Free Office Suite'), ord('™'))

False 8482


In [37]:
print(is_english('Instachat 😜'), ord('😜'))

False 128540


The function (is_english) doesn't work for some englsih apps with symbols such as '™' and '😜'.

To protect the data's integrity we make sure that only names with less or equal to 3 counts as an english app.

In [38]:
def is_english(string):
    non_ascii = 0
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
True


In [41]:
google_english = []
ios_english = []

for app in clean_google:
    name = app[0]
    if is_english(name):
        google_english.append(app)
        
for app in ios_data:
    name = app[1]
    if is_english(name):
        ios_english.append(app)
        
explore_data(google_english, 0, 3, True)
print('\n')
explore_data(ios_english, 0, 3, True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9614
Number of columns: 13


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 

So far we have:
* removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps

## Isolating Free Apps