## Profitable App Profiles for the App Store and Google Play Markets

Revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.
This is going to be a notebook/markdown for the first project in python mostly using basic data structures such as Lists, Dictionaries, Tuples and Functions.


Lets read the data and begin the adventure..


** Reading Data **

In [126]:
def open_csv(file_name="AppleStore",file_path='C:/Users/cccherukuri/AnacondaProjects/Python101',enco="utf8",headerflag=True):
        file_open=open(file_path + "/"+ file_name+".csv",encoding="utf8")
        from csv import reader
        file_read=reader(file_open)
        file_list=list(file_read)
        if(headerflag==True):
            return(file_list[0],file_list[1:])
        else:
            return(file_list)

app_store_header,app_store_data = open_csv("AppleStore",file_path='C:/Users/cccherukuri/AnacondaProjects/Python101')
play_store_header,play_store_data = open_csv("googleplaystore",file_path='C:/Users/cccherukuri/AnacondaProjects/Python101')
#play_store_data[0:2]

** Basic Data Exploration **

In [127]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
explore_data(app_store_data,1,2,rows_and_columns=True)
explore_data(play_store_data,1,2,rows_and_columns=True)

['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


Number of rows: 7197
Number of columns: 17
['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


Columns in App Store Data

In [128]:
#explore_data(app_store_header,1,17,False)
app_store_header


['',
 'id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

Colcumns in Play Store Data


In [129]:
#explore_data(play_store_header,1,13,False)
play_store_header

['App',
 'Category',
 'Rating',
 'Reviews',
 'Size',
 'Installs',
 'Type',
 'Price',
 'Content Rating',
 'Genres',
 'Last Updated',
 'Current Ver',
 'Android Ver']

For more information about each column, please refer to the below link

[App Store](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)

[Play Store](https://www.kaggle.com/lava18/google-play-store-apps/home)

Let's now work on data cleaning. As per one of group discussions there may or may not be an issue with row $10472$ in play store data. Row number and columns to look out could be different based on how your data is read (i.e., including/excluding header)

In [130]:
explore_data(play_store_data,10472,10474)

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




In [131]:
# Use with caution
del play_store_data[10472]
explore_data(play_store_data,10471,10473)

['Xposed Wi-Fi-Pwd', 'PERSONALIZATION', '3.5', '1042', '404k', '100,000+', 'Free', '0', 'Everyone', 'Personalization', 'August 5, 2014', '3.0.0', '4.0.3 and up']


['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']




In [132]:
# Fixing the App store Header with blank column
app_store_header[0]="SNo"
# Just Preview header after fixing
app_store_header


['SNo',
 'id',
 'track_name',
 'size_bytes',
 'currency',
 'price',
 'rating_count_tot',
 'rating_count_ver',
 'user_rating',
 'user_rating_ver',
 'ver',
 'cont_rating',
 'prime_genre',
 'sup_devices.num',
 'ipadSc_urls.num',
 'lang.num',
 'vpp_lic']

There seems to be some duplicate rows as per discussion in the group. Lets fix that now starting with android (play store data)

** Google Play Store (Android) **

In [133]:
def find_dup_apps(dataset):
    unique_apps=[]
    duplicate_apps=[]
    for apps in dataset:
        app_name=apps[0]
        if app_name in unique_apps:
            duplicate_apps.append(app_name)
        else:
            unique_apps.append(app_name)
    print('Numer of duplicate apps',len(duplicate_apps))
    print("\n")
    if len(duplicate_apps)>0:
        print('Some of duplicate apps',duplicate_apps[:15])
    print('There seems to be no duplicates!Isnt that a good news!!')
    return unique_apps, duplicate_apps

unq_ps_apps,dup_ps_apps=find_dup_apps(play_store_data)
    
    

Numer of duplicate apps 1181


Some of duplicate apps ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']
There seems to be no duplicates!Isnt that a good news!!


** App store (iOS) **

Similarly, lets find how many such duplicates are there for App_store (ios)

In [134]:
unq_ios_apps,dup_ios_apps=find_dup_apps(app_store_data)

Numer of duplicate apps 0


There seems to be no duplicates!Isnt that a good news!!


If there are duplicates, should we just randomly delete them? For example, if we look at `Instagram` the entries are duplicate, 
but the # of ratings is different. So, may be we should delete the row with low # of ratings and keep the rest!
Lets see if we can come up with any other criterion instead of just random deletion..


In [135]:
for app in play_store_data:
    name=app[0]
    if name=='Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


** Remove Duplicates if any **

We will use dictionaries to help us with removing duplicate entries. Remember we have a function that returns duplicate and unique apps.
Below we are going to remove duplicate apps but with low total reviews. For this we follow this logic:

- Create empty dictionary `reviews_max`  
- Loop through play store data to add max(n_reviews). Careful about if condition logic here..  
- Use this `reviews_max` to compare the reviews while looping throuh our data to remove duplicates. (some what similar to vlookup inside if!)  
- Put this clean rows into `android_clean`

In [141]:
len(play_store_data)
reviews_max={}
for app in play_store_data:
    name=app[0]
    n_reviews=float(app[3])
    
    if name in reviews_max and reviews_max[name]<n_reviews:
        reviews_max[name]=n_reviews
    
    elif name not in reviews_max:
        reviews_max[name]=n_reviews


print('Expected length:', len(play_store_data) - len(dup_ps_apps))
print('Actual length:', len(reviews_max))   


Expected length: 9659
Actual length: 9659


In [154]:
android_clean=[]
already_added=[]

for app in play_store_data:
    name=app[0]
    n_reviews=float(app[3])
    if (n_reviews==reviews_max[name]) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)
#print(android_clean[:10])
print("Length of clean android dataset:", len(android_clean))

Length of clean android dataset: 9659


** Removing Non-English Apps **

Use `ord` built-in function to find out ASCII characater range falling outside 0-127

In [167]:
def find_non_eng_chars(input_string="Instagram"):
        for letter in input_string:
            if ord(letter)>127:
                return(False)
        return(True)

find_non_eng_chars('Instachat 😜')


False

Lets modify above function a bit to find if a string could be called non-english if there are more than 3 such non-english characters


In [173]:
def find_non_eng_chars(input_string="Instagram"):
        counter=0
        for letter in input_string:
            if ord(letter)>127:
                counter+=1            
        if counter>3:
            return(True)
        return(False)

find_non_eng_chars('爱奇艺PPS -《欢乐颂2》电视剧热播')


True

In [185]:
# App store data
new_clean=[]
for app in app_store_data:
    name=app[0]
    if not(find_non_eng_chars(name)):
        new_clean.append(app)
print("Remaining records after removing non-english strings:",len(new_clean))
# Appstore data seems to be clean compared to Play store (android)..        

Remaining records after removing non-english strings: 7197


In [186]:
# Play Store Data
new_clean=[]
for app in android_clean:
    name=app[0]
    if not(find_non_eng_chars(name)):
        new_clean.append(app)
print("Remaining records after removing non-english strings:",len(new_clean))

Remaining records after removing non-english strings: 9614


**Isolate Free Apps**

Lets quickly isolate free apps from both the data sets. Note that name of columns for price is different in the datasets!!

In [200]:
free_android_apps=[]
for app in android_clean:
    if app[play_store_header.index("Type")]=='Free':
        free_android_apps.append(app)
print('Total Android free apps are: ',len(free_android_apps))

free_ios_apps=[]
for app in app_store_data:
    if app[app_store_header.index("price")]=='0':
        free_ios_apps.append(app)
print('Total Android free apps are: ',len(free_ios_apps))


Total Android free apps are:  8904
Total Android free apps are:  4056


** Most Common Apps by Genre **

We may have to use rating_count_tot in both data sets to identify the frequency by Genre.
Alternatively we could also use ''

In [203]:
print(play_store_header)
android_clean[1]

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps',
 'ART_AND_DESIGN',
 '4.7',
 '87510',
 '8.7M',
 '5,000,000+',
 'Free',
 '0',
 'Everyone',
 'Art & Design',
 'August 1, 2018',
 '1.2.4',
 '4.0.3 and up']