# Data Analysis: Apps that Attract More Users
On this program we analyze the data of free apps in Android and iOS mobile apps to determine what kind of apps that attracts more users. We generate revenues from the in-app ads which doesn't cost much compared to in-app premium subscriptions, so by knowing this we can target which app to engage more at--the more users, the better.

## Data Extraction

To make things more efficient, we're adding several functions:

**Function 1:** `extract_data(filename, header=True)`
-- Extract the data from a file
-- If there's not a header, put an additional argument `False`
   to clean the first row. e.g. `extract_data(iOS, False)`
   
^ This case would be great for opening a customized file.
But Because we're only dealing with 2, I'll make one for each instead.
Also, we need the header to fact-check for data cleaning.
   
**Function 2:** `explore_data(dataset, start, end, rows_and_columns = False)`
 -- Slice a segment of the data on a file from index `start` to `end`.
 -- If you need to know how many rows and columns are there,
    add an additional argument `True` to print both lengths.
    e.g.`explore_data(fileAndroid, 0, 3, True)`

In [1]:
## FUNCTION 1 ## def extract_data(filename, header=True):

from csv import reader

open_android = open('googleplaystore.csv')

read_android = reader(open_android)
data_android = list(read_android)

header_android = data_android[0]
data_android = data_android[1:]

open_ios = open('AppleStore.csv')

read_ios = reader(open_ios)
data_ios = list(read_ios)

header_ios = data_ios[0]
data_ios   = data_ios[1:]

## FUNCTION 2

def explore_data(dataset, start, end, rows_and_columns = False):
    dataset_slice    = dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

Extract both data `googleplaystore.csv` and `AppleStore.csv`, and then extract several rows of them. You're gonna notice that each file have different formats. For instance, the `genre` in `googleplaystore.csv` is in index [1], meanwhile `AppleStore.csv` is in index [11]. Here's the link of both files: [googleplaystore.csv](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) & [AppleStore.csv](https://dq-content.s3.amazonaws.com/350/AppleStore.csv)

In [2]:
explore_data(data_android, 0, 3)
explore_data(data_ios, 0, 3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']




## Data Cleaning (80%)

In both data there may be several wrong data that's not relevant, such as non-English apps or apps that aren't free. And by that, here we're doing **data cleaning**, a process that mostly takes 80% of the analysis.

On the [discussion], there's a missing rating on index 10472. This makes the data invalid, and we can delete it with `del`.

In [3]:
## PROCESS: Cleaning data

print(data_android[10472], '\n')
print(header_android, '\n')
print(data_android[0], '\n')

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up'] 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'] 



And by that we can delete the entry, and then check the length (`len`) to make sure it's deleted.

In [4]:
print(len(data_android))

del data_android[10472]

print(len(data_android))

10841
10840


And then, if you notice, there are several duplicates of app data. For instance Instagram (look at the left screen), which have the different number of reviews implying it's collected at different times. And by that we need to separate the `unique_apps` and `duplicate_apps`. 

But we won't delete the duplicates easily. We'll go onto it in a moment. We'll need to collect the duplicates first. 

In [5]:
duplicate_apps = []
unique_apps    = []

for app in data_android:
    name    = app[0]
    if name in unique_apps:    # Yes, you can type it this way.
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
        
print('Number of duplicate apps:', len(duplicate_apps), '\n')
print('Examples of duplicate apps:', duplicate_apps[:10], '\n')

Number of duplicate apps: 1181 

Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack'] 



We have a criteria to choose the relevant duplicate. In this case, the apps are taken in different times with different number of reviews. And by that, we choose the highest one since it implies the most recent data.

In order to do that, we can do procedures below.

* Create a dictionary with each key is a unique app name and the value is the highest number of that app's reviews.
* Define the name `name` and number of reviews `n_reviews`, and compare the data.
    * If the name is in `dict reviews_max` and the number of reviews is the largest (meaning smaller than current data in the dictionary), save the largest value in dict.
    * If the name is not in `dict reviews_max`, add it.

In [6]:
reviews_max = {}

for app in data_android:
    name       = app[0]
    n_reviews  = float(app[3])
    
    if (name in reviews_max) and (reviews_max[name] < n_reviews):
        reviews_max[name] = n_reviews
        
    elif (name not in reviews_max):
        reviews_max[name] = n_reviews

* Make two empty list:
    * `android_clean` to collect the clean data, and
    * `already_added` as a backtrack if the data has been added.
* Iterate each app, define the `name` and `n_reviews`, and put the data in both lists if the `n_reviews` is the same as in `dict reviews` and the app hasn't been added yet.

In [7]:
android_clean = []
already_added = []

for app in data_android:
    name       = app[0]
    n_reviews  = float(app[3])
    
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)      # Add the APP,
        already_added.append(name)     # and put the NAME. 

Let's compared the expected length and the clean data, shall we?

In [8]:
print('Expected length:', len(data_android) - len(duplicate_apps))
print('Data clean length:', len(android_clean))

Expected length: 9659
Data clean length: 9659


^ It's the same! *yeaayy!* 🙌

Now we've done the first step. But there's still non-English apps laying in the data. How do we clean those? All we need to know, fortunately, each character in the app are named based on ASCII system. And the alphabet we're writing at in English is categorized at a value 0-127, meanwhile non-English characters are categorized at a value more than that.

In [9]:
def is_english(string):
    non_ascii = 0      # Define the number of non-ASCII characters,
    
    for character in string:
        if ord(character) > 127:
            non_ascii += 1       # Add one if there's one in the string,

    if non_ascii > 3:            # And return the `bool`  
        return False             # based on the requirements
    else:
        return True

print(is_english('Instagram'))
print(is_english('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(is_english('Docs To Go™ Free Office Suite'))
print(is_english('Instachat 😜'))

True
False
True
True


In the above program, you may notice the last two data are having an ASCII character more than 127, which is the 'TM' and the emoji. Some apps may have that, and if we eliminate all data with at least 1 >127 ASCII value, it'll have a lot of data loss. Therefore, we made a rule to eliminate the app IF there are >3 non-ASCII characters, as corrected starting with the comment (#) above.

It's not a prefect procedure to bash all the non-English apps, but hey it's pretty good 😁 Now we clean both data one more time to all English apps.

In [10]:
android_english = []
ios_english     = []

for app in android_clean:
    name    = app[0]
    
    if is_english(name):
        android_english.append(app)
        # If you want to add all the data,
        # add the app. Not just the name,
        
for app in data_ios:
    name    = app[0]
    
    if is_english(name):
        ios_english.append(app)
        # idem.
        
print(len(android_english))
print(len(ios_english))
    

9614
7197


Last but not least, we're isolating the free apps since that's the data we want to analyze. We do that by typing the program below.

In [11]:
android_free   = []
ios_free       = []

for app in android_english:
    price    = app[7]         # Define the price
    
    if price == '0':          # If it's free..
        android_free.append(app)   # add the *app*, not the name.

for app in ios_english:
    price    = app[4]
    
    if price == '0.0':
        ios_free.append(app)

print(len(android_free))
print(len(ios_free))
    

8864
4056


Now we've isolated the apps, we can analyze the data further!

## Data Analysis

Our goal is to determine the kinds of apps that are likely to attract more users becuase the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:
1. Build a minimal Android version of the app, and add it to Google Play,
2. If the app has a good response from users, we develop it further,
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

We can make two functions to make things easier:
* One to make a frequency table in percentages
* Another to display sorted descending result of app genres

In [12]:
## FUNCTION 1
def freq_table(dataset, index):
    
    table    = {}
    total    = 0
    
    for row in dataset:
        total  += 1
        data    = row[index]
        
        if data in table:
            table[data] += 1
        else:
            table[data] = 1
            
    table_percent = {}
    
    for key in table:
        percentage = (table[key] / total) * 100
        table_percent[key] = percentage
        
    return table_percent

In [13]:
def display_table(dataset, index):
    
    ## Use the previous function:
    table         = freq_table(dataset, index)
    table_display = []
    
    for key in table: # Table is `dict`
        # We want to change the `dict` type to list of tuples:
        key_val_as_tuple    = (table[key], key)
        # And then add it to the list of `table_display`:
        table_display.append(key_val_as_tuple)
        
    # After that, we sort it by the key value in descending order
    # with a built-in function called `sorted()`:
    table_sorted   = sorted(table_display, reverse=True)
    # We do this since the sorted function can be used only for list (?)
    
    # QUESTION: Why does key_val_as_tuple formatted as (table[key], key) ?
    #   I think it's because we wanna sort the first value, and
    #   by that we prioritize the value `table[key]` first. ✅
    
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [14]:
display_table(ios_free, -5)

Games : 55.64595660749507
Entertainment : 8.234714003944774
Photo & Video : 4.117357001972387
Social Networking : 3.5256410256410255
Education : 3.2544378698224854
Shopping : 2.983234714003945
Utilities : 2.687376725838264
Lifestyle : 2.3175542406311638
Finance : 2.0710059171597637
Sports : 1.947731755424063
Health & Fitness : 1.8737672583826428
Music : 1.6518737672583828
Book : 1.6272189349112427
Productivity : 1.5285996055226825
News : 1.4299802761341223
Travel : 1.3806706114398422
Food & Drink : 1.0601577909270217
Weather : 0.7642998027613412
Reference : 0.4930966469428008
Navigation : 0.4930966469428008
Business : 0.4930966469428008
Catalogs : 0.22189349112426035
Medical : 0.19723865877712032


In [15]:
display_table(android_free, -4)

Tools : 8.449909747292418
Entertainment : 6.069494584837545
Education : 5.347472924187725
Business : 4.591606498194946
Productivity : 3.892148014440433
Lifestyle : 3.892148014440433
Finance : 3.7003610108303246
Medical : 3.531137184115524
Sports : 3.463447653429603
Personalization : 3.3167870036101084
Communication : 3.2378158844765346
Action : 3.1024368231046933
Health & Fitness : 3.0798736462093865
Photography : 2.944494584837545
News & Magazines : 2.7978339350180503
Social : 2.6624548736462095
Travel & Local : 2.3240072202166067
Shopping : 2.2450361010830324
Books & Reference : 2.1435018050541514
Simulation : 2.0419675090252705
Dating : 1.861462093862816
Arcade : 1.8501805054151623
Video Players & Editors : 1.7712093862815883
Casual : 1.7599277978339352
Maps & Navigation : 1.3989169675090252
Food & Drink : 1.2409747292418771
Puzzle : 1.128158844765343
Racing : 0.9927797833935018
Role Playing : 0.9363718411552346
Libraries & Demo : 0.9363718411552346
Auto & Vehicles : 0.9250902527075

After proceeding with the `display_table` function, we can see what genre has the most app built in both stores:

* In `ios_free`, it's `Games`: 55,65%, `Entertainment` : 8.24%, `Photo & Video` : 4.12%, and `Social Networking` : 3.53%.
* In `android_free`, it's `Tools` with 8,45%.

This implies that most genre in iOS tend to be for entertainment, and most genres in Android 

In [17]:
display_table(android_free, 1)

FAMILY : 18.907942238267147
GAME : 9.724729241877256
TOOLS : 8.461191335740072
BUSINESS : 4.591606498194946
LIFESTYLE : 3.9034296028880866
PRODUCTIVITY : 3.892148014440433
FINANCE : 3.7003610108303246
MEDICAL : 3.531137184115524
SPORTS : 3.395758122743682
PERSONALIZATION : 3.3167870036101084
COMMUNICATION : 3.2378158844765346
HEALTH_AND_FITNESS : 3.0798736462093865
PHOTOGRAPHY : 2.944494584837545
NEWS_AND_MAGAZINES : 2.7978339350180503
SOCIAL : 2.6624548736462095
TRAVEL_AND_LOCAL : 2.33528880866426
SHOPPING : 2.2450361010830324
BOOKS_AND_REFERENCE : 2.1435018050541514
DATING : 1.861462093862816
VIDEO_PLAYERS : 1.7937725631768955
MAPS_AND_NAVIGATION : 1.3989169675090252
FOOD_AND_DRINK : 1.2409747292418771
EDUCATION : 1.1620036101083033
ENTERTAINMENT : 0.9589350180505415
LIBRARIES_AND_DEMO : 0.9363718411552346
AUTO_AND_VEHICLES : 0.9250902527075812
HOUSE_AND_HOME : 0.8235559566787004
WEATHER : 0.8009927797833934
EVENTS : 0.7107400722021661
PARENTING : 0.6543321299638989
ART_AND_DESIGN : 

An app profile that would be recommended for in-ad revenue for free apps are a **FAMILY** category with genre of **GAMES** and **TOOLS** for productivity.

In [21]:
genres_ios = freq_table(ios_free, -5)

for genre in genres_ios:        # We call the parent loop `genre`
    total_rc      = 0             # so we can compare it with the data
    len_genre     = 0               # and then we collect the `rating_count`
    
    for app in ios_free:
        genre_app     = app[-5]
        
        if (genre_app == genre): # `genre` = The iteration variable
                                           # of the main loop
            rating_count  = float(app[5])
            total_rc     += rating_count
            len_genre    += 1
    
    # Keep in mind, we're still on the parent loop with 'genre'.
    # And by that, the value of `total_rc` and `len_genre`
    # for each `genre` is different.
    
    avg_rating_count = total_rc / len_genre    
    print(genre, ':', avg_rating_count)

Social Networking : 53078.195804195806
Photo & Video : 27249.892215568863
Games : 18924.68896765618
Music : 56482.02985074627
Reference : 67447.9
Health & Fitness : 19952.315789473683
Weather : 47220.93548387097
Utilities : 14010.100917431193
Travel : 20216.01785714286
Shopping : 18746.677685950413
News : 15892.724137931034
Navigation : 25972.05
Lifestyle : 8978.308510638299
Entertainment : 10822.961077844311
Food & Drink : 20179.093023255813
Sports : 20128.974683544304
Book : 8498.333333333334
Finance : 13522.261904761905
Education : 6266.333333333333
Productivity : 19053.887096774193
Business : 6367.8
Catalogs : 1779.5555555555557
Medical : 459.75


Based on the number of user in App Store, we can recommend that the profile that needs more in-ads engagement for more revenue would be:
* `Reference`, avg_users : 67447.9 users,
* `Music`, avg_users : 56482.03 users, and
* `Social Networking`, avg_users : 53078.2 users.

But.. what kind of `Reference` are these? We can look at the apps (will look for the solutions rq)

...

Well, what about the Google Play Store? We can utilize the data of `Installs` on index `5`.

In [25]:
ctg_android = freq_table(android_free, 1)

for ctg in ctg_android:
    total_ctg = 0
    len_ctg   = 0
    
    for app in android_free:
        category_app = app[1]
        
        if category_app == ctg:
            
            n_installs = app[5]
            
            # Since the data in `Installs` are in `str`,
            # we can replace the strings and make it into `float`
            # by using built-in function `replace`.
            n_installs = n_installs.replace(',', '')
            n_installs = n_installs.replace('+', '')
            total_ctg += float(n_installs)
            len_ctg   += 1
            
    avg_n_installs = total_ctg / len_ctg
    print(ctg, ':', avg_n_installs)

ART_AND_DESIGN : 1986335.0877192982
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 1833495.145631068
ENTERTAINMENT : 11640705.88235294
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1331540.5616438356
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15588015.603248259
FAMILY : 3695641.8198090694
MEDICAL : 120550.61980830671
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17840110.40229885
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10801391.298666667
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24727872.452830188
NEWS_AND_

As we can see, the top 5 apps installed by users are:
* `VIDEO_PLAYERS`, avg_users = 24727872.45 users, 
* `SOCIAL`, avg_users = 23253652.13 users,
* `PHOTOGRAPHY`, avg_users = 17840110.40 users,
* `PRODUCTIVITY`, avg_users = 16787331.35 users, and
* `GAME`, avg_users = 15588015.60 users

Here are a few next steps you could take:

* Analyze the frequency table for the Genre column of the Google Play dataset, and see if you can find useful patterns.
* Assume we could also make revenue via in-app purchases and subscriptions, and try to determine which genres seem to be liked the most by users — you could examine app ratings here.
* Refine your project using our data science project style guide.

If you're going to work on the next steps above independently, you'll almost inevitably face some problems like not knowing how to fix an error, or not knowing what code to write to perform a certain task. In situations like these, the best thing to do is to start with a Google search (or any other search engine). In most situations, there will always be people who already ran into the same kind of problem, and you'll be able to use the solution they came up with.

As you search for solutions to your problems, you'll notice that one particular site will constantly show up in the first few results of your query — Stack Overflow. The community on Stack Overflow is very active, and the answers you'll find there are almost always accurate and up-to-date. One important tip when you're searching on Google is to start with the word "python". For instance, if you want to find out how to remove the characters from a string, search for "python how to remove a character from a string" (not just "how to remove a character from a string") — otherwise you'll most likely get results for other programming languages.