# FREE APP DATA ANALYSIS

## About the project: 

This project simulates working as data analyst for a company that builds Android and iOS mobile apps. The company makes apps available on Google Play and in the App Store. They only build apps that are free to download and install, and the main source of revenue consists of in-app ads. This means that the number of users of their apps determines revenue for any given app — the more users who see and engage with the ads, the better. The goal for this project is to analyze data to help the developers understand what type of apps are likely to attract more users. 

As of September 2018, there were approximately 2 million iOS apps available on the App Store, and 2.1 million Android apps on Google Play.

- **STYLE/FORMATTING NOTE: The format of the content below is block of code first, followed by printed result, followed by a markdown explaining the code above and why it was ran. The markdown block will also provide any insights or notable discoveries in the processed data.**

- The process below is broken down into two main sections with numerous sub-sections. The two main sections are:
    1. Data Cleaning Process: I clean the data to eliminate all paid apps, non-english apps, incomplete rows, and duplicate entries. Each subsection is a "Step" in the cleaning process. 
    2. Data Analysis Process: after the data is clean, I run various functions to determine useful information from the raw data such as popular genres and which genres have the most users. Each sub-seciton is a "step" in the process
    

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Section 1: Data Cleaning Process:

## Step 1: start with a function that will make it easy to print readable rows

In [2]:
def explore_data(data_set, start, end, rows_and_columns=False):
    data_slice = data_set[start:end] 
    for row in data_slice: 
        print(row)
        print('\n') #starts a new line
    
    if rows_and_columns: 
        print('Number of rows', len(data_set))
        print('Number of columns', len(data_set[0]))

The function above takes in a data set, start and end for the splice of the list I want. 
Then I initialize a new data slice variable called data_slice. Next, I iterate over the data slice to print out the number of rows and columns in the original dataset. 

## Step 2: import both csv files (google and app store) and save as list of lists

In [3]:
import csv
google_data = open('/home/hunter/Jupyter Projects/1.free_app_proj/googleplaystore.csv')
read_google = csv.reader(google_data)
gdl = list(read_google)

apple_data = open('/home/hunter/Jupyter Projects/1.free_app_proj/AppleStore (1).csv')
read_apple = csv.reader(apple_data)
adl = list(read_apple)

Above, I imported csv module to be able to use the reader function after opening the csv file. Once the file was opened, the program read the file then turned it into a list, which will be used for all subsequent operations. 

## Step 3: take a look at the data to become familiar with how it is presented. 

In [4]:
explore_data(gdl, 0, 2, rows_and_columns=True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows 10842
Number of columns 13


In [5]:
explore_data(adl, 0, 2, rows_and_columns=True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows 7198
Number of columns 16



Above we found that both lists have similar data but are not the same. This is important to pay attention to as we go forward in the project to make sure we are calling the correct indexes for each list. 

It is also important to note that we did not remove the header when we created the list therefor whenever we call the original list or define a function, iterate over all rows except the header. 

Some features of the Google store list that are not in the Apple store list are category and genre, where as the Apple store only lists genre. This insight is important because the breakdown of category and and genre can give more fidelity into which category/genre the company should build their test app. Also, the google store tracks number of installations which is important because it gives an accurate insight into how many users have the app n their phone. The Apply store does not have this. 


## Step 4: Clean the data. We need to remove any apps that are not free or are not English language apps. 

In [6]:
#create a function to determine if any rows in the lists are incomplete:
def incomplete_row(a_list):
    for row in a_list[1:]:
        if len(row) != len(a_list[0]):
            print('Incomplete row:', a_list.index(row))
            print(row)


The function `incomplete_row` iterates over the called list and measures the length of the row, and compares it to the length of the header. If there is a miss match, it lets us know the row has incomplete data in it. When we print both header and the incomplete row, we can determine which piece of information is missing. Additionally, the index of the row of incomplete is printed so I can delete it from the list in the next code block. 


In [7]:
#look for incomplete data rows by comparing the length of each row to the header:
incomplete_row(gdl)
print('\n')
print(gdl[0])


Incomplete row: 10473
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


In [8]:
incomplete_row(adl)

By running both adl and gdl thorugh the `incomplete_row(a_list)` function we found:
    **Google has 1 row** of incomplete data.
    **App store has 0 rows** of incomplete data.
      In the next code block I delete the row from `gdl`

In [9]:
#deleting the row with missing data from `gdl`:
del gdl[10473]

The two functions below `dup_dict` and `duplicates` both do the same thing. They count up all the duplicates in each list. Both functions returned the same number of 1181 in the gdl list. Creating the two functions was redundant but I left them in for my own practice and reference. 

In [10]:
#creating a function that will make a dictionary based on app name to look for duplicates:
def dup_dict(a_list, a):
    test_dict = {}
    len_duplicates = 0
    for row in a_list[1:]:
        item = row[a]
        if item in test_dict:
            test_dict[item] += 1
            len_duplicates += 1
        else:
            test_dict[item] = 1
    return print('There is a total of:', len_duplicates, 'duplicates in the list')

In [11]:
dup_dict(adl, 0)

There is a total of: 0 duplicates in the list


In [12]:
dup_dict(gdl, 0)

There is a total of: 1181 duplicates in the list


Above I created a function that looks for duplicate data. The function works in the following way:
1. initiate an empty dictionary
2. iterate over each row in the called list
3. check to see if the app name is already in the dictionary. 
5. if the item is in the dictionary, increase the key's value by 1. 
6. if the item is not in the dictionary, set the key's value to 1. 
7. add duplicates to len_duplicates to find total. 
Function can be altered to return the dictionary of duplicates or print out all the duplicates and how many per app

#### Another way to determine duplicates from lists:

In [13]:
#Creating a function where we pull data from an index and assign it to a singular entry list or duplicates list:
def duplicates(a_list, x): #x = index to pull data from
    sing_entries = []
    dup_entries = []
    for row in a_list:
        item = row[x]
        if item in sing_entries: 
            dup_entries.append(item)
        else: 
            sing_entries.append(item)
    print('This list has', len(dup_entries), 'duplicate entries') 

The function `duplicates()` listed above is just a redundant function to demonstrate another way to look for duplicates in the list. 
1. create two empty lists. One is for single entries, the other for duplicates
2. iterate over the called list. 
3. assign the data in the index listed as a variable to `item`
4. check to see if `item` is in the `sing_entries` list, if it is, then the item is a duplicate, so append it to the `dup_entries` list.
5. otherwise, append the item to the `sing_entries` list.
6. print out the length of the `dup_entries` list to get total duplicates. 

In [14]:
duplicates(gdl, 0)

This list has 1181 duplicate entries


In [15]:
duplicates(adl, 0)

This list has 0 duplicate entries


#### Now we delete the duplicate data. We will determine which duplicate is the most accurate by choosing which has the highest and choose to keep that one, and delete all others. 

In [16]:
def del_dupes(a_list, x, y):
    rating_dict = {}
    clean_list = []
    already_added = []
    
    for row in a_list[1:]:
        app = row[x]
        rating = int(row[y])
        
        if app in rating_dict and rating_dict[app] < rating:
            rating_dict[app] = rating
        elif app not in rating_dict:
            rating_dict[app] = rating
        if rating == rating_dict[app] and app not in already_added: #to prevent ratings that are equal creating duplicates in clean list
            clean_list.append(row)
            already_added.append(app)
            
    return clean_list   

The function above creates a dictionary and two lists and returns the `clean_list` which is the list without duplicates. 
1. iterate over the called list defined in the function. 
2. check to see if the app name is in the dictionary AND check to see if the key's variable is greater than the apps rating. 
3. if the if statement is true then append dictionary's key value to the current rating
-- this ensures we save the duplicate row with the highest number of ratings. 
4. check to see if the rating is equal to the rating listed as the key's value AND that the app is not in the `already_added` list. 
-- Checking to see if the data is in `already_added` ensures that no duplicate data enters the `clean_data` in the event that there are duplicate rows that have identical data
5. return the `clean_list`

In [17]:
#Run the function, check the length of the original list minus no. of duplicates vs. length of duplicates dictionary. 
gdl_clean = del_dupes(gdl, 0, 3)

expected_length_gdl = len(gdl[1:]) - 1181
print(expected_length_gdl)

length_gdl_clean = len(gdl_clean)
print(length_gdl_clean)

9659
9659


Above we created a duplicate free list for `gdl` (`adl` did not have duplicates)
Going forward we will only use the **`gel_clean` list NOT `gdl`**

To determine if our function properly deleted the rows we found the length of the original `gdl` list and subtracted out the number of duplicates that we found earlier. The result gave us an expected length that we can check our new clean list against. 

Both length numbers matched, showing us we successfully deleted the duplicates. 

## Step 5: Remove all non-English entries.
### Recall from the initial criteria, we are analyzing free apps that are in the english language.

In [18]:
#delete all rows that are not in the english language
#ASCII states that all english characters are within range 1-127
#Start by creating two lists, iterate over the original data and assing rows to either english, or non_english lists. 

def is_english(string):
    non_ascii = 0
    
    for character in string: 
        if ord(character) > 128:
            non_ascii += 1
      
    if non_ascii > 3: 
        return False
    else:
        return True 

In the code above, we defined a function that will do the following: 
1. set a variable `non-ascii` to 0 to be used for counting characters outside the english lang. 
2. iterate over each letter in the app name. 
3. the iteration will ananyze each character in the string to see if it's ascii number value is greater than 128 (non-english characters)
4. if true, add 1 to `non-ascii
5. create a boolean expression to check if non_ascii is > 3 after the iteration is complete. 
    - we want to minimize data loss so we chose to eliminate characters that have more than 3 non english characters to account for things like emojis and dashes. 
6. the if statement will return a True or False. 

In [19]:
gdl_clean_eng = []

for app in gdl_clean[1:]:    
    name = app[0]
    ed = is_english(name)
    if ed == True:
        gdl_clean_eng.append(app)

print(len(gdl_clean_eng))
print(len(gdl_clean_eng) - len(gdl_clean))
   

9613
-46


In [20]:
adl_clean_eng = []

for app in adl[1:]:    
    name = app[1]
    ed = is_english(name)
    if ed == True:
        adl_clean_eng.append(app)

print(len(adl_clean_eng))
print(len(adl_clean_eng) - len(adl))


6183
-1015


In the two code blocks above we iterated over our previously cleaned data (no duplicates and no incomplete rows). 
1. We initialized a new list for the english only data. 
2. iterate over the clean data list and assign the data at the index which holds the app name to `name`
3. run the function `is_english(name)` to `ed`
4. remember the `is_english` function returns a true or false. True if the word has engligh characters, false if not. 
5. if the return is True, append the row from the original list to the newly created list. 
6. the result is the newest version of the cleaned list. 


## Step 6: Remove all non-free apps

In [21]:
#make a function to create two list of lists for free and non_free data
def free_app(a_list, a):
    free = []
    not_free = []
    
    for row in a_list:
        price = row[a]
        dollar = float(price)
        if dollar == 0.0: 
            free.append(row)
        else: 
            not_free.append(row)
            
    return free

In the cell above we created a function that separates the data from the main list into two separate lists. One list for free apps and one for non-free apps. 
1. we define the function `free_app()`
2. initialize 2 lists: `non_free` and `free`
3. iterate over the list and name a variable that will be the index of the data we want. 
4. let data be the data in the index we named with the variable. convert string to float
5. if the data = 0 then it is free. 
    - append to free list
    - else append to non_free list

In [22]:
adl_total_clean = free_app(adl_clean_eng, 4)


In [23]:
#remove the'$' from the cost in each row of gdl list to allow float conversion
for row in gdl_clean_eng:
    money = row[7]
    dol = money.replace('$', '')
    row[7] = dol

#now run `gdl_clean` through the `free_app` function
gdl_total_clean = free_app(gdl_clean_eng, 7)


In the two code cells above we ran our clean, english only data through the function that sorted rown into free and paid lists. We checked each list for accuracy to ensure the sum of the length of the two lists was equal to the total length of the original list. 

**The clean list going forward will be `adl_total_clean` and `gdl_total_clean`**

### Conclusion to data cleaning process:    

So far to clean the data, we did the following
- found and deleted incomplete rows
- removed duplicate data rows
- filtered out the majority of non-english apps (this needs to be optimized because we did lost good data while keeping some bad data in the lists.)
- filtered out apps that are not free 

# Data Analysis Process:

From the company: Our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

In order to achieve these goals we need to figure out which type of app will be successful in both the Google and Apple store. 

## Step 1: Determine the most common genres between the two stores

In [24]:
#We will set up a function that makes a dictionary used as a frequency table and we will list the top three genres for both Google and Apple app stores

def pop_genres(dataset, index): #this function returns a freq table for genres
    total = len(dataset)
    genre_dict = {}
    
    for row in dataset: #create a dictionary with values as intergers
        genre = row[index]
        if genre in genre_dict:
            genre_dict[genre] += 1
        else:
            genre_dict[genre] = 1
    
    gen_dict_percent = {}
    for genre in genre_dict: #transform the library values to percentages of total 
        average = (genre_dict[genre]/total)*100
        rnd = round(average, 2)
        gen_dict_percent[genre] = rnd
    
    return gen_dict_percent
 
def sort_ft(dataset, index):
    table = pop_genres(dataset, index)
    sort_dict = []
    for key in table: #sort the dictionary making a list of tuples from the dictionary
        tup_data = (table[key], key)
        sort_dict.append(tup_data) 

    table_sorted = sorted(sort_dict, reverse=True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        
#this code finds the average of the values in a dictionary to total values:

    


In the code above, I wanted to determine the genre that is most popular by creating a frequency dictionary. 
The function `pop_genre` does the following: 
1. creates a dictionary frequency table from the specified list and index
2. after the dictionary freq table is established, it converts the values in the dictionary to percentages by dividing by the `len(dataset)` then rounding to 2 decimal places

The next function `sort_ft` sorts the returned percentage frequency table to a sorted form:
1. take the stored percentage freq table from `pop_genre` and iterate to convert key, values to tuples
2. then make a list of tuples
3. finally, use the `sorted` function to sort high to low based on the percentage value
4. last, print in the specified format for each iteration

In [25]:
adl_genres = sort_ft(adl_total_clean, 11)
print(adl_genres)

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12
None


In [26]:
gdl_genres = sort_ft(gdl_total_clean, 1)
print(gdl_genres)

FAMILY : 18.45
GAME : 9.87
TOOLS : 8.44
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.52
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.95
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.78
MAPS_AND_NAVIGATION : 1.4
EDUCATION : 1.29
FOOD_AND_DRINK : 1.24
ENTERTAINMENT : 1.13
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.84
WEATHER : 0.8
EVENTS : 0.71
ART_AND_DESIGN : 0.67
PARENTING : 0.65
COMICS : 0.62
BEAUTY : 0.6
None


In [27]:
gdl_genres = sort_ft(gdl_total_clean, 9)
print(gdl_genres)

Tools : 8.43
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.52
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.95
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Reference : 2.14
Simulation : 2.04
Dating : 1.86
Arcade : 1.85
Video Players & Editors : 1.77
Casual : 1.76
Maps & Navigation : 1.4
Food & Drink : 1.24
Puzzle : 1.13
Racing : 0.99
Role Playing : 0.94
Libraries & Demo : 0.94
Auto & Vehicles : 0.93
Strategy : 0.91
House & Home : 0.84
Weather : 0.8
Events : 0.71
Adventure : 0.68
Comics : 0.61
Beauty : 0.6
Art & Design : 0.59
Parenting : 0.5
Card : 0.45
Casino : 0.43
Trivia : 0.42
Educational;Education : 0.39
Educational : 0.37
Board : 0.37
Education;Education : 0.34
Word : 0.26
Casual;Pretend Play : 0.24
Music : 0.2
Racing;Action & Adventure : 0.17
Puzzle;Brain Games : 0.17
Entertainment;Music & Video : 0.17
Casual

## Step 2: Genre Data Analysis:

In the three code cells above we ran each list through the `sort_ft` function.

We got the return 
- **Games** as the most popular genre for the Apple store and **entertainment** as second most popular in the 'prime_genre' column
- **Family** is most popular and **Game** is second most popular genre for the Google store in the 'Category' column
- **Tools, entertainment, and education** were the three most popular genres in the Google store 'Genre' column respectively

Keep in mind that these results are for **free** and **english language** apps in each store only. 

Recommendation could be to create an app that fits or could be included in all 4 cetegories that are #1 and #2 most popular between both stores. For example, a family friendly game that also fits into the entertainment category. 

It would be very interesting to see a dataset that breaks down the sub-categorys under games, especially in the Apply store. Since such a large overwhelming percentage of the apps are in the Apple Store are of the games genre a breakdown could provide more insight into the most popular type of games. i.e. puzzles, action, racing, etc. 

The Google store dataset does provide this breakdown. For example, the Google store dataset provides data for both 'genre' and 'category'. Further analysis suggests that categories are the primary grouping factor and genres are sub-groupings under category. For example an app could be in the game category with the genre of puzzle. 

The important thing to keep in mind is that the company needs to create an app that will be profitable. For the app genre, there are two choices. One is to make an app in a popular genre. the benefit here is that popular genres probably have more users and a greater chance of making a lot of money if the app becomes popular. The downside to this is that the most popular genres have a lot of competition therefor becoming noticed is like being one fish in the ocean. The app would have to be vestly superior to all the other free apps out there. It maybe wise in this scenario to go with a less popular genre in hopes to find a niche where there is a need for an app that many people need and there arent many options on the market. 



## Step 3: Determine which type of apps have the most users. 

I will determine which app types have the most users by genre. For the Apple Store we will use the 'user_ratings' column as a proxy for number of users. For the Google store, we will use 'installs' as number of users. 

In order to determine which genre is most popular among users, we will find the average ratings or installs. To find the average we will need to know the sum of all the ratings or installs for each app in a genre. The we must determine the number of apps in each genre. 

To find the sum of all ratings/installs in a genre we must use a dictionary of genres and our cleaned list. We will have to iterate over the dictionary with a nested iteration over the list to filter the data to ensure only apps of the same genre are being added up. 

## Apple Store

In [28]:
#start with a previously defined function for the dictionary: 
adl_inst = pop_genres(adl_total_clean, 11)

#iterate over the keys in `adl_inst`:
for genre in adl_inst: 
    total = 0
    len_total = 0
    for row in adl_total_clean: #iterate over the list defining rating numbers and genres
        rat = row[5]
        gen = row[11]
        if gen == genre: 
            rat = rat.replace(',', '')
            rat = rat.replace('+', '')
            rat = int(rat)
            total += rat
            len_total += 1
    #find the average of the genre prior to iterating over the next key: 
    avg_user = total / len_total
    print(genre, ':', avg_user)
    

Social Networking : 71548.34905660378
Photo & Video : 28441.54375
Games : 22788.6696905016
Music : 57326.530303030304
Reference : 74942.11111111111
Health & Fitness : 23298.015384615384
Weather : 52279.892857142855
Utilities : 18684.456790123455
Travel : 28243.8
Shopping : 26919.690476190477
News : 21248.023255813954
Navigation : 86090.33333333333
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Food & Drink : 33333.92307692308
Sports : 23008.898550724636
Book : 39758.5
Finance : 31467.944444444445
Education : 7003.983050847458
Productivity : 21028.410714285714
Business : 7491.117647058823
Catalogs : 4004.0
Medical : 612.0


In [29]:
#start with a previously defined function for the dictionary: 
gdl_inst = pop_genres(gdl_total_clean, 1)

#iterate over the keys in `adl_inst`:
for genre in gdl_inst: 
    total = 0
    len_total = 0
    for row in gdl_total_clean: #iterate over the list defining rating numbers and genres
        rat = row[5]
        gen = row[1]
        if gen == genre: 
            rat = rat.replace(',', '')
            rat = rat.replace('+', '')
            rat = int(rat)
            total += rat
            len_total += 1
    #find the average of the genre prior to iterating over the next key: 
    avg_user = total / len_total
    print(genre, ':', avg_user)

ART_AND_DESIGN : 1937476.2711864407
AUTO_AND_VEHICLES : 647317.8170731707
BEAUTY : 513151.88679245283
BOOKS_AND_REFERENCE : 8767811.894736841
BUSINESS : 1712290.1474201474
COMICS : 817657.2727272727
COMMUNICATION : 38456119.167247385
DATING : 854028.8303030303
EDUCATION : 3082017.543859649
ENTERTAINMENT : 21134600.0
EVENTS : 253542.22222222222
FINANCE : 1387692.475609756
FOOD_AND_DRINK : 1924897.7363636363
HEALTH_AND_FITNESS : 4188821.9853479853
HOUSE_AND_HOME : 1313681.9054054054
LIBRARIES_AND_DEMO : 638503.734939759
LIFESTYLE : 1437816.2687861272
GAME : 15837565.085714286
FAMILY : 2691618.159021407
MEDICAL : 120616.48717948717
SOCIAL : 23253652.127118643
SHOPPING : 7036877.311557789
PHOTOGRAPHY : 17805627.643678162
SPORTS : 3638640.1428571427
TRAVEL_AND_LOCAL : 13984077.710144928
TOOLS : 10695245.286096256
PERSONALIZATION : 5201482.6122448975
PRODUCTIVITY : 16787331.344927534
PARENTING : 542603.6206896552
WEATHER : 5074486.197183099
VIDEO_PLAYERS : 24852732.40506329
NEWS_AND_MAGAZINE

## Analysis Conclusions:

Based on the results obtained above in Step 1, we can see that the genres that contain the majority of apps are: 
- Games as the most popular genre for the Apple store and entertainment as second most popular in the 'prime_genre' column
- Family is most popular and Game is second most popular genre for the Google store in the 'Category' column
- Tools, entertainment, and education were the three most popular genres in the Google store 'Genre' column respectively

The results from Step 2 indicate that in the Apple Store: 
- Navigation, Reference, and Social Networking were the genres that contained apps that were the most rated by users. Since rating was the proxy for number of users, these three genres contain the most popular apps in the store. 
And in the Google Store: 
-Social, video players, tools and communication were the genres with the highest average users in the Google store. 

Combining this information from Step 1 and 2 of the analysis section, a good strategy may be to avoid the genres that have the most apps in them because there is more competition and less likelyhood that the test app would be found amongst all the other apps in the genre. We might want to find a genre that is dominated by only a few apps in terms of users. The reason for this is that we can mimic that the extremely populaar app is doing without having a lot of competition in the space. 

Reviewing and comparing both results from step 2, a good strategy might be to pick an app that could be found in multiple genres that are less popular but not the least popular and can be applied in a unique way to give the user a new experience rather than a modified experience. For example, books/reference genre in Google store and books in Apple store and the travel genre in both stores are relatively less popular. A sprcific idea for an app might be something along the lines of an app that interfaces with your calendar to determine if you are going on any trips in the future and where you are going. Then it could give you book recommendations based on where you are going and what your reading preferences are. 

Recommendations for further analysis/research:
Further analysis should be conducted to determine the most popualr apps from the genres that the company selects to build their app in and determine why those apps are so popular.

The data cleaning process could be optimized to eliminate all non-English apps and retain those that are english but were removed. 