# Introduction
Our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the Apple App Store. We'll collect and analyze data about mobile apps available on Google Play and the Apple App Store.

---
### Given: Data Sets
- Data from the Apple App Store and the dataset's documentation [are found here.](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)
- Data from the Google Play Store and the dataset's documentation [are found here.](https://www.kaggle.com/lava18/google-play-store-apps)

### Given: Function explore_data

This function takes in four parameters:
- `dataset`, which is expected to be a list of lists.
- `start` and `end`, which are both expected to be integers and represent the starting and the ending indices of a slice from the data set.
- `rows_and_columns`, which is expected to be a Boolean and has `False` as a default argument.

It:
- Slices the data set using `dataset[start:end]`.
- Loops through the slice, and for each iteration, prints a row and adds a new line after that row using `print('\n')`.
  - The `\n` in `print('\n')` is a special character and won't be printed. Instead, the `\n` character adds a new line, and we use `print('\n')` to add some blank space between rows.
- Prints the number of rows and columns if `rows_and_columns` is `True`.
  - `dataset` shouldn't have a header row, otherwise the function will print the wrong number of rows (one more row compared to the actual length).


In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

--------
## Step 1: Open the datasets

We'll open our data set now and make it ready to be used in the explore_data function.

In [2]:
# global imports
from csv import reader

# open Apple file
apple_file = open('AppleStore.csv')
read_apple_file = reader(apple_file)
apple_apps_data = list(read_apple_file)

# open Google file
google_file=open('googleplaystore.csv')
read_goog_file=reader(google_file)
goog_apps_data=list(read_goog_file)

----------------
## Step 2: Explore the data
Let's print out a few lines of the datasets to see what they look like!

In [3]:
# Explore the Apple dataset
explore_data(apple_apps_data, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


In [4]:
# Explore the Google dataset
explore_data(goog_apps_data, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


---
## Step 3: Clean the data
Now that we've opened and explored our two datasets, we'll need to scrub and clean the data.

This process of preparing our data for analysis is called `data cleaning`. Data cleaning is done before the analysis; it includes removing or correcting wrong data, removing duplicate data, and modifying the data to fit the purpose of our analysis.

### Remove rows with missing data

First, in the Google Apps dataset, we know (from the Kaggle discussion), that there is one row with a missing category value. Let's delete that row, and then run the `explore_data` function again to ensure we have one less row.

In [5]:
len(goog_apps_data[10473])
del goog_apps_data[10473]

In [6]:
explore_data(goog_apps_data, 0, 0, True)

Number of rows: 10841
Number of columns: 13


### Remove rows with duplicate data

Secondly, in the Google Apps dataset, we know (from the Kaggle discussion), that there are duplicate rows in this dataset. Let's find the duplicate entries and return the count of duplicately-named entries.

In [7]:
def dup_data(dataset):
    unique_list=[]
    dup_list=[]

    for apps in dataset[1:]:
        app_name=apps[0]
        if app_name in unique_list:
            dup_list.append(app_name)
        else:
            unique_list.append(app_name)

    print(len(dup_list))
    print(len(unique_list))

In [8]:
dup_data(goog_apps_data)

1181
9659


In [9]:
dup_data(apple_apps_data)

0
7197


So, it looks like we've got 1,180 duplicate entries in the Google set and 0 in the Apple set. Wowza!

Now, let's delete the duplicate entries by the "Last Updated" column value; in cases of duplicate app rows, we'll keep the most app row with the most number of reviews.

First, let's create a dictionary object that contains app names, and when we run across a duplicate app name, the number of reviews is updated in our new dictionary object if it's larger than the previous reviews number.

In [10]:
reviews_max={}

for apps in goog_apps_data[1:]:
    name=apps[0]
    n_reviews=float(apps[3])
    
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name]=n_reviews
    if name not in reviews_max:
        reviews_max[name]=n_reviews

print(len(reviews_max))

9659


Next, let's grab the entire app row if it's not a duplicate (and if it is a duplicate, we'll compare the row's rating number to the number in reviews_max), and add it to a new list.

We'll print the length of the new list, and it should match the length of the reviews_max dictionary object.

In [11]:
android_clean=[]
already_added=[]

for apps in goog_apps_data[1:]:
    name=apps[0]
    n_reviews=float(apps[3])
    
    if n_reviews==reviews_max[name] and name not in already_added:
        android_clean.append(apps)
        already_added.append(name)
        
print(len(android_clean))

9659


Success!

### Remove rows with non-English audiences
Since we only use English for the apps that we develop at our company, we need to remove any rows of data that have non-English names.

We can use the built-in `ord()` function to determine the character number for each letter in an app's name because the characters we commonly use in English have character numbers in the range of 0 to 127.

In Python, since strings are iterable, this shouldn't be too much of a hassle to check! If an app name has more than 3 non-English characters, we'll count it as non-English.

In [12]:
def english_check(a_string):
    count=0
    for letters in a_string:
        letter_num=ord(letters)
        if letter_num>127:
            count+=1
    if count>3:
        return False
    else:
        return True

In [13]:
def eng_apps(dataset, name_index):
    english_apps=[]
    non_eng_apps=[]

    for apps in dataset[1:]:
        name=apps[name_index]
        if english_check(name)==True:
            english_apps.append(apps)
        if english_check(name)==False:
            non_eng_apps.append(apps)

    print(len(non_eng_apps))
    print(len(english_apps))
    return english_apps

In [14]:
goog_eng_apps=eng_apps(android_clean, 0)

45
9613


In [15]:
apple_eng_apps=eng_apps(apple_apps_data,1)

1014
6183


For the Google dataset, it looks like 45 apps are non-English-geared, while 9,613 apps are English-focused.

In the Apple dataset, 1,014 of the apps are non-English-geared, while 6,183 apps are English-focused.

### Remove rows with paid apps

We can do this by checking the "Free or Paid" column and grabbing all "Free" entry rows and putting them all in a new list named `free_apps`.

In [16]:
def free_list(dataset, price_index):
    free_apps=[]

    for apps in dataset:
        price=apps[price_index]
        if price=="0" or price=="0.0" or price==0:
            free_apps.append(apps)

    print(len(free_apps))
    return free_apps

In [17]:
free_goog=free_list(goog_eng_apps, 7)

8863


In [18]:
free_apple=free_list(apple_eng_apps, 4)

3222


So far, we did the following cleaning of our data:
- Removed missing data
- Removed duplicate entries
- Removed non-English apps
- Isolated the free apps

---
## Step 4: Analyze app genres

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. 

Our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

To analyze genres further, we'll first create a frequency table of the genres and append the number of apps per genre using the `prime_genre` column of the App Store data set, and for the `Genres` and `Category` columns of the Google Play data set.

In [19]:
def freq_table(dataset, index):
    new_dict={}
    col_name=dataset[0][index]
    for data in dataset[1:]:
        value=data[index]
        if value in new_dict:
            new_dict[value]+=1
        else:
            new_dict[value]=1
    for vals in new_dict:
        new_dict[vals]=(new_dict[vals]/len(dataset[1:])*100)
    return new_dict

### Given: display_table function

- Takes in two parameters: `dataset` and `index`. `dataset` is expected to be a list of lists, and `index` is expected to be an integer.
- Generates a frequency table using the `freq_table()` function.
- Transforms the frequency table into a list of tuples, then sorts the list in a descending order.
- Prints the entries of the frequency table in descending order.


In [20]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [21]:
# get the frequency percentage table for the 'category' column in the Google dataset

display_table(free_goog, 1)

FAMILY : 18.912209433536447
GAME : 9.726923944933423
TOOLS : 8.463100880162491
BUSINESS : 4.5926427443015125
LIFESTYLE : 3.9043105393816293
PRODUCTIVITY : 3.8930264048747465
FINANCE : 3.7011961182577298
MEDICAL : 3.5319341006544795
SPORTS : 3.39652448657188
PERSONALIZATION : 3.3175355450236967
COMMUNICATION : 3.238546603475513
HEALTH_AND_FITNESS : 3.080568720379147
PHOTOGRAPHY : 2.945159106296547
NEWS_AND_MAGAZINES : 2.798465357707064
SOCIAL : 2.663055743624464
TRAVEL_AND_LOCAL : 2.335815842924848
SHOPPING : 2.2455427668697814
BOOKS_AND_REFERENCE : 2.143985556307831
DATING : 1.8618821936357481
VIDEO_PLAYERS : 1.7941773865944481
MAPS_AND_NAVIGATION : 1.399232678853532
FOOD_AND_DRINK : 1.2412547957571656
EDUCATION : 1.162265854208982
ENTERTAINMENT : 0.9591514330850823
LIBRARIES_AND_DEMO : 0.9365831640713158
AUTO_AND_VEHICLES : 0.9252990295644324
HOUSE_AND_HOME : 0.8237418190024826
WEATHER : 0.8011735499887158
EVENTS : 0.7109004739336493
PARENTING : 0.6544798013992327
COMICS : 0.620627397

In [22]:
# get the frequency percentage table for the 'genre' column in the Google dataset

display_table(free_goog, 9)

Tools : 8.451816745655607
Entertainment : 6.070864364703228
Education : 5.348679756262695
Business : 4.5926427443015125
Productivity : 3.8930264048747465
Lifestyle : 3.8930264048747465
Finance : 3.7011961182577298
Medical : 3.5319341006544795
Sports : 3.4642292936131795
Personalization : 3.3175355450236967
Communication : 3.238546603475513
Action : 3.1031369893929135
Health & Fitness : 3.080568720379147
Photography : 2.945159106296547
News & Magazines : 2.798465357707064
Social : 2.663055743624464
Travel & Local : 2.324531708417964
Shopping : 2.2455427668697814
Books & Reference : 2.143985556307831
Simulation : 2.0424283457458814
Dating : 1.8618821936357481
Arcade : 1.8505980591288649
Video Players & Editors : 1.7716091175806816
Casual : 1.7603249830737984
Maps & Navigation : 1.399232678853532
Food & Drink : 1.2412547957571656
Puzzle : 1.128413450688332
Racing : 0.9930038366057323
Role Playing : 0.9365831640713158
Libraries & Demo : 0.9365831640713158
Auto & Vehicles : 0.92529902956443

In [23]:
# get the frequency percentage table for the 'prime_genre' column in the Apple dataset

display_table(free_apple, 11)

Games : 58.180689226948154
Entertainment : 7.885749767153058
Photo & Video : 4.967401428127911
Education : 3.6634585532443342
Social Networking : 3.2598571872089415
Shopping : 2.607885749767153
Utilities : 2.5147469729897547
Sports : 2.1421918658801617
Music : 2.049053089102763
Health & Fitness : 2.018006830176964
Productivity : 1.7385904998447685
Lifestyle : 1.5833592052157717
News : 1.334989133809376
Travel : 1.2418503570319777
Finance : 1.11766532132878
Weather : 0.8692952499223843
Food & Drink : 0.8072027320707855
Reference : 0.55883266066439
Business : 0.5277864017385905
Book : 0.43464762496119214
Navigation : 0.18627755355479667
Medical : 0.18627755355479667
Catalogs : 0.12418503570319776


The frequency tables we analyzed showed us that the App Store is dominated by apps designed for fun, while Google Play shows a more balanced landscape of both practical and fun apps.

Now, we'd like to get an idea about the kind of apps with the most users.

---
## Step 5: Analyze app user numbers

One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the `Installs` column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot` app.


### Apple Store User Numbers

In [38]:
apple_table=freq_table(free_apple, 11)

for genre in apple_table:
    total=0
    len_genre=0
    for app in free_apple:
        genre_app=app[11]
        if genre_app==genre:
            user_ratings=float(app[5])
            total+=user_ratings
            len_genre+=1
    avg_num=total/len_genre
    print(genre,': ',avg_num)

Photo & Video :  28441.54375
Games :  22788.6696905016
Music :  57326.530303030304
Social Networking :  71548.34905660378
Reference :  74942.11111111111
Health & Fitness :  23298.015384615384
Weather :  52279.892857142855
Utilities :  18684.456790123455
Travel :  28243.8
Shopping :  26919.690476190477
News :  21248.023255813954
Navigation :  86090.33333333333
Lifestyle :  16485.764705882353
Entertainment :  14029.830708661417
Food & Drink :  33333.92307692308
Sports :  23008.898550724636
Book :  39758.5
Finance :  31467.944444444445
Education :  7003.983050847458
Productivity :  21028.410714285714
Business :  7491.117647058823
Catalogs :  4004.0
Medical :  612.0


The Apple data shows that the most-used free apps in the Apple Store are Navigation, Social Networking, Photo & Video, and Music. If we were to succeed in garnering revenue from our app eventually, these would be some of the most competitive categories to develop around.

As they say, "niche down!" - maybe some categories to look at would be: Medical, Education, Business, and Catalogs.

### Google Play Store User Numbers

In [47]:
goog_table=freq_table(free_goog, 1)

for category in goog_table:
    total=0
    len_category=0
    for app in free_goog:
        category_app=app[1]
        if category_app==category:
            installs=app[5]
            installs=installs.replace(",","")
            installs=installs.replace("+","")
            total+=float(installs)
            len_category+=1
    avg_num=total/len_category
    print(category,': ',avg_num)

ART_AND_DESIGN :  2021626.7857142857
AUTO_AND_VEHICLES :  647317.8170731707
BEAUTY :  513151.88679245283
BOOKS_AND_REFERENCE :  8767811.894736841
BUSINESS :  1712290.1474201474
COMICS :  817657.2727272727
COMMUNICATION :  38456119.167247385
DATING :  854028.8303030303
EDUCATION :  1833495.145631068
ENTERTAINMENT :  11640705.88235294
EVENTS :  253542.22222222222
FINANCE :  1387692.475609756
FOOD_AND_DRINK :  1924897.7363636363
HEALTH_AND_FITNESS :  4188821.9853479853
HOUSE_AND_HOME :  1331540.5616438356
LIBRARIES_AND_DEMO :  638503.734939759
LIFESTYLE :  1437816.2687861272
GAME :  15588015.603248259
FAMILY :  3695641.8198090694
MEDICAL :  120550.61980830671
SOCIAL :  23253652.127118643
SHOPPING :  7036877.311557789
PHOTOGRAPHY :  17840110.40229885
SPORTS :  3638640.1428571427
TRAVEL_AND_LOCAL :  13984077.710144928
TOOLS :  10801391.298666667
PERSONALIZATION :  5201482.6122448975
PRODUCTIVITY :  16787331.344927534
PARENTING :  542603.6206896552
WEATHER :  5074486.197183099
VIDEO_PLAYERS 

The Google data shows that the most-used free apps in the Apple Store are Communication, Video Players, and Game.

Again, when we "niche down", we find that Medical, Parenting, Comics, Beauty, Events, and Libraries & Demo are less competitive genres to develop an app around.

---
# Conclusion
From our analysis, it is clear that our company should try to develop an app in one of several niches: Medical, Books and Libraries, and Education (Parenting and Beauty might work here to further niche down our app).