# Analysis of  Apple Store and Google Play Data

## Project Description
>Our goal for this project is to analyze data from "Appstore" and "Google Play" to help developers understand what type of apps are likely to attract more users.

## Opening and Exploring the Data

### Description of 'appleStore.csv' table
Name | Description
:---:|:---:
"id" | App ID
"track_name"| App Name
"size_bytes"| Size (in Bytes)
"currency"| Currency Type
"price"| Price amount
"ratingcounttot"| User Rating counts (for all version)
"ratingcountver"| User Rating counts (for current version)
"user_rating" | Average User Rating value (for all version)
"userratingver"| Average User Rating value (for current version)
"ver" | Latest version code
"cont_rating"| Content Rating
"prime_genre"| Primary Genre
"sup_devices.num"| Number of supporting devices
"ipadSc_urls.num"| Number of screenshots showed for display
"lang.num"| Number of supported languages
"vpp_lic"| Vpp Device Based Licensing Enabled

>Open the two 'csv' files and create python objects for both

In [1]:
apple_open = open('AppleStore.csv')
google_open = open('googleplaystore.csv')
from csv import reader
apple_reader = reader(apple_open)
google_reader = reader(google_open)
apple_list = list(apple_reader)
google_list = list(google_reader)

print('Apple data has ',len(apple_list), ' number of rows')
print('Google data has ',len(google_list), ' number of rows')

Apple data has  7198  number of rows
Google data has  10842  number of rows


>Create a simple function for easy display of data set portions.

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row
    
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [3]:
explore_data(google_list,0,3)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




>Display the Google data set header names and respective column indices for convenience when later we reference them in the code.

In [4]:
header = google_list[0]
for i in range(len(header)):
    print(i, ' = ', header[i])

0  =  App
1  =  Category
2  =  Rating
3  =  Reviews
4  =  Size
5  =  Installs
6  =  Type
7  =  Price
8  =  Content Rating
9  =  Genres
10  =  Last Updated
11  =  Current Ver
12  =  Android Ver


>From the headers of the two data sets we can identify the
following columns of interest for further analysis:

* **Apple data:** 
___
Name | Description | Col Idx No
:---:|:---:|:---:
"track_name"| App Name | 1
"price"| Price amount | 4
"ratingcounttot"| User Rating counts | 5
"prime_genre"| Primary Genre | 11

* **Google data:** 
___
Name | Description | Col Idx No
:---:|:---:|:---:
"App" | Application Name | 0
"Category" | Summairzed Genre | 1
"Installs" | Number of Downloads | 5
"Type" | Free or not |6
"Genres" | Granular Genre | 9
"Rating" | App rating | 2
"Content rating" | Content rating |8

In [5]:
### Removing duplicate entries from Google data

In [6]:
duplicate = []
unique = []

for row in google_list[1:]:
    if row[0] in unique:
        duplicate.append(row[0])
    else:
        unique.append(row[0])
print('Duplicate = ' , len(duplicate))
print('Unique = ', len(unique), '\n')
# the following code just finds a section of the repeting
# data with 5 repeting names and prints a slice
# around that section.
dupl_sorted = sorted(duplicate)
count = 0
for i in range(len(dupl_sorted)):
    if dupl_sorted[i] == dupl_sorted[i+1]:
        count +=1
    if count ==5:
        index = i+1
        break
print(dupl_sorted[index - 7: index+2])

Duplicate =  1181
Unique =  9660 

['365Scores - Live Scores', '420 BZ Budeze Delivery', '8 Ball Pool', '8 Ball Pool', '8 Ball Pool', '8 Ball Pool', '8 Ball Pool', '8 Ball Pool', '8fit Workouts & Meal Planner']


* The number of duplicate entries found in the Google Play data is 1181.
* Duplicates will be removed based on the number of reviews. The highest number will remain as it would suggest the most recent data. Duplicates with lower number of review will be deleted.

In [7]:
# This code deletes a row containing incorrect formatting
end_loop = len(google_list[1:])
for row in range(end_loop):
    if row+1 >= end_loop:
        break
    name = google_list[row+1][0]#row[0]
    n_reviews = google_list[row+1][3]#row[3]
     
    if 'M' in n_reviews:
        #n_reviews = float(n_reviews.replace('M',''))
        print('THE FOLLOWING ROW IS TO BE DELETED: \n',google_list[row+1])
        del google_list[row+1]
        

THE FOLLOWING ROW IS TO BE DELETED: 
 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [8]:
# Here we populate the dictionary 'reviews_max' non repeting 
# app names and the highest number of reviews for each app
reviews_max = {}
for row in google_list[1:]:
    name = row[0]
    n_reviews = float(row[3])
        
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews
print(len(reviews_max))

9659


### Removing Duplicate Entries Part II
>Next we create a new list - `android_clean` which we populate with the unique entries from 'google_list'. We use dictionary `reviews_max` from the code above to find the unique names with the highest number of reviews for each repeating name.

In [9]:
android_clean = []
already_added = []
for row in google_list[1:]: # we don't loop through te header row
    name = row[0]
    n_reviews = row[3]
    if 'M' in n_reviews:
        n_reviews = "Bad thing is happening. Thous shoud've been taken care of earlier."#float(n_reviews.replace('M',''))*1000000
    else: n_reviews = float(n_reviews)
    
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(row)
        already_added.append(name)

print(len(already_added))
print(len(android_clean))
print(android_clean[:4])

9659
9659
[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up'], ['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up'], ['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']]


>Next do the same check whether duplicates exist in the Apple data

In [10]:
duplicate = []
unique = []

for row in apple_list[1:]:
    if row[0] in unique:
        duplicate.append(row[0])
    else:
        unique.append(row[0])
print('Duplicate = ' , len(duplicate))
print('Unique = ', len(unique), '\n')

Duplicate =  0
Unique =  7197 



>Next we'll use the `ord`(*letter*) function to exclude non ASCII characters - i.e. character with 'ord' code >127. In this way we'll approximate English language and exclude non Latin languages (Mandarin, Urdu, Arabic, Cyrilic etc...)

In [11]:
print(ord('z'))
print(ord('Z'))
print(ord('Щ'), '> 127 so will be removed from the data set')

122
90
1065 > 127 so will be removed from the data set


### Removing Non-English Apps

* Create a function that checks if an app name is in ASCII characters. Up to 3 non ASCII characters are permisible so as to allow for logos, emoticons etc.

In [12]:
def english_check(strin):
    counter = 0
    for i in strin:
        if ord(i) >127:
            counter += 1
            if counter>3:
                return False
    return True

print(english_check('Instagram'))
print(english_check('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(english_check('Docs To Go™ Free Office Suite'))
print(english_check('Instachat 😜'))

True
False
True
True


### Removing Non-English Apps: Part II
* Clean the two data sets from all non English names by creating new blank lists and populating them only with English names.

In [13]:
eng_apple_set = []
eng_google_set = []
# first clean the Apple data set
for i in apple_list[1:]:
    name = i[1]
    if english_check(name):
        eng_apple_set.append(i)
# next clean the google set
for i in android_clean[:]:
    name = i[0]
    if english_check(name):
        eng_google_set.append(i)
        
print('Aplle data length after removing Non-English: ',len(eng_apple_set))
print('Google data length after removing Non-English: ',len(eng_google_set), '\n')

print(eng_apple_set[:4],"\n")
print(eng_google_set[:4])

Aplle data length after removing Non-English:  6183
Google data length after removing Non-English:  9614 

[['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1'], ['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1'], ['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1'], ['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']] 

[['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up'], ['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & 

### Isolating Only the Free Apps

In [14]:
apple_free = []
google_free = []

# we use the cleaned, unique sets in English that we arrived at
# in the previous code segment.
# From them we select only apps with price == 0

for i in eng_apple_set:
    if float(i[4]) == 0:
        apple_free.append(i)

for i in eng_google_set:
    if i[6] == 'Free':
        google_free.append(i)
print('Apple number of free apps: ',len(apple_free), '\n')
print('Google number of free apps: ', len(google_free))


Apple number of free apps:  3222 

Google number of free apps:  8863


### Recap so far:
* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps
* Isolated the free apps

# Analysis

>Our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

>Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets.

>Let's begin the analysis by getting a sense of what are the most common genres for each market. 

### Most Common Apps by Genre

Before we continue here is a short function to display a  dictionary with sorted values which we'll need.

In [15]:
def sort_dic(dict, descending = True): # returns a list of tupples, sorted in
    list_of_tupples = []               # descending order by the dict values.
    for i in dict:
        tup = (dict[i], i)
        list_of_tupples.append(tup)
    sorted_list_tup = sorted(list_of_tupples, reverse = True)
#     for i in sorted_list_tup:
#         print(i[1], ' : ', i[0])
    return sorted_list_tup
def print_sorted_dic(list_of_tupples): # prints two columns of the first two
    for i in list_of_tupples:          # elements in a list of tuples.
        print(i[1], ' : ', i[0])
so_dic = sort_dic({'q':23, 'd':78, 'g':990, 'e':55, 'j':92})
print(so_dic)
print_sorted_dic(so_dic)

[(990, 'g'), (92, 'j'), (78, 'd'), (55, 'e'), (23, 'q')]
g  :  990
j  :  92
d  :  78
e  :  55
q  :  23


In [16]:
# a function to display results in a more readable way
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [17]:

# this function returns a dictionary containging a frequency
# distribution for data in a column (given as an argument), 
# presented in percents
def freq_table(dataset, col_index):
    freq_dic = {}

    for i in dataset:
        cell_data = i[col_index]
        if cell_data in freq_dic:
            freq_dic[cell_data] += 1
        else:
            freq_dic[cell_data] = 1
    #transform into percentage
    sum_dic = 0 # this is the total or 100%
    for i in freq_dic:
        sum_dic += freq_dic[i]
    
    freq_dic_perc = {}
    for i in freq_dic:
        perc = round((freq_dic[i]/sum_dic)*100,2)
        freq_dic_perc[i] = perc
    return freq_dic_perc


>Just to test the `freq_table` function and visualise the composition of Categories and Subcategories in the Google data set.

In [30]:
google_cat = freq_table(google_free, 1)
google_genre_detailed = freq_table(google_free, 9)

# for the sake of readability we want to display the results in descending
# order. Since we cannot sort a dictionary, we'll trasnsform the dictionaies
# into lists of tuples. Then we can sort these lists by putting the data we 
# need sorted into the '0' index of the tuple elements.


sorted_google_cat = sort_dic(google_cat)
sorted_detailed_genre = sort_dic(google_genre_detailed)

# create a dictionary {key = 'detailed genre name': value = 'categ name'}
cat_genre_dic = {}
for i in google_free:
    detailed_genre = i[9]
    categ = i[1]
    if detailed_genre not in cat_genre_dic:
        cat_genre_dic[detailed_genre] = categ
# now we have isolated the unique sub categories genre and we can continue
# to assigning their shares

for i in sorted_google_cat:
    categ = i[1]
    categ_share = i[0]
    print(categ, ' : ', categ_share)
    sub_categ = []
    for dic_sub_ca in cat_genre_dic:
        if cat_genre_dic[dic_sub_ca] == categ:
            for j in sorted_detailed_genre:
                sub_category = j[1]
                share_sub_category = j[0]
                if sub_category == dic_sub_ca:
                    sub_categ.append([share_sub_category, sub_category])
    sorted_sub_categ = sorted(sub_categ, reverse = True)
    for i in sorted_sub_categ:
        print('   ',i[1],' : ', i[0])


FAMILY  :  18.9
    Educational;Education  :  0.39
    Educational  :  0.37
    Casual;Pretend Play  :  0.24
    Racing;Action & Adventure  :  0.17
    Puzzle;Brain Games  :  0.17
    Casual;Action & Adventure  :  0.14
    Arcade;Action & Adventure  :  0.12
    Educational;Pretend Play  :  0.09
    Simulation;Action & Adventure  :  0.08
    Board;Brain Games  :  0.08
    Educational;Brain Games  :  0.07
    Casual;Creativity  :  0.07
    Role Playing;Pretend Play  :  0.05
    Role Playing;Action & Adventure  :  0.03
    Puzzle;Action & Adventure  :  0.03
    Entertainment;Action & Adventure  :  0.03
    Educational;Creativity  :  0.03
    Educational;Action & Adventure  :  0.03
    Education;Music & Video  :  0.03
    Education;Action & Adventure  :  0.03
    Adventure;Action & Adventure  :  0.03
    Sports;Action & Adventure  :  0.02
    Simulation;Pretend Play  :  0.02
    Puzzle;Creativity  :  0.02
    Music;Music & Video  :  0.02
    Entertainment;Pretend Play  :  0.02
    Casual;E

In [19]:
category_dic = freq_table(google_free, 1)
subcat_dic = freq_table(google_free, 9)
# cat_list_tuples = []
# sort the shares of 'category_dic' in a list of tuples
# for i in category_dic:
#     tup = (category_dic[i], i)
#     cat_list_tuples.append(tup)
#     sorted_cat_list_tuples = sorted(cat_list_tuples, reverse = True)
sorted_cat_list_tuples = sort_dic(category_dic)

subcat_cat_dic = {}
for i in google_free:
    subcat = i[9]
    cat = i[1]
    if subcat not in subcat_cat_dic:
        subcat_cat_dic[subcat] = cat

for i in sorted_cat_list_tuples:
    print(i[1], ' : ', i[0])
    list_subcat_share = []
    for j in subcat_cat_dic:
        if subcat_cat_dic[j] == i[1]:
            list_subcat_share.append([subcat_dic[j], j])
    sorted_list = sorted(list_subcat_share, reverse = True)
    for z in sorted_list:
        print('   ', z[1], ' : ', z[0])


FAMILY  :  18.9
    Educational;Education  :  0.39
    Educational  :  0.37
    Casual;Pretend Play  :  0.24
    Racing;Action & Adventure  :  0.17
    Puzzle;Brain Games  :  0.17
    Casual;Action & Adventure  :  0.14
    Arcade;Action & Adventure  :  0.12
    Educational;Pretend Play  :  0.09
    Simulation;Action & Adventure  :  0.08
    Board;Brain Games  :  0.08
    Educational;Brain Games  :  0.07
    Casual;Creativity  :  0.07
    Role Playing;Pretend Play  :  0.05
    Role Playing;Action & Adventure  :  0.03
    Puzzle;Action & Adventure  :  0.03
    Entertainment;Action & Adventure  :  0.03
    Educational;Creativity  :  0.03
    Educational;Action & Adventure  :  0.03
    Education;Music & Video  :  0.03
    Education;Action & Adventure  :  0.03
    Adventure;Action & Adventure  :  0.03
    Sports;Action & Adventure  :  0.02
    Simulation;Pretend Play  :  0.02
    Puzzle;Creativity  :  0.02
    Music;Music & Video  :  0.02
    Entertainment;Pretend Play  :  0.02
    Casual;E

>The last two code cells perform the same task in two slightly different ways. Both results are the same. Both are flawed. In some of the bigger categories - the sum total of subcategories is larger than their respective category. Further investigation is necessary to find out if this is an error in our coding or a data inconsistency.

In [20]:
for i in google_free:
    if i[9] == 'Education':
        print(i[1], ' : ', i[9])

EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Education
EDUCATION  :  Ed

>After a quick check we see that the sub-category 'Education' is contained in both the 'FAMILY' and 'EDUCATION' categories. The freq_table function we used for obtaining frequency distribution data does not account for this. Below we'll tackle this into a different way. 

In [21]:
main_categ = freq_table(google_free, 1)

# sort the shares of 'main_categ' in a list of tuples
main_categ_sorted_list_tuples = sort_dic(main_categ)
# loop through sorted main_categ and for each categ, loop the Google data set
# and populate a dictionary of the subcategories belonging to that categ.
super_total = 0 # ----------just for testing purposes
for i in main_categ_sorted_list_tuples:
    categ = i[1]
    share_categ = i[0]
    print('   ',categ, ' : ', share_categ)
    subcateg_dic = {}
    for j in google_free:
        if j[1] == categ:
            subcateg = j[9]
            if subcateg in subcateg_dic:
                subcateg_dic[subcateg] += 1
            else:
                subcateg_dic[subcateg] = 1
        # transform the values in subcateg_dic into percentages
        # but furst find the sum total of values in subcateg_dic
    
    for k in subcateg_dic:
        perc = round((subcateg_dic[k]/len(google_free))*100, 2)
        subcateg_dic[k] = perc
    sorted_subcateg_dic = sort_dic(subcateg_dic)
    print_sorted_dic(sorted_subcateg_dic)

    FAMILY  :  18.9
Entertainment  :  5.17
Education  :  4.31
Simulation  :  1.96
Casual  :  1.51
Puzzle  :  0.88
Role Playing  :  0.81
Strategy  :  0.73
Educational;Education  :  0.39
Educational  :  0.37
Education;Education  :  0.27
Casual;Pretend Play  :  0.24
Racing;Action & Adventure  :  0.17
Puzzle;Brain Games  :  0.17
Entertainment;Music & Video  :  0.14
Casual;Action & Adventure  :  0.14
Casual;Brain Games  :  0.12
Arcade;Action & Adventure  :  0.12
Educational;Pretend Play  :  0.09
Action;Action & Adventure  :  0.09
Simulation;Action & Adventure  :  0.08
Board;Brain Games  :  0.08
Entertainment;Brain Games  :  0.07
Educational;Brain Games  :  0.07
Casual;Creativity  :  0.07
Role Playing;Pretend Play  :  0.05
Education;Pretend Play  :  0.05
Role Playing;Action & Adventure  :  0.03
Puzzle;Action & Adventure  :  0.03
Entertainment;Action & Adventure  :  0.03
Educational;Creativity  :  0.03
Educational;Action & Adventure  :  0.03
Education;Music & Video  :  0.03
Education;Action &

#### Number of Apps per Genre 

> Shown as Percentage of Total Apps, Ordered Descending

** Results for Apple **

In [22]:
print('APPLE GENRE \n')
display_table(apple_free, 11)


APPLE GENRE 

Games : 58.16
Entertainment : 7.88
Photo & Video : 4.97
Education : 3.66
Social Networking : 3.29
Shopping : 2.61
Utilities : 2.51
Sports : 2.14
Music : 2.05
Health & Fitness : 2.02
Productivity : 1.74
Lifestyle : 1.58
News : 1.33
Travel : 1.24
Finance : 1.12
Weather : 0.87
Food & Drink : 0.81
Reference : 0.56
Business : 0.53
Book : 0.43
Navigation : 0.19
Medical : 0.19
Catalogs : 0.12



> Clearly on the Apple store applications for gaming, entertainment and photo and video related apps dominate the free segment of the market (at least in terms of number of apps offered). More than 70% of free apps on offer at the Apple store are in these three categories with more than halg in Gaming only.

** Results for Google **

In [23]:
print('\n GOOGLE CATEGORY \n')
display_table(google_free, 1)
print('\n GOOGLE GENRE DETAILED \n')
display_table(google_free, 9)


 GOOGLE CATEGORY 

FAMILY : 18.9
GAME : 9.73
TOOLS : 8.46
BUSINESS : 4.59
LIFESTYLE : 3.9
PRODUCTIVITY : 3.89
FINANCE : 3.7
MEDICAL : 3.53
SPORTS : 3.4
PERSONALIZATION : 3.32
COMMUNICATION : 3.24
HEALTH_AND_FITNESS : 3.08
PHOTOGRAPHY : 2.94
NEWS_AND_MAGAZINES : 2.8
SOCIAL : 2.66
TRAVEL_AND_LOCAL : 2.34
SHOPPING : 2.25
BOOKS_AND_REFERENCE : 2.14
DATING : 1.86
VIDEO_PLAYERS : 1.79
MAPS_AND_NAVIGATION : 1.4
FOOD_AND_DRINK : 1.24
EDUCATION : 1.16
ENTERTAINMENT : 0.96
LIBRARIES_AND_DEMO : 0.94
AUTO_AND_VEHICLES : 0.93
HOUSE_AND_HOME : 0.82
WEATHER : 0.8
EVENTS : 0.71
PARENTING : 0.65
ART_AND_DESIGN : 0.64
COMICS : 0.62
BEAUTY : 0.6

 GOOGLE GENRE DETAILED 

Tools : 8.45
Entertainment : 6.07
Education : 5.35
Business : 4.59
Productivity : 3.89
Lifestyle : 3.89
Finance : 3.7
Medical : 3.53
Sports : 3.46
Personalization : 3.32
Communication : 3.24
Action : 3.1
Health & Fitness : 3.08
Photography : 2.94
News & Magazines : 2.8
Social : 2.66
Travel & Local : 2.32
Shopping : 2.25
Books & Referenc


> Categorization of app genres in the Play Store is very different form the  one in Apple market. This makes comparison a challenge. For instance the a "Familly" category contains approx. 19% of all the Google apps. However this category contains more than 50 sub-genres of a wide variety - ranging from games to education, to entertainment and so on. In any case the Google market is much less concentrated in the Gaming genres as compared to Apple.

> Besides the number of apps in each genre we can also measure popularity and the potential for  adivertising income by looking at user ratings and numbers of downloads. The more popular the genre in terms of average ratings or downloads per app, the more potential for advertising profit.



#### Apple Data Set - Average Number of User Ratings per App for Every Genre


In [34]:
apple_prime_genre_freq = freq_table(apple_free, 11)
dic_ratings_avg = {} # this will contain the results of avg num of ratings
dic_total_ratings = {} # for the total amount of ratings
for genre in apple_prime_genre_freq:
    total_ratings = 0
    num_apps_genre = 0
    
    for row in apple_free:
        if genre == row[11]:
            num_apps_genre += 1
            total_ratings += float(row[5])
    avg_num_ratings = round(total_ratings / num_apps_genre)
    dic_total_ratings[genre] = total_ratings
    #print(genre, ' : ', avg_num_ratings)
    dic_ratings_avg[genre] = avg_num_ratings
d = sort_dic(dic_ratings_avg)
print_sorted_dic(d)

Navigation  :  86090
Reference  :  74942
Social Networking  :  71548
Music  :  57327
Weather  :  52280
Book  :  39758
Food & Drink  :  33334
Finance  :  31468
Photo & Video  :  28442
Travel  :  28244
Shopping  :  26920
Health & Fitness  :  23298
Sports  :  23009
Games  :  22789
News  :  21248
Productivity  :  21028
Utilities  :  18684
Lifestyle  :  16486
Entertainment  :  14030
Business  :  7491
Education  :  7004
Catalogs  :  4004
Medical  :  612


#### Google Data Set - Average number of downloads 
First let's take a look at the 'Installs' column index [5]

In [25]:
for row in range(20): 
    print(google_free[row][5])

10,000+
5,000,000+
50,000,000+
100,000+
50,000+
50,000+
1,000,000+
1,000,000+
10,000+
1,000,000+
1,000,000+
10,000,000+
100,000+
100,000+
5,000+
500,000+
10,000+
5,000,000+
10,000,000+
100,000+


* Before proceeding with calculating the average number of downloads for each genre let's convert data in the 'Installs' column to numbers.

In [26]:
list_downloads = []
# Following loop extracts three colums - indexes: 0, 1 and 5.
for row in google_free:
    name = row[0]
    category = row[1]
    downloads = row[5]
    list_downloads.append([name, category, downloads])

# loop to convert data in "Installs" calumn from 'text' to 'float'
for i in range(len(list_downloads)):
    list_downloads[i][2] = list_downloads[i][2].replace(',','')
    list_downloads[i][2] = list_downloads[i][2].replace('+','')
    list_downloads[i][2] = round(float(list_downloads[i][2]))

# print several lines of the table extract with
# converted 'installs' column
for row in list_downloads[:5]:
    print(row[0], " | ", row[1], " | ", row[2])

Photo Editor & Candy Camera & Grid & ScrapBook  |  ART_AND_DESIGN  |  10000
U Launcher Lite – FREE Live Cool Themes, Hide Apps  |  ART_AND_DESIGN  |  5000000
Sketch - Draw & Paint  |  ART_AND_DESIGN  |  50000000
Pixel Draw - Number Art Coloring Book  |  ART_AND_DESIGN  |  100000
Paper flowers instructions  |  ART_AND_DESIGN  |  50000




>Next we find the average number of downloads per app in each 
genre, sorted from high to low.



In [37]:
categ_dic = freq_table(google_free, 1) # here we get unique 
                                       # category data
categ_sum_downl = {} #total downloads per category, used further
                     # to calculate concentrations

for i in categ_dic:
    sum_downl = 0
    num_apps = 0
    for row in list_downloads:
        if i == row[1]:
            sum_downl += row[2]
            num_apps += 1
    categ_dic[i] = round(sum_downl/num_apps) # average num downl per categ
    categ_sum_downl[i] = sum_downl

f = sort_dic(categ_dic)
print_sorted_dic(f)

COMMUNICATION  :  38456119
VIDEO_PLAYERS  :  24727872
SOCIAL  :  23253652
PHOTOGRAPHY  :  17840110
PRODUCTIVITY  :  16787331
GAME  :  15588016
TRAVEL_AND_LOCAL  :  13984078
ENTERTAINMENT  :  11640706
TOOLS  :  10801391
NEWS_AND_MAGAZINES  :  9549178
BOOKS_AND_REFERENCE  :  8767812
SHOPPING  :  7036877
PERSONALIZATION  :  5201483
WEATHER  :  5074486
HEALTH_AND_FITNESS  :  4188822
MAPS_AND_NAVIGATION  :  4056942
FAMILY  :  3697848
SPORTS  :  3638640
ART_AND_DESIGN  :  1986335
FOOD_AND_DRINK  :  1924898
EDUCATION  :  1833495
BUSINESS  :  1712290
LIFESTYLE  :  1437816
FINANCE  :  1387692
HOUSE_AND_HOME  :  1331541
DATING  :  854029
COMICS  :  817657
AUTO_AND_VEHICLES  :  647318
LIBRARIES_AND_DEMO  :  638504
PARENTING  :  542604
BEAUTY  :  513152
EVENTS  :  253542
MEDICAL  :  120551


#### Calculating Level of Concentration for Each Category
We want to see the HHI index for each category to find out if the popularity of downloads is due to a dominant app or a few very popular apps. 

The HHI index is calculated as the sum of squared shares of each app in the category. 

Hi values == Hi concentration. 
For instance is there is a 90% concentration in a single item HHI would be higher than 90**2 or HHI > 8100 , while if the highest share is 10% then HHI <=1000.

We want to try and avoid entering sectors with HHI > 1000.

##### Google Concentration

In [28]:
categ_hhi = {}
for categ in categ_sum_downl:
    sum_cat = categ_sum_downl[categ]
    sum_squares = 0
    for row in list_downloads: # list_downloads contains: name, categ, num downl
        if categ == row[1]:
            app_share_sq = ((row[2]/sum_cat)*100)**2
            sum_squares += app_share_sq
    categ_hhi[categ] = round(sum_squares)
s = sort_dic(categ_hhi)
print_sorted_dic(s)

BOOKS_AND_REFERENCE  :  3755
NEWS_AND_MAGAZINES  :  2683
TRAVEL_AND_LOCAL  :  2438
ART_AND_DESIGN  :  2325
EVENTS  :  2131
HEALTH_AND_FITNESS  :  2036
BEAUTY  :  1820
VIDEO_PLAYERS  :  1512
PARENTING  :  1403
LIBRARIES_AND_DEMO  :  1281
SOCIAL  :  1194
COMICS  :  1050
MAPS_AND_NAVIGATION  :  1043
AUTO_AND_VEHICLES  :  1027
WEATHER  :  863
BUSINESS  :  704
FINANCE  :  694
HOUSE_AND_HOME  :  660
PRODUCTIVITY  :  656
ENTERTAINMENT  :  636
COMMUNICATION  :  610
LIFESTYLE  :  608
PHOTOGRAPHY  :  573
DATING  :  490
MEDICAL  :  488
PERSONALIZATION  :  412
TOOLS  :  388
SHOPPING  :  368
FOOD_AND_DRINK  :  352
SPORTS  :  347
EDUCATION  :  321
FAMILY  :  319
GAME  :  149


#### Apple Concentration

In [29]:
categ_hhi_apple = {}
for categ in dic_total_ratings:
    sum_cat_apple = dic_total_ratings[categ]
    sum_squares_apple = 0
    for row in apple_free: 
        if categ == row[11]:
            apple_app_share_sq = ((float(row[5])/sum_cat_apple)*100)**2
            sum_squares_apple += apple_app_share_sq
    categ_hhi_apple[categ] = round(sum_squares_apple)
k = sort_dic(categ_hhi_apple)
print_sorted_dic(k)

Catalogs  :  7112
Reference  :  5588
Navigation  :  5368
Medical  :  3303
Book  :  2849
Food & Drink  :  2468
Photo & Video  :  2393
Travel  :  2209
Lifestyle  :  2207
News  :  2026
Health & Fitness  :  1926
Social Networking  :  1854
Weather  :  1807
Business  :  1769
Music  :  1662
Utilities  :  1332
Finance  :  1237
Education  :  896
Sports  :  844
Shopping  :  703
Productivity  :  676
Entertainment  :  342
Games  :  98


### Final Review of Results

** Apple ** 
* Sectors with HHI concentration below 2000:

Sector | HHI Index
:---: | :---:
Health & Fitness  |  1926
Social Networking  |  1854
Weather  |  1807
Business  |  1769
Music  |  1662
Utilities  |  1332
Finance  |  1237
Education  |  896
Sports  |  844
Shopping  |  703
Productivity  |  676
Entertainment  |  342
Games  |  98

* Top 10 Apple sectors as measured by the average number of ratings per app:

Sector | Avg. Num. Ratings
:---: | :---:
Navigation  |  86090
Reference  |  74942
Social Networking  |  71548
Music  |  57327
Weather  |  52280
Book  |  39758
Food & Drink  |  33334
Finance  |  31468
Photo & Video  | 28442
Travel  |  28244

Following is the intersection of the above two tables - i.e the most popular app genres on the Apple store with low to moderate level of concentration:

In [55]:
# dic_ratings_avg ------ {Apple sectors : avg number of ratings}

# categ_dic ------------ {Google sectors : avg number of downl}

# categ_hhi -------------{Google sector : HHI sector index}

# categ_hhi_apple) ------{Apple sector : HHI sector index}
apple_intersec = []
for i in dic_ratings_avg:
    if dic_ratings_avg[i] >28000:
        if categ_hhi_apple[i] <=2000:
            print(i)
            apple_intersec.append(i)
# print(apple_intersec)

Music
Finance
Social Networking
Weather


* ** Google **

Following the same criteria as for Apple, the Google Sectors with HHI index of concentration below 2000 and top popularity as measured by the average number of downloads per app we get:

In [56]:
google_intersec = []
for i in categ_dic:
    if categ_dic[i] >1000000:
        if categ_hhi[i] <=2000:
            print(i)
            google_intersec.append(i)


COMMUNICATION
SPORTS
PERSONALIZATION
FINANCE
ENTERTAINMENT
PHOTOGRAPHY
SHOPPING
FOOD_AND_DRINK
EDUCATION
GAME
TOOLS
LIFESTYLE
VIDEO_PLAYERS
HOUSE_AND_HOME
PRODUCTIVITY
FAMILY
BUSINESS
WEATHER
MAPS_AND_NAVIGATION
SOCIAL


>** Finance ** and ** Weather ** apper to be the common genres for Apple and Google stores that combine high popularity with low levels of concentration.

## Conclusion
>The ** Gaming ** and ** Entertainment ** genres dominate both marketplaces in terms of number of apps. Consequently their concentration indeces are very low. These genres however turn out to be not so popular - at least as we have measured popularity - with number of downloads with Google and number of ratings with Apple.

> ** Finance ** and ** Weather ** categories apear to be the common denominator between Apple and Google markets in terms of popularity and concentration. The nature of these types of apps however would present a number of challenges for developing a successful income generating strategy which are beyond the scope of this analysis.

>A possible solution to the constraints above could be specialistion in only one of the two marketplaces.