# Mobile App Analysis:
#### aka: Profitable App Profiles for AppStore and Google Play Markets
--------------

* This is a Data Analysis project where I analyze two sources of data:

  1. The Apple Appstore dataset available at [Kaggle](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/version/2), and
  2. The Google Play store dateset available also at [Kaggle](https://www.kaggle.com/lava18/google-play-store-apps).


* The goal is to understand and identify the types of *free* mobile apps that are most likely to attract more users over time.


### First, general Exploration:

In [1]:
# This is a csv to list of lists func.:

def csv2lol(csvFile,trim_headers=0):
    
    '''
    Description:
        A "CSV" to "list of lists" Function,
        expects 2 arguments:
            1. Dataset CSV File Path+Name. (String)
            2. Should the headers be removed.
               (Boolean, Optional, Defaults to False)
    
    Usage:
        csv2lol("googleplaystore.csv")
        csv2lol("googleplaystore.csv",1)
    '''
    opFile = open(csvFile, encoding="utf8")
    from csv import reader
    rFile = reader(opFile)
    lol = list(rFile)
    if trim_headers:
        return lol[1:]
    else:
        return lol

In [2]:
gStoreLol = csv2lol("googleplaystore.csv")

In [3]:
aStoreLol = csv2lol("AppleStore.csv")

#### Explore these in python shell :  globals() locals() dir() keys()

In [112]:
# This is a data exploration func.:

def dataScan(dataset,start=0,end=0,general_info=1):
    
    '''
    Description:
        Data exploration Function that prints some data
        and some general info,
        expects 4 arguments:
            1. Dataset Name.
               (List of Lists)
            2. Where to start.
               (int, Optional)
            3. Where to End, sholud be greater than start.
               (int, Optional)
            4. Print General info?
               (Boolean, Optional, Defaults to True)
    
    Usage:
        dataScan(dataset)        Prints General Info Only
        dataScan(dataset,1,5,0)  Prints the first 4 rows only
    '''
    print('========================================')

    def namestr(obj, namespace=globals()):
        return [i for i in namespace if namespace[i] is obj]
    
    if general_info:
        print('Dataset General Info :\n====================')
        print('Dataset =',namestr(dataset))
        print('Columns = '+str(len(dataset[0])))
        print('Rows    = '+str(len(dataset))+'   (including headers if present)')
        print('\n')
        print('Dataset Header : ')
        print(dataset[0])
        print('\n')

    if start<end:
        print("Requested Data (%s Rows) :\n======================="%(end-start))
        for i in dataset[start:end]:
            print(i)
    else:
        print('No Data Requested.')
        print('to get some data set the start and end arguments, eg:')
        print('dataScan(dataset,1,5,0)  Prints the first 4 rows only')
    print('========================================')

In [5]:
# help(csv2lol)
# print('\n')
# help(dataScan)

# dataScan(csv2lol("googleplaystore.csv"))
# dataScan(csv2lol("AppleStore.csv"),0,0,0)

In [113]:
dataScan(gStoreLol,1,2)

Dataset General Info :
Dataset = ['gStoreLol']
Columns = 13
Rows    = 9660   (including headers if present)


Dataset Header : 
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


Requested Data (1 Rows) :
['Chakra Cleansing', 'LIBRARIES_AND_DEMO', '4.6', '539', '99M', '50,000+', 'Free', '0', 'Everyone', 'Libraries & Demo', 'August 2, 2018', '7.0', '4.0.3 and up']


In [7]:
dataScan(aStoreLol,1,2)

Dataset General Info :
Dataset = ['aStoreLol']
Columns = 16
Rows    = 7198   (including headers if present)


Dataset Header : 
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


Requested Data (1 Rows) :
['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

### Next, Error Analysis:
- Find Apps w/ missing data points.
- Find Apps w/ empty data points.
- Find non-English apps.
- Find non-free apps.
- Find Duplicates.

In [114]:
# this Func. finds MISSING data points:

def missFinder(lol):
        
    '''
    Description:
        Prints rows with missing data along
        with their index, by comparing every
        row's length to header row's length,
        expects 1 argument:
            Dataset Name. (List of Lists)
               
    Usage:
        missFinder(Dataset_as_List_of_Lists)
    '''
    
    headerlen = len(lol[0])
    for i in lol:
        if len(i) != headerlen:
            print('Found Row with index number : ',lol.index(i))
            print(i)

In [115]:
missFinder(gStoreLol)

In [10]:
del gStoreLol[10473]

In [11]:
missFinder(gStoreLol)

In [12]:
missFinder(aStoreLol)

In [13]:
# this Func. finds EMPTY data points:

def empFinder(lol):
        
    '''
    Description:
        Prints rows with empty data points along
        with their index,
        expects 1 argument:
            Dataset Name. (List of Lists)
               
    Usage:
        empFinder(Dataset_as_List_of_Lists)
    '''
    
    for i in lol:
        for j in i:
            if not len(j):
                print('Found Row with index number : ',end='')
                print(lol.index(i))
                print(i)

In [14]:
empFinder(csv2lol("googleplaystore.csv"))

Found Row with index number : 1554
['Market Update Helper', 'LIBRARIES_AND_DEMO', '4.1', '20145', '11k', '1,000,000+', 'Free', '0', 'Everyone', 'Libraries & Demo', 'February 12, 2013', '', '1.5 and up']
Found Row with index number : 10473
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [15]:
empFinder(csv2lol("AppleStore.csv"))

--------------------------------------------

In [16]:
# repeated App data Example:

apps_data = csv2lol("googleplaystore.csv")

counter=0
for x in apps_data:
    if x[0] == 'Subway Surfers':
#         print(apps_data.index(x),end=" ")
#         print('should be ',end='')
        print(counter)
    counter+=1

1655
1701
1751
1873
1918
3897


~~~^ This Has A bug :(~~~

The problem was the repeated result out of the loop,
and the reason was the nature of the ".index" method
as it searches and reports the index of the `FIRST`
occurance of its input, and the whole line at index 1873
was exactly repeated again at 1918 !!


### So `.index` is not dependable for this context !!!

--------------------------------------------


In [17]:
# search for and count repeated App names

apps_data = csv2lol("googleplaystore.csv")

namesList = [i[0] for i in apps_data[1:]]

namesList.sort()
# This have a None Return!, it just sorts the list itself!

# print(namesList[0:5])
repList=[]
for i in range(len(namesList)-1):
    if namesList[i+1] == namesList[i]:
#         print(namesList[i+1])
        repList.append(namesList[i+1])
len(repList)

1181

In [18]:
# search for and count repeated App data

apps_data = csv2lol("googleplaystore.csv")

apps_data.sort()

dupList=[]

for i in range(len(apps_data)-1):
    if apps_data[i+1] == apps_data[i]:
#         print(apps_data[i+1])
        dupList.append(apps_data[i+1])
len(dupList)

483

## Here I'll  search for and count repeated App data (Whole Rows):

I think those are safe to remove directly!

In [19]:
# search for and count repeated App data (Whole Rows)
# Each row is checked if it was seen before

xunique = []
xduplicates = []

for row in gStoreLol[1:]:
    if row in xunique : xduplicates.append(row)
    else : xunique.append(row)
        
print(len(xunique),end=' '); print('Unique Rows, may have duplicate apps with changed ratings')
print(len(xduplicates),end='   '); print('Duplicate Apps (Rows), those are exact row duplicates')

gStoreLol[1:] = xunique

print('\nNow gstore is completely free of exact dups')
print(len(gStoreLol))


10357 Unique Rows, may have duplicate apps with changed ratings
483   Duplicate Apps (Rows), those are exact row duplicates

Now gstore is completely free of exact dups
10358


## This is a demonstration of the other duplication present in the app data:

In [20]:
# search for and count repeated App names

unique = []
duplicates = []

for i in gStoreLol[1:]:
    appName = i[0]
    if appName in unique : duplicates.append(appName)
    else : unique.append(appName)

print(len(unique),end=' '); print('Unique App Names')
print(len(duplicates),end='  '); print('Duplicate App Names')

9659 Unique App Names
698  Duplicate App Names


In [21]:
for i in gStoreLol[1:]:
    if i[0] == duplicates[0]: print(i)

['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80805', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']
['Quick PDF Scanner + OCR FREE', 'BUSINESS', '4.2', '80804', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'February 26, 2018', 'Varies with device', '4.0.3 and up']


## A criteria for removing duplicate Names :

- we can keep apps with higher Version number.
- we can keep apps with higher rating as this means they are newer.

## How we are removing rows that have duplicate app names with different ratings:

1. New Dict. > Find Names of all apps => namesDict.keys.

2. For each app {.} find max ratings [[.]] => namesDict.values.

3. create a new myList.

4. for each app (key) & max rating (value) in {myDups} matching with name and rating in gStoreLol, add new row in [[myList]] with matching row.

In [22]:
# deleting repeated App names with lower rating counts

unique = []
duplicates = []

namesDict = {}
myList = []


# Fill unique,duplicates:

for i in gStoreLol[1:]:
    appName = i[0]
    if appName in unique : duplicates.append(appName)
    else : unique.append(appName)

# Fill namesDict key with unique app names, and 0 values:

for i in gStoreLol[1:]:
    if not i[0] in namesDict: namesDict[i[0]] = 0

# Fill namesDict Values with MaxRating for each app:

for app,rate in namesDict.items():
    for row in gStoreLol[1:]:
        if app == row[0] and rate < int(row[3]):
            namesDict[app] = int(row[3])

# print(namesDict)
# namesDict : {( AppName : MaxRatings ),...}
            
# Fill a new list newList with rows of max value:

for k,v in namesDict.items():
    for row in gStoreLol[1:]:
        if k == row[0] and v == int(row[3]):
            myList.append(row)


gStoreLol[1:] = myList



# this leaves 6 rows with various diffs other than the rating:

unique = []
duplicates = []

for i in gStoreLol[1:]:
    appName = i[0]
    if appName in unique : duplicates.append(appName)
    else : unique.append(appName)

print(len(unique),end=' '); print('Unique App Names')
print(len(duplicates),end='  '); print('Duplicate App Names')

# print('Dups:')
# print(duplicates)

print('\nAll Dups:')

for i in gStoreLol[1:]:
    for j in duplicates:
        if j == i[0]:
            print(i)
            print('\n\n')

9659 Unique App Names
6  Duplicate App Names

All Dups:
['Candy Bomb', 'GAME', '4.4', '42145', '20M', '10,000,000+', 'Free', '0', 'Everyone', 'Casual;Brain Games', 'July 4, 2018', '2.9.3181', '4.0.3 and up']



['Candy Bomb', 'FAMILY', '4.4', '42145', '20M', '10,000,000+', 'Free', '0', 'Everyone', 'Casual;Brain Games', 'July 4, 2018', '2.9.3181', '4.0.3 and up']



['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 6, 2018', '6.06.14', '4.4 and up']



['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']



['Learn C++', 'EDUCATION', '4.6', '73404', '5.3M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'December 25, 2017', '4.5.2', '4.0 and up']



['Learn C++', 'FAMILY', '4.6', '73404', '5.3M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'December 25, 2017', '4.5.2', '4.0 and up']



['Target - now with C

In [23]:
# will del the 6 repeated rows after visual inspection:

del(gStoreLol[gStoreLol.index(['Fuzzy Numbers: Pre-K Number Foundation', 'FAMILY', '4.7', '21', '44M', '1,000+', 'Paid', '$5.99', 'Everyone', 'Education;Education', 'July 21, 2017', '1.3', '4.1 and up'])])
del(gStoreLol[gStoreLol.index(['Candy Bomb', 'FAMILY', '4.4', '42145', '20M', '10,000,000+', 'Free', '0', 'Everyone', 'Casual;Brain Games', 'July 4, 2018', '2.9.3181', '4.0.3 and up'])])
del(gStoreLol[gStoreLol.index(['Learn C++', 'FAMILY', '4.6', '73404', '5.3M', '1,000,000+', 'Free', '0', 'Everyone', 'Education', 'December 25, 2017', '4.5.2', '4.0 and up'])])
del(gStoreLol[gStoreLol.index(['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 6, 2018', '6.06.14', '4.4 and up'])])
del(gStoreLol[gStoreLol.index(['YouTube Gaming', 'FAMILY', '4.2', '130549', 'Varies with device', '5,000,000+', 'Free', '0', 'Teen', 'Entertainment', 'June 27, 2018', '2.08.78.2', '4.1 and up'])])
del(gStoreLol[gStoreLol.index(['Target - now with Cartwheel', 'SHOPPING', '4.1', '68406', '24M', '10,000,000+', 'Free', '0', 'Everyone', 'Shopping', 'July 25, 2018', '6.25.0+1906001476', '5.0 and up'])])


print(len(gStoreLol))
# print(gStoreLol[0:5])

clean_android = gStoreLol[1:]

print(len(clean_android))
# print(clean_android[0:5])

9660
9659


In [24]:
# Function that returns flase if there is any character
# in the input string that doesn't belong to the set
# of common English characters, otherwise it returns True.

def isEng(inStr):
    for i in inStr:
        if ord(i) > 127:
            return False
    return True

print(isEng('Instagram'))
print(isEng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEng('Docs To Go™ Free Office Suite'))
print(isEng('Instachat 😜'))


True
False
False
False


In [25]:
# Same Function, but allows up to three characters 
# that fall outside the ASCII range (0 - 127)  

def isEng(inStr):
    z = 0
    for i in inStr:
        if ord(i) > 127:
            z+=1
            if z > 3 :
                return False 
    return True

print(isEng('Instagram'))
print(isEng('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEng('Docs To Go™ Free Office Suite'))
print(isEng('Instachat 😜'))


True
False
True
True


### Now we Use the new isEng function to filter out non-English apps from both data sets.

- We Loop through each data set. If an app name is identified as English, we append the whole row to a new list.

- Then we Explore the data sets and see how many rows in each data set.


In [26]:
engGoogle = []

for i in clean_android:
    if isEng(i[0]) : engGoogle.append(i)
#     else: print(i[0],end=''); print('  is Excluded')


print('Removed:',end=''); print(len(clean_android)-len(engGoogle))
print('Remaining:',end=''); print(len(engGoogle))

Removed:45
Remaining:9614


In [27]:
engApple  = []

for i in aStoreLol[1:]:
    if isEng(i[1]) : engApple.append(i)
#     else: print(i[1],end=''); print('  is Excluded')

print('Removed:',end='')
print(len(aStoreLol[1:])-len(engApple))
print('Remaining:',end='')
print(len(engApple))

Removed:1014
Remaining:6183


### Now we remove Non Free Apps:

In [28]:
engFreeAndroid = []

for i in engGoogle:
    if i[7] == '0':
        engFreeAndroid.append(i)

        
print('Removed:',end='')
print(len(engGoogle)-len(engFreeAndroid))
print('Remaining:',end='')
print(len(engFreeAndroid))

Removed:750
Remaining:8864


In [29]:
engFreeApple = []

for i in engApple:
#     if i[4] not in engFreeApple:
#         engFreeApple.append(i[4])
    if i[4] == '0.0':
        engFreeApple.append(i)

# print(engFreeApple)        
print('Removed:',end='')
print(len(engApple)-len(engFreeApple))
print('Remaining:',end='')
print(len(engFreeAndroid))

Removed:2961
Remaining:8864


--------
So far in the data cleaning process, we:

- Removed inaccurate data
- Removed duplicate app entries
- Removed non-English apps
- Removed Non-Free Apps


our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1. Build a minimal Android version of the app, and add it to Google Play.
2. If the app has a good response from users, we develop it further.
3. If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

---------

## Data Analysis

### what are the most common genres for each market?

For this, we'll need to build frequency tables for a few columns in our data sets.

----------

#### Android Columns (engFreeAndroid Dataset Headers) : 

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']

> we can use column indexs (1,2,6,8,9) for frequency tables, 
> lets start with Category and Genres @ indexes (1,9)

-----------

#### iOS Columns (engFreeApple Dataset Headers) :

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']

> we can use column indexs (7,8,10,11) for frequency tables, 
> lets start with prime_genre @ index(11)

In [30]:
# Generates A Frequency Table from given dataset
# for the given column index.

# takes in two inputs:
# dataset (which is expected to be a list of lists)
# and index (which is expected to be an integer).

# I use the dictionary .get method 

# We can use get to write our histogram loop more concisely.
# Because the get method automatically handles the case
# where a key is not in a dictionary, we can reduce four
# lines down to one and eliminate the if statement.

# @ https://books.trinket.io/pfe/09-dictionaries.html#dictionaries-and-files


# and list comprehension - not needed here

def freqGen(dataset, idx):
    freq = {}
    for i in dataset:
        j = i[idx]
        freq[j] = freq.get(j,0) + 1

#         if j in freq :
#             freq[j] += 1
#         else:
#             freq[j] = 1

#     calc Percentages:
#     for x in freq: freq[x] = freq[x] / len(dataset) * 100
    return freq

In [31]:
# to sort a dictionary:
# Takes in two parameters: dataset and index. dataset is expected to be a list of lists, and index is expected to be an integer.
# Generates a frequency table using the freq_table() function.
# Transforms the frequency table into a list of tuples, then sorts the list in a descending order.
# Prints the entries of the frequency table in descending order.


def display_table(dataset, index):
    table = freqGen(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

In [32]:
# Histogram/Frquency Table for the 'Category' Column on Android:

display_table(engFreeAndroid, 1)

FAMILY : 1704
GAME : 843
TOOLS : 750
BUSINESS : 406
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 314
SPORTS : 303
PERSONALIZATION : 294
COMMUNICATION : 288
HEALTH_AND_FITNESS : 272
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 158
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 100
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
ENTERTAINMENT : 78
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


In [33]:
# Histogram/Frquency Table for the 'Genres' Column on Android:

display_table(engFreeAndroid, 9)

Tools : 749
Entertainment : 538
Education : 474
Business : 406
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 314
Sports : 307
Personalization : 294
Communication : 288
Action : 275
Health & Fitness : 272
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 165
Video Players & Editors : 158
Casual : 155
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 39
Casino : 38
Trivia : 37
Educational;Education : 35
Board : 34
Educational : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Racing;Action & Adventure : 15
Puzzle;Brain Games : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

In [34]:
# Histogram/Frquency Table for the 'prime_genre' Column on Apple:

display_table(engFreeApple, 11)

Games : 1874
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


Remember our data set only contains free English apps, so you should be careful not to extend your conclusions beyond that scope. If you find that gaming apps are the most numerous among the free English apps on Google Play, it doesn't mean we'll see the same pattern on Google Play as a whole.

Analyze the frequency table you generated for the prime_genre column of the App Store data set.
- What is the most common genre? What is the runner-up?
- What other patterns do you see?
- What is the general impression — are most of the apps designed for practical purposes (education, shopping, utilities, productivity, lifestyle) or more for entertainment (games, photo and video, social networking, sports, music)?
- Can you recommend an app profile for the App Store market based on this frequency table alone? If there's a large number of apps for a particular genre, does that also imply that apps of that genre generally have a large number of users?


Analyze the frequency table you generated for the Category and Genres column of the Google Play data set.
- What are the most common genres?
- What other patterns do you see?
- Compare the patterns you see for the Google Play market with those you saw for the App Store market.
- Can you recommend an app profile based on what you found so far? Do the frequency tables you generated reveal the most frequent app genres or what genres have the most users?


One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. For the Google Play data set, we can find this information in the Installs column, but this information is missing for the App Store data set. As a workaround, we'll take the total number of user ratings as a proxy, which we can find in the rating_count_tot app.

In [74]:
# Histogram/Frquency Table for the 'rating_count_tot' Column on Apple:
# this is USELESS!!!

# display_table(engFreeApple, 5)

In [71]:
# Calculating the average number of user ratings per app genre on the App Store.
# as an indicator for what genres are the most popular.



# First make d Dictionary with the Gernres:

genreDict = {}

for i in engFreeApple:
    if i[11] not in genreDict :
        genreDict[i[11]] = []

print(genreDict)


#  'Social Networking',
#  'Photo & Video',
#  'Games',
#  'Music',
#  'Reference',
#  'Health & Fitness',
#  'Weather',
#  'Utilities',
#  'Travel',
#  'Shopping',
#  'News',
#  'Navigation',
#  'Lifestyle',
#  'Entertainment',
#  'Food & Drink',
#  'Sports',
#  'Book',
#  'Finance',
#  'Education',
#  'Productivity',
#  'Business',
#  'Catalogs',
#  'Medical'



# Then we Cluster the apps in our dictionary:

for i in engFreeApple:
    for j in genreDict:
        if i[11] == j :
            genreDict[j].append(i)

print(genreDict['News'][5])


# Then we sum up the user ratings for the apps in each genre in a new Dict:

sumsDict = {}

for i in genreDict:
    sumsDict[i] = 0

for i,j in genreDict.items():
    for app in j:
        sumsDict[i] += int(app[5])

sumsDict

{'Social Networking': [], 'Photo & Video': [], 'Utilities': [], 'Navigation': [], 'News': [], 'Productivity': [], 'Reference': [], 'Lifestyle': [], 'Shopping': [], 'Sports': [], 'Book': [], 'Finance': [], 'Food & Drink': [], 'Catalogs': [], 'Travel': [], 'Education': [], 'Games': [], 'Entertainment': [], 'Health & Fitness': [], 'Music': [], 'Medical': [], 'Weather': [], 'Business': []}
['300255638', 'ABC News - US & World News + Live Video', '98108416', 'USD', '0.0', '48407', '20', '3.0', '3.5', '5.16', '12+', 'News', '37', '0', '1', '0']


{'Book': 556619,
 'Business': 127349,
 'Catalogs': 16016,
 'Education': 826470,
 'Entertainment': 3563577,
 'Finance': 1132846,
 'Food & Drink': 866682,
 'Games': 42705967,
 'Health & Fitness': 1514371,
 'Lifestyle': 840774,
 'Medical': 3672,
 'Music': 3783551,
 'Navigation': 516542,
 'News': 913665,
 'Photo & Video': 4550647,
 'Productivity': 1177591,
 'Reference': 1348958,
 'Shopping': 2261254,
 'Social Networking': 7584125,
 'Sports': 1587614,
 'Travel': 1129752,
 'Utilities': 1513441,
 'Weather': 1463837}

In [72]:
# Divide the sum by the number of apps belonging to that genre
# not by the total number of apps & sorting the Dict as a list of tubles:


appCounts = freqGen(engFreeApple, 11)

appCounts


# suum = 0
# for i in appCounts:
#     suum += appCounts[i]
# suum==len(engFreeApple)


for k in sumsDict:
    sumsDict[k] = sumsDict[k] // appCounts[k]

sumsDict



# we sort!

table_display = []

for key in sumsDict:
    key_val_as_tuple = (sumsDict[key], key)
    table_display.append(key_val_as_tuple)

table_sorted = sorted(table_display, reverse = True)

for entry in table_sorted:
    print(entry[0], ':', entry[1])

86090 : Navigation
74942 : Reference
71548 : Social Networking
57326 : Music
52279 : Weather
39758 : Book
33333 : Food & Drink
31467 : Finance
28441 : Photo & Video
28243 : Travel
26919 : Shopping
23298 : Health & Fitness
23008 : Sports
22788 : Games
21248 : News
21028 : Productivity
18684 : Utilities
16485 : Lifestyle
14029 : Entertainment
7491 : Business
7003 : Education
4004 : Catalogs
612 : Medical


In [81]:
# Histogram/Frquency Table for the 'Installs' Column on Android:

display_table(engFreeAndroid, 5)

1,000,000+ : 1396
100,000+ : 1025
10,000,000+ : 931
10,000+ : 905
1,000+ : 745
100+ : 613
5,000,000+ : 604
500,000+ : 493
50,000+ : 423
5,000+ : 400
10+ : 314
500+ : 288
50,000,000+ : 204
100,000,000+ : 189
50+ : 170
5+ : 70
1+ : 45
500,000,000+ : 24
1,000,000,000+ : 20
0+ : 4
0 : 1


In [104]:
# Histogram/Frquency Table for the 'Category' Column on Android:

# display_table(engFreeAndroid, 1)


# lets Clean The Keys a little:

andGenreDict = freqGen(engFreeAndroid,1)

andInstallsDict = freqGen(engFreeAndroid,5)

# engFreeAndroid[4][5]

# andInstallsDict


for key in list(andInstallsDict):
    newKey = key.replace('+','').replace(',','')
    andInstallsDict[newKey] = andInstallsDict.pop(key)
    # this renames the key by creating a new one and deleting the old one!
    # same as:
        # dict[newKey] = dict[oldkey]
        # del dict[oldKey]


andInstallsDict    

{'0': 4,
 '1': 45,
 '10': 314,
 '100': 613,
 '1000': 745,
 '10000': 905,
 '100000': 1025,
 '1000000': 1396,
 '10000000': 931,
 '100000000': 189,
 '1000000000': 20,
 '5': 70,
 '50': 170,
 '500': 288,
 '5000': 400,
 '50000': 423,
 '500000': 493,
 '5000000': 604,
 '50000000': 204,
 '500000000': 24}

In [121]:
# Here we do it all in one nested loop:

andGenreDict = freqGen(engFreeAndroid,1)
myDict={}
for category in andGenreDict:
    sumInstalls = 0
    appCount    = 0
    for app in engFreeAndroid:
        if app[1] == category:
            sumInstalls += int(app[5].replace('+','').replace(',',''))
            appCount += 1
    myDict[sumInstalls//appCount] = category
    print((sumInstalls//appCount) , ' : ' , category)
myDict

817657  :  COMICS
1331540  :  HOUSE_AND_HOME
5201482  :  PERSONALIZATION
647317  :  AUTO_AND_VEHICLES
1704192  :  BUSINESS
253542  :  EVENTS
638503  :  LIBRARIES_AND_DEMO
107144  :  MEDICAL
10801391  :  TOOLS
4056941  :  MAPS_AND_NAVIGATION
8767811  :  BOOKS_AND_REFERENCE
13984077  :  TRAVEL_AND_LOCAL
12914435  :  GAME
9549178  :  NEWS_AND_MAGAZINES
1924897  :  FOOD_AND_DRINK
7036877  :  SHOPPING
1768500  :  EDUCATION
1437816  :  LIFESTYLE
5074486  :  WEATHER
1986335  :  ART_AND_DESIGN
4274688  :  SPORTS
16772838  :  PRODUCTIVITY
542603  :  PARENTING
38326063  :  COMMUNICATION
4167457  :  HEALTH_AND_FITNESS
513151  :  BEAUTY
854028  :  DATING
1387692  :  FINANCE
17840110  :  PHOTOGRAPHY
23253652  :  SOCIAL
24790074  :  VIDEO_PLAYERS
9146923  :  ENTERTAINMENT
5180161  :  FAMILY


{107144: 'MEDICAL',
 253542: 'EVENTS',
 513151: 'BEAUTY',
 542603: 'PARENTING',
 638503: 'LIBRARIES_AND_DEMO',
 647317: 'AUTO_AND_VEHICLES',
 817657: 'COMICS',
 854028: 'DATING',
 1331540: 'HOUSE_AND_HOME',
 1387692: 'FINANCE',
 1437816: 'LIFESTYLE',
 1704192: 'BUSINESS',
 1768500: 'EDUCATION',
 1924897: 'FOOD_AND_DRINK',
 1986335: 'ART_AND_DESIGN',
 4056941: 'MAPS_AND_NAVIGATION',
 4167457: 'HEALTH_AND_FITNESS',
 4274688: 'SPORTS',
 5074486: 'WEATHER',
 5180161: 'FAMILY',
 5201482: 'PERSONALIZATION',
 7036877: 'SHOPPING',
 8767811: 'BOOKS_AND_REFERENCE',
 9146923: 'ENTERTAINMENT',
 9549178: 'NEWS_AND_MAGAZINES',
 10801391: 'TOOLS',
 12914435: 'GAME',
 13984077: 'TRAVEL_AND_LOCAL',
 16772838: 'PRODUCTIVITY',
 17840110: 'PHOTOGRAPHY',
 23253652: 'SOCIAL',
 24790074: 'VIDEO_PLAYERS',
 38326063: 'COMMUNICATION'}

In [123]:
def test_args_kwargs(arg1, arg2, arg3):
    print("arg1:", arg1)
    print("arg2:", arg2)
    print("arg3:", arg3)
args = ("two", 3, 5)
test_args_kwargs(*args)

arg1: two
arg2: 3
arg3: 5
