# What are the type of apps attract more users?

## 1. Introduction:

This project aims to identify what are the more profitable app profiles for the App Store and Google Play markets.

We are data analysts in a company which builds Android and iOS mobile app, and makes those apps available on Google Play and the App Store.

Our developers only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

To reach this objective we will need to collect and analyze data about mobile apps available on Google Play and App Store. We we will work with data of 2018 and a data sample reffers to more than 4 million apps (approximatelly 2 million regarding to iOS apps, and 2.1 million to Android apps).


## 2. Exploring the dataset content:
We will use existing data sets in order to save time and costs
The data sets that we will use are: 
- Google Play: containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. We can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/googleplaystore.csv) .
- App Store:containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017. You can download the data set directly from [this link](https://dq-content.s3.amazonaws.com/350/AppleStore.csv) .




First, we will open both files and transform them as a list.

In [112]:
from csv import reader

# open file referring to Android apps
openfile=open("/Users/midl/Documents/_Dataquest/mydatasets/project/googleplaystore.csv")
readfile=reader(openfile)
playstore_tt =list(readfile)
playstore_header=playstore_tt[0]
play_data=playstore_tt[1:]

#open file referring to iOS apps
openfile=open("/Users/midl/Documents/_Dataquest/mydatasets/project/AppleStore.csv")
readfile=reader(openfile)
istore_tt=list(readfile)
istore_header=istore_tt[0]
i_data=istore_tt[1:]


Now, we will start to explore data contents, using the following function:

In [113]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print('\n')
    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Numner of columnns', len(dataset[0]))

In [114]:
#checking first 4 rows of data of each dataset
print("\n" + "---Android apps: --- "+'\n')
explore_data(play_data,0, 4, True)
print("---iOS apps: --- "+'\n')
explore_data(i_data, 0, 4, True)


---Android apps: --- 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10841
Numner of columnns 13
---iOS apps: --- 

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954

- **Checking this information, we can conclude that we have in total 10841 Andoid apps and 7197 iOS apps.**

In [115]:
#checking header of each dataset to identify the columns that could help us with our analysis.
print("Android columns names: ", "\n", playstore_header, "\n")

print("iOS columns names: ", "\n", istore_header, "\n")


Android columns names:  
 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver'] 

iOS columns names:  
 ['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic'] 



- The column names of Android data are self-explanatory. However the column names of iOS data are not clear. 
- The explanation for the meaning of each column of iOS data is [in this link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home)
- **Checking this information, we can conclude that the most interesting columns will be:**
    - Android: "App", "Rating", "Installs", "Price", "Genres" and "Category".
    - iOS: "track_name", "price", "rating_count_tot", "rating_count_ver", "user_rating", "prime_genre".

In [116]:
#checking price columns:
#android:
print(play_data[1][7])
#iOS:
print(i_data[1][4])
#checking langguage of app name is english:
#android:
print(play_data[1][0])
#iOS:
print(i_data[1][1])



0
0.0
Coloring book moana
Instagram


## 3. Data Cleaning
Before start to analyse the datas, we need to check if there are some wrong information or mistakes in rows, to avoid wrong conclusions in our analyzes.

Following we have some examples of tasks that we will need to check:
 - remove inaccurate data / data with different numbers of column 
 - remove duplicate data
 - remove non-English apps names
 - remove the non-free apps:


## 3.1. Remove the rows with different length

**First Step:** The Google Play data set has a dedicated discussion section [here](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that one of the discussions describes an error for a certain row. That row is the row number 10472. 
Let us check what is happening in that row:


In [117]:
print(playstore_header)
print(play_data[10472])

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


-> We know that Rating must be between 1 and 5. But when we saw this row number 10472, we can see something strange. The rating is 19. 
-> Let us check why is this happening? 
->As mentioned in the discussions section (mention before), this problem is caused by a missing value in the 'Category' column. How can we inspect this? Bellow we are checking if it is missing one column information, and **we conclude that yes, it is missing information in one column.**
-> This way **we will delete this row.** 
Note: After run this instruction to delete the wrong row, we wil comment the instruction, in order to prevent to run the instruction more than one time, and delete more than one row.

In [118]:
print(len(playstore_header))
print(len(play_data[10472]))

13
12


In [119]:
#if we delete directly we will do:
#del play_data[10472]
#however, to be safe we will leave this instruction under comment, because we only can run it one time. It was already run. 
#as alternative we can do this inside the following for part of code

Now, let we check for both datasets, if we have more situation like the previous one. I mean with less or more columns than what we expect. 

In [120]:
#inspect if we have more rows with different lenght than expected:
#android
i=[]
j=0
m=[]
for each in play_data:
    if len(each) != 13:#len information from result of explore_data()
        i.append(each)
        m.append(j)
    j+=1
    
print(i, m) # if i and m are empety lists, it means we don't have more any situation cause by this problem
print(len(play_data)) #this is just to double check that we will not delete more than one row. the initial number of rows were 10841


if len(m)==1:
    del play_data[m[0]]
print(len(play_data)) #this is just to double check that we will not delete more than one row. the initial number of rows were 10841

#iOS
i=[]
j=0
m=[]
for each in i_data:
    if len(each) != 16:#len information from result of explore_data()
        i.append(each)
        m.append(j)
    j+=1

print(i, m)# if i and m are empety lists, it means we don't have more any situation cause by this problem
print(len(i_data)) #this is just to double check that we will not delete more than one row. the initial number of rows were 7197
if len(m)==1:
    del play_data[m[0]]
print(len(play_data)) #this is just to double check that we will not delete more than one row. the initial number of rows were 10841


[['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']] [10472]
10841
10840
[] []
7197
10840


In [26]:
#remove the non free apps?
play_nonfree=[]
j=0
for each in play_data:
    price=each[6]
    if price!="Free":
        play_nonfree.append(j)
    j+=1
print(len(play_nonfree)) #this is the total number of index which are not free
#print(play_nonfree) # these are the index that we will need to remove

i_nonfree=[]
j=0
for each in i_data:
    price=float(each[4])
    if price==0.0:
        i_nonfree.append(j)
    j+=1
print(len(i_nonfree)) #this is the total number of index which are not free
#print(i_nonfree) # these are the index that we will need to remove


801
4056


## 3.2. Remove the duplicate rows
**Second step:** If you explore the Google Play data set long enough or look at the [discussions](https://www.kaggle.com/lava18/google-play-store-apps/discussion) section, we found that some apps have duplicate entries. For instance, Instagram has four entries.

For each dataset, we need to identify how many rows we have in this situation and, then find the best way to remove the duplicate rows (we don't want to remove the rows randomly). 

Checking the example of Instagram, we can see that four entries have the same number of installers, but has different number of reviews. So, this allow us to understand that in this example,   the row with more reviews is the most recent one. 

So, at the end, we want to keep **the row with more reviews** and eliminate the others.
To reach this goal we will do:
- First goal - *identify dupplicates*: we construct a dictionary`play_name_freq` - for each app in the android dataset, we will register in this dictionationary its frequency (the number of times that it is in the data base). If one app has duplicated entries we saved its name in the list`duplicate_app` list.
- Second goal - *identify the duplicated with more views*: for the apps which has duplicated entries, we construct another dictionary, the `reviews_max` with the app name and its maximum number of the reviews.
- Third goal - *eliminate the duplicated rows with less reviews or same as the maximum reviews* - we will construct a lists `android_cleaned` with android data without dupplicate rows. However, we realize that only using the previous dictionary `reviews_max`, to insert condition reviews_max[name] == n_reviews, and then append those rows to our list, we will have more entries than lenght of previous dictionary. This means that we still will have some duplicated entries, which correspond to the entries of each app with same number of reviews (example: 'ZEDGE™ Ringtones & Wallpapers'). To avoid this situation, we create a second list `already_added` where we save the app name already in the data_cleaned and this way we make sure that we have only one entry per app.


For iOS data, we stop on first dictionary, because we found each line as a different ID,so it means, there is not duplicate entry there.


NOTES:
Other criterias for "if we have more than one dupplicate app row with same number of reviews"
1. we tried to select one 2nd criteria: installers. 
We would select the one with more installers. However after some verifications, we understood that the installers are exact numbers of installation, but are only intervals of installation "5,000+" or "10,000+". This way, we think this is not relevant criteria in this analyze, so we will left this criteria behind. Anyway if we really want to use installers, we will need to remove "," and "+" and transform this in one int(), to can compare which number is bigger.

2. we also tried to check if we can use the last update as a solution for this situations.
If we have same numbers of reviews, we are interested to select the most recent one.
However after some inspections to the dataset we verified that we have lines complete the equals in the dataset (example: 'ZEDGE™ Ringtones & Wallpapers') -  this way for this situation we will need anyway to select one of them without any extra criteria.

Let us move without any extra criteria.


--- EXAMPLE: Instagram ---

In [121]:
i=0
for each in play_data:
    name=each[0]
    review=int(each[3])
    if name== "Instagram":
        print(each)
        if review>i:
            i=review
max_rev_instagram=i
print("\n", "Max reviews for Instagram: ", max_rev_instagram)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']

 Max reviews for Instagram:  66577446


--- EXAMPLE: ZEDGE™ Ringtones & Wallpapers ---

In [122]:
for each in play_data:
    name=each[0]
    review=int(each[3])
    if name== "ZEDGE™ Ringtones & Wallpapers":
        print(each)

print("\n", "We have 3 entries with max reviews of this app, and those entries are completely equals")

['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6466641', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Personalization', 'July 19, 2018', 'Varies with device', 'Varies with device']
['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6466641', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Personalization', 'July 19, 2018', 'Varies with device', 'Varies with device']
['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6466641', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Personalization', 'July 19, 2018', 'Varies with device', 'Varies with device']
['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6459626', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Personalization', 'July 19, 2018', 'Varies with device', 'Varies with device']

 We have 3 entries with max reviews of this app, and those entries are completely equals


--- **Abstracting :** Checking duplicate rows ---


In [123]:
#Android
play_name_freq={}
duplicate_app=[]
reviews_max={}

for each in play_data:
    name=str(each[0])
    n_reviews=int(each[3])
    if name in play_name_freq:
        play_name_freq[name]+=1
        duplicate_app.append(name)
        if n_reviews>reviews_max[name]:
            reviews_max[name]=n_reviews
    else:
        play_name_freq[name]=1
        reviews_max[name]=n_reviews

        
print("\n", "ANDROID: ", "\n")

print("-> Examples of apps with dupplicate rows: ", duplicate_app[:5], "\n")

print("-> Total of duplicate apps' rows: ", len(duplicate_app), "\n")

print("Does it works for Instagram example?", max_rev_instagram==reviews_max["Instagram"], "\n")#checking if the dictionary gives the expected reviews number for Instgram attending what we saw before. If True, it means that we get same result as we expected

print("-> Expected total of apps without dupplicate rows: ", len(play_data)-len(duplicate_app))

print("-> Current total of apps without dupplicate rows: ", len(reviews_max), "\n")

print("Is the previsous expected number same as current number?", len(reviews_max)== len(play_data)-len(duplicate_app))

#print(play_name_freq)


#iOS

i_name_freq={}
duplicate_app=[]

for each in i_data:
    name=str(each[0])
    if name in i_name_freq:
        i_name_freq[name]+=1
        duplicate_app.append(name)
    else:
        i_name_freq[name]=1


print("\n", "iOS: ", "\n")

print("Examples of apps with dupplicate rows: ", duplicate_app[:5], "\n")

print("Total of duplicate apps' rows: ", len(duplicate_app), "\n")

#print(i_name_freq)



 ANDROID:  

-> Examples of apps with dupplicate rows:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings'] 

-> Total of duplicate apps' rows:  1181 

Does it works for Instagram example? True 

-> Expected total of apps without dupplicate rows:  9659
-> Current total of apps without dupplicate rows:  9659 

Is the previsous expected number same as current number? True

 iOS:  

Examples of apps with dupplicate rows:  [] 

Total of duplicate apps' rows:  0 



--- Deleting duplicates rows ---

In [124]:
android_cleaned=[]
already_added=[]

for each in play_data:
    name=str(each[0])
    n_reviews=int(each[3])
    if n_reviews== reviews_max[name] and name not in already_added:#we need to check why we have duplicate with same numbers of reviews and found another criteria
            android_cleaned.append(each)
            already_added.append(name)
# print(android_cleaned[:5])
print(len(android_cleaned))

#checking first 4 rows of data of each dataset
print("\n" + "---Android apps: --- "+'\n')
explore_data(android_cleaned,0, 4, True)


9659

---Android apps: --- 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 9659
Numner of columnns 13


## 3.3. Remove non-English apps

We use English for the apps we develop at our company, and we'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience.

In order to reach this objective, using the function `ord()` we will check the range number of each character in the App name. Knowing that the numbers corresponding to the characters we commonly use in an English text are all in the **range 0 to 127**, according to the ASCII (American Standard Code for Information Interchange) system, we can easily creat one function to check if APP name includes or not non-english characters.

We create two version:
- first version `english_v1`: we will exclude all apps which have at least 1 foreign character. We think this version is too stictly, and we will lose many workable entries, just because its name contains for example one emoji or trade mark symbol.

- second version `english_v2`: we will exclude all apps which have 3 or more foreign characters.


In [220]:
from pprint import pprint

play_name_freq={}
duplicate_app=[]
list_dup_rev={}
reviews_max={}


#for each in play_data:
#    name=str(each[0])
#    n_reviews=int(each[3])
#    if n_reviews==reviews_max[name]:
#        list_dup_rev[name]= each
#print(len(list_dup_rev))
#pprint(reviews_max_dup)

#print(dup_rev)
for each in play_data:
    name=each[0]
    review=int(each[3])
    if name== "ZEDGE™ Ringtones & Wallpapers":
        print(each)
print(play_name_freq['ZEDGE™ Ringtones & Wallpapers'])
print(reviews_max['ZEDGE™ Ringtones & Wallpapers'])
print(list_dup_rev['ZEDGE™ Ringtones & Wallpapers'])
#print(reviews_max_dup['ZEDGE™ Ringtones & Wallpapers'])


['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6466641', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Personalization', 'July 19, 2018', 'Varies with device', 'Varies with device']
['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6466641', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Personalization', 'July 19, 2018', 'Varies with device', 'Varies with device']
['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6466641', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Personalization', 'July 19, 2018', 'Varies with device', 'Varies with device']
['ZEDGE™ Ringtones & Wallpapers', 'PERSONALIZATION', '4.6', '6459626', 'Varies with device', '100,000,000+', 'Free', '0', 'Teen', 'Personalization', 'July 19, 2018', 'Varies with device', 'Varies with device']


KeyError: 'ZEDGE™ Ringtones & Wallpapers'

Examples of non-english entries in our data sets:

In [125]:
print("Android dataset")
print(android_cleaned[4412][0])
print(android_cleaned[7940][0], "\n")

print("iOS dataset")
print(i_data[813][1])
print(i_data[6731][1])

Android dataset
中国語 AQリスニング
لعبة تقدر تربح DZ 

iOS dataset
爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


Checking if app name has non-English characteres:

In [126]:
print("---First vesion of function to check if we have an english name:---", "\n")
def english_v1(string):
    for each in string:
        if ord(each)>127:
            return False
    return True
print(english_v1("Instagram"), "-> This result is beacause all characters are english and range number is less than 127")
print(english_v1("爱奇艺PPS -《欢乐颂2》电视剧热播"), "-> This result is beacause name contains foreign characters")
print(english_v1("Docs To Go™ Free Office Suite"), "-> This result is beacause '™'', which has range number: ", ord("™"))
print(english_v1("Instachat 😜"), "-> This result is beacause emoji which has range number: ", ord("😜"),"\n")

print("---Second vesion of function to check if we have an english name:---", "\n")


def english_v2(string):
    tolerance=0
    for each in string:
        if ord(each)>127:
            tolerance+=1
            if tolerance>3:
                return False
    return True
print(english_v2("Instagram"), "-> This result is beacause all characters are english and range number is less than 127")
print(english_v2("爱奇艺PPS -《欢乐颂2》电视剧热播"), "-> This result is beacause name contains more than 3 foreign characters")
print(english_v2("Docs To Go™ Free Office Suite"), "-> This result is beacause we don't have more than 3 foreign characters")
print(english_v2("Instachat 😜"), "-> This result is beacause we don't have more than 3 foreign characters")


print("\n")
print("-> RESULTS: Android dataset")
play_en_cleaned=[]
play_nonenglish=[]

for each in android_cleaned:
    name=each[0]
    if english_v2(name):
        play_en_cleaned.append(each)
    else:
        play_nonenglish.append(each)

print ("Apps total with foreign name: ",len(play_nonenglish), "\n")
print ("Apps total removing the ones with foreign name: " , len(play_en_cleaned), "\n")


#checking results for android dataset:
for each in play_en_cleaned:
    name=each[0]
    #print (name)
for each in play_nonenglish:
    name=each[0]
    #print (name)
    
print("\n")
print("-> RESULTS: iOS dataset")
i_en_cleaned=[]
i_nonenglish=[]

for each in i_data:
    name=each[1]
    if english_v2(name):
        i_en_cleaned.append(each)
    else:
        i_nonenglish.append(each)

print ("Apps total with foreign name: ",len(i_nonenglish), "\n")
print ("Apps total removing the ones with foreign name: " , len(i_en_cleaned), "\n")

#checking results for ios dataset:
for each in i_en_cleaned:
    name=each[1]
    #print (name)
for each in i_nonenglish:
    name=each[1]
    #print (name)

print("-> EXPLORING DATASETS")
#checking first 4 rows of data of each dataset
print("\n" + "---Android apps: --- "+'\n')
explore_data(play_en_cleaned,0, 4, True)
print("---iOS apps: --- "+'\n')
explore_data(i_en_cleaned, 0, 4, True)

---First vesion of function to check if we have an english name:--- 

True -> This result is beacause all characters are english and range number is less than 127
False -> This result is beacause name contains foreign characters
False -> This result is beacause '™'', which has range number:  8482
False -> This result is beacause emoji which has range number:  128540 

---Second vesion of function to check if we have an english name:--- 

True -> This result is beacause all characters are english and range number is less than 127
False -> This result is beacause name contains more than 3 foreign characters
True -> This result is beacause we don't have more than 3 foreign characters
True -> This result is beacause we don't have more than 3 foreign characters


-> RESULTS: Android dataset
Apps total with foreign name:  45 

Apps total removing the ones with foreign name:  9614 



-> RESULTS: iOS dataset
Apps total with foreign name:  1014 

Apps total removing the ones with foreign name:

## 3.4. Remove non-free apps

In our company we only build free apps, so we want to remove from our datasets the non-free apps.

In [127]:
print("Android dataset")
play_en_free_cleaned=[]
play_en_nonfree_cleaned=[]
for each in play_en_cleaned:
    price=each[6]
    if price== "Free":
        play_en_free_cleaned.append(each)
    else:
        play_en_nonfree_cleaned.append(each)

print("\n")
print("\n")
print("iOS dataset")

i_en_free_cleaned=[]
i_en_nonfree_cleaned=[]
for each in i_en_cleaned:
    price=float(each[4])
    if price==0:
        i_en_free_cleaned.append(each)
    else:
        i_en_nonfree_cleaned.append(each)
        
print ("Free apps total in Android dataset: ",len(play_en_free_cleaned), "\n")
print ("Free apps total in iOS dataset: ", len(i_en_free_cleaned), "\n")

#checking results for android dataset:
for each in play_en_free_cleaned:
    price=each[6]
    #print (price)
for each in play_en_nonfree_cleaned:
    price=each[6]
    #print (price)
#checking results for android dataset:
for each in i_en_free_cleaned:
    price=each[4]
    #print (price)
for each in i_en_nonfree_cleaned:
    price=each[4]
    #print (price)

print("-> EXPLORING DATASETS")
#checking first 4 rows of data of each dataset
print("\n" + "---Android apps: --- "+'\n')
explore_data(play_en_free_cleaned,0, 4, True)
print("---iOS apps: --- "+'\n')
explore_data(i_en_free_cleaned, 0, 4, True)

Android dataset




iOS dataset
Free apps total in Android dataset:  8863 

Free apps total in iOS dataset:  3222 

-> EXPLORING DATASETS

---Android apps: --- 

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


['Pixel Draw - Number Art Coloring Book', 'ART_AND_DESIGN', '4.3', '967', '2.8M', '100,000+', 'Free', '0', 'Everyone', 'Art & Design;Creativity', 'June 20, 2018', '1.1', '4.4 and up']


Number of rows: 8863
Numner of columnns 13
---iOS apps: --- 

['284882215', 'Facebook', '3898

## 4. Analyzing datasets

## 4.1. Most common app by Genre

*As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.*

*To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:*

    Build a minimal Android version of the app, and add it to Google Play.
    If the app has a good response from users, we develop it further.
    If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

*Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.*

*Let's begin the analysis by getting a sense of what are the most common genres for each market. For this, we'll need to build frequency tables for a few columns in our data sets.*

In [128]:
print("MOST COMMON GENRES for each dataset:")
#first we creat both frequency tables by hand
play_app_genre={}
for each in play_en_free_cleaned:
    genre=each[-4]
    if genre in play_app_genre:
        play_app_genre[genre]+=1
    else:
        play_app_genre[genre]=1
#print(play_app_genre)
#print("\n")
i_app_genre={}
for each in i_en_free_cleaned:
    genre=each[-5]
    if genre in i_app_genre:
        i_app_genre[genre]+=1
    else:
        i_app_genre[genre]=1
#print(i_app_genre)



print("-> Android dataset - most common genre")
freq_play=0
most_common=0
most_common_list=[]
for each in play_app_genre:
    if play_app_genre[each]>freq_play:
        freq_play=play_app_genre[each]
        most_common=each
    elif play_app_genre[each]==freq_play:
        most_common_list.append(each)
if len(most_common_list)==0:
        print("most common genre:", most_common," -> with frequence ", freq_play, " and percentage: ", (freq_play/len(play_en_free_cleaned)*100))   
else:
    print("check! we have 2 parameteres with exactly same percentage: ",most_common_list)       

print("\n")
print("-> iOS dataset - most common genre")
freq_i=0
most_common=0
most_common_list=[]
for each in i_app_genre:
    if i_app_genre[each]>freq_i:
        freq_i=i_app_genre[each]
        most_common=each
    elif i_app_genre[each]==freq_i:
        most_common_list.append(each)
if len(most_common_list)==0:
        print("most common genre:", most_common," -> with frequence: ", freq_i, "and percentage: ", (freq_i/len(i_en_free_cleaned)*100)) 
else:
    print("check! we have 2 parameteres with exactly same percentage: ",most_common_list)       

MOST COMMON GENRES for each dataset:
-> Android dataset - most common genre
most common genre: Tools  -> with frequence  749  and percentage:  8.450863138892023


-> iOS dataset - most common genre
most common genre: Games  -> with frequence:  1874 and percentage:  58.16263190564867


In [129]:
print("CREATING FREQENCY TABLES FOR ANY COLUMN AND IN PERCENTAGES:")
#now we will create a function to avoid the previous "manual work"
def freq_table(dataset, index):
    app_parameter={}
    for each in dataset:
        parameter=each[index]
        if parameter in app_parameter:
            app_parameter[parameter]+=1
        else:
            app_parameter[parameter]=1
    app_parameter_percentage={}
    for each in app_parameter:
        app_parameter_percentage[each]=app_parameter[each]/len(dataset)*100
    return app_parameter_percentage

#following function will gave us most common parameter
def most_common(dataset, index):
    app_parameter_percentage=freq_table(dataset, index)
    freq=0
    most_common=0
    most_common_list=[]
    for each in app_parameter_percentage:
        if app_parameter_percentage[each]>freq:
            freq=app_parameter_percentage[each]
            most_common=each
        elif app_parameter_percentage[each]==freq:
            most_common_list.append(each)
    if len(most_common_list)==0:
        return print(most_common, freq)
    else:
        return print("check! we have 2 parameteres with exactly same percentage: ",most_common_list)

    
    
#freq_table(play_en_free_cleaned, -4)
#freq_table(i_en_free_cleaned, -5)
most_common(play_en_free_cleaned, -4)
most_common(i_en_free_cleaned, -5)

CREATING FREQENCY TABLES FOR ANY COLUMN AND IN PERCENTAGES:
Tools 8.450863138892023
Games 58.16263190564867


When we are analyzing the data, we may have interest to check all frequence table (instead we only check the most common item). However, the dictionary information are not following any sequence and so, it would be very hard checking results just printing the dictionary. This way, we need to see data ordenated by percentages. For that will use the following function to ordenate

In [132]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


In [134]:

print("**iOS data set: prime_genre**")
display_table(i_en_free_cleaned,-5)

**iOS data set: prime_genre**
Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


Let us start to analyze the frequency table you generated for the prime_genre column of the App Store data set. Checking this data, we can take the following conclusions:

- The most common genre is **Games** with more than half entries in this dataset (58,16%)
- The other patterns that we see are that after Games, the most common apps are Entertainment(7,88%),Photo & Video(4,96%), Education (3,66%) and than Social Networking (3.29%).
- The general impression the is that the most of the apps are designed for entertainment (games, photo and video, social networking, sports, music) and not for practical purposes (education, shopping, utilities, productivity, lifestyle)
- Based on this results we still cannot recommend an app profile for the App Store market, because even based on this frequency there's a large number of apps for **Games** we still didn't cehck if is this genre that have the largest number of users.



In [135]:
print("**1- Android data set: Genres**")
display_table(play_en_free_cleaned,-4)  
print("\n")
print("-----")
print("\n")
print("**2- Android data set: Category**")
display_table(play_en_free_cleaned,1)


**1- Android data set: Genres**
Tools : 8.450863138892023
Entertainment : 6.070179397495204
Education : 5.348076272142616
Business : 4.592124562789123
Productivity : 3.8925871601038025
Lifestyle : 3.8925871601038025
Finance : 3.7007785174320205
Medical : 3.5315355974275078
Sports : 3.463838429425702
Personalization : 3.317161232088458
Communication : 3.2381812027530184
Action : 3.102786866749408
Health & Fitness : 3.0802211440821394
Photography : 2.944826808078529
News & Magazines : 2.798149610741284
Social : 2.6627552747376737
Travel & Local : 2.324269434728647
Shopping : 2.245289405393208
Books & Reference : 2.1437436533904997
Simulation : 2.042197901387792
Dating : 1.8616721200496444
Arcade : 1.8503892587160102
Video Players & Editors : 1.771409229380571
Casual : 1.7601263680469368
Maps & Navigation : 1.399074805370642
Food & Drink : 1.241114746699763
Puzzle : 1.128286133363421
Racing : 0.9928917973598104
Role Playing : 0.9364774906916393
Libraries & Demo : 0.9364774906916393
Auto &


Now looking for the other frequency tables generated for the Category and Genres column of the Google Play data set, we can take the following conclusions:

- The most common genres are **Tools** (8,45%), **Entertainment**(6,07%)and **Education**(5,35%) and the most common Categories are **FAMILY** (18,90%) , **GAME** (9.72%) and **TOOLS** (8.46%)
- The other patterns that we can see is that beside top 3, the other categories and genres on top of table are more regarding pratical purpose than entertainment.
- Comparing the patterns for the Google Play market with what we saw for the App Store market, we can see the opposite trend. For Google Play, the general impression the is that the most of the apps are designed for practical purposes (education, shopping, utilities, productivity, lifestyle)than for entertainment (games, photo and video, social networking, sports, music)
- We cannot recommend an app profile based on what we found until now. We need to compare this numbers with numbers of installers/reviews, in order to understand which apps has more users.



## 4.2. Most common app by Genre attending user number

### 4.2.1. Most common app by Genre attending user number - iOS dataset

*Now, we'd like to get an idea about the kind of apps with the most users.*

*One way to find out what genres are the most popular (have the most users) is to calculate the average number of installs for each app genre. *
For the Google Play data set, we can find this information in the `Installs` column.
For the App Store data set, we'll take the total number of user ratings as a proxy, which we can find in the `rating_count_tot app`.

In [181]:
print("iOS data set: prime_genre", "\n")

i_genre_dic= freq_table(i_en_free_cleaned,-5)

table_display=[]
for genre in i_genre_dic:
    total=0
    len_genre=0
    for each in i_en_free_cleaned:
        installers= float(each[5])
        genre_app=each[-5]
        if genre_app==genre:
            total +=installers
            len_genre+=1
    average=total/len_genre
    genre_tupples=(average, genre)
    table_display.append(genre_tupples)
    #print(genre, ": ", average)
    #print(genre_tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])
    

iOS data set: prime_genre 

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22788.6696905016
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


--> On average, navigation apps have the highest number of user reviews
Let us jump deep inside navigation data information and saw what are the apps inside this genre versus the numbers of users:

In [154]:
table_display=[]
for each in i_en_free_cleaned:
    if each[-5] == 'Navigation':
        tupples=(int(each[5]), each[1])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered


Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


--> We can see that inside Naviagtion, we have 2 apps with the biggest part of the users:  Waze and Google Maps. - These market is dominated by this gigants so should not be what we are looking for.

Let us check what happened inside each other genre:

In [155]:
table_display=[]
for each in i_en_free_cleaned:
    if each[-5] == 'Reference':
        tupples=(int(each[5]), each[1])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Google Translate : 26786
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
Merriam-Webster Dictionary : 16849
Night Sky : 12122
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
LUCKY BLOCK MOD ™ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Guides for Pokémon GO - Pokemon GO News and Cheats : 826
WWDC : 762
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
VPN Express : 14
Real Bike Traffic Rider Virtual Reality Glasses : 8
教えて!goo : 0
Jishokun-Japanese English Dictionary & Translator : 0


--> Inside references we have also 2 apps with the biggest part of the users: Bible and Dictionary.com, where Bible is much more users. This looks insteresting.

In [156]:
table_display=[]
for each in i_en_free_cleaned:
    if each[-5] == 'Social Networking':
        tupples=(int(each[5]), each[1])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

Facebook : 2974676
Pinterest : 1061624
Skype for iPhone : 373519
Messenger : 351466
Tumblr : 334293
WhatsApp Messenger : 287589
Kik : 260965
ooVoo – Free Video Call, Text and Voice : 177501
TextNow - Unlimited Text + Calls : 164963
Viber Messenger – Text & Call : 164249
Followers - Social Analytics For Instagram : 112778
MeetMe - Chat and Meet New People : 97072
We Heart It - Fashion, wallpapers, quotes, tattoos : 90414
InsTrack for Instagram - Analytics Plus More : 85535
Tango - Free Video Call, Voice and Chat : 75412
LinkedIn : 71856
Match™ - #1 Dating App. : 60659
Skype for iPad : 60163
POF - Best Dating App for Conversations : 52642
Timehop : 49510
Find My Family, Friends & iPhone - Life360 Locator : 43877
Whisper - Share, Express, Meet : 39819
Hangouts : 36404
LINE PLAY - Your Avatar World : 34677
WeChat : 34584
Badoo - Meet New People, Chat, Socialize. : 34428
Followers + for Instagram - Follower Analytics : 28633
GroupMe : 28260
Marco Polo Video Walkie Talkie : 27662
Miitomo : 2

--> Inside references we have also some apps with the biggest part of the users: Facebook
, Pinterest and Skype. This market looks to also be dominated by these gigants so should not be what we are looking for.

In [157]:
table_display=[]
for each in i_en_free_cleaned:
    if each[-5] == 'Music':
        tupples=(int(each[5]), each[1])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

Pandora - Music & Radio : 1126879
Spotify Music : 878563
Shazam - Discover music, artists, videos & lyrics : 402925
iHeartRadio – Free Music & Radio Stations : 293228
SoundCloud - Music & Audio : 135744
Magic Piano by Smule : 131695
Smule Sing! : 119316
TuneIn Radio - MLB NBA Audiobooks Podcasts Music : 110420
Amazon Music : 106235
SoundHound Song Search & Music Player : 82602
Sonos Controller : 48905
Bandsintown Concerts : 30845
Karaoke - Sing Karaoke, Unlimited Songs! : 28606
My Mixtapez Music : 26286
Sing Karaoke Songs Unlimited with StarMaker : 26227
Ringtones for iPhone & Ringtone Maker : 25403
Musi - Unlimited Music For YouTube : 25193
AutoRap by Smule : 18202
Spinrilla - Mixtapes For Free : 15053
Napster - Top Music & Radio : 14268
edjing Mix:DJ turntable to remix and scratch music : 13580
Free Music - MP3 Streamer & Playlist Manager Pro : 13443
Free Piano app by Yokee : 13016
Google Play Music : 10118
Certified Mixtapes - Hip Hop Albums & Mixtapes : 9975
TIDAL : 7398
YouTube Mu

--> Inside references we have also some apps with the biggest part of the users: Pandora , Spotify and Shazam.  This market looks to also be dominated by these gigants so should not be what we are looking for.

In [163]:
table_display=[]
for each in i_en_free_cleaned:
    if each[-5] == 'Weather':
        tupples=(int(each[5]), each[1])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

The Weather Channel: Forecast, Radar & Alerts : 495626
The Weather Channel App for iPad – best local forecast, radar map, and storm tracking : 208648
WeatherBug - Local Weather, Radar, Maps, Alerts : 188583
MyRadar NOAA Weather Radar Forecast : 150158
AccuWeather - Weather for Life : 144214
Yahoo Weather : 112603
Weather Underground: Custom Forecast & Local Radar : 49192
NOAA Weather Radar - Weather Forecast & HD Radar : 45696
Weather Live Free - Weather Forecast & Alerts : 35702
Storm Radar : 22792
QuakeFeed Earthquake Map, Alerts, and News : 6081
Moji Weather - Free Weather Forecast : 2333
Hurricane by American Red Cross : 1158
Forecast Bar : 375
Hurricane Tracker WESH 2 Orlando, Central Florida : 203
FEMA : 128
iWeather - World weather forecast : 80
Weather - Radar - Storm with Morecast App : 78
Yurekuru Call : 53
Weather & Radar : 37
WRAL Weather Alert : 25
Météo-France : 24
JaxReady : 22
Freddy the Frogcaster's Weather Station : 14
Almanac Long-Range Weather Forecast : 12
wetter.c

--> For weather, we gave some gigants and we may take in attention that we cannot create much more, and this kind of apps are usually for quick use

In [161]:
table_display=[]
for each in i_en_free_cleaned:
    if each[-5] == 'Book':
        tupples=(int(each[5]), each[1])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

Kindle – Read eBooks, Magazines & Textbooks : 252076
Audible – audio books, original series & podcasts : 105274
Color Therapy Adult Coloring Book for Adults : 84062
OverDrive – Library eBooks and Audiobooks : 65450
HOOKED - Chat Stories : 47829
BookShout: Read eBooks & Track Your Reading Goals : 879
Dr. Seuss Treasury — 50 best kids books : 451
Green Riding Hood : 392
Weirdwood Manor : 197
MangaZERO - comic reader : 9
謎解き2016 : 0
謎解き : 0
ikouhoushi : 0
MangaTiara - love comic reader : 0


--> For Book it looks interesting. There are not many apps. We may have space to entry.

In [162]:
table_display=[]
for each in i_en_free_cleaned:
    if each[-5] == 'Food & Drink':
        tupples=(int(each[5]), each[1])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

Starbucks : 303856
Domino's Pizza USA : 258624
OpenTable - Restaurant Reservations : 113936
Allrecipes Dinner Spinner : 109349
DoorDash - Food Delivery : 25947
UberEATS: Uber for Food Delivery : 17865
Postmates - Food Delivery, Faster : 9519
Dunkin' Donuts - Get Offers, Coupons & Rewards : 9068
Chick-fil-A : 5665
McDonald's : 4050
Deliveroo: Restaurant Delivery - Order Food Nearby : 1702
SONIC Drive-In : 1645
Nowait Guest : 1625
7-Eleven, Inc. : 1356
Outback : 805
Bon Appetit : 750
Starbucks Keyboard : 457
Whataburger : 197
Delish Eatmoji Keyboard : 154
Lieferheld - Delicious food delivery service : 29
Lieferando.de : 29
McDo France : 22
Chefkoch - Rezepte, Kochen, Backen & Kochbuch : 20
Youmiam : 9
Marmiton Twist : 2
Open Food Facts : 1


--> For Food and Drink, if we excludes the 2 first apps, because are from specific restaurantes, we will found apps for restaurant reservations, apps for deliveries and apps for receipts. This is interesting and we should have space to entry.

Looking for these genres and conclusions we took for each one, the profile APPs that we suggest for iOS market is: 
- Reference: edit one known book
- Book: think about a new reader
- Food and drink: there is different ways to go

### 4.2.2. Most common app by Genre attending user number - Android dataset

As we had already verified the installers numbers on Android dataset are not precise numbers. Even so they are strings because have "," and "+". Knowing about the method `str.replace()`we can now easily transform installers in integer. 

In [186]:
#play_genre_dic= freq_table(play_en_free_cleaned,-4)
play_categ_dic= freq_table(play_en_free_cleaned,1)

table_display=[]
#for genre in play_genre_dic:
for categ in play_categ_dic:
    total=0
    #len_genre=0
    len_categ=0
    for each in play_en_free_cleaned:
        installers=each[5]
        installers=installers.replace("+","")
        installers=installers.replace(",","")
        installers=int(installers)
        #genre_app=each[-4]
        categ_app=each[1]
        #if genre_app==genre:
        if categ_app==categ:
            total +=installers
            len_categ+=1
            #len_genre+=1
    average=total/len_categ
    #average=total/len_genre
    tupples=(average, categ)
    table_display.append(tupples)
    #print(categ, ": ", average)
    #print(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0])

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17840110.40229885
PRODUCTIVITY : 16787331.344927534
GAME : 15588015.603248259
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10801391.298666667
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3697848.1731343283
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1833495.145631068
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315

--> On average, communication apps have the highest number of user reviews Let us jump deep inside communication data information and saw what are the apps inside this genre versus the numbers of users:

In [190]:
table_display=[]
for each in play_en_free_cleaned:
    if each[1] == 'COMMUNICATION':
        installers=each[5]
        installers=installers.replace("+","")
        installers=installers.replace(",","")
        installers=int(installers)
        tupples=(installers, each[0])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

WhatsApp Messenger : 1000000000
Skype - free IM & video calls : 1000000000
Messenger – Text and Video Chat for Free : 1000000000
Hangouts : 1000000000
Google Chrome: Fast & Secure : 1000000000
Gmail : 1000000000
imo free video calls and chat : 500000000
Viber Messenger : 500000000
UC Browser - Fast Download Private & Secure : 500000000
LINE: Free Calls & Messages : 500000000
Google Duo - High Quality Video Calls : 500000000
imo beta free calls and text : 100000000
Yahoo Mail – Stay Organized : 100000000
Who : 100000000
WeChat : 100000000
UC Browser Mini -Tiny Fast Private & Secure : 100000000
Truecaller: Caller ID, SMS spam blocking & Dialer : 100000000
Telegram : 100000000
Opera Mini - fast web browser : 100000000
Opera Browser: Fast and Secure : 100000000
Messenger Lite: Free Calls & Messages : 100000000
Kik : 100000000
KakaoTalk: Free Calls & Text : 100000000
GO SMS Pro - Messenger, Free Themes, Emoji : 100000000
Firefox Browser fast & private : 100000000
BBM - Free Calls & Messages

---> We saw the giants are WhatsApp, Skype, Messenger, Hangounts, Google Chrome and Gmail -  there should not be space to entry in this market

In [191]:
table_display=[]
for each in play_en_free_cleaned:
    if each[1] == 'VIDEO_PLAYERS':
        installers=each[5]
        installers=installers.replace("+","")
        installers=installers.replace(",","")
        installers=int(installers)
        tupples=(installers, each[0])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

YouTube : 1000000000
Google Play Movies & TV : 1000000000
MX Player : 500000000
VivaVideo - Video Editor & Photo Movie : 100000000
VideoShow-Video Editor, Video Maker, Beauty Camera : 100000000
VLC for Android : 100000000
Motorola Gallery : 100000000
Motorola FM Radio : 100000000
Dubsmash : 100000000
Vote for : 50000000
Vigo Video : 50000000
VMate : 50000000
Samsung Video Library : 50000000
Ringdroid : 50000000
MiniMovie - Free Video and Slideshow Editor : 50000000
LIKE – Magic Video Maker & Community : 50000000
KineMaster – Pro Video Editor : 50000000
HD Video Downloader : 2018 Best video mate : 50000000
DU Recorder – Screen Recorder, Video Editor, Live : 50000000
video player for android : 10000000
iMediaShare – Photos & Music : 10000000
YouTube Studio : 10000000
Video Player All Format : 10000000
Video Downloader - for Instagram Repost App : 10000000
Video Downloader : 10000000
Ustream : 10000000
Quik – Free Video Editor for photos, clips, music : 10000000
PowerDirector Video Editor

---> We saw the giants are Youtube, Google play movies and then MXplayer - there should not be space to entry in this market

In [192]:
table_display=[]
for each in play_en_free_cleaned:
    if each[1] == 'SOCIAL':
        installers=each[5]
        installers=installers.replace("+","")
        installers=installers.replace(",","")
        installers=int(installers)
        tupples=(installers, each[0])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

Instagram : 1000000000
Google+ : 1000000000
Facebook : 1000000000
Snapchat : 500000000
Facebook Lite : 500000000
VK : 100000000
Tumblr : 100000000
Tik Tok - including musical.ly : 100000000
Tango - Live Video Broadcast : 100000000
Pinterest : 100000000
LinkedIn : 100000000
Badoo - Free Chat & Dating App : 100000000
BIGO LIVE - Live Stream : 100000000
ooVoo Video Calls, Messaging & Stories : 50000000
Zello PTT Walkie Talkie : 50000000
SKOUT - Meet, Chat, Go Live : 50000000
POF Free Dating App : 50000000
MeetMe: Chat & Meet New People : 50000000
textPlus: Free Text & Calls : 10000000
magicApp Calling & Messaging : 10000000
YouNow: Live Stream Video Chat : 10000000
We Heart It : 10000000
Waplog - Free Chat, Dating App, Meet Singles : 10000000
TextNow - free text + calls : 10000000
Text free - Free Text + Call : 10000000
Text Me: Text Free, Call Free, Second Phone Number : 10000000
Tapatalk - 100,000+ Forums : 10000000
Tagged - Meet, Chat & Dating : 10000000
SayHi Chat, Meet New People : 1

---> We saw the giants are Instagram, Google+, Facebook and Snapchat - there should not be space to entry in this market


In [195]:
table_display=[]
for each in play_en_free_cleaned:
    if each[1] == 'PHOTOGRAPHY':
        installers=each[5]
        installers=installers.replace("+","")
        installers=installers.replace(",","")
        installers=int(installers)
        tupples=(installers, each[0])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

Google Photos : 1000000000
Z Camera - Photo Editor, Beauty Selfie, Collage : 100000000
YouCam Perfect - Selfie Photo Editor : 100000000
YouCam Makeup - Magic Selfie Makeovers : 100000000
Sweet Selfie - selfie camera, beauty cam, photo edit : 100000000
S Photo Editor - Collage Maker , Photo Collage : 100000000
Retrica : 100000000
PicsArt Photo Studio: Collage Maker & Pic Editor : 100000000
PhotoGrid: Video & Pic Collage Maker, Photo Editor : 100000000
Photo Editor Pro : 100000000
Photo Editor Collage Maker Pro : 100000000
Photo Collage Editor : 100000000
LINE Camera - Photo editor : 100000000
Cymera Camera- Photo Editor, Filter,Collage,Layout : 100000000
Candy Camera - selfie, beauty camera, photo editor : 100000000
Camera360: Selfie Photo Editor with Funny Sticker : 100000000
BeautyPlus - Easy Photo Editor & Selfie Camera : 100000000
B612 - Beauty & Filter Camera : 100000000
AR effect : 100000000
Video Editor Music,Cut,No Crop : 50000000
VSCO : 50000000
Square InPic - Photo Editor & Co

--> We have one big giant - Google Photos - but then we have market split for many apps- which is not so interesting for us

In [197]:
table_display=[]
for each in play_en_free_cleaned:
    if each[1] == 'PRODUCTIVITY':
        installers=each[5]
        installers=installers.replace("+","")
        installers=installers.replace(",","")
        installers=int(installers)
        tupples=(installers, each[0])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

Google Drive : 1000000000
Microsoft Word : 500000000
Google Calendar : 500000000
Dropbox : 500000000
Cloud Print : 500000000
WPS Office - Word, Docs, PDF, Note, Slide & Sheet : 100000000
SwiftKey Keyboard : 100000000
Samsung Notes : 100000000
Microsoft PowerPoint : 100000000
Microsoft Outlook : 100000000
Microsoft OneNote : 100000000
Microsoft OneDrive : 100000000
Microsoft Excel : 100000000
Google Slides : 100000000
Google Sheets : 100000000
Google Keep : 100000000
Google Docs : 100000000
Evernote – Organizer, Planner for Notes & Memos : 100000000
ES File Explorer File Manager : 100000000
ColorNote Notepad Notes : 100000000
CamScanner - Phone PDF Creator : 100000000
Adobe Acrobat Reader : 100000000
myAT&T : 50000000
Verizon Cloud : 50000000
QR Droid : 50000000
My Airtel-Online Recharge, Pay Bill, Wallet, UPI : 50000000
Mobizen Screen Recorder - Record, Capture, Edit : 50000000
MEGA : 50000000
File Browser by Astro (File Manager) : 50000000
Do It Later: Tasks & To-Dos : 50000000
Calcul

--> First places on top dominated by apps from Google and Microsoft. - could be hard to entry... 

In [196]:
table_display=[]
for each in play_en_free_cleaned:
    if each[1] == 'GAME':
        installers=each[5]
        installers=installers.replace("+","")
        installers=installers.replace(",","")
        installers=int(installers)
        tupples=(installers, each[0])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

Subway Surfers : 1000000000
Temple Run 2 : 500000000
Pou : 500000000
My Talking Tom : 500000000
Candy Crush Saga : 500000000
slither.io : 100000000
Zombie Tsunami : 100000000
Yes day : 100000000
Vector : 100000000
Trivia Crack : 100000000
Traffic Racer : 100000000
Temple Run : 100000000
Talking Tom Gold Run : 100000000
Super Mario Run : 100000000
Sonic Dash : 100000000
Sniper 3D Gun Shooter: Free Shooting Games - FPS : 100000000
Smash Hit : 100000000
Skater Boy : 100000000
Shadow Fight 2 : 100000000
Score! Hero : 100000000
Roll the Ball® - slide puzzle : 100000000
Pokémon GO : 100000000
Plants vs. Zombies FREE : 100000000
Piano Tiles 2™ : 100000000
PAC-MAN : 100000000
My Talking Angela : 100000000
Modern Combat 5: eSports FPS : 100000000
Mobile Legends: Bang Bang : 100000000
Lep's World 2 🍀🍀 : 100000000
Jetpack Joyride : 100000000
Hungry Shark Evolution : 100000000
Hill Climb Racing 2 : 100000000
Hill Climb Racing : 100000000
Helix Jump : 100000000
Glow Hockey : 100000000
Geometry Dash

Water Surfer Racing In Moto : 1000000
WDAMAGE: Car Crash Engine : 1000000
Voxel - 3D Color by Number & Pixel Coloring Book : 1000000
Vikings: an Archer's Journey : 1000000
V for Voodoo : 1000000
True Skateboarding Ride Skateboard Game Freestyle : 1000000
Traffic Sniper Counter Attack : 1000000
Tokyo Ghoul: Dark War : 1000000
The Walking Zombie: Dead City : 1000000
The Walking Dead: Our World : 1000000
The Visitor: Ep.2 - Sleepover Slaughter : 1000000
The Vikings : 1000000
The Grand Wars: San Andreas : 1000000
The Fish Master! : 1000000
Thai Sic Bo : 1000000
Texas Hold'em Poker : 1000000
TerraGenesis - Space Colony : 1000000
TAMAGO Monsters Returns : 1000000
Super Jim Jump - pixel 3d : 1000000
Super Dancer : 1000000
Sudoku Master : 1000000
Stickman Warriors Heroes 2 : 1000000
Speed Racing Ultimate 2 : 1000000
Space X: Sky Wars of Air Force : 1000000
Sonic CD Classic : 1000000
Soccer Clubs Logo Quiz : 1000000
SnowMobile Parking Adventure : 1000000
Slendrina X : 1000000
Skip-Bo™ Free : 10

Texas HoldEm Poker Deluxe (BR) : 10000
Super DK vs Kong Brother Advanced Free Classic : 10000
Sic Bo Rave : 10000
Sic Bo (Tai Xiu) - Multiplayer Casino : 10000
Sermon on Proverbs CH Spurgeon : 10000
Santa Panda Bubble Christmas : 10000
Saiyan Z: Super SSJ Ultimate Combat : 10000
Rock n Roll Music Quiz Game : 10000
Puzzle for CS:GO : 10000
Modern Sniper Strike: Best Commando Action 2k18 : 10000
Lust in Terror Manor - The Truth Unveiled | Otome : 10000
Korean Dungeon: K-Word 1000 : 10000
J-Stars Victory VS Guide : 10000
GALAK-Z: Variant Mobile : 10000
Five Nights at Flappy's : 10000
E.G. Chess Free : 10000
Dr.Slender Ep 1 Guide (Eng) : 10000
Dr Driving Racer : 10000
Dr Dre - Beatmaker : 10000
Dead Zombie Evil Killer:Axe : 10000
DM EVOLUTION : 10000
Counter Terrorist Gun Strike CS: Special Forces : 10000
Cards Casino:Video Poker & BJ : 10000
Cardi B Piano Game : 10000
Car Crash III Beam DH Real Damage Simulator 2018 : 10000
Bullshite! : 10000
Bullshit! (Free) : 10000
Blood Domination - BL

--> Very diversify category. We have many many apps. It could mean that market is a lit bit saturated. Maybe we are interesting in other kind of apps.

In [200]:
table_display=[]
for each in play_en_free_cleaned:
    if each[1] == 'BOOKS_AND_REFERENCE':
        installers=each[5]
        installers=installers.replace("+","")
        installers=installers.replace(",","")
        installers=int(installers)
        tupples=(installers, each[0])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

Google Play Books : 1000000000
Wattpad 📖 Free Books : 100000000
Bible : 100000000
Audiobooks from Audible : 100000000
Amazon Kindle : 100000000
Wikipedia : 10000000
Spanish English Translator : 10000000
Quran for Android : 10000000
Oxford Dictionary of English : Free : 10000000
NOOK: Read eBooks & Magazines : 10000000
Moon+ Reader : 10000000
JW Library : 10000000
HTC Help : 10000000
FBReader: Favorite Book Reader : 10000000
English Hindi Dictionary : 10000000
English Dictionary - Offline : 10000000
Dictionary.com: Find Definitions for English Words : 10000000
Dictionary - Merriam-Webster : 10000000
Dictionary : 10000000
Cool Reader : 10000000
Aldiko Book Reader : 10000000
Al-Quran (Free) : 10000000
Al'Quran Bahasa Indonesia : 10000000
Al Quran Indonesia : 10000000
Read books online : 5000000
English to Hindi Dictionary : 5000000
Ebook Reader : 5000000
Dictionary - WordWeb : 5000000
Bible KJV : 5000000
Ancestry : 5000000
AlReader -any text book reader : 5000000
Al Quran : EAlim - Transl

In [201]:
table_display=[]
for each in play_en_free_cleaned:
    if each[1] == 'FOOD_AND_DRINK':
        installers=each[5]
        installers=installers.replace("+","")
        installers=installers.replace(",","")
        installers=int(installers)
        tupples=(installers, each[0])
        table_display.append(tupples)
table_sorted = sorted(table_display, reverse = True)
for entry in table_sorted:
    print(entry[1], ':', entry[0]) # print name and number of ratings, ordered

foodpanda - Local Food Delivery : 10000000
Zomato - Restaurant Finder and Food Delivery App : 10000000
Uber Eats: Local Food Delivery : 10000000
Tastely : 10000000
Starbucks : 10000000
Pizza Hut : 10000000
McDonald's - McDonald's Japan : 10000000
McDonald's : 10000000
Foursquare City Guide : 10000000
Domino's Pizza USA : 10000000
Delivery yogi. : 10000000
Cookpad - FREE recipe search makes fun cooking · musical making! : 10000000
Cookpad : 10000000
TheFork - Restaurants booking and special offers : 5000000
Talabat: Food Delivery : 5000000
OpenTable: Restaurants Near Me : 5000000
Grubhub: Food Delivery : 5000000
Delivery trough - delivery trough delivery trough : 5000000
Delivery Club-food delivery: pizza, sushi, burger, salad : 5000000
Cookbook Recipes : 5000000
Chick-fil-A : 5000000
Chef - Recipes & Cooking : 5000000
Allrecipes Dinner Spinner : 5000000
hellofood - Food Delivery : 1000000
Yummly Recipes & Shopping List : 1000000
Wendy’s – Food and Offers : 1000000
Seamless Food Deliver

---> Checking Foof and Drinks and Books and References - we take similar conclusions that we saw in ios Dataset. This categories has demand, and we have a variety of ways to follow inside.

## Conclusions:

In this project, we analyzed data about the App Store and Google Play mobile apps with the goal of recommending an app profile that can be profitable for both markets.

We concluded that we have 2 interesting areas to work: 
- in books and reference area: for example taking a popular book (perhaps a more recent book) and turning it into an app could be profitable for both the Google Play and the App Store markets. We need to introduce some kind of novelity, like audio version of the book, quizzes on the book, a forum where people can discuss the book, etc.
- in food and drink area: for example in what concerns website for receipts, restaurants reservations, etc. In the same way we need to introduce some novelities, interactive apss and adapting to the current days.