## Dataset
> * [A data set](https://www.kaggle.com/lava18/google-play-store-apps/home): Approx. ten thousand apps from Google Play
> * [A data set](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home): Approx. seven thousand iOS apps from App Store

In [1]:
from csv import reader

### Google Play ###

open_file = open('googleplaystore.csv', encoding = 'mbcs')
read_file=reader(open_file)
google=list(read_file)
google_header=google[0]
google=google[1:]


### App Store###
open_file = open('AppleStore.csv', encoding = 'mbcs')
read_file=reader(open_file)
apple=list(read_file)
apple_header=apple[0]
apple=apple[1:]


## Data Exploring

---
* **Google Play**

In [2]:
print(google_header)
print(len(google_header))
print(len(google))

*The Google Play data has 10841 apps and 13 columns.*

---

* **App Store**

In [3]:
print(apple_header)
print(len(apple_header))
print(len(apple))

*The App Store data has 7197 apps and 16 columns.*

---

## Data Cleaning


### Rating


---
* **Google Play**

In [4]:
google_correct_rating=[]
google_incorrect_rating=[]

print(len(google))

for x in google:
    if float(x[2])<=5.0:
        google_correct_rating.append(x)
    else:
        google_incorrect_rating.append(x)
        
google=google_correct_rating
print(len(google))
print(len(google_incorrect_rating))

---
* **App Store**


>   *column 'user_rating' = average user rating value for all version.*  
>   *column 'user_rating_ver' = average user rating value for current version*
>    
>   **'user_rating_ver' is going to be used to filter out the incorrect rating data**

In [5]:
apple_correct_rating=[]
apple_incorrect_rating=[]

print(len(apple))

for x in apple:
    if float(x[8])<=5.0:
        apple_correct_rating.append(x)
    else:
        apple_incorrect_rating.append(x)
        
apple=apple_correct_rating
print(len(apple))
print(len(apple_incorrect_rating))

---


### Removing Duplicate Entries


> Duplicate Entries were found within the google data
>
> When duplicated rows were examined, the main difference with these data was in the number of reviews. The solution to fix this duplicate is to keep the rows that have the highest number of reviews

In [6]:
# figure out the max review count for each unique app name

google_review_max={}

for x in google:
    name=x[0]
    review_count=float(x[3])
    if name in google_review_max and google_review_max[name]<review_count:
        google_review_max[name]=review_count
    elif name not in google_review_max:
        google_review_max[name]=review_count
        


In [7]:
# removing the data duplicates with lower number of reviews
google_clean=[]
google_already_added=[]

for x in google:
    if google_review_max[x[0]]==float(x[3]) and x[0] not in google_already_added:
        google_clean.append(x)
        google_already_added.append(x[0])
        
google=google_clean       
print(len(google))


8196



### Removing Non-English Apps
> Remove apps with non-Enligh name
>
> Traditional English texts are encoded using the ASCII standard, which has a corresponding number between 0 and 127 associated with it.
>
> *In order to keep the English name apps with non-traditional English letters (emoji, symbols), we are filtering out the app only if it has more than 3 non-ASCII characters*

In [8]:
def is_english(string):
    non_ascii=0
    
    for x in string:
        if ord(x)>127:
            non_ascii+=1
            
    if non_ascii>3:
        return False
    
    else:
        return True
    

    
google_english=[]
apple_english=[]



for x in google:
    if is_english(x[0]):
        google_english.append(x)

for x in apple:
    if is_english(x[1]):
        apple_english.append(x)
        
google=google_english
apple=apple_english

print(len(google))
print(len(apple))

8080
6100


### Isolating the Free Apps

In [9]:
google_free=[]
apple_free=[]


for x in google:
    if x[7]=='0':
        google_free.append(x)
        
for x in apple:
    if float(x[4])==0.0:
        apple_free.append(x)
        
google=google_free
apple=apple_free

print(len(google))
print(len(apple))

7488
3169


## Data Exploration

### Most Commn App Genre

In [28]:
# generate frequency table with % and display the percentages in a descending order
def freq (dataset, index):
    table= {}
    total = 0
    
    for x in dataset:
        total+=1
        if x[index] in table:
            table[x[index]]+=1
        else:
            table[x[index]]=1
            
    percentage={}
    for x in table:
        percentage[x]=(table[x]/total)*100
    
    return percentage

#use sorted function to sort the tuple
def sort_freq(dataset,index):
    display=[]
    percentage_table=freq(dataset,index)
    for x in percentage_table:
        tuple_formed=(percentage_table[x],x)
        display.append(tuple_formed)
        
    table_sorted=sorted(display, reverse=True)
    
    for x in table_sorted:
        print(x[1],":",x[0])



---
* **App Store**

In [30]:
sort_freq(apple,-5)

Games : 58.53581571473651
Entertainment : 7.82581255916693
Photo & Video : 5.0489113284947935
Education : 3.72357210476491
Social Networking : 3.2817923635216157
Shopping : 2.5244556642473968
Utilities : 2.398232881035027
Sports : 2.1773430104133795
Music : 2.0511202272010096
Health & Fitness : 1.9880088355948247
Productivity : 1.7040075733669928
Lifestyle : 1.5462290943515304
News : 1.3253392237298833
Travel : 1.1360050489113285
Finance : 1.1044493531082362
Weather : 0.8520037866834964
Food & Drink : 0.8204480908804039
Reference : 0.5364468286525718
Business : 0.5364468286525718
Book : 0.3786683496371095
Navigation : 0.18933417481855475
Medical : 0.18933417481855475
Catalogs : 0.12622278321236985


App Store's free English app market is dominated by apps that are designed for fun (games, entertainment, photo and video, social networking, sports, musick, etc). However, the fact that fun apps are the most numerous does not imply that they also have the greates number of users. What it does suggest that **the for-fun apps market in App Store might be a bit saturated.**

---
* **Google Play**

In [31]:
# google play data had both genre and category columns


#Category 
sort_freq(google,1)

FAMILY : 19.618055555555554
GAME : 10.790598290598291
TOOLS : 8.693910256410255
FINANCE : 3.8327991452991457
PRODUCTIVITY : 3.766025641025641
LIFESTYLE : 3.6992521367521367
BUSINESS : 3.378739316239316
PHOTOGRAPHY : 3.311965811965812
SPORTS : 3.1517094017094016
HEALTH_AND_FITNESS : 3.111645299145299
COMMUNICATION : 3.111645299145299
PERSONALIZATION : 3.0715811965811963
MEDICAL : 3.018162393162393
SOCIAL : 2.6308760683760686
NEWS_AND_MAGAZINES : 2.6308760683760686
TRAVEL_AND_LOCAL : 2.363782051282051
SHOPPING : 2.3504273504273505
BOOKS_AND_REFERENCE : 2.110042735042735
VIDEO_PLAYERS : 1.9230769230769231
DATING : 1.7227564102564104
MAPS_AND_NAVIGATION : 1.469017094017094
EDUCATION : 1.3621794871794872
FOOD_AND_DRINK : 1.215277777777778
ENTERTAINMENT : 1.1217948717948718
AUTO_AND_VEHICLES : 0.9481837606837606
WEATHER : 0.8413461538461539
LIBRARIES_AND_DEMO : 0.8012820512820512
HOUSE_AND_HOME : 0.7745726495726496
ART_AND_DESIGN : 0.734508547008547
COMICS : 0.6677350427350427
PARENTING : 0.

In [32]:
#Genre 
sort_freq(google,-4)

Tools : 8.680555555555555
Entertainment : 6.022970085470085
Education : 5.435363247863248
Finance : 3.8327991452991457
Productivity : 3.766025641025641
Lifestyle : 3.685897435897436
Action : 3.5389957264957266
Business : 3.378739316239316
Photography : 3.311965811965812
Sports : 3.2318376068376073
Health & Fitness : 3.111645299145299
Communication : 3.111645299145299
Personalization : 3.0715811965811963
Medical : 3.018162393162393
Social : 2.6308760683760686
News & Magazines : 2.6308760683760686
Travel & Local : 2.3504273504273505
Shopping : 2.3504273504273505
Simulation : 2.323717948717949
Books & Reference : 2.110042735042735
Arcade : 2.0165598290598292
Casual : 1.9497863247863247
Video Players & Editors : 1.8963675213675213
Dating : 1.7227564102564104
Maps & Navigation : 1.469017094017094
Food & Drink : 1.215277777777778
Racing : 1.1217948717948718
Puzzle : 1.108440170940171
Role Playing : 1.0683760683760684
Strategy : 1.0283119658119657
Auto & Vehicles : 0.9481837606837606
Weather 

Practical apps (family, tools, business, lifestyle, productiviy, etc) seem to have a better representation on Google Play compared to App Store. 

The difference between *Genre* and the *Category* column is not clear. However, *Category* column seems more generic (focusing on the bigger picture at the moment) and thus will be used for analysis.  

### Most Popular App Genre

> Will determine the popularity of the app based on the average number of installs 

---
* **App Store**
> This data set was missing the number of install information.
>
> As a workaround, total number of user ratings is used as a proxy (should be sufficient to determine the most popular genre)

In [33]:
genre_apple = freq(apple, -5)


for y in genre_apple:
    total_review_for_the_genre = 0
    genre_app_count = 0 
    for x in apple:
        if x[-5]==y:
            total_review_for_the_genre+=float(x[5])
            genre_app_count+=1
    print(y,":", total_review_for_the_genre/genre_app_count)

Productivity : 21799.14814814815
Weather : 54215.2962962963
Shopping : 27816.2
Reference : 79350.4705882353
Finance : 32367.02857142857
Music : 58205.03076923077
Utilities : 19900.473684210527
Travel : 31358.5
Social Networking : 72916.54807692308
Sports : 23008.898550724636
Health & Fitness : 24037.634920634922
Games : 22985.211320754715
Food & Drink : 33333.92307692308
News : 21750.071428571428
Book : 46384.916666666664
Photo & Video : 28441.54375
Entertainment : 14364.774193548386
Business : 7491.117647058823
Lifestyle : 16739.34693877551
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


**Popular genres but those that do not fit our purpose or resource should be elimiated:**
* Weather Apps - people do not spend much time on it and thus won't be profitable for the in app advertisement. Also its hard to provide reliable live weather 
* Food and drink - popular apps within this genre are the apps from big  brands like Starbucks, Dunkin' Donuts etc
* Finance apps - popular apps within this genre involves highly technical functions like banking, paying bills, money transfering.  


In [36]:
for x in apple:
    if x[-5]=="Navigation":
        print(x[1],":",x[5])
       

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
GeocachingÂ® : 12811
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5
CoPilot GPS â€“ Car Navigation & Offline Maps : 3582
Google Maps - Navigation & Transit : 154911


The average number or ratings seem to be skewed by very few apps. Ex) Waze, Google Maps for Navigation. Ex) Facebook, Pinterest, Skype For Social Networking Apps, etc. When the aim is to find popular genres, these genre with a few giants might seem more popular than they really are. 

----
* **Google Play**

> Google data has number of installs column. The data is not in precise number format but a range of number format like 1,000,000,000+, 500,000,000+, 10,000,000+ and so. However, for the purpose of figuring out the popular genre, leaving the numbers as they are (consider 100,000+ has 100,000) should be sufficient.


In [41]:
genre_google = freq(google,1)

for y in genre_google:
    total_install_for_the_genre=0
    genre_app_count=0
    
    for x in google:
        if x[1]==y:
            n_installs=x[5]
            n_installs= n_installs.replace(',','')
            n_installs= n_installs.replace('+','')
            total_install_for_the_genre+=float(n_installs)
            genre_app_count+=1
    
    print(y,":", total_install_for_the_genre/genre_app_count)

ART_AND_DESIGN : 2058474.5454545454
AUTO_AND_VEHICLES : 746194.3661971831
BEAUTY : 640861.9047619047
BOOKS_AND_REFERENCE : 9909550.664556962
BUSINESS : 2753974.1501976284
COMICS : 876222.0
COMMUNICATION : 47153997.98283262
DATING : 1088343.488372093
EDUCATION : 1850490.1960784313
ENTERTAINMENT : 11767380.952380951
EVENTS : 354431.3333333333
FINANCE : 1550929.6167247386
FOOD_AND_DRINK : 2314480.769230769
HEALTH_AND_FITNESS : 4907867.896995708
HOUSE_AND_HOME : 1646068.9655172413
LIBRARIES_AND_DEMO : 839716.6666666666
LIFESTYLE : 1792061.4079422383
GAME : 16303715.909653466
FAMILY : 4196324.206943499
MEDICAL : 166567.15044247787
SOCIAL : 27826545.558375634
SHOPPING : 7950633.181818182
PHOTOGRAPHY : 18775260.524193548
SPORTS : 4640371.461864407
TRAVEL_AND_LOCAL : 16354046.892655367
TOOLS : 12439973.394777266
PERSONALIZATION : 6561236.565217392
PRODUCTIVITY : 20537621.879432622
PARENTING : 664261.0869565217
WEATHER : 5709285.714285715
VIDEO_PLAYERS : 27296709.722222224
NEWS_AND_MAGAZINES : 

Same skewness happens in Google Play as it happend in the App Store. Few of the very popular apps biased the populairty of some genres like communication, video players, social apps, photography apps, and productivity apps

In [42]:
#Understand the skewness of the Communcation Category


# apps with over million installs
for x in google:
    if x[1]=="COMMUNICATION" and (x[5]=="1,000,000,000+" or x[5]=='500,000,000+' or x[5]=="100,000,000+"):
        print(x[0],":",x[5])

WhatsApp Messenger : 1,000,000,000+
imo beta free calls and text : 100,000,000+
Android Messages : 100,000,000+
Google Duo - High Quality Video Calls : 500,000,000+
Messenger â€“ Text and Video Chat for Free : 1,000,000,000+
imo free video calls and chat : 500,000,000+
Skype - free IM & video calls : 1,000,000,000+
Who : 100,000,000+
GO SMS Pro - Messenger, Free Themes, Emoji : 100,000,000+
LINE: Free Calls & Messages : 500,000,000+
Google Chrome: Fast & Secure : 1,000,000,000+
Firefox Browser fast & private : 100,000,000+
UC Browser - Fast Download Private & Secure : 500,000,000+
Gmail : 1,000,000,000+
Hangouts : 1,000,000,000+
Messenger Lite: Free Calls & Messages : 100,000,000+
Kik : 100,000,000+
KakaoTalk: Free Calls & Text : 100,000,000+
Opera Mini - fast web browser : 100,000,000+
Opera Browser: Fast and Secure : 100,000,000+
Telegram : 100,000,000+
Truecaller: Caller ID, SMS spam blocking & Dialer : 100,000,000+
UC Browser Mini -Tiny Fast Private & Secure : 100,000,000+
Viber Me

In [46]:
#removing these overly popular few apps bring down the average of the number installs greatly


under_100m=[]
for x in google:
    n_installs=x[5]
    n_installs= n_installs.replace(',','')
    n_installs= n_installs.replace('+','')
    if(x[1]=="COMMUNICATION") and (float(n_installs)<100000000):
        under_100m.append(float(n_installs))
    

        
sum(under_100m)/len(under_100m)

4305250.145631068

----


**By eliminating some of the popular genres (overly saturated genre, too technical genre, etc), there is one fairly popular genre in both App store and Google Play that seems penetratable** 

### BOOKS AND REFERENCE GENRE



---
* **App Store**

In [51]:
for x in apple:
    if x[-5]=="Reference":
        print(x[1],":",x[5])

Bible : 985920
Dictionary.com Dictionary & Thesaurus : 200047
Dictionary.com Dictionary & Thesaurus for iPad : 54175
Muslim Pro: Ramadan 2017 Prayer Times, Azan, Quran : 18418
Merriam-Webster Dictionary : 16849
Google Translate : 26786
Night Sky : 12122
WWDC : 762
Jishokun-Japanese English Dictionary & Translator : 0
VPN Express : 14
New Furniture Mods - Pocket Wiki & Game Tools for Minecraft PC Edition : 17588
LUCKY BLOCK MOD â„¢ for Minecraft PC Edition - The Best Pocket Wiki & Mods Installer Tools : 4693
Guides for PokÃ©mon GO - Pokemon GO News and Cheats : 826
Horror Maps for Minecraft PE - Download The Scariest Maps for Minecraft Pocket Edition (MCPE) Free : 718
City Maps for Minecraft PE - The Best Maps for Minecraft Pocket Edition (MCPE) : 8535
GUNS MODS for Minecraft PC Edition - Mods Tools : 1497
Real Bike Traffic Rider Virtual Reality Glasses : 8


 ----
* **Google Play**

In [52]:
for x in google:
    if x[1]=="BOOKS_AND_REFERENCE":
        print(x[0],":",x[5])

E-Book Read - Read Book for free : 50,000+
Download free book with green book : 100,000+
Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Free Panda Radio Music : 100,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
English Grammar Complete Handbook : 500,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
Google Play Books : 1,000,000,000+
AlReader -any text book reader : 5,000,000+
Offline English Dictionary : 100,000+
Offline: English to Tagalog Dictionary : 500,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
Recipes of Prophetic Medicine for free : 500,000+
ReadEra â€“ free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Litnet - E-books : 100,000+
Read books online : 5,000,000+
English to Urdu Dictionary : 500,000+
eBoox: book reader fb2 epub zip : 1,000,000+
English Persian Dictionary : 500,000+
Flybook : 500,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
English translation from Bengali : 1

----

Both app markets are skewed by a small number of extremely popular apps. However, let's try to get some app ideas based on the kind of apps that are somwehre in the middle in terms of popularity



In [55]:
for x in google:
    if x[1]=="BOOKS_AND_REFERENCE" and (x[5]=="1,000,000+" or x[5]=="5,000,000+" or x[5] == "10,000,000+" or x[5]=="50,000,000+"):
        print(x[0],":",x[5])

Wikipedia : 10,000,000+
Cool Reader : 10,000,000+
Book store : 1,000,000+
FBReader: Favorite Book Reader : 10,000,000+
Free Books - Spirit Fanfiction and Stories : 1,000,000+
AlReader -any text book reader : 5,000,000+
FamilySearch Tree : 1,000,000+
Cloud of Books : 1,000,000+
ReadEra â€“ free ebook reader : 1,000,000+
Ebook Reader : 5,000,000+
Read books online : 5,000,000+
eBoox: book reader fb2 epub zip : 1,000,000+
All Maths Formulas : 1,000,000+
Ancestry : 5,000,000+
HTC Help : 10,000,000+
Moon+ Reader : 10,000,000+
English-Myanmar Dictionary : 1,000,000+
Golden Dictionary (EN-AR) : 1,000,000+
All Language Translator Free : 1,000,000+
Aldiko Book Reader : 10,000,000+
Dictionary - WordWeb : 5,000,000+
50000 Free eBooks & Free AudioBooks : 5,000,000+
Al-Quran (Free) : 10,000,000+
Al Quran Indonesia : 10,000,000+
Al'Quran Bahasa Indonesia : 10,000,000+
Al Quran Al karim : 1,000,000+
Al Quran : EAlim - Translations & MP3 Offline : 5,000,000+
Koran Read &MP3 30 Juz Offline : 1,000,000+

In this middle market, there seems 1. software for reading ebooks and 2. dicitionaries/library 3.popular book (Quran)


### Conclusion


In this project, App Store and Google Play apps were analyzed to provide an app profile that can be profitable for in-app advertisment. 

Conclusion is among the popular app genres, avoid the saturated genre and highly techincal market to pick a penetrable genre like Book and Reference. Then try implementing new features to differentiate the app from others. Ex) Take a popular, free book and add some special features like daily quote, quizzes on the book, discussion forum, etc