## Profitable App Profiles for the App Store and Google Play Markets

our aim for this project is to find profiles that are profitable in the app store we are working as data analysts for a company and our job is to help our developer's team make a data-driven decision 

our app is free and our main source of revenue consists of in-app ads
this means that the number of users of our apps determines our revenue for any given app our goal in this project is to analyze data and help developers which apps are more attractive to the user 

## Opening and Exploring the Data


As of September 2018 there were approximately 2 million ios apps available on the App Store and 2,1 million Android apps on Google Play

![img](https://s3.amazonaws.com/dq-content/350/py1m8_statista.png) Source: [Statista](https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/) 

Collecting data for over 4 million apps requires a significant amount of time and money so we'll try to analyze a sample of the data instead, To avoid spending resources on collecting new data ourselves we should first try to see if we can find any relevant existing data at no cost


- [dataset](https://www.kaggle.com/lava18/google-play-store-apps) containing data about approximately 10,000 Android apps from Google Play; the data was collected in August 2018. 

- [dataset](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps) containing data about approximately 7,000 iOS apps from the App Store; the data was collected in July 2017.

let's start opening our datasets

In [1]:

open_file=open("googleplaystore.csv")
from csv import reader
read_file= reader(open_file)
android=list(read_file)
android_header=android[0]
android=android[1:]

open_file=open("AppleStore.csv")
from csv import reader
read_file= reader(open_file)
ios=list(read_file)
ios_header=ios[0]
ios=ios[1:]

to make the data more easier to explore i we created a function named <mark>explore_data()</mark> that you can repeatedly use to print rows in a readable way

In [2]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice=dataset[start:end]
    for row in dataset_slice:
        print(row)
        print("\n")
    if rows_and_columns:
        print("Number of rows:",len(dataset))
        print('Number of columns',len(dataset[0]))
    

In [3]:
print(android_header)
print('\n')
explore_data(android,0,3,True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 10841
Number of columns 13


we see that we have 10841 apps in google store and 13 columns . in quick glence the columns that might be useful for the purpose of our analysis are  App, Reviews, Type, Genres, Category, Installs

Now lets look in ios data

In [4]:
print(ios_header)
print('\n')
explore_data(ios,0,3,True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 7197
Number of columns 16


we can se that we have 7917 ios apps and the column that is instresting track_name, price, currency, rating_count_ver, cont_rating 

## Deleting wrong data

The Google Play dataset has a dedicated [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion), and we can see that [one of the discussions](https://www.kaggle.com/datasets/lava18/google-play-store-apps/discussion/66015) describes an error for a certain row.

In [5]:
print(android[10472])#incorrect row
print(android_header)#header
print(android[0])# correct row

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']
['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


we found that rating is wrong and we can see that it shift because we are missing a category  we will delete this row

In [6]:
print(len(android))
del(android[10472])
print(len(android))

10841
10840


##  Removing Duplicate Entries

### Part one

If you explore the Google Play data set long enough or look at the discussions section, you'll notice some apps have duplicate entries. For instance, Instagram has four entries:

In [7]:
for app in android:
    if app[0] =='Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']




In total, there are 1,181 cases where an app occurs more than once:


In [8]:
duplicate_apps=[]
unique_apps=[]

for app in android:
    name= app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)
print('Number of dublicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])

Number of dublicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


We don't want to count certain apps more than once when we analyze data, so we need to remove the duplicate entries and keep only one entry per app. One thing we could do is remove the duplicate rows randomly, but we could probably find a better way.

If you examine the rows we printed two cells above for the Instagram app, the main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show that the data was collected at different times. We can use this to build a criterion for keeping rows. We won't remove rows randomly, but rather we'll keep the rows that have the highest number of reviews because the higher the number of reviews, the more reliable the ratings.

To do that, we will:

- Create a dictionary where each key is a unique app name, and the value is the highest number of reviews of that app
- Use the dictionary to create a new data set, which will have only one entry per app (and we only select the apps with the highest number of reviews)



### Part Two

Let's start by building the dictionary.

In [9]:
reviews_max={}
for app in android:
    name=app[0]
    n_reviews=float(app[3])
    if name in reviews_max and reviews_max[name] > n_reviews:
        reviews_max[name]=n_reviews
    elif name not in reviews_max:
        reviews_max[name]=n_reviews
        
        

In a previous code cell, we found that there are 1,181 cases where an app occurs more than once, so the length of our dictionary (of unique apps) should be equal to the difference between the length of our data set and 1,181

In [10]:
print("Expected length",len(android)-1181)
print("Actual length",len(reviews_max))

Expected length 9659
Actual length 9659


Now, let's use the reviews_max dictionary to remove the duplicates. For the duplicate cases, we'll only keep the entries with the highest number of reviews. In the code cell below:

In [11]:
android_clean=[]
already_added=[]
for app in android:
    name=app[0]
    n_reviews=float(app[3])
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(app)
        already_added.append(name)

In [12]:
explore_data(android_clean, 0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns 13


## Removing Non-English Apps

### Part One

However, if we explore the data long enough, we'll find that both datasets have apps with names that suggest they are not designed for an English-speaking audience.

In [13]:
print(ios[813][1])
print(ios[6731][1])

爱奇艺PPS -《欢乐颂2》电视剧热播
【脱出ゲーム】絶対に最後までプレイしないで 〜謎解き＆ブロックパズル〜


We're not interested in keeping these apps, so we'll remove them,The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII 

In [14]:
print(ord('a'))

97


def check_name(name): #before editing
    for i in name:
        if(ord(i)>127):
            return False
    return True

In [15]:
def check_name(name): #after editing
    x=0
    for i in name:
        if(ord(i)>127):
            x+=1
    if x>3:
        return False
    else:
        return True

In [16]:
print(check_name('Instagram'))
print(check_name('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(check_name('Docs To Go™ Free Office Suite'))
print(check_name('Instachat 😜'))

True
False
True
True


the function check_name it check if the string or name is in english if true return true else not english,but we can see it need more improve 

filter out non-English apps from both datasets.

In [17]:
english_andorid=[]
for app in android_clean:
    if  check_name(app[0]):
        english_andorid.append(app)
print(explore_data(english_andorid,0,3,True))

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9614
Number of columns 13
None


In [18]:
english_ios=[]
for app in ios:
    if  check_name(app[1]):
        english_ios.append(app)
print(explore_data(english_ios,0,3,True))

['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


Number of rows: 6183
Number of columns 16
None


## solating the Free Apps

In [19]:
free_android_apps=[]
for app in english_andorid:
    if app[6]=="Free":
        free_android_apps.append(app)
print(len(free_android_apps))

8861


In [20]:
free_ios_apps=[]
for app in english_ios:
    if float(app[4]) ==0.0:
        free_ios_apps.append(app)
print(len(free_ios_apps))

3222


## Most Common Apps by Genre

### Part One

As we mentioned in the introduction, our goal is to determine the kinds of apps that are likely to attract more users because the number of people using our apps affect our revenue.

To minimize risks and overhead, our validation strategy for an app idea has three steps:

-    Build a minimal Android version of the app, and add it to Google Play.
-    If the app has a good response from users, we develop it further.
-    If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.


In [21]:
print(android_header)
print(ios_header)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


we will use prime_genre from ios and category and genres form android to build frequency tables

### Part Two

In [22]:
def freq_table(dataset, index):
    table = {}
    total = 0
    
    for row in dataset:
        total += 1
        value = row[index]
        if value in table:
            table[value] += 1
        else:
            table[value] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages

In [23]:
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)
        
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])


### Part Three

prime_genre column of the App Store dataset.

In [24]:
display_table(free_ios_apps,-5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


we can see that the most common genra is Games nad next is Entertainment

In [25]:
display_table(free_android_apps, 1) # Category

FAMILY : 18.77891885791671
GAME : 9.637738404243313
TOOLS : 8.441485159688522
BUSINESS : 4.581875634804198
LIFESTYLE : 3.9047511567543167
PRODUCTIVITY : 3.8934657487868187
FINANCE : 3.7016138133393524
MEDICAL : 3.532332693826882
SPORTS : 3.4194786141519016
PERSONALIZATION : 3.317909942444419
COMMUNICATION : 3.250197494639431
HEALTH_AND_FITNESS : 3.069630967159463
PHOTOGRAPHY : 2.9454914795169844
NEWS_AND_MAGAZINES : 2.7987811759395105
SOCIAL : 2.663356280329534
TRAVEL_AND_LOCAL : 2.3360794492720913
SHOPPING : 2.245796185532107
BOOKS_AND_REFERENCE : 2.144227513824625
DATING : 1.8620923146371742
VIDEO_PLAYERS : 1.783094458864688
MAPS_AND_NAVIGATION : 1.3993905879697552
EDUCATION : 1.2526802843922809
FOOD_AND_DRINK : 1.2413948764247829
ENTERTAINMENT : 1.0382575330098183
LIBRARIES_AND_DEMO : 0.9366888613023362
AUTO_AND_VEHICLES : 0.9254034533348381
HOUSE_AND_HOME : 0.8351201895948538
WEATHER : 0.8012639656923597
EVENTS : 0.7109807019523756
ART_AND_DESIGN : 0.6771244780498815
PARENTING : 0.

we can see most category games are family then game then tools 

In [26]:
display_table(free_android_apps, 9) 

Tools : 8.430199751721025
Entertainment : 6.071549486513937
Education : 5.349283376594064
Business : 4.581875634804198
Productivity : 3.8934657487868187
Lifestyle : 3.8934657487868187
Finance : 3.7016138133393524
Medical : 3.532332693826882
Sports : 3.464620246021894
Personalization : 3.317909942444419
Communication : 3.250197494639431
Action : 3.103487191061957
Health & Fitness : 3.069630967159463
Photography : 2.9454914795169844
News & Magazines : 2.7987811759395105
Social : 2.663356280329534
Travel & Local : 2.3247940413045933
Shopping : 2.245796185532107
Books & Reference : 2.144227513824625
Simulation : 2.0426588421171425
Dating : 1.8620923146371742
Arcade : 1.8620923146371742
Video Players & Editors : 1.783094458864688
Casual : 1.749238234962194
Maps & Navigation : 1.3993905879697552
Food & Drink : 1.2413948764247829
Puzzle : 1.1285407967498025
Racing : 0.9931159011398263
Role Playing : 0.9366888613023362
Libraries & Demo : 0.9366888613023362
Auto & Vehicles : 0.9254034533348381


we can see most genres are tools then entertaimnent 

The frequency tables we analyzed on the previous screen showed us that apps designed for fun dominate the App Store, while Google Play shows a more balanced landscape of both practical and fun apps.

## Most Popular Apps by Genre on the App Store

In [27]:
table=freq_table(free_ios_apps,-5)

for genre in table:
    total=0
    len_genre=0
    for i in free_ios_apps:
        genra_app=i[-5]
        if genra_app == genre:
            total+=float(i[5])
            len_genre+=1
    avg=total/len_genre
    print(genre,": average number of user rating:",avg)


Social Networking : average number of user rating: 71548.34905660378
Photo & Video : average number of user rating: 28441.54375
Games : average number of user rating: 22788.6696905016
Music : average number of user rating: 57326.530303030304
Reference : average number of user rating: 74942.11111111111
Health & Fitness : average number of user rating: 23298.015384615384
Weather : average number of user rating: 52279.892857142855
Utilities : average number of user rating: 18684.456790123455
Travel : average number of user rating: 28243.8
Shopping : average number of user rating: 26919.690476190477
News : average number of user rating: 21248.023255813954
Navigation : average number of user rating: 86090.33333333333
Lifestyle : average number of user rating: 16485.764705882353
Entertainment : average number of user rating: 14029.830708661417
Food & Drink : average number of user rating: 33333.92307692308
Sports : average number of user rating: 23008.898550724636
Book : average number of us

ok we can see that the nevigation is the highst number of user reivew 

In [28]:
for app in free_ios_apps:
    if app[-5] =='Navigation':
        print(app[1],":",app[5])

Waze - GPS Navigation, Maps & Real-time Traffic : 345046
Google Maps - Navigation & Transit : 154911
Geocaching® : 12811
CoPilot GPS – Car Navigation & Offline Maps : 3582
ImmobilienScout24: Real Estate Search in Germany : 187
Railway Route Search : 5


we can see it wrong because it is influenced mostly by google maps and waze

Our aim is to find popular genres, but navigation, social networking or music apps might seem more popular than they really are. The average number of ratings seem to be skewed by very few apps which have hundreds of thousands of user ratings, while the other apps may struggle to get past the 10,000 threshold. We could get a better picture by removing these extremely popular apps for each genre and then rework the averages, but we'll leave this level of detail for later.

## Most Popular Apps by Genre on Google Play

In [39]:
table_android=freq_table(free_android_apps,1)
for category in table_android:
    total=0
    len_category=0
    for i in free_android_apps:
        category_app=i[1]
        if category_app == category:
            numb=i[5]
            numb=numb.replace("+","")
            numb=numb.replace(",","")
            total+=float(numb)
            len_category+=1
    avg=total/len_category
    print(category,": average number of user rating:",avg)

ART_AND_DESIGN : average number of user rating: 1905351.6666666667
AUTO_AND_VEHICLES : average number of user rating: 647317.8170731707
BEAUTY : average number of user rating: 513151.88679245283
BOOKS_AND_REFERENCE : average number of user rating: 8767811.894736841
BUSINESS : average number of user rating: 1704192.3399014778
COMICS : average number of user rating: 817657.2727272727
COMMUNICATION : average number of user rating: 38326063.197916664
DATING : average number of user rating: 854028.8303030303
EDUCATION : average number of user rating: 3057207.207207207
ENTERTAINMENT : average number of user rating: 19428913.04347826
EVENTS : average number of user rating: 253542.22222222222
FINANCE : average number of user rating: 1387692.475609756
FOOD_AND_DRINK : average number of user rating: 1924897.7363636363
HEALTH_AND_FITNESS : average number of user rating: 4167457.3602941176
HOUSE_AND_HOME : average number of user rating: 1313681.9054054054
LIBRARIES_AND_DEMO : average number of use