# iOS & Android App Analysis


We only look to develop free apps, so our business model solely relies on in-app advertising revenue. Therefore, we need to build apps with high-traffic to generate revenue.

Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.
    

In [3]:
from csv import *
def open_data(file):
    opened_file = open('/Users/nstanzione/Documents/EDU/DataQuest/Data/' + file)
    data_raw = reader(opened_file)
    return list(data_raw)


In [4]:
ios = open_data('AppleStore.csv')
android = open_data('googleplaystore.csv')

## Data Location

iOS Data: [Link](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps)

Android Data: [Link](https://www.kaggle.com/lava18/google-play-store-apps)

In [5]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [6]:
explore_data(android,0,1)
explore_data(ios,0,1)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['', 'id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']




In [7]:
del(android[10473])

In [8]:
print(android[10473])

['osmino Wi-Fi: free WiFi', 'TOOLS', '4.2', '134203', '4.1M', '10,000,000+', 'Free', '0', 'Everyone', 'Tools', 'August 7, 2018', '6.06.14', '4.4 and up']


In [9]:
freq_dist = {}
for row in ios:
    name = row[2]
    if name in freq_dist:
        freq_dist[name] += 1
    else:
        freq_dist[name] = 1

duplicates = []
for x in freq_dist:
    if freq_dist[x] > 1:
        duplicates.append(x)
        
print(duplicates)


['VR Roller Coaster', 'Mannequin Challenge']


In [10]:
for app in android:
    name = app[0]
    if name == 'Instagram':
        print(app)

['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577446', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66577313', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Instagram', 'SOCIAL', '4.5', '66509917', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Teen', 'Social', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [11]:
duplicate_apps = []
unique_apps = []

for app in android:
    name = app[0]
    if name in unique_apps:
        duplicate_apps.append(name)
    else:
        unique_apps.append(name)

print('Number of duplicate apps:', len(duplicate_apps))
print('\n')
print('Examples of duplicate apps:', duplicate_apps[:15])


Number of duplicate apps: 1181


Examples of duplicate apps: ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']


## Duplicate Analysis

As noted in the analysis above, the Google Play dataset has several duplicate records (1,181 duplicate records to be exact). Please note that each of the duplicate apps noted in the list above already has an entry in the "Unique" listing. However, we want to ensure that we remove the "worst" duplicate rows for a specific app, so we will not just accept the current unique list. One idea of finding the "best" record is using the record with the highest number of reviews. For instance, let's review the records for the Slack app. 

In [12]:
for app in android:
    name = app[0]
    if name == 'Slack':
        print(app)
    

['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51507', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']
['Slack', 'BUSINESS', '4.4', '51510', 'Varies with device', '5,000,000+', 'Free', '0', 'Everyone', 'Business', 'August 2, 2018', 'Varies with device', 'Varies with device']


As we can see above, we have three Slack app records. The last record has the most reviews, so it appears to be the latest information and the "best" record to keep of the three. Looking below,  we now know that we would expect a total of 9,659 records in our "clean" dataset without duplicates. Below, we will 

In [13]:
print('Expected length:',len(android[1:])-1181)

Expected length: 9659


## Resolution for Duplicates

In order to develop a dataset with proper "maximum review" records, we will need to develop a dictionary that stores the maximum number of reviews for each app. Second, we will need to cycle through the original dataset and add the proper app record with the review count that matches the stored maximum number in the dictionary. In addition, we will create a ongoing list to track the last we have already added. This piece is needed to ensure we do not add duplicates for mulitple records of the app with the same maximum review counts.

In [14]:
reviews_max = {}
for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max and n_reviews > reviews_max[name]:
        reviews_max[name] = n_reviews
    if name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))    

9659


In [15]:
android_clean = []
already_added = []
for app in android[1:]:
    name = app[0]
    n_reviews = float(app[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        android_clean.append(app)
        already_added.append(name)

explore_data(android_clean,0,3,True)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 9659
Number of columns: 13


In [16]:
freq_dist_ios_id = {}
for row in ios:
    id = row[1]
    if name in freq_dist_ios_id:
        freq_dist_ios_id[id] += 1
    else:
        freq_dist_ios_id[id] = 1

duplicates_ios_id = []
for x in freq_dist_ios_id:
    if freq_dist_ios_id[x] > 1:
        duplicates_ios_id.append(x)
        
print(duplicates_ios_id)


[]


In [17]:
freq_dist_ios_name = {}
for row in ios:
    name = row[2]
    if name in freq_dist_ios_name:
        freq_dist_ios_name[name] += 1
    else:
        freq_dist_ios_name[name] = 1

duplicates_ios_name = []
for x in freq_dist_ios_name:
    if freq_dist_ios_name[x] > 1:
        duplicates_ios_name.append(x)
        
print(duplicates_ios_name)

['VR Roller Coaster', 'Mannequin Challenge']


In [18]:
for app in ios:
    name = app[2]
    if name in duplicates_ios_name:
        print(app)

['4000', '952877179', 'VR Roller Coaster', '169523200', 'USD', '0', '107', '102', '3.5', '3.5', '2.0.0', '4+', 'Games', '37', '5', '1', '1']
['7579', '1089824278', 'VR Roller Coaster', '240964608', 'USD', '0', '67', '44', '3.5', '4', '0.81', '4+', 'Games', '38', '0', '1', '1']
['10751', '1173990889', 'Mannequin Challenge', '109705216', 'USD', '0', '668', '87', '3', '3', '1.4', '9+', 'Games', '37', '4', '1', '1']
['10885', '1178454060', 'Mannequin Challenge', '59572224', 'USD', '0', '105', '58', '4', '4.5', '1.0.1', '4+', 'Games', '38', '5', '1', '1']


## Summary of Duplicates

Google Play data contained 1,181 duplicate records based on the app name column. We were able to clean the dataset in order to ensure we had the "best" unique records.

The iOS data from the App store only has 2 duplicate records based on the app name, but there are no duplicates base don the app ID. We will consider the id to be abtter unique identifier and consider these all of the records as unique even though there are two records with a previously existing name.

## Consideration of Non-English Apps

Apple and Google are global companies, so it is understandable that the apps each company has available come in a variety of languages. We will not be much use of analyzing data outside of langauges we understand. One further step we can take to clean the data is to reduce the datsets down to "english-only" data. Below we will create two functions:
* english: this function will identify if there ar emore than three non-english characters in a given string
* create_english: this function leverages the prior function to cycle through a dataset and return new lists with "english-only" records.

In [19]:
def english(string):
    non_english = []
    for char in string:
        if ord(char) > 127:
            non_english.append(char)
    if len(non_english) > 3:
        return False
    else:
        return True
        
print(english('爱奇艺PPS -《欢乐颂2》电视剧热播'))

False


In [20]:
ios_eng = []
android_eng = []
def create_english(dataset1=ios,dataset2=android_clean):
    for app in dataset1[1:]:
        name = app[2]
        if english(name):
            ios_eng.append(app)
    for app in dataset2:
        name = app[0]
        if english(name):
            android_eng.append(app)

create_english()
            
explore_data(ios_eng,0,3,True)
print('\n')
print('\n')
explore_data(android_eng,0,3,True)

['1', '281656475', 'PAC-MAN Premium', '100788224', 'USD', '3.99', '21292', '26', '4', '4.5', '6.3.5', '4+', 'Games', '38', '5', '10', '1']


['2', '281796108', 'Evernote - stay organized', '158578688', 'USD', '0', '161065', '26', '4', '3.5', '8.2.2', '4+', 'Productivity', '37', '5', '23', '1']


['3', '281940292', 'WeatherBug - Local Weather, Radar, Maps, Alerts', '100524032', 'USD', '0', '188583', '2822', '3.5', '4.5', '5.0.0', '4+', 'Weather', '37', '5', '3', '1']


Number of rows: 6183
Number of columns: 17




['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0',

So far in the data cleaning process, we:
* Removed inaccurate data
* Removed duplicate app entries
* Removed non-English apps

As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

Isolating the free apps will be our last step in the data cleaning process. On the next screen, we're going to start analyzing the data.

In [21]:
ios_final = []
android_final = []
for app in ios_eng:
    price = float(app[5])
    if price == 0:
        ios_final.append(app)
for app in android_eng:
    type = app[6]
    if type == 'Free':
        android_final.append(app)

print(len(ios_final))
print(len(android_final))
    

3222
8863


## Profitable App Profile

As we mentioned in the introduction, our aim is to determine the kinds of apps that are likely to attract more users because our revenue is highly influenced by the number of people using our apps.

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

* Build a minimal Android version of the app, and add it to Google Play.
* If the app has a good response from users, we develop it further.
* If the app is profitable after six months, we build an iOS version of the app and add it to the App Store. 

Because our end goal is to add the app on both Google Play and the App Store, we need to find app profiles that are successful on both markets. For instance, a profile that works well for both markets might be a productivity app that makes use of gamification.

We will start this research by analyzing the genere frequency in both the App Store and Google Play Store. For the iOS dataset we will use *prime_genre* and for Android we will use *Category*. To analyze these columns we will build frequency tables to identify which categories have the most existing apps. The code below contains two functiouns to quickly develop sorted data based on a input dataset and known column of interest.

In [30]:
def freq_table(dataset,index):
    table = {}
    total = 0
    for row in dataset:
        total += 1
        item = row[index]
        if item in table:
            table[item] += 1
        else:
            table[item] = 1
    
    table_percentages = {}
    for key in table:
        percentage = (table[key] / total) * 100
        table_percentages[key] = percentage 
    
    return table_percentages  

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])



Now, we will generate the actual frequencies for iOS and then Android.

In [31]:
display_table(ios_final,12)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can notice that Games account for 58% free English apps available for iOS users. Entertainment comes in at next highest with 7% of apps. 

Education with 3.5% is surprising to see, this could be growing segment that could have room for more people using their iOS devices for education as we move to a more connected world. Navigation at less than 1% is not surprising even though it is most likely a "high-use" app, users are loyal to a few dominant players. Similar comparison can be made with music at 2%.

The data clearly shows that "entertainment-centric" apps is an extremely saturated app for the iOS market with a large number of competitors.

It is difficult to recommend a genre for our app based on this data alone. It would be nice to see number of user downloads to see which genres have the most downloads. It seems Games would generate the most traffic, but this sector also has the most competition, so may be wise to find another high-download sector with limited number of competitors.

In [32]:
display_table(android_final,1)

FAMILY : 18.898792733837304
GAME : 9.725826469592688
TOOLS : 8.462146000225657
BUSINESS : 4.592124562789123
LIFESTYLE : 3.9038700214374367
PRODUCTIVITY : 3.8925871601038025
FINANCE : 3.7007785174320205
MEDICAL : 3.5315355974275078
SPORTS : 3.396141261423897
PERSONALIZATION : 3.317161232088458
COMMUNICATION : 3.2381812027530184
HEALTH_AND_FITNESS : 3.0802211440821394
PHOTOGRAPHY : 2.944826808078529
NEWS_AND_MAGAZINES : 2.798149610741284
SOCIAL : 2.6627552747376737
TRAVEL_AND_LOCAL : 2.335552296062281
SHOPPING : 2.245289405393208
BOOKS_AND_REFERENCE : 2.1437436533904997
DATING : 1.8616721200496444
VIDEO_PLAYERS : 1.7939749520478394
MAPS_AND_NAVIGATION : 1.399074805370642
FOOD_AND_DRINK : 1.241114746699763
EDUCATION : 1.1621347173643235
ENTERTAINMENT : 0.9590432133589079
LIBRARIES_AND_DEMO : 0.9364774906916393
AUTO_AND_VEHICLES : 0.9251946293580051
HOUSE_AND_HOME : 0.8236488773552973
WEATHER : 0.8010831546880289
EVENTS : 0.7108202640189552
PARENTING : 0.6544059573507841
ART_AND_DESIGN : 0

We can notice that Family account for 19% free English apps available for iOS users. Games comes in at next highest with 10% of apps.

Business with 4.5% is surprising to see, this could be growing segment that could have room for more people using their Android devices for business as we move to a more connected world. This stands out as it seems Android devices may be easier to use G-Suite which is a common business software. On the other hand, as we saw earlier, Apple products seem to to be the preference for education. Art & Design is not surpising to see at the bottom as iOS products have always been geared toward "creative" workers. Tools near the top is suprising to see, meaning there may be more options/situations for Android user to use their device. 


It is difficult to recommend a genre for our app based on this data alone. It would be nice to see number of user downloads to see which genres have the most downloads. Also, we are not quite sure what "Family" even means in terms of an app: are they games? tools? businesses? parks?

In [38]:
prime_genre = freq_table(ios_final,12)

for genre in prime_genre:
    total = 0
    len_genre = 0
    for row in ios_final:
        genre_app = row[12]
        if genre_app == genre:
            users = float(row[6])
            total += users
            len_genre += 1
    avg_users = total / len_genre
    print(genre, ':', avg_users)



Productivity : 21028.410714285714
Weather : 52279.892857142855
Shopping : 26919.690476190477
Reference : 74942.11111111111
Finance : 31467.944444444445
Music : 57326.530303030304
Utilities : 18684.456790123455
Travel : 28243.8
Social Networking : 71548.34905660378
Sports : 23008.898550724636
Health & Fitness : 23298.015384615384
Games : 22788.6696905016
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Photo & Video : 28441.54375
Entertainment : 14029.830708661417
Business : 7491.117647058823
Lifestyle : 16485.764705882353
Education : 7003.983050847458
Navigation : 86090.33333333333
Medical : 612.0
Catalogs : 4004.0


In reviewing the average users, Games apps had an average user base of 22K. This is fairly average, the highest 6 categories with most users are Navigation, Reference, Social & Networking, Music, Weather, Book.

Navigation is dominated by few players like Waze. Social & Networking are complex apps to develop and also concetrated in certain apps like Facebook and Twitter. Music is a similar story with Spotify. Leaving Reference, Weather and Book to consider for as an app. 

Cross referencing with the frequency table of the most apps, I would recommend a Book app. These apps have a high user base per app and there are not many apps in the iOS store addressing the Free, English market as they account for less than 1% of these apps. There looks to be a market opportunity in this space.

*Moving onto the Android data...*

In [42]:
category = freq_table(android_final,1)

for cat in category:
    total = 0
    len_cat = 0
    for row in android_final:
        cat_app = row[1]
        if cat_app == cat:
            est_installs = row[5]
            installs_1 = est_installs.replace('+','')
            installs_2 = installs_1.replace(',','')
            installs = float(installs_2)
            total += installs
            len_cat += 1
    avg_a_users = total / len_cat
    print(cat, ':', '{:,.0f}'.format(avg_a_users))

ART_AND_DESIGN : 1,986,335
AUTO_AND_VEHICLES : 647,318
BEAUTY : 513,152
BOOKS_AND_REFERENCE : 8,767,812
BUSINESS : 1,712,290
COMICS : 817,657
COMMUNICATION : 38,456,119
DATING : 854,029
EDUCATION : 1,833,495
ENTERTAINMENT : 11,640,706
EVENTS : 253,542
FINANCE : 1,387,692
FOOD_AND_DRINK : 1,924,898
HEALTH_AND_FITNESS : 4,188,822
HOUSE_AND_HOME : 1,331,541
LIBRARIES_AND_DEMO : 638,504
LIFESTYLE : 1,437,816
GAME : 15,588,016
FAMILY : 3,697,848
MEDICAL : 120,551
SOCIAL : 23,253,652
SHOPPING : 7,036,877
PHOTOGRAPHY : 17,840,110
SPORTS : 3,638,640
TRAVEL_AND_LOCAL : 13,984,078
TOOLS : 10,801,391
PERSONALIZATION : 5,201,483
PRODUCTIVITY : 16,787,331
PARENTING : 542,604
WEATHER : 5,074,486
VIDEO_PLAYERS : 24,727,872
NEWS_AND_MAGAZINES : 9,549,178
MAPS_AND_NAVIGATION : 4,056,942


In reviewing the number of users, Family and Games have 3M and 15M users per app, respectively. These are high totals; however, looking at the categories with more than 10M users per app: Communication, Video Players, Photography, Productivity, Travel & Local, Game, Entertainment, Tools. Many of these categories will be difficult to break into as they are dominated by a select few players.

Books & Reference stands out as a potnetial candidate in the Android data as well. With 9M users per app, and only 2% of the apps in the store in this category, this seems to be a potential market to exploit. Another category with similar attributes is Shopping. This category has 7M users per app and consists of 2% of the apps on the store (similar metrics for iOS).


# Conclusion

Book and Reference stood out amongst the iOS data as a clear winner. The Android data is much less concentrated in certain genres. The Book and Reference category still had strong data points (higher than average user downloads and limited competition).