# Apps

In this project we will analyze mobile app market. Here we try to understand, what type of free apps are more attractive and profitable. 

In [9]:
google = open('googleplaystore.csv')
apple = open('AppleStore.csv')
from csv import reader
android = list(reader(google))
ios = list(reader(apple))

def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

print(android[0])
print(ios[0])


['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


In [10]:
explore_data(android, 1, 3)

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']




In [11]:
def data_len(x):
    a = len(x) - 1
    b = len(x[1])
    return a, b

print('Android data set columns and rows', data_len(android))
print('iOS data set columns and rows', data_len(ios))

Android data set columns and rows (10841, 13)
iOS data set columns and rows (7197, 16)


In [12]:
del android[10472]
print(android[10472])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [38]:
duble = []
non_duble = []
for app in android:
    name = app[0]
    if name in non_duble:
        duble.append(name)
    else:
        non_duble.append(name)
print('Number of duplicate apps', len(duble))
print('Number of non duplicate apps', len(non_duble))
print('\n')
print('Examples of duplicate apps','\n', duble[:15])

Number of duplicate apps 1181
Number of non duplicate apps 9659


Examples of duplicate apps 
 ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings', 'join.me - Simple Meetings', 'Box', 'Zenefits', 'Google Ads', 'Google My Business', 'Slack', 'FreshBooks Classic', 'Insightly CRM', 'QuickBooks Accounting: Invoicing & Expenses', 'HipChat - Chat Built for Teams', 'Xero Accounting Software']
87510


Here we noticed, that some apps have duplicates in dataset. Further we need to delete them. We have: 

Number of duplicate apps is 1181
Number of non duplicate apps is 9659

We need to delete duplicates to continue analysis.


In [13]:
reviews_max = {}
for row in android:
    name = row[0]
    n_reviews = row[3]
    if name in reviews_max and reviews_max[name] <= n_reviews:
        reviews_max[name] = n_reviews
    
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

print(len(reviews_max))

9660


In [14]:

android_clean = []
already_added = []

for row in android:
    name = row[0]
    n_reviews = row[3]
    if (reviews_max[name] == n_reviews) and (name not in already_added):
        android_clean.append(row)
        already_added.append(name)
print(len(android_clean))

9660


Now we have a dictionary with unique apps. Continue! We dont want to analyze non-English apps, so we must clean them too. Lets do it.

First we will write a function, that will define, is app has English name, or not.

In [15]:
def english(x):
    eng_sym = 0
    for i in x:
        if ord(i) > 127:
            eng_sym += 1
            if eng_sym > 3:
                break
        else:
            continue
    if eng_sym <= 3:
        return True
    else:
        return False

android_english = []
ios_english = []

for app in android_clean:
    name = app[0]
    lang = english(name)
    if lang == True:
        android_english.append(app)
print(len(android_english))

for app in ios[1:]:
    name = app[1]
    lang = english(name)
    if lang == True:
        ios_english.append(app)
print(len(ios_english))

9615
6183


Now we need to isolate free apps:

In [16]:
ios_free = []
android_free = []
for app in ios_english:
    price = app[4]
    if price == '0.0':
        ios_free.append(app)
for app in android_english:
    price = app[7]
    if price == '0':
        android_free.append(app)
print(len(android_free))
print(len(ios_free))


8861
3222


So, we got 8861 android apps free and 3222 ios apps free. 
We spent a good amount of time on cleaning data, and:

Removed inaccurate data

Removed duplicate app entries

Removed non-English apps

Isolated the free apps

To minimize risks and overhead, our validation strategy for an app idea is comprised of three steps:

1) Build a minimal Android version of the app, and add it to Google Play.

2) If the app has a good response from users, we develop it further.

3) If the app is profitable after six months, we build an iOS version of the app and add it to the App Store.

In [17]:
def freq_table(dataset, index):
    dic = {}
    total = 0
    for app in dataset:
        total += 1
        name = app[index]
        if name in dic:
            dic[name] +=1
        else:
            dic[name] = 1
    dic_freq = {}
    for key in dic:
        percent = dic[key] / total * 100
        dic_freq[key] = percent
    
    return dic_freq

def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])
        

 



In [18]:
ios_prime_genre = display_table(ios_free, -5)

Games : 58.16263190564867
Entertainment : 7.883302296710118
Photo & Video : 4.9658597144630665
Education : 3.662321539416512
Social Networking : 3.2898820608317814
Shopping : 2.60707635009311
Utilities : 2.5139664804469275
Sports : 2.1415270018621975
Music : 2.0484171322160147
Health & Fitness : 2.0173805090006205
Productivity : 1.7380509000620732
Lifestyle : 1.5828677839851024
News : 1.3345747982619491
Travel : 1.2414649286157666
Finance : 1.1173184357541899
Weather : 0.8690254500310366
Food & Drink : 0.8069522036002483
Reference : 0.5586592178770949
Business : 0.5276225946617008
Book : 0.4345127250155183
Navigation : 0.186219739292365
Medical : 0.186219739292365
Catalogs : 0.12414649286157665


We can see, that most popular apps are games (58%), after them - Entertainment (7.8%)
And other common analytics stuff, assignment style :)

In [19]:
display_table(android_free, 1) #(Category)

FAMILY : 18.936914569461685
GAME : 9.694165444080802
TOOLS : 8.452770567656021
BUSINESS : 4.593161042771697
LIFESTYLE : 3.9047511567543167
PRODUCTIVITY : 3.8934657487868187
FINANCE : 3.7016138133393524
MEDICAL : 3.521047285859384
SPORTS : 3.3969077982169056
PERSONALIZATION : 3.3066245344769216
COMMUNICATION : 3.2389120866719328
HEALTH_AND_FITNESS : 3.080916375126961
PHOTOGRAPHY : 2.9454914795169844
NEWS_AND_MAGAZINES : 2.7987811759395105
SOCIAL : 2.663356280329534
TRAVEL_AND_LOCAL : 2.3360794492720913
SHOPPING : 2.245796185532107
BOOKS_AND_REFERENCE : 2.144227513824625
DATING : 1.8620923146371742
VIDEO_PLAYERS : 1.794379866832186
MAPS_AND_NAVIGATION : 1.3993905879697552
FOOD_AND_DRINK : 1.2413948764247829
EDUCATION : 1.1736824286197944
ENTERTAINMENT : 0.9592596772373322
LIBRARIES_AND_DEMO : 0.9366888613023362
AUTO_AND_VEHICLES : 0.9254034533348381
HOUSE_AND_HOME : 0.8238347816273558
WEATHER : 0.8012639656923597
EVENTS : 0.7109807019523756
PARENTING : 0.6545536621148855
ART_AND_DESIGN :

In [20]:
#Android Genres
display_table(android_free, -4)

Tools : 8.441485159688522
Entertainment : 6.071549486513937
Education : 5.349283376594064
Business : 4.593161042771697
Productivity : 3.8934657487868187
Lifestyle : 3.8934657487868187
Finance : 3.7016138133393524
Medical : 3.521047285859384
Sports : 3.464620246021894
Personalization : 3.3066245344769216
Communication : 3.2389120866719328
Action : 3.103487191061957
Health & Fitness : 3.080916375126961
Photography : 2.9454914795169844
News & Magazines : 2.7987811759395105
Social : 2.663356280329534
Travel & Local : 2.3247940413045933
Shopping : 2.245796185532107
Books & Reference : 2.144227513824625
Simulation : 2.0426588421171425
Dating : 1.8620923146371742
Arcade : 1.8508069066696762
Video Players & Editors : 1.7718090508971898
Casual : 1.749238234962194
Maps & Navigation : 1.3993905879697552
Food & Drink : 1.2413948764247829
Puzzle : 1.1285407967498025
Racing : 0.9931159011398263
Role Playing : 0.9366888613023362
Libraries & Demo : 0.9366888613023362
Auto & Vehicles : 0.92540345333483

In [23]:
ifreq = freq_table(ios_free, -5)

for genre in ifreq:
    total = 0
    len_genre = 0
    for app in ios_free:
        app_genre = app[-5]
        if app_genre == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    average_number = total / len_genre
    print(genre, average_number)
    

Business 7491.117647058823
Travel 28243.8
Games 22788.6696905016
Navigation 86090.33333333333
Medical 612.0
Catalogs 4004.0
News 21248.023255813954
Weather 52279.892857142855
Lifestyle 16485.764705882353
Book 39758.5
Finance 31467.944444444445
Shopping 26919.690476190477
Education 7003.983050847458
Social Networking 71548.34905660378
Health & Fitness 23298.015384615384
Music 57326.530303030304
Reference 74942.11111111111
Utilities 18684.456790123455
Sports 23008.898550724636
Productivity 21028.410714285714
Food & Drink 33333.92307692308
Photo & Video 28441.54375
Entertainment 14029.830708661417


Some assignments crap. Rate higher - better. Need to see popularity and ratings.

In [24]:
#GPLAY RATINGS
afreq = freq_table(android_free, 1)

for genre in ifreq:
    total = 0
    len_genre = 0
    for app in android_free:
        app_genre = app[-5]
        if app_genre == genre:
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    average_number = total / len_genre
    print(genre, average_number)
    

ZeroDivisionError: division by zero