# Apps Market Analyzing Using Data Bases

In this python notebook we use data bases of Google Play and Apple Store to make choices about best apps genres to turn an apps enteprise popular in both apps market. After data cleaning process, we have two data sets with only free English apps. Using some functions and analyzing some columns and values, we have some conclusions about good genres to make money in Google Play and Apple Store.

**Authors:**
- Matheus de Andrade Silva
- Pedro Henrique Alves Cardoso

#2 Opening and Exploring the Data

To start let's open both data sets and see some informations about them.

In [0]:
# importing reader function to read csv file
from csv import reader

# opening AppleStore.csv data set
appleStore_data = list(reader(open("AppleStore.csv", encoding='utf8')))

# opening googleplaystore.csv data set
googlePlay_data = list(reader(open("googleplaystore.csv", encoding='utf8')))
del googlePlay_data[10473] # removing row with wrong data

In [0]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))
        

In [3]:
# Apple Store data set informations
explore_data(appleStore_data, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


In [4]:
# Google Play data set informations
explore_data(googlePlay_data, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10841
Number of columns: 13


#3 Data Cleaning

In procecess of data cleaning we focus on remove some applications that may disrupt the analysis, removing the duplication apps, non-english apps and non-free apps. This way the analysis will be much more accurate, being able to see much better the current market of apps.

In [0]:
# Find duplicate apps and return a dict with her names and number of repetitions
def duplicates_counter(dataset, index=0):
  duplicates_count = {}
  unique_apps = []
  
  for row in dataset[1:]:
    if row[index] in unique_apps:
      if row[index] in duplicates_count:
        duplicates_count[row[index]] += 1
      else:
        duplicates_count[row[index]] = 2
    else:
      unique_apps.append(row[index])
  
  return duplicates_count

In [6]:
# showing how many times each duplicate app shows in data set
googleplay_duplicates = duplicates_counter(googlePlay_data, 0)
applestore_duplicates = duplicates_counter(appleStore_data, 1)

print('Google Play apps repetitions: ')
print('Total: ', len(googleplay_duplicates), 'apps')
print(googleplay_duplicates)
print('\n')
print('Play Store apps repetitions: ')
print('Total: ', len(applestore_duplicates), 'apps')
print(applestore_duplicates)

Google Play apps repetitions: 
Total:  798 apps
{'Quick PDF Scanner + OCR FREE': 3, 'Box': 3, 'Google My Business': 3, 'ZOOM Cloud Meetings': 2, 'join.me - Simple Meetings': 3, 'Zenefits': 2, 'Google Ads': 3, 'Slack': 3, 'FreshBooks Classic': 2, 'Insightly CRM': 2, 'QuickBooks Accounting: Invoicing & Expenses': 3, 'HipChat - Chat Built for Teams': 2, 'Xero Accounting Software': 2, 'MailChimp - Email, Marketing Automation': 2, 'Crew - Free Messaging and Scheduling': 2, 'Asana: organize team projects': 2, 'Google Analytics': 2, 'AdWords Express': 2, 'Accounting App - Zoho Books': 2, 'Invoice & Time Tracking - Zoho': 2, 'Invoice 2go — Professional Invoices and Estimates': 2, 'SignEasy | Sign and Fill PDF and other Documents': 2, 'Genius Scan - PDF Scanner': 2, 'Tiny Scanner - PDF Scanner App': 2, 'Fast Scanner : Free PDF Scan': 2, 'Mobile Doc Scanner (MDScan) Lite': 2, 'TurboScan: scan documents and receipts in PDF': 2, 'Tiny Scanner Pro: PDF Doc Scan': 2, 'Docs To Go™ Free Office Suite':

In [0]:
# create a new data set with no duplicated apps
def remove_duplicates(dataset, nameindex=0, reviewindex=3):
  reviews_max = {}
  newdataset = []
  newdataset.append(dataset[0])
  
  for row in dataset[1:]:
    name = row[nameindex]
    review = row[reviewindex]
    if name not in reviews_max:
      reviews_max[name] = row
    elif review > reviews_max[name][reviewindex]:
      reviews_max[name] = row
         
  for app in reviews_max:
    newdataset.append(reviews_max[app])
        
  return newdataset

In [8]:
# deleting all duplicated values
googlePlay_data = remove_duplicates(googlePlay_data, 0, 3)
appleStore_data = remove_duplicates(appleStore_data, 1, 5)

# verifying remove duplicates algorithm's efficiency
googleplay_duplicates = duplicates_counter(googlePlay_data, 0)
applestore_duplicates = duplicates_counter(appleStore_data, 1)

print('Google Play apps repetitions: ')
print('Total: ', len(googleplay_duplicates), 'apps')
print(googleplay_duplicates)
print('\n')
print('Play Store apps repetitions: ')
print('Total: ', len(applestore_duplicates), 'apps')
print(applestore_duplicates)

print('----------------------------------------')
# showing current number of rows of data sets
print('CURRENT DATA SET SIZE (rows):')
print('Google Play:', len(googlePlay_data))
print('Apple Store:',len(appleStore_data))

Google Play apps repetitions: 
Total:  0 apps
{}


Play Store apps repetitions: 
Total:  0 apps
{}
----------------------------------------
CURRENT DATA SET SIZE (rows):
Google Play: 9660
Apple Store: 7196


In [0]:

# checking if the name is non-english
def check_name(name):
  strick = 0
  for letter in name:
    if ord(letter) > 127:
      strick += 1
  if strick > 3: 
    return False
  else:
    return True 


# putting all english apps in a new list
def remove_nonenglish(dataset, index):
  newdataset = []
  newdataset.append(dataset[0])
  
  for row in dataset[1:]:
    test = check_name(row[index])
    if test == True:
      newdataset.append(row)          
  return newdataset


In [10]:
# removing non-english apps
googlePlay_data = remove_nonenglish(googlePlay_data, 0)
appleStore_data = remove_nonenglish(appleStore_data, 1)

# showing current number of rows of data sets
print('CURRENT DATA SET SIZE (rows):')
print('Google Play:', len(googlePlay_data))
print('Apple Store:',len(appleStore_data))

CURRENT DATA SET SIZE (rows):
Google Play: 9615
Apple Store: 6182


In [0]:
# delete non-free apps of a list
def remove_nonfree(dataset, priceindex):
  newdataset = []
  newdataset.append(dataset[0])
  freeNames = [0.0, '0.0', 0, '0']
  
  for row in dataset[1:]:
    if row[priceindex] in freeNames:
      newdataset.append(row)
  
  return newdataset

In [12]:
#removing non-free apps
googlePlay_data = remove_nonfree(googlePlay_data, 7)
appleStore_data = remove_nonfree(appleStore_data, 4)

# showing current number of rows of data sets
print('CURRENT DATA SET SIZE (rows):')
print('Google Play:', len(googlePlay_data))
print('Apple Store:',len(appleStore_data))

CURRENT DATA SET SIZE (rows):
Google Play: 8863
Apple Store: 3221


#4 Data Analyzing

In this topic we gonna make the real processes of analysis, showing the numbers of apps of each gender's and what genre is morer popular in each data set. 

In [0]:
# generate a frequency table of any column
def freq_table(dataset, index):
  freq_dict = {}
  
  for row in dataset[1:]:
    key = row[index]
    if key in freq_dict:
      freq_dict[key] += 1
    else:
      freq_dict[key] = 1
  
  return freq_dict


# show a frequency table of any column
def display_table(dataset, index):
    table = freq_table(dataset, index)
    table_display = []
    for key in table:
        key_val_as_tuple = (table[key], key)
        table_display.append(key_val_as_tuple)

    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted:
        print(entry[1], ':', entry[0])

## Analyzing Apple Store Data Set

**What we have here?**

We can see Games is the most popular genre of free english Apple Store apps. The runner-up are the Entertainment apps. In a general impression, we have a lot of for-fun apps, followed by photos, education and social networking apps.
I can recommend make for-fun apps for App Store, once this genre have a lot of apps, with a competitive market.

Let's see a frequency table of prime_genre column:

In [15]:
display_table(appleStore_data, 11)

Games : 1872
Entertainment : 254
Photo & Video : 160
Education : 118
Social Networking : 106
Shopping : 84
Utilities : 81
Sports : 69
Music : 66
Health & Fitness : 65
Productivity : 56
Lifestyle : 51
News : 43
Travel : 40
Finance : 36
Weather : 28
Food & Drink : 26
Reference : 18
Business : 17
Book : 14
Navigation : 6
Medical : 6
Catalogs : 4


## Analyzing Google Play Data Set

**What we have here?**

Analyzing genre and category columns of Google Play data set, we see Tools as the most common genre and Family as the most common category. By the way, tools and entertainment are the biggest part of Google Play apps. Comparing to Apple Store, we see a big market for tools apps in Google Play, aproximately 3x bigger than Apple Store, but Apple Store has the best market to Games yet.

Let's see Category column:

In [16]:
display_table(googlePlay_data, 1)

FAMILY : 1678
GAME : 859
TOOLS : 749
BUSINESS : 407
LIFESTYLE : 346
PRODUCTIVITY : 345
FINANCE : 328
MEDICAL : 312
SPORTS : 301
PERSONALIZATION : 294
COMMUNICATION : 287
HEALTH_AND_FITNESS : 273
PHOTOGRAPHY : 261
NEWS_AND_MAGAZINES : 248
SOCIAL : 236
TRAVEL_AND_LOCAL : 207
SHOPPING : 199
BOOKS_AND_REFERENCE : 190
DATING : 165
VIDEO_PLAYERS : 159
MAPS_AND_NAVIGATION : 124
FOOD_AND_DRINK : 110
EDUCATION : 104
ENTERTAINMENT : 85
LIBRARIES_AND_DEMO : 83
AUTO_AND_VEHICLES : 82
HOUSE_AND_HOME : 73
WEATHER : 71
EVENTS : 63
PARENTING : 58
ART_AND_DESIGN : 57
COMICS : 55
BEAUTY : 53


Let's see Genre column:

In [17]:
display_table(googlePlay_data, 9)

Tools : 748
Entertainment : 538
Education : 474
Business : 407
Productivity : 345
Lifestyle : 345
Finance : 328
Medical : 312
Sports : 307
Personalization : 294
Communication : 287
Action : 275
Health & Fitness : 273
Photography : 261
News & Magazines : 248
Social : 236
Travel & Local : 206
Shopping : 199
Books & Reference : 190
Simulation : 181
Dating : 165
Arcade : 164
Video Players & Editors : 157
Casual : 155
Maps & Navigation : 124
Food & Drink : 110
Puzzle : 100
Racing : 88
Role Playing : 83
Libraries & Demo : 83
Auto & Vehicles : 82
Strategy : 81
House & Home : 73
Weather : 71
Events : 63
Adventure : 60
Comics : 54
Beauty : 53
Art & Design : 53
Parenting : 44
Card : 39
Casino : 38
Trivia : 37
Educational;Education : 35
Educational : 33
Board : 33
Education;Education : 30
Word : 23
Casual;Pretend Play : 21
Music : 18
Puzzle;Brain Games : 16
Racing;Action & Adventure : 15
Entertainment;Music & Video : 15
Casual;Brain Games : 12
Casual;Action & Adventure : 12
Arcade;Action & Advent

## Analyzing Apple Most Popular Apps

In [0]:
# calculate the most popular apps in Apple Store
def most_popular_apple(dataset,index):
  tab = freq_table(dataset,index)
  table_display = []
  dic = {}
    
  for key in tab:
    key_val_as_tuple = (tab[key], key)
    table_display.append(key_val_as_tuple)

  table_sorted = sorted(table_display, reverse = True)
  for entry in table_sorted:
    total = 0
    
    for row in dataset[1:]:
      if row[index] == entry[1]:
        
        total += float(row[5])
        
    avarege = total/entry[0]
    dic[entry[1]] = avarege
    
    table = dic
    table_display2 = []
  for key in table:
    key_val_as_tuple = (table[key], key)
    table_display2.append(key_val_as_tuple)

  table_sorted2 = sorted(table_display2, reverse = True)
  for entry2 in table_sorted2:
    print(entry2[1], ':', entry2[0]) 

**What we have here?**

After the analysis has been made,  in Apple Store we can see the genres Navigation, Reference and Social Networking are the best genres for what developers looking for, once they have the most numbers of users per app.



In below we can see the list of most popular apps in Apple Store:

In [19]:
most_popular_apple(appleStore_data,11)

Navigation : 86090.33333333333
Reference : 74942.11111111111
Social Networking : 71548.34905660378
Music : 57326.530303030304
Weather : 52279.892857142855
Book : 39758.5
Food & Drink : 33333.92307692308
Finance : 31467.944444444445
Photo & Video : 28441.54375
Travel : 28243.8
Shopping : 26919.690476190477
Health & Fitness : 23298.015384615384
Sports : 23008.898550724636
Games : 22812.903311965812
News : 21248.023255813954
Productivity : 21028.410714285714
Utilities : 18684.456790123455
Lifestyle : 16485.764705882353
Entertainment : 14029.830708661417
Business : 7491.117647058823
Education : 7003.983050847458
Catalogs : 4004.0
Medical : 612.0


##Analyzing Google Play Most Popular Apps

In [0]:
# calculate most popular apps in Google Play
def most_popular_google(dataset,index):
  tab = freq_table(dataset,index)
  table_display = []
  dic = {}  
  for key in tab:
    key_val_as_tuple = (tab[key], key)
    table_display.append(key_val_as_tuple)

  table_sorted = sorted(table_display, reverse = True)
  for entry in table_sorted:
    total = 0
    
    for row in dataset[1:]:
      if row[index] == entry[1]:
        float_n = row[5]
        float_n = float_n.replace('+','')
        float_n = float_n.replace(',','')
        
        total += float(float_n)
        
    avarege = total/entry[0]
    dic[entry[1]] = avarege
    
    table = dic
    table_display2 = []
  for key in table:
    key_val_as_tuple = (table[key], key)
    table_display2.append(key_val_as_tuple)

  table_sorted2 = sorted(table_display2, reverse = True)
  for entry2 in table_sorted2:
    print(entry2[1], ':', entry2[0]) 
     

**What we have here?**

In this analysis we can see the most popular genres in the Google Play and the recommendation for developers are the apps of Communication, Video Players and Social Apps.

The table with the most popular apps in Google Play:

In [21]:
most_popular_google(googlePlay_data,1)

COMMUNICATION : 38456119.167247385
VIDEO_PLAYERS : 24727872.452830188
SOCIAL : 23253652.127118643
PHOTOGRAPHY : 17805627.643678162
PRODUCTIVITY : 16787331.344927534
GAME : 15560965.599534342
TRAVEL_AND_LOCAL : 13984077.710144928
ENTERTAINMENT : 11640705.88235294
TOOLS : 10682301.033377837
NEWS_AND_MAGAZINES : 9549178.467741935
BOOKS_AND_REFERENCE : 8767811.894736841
SHOPPING : 7036877.311557789
PERSONALIZATION : 5201482.6122448975
WEATHER : 5074486.197183099
HEALTH_AND_FITNESS : 4188821.9853479853
MAPS_AND_NAVIGATION : 4056941.7741935486
FAMILY : 3694276.334922527
SPORTS : 3638640.1428571427
ART_AND_DESIGN : 1986335.0877192982
FOOD_AND_DRINK : 1924897.7363636363
EDUCATION : 1820673.076923077
BUSINESS : 1712290.1474201474
LIFESTYLE : 1437816.2687861272
FINANCE : 1387692.475609756
HOUSE_AND_HOME : 1331540.5616438356
DATING : 854028.8303030303
COMICS : 817657.2727272727
AUTO_AND_VEHICLES : 647317.8170731707
LIBRARIES_AND_DEMO : 638503.734939759
PARENTING : 542603.6206896552
BEAUTY : 51315