<h1> Exporing Data From the Google Play and App Store</h1>

<p> This project is to expore data from the Google Play Store and the App store and look for patterns.  Data are originally taken from Kaggle, I would include the information below. <p>

<br> Links:
<br> Google Play: https://www.kaggle.com/lava18/google-play-store-apps
<br> App Store: https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps



<p><b>Exploring and Cleaning Data</b></p>

In [2]:
opened_file = open("googleplaystore.csv")
from csv import reader
read_file = reader(opened_file)
play_store_data = list(read_file)

opened_file = open("AppleStore.csv")
from csv import reader
read_file = reader(opened_file)
app_store_data = list(read_file)

#Function to show an overview of the datasets
def explore_data(dataset, start, end, rows_and_columns=True):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))


In [3]:
# Exploring data from the Play Store
explore_data(play_store_data,0,3)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [4]:
# Exploring data from the App Store
explore_data(app_store_data,0,3)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


Number of rows: 7198
Number of columns: 16


It looks like the following columns may be helpful from the Play Store:
'App', 'Category', 'Rating', 'Reviews', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated'

And From the App Store:
'id', 'track_name', 'price', 'rating_count_tot', 'user_rating', 'prime_genre'

<p><b> Cleaning the Dataset </b></p>

In [5]:
for i, row in enumerate(play_store_data[1:]):
    if float(row[2])>5 or float(row[2])<0:
        print ("Deleting row at index:", i)
        del play_store_data[i]
    else:
        row[2]=float(row[2])

Deleting row at index: 10472


<p>One row was deleted because the rating was incorrect.  Now we will check if there are duplicate names in the dataset, I will remove the one with less ratings (meaning that this data isn't as recent):</p>

In [6]:
# Making sure that the number of ratings data is in int form
for i, row in enumerate(play_store_data):
    if row[3].isdigit()==False:
        print (i, row)

0 ['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']
10472 ['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [7]:
#Changing this index position to have 3,000,000 ratings
play_store_data[10472][3]='3000000'
print(play_store_data[10472])

# Sorting by the largest number of ratings
play_store_data[1:].sort(key=lambda x: int(x[3]))

#Checked that data were sorted- but don't actually want to print them
# print(play_store_data[1])
# print(play_store_data.pop())

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3000000', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


The following checks if an app name is in an app name set.  If it isn't, it adds
the name to the set.  If it is in the set, it skips this entry.  This works 
because the data were sorted above. Only the largest number of ratings will 
remain. While iterating, I also add the name of the app to a dictionary with the maximum number of ratings.

In [8]:
#Used for the maximum number of reviews
reviews_max={}
app_name_set=set(play_store_data[0][0])
play_removed_duplicate_list=[play_store_data[0]]
length_before_removal= len(play_store_data)

while len(play_store_data)>1:
    app= play_store_data.pop()
    if app[0] in reviews_max:
        next
    
    else:
        app_name_set.add(app[0])
        play_removed_duplicate_list.append(app)
        reviews_max[app[0]]= app[3]
        
num_duplicates_removed= length_before_removal-len(play_removed_duplicate_list)
print("The follwing number of duplicates were removed:", num_duplicates_removed)
print("The length of the reviews_max dictionary is:", len(reviews_max))

The follwing number of duplicates were removed: 1181
The length of the reviews_max dictionary is: 9659


The next step is to check for non-english apps.  I write a function to check if an app has any characters that are non-englist.  I will try a few tests below as well to make sure my function works as I expect.

In [9]:
def check_all_eng_char(check_string):
    i=0
    for letter in check_string:
        if ord(letter)>127:
            i+=1
            if i>3:
                return False
    return True

#'Instagram' Expect True:
print(check_all_eng_char('Instagram'))

# '爱奇艺PPS -《欢乐颂2》电视剧热播': Expect False
print(check_all_eng_char('爱奇艺PPS -《欢乐颂2》电视剧热播'))

# 'Docs To Go™ Free Office Suite': Expect True- only one character
print(check_all_eng_char('Docs To Go™ Free Office Suite'))

# 'Instachat 😜': Expect True- only one character
print(check_all_eng_char('Instachat 😜'))


True
False
True
True


In the google play data, how many rows remain if we remove the non-english apps?

In [10]:
eng_apps_play=[]
eng_apps_appstore=[]
num_eng_apps=0
for app in play_removed_duplicate_list:
    if check_all_eng_char(app[0])==True:
        eng_apps_play.append(app)
        num_eng_apps+=1
print("There are this number of apps with Eng titles in the google play store:", num_eng_apps)

appstore_num_eng_apps=0
for app in app_store_data:
    if check_all_eng_char(app[1])==True:
        eng_apps_appstore.append(app)
        appstore_num_eng_apps+=1
        
print("There are this number of apps with Eng titles in the app store:", appstore_num_eng_apps)


There are this number of apps with Eng titles in the google play store: 9615
There are this number of apps with Eng titles in the app store: 6184


Removing Non-free apps: We only wish to explore free apps, the next step will be to remove all non-free apps from each dataset:
    

In [11]:
free_eng_apps_play=[eng_apps_play[1]]
free_eng_apps_appstore=[eng_apps_appstore[1]]

for app in eng_apps_play[1:]:
    if app[7]=='0':
        free_eng_apps_play.append(app)
print("Play: there are this num of apps that are in english and free:", (len(free_eng_apps_play)-1))

for app in eng_apps_appstore[1:]:
      if app[4]=='0.0':
        free_eng_apps_appstore.append(app)
print("Appstore: there are this num of apps that are in english and free:", (len(free_eng_apps_appstore)-1))



Play: there are this num of apps that are in english and free: 8863
Appstore: there are this num of apps that are in english and free: 3222


<h2> Data Analysis</h2>

We wish to explore which apps are most popular, and will start by generating a frequency table by genera.

In [12]:
def generate_freq_tbl(a_data_set, gen_idx):
    frequency_table = {}
    table_display = []
    
    for row in a_data_set[1:]:
        a_data_point = row[gen_idx]
        if a_data_point in frequency_table:
            frequency_table[a_data_point] += 1
        else:
            frequency_table[a_data_point] = 1
    for key in frequency_table.keys():
        frequency_table[key]= frequency_table[key]/(len(a_data_set)-1)
        key_val_as_tuple = (frequency_table[key], key)
        table_display.append(key_val_as_tuple)
    
    table_sorted = sorted(table_display, reverse = True)
    for entry in table_sorted[:6]:
        print(entry[1], ':', entry[0])
        
    return frequency_table

print("Play Store Catagories")
play_freq_tbl_cat= generate_freq_tbl(free_eng_apps_play, 1)

print("")
print("Play Store Genera")
play_freq_tbl_gen= generate_freq_tbl(free_eng_apps_play, 9)

print("")
print("AppStore Genera")
app_freq_tbl= generate_freq_tbl(free_eng_apps_appstore, 11)

Play Store Catagories
FAMILY : 0.19259844296513595
GAME : 0.09500169242920005
TOOLS : 0.08462146000225657
BUSINESS : 0.04580841701455489
LIFESTYLE : 0.03903870021437437
PRODUCTIVITY : 0.038925871601038026

Play Store Genera
Tools : 0.08450863138892023
Entertainment : 0.06070179397495205
Education : 0.053480762721426156
Business : 0.04580841701455489
Productivity : 0.038925871601038026
Lifestyle : 0.038925871601038026

AppStore Genera
Games : 0.5816263190564867
Entertainment : 0.07883302296710118
Photo & Video : 0.04965859714463067
Education : 0.03662321539416512
Social Networking : 0.032898820608317815
Shopping : 0.0260707635009311


<h2> Analysis of Frequency Tables</h2>

<h4>AppStore Analysis</h4>
The most common genera in the AppStore were "Games"(58%), "Entertainment" (8%) and "Photo and Video" (5%).  Catagories that were more entertaining seemed to dominate in populated in the AppStore.  We can't tell how much these apps were used, and if they were highly reviewed.

<h4>PlayStore Analysis</h4>
The PlayStore has two different catagories that could be considered "generas" the "Catagories" and "Genera" options.  "Family" apps seem to dominate the PlayStore
The most common genera in the AppStore were "Games"(58%), "Entertainment" (8%) and "Photo and Video" (5%).  Catagories that were more entertaining seemed to dominate in populated in the AppStore.  We can't tell how much these apps were used, and if they were highly reviewed.

In [13]:
for genre in app_freq_tbl:
    total = 0
    len_genre = 0
    for app in free_eng_apps_appstore:
        genre_app = app[-5]
        if genre_app == genre:            
            n_ratings = float(app[5])
            total += n_ratings
            len_genre += 1
    avg_n_ratings = total / len_genre
    print(genre, ':', avg_n_ratings)

Photo & Video : 28441.54375
Travel : 28243.8
Navigation : 86090.33333333333
Productivity : 21028.410714285714
Medical : 612.0
Music : 57326.530303030304
Games : 22788.6696905016
Health & Fitness : 23298.015384615384
Entertainment : 14029.830708661417
Business : 7491.117647058823
Food & Drink : 33333.92307692308
News : 21248.023255813954
Book : 39758.5
Catalogs : 4004.0
Lifestyle : 16485.764705882353
Sports : 23008.898550724636
Finance : 31467.944444444445
Social Networking : 98680.3831775701
Education : 7003.983050847458
Weather : 52279.892857142855
Reference : 74942.11111111111
Utilities : 18684.456790123455
Shopping : 26919.690476190477


Navigation has the highest number of reviews, followed by reference apps.  More popular app types could have an oversaturation of the market, and it may be helpful to produce an app in a genere that is growing.

In [14]:
for catagory in play_freq_tbl_cat:
    total = 0
    len_catagory = 0
    for app in free_eng_apps_play:
        catagory_app = app[1]
        if catagory_app==catagory:
            n_ratings = app[5].replace("+","")
            n_ratings=n_ratings.replace(",","")
            total += float(n_ratings)
            len_catagory += 1
    avg_n_ratings = total / len_catagory
    print(catagory, ':', avg_n_ratings)
            

ART_AND_DESIGN : 1986335.0877192982
LIBRARIES_AND_DEMO : 638503.734939759
TRAVEL_AND_LOCAL : 13984077.710144928
DATING : 854028.8303030303
PHOTOGRAPHY : 17840110.40229885
BEAUTY : 513151.88679245283
COMMUNICATION : 38326063.197916664
FAMILY : 5180430.984182777
ENTERTAINMENT : 9200779.220779222
PERSONALIZATION : 5218893.815699658
SOCIAL : 23253652.127118643
EDUCATION : 1776262.6262626264
WEATHER : 5074486.197183099
BOOKS_AND_REFERENCE : 8767811.894736841
EVENTS : 253542.22222222222
VIDEO_PLAYERS : 24790074.17721519
NEWS_AND_MAGAZINES : 9549178.467741935
MEDICAL : 123064.7898089172
HOUSE_AND_HOME : 1331540.5616438356
LIFESTYLE : 1462491.1498559078
MAPS_AND_NAVIGATION : 4056941.7741935486
BUSINESS : 1704192.3399014778
PRODUCTIVITY : 16772838.591304347
HEALTH_AND_FITNESS : 4167457.3602941176
SHOPPING : 7036877.311557789
SPORTS : 4274688.722772277
AUTO_AND_VEHICLES : 647317.8170731707
PARENTING : 542603.6206896552
FOOD_AND_DRINK : 1924897.7363636363
TOOLS : 10801391.298666667
COMICS : 81765

Communications apps have the largest number of installations.  It looks like the data may be heavily skewed towards apps that have over 1M installations such as WhatsApp.  We haven't analized ratings of apps.  It would be helpful in the future to understand if free apps are more highly rated, and how there may be room in the market of less popular catagories for apps.