Our aim is to help our developers understand what type of apps are likely to attract more users on Google Play and the App Store.

### Import Libaries:

In [189]:
import pandas as pd
import numpy as np

### Open and Explore Data: 

In [20]:
google_store_data = pd.read_csv('/Users/khevnaparikh/Desktop/Apps/googleplaystore.csv')
apple_store_data = pd.read_csv('/Users/khevnaparikh/Desktop/Apps/AppleStore.csv')

In [21]:
apple_store_data.head()

Unnamed: 0,id,track_name,size_bytes,currency,price,rating_count_tot,rating_count_ver,user_rating,user_rating_ver,ver,cont_rating,prime_genre,sup_devices.num,ipadSc_urls.num,lang.num,vpp_lic
0,284882215,Facebook,389879808,USD,0.0,2974676,212,3.5,3.5,95.0,4+,Social Networking,37,1,29,1
1,389801252,Instagram,113954816,USD,0.0,2161558,1289,4.5,4.0,10.23,12+,Photo & Video,37,0,29,1
2,529479190,Clash of Clans,116476928,USD,0.0,2130805,579,4.5,4.5,9.24.12,9+,Games,38,5,18,1
3,420009108,Temple Run,65921024,USD,0.0,1724546,3842,4.5,4.0,1.6.2,9+,Games,40,5,1,1
4,284035177,Pandora - Music & Radio,130242560,USD,0.0,1126879,3594,4.0,4.5,8.4.1,12+,Music,37,4,1,1


In [22]:
google_store_data.head()

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
2,"U Launcher Lite – FREE Live Cool Themes, Hide ...",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
3,Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
4,Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up


### Data Cleaning: 
##### Detect inaccurate data, and correct or remove it.
##### Detect duplicate data, and remove the duplicates.

For the purpose of this project, we will focus on English-speaking apps and free apps only. 

In [23]:
print('Number of rows or apps:', len(google_store_data))
google_store_data.nunique()

Number of rows or apps: 10841


App               9660
Category            34
Rating              40
Reviews           6002
Size               462
Installs            22
Type                 3
Price               93
Content Rating       6
Genres             120
Last Updated      1378
Current Ver       2832
Android Ver         33
dtype: int64

In [24]:
print('Number of rows or apps:', len(apple_store_data))
apple_store_data.nunique()

Number of rows or apps: 7197


id                  7197
track_name          7195
size_bytes          7107
currency               1
price                 36
rating_count_tot    3185
rating_count_ver    1138
user_rating           10
user_rating_ver       10
ver                 1590
cont_rating            4
prime_genre           23
sup_devices.num       20
ipadSc_urls.num        6
lang.num              57
vpp_lic                2
dtype: int64

#### The row 10472 corresponds to the app Life Made WI-Fi Touchscreen Photo Frame, and we can see that the rating is 19. This is clearly off because the maximum rating for a Google Play app is 5 (as mentioned in the discussions section, this problem is caused by a missing value in the 'Category' column). As a consequence, we'll delete this row.

In [25]:
google_store_data.loc[[0,10472]]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
0,Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
10472,Life Made WI-Fi Touchscreen Photo Frame,1.9,19.0,3.0M,"1,000+",Free,0,Everyone,,"February 11, 2018",1.0.19,4.0 and up,


In [26]:
google_store_data = google_store_data.drop(labels=10472, axis=0)

#### For the Google Play Store dataset, there are apps with multiple entries. We will go ahead and delete duplicates. For instance, Facebook has 2 entries:

In [55]:
print(google_store_data.shape)
google_store_data[google_store_data.App == "Facebook"]

(10840, 13)


Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Content Rating,Genres,Last Updated,Current Ver,Android Ver
2544,Facebook,SOCIAL,4.1,78158306,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device
3943,Facebook,SOCIAL,4.1,78128208,Varies with device,"1,000,000,000+",Free,0,Teen,Social,"August 3, 2018",Varies with device,Varies with device


#### The difference between the two rows is the number of reviews. We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be. Rather than removing duplicates randomly, we'll only keep the row with the highest number of reviews and remove the other entries for any given app.

In [216]:
#Get Distinct values of the dataframe based on a column:
google = google_store_data.sort_values('Reviews').drop_duplicates(subset = ["App"], keep='last').sort_index()
print('Number of Apps:', len(google))

apple = apple_store_data.sort_values('rating_count_tot').drop_duplicates(subset = ["track_name"], keep='last').sort_index()
print('Number of Tracks:', len(apple))

Number of Apps: 9659
Number of Tracks: 7195


#### Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are designed for an English-speaking audience. However, if we explore the data long enough, we'll find that both datasets have apps with names that suggest they are not designed for an English-speaking audience.

#### Each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, character 'A' is 65, and character '爱' is 29,233. We can get the corresponding number of each character using the ord() built-in function.

#### The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII (American Standard Code for Information Interchange) system. 

#### Based on this number range, we can see if an app name contains a character that is greater than 127, then it probably means that the app has a non-English name. Our app names, however, are stored as strings, so how could we take each individual character of a string and check its corresponding number?

In [76]:
print(ord('a'))
print(ord("A"))

97
65


In [211]:
google_english = []
apple_english = []

def is_english(app_name):
    non_ascii = 0
    
    for character in app_name:
        if ord(character) > 127:
            non_ascii += 1
    
    if non_ascii > 3:
        return False
    else:
        return True

for column in google["App"]:
    if is_english(column):
        temp = google.loc[google['App'] == column].values.tolist()
        google_english.append(temp)

#np.shape(google_english)
google_english = np.reshape(google_english, (9614,13))
google_english = pd.DataFrame(google_english,columns=google.columns)

for column in apple["track_name"]:
    if is_english(column):
        temp = apple.loc[apple["track_name"] == column].values.tolist()
        apple_english.append(temp)

#np.shape(apple_english)
apple_english = np.reshape(apple_english, (6181,16))
apple_english = pd.DataFrame(apple_english,columns=apple.columns)