# Analyzing Mobile App Data

In this *guided* project I'll pretend I'm working as data analysts for a company that builds Android and iOS mobile apps. They make apps available on Google Play and the App Store.

They only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better.
The goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

## Load Data

In [1]:
#load csv function
def load_data(data_csv):
    opened_file = open(data_csv, encoding="utf8")
    from csv import reader
    read_file = reader(opened_file)
    return list(read_file)

apple_data = load_data("AppleStore.csv")
google_data = load_data("googleplaystore.csv")

## Explore Data

In [2]:
#data explore function
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n') # adds a new (empty) line after each row

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [7]:
explore_data(google_data, 0, 2, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


Number of rows: 10842
Number of columns: 13


In [13]:
explore_data(apple_data, 0, 2, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


Number of rows: 7198
Number of columns: 16


## Clean Data: google_data

https://www.kaggle.com/lava18/google-play-store-apps/discussion/66015
In google_data:
1. the above shows the below entry isnt complete, hence I deleted the entry. It doent have a 'Category' data
2. some apps like Instagram has duplicates, removed them and maintain the most recent row

In [10]:
print(google_data[10473])

['Life Made WI-Fi Touchscreen Photo Frame', '1.9', '19', '3.0M', '1,000+', 'Free', '0', 'Everyone', '', 'February 11, 2018', '1.0.19', '4.0 and up']


In [11]:
del(google_data[10473])

### separating duplicates from main data set

In [18]:
duplicate_apps = []
unique_apps = []

for row in google_data[1:]:
    app_name = row[0]
    if app_name in unique_apps:
        duplicate_apps.append(app_name)
    else:
        unique_apps.append(app_name)
        
print('Number of duplicates: ', len(duplicate_apps))
print('\n Number of rows left: ', len(unique_apps))
print('\n example: ', duplicate_apps[0:4])



Number of duplicates:  1181

 Number of rows left:  9659

 example:  ['Quick PDF Scanner + OCR FREE', 'Box', 'Google My Business', 'ZOOM Cloud Meetings']


#### checking how many duplicates of 'Box' we have in the data set

In [17]:
for row in google_data[1:]:
    app_name = row[0]
    if app_name == 'Box':
        print(row)

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


From the above, 'Box' app has three entries. Which of the entries should be removed?

    The main difference happens on the fourth position of each row, which corresponds to the number of reviews. The different numbers show the data was collected at different times.
    We can use this information to build a criterion for removing the duplicates. The higher the number of reviews, the more recent the data should be

In [22]:
reviews_max ={}
for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if name in reviews_max and reviews_max[name] < n_reviews:
        reviews_max[name] = n_reviews
    elif name not in reviews_max:
        reviews_max[name] = n_reviews

len(reviews_max) #9659 expected

9659

In order to ensure we are collecting the clean data only we need the function below to ensure this

In [24]:
google_data_clean = []
already_added = [] # serves as a checklist 

for row in google_data[1:]:
    name = row[0]
    n_reviews = float(row[3])
    if n_reviews == reviews_max[name] and name not in already_added:
        google_data_clean.append(row)
        already_added.append(name)
        
len(reviews_max) == len(android_clean) #True means all collected 9659 apps

True

In [26]:
explore_data(android_clean, 0, 2, True)
# clean data doesnt include the headers

['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


Number of rows: 9659
Number of columns: 13


### removing non-English apps
ASCII for English characters are 0-127

In [48]:
def english_only(string):
    for i in string:
        if ord(i) > 127: #ord() returns ASCII
            return False
    return True
        
english_only('Docs To Go™ Free Office Suite')
#english_only('我爱中国')
#english_only('Instachat 😜')

False

Emojis and subscripts are out of range > 127, this implies we will lose data if we use this function as it is(as many apps have out of ASCII range characters though they are English apps), we can reduce this effect by checking if an app has more than 3 non-English characters in its name: therefore the `english_only()` function is rewritten below

In [55]:
def english_only(string):
    count = 0
    for i in string:
        if ord(i) > 127: #ord() returns ASCII
            count += 1
            if count >= 3:
                return False
    return True

#english_only('Docs To Go™ Free Office Suite')
#english_only('我爱中国')
english_only('Instachat 😜')

True

## Clean Data: apple_data

https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/discussion
In apple_data:
1. no issue yet