# Profitable App Profiles for the App Store and Google Play Markets
For this project, we'll pretend we're working as data analysts for a company that builds Android and iOS mobile apps. We make our apps available on Google Play and the App Store.

We only build apps that are free to download and install, and our main source of revenue consists of in-app ads. This means our revenue for any given app is mostly influenced by the number of users who use our app — the more users that see and engage with the ads, the better. Our goal for this project is to analyze data to help our developers understand what type of apps are likely to attract more users.

In [1]:
def explore_data(dataset, start, end, rows_and_columns=False):
    dataset_slice = dataset[start:end]    
    for row in dataset_slice:
        print(row)
        print('\n')

    if rows_and_columns:
        print('Number of rows:', len(dataset))
        print('Number of columns:', len(dataset[0]))

In [2]:
from csv import reader

ios = list(reader(open('AppleStore.csv', encoding='utf8')))
ios_header = ios[0]
ios_data = ios[1: ]

android = list(reader(open('googleplaystore.csv', encoding='utf8')))
android_header = android[0]
android_body = android[1: ]

In [3]:
explore_data(ios, 0, 5, True)

['id', 'track_name', 'size_bytes', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', 'user_rating', 'user_rating_ver', 'ver', 'cont_rating', 'prime_genre', 'sup_devices.num', 'ipadSc_urls.num', 'lang.num', 'vpp_lic']


['284882215', 'Facebook', '389879808', 'USD', '0.0', '2974676', '212', '3.5', '3.5', '95.0', '4+', 'Social Networking', '37', '1', '29', '1']


['389801252', 'Instagram', '113954816', 'USD', '0.0', '2161558', '1289', '4.5', '4.0', '10.23', '12+', 'Photo & Video', '37', '0', '29', '1']


['529479190', 'Clash of Clans', '116476928', 'USD', '0.0', '2130805', '579', '4.5', '4.5', '9.24.12', '9+', 'Games', '38', '5', '18', '1']


['420009108', 'Temple Run', '65921024', 'USD', '0.0', '1724546', '3842', '4.5', '4.0', '1.6.2', '9+', 'Games', '40', '5', '1', '1']


Number of rows: 7198
Number of columns: 16


In [4]:
explore_data(android, 0, 5, True)

['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type', 'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver', 'Android Ver']


['Photo Editor & Candy Camera & Grid & ScrapBook', 'ART_AND_DESIGN', '4.1', '159', '19M', '10,000+', 'Free', '0', 'Everyone', 'Art & Design', 'January 7, 2018', '1.0.0', '4.0.3 and up']


['Coloring book moana', 'ART_AND_DESIGN', '3.9', '967', '14M', '500,000+', 'Free', '0', 'Everyone', 'Art & Design;Pretend Play', 'January 15, 2018', '2.0.0', '4.0.3 and up']


['U Launcher Lite – FREE Live Cool Themes, Hide Apps', 'ART_AND_DESIGN', '4.7', '87510', '8.7M', '5,000,000+', 'Free', '0', 'Everyone', 'Art & Design', 'August 1, 2018', '1.2.4', '4.0.3 and up']


['Sketch - Draw & Paint', 'ART_AND_DESIGN', '4.5', '215644', '25M', '50,000,000+', 'Free', '0', 'Teen', 'Art & Design', 'June 8, 2018', 'Varies with device', '4.2 and up']


Number of rows: 10842
Number of columns: 13


In [5]:
# delete rows with missing data
del android[10473]

In [6]:
len(android)

10841

The Google Play data set has duplicate entries so we need to delete the duplicates first before analysis

In [7]:
unique_set = []
duplicate_set = []

for app in android[1:]:
    name = app[0]
    if name in unique_set:
        duplicate_set.append(name)
    else:
        unique_set.append(name)

In [8]:
print(len(android[1:]))
print(len(duplicate_set))
print(len(unique_set))

10840
1181
9659


As you can see their are 1181 duplicates entries in the Google play data set and we should delete them so it doesn't affect our analysis but which entry we should delete? should it be random? or should it have some criteria?!
lets check one of the apps that has duplicates and see...

In [9]:
for app in android:
    if app[0] == duplicate_set[5]:
        print(app)

['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']
['Box', 'BUSINESS', '4.2', '159872', 'Varies with device', '10,000,000+', 'Free', '0', 'Everyone', 'Business', 'July 31, 2018', 'Varies with device', 'Varies with device']


In [10]:
for app in android:
    if app[0] == duplicate_set[40]:
        print(app)

['Hangouts', 'COMMUNICATION', '4.0', '3419249', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 21, 2018', 'Varies with device', 'Varies with device']
['Hangouts', 'COMMUNICATION', '4.0', '3419433', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 21, 2018', 'Varies with device', 'Varies with device']
['Hangouts', 'COMMUNICATION', '4.0', '3419513', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 21, 2018', 'Varies with device', 'Varies with device']
['Hangouts', 'COMMUNICATION', '4.0', '3419464', 'Varies with device', '1,000,000,000+', 'Free', '0', 'Everyone', 'Communication', 'July 21, 2018', 'Varies with device', 'Varies with device']


After discovering some of the duplicates entries we can see that some duplicates have the same exact data and some differs on the number of 'reviews'
for the first case we can delete any entry randomly
But for the second case it makes sense to take the entry with the largest number of reviews
because the different numbers show the data was collected at different times so the largest number will mean the latest data

### To remove the duplicates, we will:

- Create a dictionary, where each dictionary key is a unique app name and the corresponding dictionary value is the highest number of reviews of that app.

- Use the information stored in the dictionary and create a new data set, which will have only one entry per app (and for each app, we'll only select the entry with the highest number of reviews).

In [11]:
reviews_max = {}
for app in android[1: ]:
    name = app[0]
    n_reviews = float(app[3])
    if name in reviews_max:
        if float(reviews_max[name][3]) < n_reviews:
            reviews_max[name][3] = app[3]
    else:
        reviews_max[name] = app


print(len(reviews_max))

9659


In [12]:
print(reviews_max['Hangouts'][3])
print(reviews_max['Instagram'][3])

3419513
66577446


In [13]:
android_clean = []
already_added = []
for app in android[1: ]:
    name = app[0]
    n_reviews = float(app[3])
    if name not in already_added and n_reviews == float(reviews_max[name][3]):
        android_clean.append(app)
        already_added.append(name)


In [14]:
print(len(android_clean))

9659


## Removing Non English apps
Remember we use English for the apps we develop at our company, and we'd like to analyze only the apps that are directed toward an English-speaking audience. However, if we explore the data long enough, we'll find that both data sets have apps with names that suggest they are not directed toward an English-speaking audience

In [21]:
print(ios[2746][1])

大众点评-发现品质生活


We're not interested in keeping these apps, so we'll remove them. One way to go about this is to remove each app with a name containing a symbol that is not commonly used in English text — English text usually includes letters from the English alphabet, numbers composed of digits from 0 to 9, punctuation marks (., !, ?, ;), and other symbols (+, *, /).

Behind the scenes, each character we use in a string has a corresponding number associated with it. For instance, the corresponding number for character 'a' is 97, character 'A' is 65, and character '爱' is 29,233. We can get the corresponding number of each character using the ord() built-in function

In [22]:
print(ord('A'))
print(ord('z'))
print(ord('爱'))

65
122
29233


The numbers corresponding to the characters we commonly use in an English text are all in the range 0 to 127, according to the ASCII system. 
Based on this number range, we can build a function that detects whether a character belongs to the set of common English characters or not. If the number is equal to or less than 127, then the character belongs to the set of common English characters.

If an app name contains a character that is greater than 127, then it probably means that the app has a non-English name.

In [25]:
def isEnglish(word):
    for letter in word:
        if ord(letter) > 127:
            return False
    return True

In [30]:
print(isEnglish('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEnglish('Docs To Go™ Free Office Suite'))
print(isEnglish('Twitter'))
print(isEnglish('Instachat 😜'))

False
False
True
False


On the previous screen, we wrote a function that detects non-English app names, but we saw that the function couldn't correctly identify certain English app names like 'Docs To Go™ Free Office Suite' and 'Instachat 😜'. This is because emojis and characters like ™ fall outside the ASCII range and have corresponding numbers over 127.


If we're going to use the function we've created, we'll lose useful data since many English apps will be incorrectly labeled as non-English. To minimize the impact of data loss, we'll only remove an app if its name has more than three characters with corresponding numbers falling outside the ASCII range. This means all English apps with up to three emoji or other special characters will still be labeled as English. Our filter function is still not perfect, but it should be fairly effective.

In [31]:
def isEnglishOptimized(word):
    counter = 0
    for letter in word:
        if ord(letter) > 127:
            counter += 1
            if counter > 3:
                return False
    return True

In [32]:
print(isEnglishOptimized('爱奇艺PPS -《欢乐颂2》电视剧热播'))
print(isEnglishOptimized('Docs To Go™ Free Office Suite'))
print(isEnglishOptimized('Twitter'))
print(isEnglishOptimized('Instachat 😜'))

False
True
True
True


Much Better!
Now Lets delete the non-English apps from IOS and Google Play

In [33]:
android_english = []

for app in android_clean:
    name = app[0]
    if isEnglishOptimized(name):
        android_english.append(app)

print(len(android_english))

9614


In [34]:
ios_english = []

for app in ios:
    name = app[1]
    if isEnglishOptimized(name):
        ios_english.append(app)

print(len(ios_english))

6184


## Isolating the free apps
As we mentioned in the introduction, we only build apps that are free to download and install, and our main source of revenue consists of in-app ads. Our data sets contain both free and non-free apps; we'll need to isolate only the free apps for our analysis.

In [40]:
android_free = []

for app in android_english:
    a_type = app[6]
    if a_type == 'Free':
        android_free.append(app)

print(len(android_free))

8861


In [None]:
ios_free = []

for app in ios_english:
    a_t = app[4]
    if a_type == 'Free':
        ios_free.append(app)

print(len(ios_free))