# Analysis of Profitable Apps for the App Store and Google Play

As of 2018, there were about 2 million iOS apps available in the App Store, and about 2.1 million Android apps in the Google Play store. This app analysis is intended to find the most profitable types of apps to develop in terms of ad revenue, as the apps themselves will be free to download. Naturally, this means that we must find the kind of apps which are most installed, since more users means more people seeing and potentially engaging with the ads. Therefore, the goal is to help our hypothetical developers understand what kinds of apps are most likely to attract the most users on both Android and iOS.

We will accomplish these goals with a few fairly straightforward steps:

1. Open data for the App Store and Google Play apps, and create lists to more easily process the data
2. Clean the data by removing any duplicate or irrelevant apps (non-free, non-English)
3. Categorize and sort the apps so we can make informed statements about which ones users prefer

### Data exploration and sorting

Collecting data for over 4 million apps would not only be very time-consuming, it would also be expensive. Realistically, it would not even be necessary, as a smaller sample should be suitable to give us the information we need. Our data sets for the Android and iOS app stores will include about 10,000 and 7,000 apps, respectively.

Opening and exploring the data sets is the logical first step. We'll use a function to allow for repeated printing of rows in a readable way.

In [14]:
def explore_data(dataset, start, end, rows_columns=False):
    data_slice = dataset[start:end]
    for row in data_slice:
        print(row)
        print('\n')

    if rows_columns:
        print(f'Number of rows: {len(dataset)}')
        print(f'Number of columns: {len(dataset[0])}')

The explore_data() function takes four parameters: dataset, which is expected as a list of lists; start and end, which should be integers and represent the starting/ending indices of a slice of the dataset; rows_columns, expected to be a boolean with False set as default.

The function opens and slices the data. It then loops through the slice, and for each iteration prints a row and a blank line.

Let's open the two app data sets and create lists from them. We'll also save each header row into its own variable.

In [15]:
from csv import reader

### The Google Play data set ###
opened_file = open('googleplaystore.csv')
read_file = reader(opened_file)
android = list(read_file)
android_header = android[0]
android = android[1:]

### The App Store data set ###
opened_file = open('AppleStore.csv')
read_file = reader(opened_file)
ios = list(read_file)
ios_header = ios[0]
ios = ios[1:]

Simple enough! Next, we'll use our explore_data() function to examine the first few rows of the Google data file.

In [16]:
print(android_header)
print('\n')
explore_data(android, 0, 4, rows_columns=True)

[&#39;App&#39;, &#39;Category&#39;, &#39;Rating&#39;, &#39;Reviews&#39;, &#39;Size&#39;, &#39;Installs&#39;, &#39;Type&#39;, &#39;Price&#39;, &#39;Content Rating&#39;, &#39;Genres&#39;, &#39;Last Updated&#39;, &#39;Current Ver&#39;, &#39;Android Ver&#39;]


[&#39;Photo Editor &amp; Candy Camera &amp; Grid &amp; ScrapBook&#39;, &#39;ART_AND_DESIGN&#39;, &#39;4.1&#39;, &#39;159&#39;, &#39;19M&#39;, &#39;10,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Everyone&#39;, &#39;Art &amp; Design&#39;, &#39;January 7, 2018&#39;, &#39;1.0.0&#39;, &#39;4.0.3 and up&#39;]


[&#39;Coloring book moana&#39;, &#39;ART_AND_DESIGN&#39;, &#39;3.9&#39;, &#39;967&#39;, &#39;14M&#39;, &#39;500,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Everyone&#39;, &#39;Art &amp; Design;Pretend Play&#39;, &#39;January 15, 2018&#39;, &#39;2.0.0&#39;, &#39;4.0.3 and up&#39;]


[&#39;U Launcher Lite – FREE Live Cool Themes, Hide Apps&#39;, &#39;ART_AND_DESIGN&#39;, &#39;4.7&#39;, &#39;87510&#39;, &#39;8.7M&#39;, &#39;5,000,00

There are 10,841 apps, sorted into 13 columns. Many of the column headings seem like they will be useful to our analysis. The most relevant for now are probably 'App', 'Category', 'Rating', 'Installs', 'Type', 'Price', and 'Genres'.

Now let's do the same for the iOS store data.

In [17]:
print(ios_header)
print('\n')
explore_data(ios, 0, 4, rows_columns=True)

[&#39;id&#39;, &#39;track_name&#39;, &#39;size_bytes&#39;, &#39;currency&#39;, &#39;price&#39;, &#39;rating_count_tot&#39;, &#39;rating_count_ver&#39;, &#39;user_rating&#39;, &#39;user_rating_ver&#39;, &#39;ver&#39;, &#39;cont_rating&#39;, &#39;prime_genre&#39;, &#39;sup_devices.num&#39;, &#39;ipadSc_urls.num&#39;, &#39;lang.num&#39;, &#39;vpp_lic&#39;]


[&#39;284882215&#39;, &#39;Facebook&#39;, &#39;389879808&#39;, &#39;USD&#39;, &#39;0.0&#39;, &#39;2974676&#39;, &#39;212&#39;, &#39;3.5&#39;, &#39;3.5&#39;, &#39;95.0&#39;, &#39;4+&#39;, &#39;Social Networking&#39;, &#39;37&#39;, &#39;1&#39;, &#39;29&#39;, &#39;1&#39;]


[&#39;389801252&#39;, &#39;Instagram&#39;, &#39;113954816&#39;, &#39;USD&#39;, &#39;0.0&#39;, &#39;2161558&#39;, &#39;1289&#39;, &#39;4.5&#39;, &#39;4.0&#39;, &#39;10.23&#39;, &#39;12+&#39;, &#39;Photo &amp; Video&#39;, &#39;37&#39;, &#39;0&#39;, &#39;29&#39;, &#39;1&#39;]


[&#39;529479190&#39;, &#39;Clash of Clans&#39;, &#39;116476928&#39;, &#39;USD&#39;, &#39;0.0&#

The iOS data set includes 7,197 apps with attributes sorted into 16 columns. The columns for the Apple data are in some cases a bit cryptic. Nonetheless, they are easy enough to figure out by looking at the entries. For example, track_name corresponds to the app's name. The most useful seem to be 'track_name', 'currency', 'price', 'rating_count_tot', 'rating_count_ver', and 'prime_genre'. For additional help with the columns, the [documentation](https://www.kaggle.com/ramamet4/app-store-apple-data-set-10k-apps/home) is available.

### Data Cleaning

One of the most critical parts of data analysis is cleaning the data, i.e. removing irrelevant, inaccurate, or duplicate data which would interfere with drawing accurate conclusions.

From the [discussion section](https://www.kaggle.com/lava18/google-play-store-apps/discussion) of the Google Play data, we see that row 10,472 is incorrect - it lists the app's rating as 19, while the maximum rating for an app in the Google Play Store should be 5.

In [18]:
print(android_header)  # header
print('\n')
print(android[10472])  # incorrect row
print('\n')
print(android[0])      # example correct row

[&#39;App&#39;, &#39;Category&#39;, &#39;Rating&#39;, &#39;Reviews&#39;, &#39;Size&#39;, &#39;Installs&#39;, &#39;Type&#39;, &#39;Price&#39;, &#39;Content Rating&#39;, &#39;Genres&#39;, &#39;Last Updated&#39;, &#39;Current Ver&#39;, &#39;Android Ver&#39;]


[&#39;Life Made WI-Fi Touchscreen Photo Frame&#39;, &#39;1.9&#39;, &#39;19&#39;, &#39;3.0M&#39;, &#39;1,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Everyone&#39;, &#39;&#39;, &#39;February 11, 2018&#39;, &#39;1.0.19&#39;, &#39;4.0 and up&#39;]


[&#39;Photo Editor &amp; Candy Camera &amp; Grid &amp; ScrapBook&#39;, &#39;ART_AND_DESIGN&#39;, &#39;4.1&#39;, &#39;159&#39;, &#39;19M&#39;, &#39;10,000+&#39;, &#39;Free&#39;, &#39;0&#39;, &#39;Everyone&#39;, &#39;Art &amp; Design&#39;, &#39;January 7, 2018&#39;, &#39;1.0.0&#39;, &#39;4.0.3 and up&#39;]


According to the discussion, this issue is caused by a missing value for the 'Category' column. To fix this, we'll simply delete the row so it doesn't interfere with the rest of the data.

In [19]:
del android[10472]