## 1. Introduction
<p><img src="https://assets.datacamp.com/production/project_1197/img/google_play_store.png" alt="Google Play logo"></p>
<p>Mobile apps are everywhere. They are easy to create and can be very lucrative from the business standpoint. Specifically, Android is expanding as an operating system and has captured more than 74% of the total market<sup><a href="https://www.statista.com/statistics/272698/global-market-share-held-by-mobile-operating-systems-since-2009">[1]</a></sup>. </p>
<p>The Google Play Store apps data has enormous potential to facilitate data-driven decisions and insights for businesses. In this notebook, we will analyze the Android app market by comparing ~10k apps in Google Play across different categories. We will also use the user reviews to draw a qualitative comparision between the apps.</p>
<p>The dataset used here was scraped from Google Play Store in September 2018 and was published on <a href="https://www.kaggle.com/lava18/google-play-store-apps">Kaggle</a>. Here are the details: <br>
<br></p>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/apps.csv</b></div>
This file contains all the details of the apps on Google Play. There are 9 features that describe a given app.
<ul>
    <li><b>App:</b> Name of the app</li>
    <li><b>Category:</b> Category of the app. Some examples are: ART_AND_DESIGN, FINANCE, COMICS, BEAUTY etc.</li>
    <li><b>Rating:</b> The current average rating (out of 5) of the app on Google Play</li>
    <li><b>Reviews:</b> Number of user reviews given on the app</li>
    <li><b>Size:</b> Size of the app in MB (megabytes)</li>
    <li><b>Installs:</b> Number of times the app was downloaded from Google Play</li>
    <li><b>Type:</b> Whether the app is paid or free</li>
    <li><b>Price:</b> Price of the app in US$</li>
    <li><b>Last Updated:</b> Date on which the app was last updated on Google Play </li>

</ul>
</div>
<div style="background-color: #efebe4; color: #05192d; text-align:left; vertical-align: middle; padding: 15px 25px 15px 25px; line-height: 1.6;">
    <div style="font-size:20px"><b>datasets/user_reviews.csv</b></div>
This file contains a random sample of 100 <i>[most helpful first](https://www.androidpolice.com/2019/01/21/google-play-stores-redesigned-ratings-and-reviews-section-lets-you-easily-filter-by-star-rating/)</i> user reviews for each app. The text in each review has been pre-processed and passed through a sentiment analyzer.
<ul>
    <li><b>App:</b> Name of the app on which the user review was provided. Matches the `App` column of the `apps.csv` file</li>
    <li><b>Review:</b> The pre-processed user review text</li>
    <li><b>Sentiment Category:</b> Sentiment category of the user review - Positive, Negative or Neutral</li>
    <li><b>Sentiment Score:</b> Sentiment score of the user review. It lies between [-1,1]. A higher score denotes a more positive sentiment.</li>

</ul>
</div>
<p>From here on, I will explore and manipulate the data to figure out our best performing competitors in the finance apps category.<br></p>

In [14]:
import numpy as np
import pandas as pd
import re

In [15]:
apps = pd.read_csv('datasets/apps.csv')
print(apps.head())
apps.isna().sum()


                                                 App        Category  Rating  \
0     Photo Editor & Candy Camera & Grid & ScrapBook  ART_AND_DESIGN     4.1   
1                                Coloring book moana  ART_AND_DESIGN     3.9   
2  U Launcher Lite – FREE Live Cool Themes, Hide ...  ART_AND_DESIGN     4.7   
3                              Sketch - Draw & Paint  ART_AND_DESIGN     4.5   
4              Pixel Draw - Number Art Coloring Book  ART_AND_DESIGN     4.3   

   Reviews  Size     Installs  Type  Price      Last Updated  
0      159  19.0      10,000+  Free    0.0   January 7, 2018  
1      967  14.0     500,000+  Free    0.0  January 15, 2018  
2    87510   8.7   5,000,000+  Free    0.0    August 1, 2018  
3   215644  25.0  50,000,000+  Free    0.0      June 8, 2018  
4      967   2.8     100,000+  Free    0.0     June 20, 2018  


App                0
Category           0
Rating          1463
Reviews            0
Size            1227
Installs           0
Type               0
Price              0
Last Updated       0
dtype: int64

Notice that the values in the **'Installs'** column is currently a string. To be able to perform analysis using this column, we would have to remove the non-numeric characters and then convert the values to integers.

I will use a for loop to iterate over each value in the column and then use the command 
***re.sub("[,+]","",s)** to delete the **","** and **"+"** characters then replace them with an empty string. The "s" in the command represents the iterable variable i.e. the string values in the column of interest.
I will then convert the new string(without the non-numeric characters) to an integer value and then append the value to the empty list created outside the for loop.

For the missing values, I will just ignore them since I'm not dealing with those columns in this analysis. Even though I later calculated the mean rating for each app, note that the pandas **.mean()** command has a default argument to ignore NaNs.

In [16]:
#cleaning the Installs column

Installs = []

for s in apps['Installs']:
    new = int(re.sub("[,+]","",s)) #'re' has been imported above.
    Installs.append(new)
    
print(Installs[:5])

[10000, 500000, 5000000, 50000000, 100000]


In [17]:
#Now, to replace the old "Installs" column with the new "Installs" list

apps['Installs'] = Installs
print(apps['Installs'].head())

0       10000
1      500000
2     5000000
3    50000000
4      100000
Name: Installs, dtype: int64


In [18]:
app_category_info = apps.groupby('Category')[['Price', 'Rating']].mean()
print(app_category_info.head())

                        Price    Rating
Category                               
ART_AND_DESIGN       0.093281  4.357377
AUTO_AND_VEHICLES    0.158471  4.190411
BEAUTY               0.000000  4.278571
BOOKS_AND_REFERENCE  0.539505  4.344970
BUSINESS             0.417357  4.098479


In [19]:
#renaming the "Price" and "Rating" columns appropriately

app_category_info.rename(columns= \
                         {"Price":"Average price", "Rating":"Average rating"}, \
                         inplace=True)


In [20]:
#count number of apps per Categpry group
cat_grouping = apps.groupby('Category').count()['App']
print(cat_grouping[:5])


#turning group count into a list
Number_of_apps = list(cat_grouping)


#inserting the list "Number_of_apps" into the "app_category_info" dataframe 
app_category_info.insert(0, "Number of apps", Number_of_apps , True)
print(app_category_info.head())

Category
ART_AND_DESIGN          64
AUTO_AND_VEHICLES       85
BEAUTY                  53
BOOKS_AND_REFERENCE    222
BUSINESS               420
Name: App, dtype: int64
                     Number of apps  Average price  Average rating
Category                                                          
ART_AND_DESIGN                   64       0.093281        4.357377
AUTO_AND_VEHICLES                85       0.158471        4.190411
BEAUTY                           53       0.000000        4.278571
BOOKS_AND_REFERENCE             222       0.539505        4.344970
BUSINESS                        420       0.417357        4.098479


Now, I have a compact table that has summerized my **Apps** dataset by grouping **Category**.


In [21]:
reviews = pd.read_csv('datasets/user_reviews.csv')
reviews.head()


Unnamed: 0,App,Review,Sentiment Category,Sentiment Score
0,10 Best Foods for You,I like eat delicious food. That's I'm cooking ...,Positive,1.0
1,10 Best Foods for You,This help eating healthy exercise regular basis,Positive,0.25
2,10 Best Foods for You,,,
3,10 Best Foods for You,Works great especially going grocery store,Positive,0.4
4,10 Best Foods for You,Best idea us,Positive,1.0


This new dataframe has data on the individual ratings of different apps on my **Apps** dataframe. I'm going to perform a merge so that I can see the full information of each app alongside its rating. This merge is a one-to-many merge because each **App** occurs more than once in the **Reviews** table. 

In [22]:
apps_reviews = apps.merge(reviews, on = 'App', how = 'inner', validate = 'one_to_many')
apps_reviews[:3]

Unnamed: 0,App,Category,Rating,Reviews,Size,Installs,Type,Price,Last Updated,Review,Sentiment Category,Sentiment Score
0,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,"January 15, 2018",A kid's excessive ads. The types ads allowed a...,Negative,-0.25
1,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,"January 15, 2018",It bad >:(,Negative,-0.725
2,Coloring book moana,ART_AND_DESIGN,3.9,967,14.0,500000,Free,0.0,"January 15, 2018",like,Neutral,0.0


I am only interested in **FINANCE** apps for this analysis, so I will subset my merged dataframe by **Category == "FINANCE"**.

In [23]:
finance_apps = apps_reviews.query('Category == "FINANCE"')
print(finance_apps.head())

                     App Category  Rating  Reviews  Size  Installs  Type  \
14112  Citibanamex Movil  FINANCE     3.6    52306  42.0   5000000  Free   
14113  Citibanamex Movil  FINANCE     3.6    52306  42.0   5000000  Free   
14114  Citibanamex Movil  FINANCE     3.6    52306  42.0   5000000  Free   
14115  Citibanamex Movil  FINANCE     3.6    52306  42.0   5000000  Free   
14116  Citibanamex Movil  FINANCE     3.6    52306  42.0   5000000  Free   

       Price   Last Updated  \
14112    0.0  July 27, 2018   
14113    0.0  July 27, 2018   
14114    0.0  July 27, 2018   
14115    0.0  July 27, 2018   
14116    0.0  July 27, 2018   

                                                  Review Sentiment Category  \
14112  Forget paying app, designed make fail payments...           Negative   
14113  It's working expected, talking best bank Mexic...           Positive   
14114  It has many problems with Android 8.1. You can...           Positive   
14115  I changed my phone to a Xiaomi Re

I do not need to see each unique review of each app so I will create a pivot table for the financial apps with the average **Sentiment Score** as the summary statistic.

In [24]:
user_feedback = finance_apps.pivot_table(values = 'Sentiment Score', index = 'App').sort_values("Sentiment Score", ascending = False)
top_10_user_feedback = user_feedback[:9]
print(top_10_user_feedback)

                                            Sentiment Score
App                                                        
BBVA Spain                                         0.515086
Associated Credit Union Mobile                     0.388093
BankMobile Vibe App                                0.353455
A+ Mobile                                          0.329592
Current debit card and app made for teens          0.327258
BZWBK24 mobile                                     0.326883
Even - organize your money, get paid early         0.283929
Credit Karma                                       0.270052
Fortune City - A Finance App                       0.266966


Now, I know the best peforming financial apps based on sentiment score. I can now submit my findings to the relevant teams to investigate what these apps are doing right and how we can beat exisiting competition. 