# Google Play Store Twitter Sentiment Analysis
Data provided by [Kaggle](https://www.kaggle.com/lava18/google-play-store-apps). In addition Twitter will be analyzed for any further data on the app itself.

A coulple questions come to mind:
1. Is there really a corelation between the number of times an app was downloaded and the Twitter sentiment.
2. Try to find from the top 10000 apps (based on installs) how each install correlates to the positve and negative sentiments on Twitter.
3. Try to find which genre of apps brings the most positive and negative sentiment out of the top 10 downloads from the play store.


For the purpose of the second task **Syuzhet** will be used. **Syuzhet** get's the sentiment into 10 emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative and positive.
The notebook will examine only the positive and negative sentiments for each first 100 tweets for each app.

But first let's start with loading the libraries:

In [1]:
library(ggplot2)
library(Amelia)
library(dplyr)
library(rjson)
library(twitteR)
library(syuzhet)

set.seed(42)

Loading required package: Rcpp
## 
## Amelia II: Multiple Imputation
## (Version 1.7.4, built: 2015-12-05)
## Copyright (C) 2005-2018 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
## 

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union


Attaching package: ‘twitteR’

The following objects are masked from ‘package:dplyr’:

    id, location



## 1. Loading and preprocessing the data.

### 1.1 Loading the dataset.

In [2]:
df <- data.frame(read.csv('data/googleplaystore.csv'))

In [3]:
# Displaying the head
head(df)

App,Category,Rating,Reviews,Size,Installs,Type,Price,Content.Rating,Genres,Last.Updated,Current.Ver,Android.Ver
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up


### 1.2 Configuring the twitter connection

In [4]:
twitterCreds <- fromJSON(file = "data/twitter_access.json")

In [5]:
setup_twitter_oauth(twitterCreds$consumer_key, 
                    twitterCreds$consumer_secret, 
                    access_token=twitterCreds$access_token, 
                    access_secret=twitterCreds$access_token_secret)

[1] "Using direct authentication"


In [6]:
# code snippets to remind me how its done.
# soccer.tweets <- searchTwitter("soccer", n=2000, lang="en")
# soccer.tweets.df <- twListToDF(soccer.tweets)
# get_nrc_sentiment(soccer.tweets.df$text)

### 1.3 Getting the bearings of the datasets.
First - let's get the dimensions and structure of the dataset.

In [7]:
dim(df)

In [8]:
str(df)

'data.frame':	10841 obs. of  13 variables:
 $ App           : Factor w/ 9660 levels "- Free Comics - Comic Apps",..: 7206 2551 8970 8089 7272 7103 8149 5568 4926 5806 ...
 $ Category      : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
 $ Reviews       : Factor w/ 6002 levels "0","1","10","100",..: 1183 5924 5681 1947 5924 1310 1464 3385 816 485 ...
 $ Size          : Factor w/ 462 levels "1,000+","1.0M",..: 55 30 368 102 64 222 55 118 146 120 ...
 $ Installs      : Factor w/ 22 levels "0","0+","1,000,000,000+",..: 8 20 13 16 11 17 17 4 4 8 ...
 $ Type          : Factor w/ 4 levels "0","Free","NaN",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Price         : Factor w/ 93 levels "$0.99","$1.00",..: 92 92 92 92 92 92 92 92 92 92 ...
 $ Content.Rating: Factor w/ 7 levels "","Adults only 18+",..: 3 3 3 6 3 3 3 3 3 3 ...
 $ Genres        : Factor w/ 120 levels "Action","Action;Action & Adventure",..: 10 13 10 10 

**Structure summary**:  
From the structure it seems that some columns, that one would expect to be of type *numerical* are actually *categoricals*. The only one that would be true is *Reviews* as the column indicates the number of reviews an app is getting. In addition *Size* also could be translated to a numerical value as there are 462 levels. On the other hand it seems that the column *Installs* is aggregated in 22 levels and is indeed a categorical value. Same goes for the *Price* with 93 levels. The *Last.Updated* column seems like a date-time.  

But first let's clean the dataset of any potential errors - what does this mean - we need to check each column if there is a value(s) that is out of context for the column. Let's make a small plan with each of the columns.

1. App - nothing to do here as names can vary quite a lot.
2. Category - as there aren't many factors we can get the unique values and check if something is out of context.
3. Rating - nothing to do here as well, as the structure of the dataframe indicated it is a numerical - which we expect.
4. Reviews - check if the column in entirely composed of integers. A non-integer will indicate a problem with the data.
5. Size - looking at the head we can see that if the value begins with a number - it will indicate a valid value. Anything else we have to do per item basis.
6. Installs - again like the revies - if the first digit is an integer its okay - else process it manually.
7. Type - only 4 factors, thus we can examine them one by one.
8. Price - again a simple check - the first char should be either a 0, an F (for Free) or a dollar sign.
9. Content.Rating - only 7 factors - again we can treat them case by case.
10. Genres - this is a tricky one as the case can be as the App column. We can leave it for now and see if outliers happen go back and fix them.
11. Last.Updated - try to convert to  a date - if it fails - examine it.
12. Curent.Ver & Android.Ver - leave as is, as they are not intended to be used.

#### *1 Category*

In [19]:
unique(df$Category)

Right of the bat we encounter an error. - 1.9 should not be there. Since the data is a csv, this is likely caused by missing a comma. Let's try to find the number of these columns.

In [30]:
length(df[df$Category == 1.9,]$Category)

Just one row - we can drop it.

In [34]:
df = df[!(df$Category == 1.9),]

Transforming the columns to the appropriate types.

In [9]:
gplay_df = transform(df, Reviews = as.numeric(Reviews)) # start with column reviews

**Plan**:  
1. *Preprocessing Data:*  
    * Check for outliers, as I saw some of the factors are unexpected, when taking into account the column's context.
    * Finish transforming the other columns.  
2. *Initial Dataviz:*  
    * Treat `Installs` as a target variable and get the relation between the Category, Rating, #Reviews, Type, Genres and Price.
    * Sort the dataset, based on Installs and start scraping Twitter.

In [16]:
sort(unique(gplay_df$Size))