# Google Play Store Twitter Sentiment Analysis
Data provided by [Kaggle](https://www.kaggle.com/lava18/google-play-store-apps). In addition Twitter will be analyzed for any further data on the app itself.

A coulple questions come to mind:
1. Is there really a corelation between the number of times an app was downloaded and the Twitter sentiment.
2. Try to find from the top 10000 apps (based on installs) how each install correlates to the positve and negative sentiments on Twitter.
3. Try to find which genre of apps brings the most positive and negative sentiment out of the top 10 downloads from the play store.


For the purpose of the second task **Syuzhet** will be used. **Syuzhet** get's the sentiment into 10 emotions: anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative and positive.
The notebook will examine only the positive and negative sentiments for each first 100 tweets for each app.

But first let's start with loading the libraries:

In [7]:
library(ggplot2)
library(Amelia)
library(dplyr)
library(rjson)
library(twitteR)
library(syuzhet)

set.seed(42)

## 1. Loading and preprocessing the data.

### 1.1 Loading the dataset.

In [8]:
df <- data.frame(read.csv('data/googleplaystore.csv'))

In [18]:
# Displaying the head
head(df)

App,Category,Rating,Reviews,Size,Installs,Type,Price,Content.Rating,Genres,Last.Updated,Current.Ver,Android.Ver
Photo Editor & Candy Camera & Grid & ScrapBook,ART_AND_DESIGN,4.1,159,19M,"10,000+",Free,0,Everyone,Art & Design,"January 7, 2018",1.0.0,4.0.3 and up
Coloring book moana,ART_AND_DESIGN,3.9,967,14M,"500,000+",Free,0,Everyone,Art & Design;Pretend Play,"January 15, 2018",2.0.0,4.0.3 and up
"U Launcher Lite – FREE Live Cool Themes, Hide Apps",ART_AND_DESIGN,4.7,87510,8.7M,"5,000,000+",Free,0,Everyone,Art & Design,"August 1, 2018",1.2.4,4.0.3 and up
Sketch - Draw & Paint,ART_AND_DESIGN,4.5,215644,25M,"50,000,000+",Free,0,Teen,Art & Design,"June 8, 2018",Varies with device,4.2 and up
Pixel Draw - Number Art Coloring Book,ART_AND_DESIGN,4.3,967,2.8M,"100,000+",Free,0,Everyone,Art & Design;Creativity,"June 20, 2018",1.1,4.4 and up
Paper flowers instructions,ART_AND_DESIGN,4.4,167,5.6M,"50,000+",Free,0,Everyone,Art & Design,"March 26, 2017",1.0,2.3 and up


### 1.2 Configuring the twitter connection

In [10]:
twitterCreds <- fromJSON(file = "data/twitter_access.json")

In [11]:
setup_twitter_oauth(twitterCreds$consumer_key, 
                    twitterCreds$consumer_secret, 
                    access_token=twitterCreds$access_token, 
                    access_secret=twitterCreds$access_token_secret)

[1] "Using direct authentication"


In [17]:
# code snippets to remind me how its done.
# soccer.tweets <- searchTwitter("soccer", n=2000, lang="en")
# soccer.tweets.df <- twListToDF(soccer.tweets)
# get_nrc_sentiment(soccer.tweets.df$text)

### 1.3 Getting the bearings of the datasets.
First - let's get the dimensions and structure of the dataset.

In [28]:
dim(df)

In [19]:
str(df)

'data.frame':	10841 obs. of  13 variables:
 $ App           : Factor w/ 9660 levels "- Free Comics - Comic Apps",..: 7206 2551 8970 8089 7272 7103 8149 5568 4926 5806 ...
 $ Category      : Factor w/ 34 levels "1.9","ART_AND_DESIGN",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Rating        : num  4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
 $ Reviews       : Factor w/ 6002 levels "0","1","10","100",..: 1183 5924 5681 1947 5924 1310 1464 3385 816 485 ...
 $ Size          : Factor w/ 462 levels "1,000+","1.0M",..: 55 30 368 102 64 222 55 118 146 120 ...
 $ Installs      : Factor w/ 22 levels "0","0+","1,000,000,000+",..: 8 20 13 16 11 17 17 4 4 8 ...
 $ Type          : Factor w/ 4 levels "0","Free","NaN",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Price         : Factor w/ 93 levels "$0.99","$1.00",..: 92 92 92 92 92 92 92 92 92 92 ...
 $ Content.Rating: Factor w/ 7 levels "","Adults only 18+",..: 3 3 3 6 3 3 3 3 3 3 ...
 $ Genres        : Factor w/ 120 levels "Action","Action;Action & Adventure",..: 10 13 10 10 

**Structure summary**:  
From the structure it seems that some columns, that one would expect to be of type *numerical* are actually *categoricals*. The only one that would be true is *Reviews* as the column indicates the number of reviews an app is getting. In addition *Size* also could be translated to a numerical value as there are 462 levels. On the other hand it seems that the column *Installs* is aggregated in 22 levels and is indeed a categorical value. Same goes for the *Price*. The *Last.Updated* column seems like a date-time.  

Let's start with the transformation of the columns.

In [32]:
gplay_df = transform(df, Reviews = as.numeric(Reviews)) # start with column reviews

**Plan**:  
1. *Preprocessing Data*  
    *. Check for outliers, as I saw some of the factors are unexpected, when taking into account the column's context.
    *. Finish transforming the other columns.  
2.*Initial Dataviz*  
1. Treat `Installs` as a target variable and get the relation between the Category, Rating, #Reviews, Type, Genres and Price.
2. Sort the dataset, based on Installs and start scraping Twitter.