# Sentiment Analysis Project - Robinhood Case Study

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media.

- How this library works
- Sentiment analysis on Robinhood app reviews

Documentation: https://github.com/cjhutto/vaderSentiment

# Cleaning and Pre-Processing Data

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('20210320_0047ios_app.csv')
df.shape

(27139, 5)

In [3]:
df.head()

Unnamed: 0,Country,Date,Rating,Review,Version
0,US,2021-03-18,1,"Knowing my shares aren’t real, means RH IS MAN...",-
1,US,2021-03-18,1,This company is currently under investigation ...,-
2,US,2021-03-18,1,"They sell your data to MM, halt trading when i...",-
3,US,2021-03-18,1,Easy and simple to use but for the love of god...,-
4,US,2021-03-18,5,Easy to learn & Use,-


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27139 entries, 0 to 27138
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Country  27139 non-null  object
 1   Date     27139 non-null  object
 2   Rating   27139 non-null  int64 
 3   Review   27137 non-null  object
 4   Version  27139 non-null  object
dtypes: int64(1), object(4)
memory usage: 1.0+ MB


### Cleaning Missing Values

In [5]:
df.isnull().sum()

Country    0
Date       0
Rating     0
Review     2
Version    0
dtype: int64

In [6]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Country,Date,Rating,Review,Version
1019,US,2021-03-11,1,,9.4.0
21480,US,2021-01-22,5,,8.63.0


In [7]:
df.dropna(inplace=True)

In [8]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Country,Date,Rating,Review,Version


### Droping Unecessary Data

In [9]:
df['Country'].describe()

count     27137
unique        1
top          US
freq      27137
Name: Country, dtype: object

In [10]:
df['Version'].describe()

count     27137
unique       28
top           -
freq      10111
Name: Version, dtype: object

In [11]:
df.drop(columns=['Country', 'Version'], inplace=True)
df.head()

Unnamed: 0,Date,Rating,Review
0,2021-03-18,1,"Knowing my shares aren’t real, means RH IS MAN..."
1,2021-03-18,1,This company is currently under investigation ...
2,2021-03-18,1,"They sell your data to MM, halt trading when i..."
3,2021-03-18,1,Easy and simple to use but for the love of god...
4,2021-03-18,5,Easy to learn & Use


### Change Column Data Types

In [12]:
df['Date'] = pd.to_datetime(df['Date'])

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 27137 entries, 0 to 27138
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Date    27137 non-null  datetime64[ns]
 1   Rating  27137 non-null  int64         
 2   Review  27137 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 848.0+ KB


### Verify Data Consistency

#### Looking for Strings with Whitespaces Only

In [14]:
blanks = []

for ID, DATE, RATING, REVIEW in df.itertuples():
    if type(REVIEW)==str:
        if REVIEW.isspace():
            '''The isspace() method returns True if there are only whitespace characters in the string.
            If not, it return False. Characters such as tabs, spaces, newline that are used for spacing
            are called whitespace characters.'''
            blanks.append(ID) 

In [15]:
blanks

[]

In [None]:
# In case we had reviews with whitespaces only we would drop them
# df.drop(blanks, inplace=True)

#### Rating Values

In [16]:
df['Rating'].describe()

count    27137.000000
mean         2.553967
std          1.834600
min          1.000000
25%          1.000000
50%          1.000000
75%          5.000000
max          5.000000
Name: Rating, dtype: float64

In [17]:
df.Rating.unique()

array([1, 5, 4, 2, 3], dtype=int64)

In [18]:
df

Unnamed: 0,Date,Rating,Review
0,2021-03-18,1,"Knowing my shares aren’t real, means RH IS MAN..."
1,2021-03-18,1,This company is currently under investigation ...
2,2021-03-18,1,"They sell your data to MM, halt trading when i..."
3,2021-03-18,1,Easy and simple to use but for the love of god...
4,2021-03-18,5,Easy to learn & Use
...,...,...,...
27134,2020-12-20,5,I love this app and it’s the best out their fo...
27135,2020-12-20,5,"The app is simple, organized and has good cont..."
27136,2020-12-20,4,"I really like the ease of using this app, but ..."
27137,2020-12-20,1,I’ve been reaching out to them for over 3 week...


In [19]:
df.to_csv("ios_app_clean.csv", index=False)

# Sentiment Analysis

In [1]:
import pandas as pd
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

## How VADER works

In [2]:
analyzer = SentimentIntensityAnalyzer()

In [3]:
print('Baseline')
print(analyzer.polarity_scores("I'm happy"))

print('Capitalization')
print(analyzer.polarity_scores("I'm HAPPY"))

Baseline
{'neg': 0.0, 'neu': 0.213, 'pos': 0.787, 'compound': 0.5719}
Capitalization
{'neg': 0.0, 'neu': 0.184, 'pos': 0.816, 'compound': 0.6633}


In [4]:
print('Baseline')
print(analyzer.polarity_scores("I'm happy"))

print('Degree modifier')
print(analyzer.polarity_scores("I'm a little happy"))
print(analyzer.polarity_scores("I'm a extremely happy"))

Baseline
{'neg': 0.0, 'neu': 0.213, 'pos': 0.787, 'compound': 0.5719}
Degree modifier
{'neg': 0.0, 'neu': 0.468, 'pos': 0.532, 'compound': 0.5279}
{'neg': 0.0, 'neu': 0.429, 'pos': 0.571, 'compound': 0.6115}


In [5]:
print('Baseline')
print(analyzer.polarity_scores("I'm happy"))

print('Punctuation')
print(analyzer.polarity_scores("I'm happy !"))
print(analyzer.polarity_scores("I'm happy !!!"))

Baseline
{'neg': 0.0, 'neu': 0.213, 'pos': 0.787, 'compound': 0.5719}
Punctuation
{'neg': 0.0, 'neu': 0.334, 'pos': 0.666, 'compound': 0.6114}
{'neg': 0.0, 'neu': 0.304, 'pos': 0.696, 'compound': 0.6784}


In [6]:
print('Baseline')
print(analyzer.polarity_scores("I'm sad"))

print('Conjunctions')
print(analyzer.polarity_scores("I'm sad, but i like it"))

Baseline
{'neg': 0.756, 'neu': 0.244, 'pos': 0.0, 'compound': -0.4767}
Conjunctions
{'neg': 0.22, 'neu': 0.43, 'pos': 0.349, 'compound': 0.296}


In [7]:
# Emotions expressed with emojis

print(analyzer.polarity_scores("I'm happy !! 😥"))
print(analyzer.polarity_scores("I'm happy !! 😄"))
print(analyzer.polarity_scores("I'm happy !! 😊"))

print(analyzer.polarity_scores("This is bad ... :("))
print(analyzer.polarity_scores("This is bad ... :'("))

{'neg': 0.166, 'neu': 0.323, 'pos': 0.511, 'compound': 0.6467}
{'neg': 0.0, 'neu': 0.338, 'pos': 0.662, 'compound': 0.8684}
{'neg': 0.0, 'neu': 0.327, 'pos': 0.673, 'compound': 0.8829}
{'neg': 0.681, 'neu': 0.319, 'pos': 0.0, 'compound': -0.7506}
{'neg': 0.691, 'neu': 0.309, 'pos': 0.0, 'compound': -0.7717}


In [8]:
sentimentScores = analyzer.polarity_scores(str('This is bad ... :('))
sentimentScores

{'neg': 0.681, 'neu': 0.319, 'pos': 0.0, 'compound': -0.7506}

In [11]:
sentimentScores['compound']

-0.7506

## Sentiment Analysis on Robinhood App Reviews

In [12]:
def sentimentScores(text):
    sentiment = 0
    
    sentimentScores = analyzer.polarity_scores(str(text))
    compound = sentimentScores['compound']
    
    if(compound >= 0.05):
        sentiment = 1
    elif(compound <= -0.05):
        sentiment = -1
    else:
        sentiment = 0
        
    return sentiment

In [13]:
df = pd.read_csv('ios_app_clean.csv')

In [14]:
df.head()

Unnamed: 0,Date,Rating,Review
0,2021-03-18,1,"Knowing my shares aren’t real, means RH IS MAN..."
1,2021-03-18,1,This company is currently under investigation ...
2,2021-03-18,1,"They sell your data to MM, halt trading when i..."
3,2021-03-18,1,Easy and simple to use but for the love of god...
4,2021-03-18,5,Easy to learn & Use


In [15]:
df['Sentiment'] = [sentimentScores(x) for x in df['Review']]

In [16]:
df.describe()

Unnamed: 0,Rating,Sentiment
count,27137.0,27137.0
mean,2.553967,0.291042
std,1.8346,0.880151
min,1.0,-1.0
25%,1.0,-1.0
50%,1.0,1.0
75%,5.0,1.0
max,5.0,1.0


In [17]:
df.Sentiment.value_counts()

 1    15609
-1     7711
 0     3817
Name: Sentiment, dtype: int64

In [18]:
df.head(5)

Unnamed: 0,Date,Rating,Review,Sentiment
0,2021-03-18,1,"Knowing my shares aren’t real, means RH IS MAN...",-1
1,2021-03-18,1,This company is currently under investigation ...,-1
2,2021-03-18,1,"They sell your data to MM, halt trading when i...",-1
3,2021-03-18,1,Easy and simple to use but for the love of god...,1
4,2021-03-18,5,Easy to learn & Use,1


In [19]:
corr = df.corr(method = 'pearson')
corr

Unnamed: 0,Rating,Sentiment
Rating,1.0,0.509658
Sentiment,0.509658,1.0


In [20]:
df.to_csv("ios_app_final.csv", index=False)