# Sentiment Analysis of Tweets using spaCy, VADER, and Scikit-Learn

This notebook will showcase two approaches to conducting sentiment analysis in a natural language processing (NLP) pipeline.  

Both approach will attempt to resolve tweets to a continuum of sentiment based on the content of each tweet.  The sentiment values produced could then further be classified into postive vs negative groups, which would be as simple as choosing an arbitrary threshold sentiment value (i.e. 0) and sorting each tweet into their respective bins.

1. Lexicon Approach

    - This approach will cross reference the salient words of each tweet against a lexicon that prescribes a sentiment valence value, which will be an integer value. 
    - These integer values will be summed for each tweet to compute its compound sentiment.
    
<br>

2. Machine Learning Approach

 - The second approach will train a machine learning regression model after representing each tweet as a vector in some n-dimensional space.
 - Each tweet will be transformed into a sparse matrix within a high dimensional space so that they can be presented to our regression models as numbers.

<br>

All the code for this notebook will be in Python.

## 1. Setup

We're going to write a tiny amount of functions ourselves, but mostly leverage functions from open-source packages.

This part is really <font color=green>**import**</font>ant.

In [2]:
import spacy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from pathlib import Path
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from matplotlib.colors import LinearSegmentedColormap, Normalize
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.svm import LinearSVR
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import SGDRegressor

## 2. Read Datasets into Memory

The target dataset of tweets we're interested in discerning the sentiment of will come from the *Sentiment140_test* set, available at the links here.
([GitHub](https://github.com/rmaestre/Sentiwordnet-BC/blob/master/test/testdata.manual.2009.06.14.csv), [Kaggle](https://www.kaggle.com/pirateshadow/sentiment140-test))

The machine learning training dataset we're going to use comes from the VADER package.  It is a list of tweet-like text with labelled sentiment values. ([GitHub](https://github.com/cjhutto/vaderSentiment#python-code-example))

In [4]:
target_data_fpath = '~/Documents/datasets/nlp/testdata.manual.2009.06.14.csv'
target_df = pd.read_csv(target_data_fpath, header=None)

mltrain_fpath = '~/Documents/datasets/nlp/tweets_GroundTruth.txt'
mltrain_df = pd.read_csv(mltrain_fpath, delimiter='\t', 
                         engine='python', header=None,
                         names = ['id', 'mean_sentiment_rating', 'tweet_text'])