# Sentiment Anlaysis
## An Introduction
### What is it?
Natural Language Processing (NLP): A sub-discipline of computer science, artificial intelligence and linguistics that builds computational models to process and understand natural language. Examples include language translation, auto-correct, language generation, topic summarization, sentiment analysis, and many more.  

Sentiment Analysis: Identification of opinions/emotions (positive, negative and neutral) within text data.  It can be used to identify public sentiment towards certain words or topics and gauge the attitude/sentiment using analytical processing of the textual data.

In this notebook, we'll look at some Twitter data, learn how to convert to numerical features, and develop a basic Sentiment Analysis model to categorize a the tone of a tweet.  In the process, we will uncover some issues with text data, discuss cleaning and ways of converting to features, and use a pre-trained model.  By the end of this lecture, you should have an understanding of sentiment analysis, how it is used, and ways of creating such a model.


#### Who Uses?
Businesses today depend on data, period.  Whether from online Yelp reviews, customer survey, social media, chats, or emails, much of this data is unstructured text, which is difficult to use en mass. Added to this difficulty include abbreviated text (slang, short forms, memes and emoticons).  But it's important to have the ability to summarize and understand trends even in this unstructured data.  

* Companies track negative/positive sentiment in response to advertising on social media
* Aggregated free-form text from online surveys
* Identify rising trends (popular songs, foods, brands)


#### Further Reading
* https://www.kaggle.com/kazanova/sentiment140
* https://www.nltk.org/data.html
* https://medium.com/@randerson112358/stock-market-sentiment-analysis-using-python-machine-learning-5b644f151a3e
* https://regex101.com/
* https://algotrading101.com/learn/sentiment-analysis-python-guide/
* https://www.kdnuggets.com/2016/06/politics-analytics-trump-clinton-sanders-twitter-sentiment.html
* https://amiham-singh.github.io/
* https://icwsm.org/papers/3--Godbole-Srinivasaiah-Skiena.pdf
* https://scikit-learn.org/stable/modules/feature_extraction.html

# Example

In [None]:
import numpy as np 
import pandas as pd 
import re
import nltk 
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.corpus import twitter_samples
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment import SentimentAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score

%matplotlib inline
nltk.download('twitter_samples')
nltk.download('stopwords')
nltk.download('vader_lexicon')

## Getting the Data

In [None]:
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

From the NLTK package, 5k positive and negative tweets are sampled (labeled by emoticons).  We will use these for our model building and analysis.  What does a positive tweet look like?

In [None]:
print('Positive Tweets:')
for i in range(6):
    print(str(i) + ': ' +positive_tweets[i])

And negative?

In [None]:
print('Negative Tweets:')
for i in range(6):
    print(str(i) + ': ' +negative_tweets[i])

### Discussion


## Pretrained model

As part of the NLTK package, there is a already trained sentiment analysis model called VADER ().  We will consider this a "black-box" type model to understand how sentiment analysis can be used (but without going into the specifics of this type of model).  To load and use, call "SentimentIntensityAnalyzer" with the following:

In [None]:
sid = SentimentIntensityAnalyzer()
sid.polarity_scores(negative_tweets[0])

In [None]:
#sid.polarity_scores('this is a tweet')

## Creating Text Features

Before we build a model, consider some tweets and issues the text could have on our outcomes:

In [None]:
print(positive_tweets[0])
print(negative_tweets[0])
print(negative_tweets[16])
print(negative_tweets[11])

Possible steps in cleaning text: 
* Remove Punctuation
* Remove numbers/symbols
* Remove Links/URLs/usernames/hashtags
* Remove case-senstivity
* Stop Words
* Whitespace
* Misspellings


In [None]:
df = pd.DataFrame(positive_tweets)
df = df.append(negative_tweets,0)
df.columns = ['text']
df['sentiment'] = [1]*len(positive_tweets)+[0]*len(negative_tweets)

In [None]:
df.head()

In [None]:
vectorizer = CountVectorizer()
bow = vectorizer.fit_transform(df.text).toarray()


In [None]:
#vectorizer.vocabulary_

In [None]:

X_train, X_test, y_train, y_test = train_test_split(bow, df.sentiment, test_size=0.2, random_state=0)

In [None]:
model = LogisticRegression(C=1.)

#model = MultinomialNB()
model.fit(X_train, y_train)

print ("auc (test data):" , roc_auc_score(y_test, model.predict(X_test)))