# Sentiment Detection in Tweets
we will be classifying tweets into 

* positive
* Negative

link to download the dataset:
(https://drive.google.com/u/0/uc?export=download&confirm=vN3O&id=1xnIo9sCIQ4ETvG8v3_-1wPhMl9mHCa5i) 


## Loading the dataset

In [None]:
import pandas as pd
data = pd.read_csv("twitter_sentiments.csv",encoding = "ISO-8859-1",header=None, names = [ 'sentiment', 'id', 'date', 'flag', 'user', 'text'])

In [None]:
data.head()

Unnamed: 0,sentiment,id,date,flag,user,text
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1600000 entries, 0 to 1599999
Data columns (total 6 columns):
sentiment    1600000 non-null int64
id           1600000 non-null int64
date         1600000 non-null object
flag         1600000 non-null object
user         1600000 non-null object
text         1600000 non-null object
dtypes: int64(2), object(4)
memory usage: 73.2+ MB


In [None]:
data["sentiment"].value_counts()

4    800000
0    800000
Name: sentiment, dtype: int64

**{4**: **Positive** ,
**0**: **Negative}**

There are 16,00,000 tweets

For our initial training purposes, let us take only 10K tweets.

In [None]:
pos = data[data["sentiment"] == 4][:5000]
neg = data[data["sentiment"] == 0][:5000]

data = pd.concat([pos,neg])

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 800000 to 4999
Data columns (total 6 columns):
sentiment    10000 non-null int64
id           10000 non-null int64
date         10000 non-null object
flag         10000 non-null object
user         10000 non-null object
text         10000 non-null object
dtypes: int64(2), object(4)
memory usage: 546.9+ KB


## Preprocessing

In [None]:
import re
def clean_text(text):
  # Remove all the special characters
  processed_feature = re.sub(r'\W', ' ', text)

  # remove all single characters
  processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

  # Remove single characters from the start
  processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature) 

  # Substituting multiple spaces with single space
  processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

  # Converting to Lowercase
  processed_feature = processed_feature.lower()

  return processed_feature


In [None]:
data["text"] = data["text"].apply(clean_text)

In [None]:
import nltk
nltk.download('stopwords')


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Training

The computer cannot understand text, we need to represent text in numbers. Let us go step by step on how to perform that.

---


### 1. Creating Vocabulary
The set of unique words used in the text corpus is referred to as the vocabulary. When processing raw text for NLP, everything is done around the vocabulary.


```
texts = ['bob ate apples', 'fred ate apples', 'bob ate pears']
convert_to_vocab(texts)

>> Output: ['bob', 'fred', 'ate', 'apples', 'pears']
```
We can create all the sentences in **texts** using the vocab list.

### 2. Tokenization
We can use the vocabulary to find the number of times each word appears in the corpus, figure out which words are the most common or uncommon, and filter each text document based on the words that appear in it. However, the most important part of the vocabulary is that it allows us to represent each piece of text by the specific words that appear in it.

Rather than being represented as one long string, a piece of text can be represented as a vector/list of its vocabulary words. This process is known as tokenization, where each individual vocabulary word in a piece of text is a token.


```
texts = ['bob ate apples, pears', 'fred ate apples!']
tokenize(texts)
>> Output: [['bob', 'ate', 'apples', 'pears'], ['fred', 'ate', 'apples']]
```
**Note that the punctuations are gone**

### 3. Embeddings
We need to represent the above as numbers for our machine to understand. A simple way to do it would be using the list index from the vocabulary.
```
[['bob', 'ate', 'apples', 'pears'], ['fred', 'ate', 'apples']]
can be represented as
[[0, 3, 4,5], [1, 3, 4]]
using this key value pair
{'ate': 3, 'apples': 4, 'bob': 0, 'pears': 5, 'fred': 1}
```

We have now represented vocabulary words with unique integer IDs. However, these integer IDs don't give a sense of how different words may be related.

The solution to this problem is to convert each word into an embedding vector. An embedding vector is a higher-dimensional vector representation of a vocabulary word. Since vectors have a sense of distance (as they are just points in a higher-dimensional space), embedding vectors give us a word representation that captures relationships between words.

Head over to https://projector.tensorflow.org/ to visualize word vectors.

We will be using Tfidf to convert our words unto vectors.



In [None]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer (stop_words=stopwords.words('english'))
tfidf = vectorizer.fit_transform(data["text"]).toarray()
# fit is used to create the vocab and transform returns the word representations

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tfidf, data["sentiment"], test_size=0.2, random_state=0)


We will be using a **random Forest** Classifier to train our model. More information on random forests [here](https://www.youtube.com/watch?v=eM4uJ6XGnSM).


In [None]:
from sklearn.ensemble import RandomForestClassifier

text_classifier = RandomForestClassifier(n_estimators=200, random_state=0)
text_classifier.fit(X_train, y_train)


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=200,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

## Testing

In [None]:
predictions = text_classifier.predict(X_test)

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, predictions))


0.6775


In [None]:
sample_tweets = ["I did not like the last weeks episode","Their customer support is doing a good job"]
tfidf2 = vectorizer.transform(sample_tweets).toarray()
predictions = text_classifier.predict(tfidf2)

In [None]:
predictions

array([0, 4])