## We cannot work with the text data in machine learning so we need to convert them into numerical vectors, As a part of this practice exercise you will implement different techniques to do the same.

### In this notebook we are going to understand techniques for encoding text data. We are going to learn about

1. **Techniques for Encoding** - These are the popular techniques that are used for encoding:
    * **Bag of Words**
    * **TF-IDF**( **T**erm **F**requency - **I**nverse **D**ocument **F**requency)
2. **Sentiment Analysis** - Sentiment Analysis is the process of ‘computationally’ determining whether a piece of writing is positive, negative or neutral. The below can be used for Sentiment Analysis:
    * **TextBlob**
    * **VADER Sentiment**

In [1]:
import re
import numpy as np                                  #for large and multi-dimensional arrays
import pandas as pd                                 #for data manipulation and analysis
import nltk                                         #Natural language processing tool-kit

nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords                   #Stopwords corpus
from nltk.stem import PorterStemmer                 # Stemmer
from nltk.tokenize import word_tokenize 


from sklearn.feature_extraction.text import CountVectorizer          #For Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer          #For TF-IDF

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PradeepSingh\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PradeepSingh\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
d1 = 'I enjoy this program.'
d2 = 'This program is great.'
d3 = 'This product is not great.'
d4 = 'I really love this brand.'

## Basic Pre-processing Steps:

* Conversion to lowercase.
* Removal of punctuation.
* Tokenization.
* Stopwords removal except the word 'not'.

In [3]:
stopwords = stopwords.words('english')
stopwords.remove('not')

In [4]:
stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [5]:
ps = PorterStemmer()
def preprocess(text):
  '''Pre-processing steps as described above.'''
  # text = text.lower()
  # # text = re.sub(r'[.|,|)|(|\|/]',r'', text)        #Removing Punctuations
  # words = [word for word in text if word not in stopwords]   # removing stopwords
  # return words
  text = text.lower()                 # Converting to lowercase
  text = re.sub(r'[.|,|)|(|\|/]',r' ', text) # Removing punctuation
  word_tokens = word_tokenize(text)  # Tokenization

  words = [word for word in word_tokens if word not in stopwords] # Stop word removal
  return words

In [6]:
d1_new = preprocess(d1)
d2_new = preprocess(d2)
d3_new = preprocess(d3)
d4_new = preprocess(d4)

sent = d1_new + d2_new + d3_new + d4_new
print(sent)

['enjoy', 'program', 'program', 'great', 'product', 'not', 'great', 'really', 'love', 'brand']


## BAG OF WORDS
In BoW we construct a dictionary that contains set of all unique words from our text review dataset. The frequency of the word is counted here. If there are d unique words in our dictionary then for every sentence or review the vector will be of length d and count of word from review is stored at its particular location in vector. The vector will be highly sparse in such case.

### Using scikit-learn's CountVectorizer we can get the BoW and check out all the parameters it consists of.

In [7]:
cv = CountVectorizer()  
X = cv.fit_transform(sent)
print(cv.vocabulary_)
print(X.shape)
print(type(X))
print(X.toarray())

{'enjoy': 1, 'program': 6, 'great': 2, 'product': 5, 'not': 4, 'really': 7, 'love': 3, 'brand': 0}
(10, 8)
<class 'scipy.sparse.csr.csr_matrix'>
[[0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 1 0]
 [0 0 1 0 0 0 0 0]
 [0 0 0 0 0 1 0 0]
 [0 0 0 0 1 0 0 0]
 [0 0 1 0 0 0 0 0]
 [0 0 0 0 0 0 0 1]
 [0 0 0 1 0 0 0 0]
 [1 0 0 0 0 0 0 0]]


In [8]:
myvocabulary = list(set(sent))
myvocabulary

['product', 'really', 'not', 'love', 'brand', 'program', 'enjoy', 'great']

### TF-IDF

**Term Frequency - Inverse Document Frequency** it makes sure that less importance is given to most frequent words and also considers less frequent words.

**Term Frequency** is number of times a particular word(W) occurs in a review divided by totall number of words (Wr) in review. The term frequency value ranges from 0 to 1.

**Inverse Document Frequency** is calculated as **log(Total Number of Docs(N) / Number of Docs which contains particular word(n))**. Here Docs referred as Reviews.

**TF-IDF** is **TF * IDF** that is **(W/Wr)*LOG(N/n)**

Using scikit-learn's tfidfVectorizer we can get the TF-IDF.

So even here we get a TF-IDF value for every word and in some cases it may consider different meaning reviews as similar after stopwords removal. so to over come we can use BI-Gram or NGram.

In [9]:
corpus = {1: d1, 2: d2, 3: d3, 4: d4}
tfidf = TfidfVectorizer(vocabulary = myvocabulary, ngram_range = (1,2))
tfs = tfidf.fit_transform(corpus.values())
feature_names = tfidf.get_feature_names()
corpus_index = [n for n in corpus]
import pandas as pd
df = pd.DataFrame(tfs.T.todense(), index=feature_names, columns=corpus_index)
print(tfidf.vocabulary_)
print(tfidf.idf_)
print(tfs.shape)
print(df)

{'product': 0, 'really': 1, 'not': 2, 'love': 3, 'brand': 4, 'program': 5, 'enjoy': 6, 'great': 7}
[1.91629073 1.91629073 1.91629073 1.91629073 1.91629073 1.51082562
 1.91629073 1.51082562]
(4, 8)
                1         2         3        4
product  0.000000  0.000000  0.617614  0.00000
really   0.000000  0.000000  0.000000  0.57735
not      0.000000  0.000000  0.617614  0.00000
love     0.000000  0.000000  0.000000  0.57735
brand    0.000000  0.000000  0.000000  0.57735
program  0.619130  0.707107  0.000000  0.00000
enjoy    0.785288  0.000000  0.000000  0.00000
great    0.000000  0.707107  0.486934  0.00000


**VADER (Valence Aware Dictionary and sEntiment Reasoner)** is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of A sentiment lexicon is a list of lexical features (e.g., words) which are generally labeled according to their semantic orientation as either positive or negative. VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

In [10]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer 
sentiment = SentimentIntensityAnalyzer()
sentiment_dict1 = sentiment.polarity_scores(d1)
sentiment_dict2 = sentiment.polarity_scores(d2)
sentiment_dict3 = sentiment.polarity_scores(d3)
sentiment_dict4 = sentiment.polarity_scores(d4)

print(f'{d1}: {sentiment_dict1}')
print(f'{d2}: {sentiment_dict2}')
print(f'{d3}: {sentiment_dict3}')
print(f'{d4}: {sentiment_dict4}')

I enjoy this program.: {'neg': 0.0, 'neu': 0.484, 'pos': 0.516, 'compound': 0.4939}
This program is great.: {'neg': 0.0, 'neu': 0.423, 'pos': 0.577, 'compound': 0.6249}
This product is not great.: {'neg': 0.452, 'neu': 0.548, 'pos': 0.0, 'compound': -0.5096}
I really love this brand.: {'neg': 0.0, 'neu': 0.471, 'pos': 0.529, 'compound': 0.6697}


In [11]:
from textblob import TextBlob
print(f'{d1}: {TextBlob(d1).sentiment}')
print(f'{d2}: {TextBlob(d2).sentiment}')
print(f'{d3}: {TextBlob(d3).sentiment}')
print(f'{d4}: {TextBlob(d4).sentiment}')

I enjoy this program.: Sentiment(polarity=0.4, subjectivity=0.5)
This program is great.: Sentiment(polarity=0.8, subjectivity=0.75)
This product is not great.: Sentiment(polarity=-0.4, subjectivity=0.75)
I really love this brand.: Sentiment(polarity=0.5, subjectivity=0.6)
