### We cannot work with the text data in machine learning so we need to convert them into numerical vectors, As a part of this practice exercise you will implement different techniques to do the same.

### In this notebook we are going to understand techniques for encoding text data. We are going to learn about

1. **Techniques for Encoding** - These are the popular techniques that are used for encoding:
    *           **Bag of Words**
    *           **TF-IDF**( **T**erm  **F**requency - **I**nverse **D**ocument **F**requency)
2. **Sentiment Analysis** - Sentiment Analysis is the process of ‘computationally’ determining whether a piece of writing is positive, negative or neutral. The below can be used for Sentiment Analysis:
    *           **TextBlob**         
    *           **VADER Sentiment**

In [1]:
import re
import numpy as np                                  #for large and multi-dimensional arrays
import pandas as pd                                 #for data manipulation and analysis
import nltk                                         #Natural language processing tool-kit

nltk.download('stopwords')
nltk.download('punkt')

from nltk.corpus import stopwords                   #Stopwords corpus
from nltk.stem import PorterStemmer                 # Stemmer
from nltk.tokenize import word_tokenize 


from sklearn.feature_extraction.text import CountVectorizer          #For Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer          #For TF-IDF

!pip install vaderSentiment

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [0]:
d1 = 'I enjoy this program.'
d2 = 'This program is great.'
d3 = 'This product is not great.'
d4 = 'I really love this brand.'

In [0]:
# Write the code to set the stopwords' list removing "not" from the list of stopwords.


### Basic Pre-processing Steps:

- Conversion to lowercase.
- Removal of punctuation.
- Tokenization.
- Stopwords removal except the word 'not'.

In [0]:
# Write the function to perform the above four steps.


In [0]:
# Write the code to preprocess all the four text strings and save the result into new variables. Join all the four resulting lists.


### **BAG OF WORDS**
      
In BoW we construct a dictionary that contains set of all unique words from our text review dataset. The frequency of the word is counted here. If there are **d** unique words in our dictionary then for every sentence or review the vector will be of length **d** and count of word from review is stored at its particular location in vector. The vector will be highly sparse in such case.
      
      
#### Using scikit-learn's CountVectorizer we can get the BoW and check out all the parameters it consists of.

In [0]:
# Write the code to perform Bag of words representation using CountVectorizer().
# Print the following: vocabulary, shape of the matrix, type of the vector, the vector as array.


In [0]:
# Write the code to set the unique words as vocabulary. Print the vocabulary.


**TF-IDF**

**Term Frequency -  Inverse Document Frequency** it makes sure that less importance is given to most frequent words and also considers less frequent words.

**Term Frequency** is number of times a **particular word(W)** occurs in a review divided by totall number of words **(Wr)** in review. The term frequency value ranges from 0 to 1.

**Inverse Document Frequency** is calculated as **log(Total Number of Docs(N) / Number of Docs which contains particular word(n))**. Here Docs referred as Reviews.


**TF-IDF** is **TF * IDF** that is **(W/Wr)*LOG(N/n)**


 Using scikit-learn's tfidfVectorizer we can get the TF-IDF.

So even here we get a TF-IDF value for every word and in some cases it may consider different meaning reviews as similar after stopwords removal. so to over come we can use BI-Gram or NGram.

In [0]:
# Write the code to perform Tfidf Vectorization. Print: vocabulary, the value of idfs, shape of the result, the resulting matrix.


**VADER (Valence Aware Dictionary and sEntiment Reasoner)** is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media. VADER uses a combination of A sentiment lexicon is a list of lexical features (e.g., words) which are generally labeled according to their semantic orientation as either positive or negative. VADER not only tells about the Positivity and Negativity score but also tells us about how positive or negative a sentiment is.

In [0]:
# Write the code to print polarity scores using vaderSentiment for all four texts (d1, d2, d3 and d4) separately.


#### With the help of TextBlob.sentiment() method, we can get the sentiments of the sentences by using TextBlob.sentiment() method.

In [0]:
# Write the code to print sentiment scores using TextBlob for all four text strings (d1, d2, d3 and d4) separately.
