# Intro

The goal of this project is getting familiar with 'classification' by solving a natural language processing problem that is a kind of **sentimental text processing**. Here we're given some texts in which some users wrote their opinion about a movie. The sentences represent their sentiment about the movie and it says whether they like it or not. we want to process the given sentences and find out that which comment is positive and which one is has a negative opinion about that movie.

Three kinds of datasets are collected. They are testing, training, and validation datasets respectively. We want to build a model and then train our model (here it's a __classifier__) using _Test_ and _Training_ datasets to do the task,  labeling the _Validation_ dataset texts. After all, we will compute the accuracy of our model.

### Datasets

The dataset we are using here is, IMDB dataset (sentiment analysis) in CSV format that you can download it from here: [kaggle.com](https://kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format)

# Round 1, Read the data!
First things first, as i said before we have to read our datasets from a __.csv__ file that we have been downloaded before from __kaggle__ website.(actuly we have 3 datasets that we have to read)
Python has an external library for reading some dataset formats like __csv__ and some other formats called __pandas__.(I love pandas, i mean the animal!)
full documentations about how to use and install pandas exists on thier website, you can checkout [here](https://pandas.pydata.org/) to findout how to install and start using it.

In [12]:
import pandas as pd

trainDataset = pd.read_csv("./data-sets/Train.csv")
testDataset = pd.read_csv("./data-sets/Test.csv")
validDataset = pd.read_csv("./data-sets/Valid.csv")

db = {'train' : trainDataset, 'test': testDataset, 'valid': validDataset}

db['train'].head()

Unnamed: 0,text,label
0,It's been about 14 years since Sharon Stone aw...,0
1,someone needed to make a car payment... this i...,0
2,The Guidelines state that a comment must conta...,0
3,This movie is a muddled mish-mash of clichés f...,0
4,Before Stan Laurel became the smaller half of ...,0


# Round 2, Clean it!

## First step
The first question that comes to my mind is what kind of words or letters are more important? which ones are less?
for example, let's look at this sentence:

_" I grew up (n. 1965) watching and loving Thunderbirds, I hate them!"_

Which part can represent the writer's feelings? can you say which parts are more important?
It might be a little hard for us to say which parts are more important in a text, beacuse it might depends on writers literature or phychological backgrounds which exists inside writers mind but, here we can surely say the last part _"i hate them!"_ representing the exact feeling of the writer about _Thunderbirds_, he *"hates them!"*. In the other hand, no one will understand any feelings from some kind of punctuations like '(', ''', '.', or even numbers like '1965'. We can omit them to have a better minimal text with fewer extra features.

In otherwords we have to clean our data. Some common data cleaning methods are as bellow:

**Common data cleaning steps:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

To do so, To do so, we can take some bits of help from a useful library in python called __regex__ (regular expression library). (Documentations are available [here](https://docs.python.org/3/library/re.html)).

**Note**: In this special case, our data has some more extra garbage characters inside that We want them to be deleted. They are _html tags_ , yes, in these special data sets we are using, there are some HTML tags embedded inside the comments and firstly we have to delete them all.

In [13]:
import re
import string

def cleanHtmlTags(text):
    mask = re.compile("<.*?>")
    text = re.sub(mask, "", text)
    return text

## Second step
A general paragraph may contains some numbers, dates and punctuiations. They don't give us any sentimental informations, right? let's clean them too.

In [14]:
def cleanNumbers(text):
    mask = re.compile("[0-9]*")
    text = re.sub(mask, "", text)
    return text

def cleanPunctuations(text):
    text = re.sub("[%s]" % re.escape(string.punctuation), "", text)
    text = re.sub("[‘’“”…]", "", text)
    return text
    

## Thired step
As said before, some words in a text can't contain any informations. In english  **stop words** are very common in sentences but they are not as informative as other parts like *verb*. A common technick that used in NLP text preprocessing is to remove this inimformative **stop word**.
There is a python library that could be handed, **sklearn**.(Ducumentations available [here](https://scikit-learn.org/stable/))

In [37]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 

stop_words = set(stopwords.words('english')) 
example_sent = "This is a sample sentence, (b, 1992) can't do this shit showing off the stop words filtration."

example_sent = cleanHtmlTags(example_sent)
example_sent = cleanNumbers(example_sent)
example_sent = cleanPunctuations(example_sent)


word_tokens = word_tokenize(example_sent)

  
for name ,dataset in db.items():
    cleaned_dataset = {'text': [], 'label': []}
    length = dataset.shape[0]
    for i in range(length):
        label = dataset['label'][i]
        text = dataset['text'][i]
        text = cleanHtmlTags(text)
        text = cleanNumbers(text)
        text = cleanPunctuations(text)
        text_tokens = word_tokenize(text)
        filtered_text = " ".join([w for w in text_tokens if not w in stop_words])
        filtered_text[:-1]
        cleaned_text = str(filtered_text[:-1])
        cleaned_dataset['text'] += [cleaned_text]
        cleaned_dataset['label'] += [label]
    cleaned_dataset = pd.DataFrame(cleaned_dataset, columns = ['text', 'label']) 
    cleaned_dataset.to_csv('./data-sets/cleaned-' + name + '.csv', index = False, header=True)
print("Done!")

Done!


# Extracting feature vectors

Now we have cleaned, tokenized text but we still have some work to do!
Lets Assume we want to extract a feature vector from this text. How many features do you think BOW will extract from it?

We will introduce BOW later but now, consider that every work in the text could be a feature and to have same feature vectors for all examples we have to assume each word as an feature. How many features could we have here? For example based on our training dataset, each sample could be a 190663 dimention vector!
it's not good at all, so we have to solve this problem and do some methods to lower our feature vector dimentions to reduce computation time.

Stemming and limitization are such a good method to lower our feature vectors dimentions. now lets introduce these methods first.

Let's read the data we just cleant at the previouse stage:


In [2]:
import pandas as pd
trainDataset = pd.read_csv("./data-sets/cleaned-train.csv")
testDataset = pd.read_csv("./data-sets/cleaned-test.csv")
validDataset = pd.read_csv("./data-sets/cleaned-valid.csv")

db = {'train' : trainDataset, 'test': testDataset, 'valid': validDataset}

db['train'].head()


Unnamed: 0,text,label
0,I grew b watching loving Thunderbirds All mate...,0
1,When I put movie DVD player sat coke chips I e...,0
2,Why people know particular time past like feel...,0
3,Even though I great interest Biblical movies I...,0
4,Im die hard Dads Army fan nothing ever change ...,1


## Stemming



In [8]:
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize 

for name ,dataset in db.items():
    stemmed_dataset = {'text': [], 'label': []}
    length = dataset.shape[0]
    for i in range(length):
        porterStemmer = PorterStemmer()
        label = dataset['label'][i]
        text = dataset['text'][i]
        text = word_tokenize(text)
        stemmed_text = ' '.join([porterStemmer.stem(word) for word in text])
        stemmed_dataset['text'] += [stemmed_text]
        stemmed_dataset['label'] += [label]
    stemmed_dataset = pd.DataFrame(stemmed_dataset, columns = ['text', 'label']) 
    stemmed_dataset.to_csv('./data-sets/stemmed-' + name + '.csv', index = False, header=True)
print("Done!")

Done!


## Extracting feature vectors

After doing the previous rounds we achived cleaned databases and now we are after to make a feature vector from our datasets. To do so, we are able to use each methods bellow.

* Bag Of Words (BOW)
* BERT embedding
* word2Vec
* TF-IDF

during the section we are going to use first two methods and go throw to compair our the results on two classifires, **native Bayes** and the **SVM**.

> **note**: As you saw, we stored the cleaned datasets on files and, here we are going to read and use them from storage to save more time.

## Bag Of Words method

description


In [14]:
import pandas as pd

trainDataset = pd.read_csv("./data-sets/stemmed-train.csv")
testDataset = pd.read_csv("./data-sets/stemmed-test.csv")
validDataset = pd.read_csv("./data-sets/stemmed-valid.csv")

db = {'train' : trainDataset, 'test': testDataset, 'valid': validDataset}

db['train'].head()
from sklearn.feature_extraction.text import CountVectorizer
train = pd.read_csv('./data-sets/stemmed-train.csv')
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(train['text'])
X_valid = vectorizer.fit_transform(valid['text'])
X_train = vectorizer.fit_transform(test['text'])


NameError: name 'valid' is not defined