# Intro

The goal of this project is getting familiar with 'classification' by solving a natural language processing problem that is a kind of **sentimental text processing**. Here we're given some texts in which some users wrote their opinion about a movie. The sentences represent their sentiment about the movie and it says whether they like it or not. we want to process the given sentences and find out that which comment is positive and which one is has a negative opinion about that movie.

Three kinds of datasets are collected. They are testing, training, and validation datasets respectively. We want to build a model and then train our model (here it's a __classifier__) using _Test_ and _Training_ datasets to do the task,  labeling the _Validation_ dataset texts. After all, we will compute the accuracy of our model.

### Datasets

The dataset we are using here is, IMDB dataset (sentiment analysis) in CSV format that you can download it from here: [kaggle.com](https://kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format)

# Round 1, Read the data, Fight!
First things first, as i said before we have to read our datasets from a __.csv__ file that we have been downloaded before from __kaggle__ website.(actuly we have 3 datasets that we have to read)
Python has an external library for reading some dataset formats like __csv__ and some other formats called __pandas__.(I love pandas, i mean the animal!)
full documentations about how to use and install pandas exists on thier website, you can checkout [here](https://pandas.pydata.org/) to findout how to install and start using it.

In [17]:
import pandas as pd

db = {}
trainDataset = pd.read_csv("./data-sets/Train.csv")
testDataset = pd.read_csv("./data-sets/Test.csv")
validDataset = pd.read_csv("./data-sets/Valid.csv")

trainDataset.head()

Unnamed: 0,text,label
0,I grew up (b. 1965) watching and loving the Th...,0
1,"When I put this movie in my DVD player, and sa...",0
2,Why do people who do not know what a particula...,0
3,Even though I have great interest in Biblical ...,0
4,Im a die hard Dads Army fan and nothing will e...,1


# Round 2, Clean it!

## First step
The first question that comes to my mind is what kind of words or letters are more important? which ones are less?
for example, let's look at this sentence:

_" I grew up (n. 1965) watching and loving Thunderbirds, I hate them!"_

Which part can represent the writer's feelings? can you say which parts are more important?
It might be a little hard for us to say which parts are more important in a text, beacuse it might depends on writers literature or phychological backgrounds which exists inside writers mind but, here we can surely say the last part _"i hate them!"_ representing the exact feeling of the writer about _Thunderbirds_, he *"hates them!"*. In the other hand, no one will understand any feelings from some kind of punctuations like '(', ''', '.', or even numbers like '1965'. We can omit them to have a better minimal text with fewer extra features.

In otherwords we have to clean our data. Some common data cleaning methods are as bellow:

**Common data cleaning steps:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

To do so, To do so, we can take some bits of help from a useful library in python called __regex__ (regular expression library). (Documentations are available [here](https://docs.python.org/3/library/re.html)).

**Note**: In this special case, our data has some more extra garbage characters inside that We want them to be deleted. They are _html tags_ , yes, in these special data sets we are using, there are some HTML tags embedded inside the comments and firstly we have to delete them all.

In [18]:
import re
import string

def cleanHtmlTags(text):
    mask = re.compile("<.*?>")
    text = re.sub(mask, "", text)
    return text
size = trainDataset.shape

for i in range(size[0]):
    txt = trainDataset["text"][i]
    txt = cleanHtmlTags(txt)
    trainDataset["text"][i] = txt
print ("Done!")

Done!


## Second step
A general paragraph may contains some numbers, dates and punctuiations. They don't give us any sentimental informations, right? let's clean them too.

In [19]:
def cleanNumbers(text):
    mask = re.compile("[0-9]*")
    text = re.sub(mask, "", text)
    return text

def cleanPunctuations(text):
    text = re.sub("[%s]" % re.escape(string.punctuation), "", text)
    text = re.sub("[‘’“”…]", "", text)
    return text
    
for i in range(size[0]):
    txt = trainDataset["text"][i]
    txt = cleanNumbers(txt)
    txt = cleanPunctuations(txt)
    trainDataset["text"][i] = txt
print ("Done!")

Done!


## Thired step
As said before, some words in a text can't contain any informations. In english  **stop words** are very common in sentences but they are not as informative as other parts like *verb*. A common technick that used in NLP text preprocessing is to remove this inimformative **stop word**.
There is a python library that could be handed, **sklearn**.(Ducumentations available [here](https://scikit-learn.org/stable/))

In [32]:
import nltk
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize 
  
example_sent = "This is a sample sentence, showing off the stop words filtration."
  
stop_words = set(stopwords.words('english')) 
  
word_tokens = word_tokenize(example_sent)
print(word_tokenize) 
print("Done!")
  
# filtered_sentence = [w for w in word_tokens if not w in stop_words] 
  
# filtered_sentence = [] 
  
# for w in word_tokens: 
#     if w not in stop_words: 
#         filtered_sentence.append(w) 
  
# print(word_tokens) 
# print(filtered_sentence) 

LookupError: 
**********************************************************************
  Resource [93mpunkt[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt/PY3/english.pickle[0m

  Searched in:
    - '/home/ozma/nltk_data'
    - '/home/ozma/anaconda3/nltk_data'
    - '/home/ozma/anaconda3/share/nltk_data'
    - '/home/ozma/anaconda3/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - ''
**********************************************************************
