# Intro

The goal of this project is getting familiar with 'classification' by solving a natural language processing problem that is a kind of **sentimental text processing**. Here we're given some texts in which some users wrote their opinion about a movie. The sentences represent their sentiment about the movie and it says whether they like it or not. we want to process the given sentences and find out that which comment is positive and which one is has a negative opinion about that movie.

Three kinds of datasets are collected. They are testing, training, and validation datasets respectively. We want to build a model and then train our model (here it's a __classifier__) using _Test_ and _Training_ datasets to do the task,  labeling the _Validation_ dataset texts. After all, we will compute the accuracy of our model.

### Datasets

The dataset we are using here is, IMDB dataset (sentiment analysis) in CSV format that you can download it from here: [kaggle.com](https://kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format)

# Round 1, Reading the data, Fight!

First things first, as i said before we have to read our datasets from a __.csv__ file that we have been downloaded before from __kaggle__ website.(actuly we have 3 datasets that we have to read)
Python has an external library for reading some dataset formats like __csv__ and some other formats called __pandas__.(I love pandas, i mean the animal!)
full documentations about how to use and install pandas exists on thier website, you can checkout [here](https://pandas.org) to findout how to use it and how to install it using __pip__.

# Round 2, Data cleaning

The first question that comes to my mind is what kind of word or letters are more important? which ones are less?
for example, let's look at this sentence:

_" I grew up (n. 1965) watching and loving Thunderbirds, I hate them!"_

which part can represent the writer's feelings? can you say which parts are more important?
It might be a little hard for us to say which parts are more important in a text but here, we can surely say the last part _"i hate them!"_ representing the exact feeling of the writer about _Thunderbirds_ he *hates them!*. In the other hand, no one will understand any feelings from some kind of Writing signs like '(', ''', '.', or even numbers like '1965'. We can omit them to have a better minimal text with fewer extra features.

In otherwords we have to clean our data, and some common data cleaning methods are as bellow:
**Common data cleaning steps:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

To do so, we have a useful library in python called __regex__ (regular expression library). (Documentations are available [here](https://docs.python.org/3/library/re.html)).

In this special case our data has some more extra garbage characters inside it that we want them to be deleted, they are _html tags_, yes, in this special data sets we are using, there are some html tags that they are embeded inside the comments and we have to first delete them all.

In [14]:
import pandas as pd
import re
import string

def cleanHtmlTags(text):
    mask = re.compile("<.*?>")
    text = re.sub(mask, "", text)
    return text

def cleanNumbers(text):
    mask = re.compile("[0-9]*")
    text = re.sub(mask, "", text)
    return text

def cleanPunctuations(text):
    text = re.sub("[%s]" % re.escape(string.punctuation), "", text)
    text = re.sub("[‘’“”…]", "", text)
    return text





