## Data Cleaning

**Note:** codes use in this step can be found in the script notebook

---

### Dataset: 
The dataset used for this project is from the Kaggle competition [Sentiment Analysis of Move Reviews](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews). It is a corpus of movie reviews broken into phrases of various lengths accompanied by a sentiment rating ranging from 0 to 4, where zero (0) is negative and four (4)  is positive. There is a total of 83917 training data with four data attributes (phraseid, sentenceid, phrase, and sentiment). In contrast, the testing dataset is composed of 36856 samples. 

### Cleaning: 
We initially examined for duplicated and null values in the dataset using the `initial_check` function. Feature types are also considered in case it needed corrections.  We then used a second function called `text_preprocessing` to perform regex cleaning for the Phrase column, removing websites, punctuations, and stopwords. It also normalizes the phrases to only contain lowercase letters (regex code is taken from this [link](https://www.kaggle.com/ankitkumarsaini/distilbert). Since regex was conducted, it resulted in some blank and duplicated rows. The function `text_processing` will also correct this. 

```
def text_cleaning(text):
    forbidden_words = set(stopwords.words('english'))
    if text:
        text = ' '.join(text.split('.'))
        text = re.sub('\/',' ',text)
        text = re.sub(r'\\',' ',text)
        text = re.sub(r'((http)\S+)','',text)
        text = re.sub(r'\s+', ' ', re.sub('[^A-Za-z]', ' ', text.strip().lower())).strip()
        text = re.sub(r'\W+', ' ', text.strip().lower()).strip()
        text = [word for word in text.split() if word not in forbidden_words]
    return " ".join(text)
```

```
def text_processing(dataframe):
    df = dataframe.copy()
    #Cleans the dataframe using the text_cleaning function (regex)
    df['Phrase'] = df['Phrase'].map(text_cleaning)
    #Removes rows that have blank in Phrase
    df = df.loc[df['Phrase'].map(lambda x: len(x.split())) > 0]
    #Remove duplicates
    df = df.drop_duplicates(subset=['Phrase'])
    return df
```
The cleaned dataset is saved as a CSV file in the data directory under the name clean_train and clean_test. The final sample count for the train data is 156060 and 66292 test samples. 

In [2]:
import pandas as pd 
from script import functions as func
import autoreload
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [13]:
#Import Data 
train = func.initial_check('../data/train.tsv', delimiter='\t')
test = func.initial_check('../data/test.tsv', delimiter='\t')

#Processs data (Please see description above)
train = func.text_processing(train)
test = func.text_processing(test)

#Save clean data 
train.to_csv('../data/clean_train.csv')
test.to_csv('../data/clean_test.csv')

Duplicated values: 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 156060 entries, 0 to 156059
Data columns (total 4 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   PhraseId    156060 non-null  int64 
 1   SentenceId  156060 non-null  int64 
 2   Phrase      156060 non-null  object
 3   Sentiment   156060 non-null  int64 
dtypes: int64(3), object(1)
memory usage: 4.8+ MB
None
Duplicated values: 0
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66292 entries, 0 to 66291
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   PhraseId    66292 non-null  int64 
 1   SentenceId  66292 non-null  int64 
 2   Phrase      66292 non-null  object
dtypes: int64(2), object(1)
memory usage: 1.5+ MB
None
