# **3. Data Preparation**

In this section I'll create methods to preprocess the training data available in `artifacts/train.csv` (at root, after running data_ingestion.py).

Based on the `2. Exploratory Data Analysis`, I have to do the following transfomations in the data:

1. **Tokenize text**

2. **Remove irrelevant information (can be anything that is not a letter)**

3. **Count typos**

4. **Correct Typos**

5. **Count english contractions**

6. **Remove stopwords**

7. **Remove nouns**

8. **Texts to vectors**


**Obs:**
* Before any text manipulation, the text need to be tokenize. That's why has `1.`;

* For better usage, typos must be corrected before usage. That's why has `4.`;

* I need a way to tranform the texts into vectors. And the way I'll handle it is with word2vec. That's why has `8.`;

* I'll be using spaCy library to make all this things, except for counting and correcting typos, wich I'll use SymSpell for it.

## **Importing libraries**

In [33]:
import re
import pandas as pd
import numpy as np

import spacy
from spacy.tokens import Doc

from symspellpy import SymSpell, Verbosity
import importlib.resources

from sklearn.preprocessing import MinMaxScaler

## **Setting up libraries**

In [2]:
# setting up spacy and disabling unused pipeline components
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner', 'lemmatizer', 'textcat', 'custom'])

In [3]:
# setting up SymSpell to typo count
sym_spell = SymSpell(max_dictionary_edit_distance=1, prefix_length=7)

with importlib.resources.open_text('symspellpy', 'frequency_dictionary_en_82_765.txt') as file:
    dictionary_path = file.name
    sym_spell.load_dictionary(dictionary_path, term_index=0, count_index=1)

## **Loading training dataset**

In [4]:
data = pd.read_csv('../artifacts/train.csv')

In [5]:
# show first rows
data.head()

Unnamed: 0,text,generated
0,In this essay I will talk about if the use of ...,0
1,The importance of self-care should never be un...,1
2,"Do you think this ""face"" was created by aliens...",0
3,\nThe potential impact of the use of social me...,1
4,Many people believe that self-esteem comes fro...,0


## **Data Transformation**

I'll make all the required tranformation into an unique function even to obtain better performance.

In [17]:
def prepare_model_data(data):
    '''
    Helper function to prepare data to the ML model. 
    This function applies:
        * text tokenization
        * irrelevant information removing
        * typos fixing
        * stopwords removing
        * nouns removing
        * word2vec with spacy
        
    Also, this function count:
        * contractions
        * typos

    Parameters
    ---
    * data: text data to prepare the data
    
    Returns
    ---
    * a numpy array with the columns: texts tranformed into vector, contractions count, typos count

    '''

    # tranfrom texts in spacy docs (wich applies tokenization)
    docs = [doc for doc in nlp.pipe(data)]

    text_vectors = []
    contractions_counts = []
    typos_counts = []

    contractions_patterns = [
        r'\b(\w+)\'(\w+)\b',
        r'\'(\w+)\b'
    ]

    for doc in docs:

        # count contractions
        contractions_count = 0
        for pattern in contractions_patterns:
            matches = re.findall(pattern, doc.text)
            contractions_count += len(matches)

        contractions_counts.append(contractions_count)

        # count typos
        typos_count = 0

        words = []
        spaces = []
        for token in doc:
            if not re.match(r'[a-zA-Z]|\'[a-zA-Z]', token.text): # ignore irrelevant information
                continue

            if token.is_stop: # ignore stopwords
                continue

            # lookup for typo
            correct_word = ''
            suggestions = sym_spell.lookup(token.text, Verbosity.CLOSEST, max_edit_distance=1) # lookup for typo

            if not suggestions:
                correct_word = token.text
                typos_count += 1

            elif suggestions[0].term != token.text:
                correct_word = suggestions[0].term
                typos_count += 1

            else:
                correct_word = token.text

            if token.pos_ in ['NOUN', 'PROPN']: # ignore nouns
                continue

            words.append(correct_word)
            spaces.append(token.whitespace_)

        typos_counts.append(typos_count)

        clean_doc = Doc(vocab=nlp.vocab, words=words, spaces=spaces)
        text_vectors.append(clean_doc.vector)

    # return numpy array
    return np.column_stack([text_vectors, contractions_counts, typos_counts])

In [18]:
# prepare the data
prepared_data = prepare_model_data(data.text)

In [30]:
print('Prepared data shape:', prepared_data.shape)

Prepared data shape: (23316, 302)


One last thing to do with the data is to apply scaling. For simplicity I'll apply normalization in the data.

In [34]:
scaler = MinMaxScaler()
prepared_data_scaled = scaler.fit_transform(prepared_data)

In [37]:
# show result
prepared_data_scaled

array([[0.35434973, 0.50383901, 0.33725433, ..., 0.3965236 , 0.01666667,
        0.01581722],
       [0.35296487, 0.58696288, 0.25450668, ..., 0.54237329, 0.01666667,
        0.01054482],
       [0.23955984, 0.47497949, 0.38730878, ..., 0.32879322, 0.06666667,
        0.03690685],
       ...,
       [0.35158229, 0.66196401, 0.28397271, ..., 0.48455537, 0.        ,
        0.00527241],
       [0.48947814, 0.65405018, 0.21185979, ..., 0.44857336, 0.06666667,
        0.03690685],
       [0.33425374, 0.5884441 , 0.29810287, ..., 0.32490683, 0.08333333,
        0.13884007]])