# Data Processing 

In this notebook, we will proceed to load the combined data generated in the previous step using the `data_extraction.ipynb` script. Our objective is to perform data processing to create a clean dataset that includes the `titles` (the feature of interest) and `token_title`. The purpose of this step is to produce an intermediate file that can be refined and utilized for various purposes..

#### Installing necessary libraries

```python
pip install spacy
python -m spacy download en_core_web_sm
```

In [12]:
# Imports for data processing
import pandas as pd
import re
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

## Loading combined data

In [13]:
# Retailer A data
retA_data = pd.read_csv('../data/raw/retailerA.csv')
# Retailer B data
retB_data = pd.read_csv('../data/raw/retailerB.csv')
# Combined data
combined_data = pd.read_csv('../data/raw/combined_data_raw.csv')

## Tokenizer Configuration

Within this tokenizer setup, we incorporate the Spacy small English model. We perform several preprocessing steps, including the removal of capital letters, basic regex operations to eliminate special characters, and lemmatization based on user-defined parameters. Additionally, we exclude stop words during tokenization. This tokenizer will subsequently be integrated into the vectorization process.

This function is saved as a module in `..scripts/text_tokenization.py` for easy renderization and usage.

In [14]:
spacy_nlp = spacy.load('en_core_web_sm') # Load the spacy small English model 
re_pattern = re.compile(r'[\W_]+') # Compile the regular expression to use

def tokenizeText(text: str, lemmas: bool) -> str:
    """
    Tokenizes the input text and returns a processed string.

    Parameters:
        text (str): The text to be tokenized.
        lemmas (bool): If True, the function returns lemmatized tokens; if False, it returns regular tokens.

    Returns:
        str: A processed string containing tokens from the input text.

    Example:
    >>> tokenizeText("This is an example sentence.", True)
    'example sentence'
    """

    text = re_pattern.sub(' ', text) # Use the compiled regex pattern

    # Tokenization
    doc = spacy_nlp(text)
    if lemmas:
        tokens = [token.lemma_.lower() for token in doc if token.lemma_.lower() not in STOP_WORDS]
        # Rejoin tokens into a single string
        return ' '.join(tokens)
    else:
        tokens = [token.text.lower() for token in doc if token.text.lower() not in STOP_WORDS]
        return tokens

For the purpose of analysis and record-keeping, we will create and save three intermediate datasets. These datasets will not only contain the original data but will also feature an additional column named `title_token` in the DataFrame. This column will store the results of tokenization.

**Note:** For this particular application, since the `title` is not including verbs or words subject to lemmantization, we will be setting this parameter as false. 

In [23]:
# Create column in dataset with tokenized 'title'
retA_data['title_token'] = retA_data['title'].apply(tokenizeText, args=(False,))
retB_data['title_token'] = retB_data['title'].apply(tokenizeText, args=(False,))
combined_data['title_token'] = combined_data['title'].apply(tokenizeText, args=(False,))

# Saving intermedite datasets
retA_data[['title', 'title_token']].to_csv('../data/processed/retailerA_tokens.csv')
retB_data[['title', 'title_token']].to_csv('../data/processed/retailerB_tokens.csv')
combined_data[['title', 'title_token']].to_csv('../data/processed/combined_data_tokens.csv')

----

### **AI tool usage for this notebook**

#### ChatGPT 3.5
* Improving markdown annotations and function doctrings

#### ChatBPT 4
* Providing regex expressions for different purposes
* Help with the apply method for the tokenizeText function call
* Improving modularity in repository structure