<div style="color:white; background-color: black; padding: 20px; border-radius:8px; font-size:26px"><b style="font-weight: 700;"><center>LEARNING NLP </center></b></div>


<div style="color:white; background-color: black; padding: 20px; border-radius:8px; font-size:20px"><b style="font-weight: 700;"><center> Creating a Pipeline </center></b></div>

<div style="background-color:  #eddcd2; padding: 10px;">

### Experimental Data

</div>

Natural Language Processing with Disaster Tweets from [here](https://www.kaggle.com/code/nkitgupta/text-representations/input)

In [14]:
# Set the encoding to UTF-8
# -*- coding: utf-8 -*-

## CLASS

In Python, **a class is a blueprint for creating objects**. It defines a **set of attributes (data members)** and **methods (functions)** that the objects created from the class will have.

Illustrative example of the structure of a CLASS:

In [15]:
class MyClass:                                                        # Defines a class named 'MyClass'
    # Class variable (shared among all instances of this class)
    class_variable = 10                                               # This is a class variable. It´s shared among all instances of the class

    # Constructor method. It initializes the object with the provided values of 'x' and 'y'. 'self' refers to the instance itself:
    def __init__(self, x, y):
        # Instance variables (unique to each instance)
        self.x = x                                                    # The instance variables hold values unique to each instance of the class.
        self.y = y

    # Instances methods. They can access and modify the attributes of the instance they are called on.
    def add(self):
        return self.x + self.y

    def multiply(self):
        return self.x * self.y

    # Class method. It can access and modify class variables.
    @classmethod                                                # This is a decorator used to define a class method.
    def class_method(cls):
        return cls.class_variable

    # Static method. It doesnt take 'self' or 'cls' as its first argument, making it independent of class or instance variables.
    @staticmethod                                               # This is a decorator used to define a static method.
    def static_method():
        return "This is a static method"


Simple usage example of a class. In the following example, we create two instances (`obj1` and `obj2`) of the `MyClass`. We then *access instance variables*, *call instance methods*, *access class variables*, *call class methods*, and *call the static method*:

In [16]:
# Creating instances of the class
obj1 = MyClass(5, 3)                                                      # x = 5 and y = 3
obj2 = MyClass(10, 2)                                                     # x = 10 and y = 2

# Accessing instance variables and calling instance methods
print(obj1.add())                                                         # Output: 8
print(obj2.multiply())                                                    # Output: 20

# Accessing class variable and calling class method
print(MyClass.class_variable)                                             # Output: 10
print(MyClass.class_method())                                             # Output: 10

# Calling static method
print(MyClass.static_method())                                            # Output: "This is a static method"


8
20
10
10
This is a static method


### **BaseEstimator** and **TransformerMixin**

By using **BaseEstimator** and **TransformerMixin**, you can create custom transformers and models that integrate seamlessly with the scikit-learn ecosystem.

**BaseEstimator**:
- BaseEstimator is a **base class for all estimators (models) in scikit-learn**. It provides **default implementations for common methods** like get_params() and set_params().
- By inheriting from BaseEstimator, a custom estimator gains useful functionality, like the ability to set hyperparameters using keyword arguments when initializing the estimator.
- Custom estimators can override these methods to provide custom behavior.

---

**TransformerMixin**
- TransformerMixin is **another base class provided by scikit-learn. It extends BaseEstimator and adds methods specific to transformers**.
- **Transformers are estimators that have a transform method, which takes data and returns transformed data**. For example, data preprocessing steps like standardization or one-hot encoding are typically implemented as transformers.
- By inheriting from TransformerMixin, a custom transformer gains additional functionality, such as the ability to chain transformers together using pipelines.

**Usage Example of BaseEstimator and TransformerMixin (do not run the cell!!):**

The class **CustomTranformer** defines **fit** and **transform** methods, which are required for any transformer in scikit-learn.

In [17]:
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTransformer(BaseEstimator, TransformerMixin):     # 'CustomTransformer' is a custom estimator/transformer that inherits from both 'BaseEstimator' and 'TransformerMixin'

    # Constructor method
    def __init__(self, param1, param2):
        self.param1 = param1
        self.param2 = param2

    def fit(self, X, y=None):
        # Custom fit logic
        return self

    def transform(self, X):
        # Custom transform logic
        return X

### PIPELINE

A **Pipeline in machine learning** is a way to streamline a lot of routine processes by **putting together a sequence of data processing steps**. These steps can include **data cleaning, feature extraction, normalization, and model training, among others**.

The pipeline **ensures that the data flows in a sequential manner through each step**, and **the output of one step becomes the input for the next**.

- <u>Sequence of Steps</u>:
    A pipeline is essentially a list of tuples where each tuple contains a name (string) and an instance of an estimator (transformer or model). The steps are applied sequentially.

- <u>Fit and Transform</u>:
    During the training phase, the pipeline's fit() method is called. This, in turn, triggers the fit() method for each estimator in the sequence. For transformers, this typically involves learning parameters from the data. For the final estimator (usually a model), it involves training the model on the transformed data.

- <u>Predict or Score</u>:
    Once the model is trained, you can use the pipeline's predict() or score() methods. For prediction, the data is transformed through each step and then passed to the final estimator.

- <u>Simplifies Workflow</u>:
    Pipelines simplify the workflow and help ensure that the same sequence of steps is applied to both the training data and any new data that you want to make predictions on.

- <u>Avoids Data Leakage</u>:
    Pipelines can help avoid data leakage in situations where you're using techniques like cross-validation. Each step in the pipeline operates on the data it receives and doesn't have access to the entire dataset.

- <u>Code Reproducibility</u>:
    Using pipelines ensures that all the preprocessing steps and model training are encapsulated in one object. This makes it easy to reproduce the entire workflow later.

<div style="color:black; background-color:  #c9e5b5; padding: 10px;">

### 1. **PIPELINE including a CLASS for CLEANING and BASIC PREPROCESSING in an NLP project**

</div>

In [18]:
# necessary libraries
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
import re
import string
import spacy

### 1.1. **<u>Clean text</u>**

Create a **custom transformer CLASS** that inherits from **BaseEstimator** and **TransformerMixin**, indicating that it can be used as a part of a scikit-learn pipeline as an estimator, and has access to methods like **fit()** and **transform()**. When fit_transform is called on this transformer, it applies the cleaning processes to the input data and returns the modified data:

In [19]:
# Class for cleaning text:
class text_cleaning(BaseEstimator, TransformerMixin):

    def fit(self, X, y=None):                             # The 'fit' method is a required method for any estimator in scikit-learn. In this case, its overridden to do nothing (it returns 'self'), as the cleaning process doesnt require any fitting or training.
        return self

    def transform(self, X):                               # The 'transform' method is also required for transformers in scikit-learn. It takes an input 'X' (tipycally a DataFrame or an array) and applies a series of text-cleaning functions to it.
        # Define all the functions necessary for the text cleaning and are part of the transformer. Each of these functions takes a string of text and applies a specific cleaning operation using regular expressions or string manipulation:
        # Remove url text
        def remove_URL(text):
            url = re.compile(r'http?://\S+|www\.\S+')
            return url.sub(r'',text)

        # Remove html text
        def remove_html(text):
            html = re.compile(r'<.*?>')
            return html.sub(r'',text)

        # Remove emojis
        def remove_emoji(text):
            emoji_pattern = re.compile("["
                                   u"\U0001F600-\U0001F64F"
                                   u"\U0001F300-\U0001F5FF"
                                   u"\U0001F680-\U0001F6FF"
                                   u"\U0001F1E0-\U0001F1FF"
                                   u"\U00002702-\U000027B0"
                                   u"\U000024C2-\U0001F251"
                                   "]+", flags=re.UNICODE)
            return emoji_pattern.sub(r'',text)

        # Remove punctuations
        def remove_punct(text):
            table = str.maketrans('','',string.punctuation)
            return text.translate(table)

        # Remove newlines
        def remove_newline(text):
            newline = re.compile(r'\n')
            return newline.sub(r'',text)

        # # Remove extra spaces
        # def remove_extra_space(text):
        #     return re.sub(r'\S+',' ', text)

        # Apply each of the cleaning functions defined above to the 'text' column of the input 'X'. This means that each cleaning operation is performed on every text sample in the dataset:
        X['text'] = X['text'].apply(remove_URL)
        X['text'] = X['text'].apply(remove_html)
        X['text'] = X['text'].apply(remove_emoji)
        X['text'] = X['text'].apply(remove_punct)
        X['text'] = X['text'].apply(remove_newline)
        # X['text'] = X['text'].apply(remove_extra_space)
        X['text'] = X['text'].apply(lambda x: x.lower())       # Lower casing each word.

        # Return the cleaned 'X':
        return X


### 1.2. **<u>Tokenize and remove stop words in text</u>**

Create a **custom transformer CLASS** that inherits from **BaseEstimator** and **TransformerMixin**, indicating that it can be used as a part of a scikit-learn pipeline as an estimator, and has access to methods like **fit()** and **transform()**. When fit_transform is called on this transformer, it applies the stop words removal process to the input data and returns the modified data:

In [20]:
class stop_word_removal(BaseEstimator, TransformerMixin):

    def fit(self, X, y = None):
        return self

    def transform(self, X):
        nlp = spacy.load('en_core_web_lg')
        def remove_stop_words(text):
            processed_text = []
            doc = nlp(text)                                 # This applies spaCy's NLP pipeline to the input text, tokenizing it and performing various linguistic analyses.
            for token in doc:
                if token.is_stop:                           # This checks if the token is a stop word. If the token is not a stop word, its text (the original word) is appended to processed_text.
                    continue
                processed_text.append(token.text)

            return ' '.join(processed_text)                # The function returns a string which is the result of joining the processed tokens with spaces in between.

        X['text'] = X['text'].apply(remove_stop_words)

        return X

### 1.3. **<u>Tokenize and Lemmatize text</u>**

Create a **custom transformer CLASS** that inherits from **BaseEstimator** and **TransformerMixin**, indicating that it can be used as a part of a scikit-learn pipeline as an estimator, and has access to methods like **fit()** and **transform()**. When fit_transform is called on this transformer, it applies the lemmatization process to the input data and returns the modified data:


In [21]:
class lemmatization(BaseEstimator, TransformerMixin):

    def fit(self, X, y = None):
        return self

    def transform(self, X):
        nlp = spacy.load('en_core_web_lg')
        def lemma_func(text):
            processed_text = []
            doc = nlp(text)                         # This applies spaCy's natural language processing pipeline to the input text. It tokenizes the text, performs part-of-speech tagging, dependency parsing, and more.
            for token in doc:
                processed_text.append(token.lemma_)

            return ' '.join(processed_text)         # The function returns a string which is the result of joining the processed tokens with spaces in between.

        X['text'] = X['text'].apply(lemma_func)

        return X

### 1.4. **<u>Define Pipeline</u>**

In [22]:
pipe_cleanprep = Pipeline([('clean_text', text_cleaning()),
                           ('removing_stop_words', stop_word_removal()),
                           ('lemma_text', lemmatization()),
                          ])

In [23]:
pipe_temp = Pipeline([('clean_text', text_cleaning())
                          ])

In [24]:
df_train = pd.read_csv('D:/git/Laboratory/NLP/Learning_NLP/data/train.csv')
df_train = df_train.drop(['id', 'keyword', 'location'],axis=1)

In [25]:
cleaned_df_train = pipe_temp.fit_transform(df_train)

cleaned_df_train.head()

Unnamed: 0,text,target
0,our deeds are the reason of this earthquake ma...,1
1,forest fire near la ronge sask canada,1
2,all residents asked to shelter in place are be...,1
3,13000 people receive wildfires evacuation orde...,1
4,just got sent this photo from ruby alaska as s...,1


**Apply pipe_cleanprep to training and test data**:

**<u> Splitted data</u>**

<span style="color:#028553"> **fit** and **transform** for training and test data, since the cleaning and basic preprocessing tasks are conceived so they can be applied without the dta leakage problem. </span>

In [40]:
# Load data (the data is given already splitted)

df_train = pd.read_csv('D:/git/Laboratory/NLP/Learning_NLP/data/train.csv')
df_train = df_train.drop(['id', 'keyword', 'location'],axis=1)

df_test = pd.read_csv('D:/git/Laboratory/NLP/Learning_NLP/data/test.csv')
df_test = df_test.drop(['id', 'keyword', 'location'],axis=1)

In [42]:
df_train.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


In [44]:
# Apply cleaning and basic preprocessing pipeline

cleaned_df_train = pipe_cleanprep.fit_transform(df_train)
cleaned_df_test = pipe_cleanprep.fit_transform(df_test)

In [45]:
cleaned_df_train.head()

Unnamed: 0,text,target
0,,1
1,,1
2,,1
3,,1
4,,1


**<u> Non-Splitted data</u>**

In [None]:
# If data is not split
# cleaned_df = pipe_cleanprep.fit_transform(df)

### 2. **<u>Split data into training and test datasets</u>**

In [21]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(cleaned_df_train.text,
                                                    cleaned_df_train.target,
                                                    stratify=cleaned_df_train.target,            #stratify allow the split to be split in a way that balances the train and test set by the target column
                                                    test_size=0.2,                               # 20% of data will be used as test data
                                                    random_state=42)                             # setting random_state allow the split to be the same everytime you run this


NameError: name 'cleaned_df_train' is not defined

<div style="color:black; background-color:  #c9e5b5; padding: 10px;">

### 3. **PIPELINE including TEXT REPRESENTATION and ML MODEL in an NLP project**

</div>

A **Pipeline** can also be used to apply the **text representation** stage. In the following cells, I will define and apply several models for word embeddings. Each result can be tested by applying a ML model to the processed text, so I can have some kind of insight/experimentation with what kind of impact of using different models for word embeddings will have the best **accuracy**.

<span style="color:#028553"> The text is assumed to be already cleaned (apply pipe_cleanprep)</span>

<span style="color:#028553"> The following cases are examples. In principle, any word embedding method can be used</span>

In [18]:
# Necessary Pkgs
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer

### **Experimental Case: Bag of Words + Multinomial Naive Bayes**

In [19]:
# Define the pipeline
clf_bog_nb = Pipeline([('bag-of-2-grams', CountVectorizer(ngram_range=(1,2))),
                       ('multiNB', MultinomialNB())
                      ])

# fit pipeline to training data
clf_bog_nb.fit(X_train, y_train)

# create predictions with the fitted pipeline on the test data
y_pred = clf_bog_nb.predict(X_test)

NameError: name 'X_train' is not defined

In [None]:
# Checking predictions with the Confusion matrix
ConfusionMatrixDisplay.from_estimator(clf_bog_nb,
                                      X_test,
                                      y_test,
                                      values_format='d',
                                      display_labels=['Fake news','Real news'])