## Relation Extraction using StackingClassifier Model

### Importing necessary libraries for data manipulation, visualization, and natural language processing

In [58]:
# NLTK library is used for natural language processing because of its specific functionalities available 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
!pip install nltk
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer



### Libraries for extracting features from text for machine learning algorithms and for encoding categorical labels

In [22]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

### Downloading the necessary NLTK data packages for tokenization, part-of-speech tagging, lemmatization, and stopwords.

In [23]:
nltk.download('punkt',  quiet=True)
nltk.download('averaged_perceptron_tagger',  quiet=True)
nltk.download('wordnet',  quiet=True)
nltk.download('stopwords',  quiet=True)

True

### Loading the "sem_eval_2010_task_8" dataset from the Hugging Face `datasets` library for NLP tasks

In [57]:
!pip install datasets



In [24]:
from datasets import load_dataset
dataset = load_dataset("sem_eval_2010_task_8")

### Converting the dataset into pandas DataFrames for training and testing, facilitating easier data manipulation

In [25]:
train_df = pd.DataFrame(dataset["train"])
test_df = pd.DataFrame(dataset["test"])

### The function "preprocess_data" preprocesses the text data by performing several NLP tasks: 
1. Extracting entities : Extracts named entities (e1 and e2) from a sentence and identifies their indices. This is useful for relation classification tasks where the relationship between specific entities in a sentence is examined.
2. Cleaning text : Cleans the sentence by removing HTML tags, converting to lowercase, and removing punctuation. This standardizes the text data, making it more amenable to processing and analysis.   
3. Tokenizing : Tokenizes the sentence into individual words. This is the first step in processing text, as it transforms a string (sentence) into a list of tokens (words).
4. Part-of-Speech Tagging : Applies part-of-speech tagging to tokens. This is important for understanding the grammatical structure of sentences and for specific processing tasks such as lemmatization, which require knowledge of a word's part of speech.
5. Lemmatizing : Lemmatizes the tokens, converting them to their base or dictionary form. Unlike stemming, lemmatization considers the context and converts the word to its meaningful base form, which is a valid word itself.

In [38]:
def preprocess_data(dataframe):
    
    # Extracting entities and their indices
    def extract_entities(sentence):
        e1_match = re.search(r'<e1>(.*?)<\/e1>', sentence)
        e2_match = re.search(r'<e2>(.*?)<\/e2>', sentence)
        
        e1 = e1_match.group(1) if e1_match else ''
        e2 = e2_match.group(1) if e2_match else ''
        
        e1_index = e1_match.start() if e1_match else -1
        e2_index = e2_match.start() if e2_match else -1
        
        return e1, e2, e1_index, e2_index

    dataframe['e1'], dataframe['e2'], dataframe['e1_index'], dataframe['e2_index'] = zip(*dataframe['sentence'].apply(extract_entities))

    # Cleaning the sentence
    def clean_data(sentence):
        sentence = re.sub(r'<\/?e[12]>', '', sentence)
        sentence = sentence.lower()
        sentence = re.sub(r'[^a-zA-Z0-9\s]', '', sentence)
        return sentence

    dataframe['cleaned_sentence'] = dataframe['sentence'].apply(clean_data)

    # Tokenize, POS tag, and lemmatize
    def tokenize_text(sentence):
        return nltk.word_tokenize(sentence)
    
    def pos_tagging(tokens):
        return nltk.pos_tag(tokens)
    
    def lemmatization(tokens):
        lemmatizer = WordNetLemmatizer()
        return [lemmatizer.lemmatize(token) for token in tokens]

    dataframe['tokens'] = dataframe['cleaned_sentence'].apply(tokenize_text)
    dataframe['pos_tags'] = dataframe['tokens'].apply(pos_tagging)
    dataframe['lemmatized_tokens'] = dataframe['tokens'].apply(lemmatization)

    return dataframe

### Preprocessing training and testing datasets

In [39]:
train_df = preprocess_data(train_df)
test_df = preprocess_data(test_df)

### Vectorizing the text data using TF-IDF to convert text to a matrix of TF-IDF features

Here we transform the text into numerical vectors by measuring the importance of each term and the frequency in the document (TF) adjusted by its rarity across all documents (IDF). The TF-IDF value increases proportionally to the number of times a word appears in the document

1. TF(t,d) = (Number of times term t appears in a document d) / (Total number of terms in the document d)
2. IDF(t,D) = log_e(Total number of documents in the corpus D / Number of documents with term t in them)
3. TF-IDF(t, d, D) = TF(t, d) × IDF(t, D)



In [40]:
vectorizer = TfidfVectorizer()
x_train = vectorizer.fit_transform(train_df['cleaned_sentence'])
x_test = vectorizer.transform(test_df['cleaned_sentence'])

### Encoding the labels into a format suitable for classification models and transforming the labels for test data.

In [41]:
le = LabelEncoder()
y_train = le.fit_transform(train_df['relation'])
y_test = le.transform(test_df['relation']) 

### Importing necessary modules from scikit-learn for the SVC model and evaluation metrics

In [42]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score

#### SVC Model Training

#### Instantiate the SVC model with specific hyperparameters:
1. kernel = 'linear' : Specifies the use of a linear kernel. This is suitable for text classification tasks where the feature space is high-dimensional. A linear kernel helps in finding a linear decision boundary in this space.

2. C = 10 : The regularization parameter. A larger value of C implies a smaller margin. Here, it is set to 10 to penalize misclassifications more, which can be useful for imbalanced datasets.

3. gamma = 0.0001 : Kernel coefficient for 'rbf', 'poly', and 'sigmoid'. For a linear kernel, it's not used but specified for completeness.

4. class_weight = 'balanced': Adjusts weights inversely proportional to class frequencies in the input data. This is important for handling imbalanced datasets, ensuring that the model does not bias towards the majority class.

In [43]:
svc = SVC(kernel='linear', C=10, gamma=0.0001, class_weight='balanced')

#### Fit the model on the training data.
1. x_train: Feature vectors of the training data.
2. y_train: Target values (class labels) for the training samples.

In [44]:
svc.fit(x_train, y_train)

#### Model Prediction and Evaluation

1. Predict the class labels for the test set.
2. x_test: Feature vectors of the test data.

In [46]:
y_pred = svc.predict(x_test)
accuracy = accuracy_score(y_test, y_pred)

#printing the performance metric scores.
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred, zero_division=0))
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.548767022451233
              precision    recall  f1-score   support

           0       0.89      0.72      0.79       134
           1       0.78      0.82      0.80       194
           2       0.44      0.54      0.48       162
           3       0.43      0.31      0.36       150
           4       0.68      0.71      0.70       153
           5       0.63      0.62      0.62        39
           6       0.74      0.83      0.79       291
           7       0.00      0.00      0.00         1
           8       0.62      0.73      0.67       211
           9       0.90      0.38      0.54        47
          10       0.43      0.27      0.33        22
          11       0.56      0.43      0.49       134
          12       0.75      0.19      0.30        32
          13       0.53      0.58      0.55       201
          14       0.62      0.50      0.55       210
          15       0.65      0.39      0.49        51
          16       0.44      0.40      0.42       108

#### Calculate and print the F1 score:
1. The F1 score is the harmonic mean of precision and recall, providing a balance between them.
2. It's particularly useful in the evaluation of binary classification systems, where the class distribution is imbalanced.

In [47]:
f1score = f1_score(y_test, y_pred, average='micro', labels=np.unique(y_train))
print(f1score)

0.548767022451233


#### Calculate F1 score for each relation:
1. This step calculates the F1 score for each class individually, providing insights into the model's performance on a per-class basis. This can highlight which classes are well-predicted by the model and which are not.

In [48]:
f1_score_per_relation = f1_score(y_test, y_pred, average=None, labels=np.unique(y_train))
# Print F1 score for each relation
relations = le.inverse_transform(range(len(le.classes_)))
for relation, score in zip(relations, f1_score_per_relation):
    print(f"Relation: {relation}, F1 Score: {score}")

Relation: 0, F1 Score: 0.7933884297520661
Relation: 1, F1 Score: 0.8040201005025126
Relation: 2, F1 Score: 0.481994459833795
Relation: 3, F1 Score: 0.36293436293436293
Relation: 4, F1 Score: 0.6964856230031949
Relation: 5, F1 Score: 0.6233766233766234
Relation: 6, F1 Score: 0.7857142857142857
Relation: 7, F1 Score: 0.0
Relation: 8, F1 Score: 0.6666666666666666
Relation: 9, F1 Score: 0.5373134328358209
Relation: 10, F1 Score: 0.3333333333333333
Relation: 11, F1 Score: 0.4851063829787234
Relation: 12, F1 Score: 0.3
Relation: 13, F1 Score: 0.5510688836104513
Relation: 14, F1 Score: 0.5488126649076517
Relation: 15, F1 Score: 0.4878048780487805
Relation: 16, F1 Score: 0.4195121951219512
Relation: 17, F1 Score: 0.3128491620111732
Relation: 18, F1 Score: 0.32894736842105265


#### Import the necessary modules from scikit-learn for creating ensemble models

In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier

#### Stacking Classifier Setup

1. The Stacking Classifier is an ensemble learning technique that combines multiple classification models via a final estimator. Here, we use RandomForest and SVC as base models and another SVC as the final estimator.
2. Define base models for the stacking classifier.
3. Each model is defined as a tuple consisting of a unique name and the model instance.

In [50]:
base_models = [
    ('rf', RandomForestClassifier(n_estimators=100, random_state=151)),
    ('svc', SVC(probability=True, kernel='linear', random_state=151))
]

### Defining the final model.

1. The final model (meta-learner) takes in the outputs of the base models as input and makes the final prediction.
2. Here, we are using another SVC, but with a radial basis function (rbf) kernel.

In [51]:
final_model = SVC(kernel='rbf', probability=True, C=1.0, random_state=42)
stacking_model = StackingClassifier(estimators=base_models, final_estimator=final_model, cv=5)
stacking_model.fit(x_train, y_train)



#### Model Prediction and Evaluation

1. Predict the class labels for the test set using the trained stacking model.
2. The predictions are based on the combined strategy of the base models followed by the final model's decision.

In [52]:
y_pred_stack = stacking_model.predict(x_test)

In [53]:
accuracy = accuracy_score(y_test, y_pred_stack)
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred_stack, zero_division=0))
print(confusion_matrix(y_test, y_pred_stack))

Accuracy: 0.5793154214206846
              precision    recall  f1-score   support

           0       0.92      0.73      0.82       134
           1       0.77      0.84      0.80       194
           2       0.45      0.54      0.49       162
           3       0.59      0.31      0.40       150
           4       0.68      0.78      0.73       153
           5       0.65      0.72      0.68        39
           6       0.73      0.88      0.80       291
           7       0.00      0.00      0.00         1
           8       0.59      0.79      0.68       211
           9       0.74      0.66      0.70        47
          10       0.37      0.45      0.41        22
          11       0.62      0.49      0.55       134
          12       0.36      0.25      0.30        32
          13       0.55      0.66      0.60       201
          14       0.64      0.58      0.61       210
          15       0.68      0.49      0.57        51
          16       0.42      0.47      0.45       10

In [54]:
# Micro - F1 score calculation.
f1scores = f1_score(y_test, y_pred_stack, average='micro', labels=np.unique(y_train))
print(f1scores)

0.5793154214206846


In [55]:
# Print F1 score for each relation
f1_scores_per_relation = f1_score(y_test, y_pred_stack, average=None, labels=np.unique(y_train))
relations = le.inverse_transform(range(len(le.classes_)))
for relation, score in zip(relations, f1_scores_per_relation):
    print(f"Relation: {relation}, F1 Score: {score}")

Relation: 0, F1 Score: 0.8166666666666667
Relation: 1, F1 Score: 0.8049382716049382
Relation: 2, F1 Score: 0.49162011173184356
Relation: 3, F1 Score: 0.40350877192982454
Relation: 4, F1 Score: 0.725609756097561
Relation: 5, F1 Score: 0.6829268292682927
Relation: 6, F1 Score: 0.7956318252730109
Relation: 7, F1 Score: 0.0
Relation: 8, F1 Score: 0.6761710794297352
Relation: 9, F1 Score: 0.6966292134831461
Relation: 10, F1 Score: 0.40816326530612246
Relation: 11, F1 Score: 0.55
Relation: 12, F1 Score: 0.2962962962962963
Relation: 13, F1 Score: 0.5972850678733032
Relation: 14, F1 Score: 0.61
Relation: 15, F1 Score: 0.5681818181818182
Relation: 16, F1 Score: 0.4473684210526316
Relation: 17, F1 Score: 0.32608695652173914
Relation: 18, F1 Score: 0.30699774266365687


#### Hyperparameter Tuning for Stacking Classifier

1. Define parameter grid for RandomForestClassifier
2. This dictionary contains the hyperparameters for the random forest and their respective ranges that
3. GridSearchCV will explore. 'rf__n_estimators' refers to the number of trees in the forest, and
4. 'rf__max_depth' refers to the maximum depth of each tree.

In [42]:
# Define parameter grid for RandomForestClassifier
param_grid_rf = {
    'rf__n_estimators': [100, 200],
    'rf__max_depth': [None, 10, 20],
}

In [43]:
# Define parameter grid for SVC
param_grid_svc = {
    'svc__C': [0.1, 1, 10],
    'svc__gamma': ['scale', 0.1, 0.01],
}

In [44]:
# Combine both parameter grids
param_grid = {**param_grid_rf, **param_grid_svc}

#### Initialize the Stacking Classifier with the base models and the final estimator. The base models are Random Forest and SVC, while the final model is an SVC with a radial basis function kernel.

In [45]:
base_models = [
    ('rf', RandomForestClassifier(random_state=151)),
    ('svc', SVC(probability=True, random_state=151))
]

In [46]:
# Final model
final_model = SVC(kernel='rbf', probability=True, random_state=42)
# Stacking Classifier
stacking_model = StackingClassifier(estimators=base_models, final_estimator=final_model, cv=5)

#### Initialize the GridSearchCV object for hyperparameter tuning
1. The GridSearchCV object will evaluate all combinations of parameters defined in the param_grid. It will perform
2. cross-validation with 5 folds for each combination and use 'f1_micro' scoring to assess the models.

In [None]:
grid_search_stacking = GridSearchCV(estimator=stacking_model, param_grid=param_grid, cv=5, verbose=3, scoring='f1_micro')

#### Fit GridSearchCV
1. This line will start the hyperparameter tuning process, which can be time-consuming depending on the size of the
2. parameter grid and the performance of the individual models.

In [None]:
grid_search_stacking.fit(x_train, y_train)

#### Best parameters found
1. After fitting GridSearchCV, we can find the combination of parameters that gave the best results on the cross-validated sets.

In [48]:
# View the best parameters found by GridSearchCV
print("Best Parameters:", grid_search_stacking.best_params_)

# Use the best parameters to make predictions
y_pred = grid_search_stacking.best_estimator_.predict(x_test)

Best Parameters: {'rf__max_depth': None, 'rf__n_estimators': 200, 'svc__C': 10, 'svc__gamma': 0.1}


#### Evaluate the model with the optimized hyperparameters
1. After making predictions, evaluate the model using standard metrics like accuracy and F1 score to see if the hyperparameter tuning has led to an improvement in performance.

In [50]:
accuracy = accuracy_score(y_test, y_pred)
f1score = f1_score(y_test, y_pred, average='micro')
print("Optimized Stacking Classifier Accuracy:", accuracy)
print("Optimized Stacking Classifier F1 Score:", f1score)
print(classification_report(y_test, y_pred))

Optimized Stacking Classifier Accuracy: 0.5844681634155319
Optimized Stacking Classifier F1 Score: 0.5844681634155319
              precision    recall  f1-score   support

           0       0.93      0.75      0.83       134
           1       0.78      0.85      0.81       194
           2       0.46      0.54      0.50       162
           3       0.60      0.33      0.42       150
           4       0.66      0.78      0.71       153
           5       0.66      0.74      0.70        39
           6       0.72      0.85      0.78       291
           7       0.00      0.00      0.00         1
           8       0.61      0.80      0.69       211
           9       0.74      0.68      0.71        47
          10       0.36      0.45      0.40        22
          11       0.62      0.50      0.55       134
          12       0.35      0.25      0.29        32
          13       0.56      0.67      0.61       201
          14       0.64      0.58      0.61       210
          15     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
