# COMP1804 - Applied Machine Learning (Coursework)
*The coursework is about a consulting scenario for a non-profit organization known as *NotTheRealWikipedia*, to explore if ML can assist in analyzing new content on their site. It comprises two main tasks:*

1. **Topic Classification**: Utilizing ML to categorize paragraphs of text into one of five topics based on the content and the presence of references to a person, organization, and/or product. Success criteria include outperforming a trivial baseline, avoiding overfitting, and ensuring low misclassification rates for each class.
2. **Text Clarity Classification Prototype**: Developing a prototype to automatically assess if a paragraph is written clearly enough, employing a subset of the data for training. This task also requires addressing ethical implications and suggesting improvements based on the prototype's performance.

Here is the given dataset structure as follows,
| FEATURE NAME       | BRIEF DESCRIPTION                                                                          |
|--------------------|---------------------------------------------------------------------------------------------|
| `par_id`             | Unique identifier for each paragraph to classify.                                           |
| `paragraph`          | Text to classify.                                                                           |
| `has_entity`         | Whether the text contains a reference to a product (yes/no), an organisation (yes/no), or a person (yes/no). |
| `lexicon_count`      | The number of words in the text.                                                            |
| `difficult_words`    | The number of difficult words in the text.                                                  |
| `last_editor_gender` | The gender of the latest person to edit the text.                                           |
| `category`           | The category into which the text should be classified.                                      |
| `text_clarity`       | The clarity level of the text. Very few data points are labelled at first.                 |

## Setup and Initial Imports

### Prerequisite Packages
Before proceeding with the actual code, it's crucial to ensure all necessary Python packages are installed. This is achieved by running a `pip install` command that references a [`requirements.txt`](../Docs/requirements.txt) file. This file must be located in a relative path from the notebook (`../Docs/requirements.txt`). The command `%pip install --user -r ../Docs/requirements.txt --quiet` takes care of installing these packages.

The [`requirements.txt`](../Docs/requirements.txt) file contains necessary packages for running the Jupyter notebook.

#### Usage:
```python
%pip install --user -r ../Docs/requirements.txt --quiet
```

In [None]:
%pip install --user -r ../Docs/requirements.txt --quiet

### Importing Libraries
The following code block is responsible for importing a comprehensive set of libraries that are essential for data manipulation, visualization, natural language processing (NLP), and machine learning tasks. A `try` and `except` block is utilized to gracefully handle any errors that might occur during the import process. This approach ensures that any missing libraries or other issues are flagged immediately, facilitating troubleshooting.

#### Key Libraries and Their Roles:
- **Data Manipulation and Linear Algebra**: `pandas`, `numpy`
- **NLP**: `spacy`, `nltk` (Natural Language Toolkit)
- **Fast Function Application**: `swifter`
- **Visualization**: `seaborn`, `matplotlib`
- **Text Processing**: Regular expressions (`re`), `string`, `textblob`
- **Machine Learning**: `sklearn` for model training, feature extraction, and evaluation; `imblearn` for handling imbalanced data.
- **Progress Monitoring**: `tqdm` for progress bars during lengthy operations.
- **Custom Transformers**: For creating pipelines that include custom data preprocessing steps.

Additionally, the seaborn library's theme is set to `"whitegrid"` for better visual aesthetics in plots, and `tqdm` is enabled for progress applications on pandas series, enhancing feedback during data processing tasks.

In [10]:
# Import necessary libraries
try:
    import json
    import os
    import time
    import pandas as pd
    import numpy as np
    import spacy
    import swifter  # For applying functions in a fast and efficient way
    import pickle
    import tensorflow as tf
    
    # Visualization libraries
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # Text processing libraries
    import re
    import string
    import nltk
    from nltk.corpus import stopwords
    from nltk.stem import WordNetLemmatizer
    from nltk.tokenize import word_tokenize
    from tqdm import tqdm  # For progress bars
    from textblob import TextBlob
    from langdetect import detect
    
    # Machine learning libraries
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, TfidfTransformer
    from sklearn.linear_model import LogisticRegression
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.svm import SVC
    from sklearn.ensemble import RandomForestClassifier
    from xgboost import XGBClassifier
    from sklearn.pipeline import Pipeline
    from imblearn.over_sampling import SMOTE
    from imblearn.pipeline import Pipeline as IMBPipeline
    from imblearn.under_sampling import RandomUnderSampler
    from sklearn.model_selection import learning_curve, cross_val_score
    
    # Custom transformers for pipeline
    from sklearn.preprocessing import FunctionTransformer
    from sklearn.compose import ColumnTransformer
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.pipeline import FeatureUnion
    from sklearn.feature_extraction import DictVectorizer
    from keras.preprocessing.sequence import pad_sequences

    # Evaluation metrics
    from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

    # Setting seaborn theme for better visuals
    sns.set_theme(style="whitegrid")

    # Enabling progress_apply for pandas series via tqdm
    tqdm.pandas()

# Catching and printing any exception that occurs during the import process
except Exception as e:
    print(f"Error : {e}")

### Loading `NLTK` libraries
The following `download_nltk_resources()` function will install the required NLTK resources if they are not already downloaded. This process is called **lazy loading** resources, which I used in my previous projects. It will download packages such as:

- `punkt` for tokenzation.
- `stopwords` for stopwords removal.
- `averaged_perceptron_tagger` for POS tagging.
- `wordnet` for lemmatization.

The defined function will verify the downloads and suppresses the output to keep the log cleaner with the help of `quiet=True` argument.

#### Usage:
```python
download_nltk_resources()
```

In [11]:
# Improved function to download and verify necessary NLTK resources
def download_nltk_resources():
    # Specifying the NLTK resources required for various NLP tasks
    resources = ['punkt', 'stopwords', 'averaged_perceptron_tagger', 'wordnet']  # Needed for tokenization, stopwords removal, POS tagging, and lemmatization respectively

    # Iterating over each resource to check its presence, and download if missing
    for resource in resources:
        try:
            # Check if the resource is already downloaded to avoid re-downloading
            nltk.data.find(f'tokenizers/punkt/{resource}.pickle')
        except LookupError:
            # Resource not found; proceed to download
            nltk.download(resource, quiet=True)  # quiet=True suppresses the output to keep the log cleaner
            print(f'NLTK resource "{resource}" downloaded.')

    # Notification upon successful verification or download of all resources
    print('All required NLTK resources are ready.')
'''
	- This code is copied from my previous college project. It is called Lazy Loading NLTK Resources. 
	- The idea behind it was that we don't need to download all the resources at once.
	- We can download them as and when needed.
	- This way, we can save time and space by downloading only the necessary resources.
'''

# Initiating the download and verification of NLTK resources
download_nltk_resources()

NLTK resource "punkt" downloaded.
NLTK resource "stopwords" downloaded.
NLTK resource "averaged_perceptron_tagger" downloaded.
NLTK resource "wordnet" downloaded.
All required NLTK resources are ready.


---
# Task 1: Text Topic Classification
*This task contains a machine learning approach for classifying text into five categories: **artificial intelligence**, **movies about artificial intelligence**, **programming**, **philosophy**, and **biographies**.*

**Objectives**:
1. Predict the topic `category` of a paragraph using `paragraph` and `has_entity` as features.
2. Success criteria such as, **better than trivial baseline**, **avoids overfitting**, **misclassification below 10%** for unrelated categories.
3. Identify the **most informative scalar performance metric**.

## Loading the dataset

These functions are designed to load and preprocess datasets for two distinct tasks, ensuring they are ready for further analysis or machine learning models. The following two functions `load_t1_df()` and `load_t2_df()` will efficiently load datasets for Task 1 and Task 2, respectively. Both functions enhance the dataset by adding an `'index'` column starting from 1.

### Task 1: `load_t1_df`
Loads a dataset specifically for Task 1 with a selected set of columns.

#### Features:
- **Columns Loaded**: `'paragraph'`, `'has_entity'`, `'category'`.
- **Index Column**: Adds an `'index'` column starting from 1 for easier reference.
- **Empty Check**: Verifies if the dataset is empty and prints the dataset's dimensions or an empty dataset message.

#### Usage:
```python
df1 = load_t1_df('filename_with_path')
```

In [12]:
# Function to load and preprocess the dataset for Task 1
def load_t1_df(filename):
    # Define the columns to be loaded from the file
    columns = ['paragraph', 'has_entity', 'category', 'lexicon_count', 'difficult_words']
    # Load the dataset with specified columns
    df = pd.read_csv(filename, usecols=columns)
    # Add an 'index' column that starts from 1
    df['index'] = df.index + 1
    # Check if the DataFrame is empty and print a message accordingly
    if not df.empty:
        task_name = 'Task-1'
        print(f"{df.shape[0]} rows and {df.shape[1]} columns (without 'text_clarity') loaded successfully for {task_name}, including newly added 'index' column.")
    else:
        print("The dataset is empty.")
    return df

# Function to load and preprocess the dataset for Task 2
def load_t2_df(filename):
    df = pd.read_csv(filename)
    # Add an 'index' column that starts from 1
    df['index'] = df.index + 1
    # Check if the DataFrame is empty and print a message accordingly
    if not df.empty:
        task_name = 'Task-2'
        print(f"{df.shape[0]} rows and {df.shape[1]} columns loaded successfully for {task_name}, including newly added 'index' column.")
    else:
        print("The dataset is empty.")
    return df

'''
	- Function usage: df = load_t1_df('filename_with_path')
	- Replace `filename_with_path` with your original value.
'''

df1 = load_t1_df('../Datasets/dataset.csv')
task_name = 'Task-1'

9347 rows and 6 columns (without 'text_clarity') loaded successfully for Task-1, including newly added 'index' column.


## Data Cleaning Function: `clean_df`

This function is designed for preprocessing a DataFrame by removing rows with any missing values, except those in the `'text_clarity'` column. It provides insights into the DataFrame's state before and after cleaning by printing the shape and the count of missing values.

### Steps Performed:
1. **Initial Check**: Prints the initial shape of the DataFrame and a count of missing values for each column.
2. **Removing Missing Values**: Excludes rows with missing values across all columns except `'text_clarity'`. It also prints the columns from which rows are being removed due to missing values.
3. **Verification**: After removal, prints the updated shape of the DataFrame and verifies that no missing values remain, except possibly in `'text_clarity'`.

#### Usage:
To clean your DataFrame `df`, call the function as follows:
```python
df1 = clean_df(df)

In [13]:
def clean_df(df):
    # Print initial shape and missing values count
    print("-" * 55)
    print(f'Initial shape: {df.shape}. Checking for missing values...')
    missing_values_initial = df.isnull().sum()
    print(missing_values_initial)
    print("-" * 55)

    # Remove rows with missing values
    df_cleaned = df.dropna(subset=df.columns.difference(['text_clarity']))
    print('Removing rows with missing values...')
    print(missing_values_initial[missing_values_initial > 0])  # Print only columns with missing values

    # Print shape after removal and verify no missing values
    print("-" * 55)
    print(f'Updated shape: {df_cleaned.shape}. Verifying no missing values remain...')
    missing_values_final = df_cleaned.isnull().sum()
    print(missing_values_final)
    print("-" * 55)

    return df_cleaned

'''
	- Function usage: df = clean_df(df)
	- Replace `df` with your original DataFrame.
'''

df1 = clean_df(df1)

-------------------------------------------------------
Initial shape: (9347, 6). Checking for missing values...
paragraph           0
has_entity          0
lexicon_count       0
difficult_words    18
category           61
index               0
dtype: int64
-------------------------------------------------------
Removing rows with missing values...
difficult_words    18
category           61
dtype: int64
-------------------------------------------------------
Updated shape: (9268, 6). Verifying no missing values remain...
paragraph          0
has_entity         0
lexicon_count      0
difficult_words    0
category           0
index              0
dtype: int64
-------------------------------------------------------


## Dataframe Processing Function: `process_df`

This function performs specific preprocessing tasks on a DataFrame to ensure data consistency and integrity, focusing on the 'category' and 'has_entity' columns.

### Steps Performed:
1. **Standardize 'category' Column**:
    - Initially prints unique values in the 'category' column.
    - Converts all text in the 'category' column to lowercase to standardize the data.
    - Prints the unique values in the 'category' column post-conversion for verification.

2. **Clean 'has_entity' Column**:
    - Prints unique values in the 'has_entity' column to show initial data state.
    - Removes rows where the 'has_entity' column has the value 'data missing', ensuring data quality.
    - Prints the unique values in the 'has_entity' column after removal for verification.

#### Usage:
To preprocess your DataFrame `df`, use the function as follows:
```python
df1 = process_df(df)
```

In [14]:
def process_df(df):
    # Standardize 'category' column to lowercase
    print("-" * 55)
    print('Checking for unique values in the "category" column:')
    print(df['category'].unique())
    df['category'] = df['category'].str.lower()
    print('\nFixed the case of the "category" column, unique values now:')
    print(df['category'].unique())
    print("-" * 55)

    # Remove rows where 'has_entity' column has 'data missing'
    print("-" * 55)
    print('Checking for unique values in the "has_entity" column:')
    print(df['has_entity'].unique())
    df = df[df['has_entity'] != 'data missing']
    print('\nRemoved rows with "data missing" in the "has_entity" column, unique values now:')
    print(df['has_entity'].unique())
    print("-" * 55)

    return df

'''
	- Function usage: df = process_df(df)
	- Replace `df` with your original DataFrame.
'''

df1 = process_df(df1)

-------------------------------------------------------
Checking for unique values in the "category" column:
['biographies' 'artificial intelligence' 'programming' 'philosophy'
 'movies about artificial intelligence' 'Philosophy' 'Programming'
 'Artificial intelligence' 'Biographies'
 'Movies about artificial intelligence']

Fixed the case of the "category" column, unique values now:
['biographies' 'artificial intelligence' 'programming' 'philosophy'
 'movies about artificial intelligence']
-------------------------------------------------------
-------------------------------------------------------
Checking for unique values in the "has_entity" column:
['ORG_YES_PRODUCT_NO_PERSON_YES_' 'ORG_YES_PRODUCT_NO_PERSON_NO_'
 'ORG_NO_PRODUCT_YES_PERSON_NO_' 'ORG_YES_PRODUCT_YES_PERSON_YES_'
 'ORG_NO_PRODUCT_NO_PERSON_NO_' 'ORG_NO_PRODUCT_YES_PERSON_YES_'
 'ORG_NO_PRODUCT_NO_PERSON_YES_' 'ORG_YES_PRODUCT_YES_PERSON_NO_'
 'data missing']

Removed rows with "data missing" in the "has_entity" co

## Entity Column Transformation Function: `split_entity_column`

This function enhances data representation by splitting the `'has_entity'` column of a DataFrame into three separate binary columns. Each new column represents the presence (1) or absence (0) of specific entity types: organizations, products, and persons.

### Transformation Details:
- **ORG Column**: Indicates the presence of organizations with `ORG_YES`.
- **PRODUCT Column**: Flags the presence of products with `PRODUCT_YES`.
- **PERSON Column**: Marks the presence of persons with `PERSON_YES`.
  
The function applies a lambda function to check for the existence of each entity type within the `'has_entity'` column, then converts these checks into binary columns.

After the transformation, the function prints a preview of the DataFrame showing the original `'has_entity'` column alongside the newly created `'ORG'`, `'PRODUCT'`, and `'PERSON'` binary columns for verification.

#### Usage:
To apply this transformation to your DataFrame, use the function as follows:
```python
df1 = split_entity_column(df)
```

In [15]:
# Transform the "has_entity" column into three binary columns for entity types
def split_entity_column(df):
    # Create binary columns based on the presence of specific entity types in the 'has_entity' column
    df['ORG'] = df['has_entity'].apply(lambda x: 'ORG_YES' in x).astype(int)
    df['PRODUCT'] = df['has_entity'].apply(lambda x: 'PRODUCT_YES' in x).astype(int)
    df['PERSON'] = df['has_entity'].apply(lambda x: 'PERSON_YES' in x).astype(int)

    # Display the first few rows to verify the newly added binary columns alongside the 'has_entity' column
    print("-" * 55)
    print("Preview of the 'has_entity' column and the new binary columns:")
    print(df[['has_entity', 'ORG', 'PRODUCT', 'PERSON']].head())
    print("-" * 55)
    print("-" * 55)
    print("Distribution of the 'has_entity' column:")
    print(df['has_entity'].value_counts())
    print("-" * 55)

    return df

'''
	- Function usage: df = split_entity_column(df)
	- Replace `df` with your original DataFrame.
'''

df1 = split_entity_column(df1)

-------------------------------------------------------
Preview of the 'has_entity' column and the new binary columns:
                        has_entity  ORG  PRODUCT  PERSON
0   ORG_YES_PRODUCT_NO_PERSON_YES_    1        0       1
1    ORG_YES_PRODUCT_NO_PERSON_NO_    1        0       0
2    ORG_YES_PRODUCT_NO_PERSON_NO_    1        0       0
3    ORG_NO_PRODUCT_YES_PERSON_NO_    0        1       0
4  ORG_YES_PRODUCT_YES_PERSON_YES_    1        1       1
-------------------------------------------------------
-------------------------------------------------------
Distribution of the 'has_entity' column:
has_entity
ORG_YES_PRODUCT_NO_PERSON_YES_     3029
ORG_NO_PRODUCT_NO_PERSON_NO_       2851
ORG_YES_PRODUCT_NO_PERSON_NO_      1462
ORG_NO_PRODUCT_NO_PERSON_YES_      1373
ORG_YES_PRODUCT_YES_PERSON_YES_     298
ORG_YES_PRODUCT_YES_PERSON_NO_      125
ORG_NO_PRODUCT_YES_PERSON_YES_       64
ORG_NO_PRODUCT_YES_PERSON_NO_        42
Name: count, dtype: int64
-----------------------------

## Text Cleaning Functions: `clean_text` and `clean_column`

These functions are designed to preprocess textual data within a DataFrame, removing unnecessary characters and whitespace, thereby standardizing the text for analysis or machine learning tasks.

#### Text Cleaning Utility: `clean_text` 
- **Purpose**: Cleans a given text string by removing punctuation, numbers, and extra spaces.
- **Implementation**:
  - Uses regular expressions (`re.sub`) to strip out non-alphabetic characters and numbers.
  - Condenses multiple whitespace characters down to a single space and trims leading/trailing whitespace.

#### Apply Text Cleaning to DataFrame Column: `clean_column`
- **Purpose**: Applies the `clean_text` function to a specified column in a DataFrame, creating a new "_cleaned" version of the column.
- **Verification**: Prints the original and cleaned version of the first text entry in the specified column for a quick comparison and verification of the cleaning process.
- **Error Handling**: Checks if the specified column exists in the DataFrame, printing an error message if it does not.

#### Usage:
To clean a text column in your DataFrame, use the `clean_column` function as follows:
```python
df1 = clean_column(df, 'column_name')
```

In [16]:
def clean_text(text):
    # Remove punctuation marks and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def clean_column(df, column_name):
    # Ensure the column exists in the DataFrame
    if column_name in df.columns:
        unclean = df[column_name]
        df[column_name + '_cleaned'] = df[column_name].apply(clean_text)
        clean = df[column_name + '_cleaned']
        print("-"*55)
        print("Original paragraph:\n", unclean.iloc[0])
        print("\nCleaned paragraph:\n", clean.iloc[0])
        print("-"*55)
    else:
        print(f"The column '{column_name}' does not exist in the DataFrame.")
    return df

'''
	- Function usage: df = clean_column(df, 'paragraph')
	- Replace `paragraph` with your original value.
'''

df1 = clean_column(df1, 'paragraph')

-------------------------------------------------------
Original paragraph:
 Ramsay was born in Glasgow on 2 October 1852. He was a nephew of the geologist Sir Andrew Ramsay. His father, William, Sr., was a civil engineer. His mother was Catherine Robertson. He studied at Glasgow Academy, at the University of Glasgow and at University of Tubingen in Germany. 

Cleaned paragraph:
 Ramsay was born in Glasgow on October He was a nephew of the geologist Sir Andrew Ramsay His father William Sr was a civil engineer His mother was Catherine Robertson He studied at Glasgow Academy at the University of Glasgow and at University of Tubingen in Germany
-------------------------------------------------------


In [None]:
def is_english(text):
    try:
        return detect(text) == 'en'
    except:
        return False

# Apply the function to the 'paragraph_cleaned' column and store the result in a new column 'is_english'
df1['is_english'] = df1['paragraph_cleaned'].swifter.apply(is_english)
df1 = df1[df1['is_english'] != False]
print("-" * 55)
print("Number of rows after removing non-English paragraphs:\n")
print(df1['is_english'].value_counts())
print("-" * 55)

print("Present columns:\n")
column_names = df1.columns.tolist()
for name in column_names:
    print(name)
print("-" * 55)

## Text Tokenization Function: `tokenize_column`

This function is designed to tokenize textual data within a specified column of a DataFrame, converting text into a list of tokens (words) and storing the result in a new column.

<u>Summary</u>: A **token** refers to a single unit of linguistic data. It's the result of taking a text or set of text and breaking it up into pieces such as words, keywords, phrases, symbols and other elements, which are then used as input for further processing.

### Features:
- **Column Verification**: Checks if the specified column exists in the DataFrame. If it does not, prints an error message.
- **Tokenization**: Utilizes the `word_tokenize` function from the NLTK library to break down text into individual words or tokens.
- **New Column Creation**: Appends a "_tokenized" suffix to the original column name and stores the tokenized lists in this new column.
- **Preview**: Prints the first entry from both the original (cleaned) column and the new (tokenized) column to demonstrate the tokenization effect.

#### Usage:
To apply text tokenization to a column in your DataFrame, execute the function as follows:
```python
df1 = tokenize_column(df, 'column_name_cleaned')
```

In [17]:
# Tokenize each paragraph and store the tokens in a new column 'tokenized_paragraph'
def tokenize_column(df, column_name):
    # Ensure the column exists in the DataFrame
    if column_name in df.columns:
        clean = df[column_name]
        df[column_name + '_tokenized'] = df[column_name].apply(lambda x: word_tokenize(x))
        tokenized = df[column_name + '_tokenized']
        print("-"*55)
        print("Cleaned paragraph:\n", clean.iloc[0])
        print("\nTokenized paragraph:\n", tokenized.iloc[0])
        print("-"*55)
    else:
        print(f"The column '{column_name}' does not exist in the DataFrame.")
    return df

'''
	- Function usage: df = clean_column(df, 'paragraph')
	- Replace `paragraph` with your original value.
'''

df1 = tokenize_column(df1, 'paragraph_cleaned')

-------------------------------------------------------
Cleaned paragraph:
 Ramsay was born in Glasgow on October He was a nephew of the geologist Sir Andrew Ramsay His father William Sr was a civil engineer His mother was Catherine Robertson He studied at Glasgow Academy at the University of Glasgow and at University of Tubingen in Germany

Tokenized paragraph:
 ['Ramsay', 'was', 'born', 'in', 'Glasgow', 'on', 'October', 'He', 'was', 'a', 'nephew', 'of', 'the', 'geologist', 'Sir', 'Andrew', 'Ramsay', 'His', 'father', 'William', 'Sr', 'was', 'a', 'civil', 'engineer', 'His', 'mother', 'was', 'Catherine', 'Robertson', 'He', 'studied', 'at', 'Glasgow', 'Academy', 'at', 'the', 'University', 'of', 'Glasgow', 'and', 'at', 'University', 'of', 'Tubingen', 'in', 'Germany']
-------------------------------------------------------


In [9]:
label_encoder = LabelEncoder()
vectorizer = TfidfVectorizer(stop_words='english', max_features=3000)
ohecoder = OneHotEncoder()

# Prepare the data
X = df1[['paragraph_cleaned', 'has_entity']]
y = df1['category']

# Label encoding for the target variable
y_encoded = label_encoder.fit_transform(y)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_encoded, test_size=0.2, stratify=y_encoded)

# Define a column transformer for preprocessing the features
preprocessor = ColumnTransformer(
    transformers=[
        ('tfidf', vectorizer, 'paragraph_cleaned'),
        ('ohe', ohecoder, ['has_entity'])
    ]
)

# TODO: Extended classifier choice
chosen_clf = 2  # Update this to choose a classifier

# Define the classifiers in a dictionary for easier selection
classifiers = {
    1: {'model': SVC(kernel='linear', probability=True, random_state=42), 'name': 'SVM'},
    2: {'model': MultinomialNB(), 'name': 'MultinomialNB'},
    3: {'model': LogisticRegression(random_state=42), 'name': 'LogisticRegression'},
    4: {'model': RandomForestClassifier(random_state=42), 'name': 'RandomForest'},
    5: {'model': XGBClassifier(use_label_encoder=False, eval_metric='mlogloss'), 'name': 'XGBClassifier'},
    # Add more classifiers here
}

# Select the classifier based on chosen_clf
if chosen_clf in classifiers:
    classifier = classifiers[chosen_clf]['model']
    clf_name = classifiers[chosen_clf]['name']
else:
    raise ValueError("Invalid classifier choice.")

# Create a pipeline that includes preprocessing, resampling, and classifier
model_pipeline = IMBPipeline(steps=[
    ('preprocessing', preprocessor),
    ('resample', SMOTE(sampling_strategy='auto', random_state=42)), 
    ('classify', classifier)
])

# Train the model pipeline
model_pipeline.fit(X_train, y_train)

In [None]:
print('Training {}...'.format(clf_name))
print("-" * 55)
start_time = time.time()

# Train the model on the training set
model_pipeline.fit(X_train, y_train)

end_time = time.time() - start_time
print("{} classifier trained on [{:.2f}] seconds.".format(clf_name, end_time))

# Predictions and Evaluation
y_pred = model_pipeline.predict(X_test)

# Decode the predictions back to original category names for interpretability
y_pred_decoded = label_encoder.inverse_transform(y_pred)
y_test_decoded = label_encoder.inverse_transform(y_test)

# Classification Report
print(classification_report(y_test_decoded, y_pred_decoded))

In [None]:
# Just use X_test which already contains the required columns
X_TEST = X_test  # X_test should be a DataFrame with the necessary columns

# Use the model_pipeline to predict
predicted = model_pipeline.predict(X_TEST)

# Decoding the predicted labels back to original category names for interpretability
predicted_decoded = label_encoder.inverse_transform(predicted)

# Print the first two predictions
c = 0
for index, (doc, category) in enumerate(zip(X_TEST.itertuples(index=False), predicted_decoded)):
    if c == 2:
        break

    print("-"*55)
    print(f"Document: {doc.paragraph_cleaned}")  # Assuming 'paragraph_cleaned' is the correct column name
    print(f"Entity Presence: [{doc.has_entity}]\n")
    print(f"Predicted Category: '{category}'")
    print("-"*55)

    c += 1

# Calculate and print the accuracy
accuracy = np.mean(predicted == y_test)
print(f'Accuracy: {accuracy * 100:.2f}%')


In [None]:
train_sizes, train_scores, test_scores = learning_curve(model_pipeline, X, y, cv=5)

# Calculate mean and standard deviation for training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Calculate mean and standard deviation for test set scores
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Plot learning curves
plt.plot(train_sizes, train_mean, '--', color="#111111",  label="Training score")
plt.plot(train_sizes, test_mean, color="#111111", label="Cross-validation score")

# Draw bands
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="#DDDDDD")
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, color="#DDDDDD")

# Create plot
plt.title("Learning Curve")
plt.xlabel("Training Set Size"), plt.ylabel("Accuracy Score"), plt.legend(loc="best")
plt.tight_layout()
plt.show()

In [None]:
# Assume that we are using a pipeline named `model_pipeline`
scores = cross_val_score(model_pipeline, X, y, cv=5)

print("-" * 55)
print("Cross-validation scores:")
print(f'{", ".join([f"{score:.2f}%" for score in scores * 100])}\n')
print(f'Mean cross-validation score: {np.mean(scores) * 100:.2f}%')
print("-" * 55)

# Fit the model
model_pipeline.fit(X_train, y_train)

# Compute the confusion matrix
cm = confusion_matrix(y_test, y_pred)

# Create a DataFrame from the confusion matrix
cm_df = pd.DataFrame(cm, index=model_pipeline.classes_, columns=model_pipeline.classes_)

# Plot the confusion matrix using Seaborn
plt.figure(figsize=(10, 7))
sns.heatmap(cm_df, annot=True, fmt='g', cmap='Blues')
plt.title('Confusion Matrix')
plt.ylabel('Actual Labels')
plt.xlabel('Predicted Labels')
plt.show()

In [18]:
# Save classifiers, transformers and encoders in a single object
model_to_save = {
    'model': model_pipeline,
    'encoder': label_encoder,
    'ohecoder': ohecoder,
    'vectorizer': vectorizer
}

# Define the name and path for the model file
model_extn = '.pkl' # Extension for the model file
model_name = f'{task_name}{clf_name}{model_extn}' # Name of the model file
dir_name = '../Models/' # Directory to save the model
path_name = os.path.join(dir_name, model_name) # Complete path to save the model

# Ensure that the 'Models' directory exists before trying to save the file
if not os.path.exists(dir_name):
    os.makedirs(dir_name)

# Save the model to the specified path
with open(path_name, 'wb') as file:
    pickle.dump(model_to_save, file)

# Notify the user about the successful saving of the model
print("-" * 55)
print(f"Model saved successfully to '{path_name}' file.")
print("-" * 55)

-------------------------------------------------------
Model saved successfully to '../Models/Task-1MultinomialNB.pkl' file.
-------------------------------------------------------


In [None]:
# Load the saved model from the file
loaded_model_path = path_name
with open(loaded_model_path, 'rb') as file:
    loaded_file = pickle.load(file)

loaded_model = loaded_file['model']
loaded_encoder = loaded_file['encoder']
loaded_ohecoder = loaded_file['ohecoder']
loaded_vectorizer = loaded_file['vectorizer']

# Load the dataset
df = load_t2_df('../Datasets/dataset.csv')

# Clean the dataset
df = clean_column(df, 'paragraph')

# Identify rows with missing 'category' values AND 'has_entity' NOT 'data missing'
missing_category_mask = df['category'].isnull() & (df['has_entity'] != 'data missing')

# Prepare data for prediction (only 'paragraph' and 'has_entity' columns)
data_to_predict = df.loc[missing_category_mask, ['paragraph_cleaned', 'has_entity']]

# Predict missing categories
predicted_categories = model_pipeline.predict(data_to_predict)

decoded_predictions = loaded_encoder.inverse_transform(predicted_categories)

# Fill missing values in the original dataframe
df.loc[missing_category_mask, 'category'] = decoded_predictions
df.drop(columns=['paragraph_cleaned', 'index'], inplace=True)

df.to_csv('../Datasets/task1_dataset.csv', index=False)

# Count how many categories were predicted and filled
predicted_count = df.loc[missing_category_mask, 'category'].notnull().sum()

print(f"Total categories predicted and filled: {predicted_count}")

In [19]:
# Load the saved model from the file
with open('../Models/Task-1MultinomialNB.pkl', 'rb') as file:
    loaded_file  = pickle.load(file)

loaded_model = loaded_file['model']
loaded_encoder = loaded_file['encoder']
loaded_ohecoder = loaded_file['ohecoder']
loaded_vectorizer = loaded_file['vectorizer']

# Load the dataset
df = load_t2_df('../Datasets/dataset.csv')

# Clean the dataset
df = clean_column(df, 'paragraph')

# Identify rows with missing 'category' values AND 'has_entity' NOT 'data missing'
missing_category_mask = df['category'].isnull() & (df['has_entity'] != 'data missing')

# Prepare data for prediction (only 'paragraph' and 'has_entity' columns)
data_to_predict = df.loc[missing_category_mask, ['paragraph_cleaned', 'has_entity']]

# Predict missing categories
predicted_categories = loaded_model.predict(data_to_predict)

decoded_predictions = loaded_encoder.inverse_transform(predicted_categories)

# Fill missing values in the original dataframe
df.loc[missing_category_mask, 'category'] = decoded_predictions
df.drop(columns=['paragraph_cleaned', 'index'], inplace=True)

df.to_csv('../Datasets/task1_dataset.csv', index=False)

# Count how many categories were predicted and filled
predicted_count = df.loc[missing_category_mask, 'category'].notnull().sum()

print(f"Total categories predicted and filled: {predicted_count}")

9347 rows and 9 columns loaded successfully for Task-2, including newly added 'index' column.
-------------------------------------------------------
Original paragraph:
 Ramsay was born in Glasgow on 2 October 1852. He was a nephew of the geologist Sir Andrew Ramsay. His father, William, Sr., was a civil engineer. His mother was Catherine Robertson. He studied at Glasgow Academy, at the University of Glasgow and at University of Tubingen in Germany. 

Cleaned paragraph:
 Ramsay was born in Glasgow on October He was a nephew of the geologist Sir Andrew Ramsay His father William Sr was a civil engineer His mother was Catherine Robertson He studied at Glasgow Academy at the University of Glasgow and at University of Tubingen in Germany
-------------------------------------------------------
Total categories predicted and filled: 61


---
# Task 2: Text Clarity Classification