# Text Classification with NLP and Machine Learning Libraries

This Jupyter notebook code snippet demonstrates the initial setup for a text classification project utilizing both Natural Language Processing (NLP) and Machine Learning (ML) libraries. The code is structured to import necessary libraries, prepare for data preprocessing, and set up a classification model. Below are the key components highlighted:

## Importing Libraries and Initializing NLTK Resources

### Overview
This section focuses on setting up the environment by importing necessary libraries for NLP and machine learning tasks, specifically for text classification. It also includes a custom function to download NLTK resources if they're not already present, ensuring all dependencies are satisfied before proceeding with data processing and modeling.

### Key Points
- Libraries such as `numpy` NumPy, `pandas`, `matplotlib`, `re` (regular expression), `string`, `nltk` NLTK (Natural Language Toolkit), `sklearn` and `imbalanced-learn` are imported for various tasks including data manipulation, visualization, text processing, and handling class imbalance.
- The custom function `download_nltk_resources` automates the downloading of essential NLTK resources like 'punkt' (for tokenizing), 'stopwords', and 'wordnet'.
- Execution of `download_nltk_resources` ensures necessary resources are available for text processing.

In [3]:
# Imports necessary NLP and ML libraries for text classification
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
import string
import nltk

# Importing necessary libraries
from nltk.tokenize import word_tokenize
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline as IMBPipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB

# Function to download necessary NLTK resources
def download_nltk_resources():
    resources = ['punkt', 'stopwords', 'wordnet']
    for resource in resources:
        try:
            nltk.data.find(f'tokenizers/{resource}')
        except LookupError:
            nltk.download(resource)
# This code is copied from my previous college project. It is called Lazy Loading NLTK Resources.

# Call the function to download resources if not already present
download_nltk_resources()

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/codespace/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


# Loading and Initial Data Check

### Overview
This segment deals with loading the dataset from a CSV file, followed by a basic check to confirm successful loading. It employs pandas for reading data and provides a quick overview of the dataset's size.

### Key Points
- Data is loaded from 'dataset.csv' using specific columns of interest.
- Immediate feedback on the dataset's dimension is provided to confirm successful data loading.

In [4]:
df = pd.read_csv('dataset.csv', usecols = ['par_id', 'paragraph', 'has_entity', 'lexicon_count', 'difficult_words', 'last_editor_gender', 'category'])
size = df.shape
if not df.empty: 
    print(size, "rows and columns loaded successfully.")


(9347, 7) rows and columns loaded successfully.


# Data Cleaning: Handling Missing Values

### Overview
Focusing on data quality, this part involves checking for and removing any rows with missing values, ensuring the dataset's integrity for subsequent analysis.

### Key Points
- Calculates and prints the count of missing values per column.
- Rows with missing values are removed to maintain data cleanliness.

In [None]:
size = df.shape
print('Checking for missing values from', size)
missing_values = df.isnull().sum()
print(missing_values, '\n')

# Remove rows with missing values
df = df.dropna()

size = df.shape
print('Removed missing values, Now it is', size)
missing_values = df.isnull().sum()
print(missing_values)

# Standardizing Text Data

### Overview
Standardization efforts are concentrated on the 'category' and 'has_entity' columns. It involves converting text to lowercase and removing rows with specific unwanted values, enhancing consistency across textual data.

### Key Points
- Unique values in 'category' and 'has_entity' columns are inspected.
- Text in the 'category' column is converted to lowercase to unify case usage.
- Rows with 'data missing' in the 'has_entity' column are identified and excluded.

In [6]:
print('Checking for unique values in the category column')
category_counts = df['category'].unique()
print(category_counts)
# Convert the 'category' column to lowercase
df['category'] = df['category'].str.lower()

print('\n' + 'Fixed the case of the category column')
category_counts = df['category'].unique()
print(category_counts)

Checking for unique values in the category column
['biographies' 'artificial intelligence' 'programming' 'philosophy'
 'movies about artificial intelligence' 'Philosophy' nan 'Programming'
 'Artificial intelligence' 'Biographies'
 'Movies about artificial intelligence']

Fixed the case of the category column
['biographies' 'artificial intelligence' 'programming' 'philosophy'
 'movies about artificial intelligence' nan]


In [7]:
print('Checking for unique values in the has_entity column')
entity_counts = df['has_entity'].unique()
print(entity_counts)

# Removing rows with 'data missing' in the 'has_entity' column
df = df[df['has_entity'] != 'data missing']

print('\n' + 'Removed rows with "data missing" in the has_entity column')
entity_counts = df['has_entity'].unique()
print(entity_counts)

Checking for unique values in the has_entity column
['ORG_YES_PRODUCT_NO_PERSON_YES_' 'ORG_YES_PRODUCT_NO_PERSON_NO_'
 'ORG_NO_PRODUCT_YES_PERSON_NO_' 'ORG_YES_PRODUCT_YES_PERSON_YES_'
 'ORG_NO_PRODUCT_NO_PERSON_NO_' 'ORG_NO_PRODUCT_YES_PERSON_YES_'
 'ORG_NO_PRODUCT_NO_PERSON_YES_' 'ORG_YES_PRODUCT_YES_PERSON_NO_'
 'data missing']

Removed rows with "data missing" in the has_entity column
['ORG_YES_PRODUCT_NO_PERSON_YES_' 'ORG_YES_PRODUCT_NO_PERSON_NO_'
 'ORG_NO_PRODUCT_YES_PERSON_NO_' 'ORG_YES_PRODUCT_YES_PERSON_YES_'
 'ORG_NO_PRODUCT_NO_PERSON_NO_' 'ORG_NO_PRODUCT_YES_PERSON_YES_'
 'ORG_NO_PRODUCT_NO_PERSON_YES_' 'ORG_YES_PRODUCT_YES_PERSON_NO_']


# Text Preprocessing for NLP

### Overview
Text data undergoes cleaning to remove punctuation, numbers, and extra whitespaces, preparing it for NLP tasks. The cleaned text replaces the original in a new column, preserving data integrity.

### Key Points
- The `clean_text` function is defined and applied to the 'paragraph' column, performing text cleaning operations.
- Cleaned text is stored in a new column 'cleaned_paragraph', with an example shown for comparison.

In [4]:
def clean_text(text):
    # Remove punctuation marks and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply the 'clean_text' function to the 'paragraph' column
unclean = df['paragraph']
df['cleaned_paragraph'] = df['paragraph'].apply(clean_text)
clean = df['cleaned_paragraph']
print("Original paragraph:\n", unclean[0])
print("\nCleaned paragraph:\n", clean[0])

Original paragraph:
 Ramsay was born in Glasgow on 2 October 1852. He was a nephew of the geologist Sir Andrew Ramsay. His father, William, Sr., was a civil engineer. His mother was Catherine Robertson. He studied at Glasgow Academy, at the University of Glasgow and at University of Tubingen in Germany. 

Cleaned paragraph:
 Ramsay was born in Glasgow on October He was a nephew of the geologist Sir Andrew Ramsay His father William Sr was a civil engineer His mother was Catherine Robertson He studied at Glasgow Academy at the University of Glasgow and at University of Tubingen in Germany


# Balancing Data and Feature Engineering

### Overview
This comprehensive section addresses data imbalance through resampling techniques and transforms textual data into numerical features using TF-IDF vectorization. It also visualizes category distributions before and after resampling.

### Key Points
- Imbalance in the category distribution is identified, and resampling (SMOTE for oversampling and RandomUnderSampler for undersampling) is applied within a pipeline.
- The TF-IDF vectorizer is used to convert text data into a matrix of TF-IDF features.
- Category distributions are compared visually using pie charts, showcasing the effect of resampling.

In [None]:
# Displaying the imbalanced category distribution
print("Imbalanced Category Distribution:")
print(df['category'].value_counts())

# Encode the labels
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(df['category'])
X_text = df['cleaned_paragraph'] # Assuming you have a column 'cleaned_paragraph' that contains the cleaned text

# Vectorizing the text data
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X_vectorized = tfidf_vectorizer.fit_transform(X_text)

# Setting up SMOTE and under-sampling within a pipeline
resampling_pipeline = IMBPipeline([
    ('smote', SMOTE(sampling_strategy='auto', random_state=42)),
    ('under', RandomUnderSampler(random_state=42))
])

X_resampled, y_resampled = resampling_pipeline.fit_resample(X_vectorized, y_encoded)

# Decode the labels back to original category names
y_resampled_decoded = encoder.inverse_transform(y_resampled)

# Create a DataFrame to display the before and after category distribution
resampled_df = pd.DataFrame(y_resampled_decoded, columns=['balanced_category'])

print("\nBalanced Category Distribution:")
print(resampled_df['balanced_category'].value_counts())

# Count the occurrences of each category in the original and resampled data
original_category_counts = df['category'].value_counts()
resampled_category_counts = resampled_df['balanced_category'].value_counts()

# Create a figure with two subplots
plt.figure(figsize=(16, 8))

# Create the first pie chart for original data
plt.subplot(1, 2, 1)
plt.pie(original_category_counts.values, labels=original_category_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Categories in Original Data')

# Create the second pie chart for resampled data
plt.subplot(1, 2, 2)
plt.pie(resampled_category_counts.values, labels=resampled_category_counts.index, autopct='%1.1f%%')
plt.title('Distribution of Categories in Resampled Data')

plt.show()
df.head(3)

# Tokenization and Feature Expansion

### Overview
Tokenization converts cleaned paragraphs into lists of tokens. Additionally, binary features are derived from the 'has_entity' column to indicate the presence of specific entity types.

### Key Points
- The 'tokenized_paragraph' column is created by applying word tokenization to the cleaned text.
- Binary columns ('ORG', 'PRODUCT', 'PERSON') are introduced based on the 'has_entity' attribute, expanding the feature set.

In [None]:
# Tokenize each paragraph and store the tokens in a new column 'tokenized_paragraph'
df['tokenized_paragraph'] = df['cleaned_paragraph'].apply(lambda x: word_tokenize(x))

# Display the first few rows to verify the tokenization
print(df[['cleaned_paragraph', 'tokenized_paragraph']].head(2))

This classify text paragraphs into specific topics based on the content and any mentioned entities such as persons, organizations, or products. The model should consider not just the textual content but also whether the paragraph mentions specific entities, enhancing its prediction accuracy.

In [5]:
# Splitting the "has_entity" column into three separate binary columns
df['ORG'] = df['has_entity'].str.contains('ORG_YES').astype(int)
df['PRODUCT'] = df['has_entity'].str.contains('PRODUCT_YES').astype(int)
df['PERSON'] = df['has_entity'].str.contains('PERSON_YES').astype(int)

# Displaying the first few rows to verify the changes
print(df[['has_entity', 'ORG', 'PRODUCT', 'PERSON']].head(5))

                        has_entity  ORG  PRODUCT  PERSON
0   ORG_YES_PRODUCT_NO_PERSON_YES_    1        0       1
1    ORG_YES_PRODUCT_NO_PERSON_NO_    1        0       0
2    ORG_YES_PRODUCT_NO_PERSON_NO_    1        0       0
3    ORG_NO_PRODUCT_YES_PERSON_NO_    0        1       0
4  ORG_YES_PRODUCT_YES_PERSON_YES_    1        1       1


# Data Preparation for Modeling

### Overview
Data is prepared for the modeling stage, involving label encoding and splitting into training and testing sets. This setup is crucial for training and evaluating the model's performance.

### Key Points
- The 'category' labels are encoded into numerical format using LabelEncoder.
- The dataset is split into training and test sets, ensuring stratification based on the category labels.

In [None]:
X = df[['tokenized_paragraph', 'has_entity', 'ORG', 'PRODUCT', 'PERSON']]
y =df['category']

encoder = LabelEncoder()

y = encoder.fit_transform(y)

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

v = dict(zip(list(y), df['category'].to_list()))

# Model Training: Naive Bayes Classifier

### Overview
A text classification pipeline is constructed using a Naive Bayes classifier. This pipeline integrates CountVectorizer and TfidfTransformer for text feature extraction and transformation before model fitting.

### Key Points
- The pipeline (`text_clf`) includes steps for converting text into a matrix of token counts, transforming these counts with TF-IDF, and classifying using MultinomialNB.

In [None]:
text_clf = Pipeline([
    ('vect', CountVectorizer(analyzer="word", stop_words="english")),
    ('tfidf', TfidfTransformer(use_idf=True)), 
    ('clf', MultinomialNB(alpha=.01)),
])

In [None]:
# Convert the list of tokens back into a single string per document
x_train_processed = x_train['tokenized_paragraph'].apply(' '.join)

# Now, fit your model
text_clf.fit(x_train_processed, y_train)

In [None]:
X_TEST = x_test['tokenized_paragraph'].to_list()
Y_TEST = list(y_test)

In [None]:
# Convert the list to a pandas Series
X_TEST_series = pd.Series(X_TEST)

# Apply the lambda function to process the data
X_TEST_processed = X_TEST_series.apply(lambda tokens: ' '.join(tokens))

# Now, use this processed data for prediction
predicted = text_clf.predict(X_TEST_processed)

### Checking the test datasets
The purpose of this code is to print out the first two documents from the test dataset `(X_TEST)` along with their corresponding predicted categories `(predicted)`.

In [None]:
c = 0

for doc, category in zip(X_TEST, predicted):
    
    if c == 2:break
    
    print("-"*55)
    print(doc)
    print(v[category])
    print("-"*55)

    c = c + 1 

In [None]:
np.mean(predicted == y_test)

# Model Evaluation and Persistence

### Overview
This final section covers the prediction process on the test set, evaluation of the model's accuracy, and persistence of the trained model to disk using Pickle.

### Key Points
- Test data is preprocessed to match the training format before making predictions.
- Model's performance is evaluated by comparing predictions against the true test labels.
- The trained model is saved to a file (`model_cat_t1.pkl`) for future use or deployment.

In [None]:
model_extn = '.pkl'
model_name = 'model_cat_t1' + model_extn

import pickle
with open(model_name,'wb') as f:
    pickle.dump(text_clf,f)

In [None]:
# load
with open(model_name, 'rb') as f:
    clf2 = pickle.load(f)

In [None]:
docs_new = ["History will place an asterisk next to A.I. as the film Stanley Kubrick might have directed. But let the record also show that Kubrick--after developing this project for some 15 years--wanted Steven Spielberg to helm this astonishing sci-fi rendition of Pinocchio, claiming (with good reason) that it veered closer to Spielberg's kinder, gentler sensibilities."]
predicted = clf2.predict(docs_new)

In [None]:
v[predicted[0]]