**Mateo Alexander**

**Homework 4**

**Natural Language Processing: QMSS 5067**

**Professor Patrick Houlihan**

**Homework Due 12/02/2024**

In [2]:
"""
Document Classification Case Study: TechCorp Document Management System
Background
TechCorp is a growing technology company with thousands of documents scattered across their systems. These documents include legal contracts, marketing materials, and engineering specifications. Currently, employees manually categorize these documents, which is time-consuming and prone to errors. The company wants to implement an automated solution using machine learning to classify documents into three categories: Legal, Marketing, and Engineering.
Available Data
TechCorp has provided you with:
A training dataset, hw4.pk, of pre-labeled documents, all in .txt form:
legal documents (contracts, agreements, compliance reports)
marketing documents (brochures, campaign materials, social media content)
engineering documents (technical specifications, code documentation, design docs)
Each document is provided in text format
Labels for each document in the training set
As a data science student, your task is to:
- Provide code that that solves for an automated end-to-end solution for this problem.
"""

'\nDocument Classification Case Study: TechCorp Document Management System\nBackground\nTechCorp is a growing technology company with thousands of documents scattered across their systems. These documents include legal contracts, marketing materials, and engineering specifications. Currently, employees manually categorize these documents, which is time-consuming and prone to errors. The company wants to implement an automated solution using machine learning to classify documents into three categories: Legal, Marketing, and Engineering.\nAvailable Data\nTechCorp has provided you with:\nA training dataset, hw4.pk, of pre-labeled documents, all in .txt form:\nlegal documents (contracts, agreements, compliance reports)\nmarketing documents (brochures, campaign materials, social media content)\nengineering documents (technical specifications, code documentation, design docs)\nEach document is provided in text format\nLabels for each document in the training set\nAs a data science student, y

In [199]:
import os
import pickle
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier, StackingClassifier
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report, accuracy_score
from imblearn.over_sampling import SMOTE
from sklearn.utils import resample
import numpy as np
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
import nltk
from sklearn.preprocessing import LabelEncoder
from xgboost import XGBClassifier

In [None]:
nltk.download('wordnet')
nltk.download('stopwords')

In [11]:
file_path_data = os.path.abspath('hw4.pk')
print(file_path_data)

/home/37b4d573-02a2-4c75-b69b-8f21f8c5d212/Natural Language Processing/hw4.pk


In [51]:
# Load the data from the .pk file using the absolute path
file_path = '/home/37b4d573-02a2-4c75-b69b-8f21f8c5d212/Natural Language Processing/hw4.pk'
with open(file_path, 'rb') as file:
    data = pickle.load(file)

# Displaying the columns and the first few rows to inspect the DataFrame
print("Columns:", data.columns)
display(data.head(200))

Columns: Index(['body', 'label'], dtype='object')


Unnamed: 0,body,label
0,We use essential cookies to make Venngage wor...,legal_contract_examples
1,A legal contract is a written document that is...,legal_contract_examples
2,November 27 2023 14 min Author Olga Asheychik...,legal_contract_examples
3,Accelerate contracts with AI native workflows ...,legal_contract_examples
4,Create smarter agreements commit to them more ...,legal_contract_examples
...,...,...
195,In software development almost everyone you wo...,engineering_specification_examples
196,You have been blocked If you believe this in e...,engineering_specification_examples
197,Denis Sheremetov CTO at Onix Mila Slesar Write...,engineering_specification_examples
198,Engineering Team June 6 2024 18min read The b...,engineering_specification_examples


In [195]:
# FINAL CODE: Classification Tool

# Load the dataset
file_path = '/home/37b4d573-02a2-4c75-b69b-8f21f8c5d212/Natural Language Processing/hw4.pk'
with open(file_path, 'rb') as file:
    data = pickle.load(file)

# Adding a new column 'category' based on the value of the 'label' column
data['category'] = data['label'].map({
    'legal_contract_examples': 'Legal',
    'marketing_material_examples': 'Marketing',
    'engineering_specification_examples': 'Engineering'
})

# Create a new DataFrame with only 'category' and 'body' columns for data analysis
analysis_df = data[['category', 'body']]

# Text Preprocessing Function
def preprocess_text(text):
    # Remove special characters and numbers
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    # Convert to lowercase
    text = text.lower()
    # Tokenize and remove stopwords
    words = text.split()
    stop_words = set(stopwords.words('english'))
    words = [word for word in words if word not in stop_words]
    # Lemmatize words
    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Apply preprocessing to the 'body' column
analysis_df.loc[:, 'body'] = analysis_df['body'].apply(preprocess_text)

# Balancing the dataset by taking 20% of each classification category
legal_sample = resample(analysis_df[analysis_df['category'] == 'Legal'],
                        replace=False, n_samples=int(len(analysis_df) * 0.2), random_state=42)
marketing_sample = resample(analysis_df[analysis_df['category'] == 'Marketing'],
                            replace=False, n_samples=int(len(analysis_df) * 0.2), random_state=42)
engineering_sample = resample(analysis_df[analysis_df['category'] == 'Engineering'],
                              replace=False, n_samples=int(len(analysis_df) * 0.2), random_state=42)

# Concatenate the samples to create a balanced dataset
balanced_df = pd.concat([legal_sample, marketing_sample, engineering_sample])

# Splitting the balanced dataset into training and test sets
X = balanced_df['body']
y = balanced_df['category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Convert text data to numerical data using TF-IDF Vectorizer with n-grams
tfidf = TfidfVectorizer(stop_words='english', max_df=0.85, ngram_range=(1, 3), max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

# Apply SMOTE to handle class imbalance
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train_tfidf, y_train)

# Encode labels for XGBoost
label_encoder = LabelEncoder()
y_train_smote_encoded = label_encoder.fit_transform(y_train_smote)
y_test_encoded = label_encoder.transform(y_test)

# Feature Engineering: Adding Named Entity Counts and Document Length
X_train_len = X_train.apply(lambda x: len(x.split())).values.reshape(-1, 1)
X_test_len = X_test.apply(lambda x: len(x.split())).values.reshape(-1, 1)

# Combine TF-IDF features with document length for training and testing sets
X_train_with_features = np.hstack((X_train_tfidf.toarray(), X_train_len))
X_test_with_features = np.hstack((X_test_tfidf.toarray(), X_test_len))

# Apply PCA for dimensionality reduction (except for MultinomialNB)
pca = PCA(n_components=0.95, random_state=42)
X_train_pca = pca.fit_transform(X_train_smote.toarray())
X_test_pca = pca.transform(X_test_tfidf.toarray())

# Defining classifiers to evaluate with hyperparameter tuning
param_grid = {
    'RandomForestClassifier': {
        'n_estimators': [100, 200, 500],
        'max_depth': [10, 20, None],
        'max_features': ['sqrt', 'log2', None]
    },
    'GradientBoostingClassifier': {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.1, 0.5],
        'max_depth': [3, 5, 7]
    },
    'XGBClassifier': {
        'n_estimators': [100, 200],
        'learning_rate': [0.01, 0.1, 0.5],
        'max_depth': [3, 5, 7]
    }
}

classifiers = {
    'RandomForestClassifier': RandomForestClassifier(),
    'GradientBoostingClassifier': GradientBoostingClassifier(),
    'XGBClassifier': XGBClassifier(eval_metric='mlogloss')
}

# Evaluating performance for RandomForest, GradientBoosting, and XGBoost using RandomizedSearchCV
for name, clf in classifiers.items():
    print(f"\nEvaluating {name} with Hyperparameter Tuning...")
    randomized_search = RandomizedSearchCV(clf, param_grid[name], n_iter=10, cv=3, n_jobs=-1, verbose=2, random_state=42)
    if name == 'XGBClassifier':
        randomized_search.fit(X_train_pca, y_train_smote_encoded)
    else:
        randomized_search.fit(X_train_pca, y_train_smote)
    best_clf = randomized_search.best_estimator_
    y_pred = best_clf.predict(X_test_pca)
    y_pred = label_encoder.inverse_transform(y_pred) if name == 'XGBClassifier' else y_pred
    print("Best Parameters:", randomized_search.best_params_)
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    print("Accuracy Score:", accuracy_score(y_test, y_pred))

# Ensemble Voting Classifier using the top-performing models
print("\nEvaluating Voting Classifier with RandomForest, GradientBoosting, and XGBoost...")
voting_clf = VotingClassifier(estimators=[
    ('rf', RandomForestClassifier(n_estimators=200, max_depth=None, max_features='log2', random_state=42)),
    ('gb', GradientBoostingClassifier(n_estimators=200, learning_rate=0.5, max_depth=3, random_state=42)),
    ('xgb', XGBClassifier(n_estimators=200, learning_rate=0.5, max_depth=3, eval_metric='mlogloss', random_state=42))
], voting='soft', weights=[2, 1, 1])

voting_clf.fit(X_train_pca, y_train_smote_encoded)
y_pred_voting = voting_clf.predict(X_test_pca)
y_pred_voting = label_encoder.inverse_transform(y_pred_voting)
print("Classification Report for Voting Classifier:")
print(classification_report(y_test, y_pred_voting))
print("Accuracy Score:", accuracy_score(y_test, y_pred_voting))

# Stacking Classifier to combine the models
print("\nEvaluating Stacking Classifier with RandomForest, GradientBoosting, and XGBoost...")
stacking_clf = StackingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=200, max_depth=None, max_features='log2', random_state=42)),
        ('gb', GradientBoostingClassifier(n_estimators=200, learning_rate=0.5, max_depth=3, random_state=42)),
        ('xgb', XGBClassifier(n_estimators=200, learning_rate=0.5, max_depth=3, eval_metric='mlogloss', random_state=42))
    ],
    final_estimator=LogisticRegression(max_iter=1000)
)

stacking_clf.fit(X_train_pca, y_train_smote_encoded)
y_pred_stacking = stacking_clf.predict(X_test_pca)
y_pred_stacking = label_encoder.inverse_transform(y_pred_stacking)
print("Classification Report for Stacking Classifier:")
print(classification_report(y_test, y_pred_stacking))
print("Accuracy Score:", accuracy_score(y_test, y_pred_stacking))

# 5-Fold Cross-Validation for the best model (Stacking Classifier)
print("\nEvaluating Stacking Classifier with 5-Fold Cross-Validation...")
cross_val_scores = cross_val_score(stacking_clf, X_train_pca, y_train_smote_encoded, cv=5, n_jobs=-1)
print("Cross-Validation Scores:", cross_val_scores)
print("Mean Cross-Validation Score:", np.mean(cross_val_scores))

# Creating a new DataFrame with predicted categories and evaluation of correctness
results_df = pd.DataFrame({
    'body_text': X_test,
    'actual_category': y_test,
    'category_expected': y_pred_stacking
})
results_df['correct_classification'] = results_df['actual_category'] == results_df['category_expected']

# Displaying the results DataFrame
print("\nResults DataFrame:")
print(results_df.head())

# Export the results DataFrame to a CSV file
results_df.to_csv('/home/37b4d573-02a2-4c75-b69b-8f21f8c5d212/Natural Language Processing/classification_results.csv', index=False)


Evaluating RandomForestClassifier with Hyperparameter Tuning...
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Parameters: {'n_estimators': 100, 'max_features': 'log2', 'max_depth': None}
Classification Report:
              precision    recall  f1-score   support

 Engineering       0.86      1.00      0.92         6
       Legal       1.00      0.82      0.90        11
   Marketing       0.91      1.00      0.95        10

    accuracy                           0.93        27
   macro avg       0.92      0.94      0.93        27
weighted avg       0.93      0.93      0.92        27

Accuracy Score: 0.9259259259259259

Evaluating GradientBoostingClassifier with Hyperparameter Tuning...
Fitting 3 folds for each of 10 candidates, totalling 30 fits
Best Parameters: {'n_estimators': 200, 'max_depth': 3, 'learning_rate': 0.5}
Classification Report:
              precision    recall  f1-score   support

 Engineering       0.75      1.00      0.86         6
       Legal  