# Text Classification and Analysis using Word2Vec and RandomForest

This notebook performs text classification on a dataset using Word2Vec for word embeddings and RandomForest for classification. The process includes the following steps:

1. **Data Loading and Preprocessing**: Load the cleaned textual data, fill missing values, and filter out categories with fewer than 3 samples.
2. **Data Filtering and Tokenization**: Tokenize the text data for Word2Vec model training using the filtered dataframe.
3. **Word2Vec Model Training**: Train a Word2Vec model on the tokenized text to create word embeddings.
4. **Feature Extraction with TF-IDF and Combining Features**: Extract features using TF-IDF and combine them with Word2Vec features for a more robust feature set.
5. **Handling Class Imbalance with SMOTE**: Use SMOTE to handle class imbalance in the dataset, creating synthetic samples for under-represented classes.
6. **Data Splitting and Classifier Training**: Split the data into training and testing sets and train a RandomForest classifier.
7. **Enhanced Cross-Validation**: Perform enhanced cross-validation to ensure robust model evaluation.
8. **Model Evaluation**: Evaluate the performance of the classifier using the test set.
9. **Confusion Matrix and ROC Curve**: Visualize the confusion matrix and ROC curve for a comprehensive evaluation.
10. **Additional Metrics**: Calculate additional metrics like the Matthews Correlation Coefficient and Area Under Precision-Recall Curve for further analysis.
11. **Learning Curve Analysis**: Analyze the learning curves to understand the model's behavior with varying training set sizes.
12. **Feature Importance Analysis**: Examine the feature importances to understand the contribution of different features to the model.
13. **Saving Models and Results**: Save the trained models and processed data for future use.

#### Import Libraries and Resouces

In [None]:
# Import necessary libraries
import pandas as pd
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split, cross_val_score, RepeatedKFold, learning_curve
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, auc, precision_recall_curve, matthews_corrcoef
from sklearn.preprocessing import label_binarize
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE
import numpy as np
import joblib
import nltk
import seaborn as sns
import matplotlib.pyplot as plt
from itertools import cycle

# Download the punkt tokenizer for word tokenization
nltk.download('punkt')

### Define Script Functions

In [None]:
# Function to create a feature vector for a document by averaging its word vectors
def document_vector(word2vec_model, doc):
    # Filter out words not in the model's vocabulary
    doc = [word for word in doc if word in word2vec_model.wv]
    return np.mean(word2vec_model.wv[doc], axis=0) if doc else np.zeros(word2vec_model.vector_size)

### Data Loading and Preprocessing

Load the cleaned textual data and preprocess it for classification. This involves filling missing values and defining category rules.

In [None]:
# Load the cleaned data from the CSV file
df = pd.read_csv('cleaned_section_data_with_categories.csv')
df['Cleaned Text'] = df['Cleaned Text'].fillna('')  # Fill missing text with empty strings

# Filter out categories with fewer than 3 samples
min_samples_threshold = 3  
category_counts = df['Category'].value_counts()
categories_to_keep = category_counts[category_counts >= min_samples_threshold].index
df_filtered = df[df['Category'].isin(categories_to_keep)]

### Data Filtering and Tokenization

Filter out categories with insufficient samples and tokenize the text for the Word2Vec model.

In [None]:
# Tokenize the text data for Word2Vec model training using the filtered dataframe
tokenized_text_filtered = [word_tokenize(text) for text in df_filtered['Cleaned Text']]

### Word2Vec Model Training

Train a Word2Vec model on the tokenized text to create word embeddings.

In [None]:
# Initialize and train a Word2Vec model
word2vec_model = Word2Vec(tokenized_text_filtered, vector_size=200, window=10, min_count=1, workers=4)

### Feature Extraction with TF-IDF and Combining Features

Extract features using TF-IDF and combine them with Word2Vec features for a more robust feature set.

In [None]:
# Initialize and fit a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
tfidf_feature_vectors_filtered = tfidf_vectorizer.fit_transform(df_filtered['Cleaned Text']).toarray()

# Create feature vectors for each document using the Word2Vec model on the filtered text
w2v_feature_vectors_filtered = np.array([document_vector(word2vec_model, doc) for doc in tokenized_text_filtered])

# Combine Word2Vec and TF-IDF features for the filtered text
combined_features_filtered = np.hstack((w2v_feature_vectors_filtered, tfidf_feature_vectors_filtered))

### Handling Class Imbalance with SMOTE

Use SMOTE to handle class imbalance in the dataset, creating synthetic samples for under-represented classes.

In [None]:
# Initialize SMOTE and apply it to the combined feature set
smote = SMOTE(random_state=42, k_neighbors=1)
X_resampled, y_resampled = smote.fit_resample(combined_features_filtered, df_filtered['Category'])

### Data Splitting and Classifier Training

Split the data into training and testing sets and train a RandomForest classifier.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Initialize and train a RandomForest classifier
classifier = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
classifier.fit(X_train, y_train)


### Enhanced Cross-Validation

Perform enhanced cross-validation to ensure robust model evaluation.

In [None]:
# Initialize RepeatedKFold for robust cross-validation
rkf = RepeatedKFold(n_splits=5, n_repeats=3, random_state=42)
cv_scores = cross_val_score(classifier, X_resampled, y_resampled, cv=rkf, scoring='accuracy')

# Print cross-validation results
print("Cross-Validation Scores:", cv_scores)
print("Average Cross-Validation Score:", np.mean(cv_scores))
print("Standard Deviation of Cross-Validation Scores:", np.std(cv_scores))

### Model Evaluation

In [None]:
# Evaluate the performance of the classifier using the test set.
predictions = classifier.predict(X_test)
print(classification_report(y_test, predictions, zero_division=0))

### Confusion Matrix and ROC Curve

Visualize the confusion matrix and ROC curve for a comprehensive evaluation.

In [None]:
# Visualize the confusion matrix and ROC curve for a comprehensive evaluation.
cm = confusion_matrix(y_test, predictions)
sns.heatmap(cm, annot=True, fmt='d')
plt.title('Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()

# Binarize the output classes for multi-class ROC AUC
y_bin = label_binarize(y_test, classes=list(np.unique(y_resampled)))
n_classes = y_bin.shape[1]

# Calculate the ROC-AUC for each class
fpr, tpr, roc_auc = {}, {}, {}
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_bin[:, i], classifier.predict_proba(X_test)[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Plot ROC curve for each class
colors = cycle(['blue', 'red', 'green', 'yellow', 'orange', 'purple', 'cyan', 'magenta', 'brown', 'black', 'grey'])
for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=2, label='ROC curve of class {0} (area = {1:0.2f})'.format(i, roc_auc[i]))

plt.plot([0, 1], [0, 1], 'k--', lw=2)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Multi-class ROC')
plt.legend(loc="lower right")
plt.show()

### Additional Metrics

Calculate additional metrics like the Matthews Correlation Coefficient and Area Under Precision-Recall Curve for further analysis.

In [None]:
# Matthews Correlation Coefficient
mcc = matthews_corrcoef(y_test, predictions)
print("Matthews Correlation Coefficient:", mcc)

# Area Under Precision-Recall Curve
precision, recall, _ = precision_recall_curve(y_bin.ravel(), classifier.predict_proba(X_test).ravel())
auprc = auc(recall, precision)
print("Area Under Precision-Recall Curve:", auprc)

### Learning Curve Analysis

Analyze the learning curves to understand the model's behavior with varying training set sizes.

In [None]:
# Analyze the learning curves to understand the model's behavior with varying training set sizes.
train_sizes, train_scores, test_scores = learning_curve(classifier, X_resampled, y_resampled, n_jobs=-1, cv=5, train_sizes=np.linspace(.1, 1.0, 5), verbose=0)
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, color="r", alpha=0.1)
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, color="g", alpha=0.1)
plt.plot(train_sizes, train_mean, 'o-', color="r", label="Training score")
plt.plot(train_sizes, test_mean, 'o-', color="g", label="Cross-validation score")
plt.title("Learning Curve")
plt.xlabel("Training Set Size")
plt.ylabel("Accuracy Score")
plt.legend(loc="best")
plt.show()

### Feature Importance Analysis

Examine the feature importances to understand the contribution of different features to the model.

In [None]:
# Examine the feature importances to understand the contribution of different features to the model.
importances = classifier.feature_importances_
indices = np.argsort(importances)[::-1]
print("Feature ranking:")
for f in range(X_train.shape[1]):
    print(f"{f + 1}. feature {indices[f]} ({importances[indices[f]]})")

### Saving Models and Results

Save the trained models and processed data for future use.

In [None]:
# Save the trained models and processed data for future use.
joblib.dump(tfidf_vectorizer, "tfidf_model.pkl")
word2vec_model.save('word2vec_model.bin')
np.save('combined_features.npy', combined_features_filtered)
joblib.dump(classifier, "random_forest_classifier.pkl")
df_filtered.to_csv('cleaned_section_data_with_categories.csv', index=False)