# Text Classification and Analysis using Word2Vec and RandomForest

In this notebook, we perform text classification on a dataset using Word2Vec for word embeddings and RandomForest for classification. The process includes text processing, feature extraction, handling class imbalance with SMOTE, training the classifier, and evaluating its performance.

In [None]:
# Import necessary libraries
import pandas as pd
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from imblearn.over_sampling import SMOTE
import numpy as np
import joblib
import nltk

# Download the punkt tokenizer for word tokenization
nltk.download('punkt')

### Data Loading and Preprocessing

Load the cleaned textual data and preprocess it for classification. This involves filling missing values and defining category rules.

In [None]:
# Load the cleaned data from the CSV file
df = pd.read_csv('cleaned_section_data.csv')
df['Cleaned Text'] = df['Cleaned Text'].fillna('')  # Fill missing text with empty strings

# Define categories and corresponding keywords for classification
categories = {
    'Installation & Setup': [
        'install', 'setup', 'implementation', 'deployment', 'configure', 'initialization', 
        'installing', 'deploy', 'configuration', 'set-up', 'initiate', 'launch', 'activate',
        'how to install', 'setting up', 'installation guide', 'deploying', 'configuring'
    ],
    'Maintenance & Management': [
        'maintain', 'maintenance', 'servicing', 'management', 'optimization', 'service', 
        'manage', 'routine check', 'system upkeep', 'system care', 'upkeep', 'tune-up',
        'maintaining', 'managing', 'service routine', 'optimizing', 'how to maintain'
    ],
    'Troubleshooting & Support': [
        'troubleshoot', 'error', 'issue', 'problem', 'diagnosis', 'resolution', 'fix', 
        'solve', 'rectify', 'repair', 'resolve', 'correct', 'debug', 'fault finding',
        'troubleshooting', 'solving issues', 'fixing errors', 'diagnosing problems', 'resolving'
    ],
    'Upgrades & Updates': [
        'upgrade', 'update', 'new version', 'patch', 'release', 'enhancement', 'updating', 
        'upgrading', 'version upgrade', 'system update', 'software update', 'patching',
        'how to upgrade', 'applying updates', 'version updating', 'software enhancement'
    ],
    'General Information & Overview': [
        'overview', 'introduction', 'info', 'summary', 'guide', 'documentation', 
        'information', 'details', 'background', 'basics', 'general data', 'key points',
        'what is', 'explain', 'description of', 'details about'
    ],
    'Security & Monitoring': [
        'surveillance', 'log management', 'event tracking', 'real-time analysis', 
        'security watch', 'monitoring', 'security check', 'system monitoring', 'network watch',
        'security overview', 'monitoring setup', 'event tracking system'
    ],
    'Threat Detection & Analysis': [
        'threat detection', 'anomaly detection', 'intrusion detection', 'threat intelligence', 
        'security alerts', 'risk detection', 'threat identification', 'vulnerability detection', 
        'security threat detection', 'analyzing threats', 'identifying risks', 'detecting anomalies'
    ],
    'Incident Response & Management': [
        'incident response', 'incident management', 'forensics', 'mitigation', 'recovery', 
        'incident handling', 'crisis management', 'incident analysis', 'emergency response',
        'responding to incidents', 'managing incidents', 'incident recovery'
    ],
    'Compliance & Auditing': [
        'compliance', 'regulatory compliance', 'audit', 'reporting', 'policy enforcement', 
        'regulation management', 'compliance tracking', 'legal compliance', 'audit management',
        'compliance policies', 'auditing processes', 'regulatory reporting'
    ],
    'Integration & Compatibility': [
        'integration', 'compatibility', 'third-party integration', 'API', 'interoperability', 
        'system merging', 'software integration', 'data integration', 'platform integration',
        'integrating systems', 'API usage', 'compatibility issues'
    ],
    'Network Security & Protection': [
        'network security', 'firewall', 'traffic analysis', 'intrusion prevention', 
        'network protection', 'cybersecurity', 'network defense', 'network safeguard',
        'protecting networks', 'network firewalls', 'cybersecurity measures'
    ]
}

# Function to assign a category to a text based on keywords
def assign_category(text, categories):
    text = str(text).lower()
    category_scores = {category: 0 for category in categories.keys()}
    for category, keywords in categories.items():
        category_scores[category] += sum(text.count(keyword) for keyword in keywords)
    assigned_category = max(category_scores, key=category_scores.get)
    return 'Other' if category_scores[assigned_category] == 0 else assigned_category

# Apply the function to assign categories to each document
df['Category'] = df['Cleaned Text'].apply(lambda text: assign_category(text, categories))

### Data Filtering and Tokenization

Filter out categories with insufficient samples and tokenize the text for the Word2Vec model.

In [None]:
# Define a threshold for the minimum number of samples required in a category
min_samples_threshold = 3  

# Count the number of samples in each category and identify categories to exclude
category_counts = df['Category'].value_counts()
categories_to_exclude = category_counts[category_counts < min_samples_threshold].index.tolist()

# Filter out categories with insufficient samples
df_filtered = df[~df['Category'].isin(categories_to_exclude)]

# Tokenize the text data for Word2Vec model training
tokenized_text = [word_tokenize(text) for text in df_filtered['Cleaned Text']]

### Word2Vec Model Training

Train a Word2Vec model on the tokenized text to create word embeddings.

In [None]:
# Initialize and train a Word2Vec model
word2vec_model = Word2Vec(tokenized_text, vector_size=200, window=10, min_count=1, workers=4)

# Function to create a feature vector for a document by averaging its word vectors
def document_vector(word2vec_model, doc):
    # Filter out words not in the model's vocabulary
    doc = [word for word in doc if word in word2vec_model.wv]
    # Return the mean of the word vectors if the document is not empty, else return a zero vector
    return np.mean(word2vec_model.wv[doc], axis=0) if doc else np.zeros(word2vec_model.vector_size)

# Create feature vectors for each document using the Word2Vec model
w2v_feature_vectors = np.array([document_vector(word2vec_model, doc) for doc in tokenized_text])

### Feature Extraction with TF-IDF and Combining Features

Extract features using TF-IDF and combine them with Word2Vec features for a more robust feature set.

In [None]:
# Initialize and fit a TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
tfidf_feature_vectors = tfidf_vectorizer.fit_transform(df_filtered['Cleaned Text']).toarray()

# Combine Word2Vec and TF-IDF features
combined_features = np.hstack((w2v_feature_vectors, tfidf_feature_vectors))

### Handling Class Imbalance with SMOTE

Use SMOTE to handle class imbalance in the dataset, creating synthetic samples for under-represented classes.

In [None]:
# Initialize SMOTE and apply it to the combined feature set
smote = SMOTE(random_state=42, k_neighbors=1)
X_resampled, y_resampled = smote.fit_resample(combined_features, df_filtered['Category'])

### Data Splitting and Classifier Training

Split the data into training and testing sets and train a RandomForest classifier.

In [None]:
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Initialize and train a RandomForest classifier
classifier = RandomForestClassifier(n_estimators=100, class_weight='balanced', random_state=42)
classifier.fit(X_train, y_train)

### Model Evaluation

Evaluate the performance of the classifier using the test set.

In [None]:
# Make predictions on the test set
predictions = classifier.predict(X_test)

# Print a classification report to evaluate the classifier
print(classification_report(y_test, predictions, zero_division=0))

### Saving Models and Results

Save the trained models and processed data for future use.

In [None]:
# Save the TF-IDF model, Word2Vec model, combined features, and the filtered dataset
joblib.dump(tfidf_vectorizer, "tfidf_model.pkl")
word2vec_model.save('word2vec_model.bin')
np.save('combined_features.npy', combined_features)
df_filtered.to_csv('cleaned_section_data_with_categories.csv', index=False)
