<a href="https://colab.research.google.com/github/mattdani21/Jup/blob/main/Final_word2vec.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This code cell is the initial setup for your Spark-based text classification project.

Here's a breakdown:

*   **Comments:** The initial comments describe the purpose of the notebook: building an improved multi-class text classification model using word embeddings and addressing issues from a previous version.
*   **Import Statements:** This section imports all the necessary libraries and modules for the project:
    *   `pyspark.sql`: For working with Spark DataFrames and SQL functions.
    *   `pyspark.ml.feature`: For feature engineering transformers like `StopWordsRemover` and `Word2Vec`.
    *   `pyspark.ml.classification`: For classification algorithms like `LogisticRegression` and `RandomForestClassifier`.
    *   `pyspark.ml.evaluation`: For evaluating the performance of the classification models (`MulticlassClassificationEvaluator`).
    *   `pyspark.ml.linalg`: For working with Spark's distributed linear algebra (`Vectors`, `VectorUDT`).
    *   `pyspark.sql.functions.udf`: To define User Defined Functions (UDFs) for custom transformations.
    *   `numpy`: For numerical operations, particularly for handling word vectors.
    *   `sklearn.datasets.fetch_20newsgroups`: To fetch the 20newsgroups dataset, a common dataset for text classification.
    *   `pandas`: For initial data handling and manipulation before converting to a Spark DataFrame.
    *   `matplotlib.pyplot` and `seaborn`: Although imported, they are not currently used in the visible code. They are typically used for data visualization.
*   **Spark Initialization:** The line `spark = SparkSession.builder.master("local[*]").appName("ImprovedClassification").getOrCreate()` initializes a Spark session.
    *   `.master("local[*]")`: Configures Spark to run locally using all available cores.
    *   `.appName("ImprovedClassification")`: Sets a name for the Spark application.
    *   `.getOrCreate()`: Gets an existing Spark session or creates a new one if none exists.

In essence, this cell prepares the environment by importing the required tools and starting a Spark session, making it ready for the subsequent data loading, preprocessing, model training, and evaluation steps.


In [3]:
# Improved Multi-Class Text Classification with Word Embeddings
# This enhanced version addresses key issues in the original model

from pyspark.sql import SparkSession
from pyspark.sql.functions import lower, regexp_replace, split, col, size, when, isnan, isnull
from pyspark.ml.feature import StopWordsRemover, Word2Vec, StringIndexer
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql.functions import udf
import numpy as np
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Initialize Spark
spark = SparkSession.builder.master("local[*]").appName("ImprovedClassification").getOrCreate()


In [4]:
# PART 1: IMPROVED DATA PREPROCESSING
def load_and_preprocess_data():
    """Load and preprocess the 20newsgroups dataset with better cleaning"""

    # Fetch dataset with better category selection (fewer, more distinct categories)
    categories = [
        'alt.atheism',
        'comp.graphics',
        'rec.motorcycles',
        'sci.space',
        'talk.politics.guns'
    ]

    newsgroups = fetch_20newsgroups(
        subset='all',
        categories=categories,
        remove=('headers', 'footers', 'quotes'),
        shuffle=True,
        random_state=42
    )

    # Create pandas DataFrame
    df_pandas = pd.DataFrame({
        'text': newsgroups.data,
        'label': newsgroups.target
    })

    # Filter out very short documents (less than 10 words)
    df_pandas = df_pandas[df_pandas['text'].str.split().str.len() >= 10]

    # Convert to Spark DataFrame
    df_labeled = spark.createDataFrame(df_pandas)

    print(f"Dataset loaded with {df_labeled.count()} documents")
    print("Target classes:")
    for i, name in enumerate(newsgroups.target_names):
        print(f"{i}: {name}")

    return df_labeled, newsgroups.target_names

def advanced_text_preprocessing(df):
    """Enhanced text preprocessing with better cleaning"""

    # More comprehensive text cleaning
    df_clean = df.select(
        'label',
        # Remove URLs, email addresses, numbers, and special characters
        regexp_replace(
            regexp_replace(
                regexp_replace(
                    regexp_replace('text', r'http\S+|www\S+', ''),  # URLs
                    r'\S+@\S+', ''  # Email addresses
                ),
                r'\d+', ''  # Numbers
            ),
            r'[^\w\s]', ' '  # Special characters
        ).alias('cleaned_text')
    )

    # Tokenize and convert to lowercase
    df_tokens = df_clean.select(
        'label',
        split(lower(col('cleaned_text')), r'\s+').alias('tokens')
    )

    # Filter out empty tokens and tokens with length < 3
    df_filtered = df_tokens.select(
        'label',
        col('tokens').alias('raw_tokens')
    )

    # Remove stop words
    remover = StopWordsRemover(inputCol="raw_tokens", outputCol="tokens")
    df_no_stopwords = remover.transform(df_filtered)

    # Filter out documents with too few tokens after cleaning
    df_final = df_no_stopwords.filter(size(col('tokens')) >= 5)

    print(f"After preprocessing: {df_final.count()} documents remain")
    return df_final.select('label', 'tokens')

print('ready')

ready


In [5]:
# PART 2: IMPROVED WORD2VEC TRAINING

def train_improved_word2vec(df_processed):
    """Train Word2Vec with better parameters"""

    # Use better parameters for Word2Vec
    word2vec = Word2Vec(
        vectorSize=200,        # Increased vector size
        minCount=3,           # Lower minimum count to capture more words
        numPartitions=4,      # Better parallelization
        stepSize=0.05,        # Learning rate
        maxIter=5,            # More iterations
        windowSize=7,         # Larger context window
        inputCol="tokens",
        outputCol="word_vectors"
    )

    print("Training Word2Vec model...")
    model = word2vec.fit(df_processed)

    # Get vocabulary statistics
    vocab_size = model.getVectors().count()
    print(f"Word2Vec vocabulary size: {vocab_size}")

    return model

In [6]:
# PART 3: IMPROVED DOCUMENT VECTORIZATION

def create_document_vectors_improved(df_processed, word2vec_model):
    """Create document vectors with TF-IDF weighting and better handling"""

    # Get word vectors
    word_vectors = word2vec_model.getVectors()
    word_vectors_dict = {row['word']: row['vector'] for row in word_vectors.collect()}
    word_vectors_broadcast = spark.sparkContext.broadcast(word_vectors_dict)

    # Calculate TF (Term Frequency) for each document
    def calculate_tf_idf_weighted_average(tokens):
        """Calculate TF-IDF weighted average of word vectors"""
        if not tokens:
            return Vectors.dense([0.0] * 200)

        # Count word frequencies in document
        word_counts = {}
        for word in tokens:
            word_counts[word] = word_counts.get(word, 0) + 1

        doc_length = len(tokens)
        vectors = []
        weights = []

        for word, count in word_counts.items():
            if word in word_vectors_broadcast.value:
                tf = count / doc_length  # Term frequency
                vectors.append(word_vectors_broadcast.value[word])
                weights.append(tf)

        if not vectors:
            return Vectors.dense([0.0] * 200)

        # Weighted average
        vectors = np.array(vectors)
        weights = np.array(weights)
        weights = weights / np.sum(weights)  # Normalize weights

        weighted_avg = np.average(vectors, axis=0, weights=weights)
        return Vectors.dense(weighted_avg)

    # Register UDF
    vectorize_udf = udf(calculate_tf_idf_weighted_average, VectorUDT())

    # Apply vectorization
    df_vectors = df_processed.withColumn("features", vectorize_udf(col("tokens")))

    # Filter out documents with zero vectors
    df_vectors = df_vectors.filter(
        ~(col("features").isNull())
    )

    print(f"Document vectors created for {df_vectors.count()} documents")
    return df_vectors.select('label', 'features')


In [7]:
# PART 4: IMPROVED CLASSIFICATION
# ============================================================================

def train_improved_classifier(df_features):
    """Train multiple classifiers and compare performance"""

    # Split data with stratification consideration
    train_data, test_data = df_features.randomSplit([0.8, 0.2], seed=42)

    print(f"Training set size: {train_data.count()}")
    print(f"Test set size: {test_data.count()}")

    # Train Logistic Regression with better parameters
    lr = LogisticRegression(
        featuresCol='features',
        labelCol='label',
        maxIter=100,           # More iterations
        regParam=0.01,         # L2 regularization
        elasticNetParam=0.1,   # Some L1 regularization
        standardization=True   # Feature standardization
    )

    print("Training Logistic Regression...")
    lr_model = lr.fit(train_data)

    # Train Random Forest as alternative
    rf = RandomForestClassifier(
        featuresCol='features',
        labelCol='label',
        numTrees=50,          # More trees
        maxDepth=10,          # Deeper trees
        seed=42
    )

    print("Training Random Forest...")
    rf_model = rf.fit(train_data)

    return lr_model, rf_model, train_data, test_data

def evaluate_models(models, test_data, target_names):
    """Comprehensive model evaluation"""

    evaluator_acc = MulticlassClassificationEvaluator(
        labelCol="label",
        predictionCol="prediction",
        metricName="accuracy"
    )

    evaluator_f1 = MulticlassClassificationEvaluator(
        labelCol="label",
        predictionCol="prediction",
        metricName="f1"
    )

    evaluator_precision = MulticlassClassificationEvaluator(
        labelCol="label",
        predictionCol="prediction",
        metricName="weightedPrecision"
    )

    evaluator_recall = MulticlassClassificationEvaluator(
        labelCol="label",
        predictionCol="prediction",
        metricName="weightedRecall"
    )

    results = {}

    for name, model in models.items():
        print(f"\n=== {name} Results ===")

        # Make predictions
        predictions = model.transform(test_data)

        # Calculate metrics
        accuracy = evaluator_acc.evaluate(predictions)
        f1 = evaluator_f1.evaluate(predictions)
        precision = evaluator_precision.evaluate(predictions)
        recall = evaluator_recall.evaluate(predictions)

        results[name] = {
            'accuracy': accuracy,
            'f1': f1,
            'precision': precision,
            'recall': recall,
            'predictions': predictions
        }

        print(f"Accuracy: {accuracy:.4f}")
        print(f"F1 Score: {f1:.4f}")
        print(f"Precision: {precision:.4f}")
        print(f"Recall: {recall:.4f}")

        # Show some example predictions
        print("\nSample Predictions:")
        predictions.select('label', 'prediction').show(10)

    return results


In [8]:
# PART 5: MAIN EXECUTION
# ============================================================================

def main():
    """Main execution function"""

    print("=== IMPROVED TEXT CLASSIFICATION MODEL ===\n")

    # Load and preprocess data
    print("1. Loading and preprocessing data...")
    df_labeled, target_names = load_and_preprocess_data()
    df_processed = advanced_text_preprocessing(df_labeled)

    # Train Word2Vec
    print("\n2. Training improved Word2Vec model...")
    word2vec_model = train_improved_word2vec(df_processed)

    # Create document vectors
    print("\n3. Creating document vectors...")
    df_features = create_document_vectors_improved(df_processed, word2vec_model)

    # Train classifiers
    print("\n4. Training classifiers...")
    lr_model, rf_model, train_data, test_data = train_improved_classifier(df_features)

    # Evaluate models
    print("\n5. Evaluating models...")
    models = {
        'Logistic Regression': lr_model,
        'Random Forest': rf_model
    }

    results = evaluate_models(models, test_data, target_names)

    # Summary comparison
    print("\n=== PERFORMANCE COMPARISON ===")
    print(f"{'Model':<20} {'Accuracy':<10} {'F1 Score':<10} {'Precision':<12} {'Recall':<10}")
    print("-" * 62)
    for name, metrics in results.items():
        print(f"{name:<20} {metrics['accuracy']:<10.4f} {metrics['f1']:<10.4f} "
              f"{metrics['precision']:<12.4f} {metrics['recall']:<10.4f}")

    return results, word2vec_model

# Run the improved model
if __name__ == "__main__":
    results, model = main()


=== IMPROVED TEXT CLASSIFICATION MODEL ===

1. Loading and preprocessing data...
Dataset loaded with 4405 documents
Target classes:
0: alt.atheism
1: comp.graphics
2: rec.motorcycles
3: sci.space
4: talk.politics.guns
After preprocessing: 4401 documents remain

2. Training improved Word2Vec model...
Training Word2Vec model...
Word2Vec vocabulary size: 14601

3. Creating document vectors...
Document vectors created for 4401 documents

4. Training classifiers...
Training set size: 3575
Test set size: 826
Training Logistic Regression...
Training Random Forest...

5. Evaluating models...

=== Logistic Regression Results ===
Accuracy: 0.8838
F1 Score: 0.8840
Precision: 0.8844
Recall: 0.8838

Sample Predictions:
+-----+----------+
|label|prediction|
+-----+----------+
|    0|       2.0|
|    0|       4.0|
|    0|       0.0|
|    0|       0.0|
|    0|       2.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
|    0|       0.0|
+-----+----------+
only showing top 1

"""
Further improvements you can implement:

1. FEATURE ENGINEERING:
   - Use TF-IDF vectors combined with Word2Vec
   - Add document length features
   - Use n-gram features
   - Implement Doc2Vec instead of averaging Word2Vec

2. ADVANCED MODELS:
   - Gradient Boosting (XGBoost)
   - Neural Networks
   - Ensemble methods

3. HYPERPARAMETER TUNING:
   - Use CrossValidator for parameter optimization
   - Grid search for best parameters

4. DATA AUGMENTATION:
   - Text augmentation techniques
   - Balancing classes with SMOTE

5. ADVANCED PREPROCESSING:
   - Lemmatization instead of just lowercasing
   - Named Entity Recognition
   - Part-of-speech filtering

6. EVALUATION IMPROVEMENTS:
   - Confusion matrix analysis
   - Per-class metrics
   - Cross-validation
"""

### PART 6: PREDICTING ON NEW TEXT

This section adds functionality to predict the category of a new, unseen text document using the trained models.