## Summary

âœ… **Feature Extraction Complete!**

**What was generated:**
1. **Sentence Embeddings** (384-dimensional)
   - Using all-MiniLM-L6-v2 model
   - Fast & efficient pre-trained model
   - Saved for train/test/val sets

2. **TF-IDF Features** (5000 features)
   - Unigrams + Bigrams
   - English stopwords removed
   - Sublinear TF scaling

3. **Text Features** (7 features)
   - Claim length, word count
   - Unique words, word diversity
   - Uppercase letters, digits, sentences

4. **Combined Feature Datasets**
   - 5392+ total features per sample
   - Ready for ML model training

**Saved Files:**
- `models/sentence_embedding_model/` - Embedding model
- `models/train_embeddings.npy` - Training embeddings
- `models/test_embeddings.npy` - Test embeddings
- `models/val_embeddings.npy` - Validation embeddings
- `models/tfidf_vectorizer.pkl` - TF-IDF vectorizer
- `preprocessed_data/train_features.csv` - Combined features
- `preprocessed_data/test_features.csv` - Combined features
- `preprocessed_data/val_features.csv` - Combined features
- `models/feature_metadata.json` - Feature metadata

**Next Steps:**
- Notebook 04: Semantic Similarity Model
- Notebook 05: NLI Model Training ðŸš€

In [None]:
# Save embedding model and metadata
import json

print("ðŸ’¾ Saving models and metadata...")

# Save embedding model
embedding_model.save('./models/sentence_embedding_model')
print("   âœ… Embedding model saved")

# Save metadata
metadata = {
    'embedding_model': model_name,
    'embedding_dimension': embedding_model.get_sentence_embedding_dimension(),
    'tfidf_max_features': 5000,
    'tfidf_ngram_range': [1, 2],
    'train_samples': len(train_embeddings),
    'test_samples': len(test_embeddings),
    'val_samples': len(val_embeddings),
    'feature_description': {
        'embeddings': 'Pre-trained sentence embeddings (384-dim)',
        'tfidf': 'TF-IDF vectorization (5000 features)',
        'claim_length': 'Length of claim in characters',
        'claim_word_count': 'Number of words in claim',
        'unique_words': 'Unique word count',
        'word_diversity': 'Ratio of unique words to total words',
        'num_uppercase': 'Count of uppercase letters',
        'num_digits': 'Count of digits',
        'num_sentences': 'Estimated number of sentences'
    }
}

with open('./models/feature_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=4)

print("   âœ… Metadata saved")

# Summary
print("\n" + "="*60)
print("FEATURE EXTRACTION SUMMARY")
print("="*60)
print(f"\nâœ… Embeddings Generated:")
print(f"   - Model: {model_name}")
print(f"   - Dimension: {embedding_model.get_sentence_embedding_dimension()}")
print(f"   - Train: {train_embeddings.shape}")
print(f"   - Test: {test_embeddings.shape}")
print(f"   - Val: {val_embeddings.shape}")

print(f"\nâœ… Additional Features:")
print(f"   - TF-IDF vectors (5000 features)")
print(f"   - Text statistics (7 features)")
print(f"   - Total feature columns: {train_feature_df.shape[1]}")

print(f"\nâœ… Models Saved:")
print(f"   - Sentence Transformers model")
print(f"   - TF-IDF vectorizer")
print(f"   - Embedding arrays (.npy files)")
print(f"   - Feature datasets (.csv files)")
print(f"   - Metadata (.json file)")

ðŸ’¾ Saving models and metadata...


NameError: name 'embedding_model' is not defined

## Step 8: Save Model & Metadata

In [None]:
# Analyze feature correlations
print("ðŸ“Š Feature Correlation Analysis...")

# Select non-embedding features for analysis
text_features_cols = ['claim_length', 'claim_word_count', 'unique_words', 'word_diversity',
                      'num_uppercase', 'num_digits', 'num_sentences']

text_features_train = train_feature_df[text_features_cols + ['label']].copy()

# Convert labels to numeric for correlation
label_to_numeric = {'REAL': 0, 'FAKE': 1, 'NOT_ENOUGH_INFO': 2}
text_features_train['label_numeric'] = text_features_train['label'].map(label_to_numeric)

# Calculate correlation
correlation_matrix = text_features_train[text_features_cols + ['label_numeric']].corr()

# Plot correlation
fig, ax = plt.subplots(figsize=(10, 8))
im = ax.imshow(correlation_matrix, cmap='coolwarm', aspect='auto', vmin=-1, vmax=1)

ax.set_xticks(range(len(correlation_matrix.columns)))
ax.set_yticks(range(len(correlation_matrix.columns)))
ax.set_xticklabels(correlation_matrix.columns, rotation=45, ha='right')
ax.set_yticklabels(correlation_matrix.columns)

# Add values to heatmap
for i in range(len(correlation_matrix)):
    for j in range(len(correlation_matrix)):
        text = ax.text(j, i, f'{correlation_matrix.iloc[i, j]:.2f}',
                      ha="center", va="center", color="black", fontsize=8)

plt.colorbar(im, ax=ax, label='Correlation')
ax.set_title('Feature Correlation Matrix')
plt.tight_layout()
plt.show()

print("âœ… Feature correlation analyzed!")

## Step 7: Feature Correlation & Importance

In [None]:
# Create feature matrices with additional features
def create_feature_dataset(df, tfidf_matrix, embeddings, label_col='label'):
    """Combine all features into one dataset"""
    
    # Convert TF-IDF to dense array
    tfidf_dense = tfidf_matrix.toarray()
    
    # Create a dataframe with all features
    feature_df = pd.DataFrame(embeddings, columns=[f'embedding_{i}' for i in range(embeddings.shape[1])])
    
    # Add TF-IDF features
    tfidf_feature_names = [f'tfidf_{i}' for i in range(tfidf_dense.shape[1])]
    feature_df[tfidf_feature_names] = pd.DataFrame(tfidf_dense)
    
    # Add basic text features
    text_features = add_text_features(df)
    feature_cols = ['claim_length', 'claim_word_count', 'unique_words', 'word_diversity',
                    'num_uppercase', 'num_digits', 'num_sentences']
    
    for col in feature_cols:
        feature_df[col] = text_features[col].values
    
    # Add label and metadata
    feature_df['claim_id'] = df['claim_id'].values
    feature_df['label'] = df['label'].values
    feature_df['claim'] = df['claim'].values
    
    return feature_df

print("ðŸ“¦ Creating combined feature datasets...")
train_feature_df = create_feature_dataset(train_df, train_tfidf, train_embeddings)
test_feature_df = create_feature_dataset(test_df, test_tfidf, test_embeddings)
val_feature_df = create_feature_dataset(val_df, val_tfidf, val_embeddings)

print(f"âœ… Feature datasets created!")
print(f"   Train shape: {train_feature_df.shape}")
print(f"   Test shape: {test_feature_df.shape}")
print(f"   Val shape: {val_feature_df.shape}")

# Save feature datasets
train_feature_df.to_csv('./preprocessed_data/train_features.csv', index=False)
test_feature_df.to_csv('./preprocessed_data/test_features.csv', index=False)
val_feature_df.to_csv('./preprocessed_data/val_features.csv', index=False)

print("\nâœ… Feature datasets saved to ./preprocessed_data/")

## Step 6: Create Combined Feature Datasets

In [None]:
# Analyze embeddings
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

print("ðŸ“Š Embedding Statistics (Training set):")
print(f"   Mean norm: {np.linalg.norm(train_embeddings, axis=1).mean():.4f}")
print(f"   Std norm: {np.linalg.norm(train_embeddings, axis=1).std():.4f}")
print(f"   Min mean value: {train_embeddings.mean(axis=0).min():.4f}")
print(f"   Max mean value: {train_embeddings.mean(axis=0).max():.4f}")

# Dimensionality reduction for visualization
print("\nðŸŽ¨ Reducing dimensions for visualization...")
pca = PCA(n_components=2)
train_embeddings_2d = pca.fit_transform(train_embeddings[:5000])  # Use subset for speed

print(f"   Explained variance: {pca.explained_variance_ratio_.sum():.2%}")

# Plot embeddings by label
fig, ax = plt.subplots(figsize=(12, 8))

labels_unique = train_df['label'].unique()
colors = {'REAL': 'green', 'FAKE': 'red', 'NOT_ENOUGH_INFO': 'orange'}

for label in labels_unique:
    indices = train_df.iloc[:5000]['label'] == label
    ax.scatter(
        train_embeddings_2d[indices, 0],
        train_embeddings_2d[indices, 1],
        c=colors.get(label, 'blue'),
        label=label,
        alpha=0.6,
        s=30
    )

ax.set_xlabel('PCA Component 1')
ax.set_ylabel('PCA Component 2')
ax.set_title('Embedding Space Visualization (PCA 2D)')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("âœ… Visualization complete!")

## Step 5: Embedding Analysis & Visualization

In [None]:
# Sentence Transformers for semantic embeddings
from sentence_transformers import SentenceTransformer

print("ðŸ§  Loading pre-trained sentence transformer model...")
# Using a lightweight model (better for inference speed)
model_name = 'all-MiniLM-L6-v2'  # Fast and efficient (~22MB)
# Alternative: 'all-mpnet-base-v2' (better quality, larger)
embedding_model = SentenceTransformer(model_name)

print(f"   Model: {model_name}")
print(f"   Embedding dimension: {embedding_model.get_sentence_embedding_dimension()}")

# Generate embeddings
print("\nðŸ“¥ Generating embeddings for training set...")
train_embeddings = embedding_model.encode(
    train_df['claim'].tolist(),
    show_progress_bar=True,
    convert_to_numpy=True,
    batch_size=64
)

print("ðŸ“¥ Generating embeddings for test set...")
test_embeddings = embedding_model.encode(
    test_df['claim'].tolist(),
    show_progress_bar=True,
    convert_to_numpy=True,
    batch_size=64
)

print("ðŸ“¥ Generating embeddings for val set...")
val_embeddings = embedding_model.encode(
    val_df['claim'].tolist(),
    show_progress_bar=True,
    convert_to_numpy=True,
    batch_size=64
)

print(f"\nâœ… Embeddings generated!")
print(f"   Train embeddings shape: {train_embeddings.shape}")
print(f"   Test embeddings shape: {test_embeddings.shape}")
print(f"   Val embeddings shape: {val_embeddings.shape}")

# Save embeddings
np.save('./models/train_embeddings.npy', train_embeddings)
np.save('./models/test_embeddings.npy', test_embeddings)
np.save('./models/val_embeddings.npy', val_embeddings)

print("\nâœ… Embeddings saved!")

## Step 4: Sentence Embeddings (Pre-trained Models)

In [None]:
# TF-IDF Vectorization
print("ðŸ”¤ Creating TF-IDF vectorizer...")
tfidf_vectorizer = TfidfVectorizer(
    max_features=5000,           # Limit to top 5000 features
    min_df=2,                    # Minimum document frequency
    max_df=0.8,                  # Maximum document frequency
    ngram_range=(1, 2),          # Unigrams and bigrams
    stop_words='english',
    sublinear_tf=True
)

# Fit on training data
print("  Fitting on training data...")
train_tfidf = tfidf_vectorizer.fit_transform(train_df['claim'])

# Transform test and val data
print("  Transforming test and val data...")
test_tfidf = tfidf_vectorizer.transform(test_df['claim'])
val_tfidf = tfidf_vectorizer.transform(val_df['claim'])

print(f"\nâœ… TF-IDF vectorization complete!")
print(f"   Train TF-IDF shape: {train_tfidf.shape}")
print(f"   Test TF-IDF shape: {test_tfidf.shape}")
print(f"   Val TF-IDF shape: {val_tfidf.shape}")
print(f"   Vocabulary size: {len(tfidf_vectorizer.get_feature_names_out())}")

# Save TF-IDF vectorizer
import pickle
os.makedirs('./models', exist_ok=True)
pickle.dump(tfidf_vectorizer, open('./models/tfidf_vectorizer.pkl', 'wb'))
print("\nâœ… TF-IDF vectorizer saved!")

## Step 3: TF-IDF Vectorization

In [None]:
# Feature engineering functions
def add_text_features(df):
    """Add basic text features"""
    df_features = df.copy()
    
    # Length features
    df_features['claim_length'] = df_features['claim'].str.len()
    df_features['claim_word_count'] = df_features['claim'].str.split().str.len()
    
    # Vocabulary features
    df_features['unique_words'] = df_features['claim'].apply(lambda x: len(set(x.split())))
    df_features['word_diversity'] = df_features['unique_words'] / (df_features['claim_word_count'] + 1)
    
    # Punctuation & special characters
    df_features['num_uppercase'] = df_features['claim'].apply(lambda x: sum(1 for c in x if c.isupper()))
    df_features['num_digits'] = df_features['claim'].apply(lambda x: sum(1 for c in x if c.isdigit()))
    df_features['num_sentences'] = df_features['claim'].apply(lambda x: x.count('.') + x.count('!') + x.count('?'))
    
    return df_features

# Apply feature engineering
print("ðŸ”§ Adding text features...")
train_features = add_text_features(train_df)
test_features = add_text_features(test_df)
val_features = add_text_features(val_df)

print("âœ… Text features added!")
print("\nðŸ“Š Feature statistics (Training set):")
print(train_features[['claim_length', 'claim_word_count', 'unique_words', 'word_diversity']].describe())

## Step 2: Text Feature Engineering

In [None]:
# Load preprocessed data
train_df = pd.read_csv('./preprocessed_data/train_data.csv')
test_df = pd.read_csv('./preprocessed_data/test_data.csv')
val_df = pd.read_csv('./preprocessed_data/val_data.csv')

print("âœ… Data loaded!")
print(f"Train: {train_df.shape}")
print(f"Test: {test_df.shape}")
print(f"Val: {val_df.shape}")

print("\nðŸ“Š Sample data:")
print(train_df.head(2))

## Step 1: Load Preprocessed Data

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import os
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import warnings
warnings.filterwarnings('ignore')

# Download NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

print("âœ… All libraries imported successfully!")

In [None]:
# Install required packages for embeddings
import subprocess
import sys

packages = ['pandas', 'numpy', 'scikit-learn', 'nltk', 'gensim', 'sentence-transformers']

for package in packages:
    try:
        __import__(package if package != 'gensim' else 'gensim.models')
        print(f"âœ… {package} already installed")
    except ImportError:
        print(f"ðŸ“¦ Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])
        print(f"âœ… {package} installed")

# 03 - Feature Extraction & Text Embeddings
## Generate Embeddings for Claim Verification

This notebook creates text embeddings and features from preprocessed claims for semantic similarity and NLI models.