# Phase 2: Text Preprocessing + Embeddings

This notebook demonstrates the text preprocessing and embedding extraction pipeline as described in the research paper.

## What we'll do:
1. **Text Preprocessing**: Clean, tokenize, remove stopwords, apply stemming
2. **spaCy Embeddings**: 300-dimensional word2vec-based features
3. **BERT Embeddings**: 768-dimensional contextual features
4. **Save/Load**: Cache embeddings for reuse

## Paper Reference:
- Text preprocessing follows the methodology in Section 6.2.1
- spaCy features: 300-dimensional word2vec vectors
- BERT features: 768-dimensional contextual embeddings
- Both are used as F1 and F2 feature sets in the paper


In [None]:
import sys
import os
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from fake_news.features.text_preprocessing import TextPreprocessor, preprocess_news_data
from fake_news.features.embeddings import EmbeddingExtractor, create_synthetic_data
from fake_news.utils.logging import get_logger
from fake_news.utils.paths import PROCESSED_DIR

logger = get_logger('notebook')
logger.info('Phase 2: Text Preprocessing + Embeddings')
