## Vectorization And Data Processing for Modeling

Now that the text data has undergone basic cleaning and exploratory analysis, it's time to move on to preparing the data for modeling. 

### Library imports

In [1]:
#-------- Core 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import re
import joblib

#-------- Text Feature extraction
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, ENGLISH_STOP_WORDS
from sklearn.model_selection import train_test_split
import spacy

#-------- Dimensionality reduction / topic modeling
from sklearn.decomposition import TruncatedSVD

#-------- Preprocessing
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import make_pipeline

### Import Cleaned Train and Test Data

In [2]:
train_df = pd.read_csv('../data/interim/clean_train_df.csv')
test_df = pd.read_csv('../data/interim/clean_test_df.csv')

## Data Processing 

### Text Normalization 
Now that the text data has some basic cleaning done, the next step is text normalization. With TF-IDF, text normalization is important because the same word will be considered different from its capitalized version (i.e. "Dog" and "dog" are considered 2 separate words). Since TF-IDF depends on word frequency, it is crucial that the same word is considered the same regardless of capitalization. This means the text must be normalized by changing the text to lowercase. 

#### Punctuation
Additionally, it is worth considering changing punctuation rules ( ie only allowing . , - % \$ )  as these are important for determining text content: business text will most definitely have $ and %, and in general, text can have words like "mother-in-law" where the '-' is important for differentiating from the words "mother" "in" and "law". However, other characters like # ( ) or * don't really weigh in on the content of the text though they affect tokenization, and thus can affect the models ability to classify. One other thing to look for is words with an ' in them such as possesive or contracted words (ie "man's" vs "man s" or "it's" vs "it s"). For TF-IDF, it doesn't really matter to have the apostrophe gone because changing the meaning of the word has no effect on the vectorization. Howevever, having the single character 's' multiple times will need to be dealt with to avoid inflating the vectorization due to the high frequency of that lone character. We can remove this lone 's' when we remove stop words. Thinking of other types of contraction like 've 'll 're 'd 'm 't -- let's remove them as well when cleaning stop words. most of these contractions, if expanded, would be stop words anyway {have, will, are, etc }



In [3]:
def normalize_text(text):
    # Normalize a single text string
        # Lowercase 
        # remove non-alphanumeric characters EXCEPT . , $ % and the unique token markers <>
        # collapse multiple spaces into a single one 
            # Even though this was partially done in EDA, doing it here is important
            # in the case that removing characters introduces extra space
        # strip leading/trailing white space

    # lowercase
    text = text.lower()

    # Remove unwanted characters //include <> in this exception to allow unique tokens <NUM> and <URL>
    text = re.sub(r'[^a-z0-9\.,$%<> ]+', '', text)

    # collapse spaces and strip leading/trailing space
    text = re.sub(r'\s+',' ', text).strip()


    return text

In [4]:
train_df['Normalized_text'] = train_df['Text'].apply(normalize_text)

test_df['Normalized_text'] = test_df['Text'].apply(normalize_text)

In [5]:
#Manually checking one example to compare text and normalized text result
temp1 = train_df['Text'][0]
temp2 = train_df['Normalized_text'][0]
print(temp1)
print(temp2)




### Tokenization
unigram and bigram are the combinations of words that will be tokenized (every word or combination of 2 words) which is standard practice. 
The primary purpose of sublinear_tf is to prevent terms that appear an extremely high number of times in a single document from dominating the vector representation, which is a different issue from the bias caused by the overall length of the document

#### Lemmatization
lemmatization usually adds value because:
* It reduces different word forms to their base form (e.g., running → run, better → good).
* That helps models generalize better since the same concept isn’t split into multiple features.
* Especially important for smaller corpora like BBC articles, where data sparsity can otherwise hurt clustering or classification.

One caveat: lemmatization adds processing time, especially if we use spaCy (faster but heavier) or NLTK (lighter but slower). spaCy is industry standard, but worth considering disabling NER and parse which are features in spaCy that are not needed in this project and increase computation.
If our goal is document classification (like BBC news topic classification)

NER is usually not needed because:
* Entities like “Google” or “New York” will already appear as tokens in the text.
* TF–IDF, CountVectorizer, or embeddings capture them directly.
* NER would just add computational overhead without giving extra features unless we explicitly extract the entities and use them as features.

If our goal is entity-aware tasks (like information extraction, knowledge graphs, search engines, or summarization) then yes — keeping NER would be very useful, because we’d want to know which tokens are entities, and potentially treat them differently.

Standard practice in industry text classification pipelines
* Don’t run NER or parser unless they directly contribute to your features.
* For classification, lemmatization + stopwords + vectorization is usually enough.

So, for our current BBC classification project, the lean pipeline (disable=["parser","ner"]) is the best practice.

#### Stopword removal
need custom list in addition to english stopwrods to account for contraction errors when we removed apostrophe in punctuation normalization. We can pull the english stop words from the sklearn.feature_Extraction.Text ENGLISH_STOP_WORDS


In [6]:
#Lemmatization
nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])

def lemmatize_text(text):
    doc = nlp(text)
    return " ".join([token.lemma_.lower() for token in doc if token.is_alpha])


In [7]:
train_df['Norm_Lemma_text'] = train_df['Normalized_text'].apply(lemmatize_text)
test_df['Norm_Lemma_text'] = test_df['Normalized_text'].apply(lemmatize_text)

In [8]:
#Take a look
print(train_df['Norm_Lemma_text'][0])



In [9]:
# Defining Stop words
english_stopWords = list(ENGLISH_STOP_WORDS)
contractions = ["s", "t", "ve","re", "d", "ll", "m"]
custom_stopWords = english_stopWords + contractions

#### Splitting Data 

Since I only have labels for the training dataset and not the testing dataset, I have to split the training dataset 80-20 (as is convention) so I can train the model on 80% of the train data, then evaluate the model using the remaining 20% portion and the corresponding labels. After training the model, I can predict labels for the full Testing dataset that does not have labels. 

This split needs to happen once the data is cleaned, normalized, and lemmatized but before vectorizing and doing dimensionality reduction. Splitting after vectorization will result in matrices that are biased because the decomposition would have "seen" all the labeled data (20% of which we would use to evaluate the model, hence the model is biased).

In [10]:
#review train_df
train_df.head(5)

Unnamed: 0.1,Unnamed: 0,ArticleId,Text,Category,Word_count,text_length,Normalized_text,Norm_Lemma_text
0,0,1833,worldcom ex-boss launches defence lawyers defe...,business,301,1856,worldcom exboss launches defence lawyers defen...,worldcom exboss launch defence lawyer defend f...
1,1,154,german business confidence slides german busin...,business,325,2009,german business confidence slides german busin...,german business confidence slide german busine...
2,2,1101,bbc poll indicates economic gloom citizens in ...,business,514,3132,bbc poll indicates economic gloom citizens in ...,bbc poll indicate economic gloom citizen in a ...
3,3,1976,lifestyle governs mobile choice faster better ...,tech,634,3583,lifestyle governs mobile choice faster better ...,lifestyle govern mobile choice fast well or fu...
4,4,917,enron bosses in $<NUM>m payout eighteen former...,business,355,2189,enron bosses in $<num>m payout eighteen former...,enron boss in num m payout eighteen former enr...


In [11]:
train_text = train_df['Norm_Lemma_text']
train_labels = train_df['Category']

test_text = test_df['Norm_Lemma_text']

In [12]:
train_text.head(5)

0    worldcom exboss launch defence lawyer defend f...
1    german business confidence slide german busine...
2    bbc poll indicate economic gloom citizen in a ...
3    lifestyle govern mobile choice fast well or fu...
4    enron boss in num m payout eighteen former enr...
Name: Norm_Lemma_text, dtype: object

In [13]:
train_labels.head(5)

0    business
1    business
2    business
3        tech
4    business
Name: Category, dtype: object

In [14]:
# Split data
# Stratify ensures label balance across train and validation sets

X_train, X_val, y_train, y_val = train_test_split(
    train_text, train_labels, test_size=0.2, random_state=88, stratify=train_labels
)

#### Vectorization with TF-IDF VEctorizer

In [15]:
# Constructing TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer(
    lowercase= False,   #already done
    max_features=10000, #only returns the top 10000 features to reduce computation costs and potential noise
    stop_words= custom_stopWords,
    ngram_range=(1,2),  #unigrams and bigrams 
    min_df= 2,      #Remove very rare occurance tokens
    max_df= 0.95,   #remove high frequency tokens
    sublinear_tf=True   #Reduce effect of extremely frequent tokens
)

#fit and transform training and validation sets
tfidf_X_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_X_val = tfidf_vectorizer.transform(X_val)


#transform test set //test set must use the same vocabulary and IDF weights learned from the training set.
tfidf_test = tfidf_vectorizer.transform(test_text) #Don't use fit_transform, just transform

#For reproducibility, export the vectorizer
joblib.dump(tfidf_vectorizer, "../models/tfidf_vectorizer.joblib")

['../models/tfidf_vectorizer.joblib']

#### Vectorization with CountVectorizer

In [16]:
# Constructing the CountVectorizer 
count_vectorizer = CountVectorizer(
    lowercase=False,    #already done
    stop_words=custom_stopWords,
    ngram_range=(1,2),
    min_df=2,
    max_df=0.95, 
    max_features=10000
)
# fit and transform training and validation sets
count_X_train = count_vectorizer.fit_transform(X_train)
count_X_val = count_vectorizer.transform(X_val)

count_test = count_vectorizer.transform(test_text)

#For reproducibility
joblib.dump(count_vectorizer, "../models/count_vectorizer.joblib")

['../models/count_vectorizer.joblib']

#### Dimensionality Reduction and Normalization

Using Truncated SVD (Not PCA because PCA creates a dense matrix but the tfidf and the count result in a sparse matrix which truncatedSVD can handle just fine) then using normalizer with L2 ensures each document vector has a unit length, which is beneficial because the dot product of two normalized vectors equals their cosine similarity—a standard metric for measuring document similarity in information retrieval.

reusing the same lsa_pipeline object for both TF-IDF and Count vectors. That’s a subtle but important issue that when you call the pipeline with TFIDF data to fit transform and then use the same pipeline on the COUNT data, then you rewrite the fitted transform. So, create two pipelines so each can hold the SVD matrix for their corresponding vectorized data. 



In [22]:
# Create LSA (latent sentiment analysis) pipeline to preprocess data uniformly
# uses Truncated SVD and Normalizer with L2

tfidf_lsa = make_pipeline(
    TruncatedSVD(n_components=100, random_state=88),
    Normalizer(copy = False)
)

count_lsa = make_pipeline(
    TruncatedSVD(n_components=100, random_state=88),
    Normalizer(copy = False)
)
joblib.dump(tfidf_lsa, "../models/tfidf_lsa.pkl")
joblib.dump(count_lsa, "../models/count_lsa.pkl")

['../models/count_lsa.pkl']

In [18]:
# reduce matrices and plot
tfidf_X_train_reduced = tfidf_lsa.fit_transform(tfidf_X_train)
tfidf_X_val_reduced = tfidf_lsa.transform(tfidf_X_val)
tfidf_test_reduced = tfidf_lsa.transform(tfidf_test)

In [19]:
count_X_train_reduced = count_lsa.fit_transform(count_X_train)
count_X_val_reduced = count_lsa.transform(count_X_val)
count_test_reduced = count_lsa.transform(count_test)

In [23]:
# Export data for modeling

#TFIDF reduced matrices
joblib.dump(tfidf_X_train_reduced,"../data/processed/tfidf_X_train_reduced.pkl")
joblib.dump(tfidf_X_val_reduced, "../data/processed/tfidf_X_val_reduced.pkl")
joblib.dump(tfidf_test_reduced, "../data/processed/tfidf_test_reduced.pkl") 

#Count reduced matrices
joblib.dump(count_X_train_reduced, "../data/processed/count_X_train_reduced.pkl")
joblib.dump(count_X_val_reduced, "../data/processed/count_X_val_reduced.pkl")
joblib.dump(count_test_reduced, "../data/processed/count_test_reduced.pkl")

['../data/processed/count_test_reduced.pkl']