# Data Mining with Professor Sloan Week 7

## Maggie Boles

##### 10/23/2025

##### From Blackboard: You need to get the data into a usable format and perform at least three different aspects of feature engineering (depending on what kind of model you plan to build). If the data is a PDF, you’ll use the technique you’ve learned so far to extract the text (I suggest this approach if you can find suitable PDF data). If the data is a CSV, it’ll be easier to ingest the data. Once you have the data, you’ll use at least three different feature engineering techniques to understand your corpus.

In [5]:
# Import Libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import re

# Load data (your real file)
df = pd.read_csv('bbc_data.csv')

print("Loaded Data Sample:")
print(df.head())
print("\nLabel Distribution:")
print(df['labels'].value_counts())

# Stopwords
stopwords = set(['the', 'is', 'are', 'to', 'of', 'in', 'a', 'and', 'for', 'on', 'that', 'by', 
                 'this', 'with', 'i', 'you', 'it', 'not', 'or', 'be', 'from', 'at', 'as', 
                 'your', 'all', 'have', 'new', 'more', 'an', 'was', 'we', 'will', 'home', 
                 'can', 'us', 'about', 'if', 'page', 'my', 'has', 'but', 'our', 'one', 
                 'other', 'do', 'no', 'they', 'he', 'up', 'may', 'what', 'which', 'their', 
                 'news', 'out', 'use', 'any', 'there', 'see', 'only', 'so', 'his', 'when', 
                 'here', 'who', 'web', 'also', 'now', 'help', 'get', 'view', 'some', 'like', 
                 'site', 'go', 'back', 'good', 'had', 'how', 'way', 'even', 'did'])

# Aspect 1: Text Cleaning
def clean_text(text):
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = re.sub(r'\d+', '', text)
    text = ' '.join([word for word in text.split() if word not in stopwords])
    return text

df['cleaned_text'] = df['data'].apply(clean_text)

print("\nAfter Text Cleaning (Aspect 1):")
print(df[['data', 'cleaned_text']].head())

# Aspect 2: TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=500, ngram_range=(1, 2))
tfidf_features = tfidf.fit_transform(df['cleaned_text'])

# Fix duplicate columns by adding suffix
tfidf_cols = tfidf.get_feature_names_out()
unique_cols = []
for col in tfidf_cols:
    suffix = 1
    while f'tfidf_{col}' in unique_cols:
        suffix += 1
    unique_cols.append(f'tfidf_{col}')

tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=unique_cols, index=df.index)

print(f"\nTF-IDF Features Created (Aspect 2): {tfidf_df.shape[1]} features")

# Aspect 3: Meta-Feature Extraction
df['text_length'] = df['data'].str.len()
df['word_count'] = df['data'].str.split().str.len()
df['avg_word_length'] = df['text_length'] / df['word_count'].replace(0, 1)

print("\nMeta Features Sample (Aspect 3):")
print(df[['labels', 'text_length', 'word_count', 'avg_word_length']].head())

# Combine ALL features (TF-IDF + Meta-features)
final_features = pd.concat([df[['text_length', 'word_count', 'avg_word_length']], tfidf_df], axis=1)

# Final dataset ready for modeling
modeling_df = pd.concat([df[['labels']], final_features], axis=1)

print(f"\n FINAL DATASET SHAPE: {modeling_df.shape}")
print(f"Features: {list(final_features.columns)}")
print("\nSample of final features:")
print(final_features.head())

# Save for modeling
modeling_df.to_csv('bbc_engineered_final.csv', index=False)
print("\n Saved to 'bbc_engineered_final.csv'")

Loaded Data Sample:
                                                data         labels
0  Musicians to tackle US red tape  Musicians gro...  entertainment
1  U2s desire to be number one  U2, who have won ...  entertainment
2  Rocker Doherty in on-stage fight  Rock singer ...  entertainment
3  Snicket tops US box office chart  The film ada...  entertainment
4  Oceans Twelve raids box office  Oceans Twelve,...  entertainment

Label Distribution:
labels
sport            511
business         510
politics         417
tech             401
entertainment    386
Name: count, dtype: int64

After Text Cleaning (Aspect 1):
                                                data  \
0  Musicians to tackle US red tape  Musicians gro...   
1  U2s desire to be number one  U2, who have won ...   
2  Rocker Doherty in on-stage fight  Rock singer ...   
3  Snicket tops US box office chart  The film ada...   
4  Oceans Twelve raids box office  Oceans Twelve,...   

                                        cle

##### For our three feature engineering techniques I chose to apply text cleaning first, removing uppercase, punctuation, stop words, and numeric data. We have also applied some of the meta features: text_length, word_count, avg_word_length to give us our bag-of-words second feature engineering technique. This can be used to help us as we continue to apply these mathematically later. (Possible Cosine Similarity or Euclidean Distance?) After we have this, our final feature engineering techniques is TF-IDF (Term Frequency-Inverse Document Frequency: https://www.geeksforgeeks.org/machine-learning/understanding-tf-idf-term-frequency-inverse-document-frequency/). This can be applied for clustering, which I think I would like to use in my next milestone as this is from news articles and this will be good for training our model (I believe!). 

##### Our initial corpus has been cleaned and is ready for processing in a model to predict the label from the news article. The cleaned data should be able to help us predict labels. We could further proccess this to drop from of the tfidf columns to find which ones properly predict the labels column. I think this is a great starting point and we will have to tinker with these and see what are the main predictors for the labels. 

In [None]:
#for future work
print("\n Ready for classification modeling!")
print("X = modeling_df.drop('labels', axis=1)")
print("y = modeling_df['labels']")