# Q2: News Categorization with Naive Bayes

Building three Multinomial Naive Bayes classifiers to categorize news articles.

Assignment parts:
- Parse RDF/XML ontology
- Train models using headlines, descriptions, and combined features
- Evaluate and compare performance
- Save best model

In [2]:
import rdflib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

## Load and Parse RDF Data

In [5]:
graph = rdflib.Graph()
graph.parse('News_Categorizer_RDF.xml')

<Graph identifier=N15a2b9779f4a4f7e986ab897203f513e (<class 'rdflib.graph.Graph'>)>

In [8]:
headlines = []
descriptions = []
categories = []
locations = []

# Extract article data from RDF triples
for subject, predicate, obj in graph:
    pred_str = str(predicate) 

    if 'headline' in pred_str:
        headlines.append(str(obj))
    elif 'short_description' in pred_str:
        descriptions.append(str(obj))
    elif 'category' in pred_str:
        categories.append(str(obj))
    elif 'place' in pred_str:
        locations.append(str(obj))

print(f"Extracted {len(headlines)} articles")
print(f"Sample headline: {headlines[0] if headlines else 'None'}")
print(f"Sample category: {categories[0] if categories else 'None'}")

Extracted 9999 articles
Sample headline: The Real Snack Food Story
Sample category: PARENTING


In [9]:
df = pd.DataFrame({
    'headline': headlines,
    'description': descriptions,
    'category': categories,
    'location': locations
})

print(f"Total articles: {len(df)}")
print(f"\nCategory distribution:")
print(df['category'].value_counts())
print(f"\nFirst 5 rows:")
df.head()

Total articles: 9999

Category distribution:
category
SPORTS            1002
WORLD NEWS        1001
TRAVEL            1001
STYLE & BEAUTY    1000
FOOD & DRINK      1000
PARENTING          999
ENTERTAINMENT      999
POLITICS           999
BUSINESS           999
WELLNESS           999
Name: count, dtype: int64

First 5 rows:


Unnamed: 0,headline,description,category,location
0,The Real Snack Food Story,The hottest fast food chain in the country has...,PARENTING,Torrance
1,"Indian Americans Have Always Faced Racism, But...",Pence would also be called on to cast a tie-br...,PARENTING,Santa Monica
2,Opening Day -- But It Doesn't Count,"An empire, more often than not, doesn't erupt ...",SPORTS,Santa Monica
3,Yemen’s Calamity Is Of Damning Proportions,Welsh-born actor Roger Rees died at his home i...,PARENTING,Pasadena
4,Jeb Faces Bush Family Ghosts In Key State,Red carpet beauty can be pretty predictable. L...,PARENTING,Long Beach


## Model 1: Headlines Only

In [None]:
vectorizer1 = CountVectorizer()
X1 = vectorizer1.fit_transform(df['headline'])
y = df['category']

X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y, test_size=0.3, random_state=42)

In [None]:
model1 = MultinomialNB()
model1.fit(X1_train, y1_train)

y1_pred = model1.predict(X1_test)
accuracy1 = accuracy_score(y1_test, y1_pred)

## Model 2: Descriptions Only

In [None]:
vectorizer2 = CountVectorizer()
X2 = vectorizer2.fit_transform(df['description'])

X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.3, random_state=42)

In [None]:
model2 = MultinomialNB()
model2.fit(X2_train, y2_train)

y2_pred = model2.predict(X2_test)
accuracy2 = accuracy_score(y2_test, y2_pred)

## Model 3: Combined Features

In [None]:
df['combined'] = df['headline'] + " " + df['description']

vectorizer3 = CountVectorizer()
X3 = vectorizer3.fit_transform(df['combined'])

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y, test_size=0.3, random_state=42)

In [None]:
model3 = MultinomialNB()
model3.fit(X3_train, y3_train)

y3_pred = model3.predict(X3_test)
accuracy3 = accuracy_score(y3_test, y3_pred)

## Model Comparison

In [None]:
# Compare accuracy scores

In [None]:
# Plot confusion matrices

## Save Best Model

In [None]:
# Determine which model performed best and save it