# Q2: News Categorization with Naive Bayes

Building three Multinomial Naive Bayes classifiers to categorize news articles.

Assignment parts:
- Parse RDF/XML ontology
- Train models using headlines, descriptions, and combined features
- Evaluate and compare performance
- Save best model

In [39]:
import rdflib
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

## Load and Parse RDF Data

In [5]:
graph = rdflib.Graph()
graph.parse('News_Categorizer_RDF.xml')

<Graph identifier=N15a2b9779f4a4f7e986ab897203f513e (<class 'rdflib.graph.Graph'>)>

In [25]:
# Extract article data correctly by grouping by subject
articles = {}

# Extract article data from RDF triples
for subject, predicate, obj in graph:
    pred_str = str(predicate)
    subj_str = str(subject)

    # Group all properties by article subject 
    if subj_str not in articles:
        articles[subj_str] = {}

    # Match exact predicate URIs
    if pred_str.endswith('#headline'):
        articles[subj_str]['headline'] = str(obj)
    elif pred_str.endswith('#short_description'):
        articles[subj_str]['description'] = str(obj)
    elif pred_str.endswith('#category'):
        articles[subj_str]['category'] = str(obj)
    elif pred_str.endswith('#place'):
        articles[subj_str]['location'] = str(obj)

# Convert dictionary to lists
headlines = []
descriptions = []
categories = []
locations = []

for article_id, article_data in articles.items():
    # Only include complete articles with all four fields
    if all(key in article_data for key in ['headline', 'description', 'category', 'location']):
        headlines.append(article_data['headline'])
        descriptions.append(article_data['description'])
        categories.append(article_data['category'])
        locations.append(article_data['location'])

print(f"Extracted {len(headlines)} articles")
print(f"Sample headline: {headlines[0] if headlines else 'None'}")
print(f"Sample category: {categories[0] if categories else 'None'}")

Extracted 9999 articles
Sample headline: Mariano Rivera Sand Sculpture: Rays Give Yankees Closer Present (VIDEO)
Sample category: SPORTS


In [26]:
df = pd.DataFrame({
    'headline': headlines,
    'description': descriptions,
    'category': categories,
    'location': locations
})

print(f"Total articles: {len(df)}")
print(f"\nCategory distribution:")
print(df['category'].value_counts())
print(f"\nFirst 5 rows:")
df.head()

Total articles: 9999

Category distribution:
category
SPORTS            1002
WORLD NEWS        1001
TRAVEL            1001
FOOD & DRINK      1000
STYLE & BEAUTY    1000
WELLNESS           999
PARENTING          999
BUSINESS           999
POLITICS           999
ENTERTAINMENT      999
Name: count, dtype: int64

First 5 rows:


Unnamed: 0,headline,description,category,location
0,Mariano Rivera Sand Sculpture: Rays Give Yanke...,Yankees closer Mariano Rivera received another...,SPORTS,Torrance
1,The Real Snack Food Story,Youth Radio/Youth Media International (YMI) is...,WELLNESS,Santa Monica
2,"Twitter Employee, Claire Diaz-Ortiz, Live Twee...","Yet, it seemed like the appropriate thing for ...",PARENTING,Inglewood
3,Are you Parenting Like Your Parent?,Have you ever had one of those moments where y...,PARENTING,Torrance
4,Chipotle Doesn't Know When Carnitas Shortage W...,The hottest fast food chain in the country has...,BUSINESS,Pasadena


## Model 1: Headlines Only

In [40]:
print("=" * 60)
print("Model 1: Headlines Only")
print("=" * 60)

# Convert headlines to numerical features
vectorizer1 = TfidfVectorizer(max_features=1000)
X1 = vectorizer1.fit_transform(df['headline'])
y = df['category']

# Split into training (70%) and testing (30%) sets
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y, test_size=0.3, random_state=42)

Model 1: Headlines Only


In [41]:
# Train Multinomial Naive Bayes classifier
model1 = MultinomialNB()
model1.fit(X1_train, y1_train)

# Make predictions on test set
y1_pred = model1.predict(X1_test)
accuracy1 = accuracy_score(y1_test, y1_pred)

# Display results
print(f"\nAccuracy: {accuracy1:.4f} ({accuracy1*100:.2f}%)")
print(f"\nClassification Report:")
print(classification_report(y1_test, y1_pred))


Accuracy: 0.5873 (58.73%)

Classification Report:
                precision    recall  f1-score   support

      BUSINESS       0.53      0.46      0.49       301
 ENTERTAINMENT       0.58      0.47      0.52       321
  FOOD & DRINK       0.62      0.65      0.63       302
     PARENTING       0.59      0.66      0.62       310
      POLITICS       0.46      0.56      0.50       260
        SPORTS       0.71      0.71      0.71       296
STYLE & BEAUTY       0.71      0.73      0.72       302
        TRAVEL       0.52      0.58      0.55       288
      WELLNESS       0.53      0.50      0.51       324
    WORLD NEWS       0.63      0.57      0.60       296

      accuracy                           0.59      3000
     macro avg       0.59      0.59      0.59      3000
  weighted avg       0.59      0.59      0.59      3000



## Model 2: Descriptions Only

In [44]:
print("=" * 60)
print("Model 2: Descriptions Only")
print("=" * 60)

# Convert descriptions to numerical features
vectorizer2 = TfidfVectorizer(max_features=1000)
X2 = vectorizer2.fit_transform(df['description'])

# Split into training and testing sets
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.3, random_state=42)

Model 2: Descriptions Only


In [45]:
# Train multinomial Naive Bayes classifier
model2 = MultinomialNB()
model2.fit(X2_train, y2_train)

# Make predictions on test set
y2_pred = model2.predict(X2_test)
accuracy2 = accuracy_score(y2_test, y2_pred)

# Display results
print(f"\nAccuracy: {accuracy2:.4f} ({accuracy2*100:.2f}%)")
print(f"\nClassification Report:")
print(classification_report(y2_test, y2_pred))


Accuracy: 0.4983 (49.83%)

Classification Report:
                precision    recall  f1-score   support

      BUSINESS       0.51      0.46      0.49       301
 ENTERTAINMENT       0.52      0.31      0.39       321
  FOOD & DRINK       0.48      0.61      0.53       302
     PARENTING       0.49      0.55      0.52       310
      POLITICS       0.41      0.44      0.42       260
        SPORTS       0.57      0.60      0.58       296
STYLE & BEAUTY       0.58      0.50      0.53       302
        TRAVEL       0.47      0.49      0.48       288
      WELLNESS       0.49      0.48      0.48       324
    WORLD NEWS       0.49      0.55      0.52       296

      accuracy                           0.50      3000
     macro avg       0.50      0.50      0.50      3000
  weighted avg       0.50      0.50      0.50      3000



## Model 3: Combined Features

In [None]:
print("=" * 60)
print("Model 1: Headlines Only")
print("=" * 60)

# Combine headl
df['combined'] = df['headline'] + " " + df['description']

vectorizer3 = CountVectorizer()
X3 = vectorizer3.fit_transform(df['combined'])

X3_train, X3_test, y3_train, y3_test = train_test_split(X3, y, test_size=0.3, random_state=42)

In [None]:
model3 = MultinomialNB()
model3.fit(X3_train, y3_train)

y3_pred = model3.predict(X3_test)
accuracy3 = accuracy_score(y3_test, y3_pred)

## Model Comparison

In [None]:
# Compare accuracy scores

In [None]:
# Plot confusion matrices

## Save Best Model

In [None]:
# Determine which model performed best and save it