# Introduction to News Category Classification

This project tackles a fundamental task in Natural Language Processing (NLP): **text classification**. The primary goal is to build an effective machine learning model that can automatically categorize news articles into one of four labels: **Business, Sports, Crime, or Science.**

Throughout this notebook, we will walk through the essential steps of a complete machine learning pipeline. This includes:

- Loading and exploring the dataset with `pandas`.
- Cleaning the text data using `spaCy` to remove noise.
- Handling class imbalance to ensure the model is not biased.
- Training and evaluating multiple models from `scikit-learn` to find the best performer.

In [57]:
import pandas as pd

df = pd.read_json("/Users/hasancan/Downloads/news_dataset.json")

df.head()

Unnamed: 0,text,category
0,Watching Schrödinger's Cat Die University of C...,SCIENCE
1,WATCH: Freaky Vortex Opens Up In Flooded Lake,SCIENCE
2,Entrepreneurs Today Don't Need a Big Budget to...,BUSINESS
3,These Roads Could Recharge Your Electric Car A...,BUSINESS
4,Civilian 'Guard' Fires Gun While 'Protecting' ...,CRIME


In [58]:
# Time to check if there is any imbalance!

df.category.value_counts() # business and sports are more than the other two categories.

category
BUSINESS    4254
SPORTS      4167
CRIME       2893
SCIENCE     1381
Name: count, dtype: int64

In [59]:
# To handle the class imbalance, there are different techniques, like over-sampling, etc.

min_samples = 1381
df_business = df[df.category=="BUSINESS"].sample(min_samples, random_state=6)
df_business

Unnamed: 0,text,category
9661,Truly Human Leadership-Part 2: Management and ...,BUSINESS
1615,5 Things Customers May Not Tell You Before The...,BUSINESS
4387,"4 Ways Your For-Profit Business Can Do ""Good"" ...",BUSINESS
6054,Here's What Companies Need To Do To Get Wellne...,BUSINESS
7762,Governor Scott Walker vs. Unions It is hard to...,BUSINESS
...,...,...
9744,How To Deal With These 3 Difficult Managers At...,BUSINESS
6967,What's Your Meeting Brand? Many leaders operat...,BUSINESS
7415,Here's What Big Banks Won't Say About Their An...,BUSINESS
4686,The Need for Private-Public Partnerships Again...,BUSINESS


In [60]:
# let's do it for the other categories, as well
min_samples = 1381

df_sports = df[df.category=="SPORTS"].sample(min_samples, random_state=6)
df_crime = df[df.category=="CRIME"].sample(min_samples, random_state=6)
df_science = df[df.category=="SCIENCE"].sample(min_samples, random_state=6)

In [61]:
df_balanced = pd.concat([df_business, df_sports, df_crime, df_science], axis=0)
df_balanced.category.value_counts() # now we have a balanced dataset!

category
BUSINESS    1381
SPORTS      1381
CRIME       1381
SCIENCE     1381
Name: count, dtype: int64

In [62]:
target
df_balanced['category_num'] = df_balanced.category.map({'BUSINESS':0,
                                                        'SPORTS':1,
                                                        'CRIME':2,
                                                        'SCIENCE':3
                                                       })

df_balanced.head() # we have converted the categories into numbers, like labels, target variables!

Unnamed: 0,text,category,category_num
9661,Truly Human Leadership-Part 2: Management and ...,BUSINESS,0
1615,5 Things Customers May Not Tell You Before The...,BUSINESS,0
4387,"4 Ways Your For-Profit Business Can Do ""Good"" ...",BUSINESS,0
6054,Here's What Companies Need To Do To Get Wellne...,BUSINESS,0
7762,Governor Scott Walker vs. Unions It is hard to...,BUSINESS,0


In [63]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df_balanced.text,
    df_balanced.category_num,
    test_size=0.2,
    random_state=6,
    stratify=df_balanced.category_num)

print(X_train.shape)
X_train.head()

(4419,)


5000    Police: Man Killed By Officers Was Holding Pho...
4631    This Enlightened CEO Takes Every Friday Off An...
9868    Scientists Crack Mystery Of Tiny Traveling Pla...
6461    Russian Long Jumper Darya Klishina To Compete ...
3592    The Power of Doing Something Meaningful with V...
Name: text, dtype: object

In [64]:
y_train.value_counts() # we have stratified the vategories!

category_num
0    1105
3    1105
1    1105
2    1104
Name: count, dtype: int64

In [65]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import CountVectorizer

clf = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range=(1,2))),
    ('Multi NB', MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.71      0.91      0.79       276
           1       0.96      0.82      0.88       276
           2       0.88      0.86      0.87       277
           3       0.89      0.79      0.84       276

    accuracy                           0.84      1105
   macro avg       0.86      0.84      0.85      1105
weighted avg       0.86      0.84      0.85      1105



In [69]:
# We apply stopword removal and lemmatization here:

import spacy
nlp = spacy.load("en_core_web_sm")

def preprocess(text):
    doc = nlp(text)

    # Filter out stop words and punctuation...
    filtered_tokens = [token.text for token in doc if not token.is_stop and not token.is_punct]
    
    # ...and join them back into a single string! This is the fix.
    return " ".join(filtered_tokens)

# Now, let's apply this fixed function
df_balanced['preprocessed_text'] = df_balanced.text.apply(preprocess)
df_balanced.head()

Unnamed: 0,text,category,category_num,preprocessed_text,preprocessed_text_fixed
9661,Truly Human Leadership-Part 2: Management and ...,BUSINESS,0,Truly Human Leadership 2 Management Leadership...,Truly Human Leadership 2 Management Leadership...
1615,5 Things Customers May Not Tell You Before The...,BUSINESS,0,5 Things Customers Tell Leave Good hired secre...,5 Things Customers Tell Leave Good hired secre...
4387,"4 Ways Your For-Profit Business Can Do ""Good"" ...",BUSINESS,0,4 Ways Profit Business Good businesses concern...,4 Ways Profit Business Good businesses concern...
6054,Here's What Companies Need To Do To Get Wellne...,BUSINESS,0,Companies Need Wellness Programs Right,Companies Need Wellness Programs Right
7762,Governor Scott Walker vs. Unions It is hard to...,BUSINESS,0,Governor Scott Walker vs. Unions hard believe ...,Governor Scott Walker vs. Unions hard believe ...


In [70]:
# Time to train the model now:

X_train, X_test, y_train, y_test = train_test_split(
    df_balanced.preprocessed_text,
    df_balanced.category_num,
    test_size=0.2,
    random_state=6,
    stratify=df_balanced.category_num)

print(X_train.shape)
X_train.head()


clf = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range=(1,2))),
    ('Multi NB', MultinomialNB())
])

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

(4419,)
              precision    recall  f1-score   support

           0       0.83      0.91      0.87       276
           1       0.93      0.89      0.91       276
           2       0.88      0.93      0.90       277
           3       0.90      0.80      0.85       276

    accuracy                           0.88      1105
   macro avg       0.88      0.88      0.88      1105
weighted avg       0.88      0.88      0.88      1105



In [71]:
# Now that our preprocessed text is fixed, let's unleash some new models on it
# We need to do a new train-test split on this clean data

X_train, X_test, y_train, y_test = train_test_split(
    df_balanced.preprocessed_text_fixed,  # Using our fixed preprocessed text!
    df_balanced.category_num,
    test_size=0.2,
    random_state=6,
    stratify=df_balanced.category_num
)

In [72]:
# Let's try some different models and see if we can beat the Naive Bayes score!
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

# --- Model 2: Logistic Regression ---
# This is another great model for text classification. Let's see how it does.
print("--- Training Logistic Regression Model ---")

model_logreg = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range=(1,2))),
    ('Logistic Regression', LogisticRegression(max_iter=1000)) # Increased max_iter to ensure convergence
])

# Train the model
model_logreg.fit(X_train, y_train)

# Make predictions and check the report
y_pred_logreg = model_logreg.predict(X_test)

print("\nClassification Report (Logistic Regression):")
print(classification_report(y_test, y_pred_logreg))


# --- Model 3: Support Vector Machine (SVM) ---
# Now for the famous SVM! Let's see if it can handle this.
print("\n--- Training Support Vector Machine (SVM) Model ---")

model_svm = Pipeline([
    ('vectorizer_bow', CountVectorizer(ngram_range=(1,2))),
    ('Linear SVM', LinearSVC())
])

# Train the model
model_svm.fit(X_train, y_train)

# Make predictions and check the report
y_pred_svm = model_svm.predict(X_test)

print("\nClassification Report (Linear SVM):")
print(classification_report(y_test, y_pred_svm))

--- Training Logistic Regression Model ---

Classification Report (Logistic Regression):
              precision    recall  f1-score   support

           0       0.84      0.85      0.84       276
           1       0.91      0.86      0.88       276
           2       0.90      0.89      0.89       277
           3       0.81      0.86      0.83       276

    accuracy                           0.86      1105
   macro avg       0.86      0.86      0.86      1105
weighted avg       0.86      0.86      0.86      1105


--- Training Support Vector Machine (SVM) Model ---

Classification Report (Linear SVM):
              precision    recall  f1-score   support

           0       0.82      0.84      0.83       276
           1       0.91      0.83      0.87       276
           2       0.88      0.88      0.88       277
           3       0.81      0.86      0.83       276

    accuracy                           0.85      1105
   macro avg       0.86      0.85      0.85      1105
weight

# Conclusion & Model Comparison
This project successfully built and evaluated several models for news category classification. After addressing a significant class imbalance by down-sampling, we compared the performance of three different machine learning algorithms.

Here's a summary of the best accuracy achieved by each model on the preprocessed text:

- **Multinomial Naive Bayes (MNB) was the top performer.** This isn't surprising for a text classification task like this one. MNB is specifically designed for features that are counts (like word counts from CountVectorizer) and often works exceptionally well with high-dimensional data like a large text vocabulary, even with a relatively small dataset. Its probabilistic nature is a great fit for this kind of problem.

- **Text preprocessing was crucial.** Our first Naive Bayes model, trained on raw text, achieved an accuracy of 84%. After cleaning the text by removing stop words and punctuation, the accuracy jumped to 88%. This shows that removing common, non-descriptive words allowed the model to focus on more meaningful terms, leading to a significant performance boost.

- **Logistic Regression and SVM were strong contenders but fell slightly short.** Both are powerful and versatile classifiers. However, they might require more data or hyperparameter tuning (e.g., adjusting the regularization parameter C) to outperform a well-suited Naive Bayes model. In this specific case, Naive Bayes' assumptions seem to align perfectly with our data's characteristics.

In summary, this analysis shows that for this dataset, a **Multinomial Naive Bayes classifier combined with effective text preprocessing** is the most effective solution for this task.