![logo_ironhack_blue 7](https://user-images.githubusercontent.com/23629340/40541063-a07a0a8a-601a-11e8-91b5-2f13e4e6b441.png)

# PROJECT | Natural Language Processing Challenge

## Introduction

Learning how to process text is a skill required for Data Scientists/AI Engineers. 

In this project, you will put these skills into practice to identify whether a news headline is real or fake news.

## Project Overview

In the file `dataset/data.csv`, you will find a dataset containing news articles with the following columns:

- **`label`**: 0 if the news is fake, 1 if the news is real.
- **`title`**: The headline of the news article.
- **`text`**: The full content of the article.
- **`subject`**: The category or topic of the news.
- **`date`**: The publication date of the article.

Your goal is to build a classifier that is able to distinguish between the two.

Once you have a classifier built, then use it to predict the labels for `dataset/validation_data.csv`. Generate a new file
where the label `2` has been replaced by `0` (fake) or `1` (real) according to your model. Please respect the original file format, 
do not include extra columns, and respect the column separator. 

Please ensure to split the `data.csv` into **training** and **test** datasets before using it for model training or evaluation.

## Guidance

Like in a real life scenario, you are able to make your own choices and text treatment.
Use the techniques you have learned and the common packages to process this data and classify the text.

## Deliverables

1. **Python Code:** Provide well-documented Python code that conducts the analysis.
2. **Predictions:** A csv file in the same format as `validation_data.csv` but with the predicted labels (0 or 1)
3. **Accuracy estimation:** Provide the teacher with your estimation of how your model will perform.
4. **Presentation:** You will present your model in a 10-minute presentation. Your teacher will provide further instructions.

In [95]:
# Can use transformer Model but also need another model
# Can check with other models like Logistic Regression, Random Forest, etc.
# Can find other training data

# Setup the Environment

In [96]:

import nltk
import re
import string
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [97]:
## Read Data
data = pd.read_csv("train_data.csv",encoding='latin-1')

print(data.shape)

print(data.head())

(40399, 5)
   label                                              title  \
0      0  HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...   
1      0  WATCH DIRTY HARRY REID ON HIS LIE ABOUT ROMNEY...   
2      0  HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...   
3      0  FLASHBACK: KING OBAMA COMMUTES SENTENCES OF 22...   
4      0  BENGHAZI PANEL CALLS HILLARY TO TESTIFY UNDER ...   

                                                text    subject        date  
0  The irony here isn t lost on us. Hillary is be...   politics  2015-03-31  
1  In case you missed it Sen. Harry Reid (R-NV), ...  left-news  2015-03-31  
2  The irony here isn t lost on us. Hillary is be...  left-news  2015-03-31  
3  Just making room for Hillary President Obama t...   politics  2015-03-31  
4  Does anyone really think Hillary Clinton will ...   politics  2015-03-31  


In [98]:
# Reduce the training set to speed up development. 

# data = data.head(1000)

print(data.shape)

(40399, 5)


# Text processing

In [99]:
data.head

<bound method NDFrame.head of        label                                              title  \
0          0  HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...   
1          0  WATCH DIRTY HARRY REID ON HIS LIE ABOUT ROMNEY...   
2          0  HILLARY RODHAM NIXON: A CANDIDATE WITH MORE BA...   
3          0  FLASHBACK: KING OBAMA COMMUTES SENTENCES OF 22...   
4          0  BENGHAZI PANEL CALLS HILLARY TO TESTIFY UNDER ...   
...      ...                                                ...   
40394      1  Exclusive: North Korea rules out negotiations ...   
40395      1  Freeport evacuating Indonesian mine worker fam...   
40396      1  Freeport evacuating Indonesian mine worker fam...   
40397      1  Venezuela opposition leader Ledezma flees to S...   
40398      1  As Canada prepares for legal pot, ex-cops get ...   

                                                    text    subject  \
0      The irony here isn t lost on us. Hillary is be...   politics   
1      In case you miss

In [100]:
from nltk.tokenize import word_tokenize

# Tokenize 'title' and 'text' columns and store as new columns
data['title'] = data['title'].apply(lambda x: word_tokenize(str(x)))
data['text'] = data['text'].apply(lambda x: word_tokenize(str(x)))

# Check the result
print(data.head())

   label                                              title  \
0      0  [HILLARY, RODHAM, NIXON, :, A, CANDIDATE, WITH...   
1      0  [WATCH, DIRTY, HARRY, REID, ON, HIS, LIE, ABOU...   
2      0  [HILLARY, RODHAM, NIXON, :, A, CANDIDATE, WITH...   
3      0  [FLASHBACK, :, KING, OBAMA, COMMUTES, SENTENCE...   
4      0  [BENGHAZI, PANEL, CALLS, HILLARY, TO, TESTIFY,...   

                                                text    subject        date  
0  [The, irony, here, isn, t, lost, on, us, ., Hi...   politics  2015-03-31  
1  [In, case, you, missed, it, Sen., Harry, Reid,...  left-news  2015-03-31  
2  [The, irony, here, isn, t, lost, on, us, ., Hi...  left-news  2015-03-31  
3  [Just, making, room, for, Hillary, President, ...   politics  2015-03-31  
4  [Does, anyone, really, think, Hillary, Clinton...   politics  2015-03-31  


In [101]:
title = data['title']
text = data['text']

In [102]:
def clean_text(text: str) -> str:
    # Ensure we are working with a string
    text = str(text)

    # Remove all special characters (keep only letters, numbers, and spaces)
    text = re.sub(r"[^A-Za-z0-9\s]", "", text)

    # Remove all single characters (like "a", "b", "c" standing alone)
    text = re.sub(r"\b[A-Za-z]\b", "", text)

    # Remove single characters from the start of the text
    text = re.sub(r"^[A-Za-z]\s+", "", text)

    # Replace multiple spaces with a single space
    text = re.sub(r"\s+", " ", text)

    # Convert everything to lowercase
    text = text.lower()

    return text


# clean text
data['text'] = data['text'].apply(clean_text)

# clean title
data['title'] = data['title'].apply(clean_text)

data.head() 


Unnamed: 0,label,title,text,subject,date
0,0,hillary rodham nixon candidate with more bagga...,the irony here isn lost on us hillary is being...,politics,2015-03-31
1,0,watch dirty harry reid on his lie about romney...,in case you missed it sen harry reid rnv who a...,left-news,2015-03-31
2,0,hillary rodham nixon candidate with more bagga...,the irony here isn lost on us hillary is being...,left-news,2015-03-31
3,0,flashback king obama commutes sentences of 22 ...,just making room for hillary president obama t...,politics,2015-03-31
4,0,benghazi panel calls hillary to testify under ...,does anyone really think hillary clinton will ...,politics,2015-03-31


In [103]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(text):
    tokens = word_tokenize(str(text))
    filtered = [word for word in tokens if word.lower() not in stop_words and word.isalpha()]
    return ' '.join(filtered)

# Make a copy to preserve the original data
data_nostop = data.copy()

# Replace the columns with stopword-removed text
data_nostop['title'] = data_nostop['title'].apply(remove_stopwords)
data_nostop['text'] = data_nostop['text'].apply(remove_stopwords)


In [104]:
data_nostop.head()


Unnamed: 0,label,title,text,subject,date
0,0,hillary rodham nixon candidate baggage samsoni...,irony lost us hillary compared president wante...,politics,2015-03-31
1,0,watch dirty harry reid lie taxes win,case missed sen harry reid rnv announced last ...,left-news,2015-03-31
2,0,hillary rodham nixon candidate baggage samsoni...,irony lost us hillary compared president wante...,left-news,2015-03-31
3,0,flashback king obama commutes sentences drug d...,making room hillary president obama today anno...,politics,2015-03-31
4,0,benghazi panel calls hillary testify oath whit...,anyone really think hillary clinton come clean...,politics,2015-03-31


In [105]:
# compare

data.head()

Unnamed: 0,label,title,text,subject,date
0,0,hillary rodham nixon candidate with more bagga...,the irony here isn lost on us hillary is being...,politics,2015-03-31
1,0,watch dirty harry reid on his lie about romney...,in case you missed it sen harry reid rnv who a...,left-news,2015-03-31
2,0,hillary rodham nixon candidate with more bagga...,the irony here isn lost on us hillary is being...,left-news,2015-03-31
3,0,flashback king obama commutes sentences of 22 ...,just making room for hillary president obama t...,politics,2015-03-31
4,0,benghazi panel calls hillary to testify under ...,does anyone really think hillary clinton will ...,politics,2015-03-31


## Stemming and Lemmatization

In [106]:
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [107]:
data_nostop.head()

Unnamed: 0,label,title,text,subject,date
0,0,hillary rodham nixon candidate baggage samsoni...,irony lost us hillary compared president wante...,politics,2015-03-31
1,0,watch dirty harry reid lie taxes win,case missed sen harry reid rnv announced last ...,left-news,2015-03-31
2,0,hillary rodham nixon candidate baggage samsoni...,irony lost us hillary compared president wante...,left-news,2015-03-31
3,0,flashback king obama commutes sentences drug d...,making room hillary president obama today anno...,politics,2015-03-31
4,0,benghazi panel calls hillary testify oath whit...,anyone really think hillary clinton come clean...,politics,2015-03-31


In [108]:
def stem_text(text):
    tokens = word_tokenize(str(text))
    return ' '.join([stemmer.stem(word) for word in tokens])

def lemmatize_text(text):
    tokens = word_tokenize(str(text))
    return ' '.join([lemmatizer.lemmatize(word) for word in tokens])


In [109]:
# stem the data
data_norm = data_nostop.copy()

data_norm['title'] = data_norm['title'].apply(stem_text)
data_norm['text'] = data_norm['text'].apply(stem_text)

data_norm.head()

Unnamed: 0,label,title,text,subject,date
0,0,hillari rodham nixon candid baggag samsonit fa...,ironi lost us hillari compar presid want take ...,politics,2015-03-31
1,0,watch dirti harri reid lie tax win,case miss sen harri reid rnv announc last week...,left-news,2015-03-31
2,0,hillari rodham nixon candid baggag samsonit fa...,ironi lost us hillari compar presid want take ...,left-news,2015-03-31
3,0,flashback king obama commut sentenc drug dealer,make room hillari presid obama today announc d...,politics,2015-03-31
4,0,benghazi panel call hillari testifi oath white...,anyon realli think hillari clinton come clean ...,politics,2015-03-31


In [110]:
# lemmatize the data
data_norm['title'] = data_norm['title'].apply(lemmatize_text)
data_norm['text'] = data_norm['text'].apply(lemmatize_text)

data_norm.head()

Unnamed: 0,label,title,text,subject,date
0,0,hillari rodham nixon candid baggag samsonit fa...,ironi lost u hillari compar presid want take n...,politics,2015-03-31
1,0,watch dirti harri reid lie tax win,case miss sen harri reid rnv announc last week...,left-news,2015-03-31
2,0,hillari rodham nixon candid baggag samsonit fa...,ironi lost u hillari compar presid want take n...,left-news,2015-03-31
3,0,flashback king obama commut sentenc drug dealer,make room hillari presid obama today announc d...,politics,2015-03-31
4,0,benghazi panel call hillari testifi oath white...,anyon realli think hillari clinton come clean ...,politics,2015-03-31


# Split the data

TimeSeriesSplit: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.TimeSeriesSplit.html

"With scikit-learn, you can use TimeSeriesSplit for cross-validation on time series data, but for a simple train/test split (as in your code), you should sort by date and split manually (as shown previously)."

In [111]:
# Ensure 'date' is a datetime column
data_norm['date'] = pd.to_datetime(data_norm['date'])

# Sort by date
data_norm = data_norm.sort_values('date')

# Define the split index
split_idx = int(len(data_norm) * 0.8)

# Split chronologically
X = data_norm.drop(['label'], axis=1)
y = data_norm['label']

X_train = X.iloc[:split_idx]
X_test = X.iloc[split_idx:]
y_train = y.iloc[:split_idx]
y_test = y.iloc[split_idx:]

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)

Training set shape: (32319, 4)
Test set shape: (8080, 4)


In [112]:
X_train.head()


Unnamed: 0,title,text,subject,date
0,hillari rodham nixon candid baggag samsonit fa...,ironi lost u hillari compar presid want take n...,politics,2015-03-31
1,watch dirti harri reid lie tax win,case miss sen harri reid rnv announc last week...,left-news,2015-03-31
2,hillari rodham nixon candid baggag samsonit fa...,ironi lost u hillari compar presid want take n...,left-news,2015-03-31
3,flashback king obama commut sentenc drug dealer,make room hillari presid obama today announc d...,politics,2015-03-31
4,benghazi panel call hillari testifi oath white...,anyon realli think hillari clinton come clean ...,politics,2015-03-31


In [113]:
y_train.head()

0    0
1    0
2    0
3    0
4    0
Name: label, dtype: int64

# Feature Extraction

## TF_IDF

In [120]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize the vectorizer
tfidf_vectorizer_text = TfidfVectorizer()

# TF-IDF for 'text'
X_train_text_tfidf = tfidf_vectorizer_text.fit_transform(X_train['text'])
X_test_text_tfidf = tfidf_vectorizer_text.transform(X_test['text'])

# TF-IDF for 'title'
tfidf_vectorizer_title = TfidfVectorizer()
X_train_title_tfidf = tfidf_vectorizer_title.fit_transform(X_train['title'])
X_test_title_tfidf = tfidf_vectorizer_title.transform(X_test['title'])

# Print shapes
print("Text TF-IDF shapes:", X_train_text_tfidf.shape, X_test_text_tfidf.shape)
print("Title TF-IDF shapes:", X_train_title_tfidf.shape, X_test_title_tfidf.shape)  

# Print feature names and first few rows for 'text'
print("Text TF-IDF feature names:", tfidf_vectorizer_text.get_feature_names_out())
print("First 5 rows of text TF-IDF:\n", X_train_text_tfidf[:5].toarray())

# Print feature names and first few rows for 'title'
print("Title TF-IDF feature names:", tfidf_vectorizer_title.get_feature_names_out())
print("First 5 rows of title TF-IDF:\n", X_train_title_tfidf[:5].toarray())


Text TF-IDF shapes: (32319, 137141) (8080, 137141)
Title TF-IDF shapes: (32319, 12157) (8080, 12157)
Text TF-IDF feature names: ['aa' 'aaa' 'aaaaackkk' ... 'zzzzaaaacccchhh' 'zzzzzzzz' 'zzzzzzzzzzzzz']
First 5 rows of text TF-IDF:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]
Title TF-IDF feature names: ['aa' 'aar' 'aarp' ... 'zuckerberg' 'zuma' 'zurich']
First 5 rows of title TF-IDF:
 [[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


## Bag of Words

Maybe used later to compare

# Train the Classifier

## Random Forest

In [128]:
# With title

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_title_tfidf, y_train)

predictions_title = clf.predict(X_test_title_tfidf)

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("Accuracy:", accuracy_score(y_test, predictions_title))
print("Classification Report:\n", classification_report(y_test, predictions_title))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions_title))


Accuracy: 0.8535891089108911
Classification Report:
               precision    recall  f1-score   support

           0       0.53      0.93      0.67      1302
           1       0.98      0.84      0.91      6778

    accuracy                           0.85      8080
   macro avg       0.76      0.88      0.79      8080
weighted avg       0.91      0.85      0.87      8080

Confusion Matrix:
 [[1211   91]
 [1092 5686]]


In [127]:
# With text

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_text_tfidf, y_train)

predictions_text = clf.predict(X_test_text_tfidf)

print("Accuracy:", accuracy_score(y_test, predictions_text))
print("Classification Report:\n", classification_report(y_test, predictions_text))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions_text))

Accuracy: 0.9627475247524753
Classification Report:
               precision    recall  f1-score   support

           0       0.82      0.99      0.90      1302
           1       1.00      0.96      0.98      6778

    accuracy                           0.96      8080
   macro avg       0.91      0.98      0.94      8080
weighted avg       0.97      0.96      0.96      8080

Confusion Matrix:
 [[1293    9]
 [ 292 6486]]


In [129]:
# Combined

from scipy.sparse import hstack

# Combine the TF-IDF features for text and title
X_train_combined = hstack([X_train_text_tfidf, X_train_title_tfidf])
X_test_combined = hstack([X_test_text_tfidf, X_test_title_tfidf])

# Train the classifier on the combined features
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train_combined, y_train)

# Predict and evaluate
predictions_combined = clf.predict(X_test_combined)
print("Accuracy:", accuracy_score(y_test, predictions_combined))
print("Classification Report:\n", classification_report(y_test, predictions_combined))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions_combined))

Accuracy: 0.9608910891089109
Classification Report:
               precision    recall  f1-score   support

           0       0.81      0.99      0.89      1302
           1       1.00      0.95      0.98      6778

    accuracy                           0.96      8080
   macro avg       0.90      0.97      0.93      8080
weighted avg       0.97      0.96      0.96      8080

Confusion Matrix:
 [[1292   10]
 [ 306 6472]]


## Logistic Regression

In [134]:
# With title

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=42)
clf.fit(X_train_title_tfidf, y_train)

predictions_title = clf.predict(X_test_title_tfidf)

print("Accuracy:", accuracy_score(y_test, predictions_title))
print("Classification Report:\n", classification_report(y_test, predictions_title))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions_title))


Accuracy: 0.8601485148514851
Classification Report:
               precision    recall  f1-score   support

           0       0.54      0.96      0.69      1302
           1       0.99      0.84      0.91      6778

    accuracy                           0.86      8080
   macro avg       0.76      0.90      0.80      8080
weighted avg       0.92      0.86      0.87      8080

Confusion Matrix:
 [[1253   49]
 [1081 5697]]


In [135]:
# With text

clf = LogisticRegression(random_state=42)
clf.fit(X_train_text_tfidf, y_train)

predictions_text = clf.predict(X_test_text_tfidf)

print("Accuracy:", accuracy_score(y_test, predictions_text))
print("Classification Report:\n", classification_report(y_test, predictions_text))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions_text))

Accuracy: 0.9709158415841584
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.99      0.92      1302
           1       1.00      0.97      0.98      6778

    accuracy                           0.97      8080
   macro avg       0.93      0.98      0.95      8080
weighted avg       0.97      0.97      0.97      8080

Confusion Matrix:
 [[1292   10]
 [ 225 6553]]


In [None]:
# Combined

from scipy.sparse import hstack

# Combine the TF-IDF features for text and title
X_train_combined = hstack([X_train_text_tfidf, X_train_title_tfidf])
X_test_combined = hstack([X_test_text_tfidf, X_test_title_tfidf])

# Train the classifier on the combined features
clf = LogisticRegression(random_state=42)
clf.fit(X_train_combined, y_train)

# Predict and evaluate
predictions_combined = clf.predict(X_test_combined)
print("Accuracy:", accuracy_score(y_test, predictions_combined))
print("Classification Report:\n", classification_report(y_test, predictions_combined))
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions_combined))

Accuracy: 0.9804455445544554
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.99      0.94      1302
           1       1.00      0.98      0.99      6778

    accuracy                           0.98      8080
   macro avg       0.95      0.99      0.97      8080
weighted avg       0.98      0.98      0.98      8080

Confusion Matrix:
 [[1293    9]
 [ 149 6629]]


## Transformer Model: DeBERTa v3 

In [141]:
! pip install transformers torch

Collecting transformers
  Downloading transformers-4.55.2-py3-none-any.whl.metadata (41 kB)
Collecting torch
  Downloading torch-2.8.0-cp313-none-macosx_11_0_arm64.whl.metadata (30 kB)
Collecting huggingface-hub<1.0,>=0.34.0 (from transformers)
  Downloading huggingface_hub-0.34.4-py3-none-any.whl.metadata (14 kB)
Collecting tokenizers<0.22,>=0.21 (from transformers)
  Downloading tokenizers-0.21.4-cp39-abi3-macosx_11_0_arm64.whl.metadata (6.7 kB)
Collecting safetensors>=0.4.3 (from transformers)
  Downloading safetensors-0.6.2-cp38-abi3-macosx_11_0_arm64.whl.metadata (4.1 kB)
Collecting hf-xet<2.0.0,>=1.1.3 (from huggingface-hub<1.0,>=0.34.0->transformers)
  Downloading hf_xet-1.1.7-cp37-abi3-macosx_11_0_arm64.whl.metadata (703 bytes)
Downloading transformers-4.55.2-py3-none-any.whl (11.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.3/11.3 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading huggingface_hub-0.34.4-py3-none-any.whl (

In [142]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Choose model: "microsoft/deberta-v3-base"
model_name = "microsoft/deberta-v3-base"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Create pipeline
nlp = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0)  # remove device=0 if not using GPU

# Example: predict on test set (using 'text' column)
preds = [nlp(text, truncation=True)[0]['label'] for text in X_test['text']]

# Convert labels if needed (e.g., 'LABEL_0' -> 0, 'LABEL_1' -> 1)
preds = [int(label.split('_')[-1]) for label in preds]

print(preds[:10])

: 