# 1. Loading and Preparing the Dataset

In this step, two separate datasets — `Fake.csv` and `True.csv` — are loaded using the pandas library. A new column named `label` is added to each dataset to indicate whether the news is fake or true:
- `label = 1` for fake news,
- `label = 0` for true news.

Then, both datasets are concatenated into a single DataFrame. Only the `title`, `text`, and `label` columns are kept. To simplify further processing, the `title` and `text` columns are merged into a new `content` column, and the final dataset includes only `content` and `label` columns.


In [None]:
import pandas as pd

fake_df = pd.read_csv("Fake.csv")
true_df = pd.read_csv("True.csv")

fake_df["label"] = 1
true_df["label"] = 0

df = pd.concat([fake_df, true_df], ignore_index=True)

df = df[["title", "text", "label"]]

df["content"] = df["title"] + " " + df["text"]
df = df[["content", "label"]]

df.head()


Unnamed: 0,content,label
0,Donald Trump Sends Out Embarrassing New Year’...,1
1,Drunk Bragging Trump Staffer Started Russian ...,1
2,Sheriff David Clarke Becomes An Internet Joke...,1
3,Trump Is So Obsessed He Even Has Obama’s Name...,1
4,Pope Francis Just Called Out Donald Trump Dur...,1


In [None]:
true_df.shape

(21417, 5)

In [None]:
fake_df.shape

(23481, 5)

In [None]:
true_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    21417 non-null  object
 1   text     21417 non-null  object
 2   subject  21417 non-null  object
 3   date     21417 non-null  object
 4   label    21417 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 836.7+ KB


In [None]:
fake_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   title    23481 non-null  object
 1   text     23481 non-null  object
 2   subject  23481 non-null  object
 3   date     23481 non-null  object
 4   label    23481 non-null  int64 
dtypes: int64(1), object(4)
memory usage: 917.4+ KB


In [None]:
true_df.isnull().sum()

Unnamed: 0,0
title,0
text,0
subject,0
date,0
label,0


In [None]:
fake_df.isnull().sum()

Unnamed: 0,0
title,0
text,0
subject,0
date,0
label,0


# 2. Text Cleaning and Preprocessing

In this step, the raw text data is cleaned to prepare it for vectorization and modeling. The following operations are performed:
- Convert all text to lowercase.
- Remove URLs.
- Remove mentions and hashtags.
- Remove all non-letter characters.
- Remove English stopwords.
- Lemmatize each word (convert to base form).

A new column `cleaned_content` is created, which contains the cleaned version of the original `content` column.


In [None]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

nltk.download('stopwords')
nltk.download('wordnet')

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'\@w+|\#','', text)
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = text.split()
    text = [lemmatizer.lemmatize(word) for word in text if word not in stop_words]
    return " ".join(text)

df["cleaned_content"] = df["content"].apply(clean_text)

df[["cleaned_content", "label"]].head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,cleaned_content,label
0,donald trump sends embarrassing new year eve m...,1
1,drunk bragging trump staffer started russian c...,1
2,sheriff david clarke becomes internet joke thr...,1
3,trump obsessed even obama name coded website i...,1
4,pope francis called donald trump christmas spe...,1


# 3. Data Splitting and Text Vectorization

In this step, the cleaned text data is split into training and testing sets. Two different vectorization techniques are applied to convert text into numerical format:

- **CountVectorizer**: Converts text documents to a matrix of token counts.
- **TF-IDF Vectorizer**: Converts text documents to a matrix of TF-IDF features, giving importance to rare but relevant words.

The resulting feature matrices will be used as input for machine learning and deep learning models.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

X = df["cleaned_content"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

count_vectorizer = CountVectorizer()
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)


# 4. Logistic Regression Model Training and Evaluation

In this section, a **Logistic Regression** model is trained using the TF-IDF features extracted from the cleaned text.


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score

lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)

y_pred = lr_model.predict(X_test_tfidf)

print("Logistic Regression Modelinin Nəticələri:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Logistic Regression Modelinin Nəticələri:
Accuracy: 0.9865

Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.99      0.99      5330
           1       0.99      0.99      0.99      5895

    accuracy                           0.99     11225
   macro avg       0.99      0.99      0.99     11225
weighted avg       0.99      0.99      0.99     11225



# 5. Naive Bayes Model Training and Evaluation (CountVectorizer)

In this section, a **Multinomial Naive Bayes** model is trained using the **CountVectorizer** features.




In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

nb_model = MultinomialNB()
nb_model.fit(X_train_count, y_train)
y_pred_nb = nb_model.predict(X_test_count)
print("Naive Bayes (CountVectorizer):\n", classification_report(y_test, y_pred_nb))



Naive Bayes (CountVectorizer):
               precision    recall  f1-score   support

           0       0.95      0.95      0.95      5330
           1       0.95      0.96      0.95      5895

    accuracy                           0.95     11225
   macro avg       0.95      0.95      0.95     11225
weighted avg       0.95      0.95      0.95     11225



# 6. Logistic Regression with Cross-Validation (TF-IDF)

In this section, a **Logistic Regression** model is trained using **TF-IDF vectorized features**, with added **5-fold cross-validation** for more reliable model evaluation.




In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

lr_model = LogisticRegression(C=1.0, penalty='l2', solver='lbfgs')

cv_scores = cross_val_score(lr_model, X_train_tfidf, y_train, cv=5, scoring='accuracy')

print(f"Logistic Regression Cross-validation Scores: {cv_scores}")
print(f"Mean Cross-validation Score: {cv_scores.mean()}")

lr_model.fit(X_train_tfidf, y_train)

y_pred_lr = lr_model.predict(X_test_tfidf)
print("\n Logistic Regression (Test Set):\n", classification_report(y_test, y_pred_lr))


Logistic Regression Cross-validation Scores: [0.98559762 0.9844098  0.98648849 0.98663499 0.98307098]
Mean Cross-validation Score: 0.9852403773116467

 Logistic Regression (Test Set):
               precision    recall  f1-score   support

           0       0.98      0.99      0.99      5330
           1       0.99      0.99      0.99      5895

    accuracy                           0.99     11225
   macro avg       0.99      0.99      0.99     11225
weighted avg       0.99      0.99      0.99     11225



# 7. Random Forest with Hyperparameter Tuning (TF-IDF)

In this section, a **Random Forest Classifier** is trained using **TF-IDF vectorized data**, and **hyperparameter tuning** is performed via **GridSearchCV**.



In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['sqrt', 'log2']
}

rf_model = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

grid_search.fit(X_train_tfidf, y_train)

print(f"Best Parameters: {grid_search.best_params_}")

y_pred_rf = grid_search.best_estimator_.predict(X_test_tfidf)
print("\n Random Forest (Test Set):\n", classification_report(y_test, y_pred_rf))


Best Parameters: {'max_depth': None, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200}

 Random Forest (Test Set):
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      5330
           1       0.99      0.99      0.99      5895

    accuracy                           0.99     11225
   macro avg       0.99      0.99      0.99     11225
weighted avg       0.99      0.99      0.99     11225



# 8. Deep Learning Model (Embedding + Global Average Pooling)

In this section, a **Deep Learning model** is trained using an **Embedding Layer** and **Global Average Pooling** to process the text data for binary classification.



In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam

tokenizer = Tokenizer(num_words=10000, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)

X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

max_len = 200
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len, padding='post', truncating='post')
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len, padding='post', truncating='post')

model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=64, input_length=max_len))
model.add(GlobalAveragePooling1D())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])

history = model.fit(X_train_pad, y_train, epochs=5, validation_data=(X_test_pad, y_test), batch_size=64)

loss, accuracy = model.evaluate(X_test_pad, y_test)
print(f"Deep Learning modelinin accuracy göstəricisi: {accuracy:.4f}")


Epoch 1/5




[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 16ms/step - accuracy: 0.8672 - loss: 0.3592 - val_accuracy: 0.9868 - val_loss: 0.0523
Epoch 2/5
[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 12ms/step - accuracy: 0.9927 - loss: 0.0318 - val_accuracy: 0.9916 - val_loss: 0.0320
Epoch 3/5
[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 15ms/step - accuracy: 0.9971 - loss: 0.0144 - val_accuracy: 0.9936 - val_loss: 0.0238
Epoch 4/5
[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 12ms/step - accuracy: 0.9991 - loss: 0.0058 - val_accuracy: 0.9946 - val_loss: 0.0234
Epoch 5/5
[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 15ms/step - accuracy: 0.9994 - loss: 0.0037 - val_accuracy: 0.9951 - val_loss: 0.0223
[1m351/351[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9947 - loss: 0.0229
Deep Learning modelinin accuracy göstəricisi: 0.9951


# 9. Deep Learning Model (Embedding + Global Average Pooling with Early Stopping)

In this section, a **Deep Learning model** with **Early Stopping** is trained to classify the text data.




In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

model = Sequential()
model.add(Embedding(input_dim=10000, output_dim=64))
model.add(GlobalAveragePooling1D())
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

model.compile(loss='binary_crossentropy', optimizer=Adam(learning_rate=0.001), metrics=['accuracy'])

history = model.fit(X_train_pad, y_train, epochs=10, validation_data=(X_test_pad, y_test), batch_size=64, callbacks=[early_stopping])

loss, accuracy = model.evaluate(X_test_pad, y_test)
print(f"\n Deep Learning Modelinin Accuracy: {accuracy:.4f}")


Epoch 1/10
[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 15ms/step - accuracy: 0.8720 - loss: 0.3450 - val_accuracy: 0.9893 - val_loss: 0.0445
Epoch 2/10
[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 15ms/step - accuracy: 0.9925 - loss: 0.0302 - val_accuracy: 0.9933 - val_loss: 0.0305
Epoch 3/10
[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 13ms/step - accuracy: 0.9976 - loss: 0.0126 - val_accuracy: 0.9870 - val_loss: 0.0426
Epoch 4/10
[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 15ms/step - accuracy: 0.9990 - loss: 0.0064 - val_accuracy: 0.9954 - val_loss: 0.0212
Epoch 5/10
[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 15ms/step - accuracy: 0.9999 - loss: 0.0021 - val_accuracy: 0.9600 - val_loss: 0.1042
Epoch 6/10
[1m527/527[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 12ms/step - accuracy: 0.9943 - loss: 0.0171 - val_accuracy: 0.9946 - val_loss: 0.0234
Epoch 7/10
[1m527/5