<a href="https://colab.research.google.com/github/hastikacheddy/-Fake-News-Detection/blob/main/Fake_News_Detection_using_AI.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About the Dataset:


1. Serial Number	A unique identifier for each news article, starting from 0
2. Title	The headline of a news article, summarizing its main topic
3. Text	The main content of a news article, providing detailed information
4. Label	A binary classification indicating whether the news is fake (0) or real (1)





Importing the Dependencies

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
# printing the stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
# Mount Google Drive to access dataset
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Data Pre-processing

In [None]:
# loading the dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/WELFake_Dataset.csv')

In [None]:
news_dataset.shape

(72134, 4)

In [None]:
# print the first 5 rows of the dataframe
news_dataset.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [None]:
# counting the number of missing values in the dataset
news_dataset.isnull().sum()

Unnamed: 0      0
title         558
text           39
label           0
dtype: int64

In [None]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [None]:
# merging the author name and news title
news_dataset['content'] = news_dataset['title']+' '+news_dataset['text']

In [None]:
print(news_dataset['content'])

0        LAW ENFORCEMENT ON HIGH ALERT Following Threat...
1           Did they post their votes for Hillary already?
2        UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...
3        Bobby Jindal, raised Hindu, uses story of Chri...
4        SATAN 2: Russia unvelis an image of its terrif...
                               ...                        
72129    Russians steal research on Trump in hack of U....
72130     WATCH: Giuliani Demands That Democrats Apolog...
72131    Migrants Refuse To Leave Train At Refugee Camp...
72132    Trump tussle gives unpopular Mexican leader mu...
72133    Goldman Sachs Endorses Hillary Clinton For Pre...
Name: content, Length: 72134, dtype: object


In [None]:
# Drop unnecessary column
news_dataset.drop(["Unnamed: 0"], axis=1, inplace=True)

In [None]:
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [None]:
print(X)
print(Y)

                                                   title  \
0      LAW ENFORCEMENT ON HIGH ALERT Following Threat...   
1                                                          
2      UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...   
3      Bobby Jindal, raised Hindu, uses story of Chri...   
4      SATAN 2: Russia unvelis an image of its terrif...   
...                                                  ...   
72129  Russians steal research on Trump in hack of U....   
72130   WATCH: Giuliani Demands That Democrats Apolog...   
72131  Migrants Refuse To Leave Train At Refugee Camp...   
72132  Trump tussle gives unpopular Mexican leader mu...   
72133  Goldman Sachs Endorses Hillary Clinton For Pre...   

                                                    text  \
0      No comment is expected from Barack Obama Membe...   
1         Did they post their votes for Hillary already?   
2       Now, most of the demonstrators gathered last ...   
3      A dozen politically active pasto

Stemming:

Stemming is the process of reducing a word to its Root word

example:
actor, actress, acting --> act

In [None]:
# Perform text preprocessing using stemming
port_stem = PorterStemmer()

In [None]:
def stemming(content):
  # Clean and tokenize the content
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
     # Apply stemming and remove stopwords
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [None]:
print(news_dataset['content'])

0        law enforc high alert follow threat cop white ...
1                                post vote hillari alreadi
2        unbeliev obama attorney gener say charlott rio...
3        bobbi jindal rais hindu use stori christian co...
4        satan russia unv imag terrifi new supernuk wes...
                               ...                        
72129    russian steal research trump hack u democrat p...
72130    watch giuliani demand democrat apolog trump ra...
72131    migrant refus leav train refuge camp hungari m...
72132    trump tussl give unpopular mexican leader much...
72133    goldman sach endors hillari clinton presid gol...
Name: content, Length: 72134, dtype: object


In [None]:
# Prepare feature matrix (X) and target vector (Y)
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [None]:
print(X)

['law enforc high alert follow threat cop white blacklivesmatt fyf terrorist video comment expect barack obama member fyf fukyoflag blacklivesmatt movement call lynch hang white peopl cop encourag other radio show tuesday night turn tide kill white peopl cop send messag kill black peopl america one f yoflag organ call sunshin radio blog show host texa call sunshin f ing opinion radio show snapshot fyf lolatwhitefear twitter page p show urg support call fyf tonight continu dismantl illus white snapshot twitter radio call invit fyf radio show air p eastern standard time show caller clearli call lynch kill white peopl minut clip radio show heard provid breitbart texa someon would like refer hannib alreadi receiv death threat result interrupt fyf confer call unidentifi black man said mother f ker start f ing like us bunch ni er takin one us roll said caus alreadi roll gang anyway six seven black mother f cker see white person lynch ass let turn tabl conspir cop start lose peopl state emerg

In [None]:
print(Y)

[1 1 1 ... 0 0 1]


In [None]:
Y.shape

(72134,)

Splitting the dataset to training & test data

In [None]:
# Split the dataset into training and testing sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1, stratify=Y, random_state=2)

In [None]:
# Initialize Count Vectorizer and TF-IDF Vectorizer
count_vectorizer = CountVectorizer(max_features=1000)
# Transform text data into feature matrices
X_train_count = count_vectorizer.fit_transform(X_train)
X_test_count = count_vectorizer.transform(X_test)

In [None]:
tfidf_vectorizer = TfidfVectorizer(max_features=1000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

Naive Bayes Model

In [None]:
# Initialize Naive Bayes classifier
naive_bayes_model_count = MultinomialNB()
naive_bayes_model_tfidf = MultinomialNB()

# Train Naive Bayes classifiers with Count Vectorizer and TF-IDF Vectorizer
naive_bayes_model_count.fit(X_train_count, Y_train)
naive_bayes_model_tfidf.fit(X_train_tfidf, Y_train)

# Make predictions and evaluate
predictions_count = naive_bayes_model_count.predict(X_test_count)
accuracy_count = accuracy_score(Y_test, predictions_count)

predictions_tfidf = naive_bayes_model_tfidf.predict(X_test_tfidf)
accuracy_tfidf = accuracy_score(Y_test, predictions_tfidf)

print(f"Naive bayes accuracy with Count Vectorizer: {accuracy_count}")
print(f"Naive bayes accuracy with TF-IDF Vectorizer: {accuracy_tfidf}")

Naive bayes accuracy with Count Vectorizer: 0.8735791516495702
Naive bayes accuracy with TF-IDF Vectorizer: 0.8490435264762961


 Decision Tree Model

In [None]:
# Initialize Decision Tree classifier
decision_tree_model_count = DecisionTreeClassifier()
decision_tree_model_tfidf = DecisionTreeClassifier()

# Train Decision Tree classifiers with Count Vectorizer and TF-IDF Vectorizer
decision_tree_model_count.fit(X_train_count, Y_train)
decision_tree_model_tfidf.fit(X_train_tfidf, Y_train)

# Make predictions and evaluate
predictions_count = decision_tree_model_count.predict(X_test_count)
accuracy_count = accuracy_score(Y_test, predictions_count)

predictions_tfidf = decision_tree_model_tfidf.predict(X_test_tfidf)
accuracy_tfidf = accuracy_score(Y_test, predictions_tfidf)

print(f"Decision tree accuracy with Count Vectorizer: {accuracy_count}")
print(f"Decision tree accuracy with TF-IDF Vectorizer: {accuracy_tfidf}")

Decision tree accuracy with Count Vectorizer: 0.9329082339894649
Decision tree accuracy with TF-IDF Vectorizer: 0.9317992791793734


In [None]:


# Define hyperparameters distribution for random search
param_dist = {
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initialize Decision Tree classifier with RandomizedSearchCV for Count Vectorizer
decision_tree_model_tuned_count = RandomizedSearchCV(DecisionTreeClassifier(), param_distributions=param_dist, n_iter=10, cv=5)

# Train Decision Tree classifier with Count Vectorizer using tuned hyperparameters
decision_tree_model_tuned_count.fit(X_train_count, Y_train)

# Make predictions and evaluate
predictions_tuned_count = decision_tree_model_tuned_count.predict(X_test_count)
accuracy_tuned_count = accuracy_score(Y_test, predictions_tuned_count)

# Print best hyperparameters and accuracy
print("Best Hyperparameters (Count Vectorizer):", decision_tree_model_tuned_count.best_params_)
print(f"Accuracy with Tuned Decision Tree and Count Vectorizer: {accuracy_tuned_count}")

# Initialize Decision Tree classifier with RandomizedSearchCV for TF-IDF Vectorizer
decision_tree_model_tuned_tfidf = RandomizedSearchCV(DecisionTreeClassifier(), param_distributions=param_dist, n_iter=10, cv=5)

# Train Decision Tree classifier with TF-IDF Vectorizer using tuned hyperparameters
decision_tree_model_tuned_tfidf.fit(X_train_tfidf, Y_train)

# Make predictions and evaluate
predictions_tuned_tfidf = decision_tree_model_tuned_tfidf.predict(X_test_tfidf)
accuracy_tuned_tfidf = accuracy_score(Y_test, predictions_tuned_tfidf)

# Print best hyperparameters and accuracy
print("Best Hyperparameters (TF-IDF Vectorizer):", decision_tree_model_tuned_tfidf.best_params_)
print(f"Accuracy with Tuned Decision Tree and TF-IDF Vectorizer: {accuracy_tuned_tfidf}")


Best Hyperparameters (Count Vectorizer): {'min_samples_split': 2, 'min_samples_leaf': 1, 'max_depth': 20}
Accuracy with Tuned Decision Tree and Count Vectorizer: 0.9381757693373995
Best Hyperparameters (TF-IDF Vectorizer): {'min_samples_split': 5, 'min_samples_leaf': 2, 'max_depth': 20}
Accuracy with Tuned Decision Tree and TF-IDF Vectorizer: 0.9349875242583865


In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.naive_bayes import MultinomialNB

# Define hyperparameters distribution for random search
param_dist = {
    'alpha': [0.1, 0.5, 1.0, 2.0],
    'fit_prior': [True, False]
}

# Initialize Naive Bayes classifier with RandomizedSearchCV for Count Vectorizer
naive_bayes_model_tuned_count = RandomizedSearchCV(MultinomialNB(), param_distributions=param_dist, n_iter=10, cv=5)

# Train Naive Bayes classifier with Count Vectorizer using tuned hyperparameters
naive_bayes_model_tuned_count.fit(X_train_count, Y_train)

# Make predictions and evaluate
predictions_tuned_nb_count = naive_bayes_model_tuned_count.predict(X_test_count)
accuracy_tuned_nb_count = accuracy_score(Y_test, predictions_tuned_nb_count)

# Print best hyperparameters and accuracy
print("Best Hyperparameters (Naive Bayes - Count Vectorizer):", naive_bayes_model_tuned_count.best_params_)
print(f"Accuracy with Tuned Naive Bayes and Count Vectorizer: {accuracy_tuned_nb_count}")

# Initialize Naive Bayes classifier with RandomizedSearchCV for TF-IDF Vectorizer
naive_bayes_model_tuned_tfidf = RandomizedSearchCV(MultinomialNB(), param_distributions=param_dist, n_iter=10, cv=5)

# Train Naive Bayes classifier with TF-IDF Vectorizer using tuned hyperparameters
naive_bayes_model_tuned_tfidf.fit(X_train_tfidf, Y_train)

# Make predictions and evaluate
predictions_tuned_nb_tfidf = naive_bayes_model_tuned_tfidf.predict(X_test_tfidf)
accuracy_tuned_nb_tfidf = accuracy_score(Y_test, predictions_tuned_nb_tfidf)

# Print best hyperparameters and accuracy
print("Best Hyperparameters (Naive Bayes - TF-IDF Vectorizer):", naive_bayes_model_tuned_tfidf.best_params_)
print(f"Accuracy with Tuned Naive Bayes and TF-IDF Vectorizer: {accuracy_tuned_nb_tfidf}")




Best Hyperparameters (Naive Bayes - Count Vectorizer): {'fit_prior': True, 'alpha': 0.1}
Accuracy with Tuned Naive Bayes and Count Vectorizer: 0.8735791516495702




Best Hyperparameters (Naive Bayes - TF-IDF Vectorizer): {'fit_prior': True, 'alpha': 0.5}
Accuracy with Tuned Naive Bayes and TF-IDF Vectorizer: 0.8490435264762961


In [None]:
just testing