<a href="https://www.kaggle.com/code/manishkr1754/fake-news-prediction?scriptVersionId=142780988" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

---
<center><h1>Fake News Prediction</h1></center>
<center><h3>Part of 30 Days 30 ML Projects Challenge</h3></center>

---

## 1) Understanding Problem Statement
---

**Fake news** (intentionally false information) spread through traditional media and online social networks is causing harm in society. It is on the rise, leading to deception and division. To combat this, we need a dependable system that can tell real news from fake news. Such a system is essential to rebuild trust in media and protect the truth in our online and offline world.

The goal of this project is to employ machine learning techniques **to classify news articles as either genuine or fake based on their content and characteristics**. This **classification task** is fundamental in addressing the challenge of fake news and promoting information integrity and informed society. Moreover, It also involves use of **Natural Language Processing (NLP) techniques** for handling textual data. 

## 2) Understanding Data
---
In this project, we work with a dataset referred to as **Fake News Data**. This dataset comprises various independent variables and one dependent variable for each individual news article.


### Dataset Description:

The dataset consists of news articles and the goal of this project is to utilize machine learning techniques to predict the reliability of these articles based on their content and associated attributes.

It includes the following attributes:

1. **id:** A unique identifier for each news article.
2. **title:** The title of the news article.
3. **author:** The author of the news article (if available).
4. **text:** The content of the article, which may be incomplete.
5. **label:** A categorical label indicating the potential reliability of the article:
   - 1: Denotes articles that are potentially unreliable or fake.
   - 0: Represents articles considered reliable or genuine.

## 3) Getting System Ready
---
Importing required libraries

In [None]:
import numpy as np
import pandas as pd

# for text data preprocessing
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer

# for model buidling
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

### Downloading stop words for text preprocessing

In [None]:
import nltk
nltk.download('stopwords')

In [None]:
# printing the stopwords in English
print(stopwords.words('english'))

## 4) Data Eyeballing
---

### Laoding Data

In [None]:
fake_news_data = pd.read_csv('Datasets/Day4_Fake_News_Data.csv') 

In [None]:
fake_news_data

In [None]:
print('The size of Dataframe is: ', fake_news_data.shape)
print('-'*100)
print('The Column Name, Record Count and Data Types are as follows: ')
fake_news_data.info()
print('-'*100)

In [None]:
# Defining numerical & categorical columns
numeric_features = [feature for feature in fake_news_data.columns if fake_news_data[feature].dtype != 'O']
categorical_features = [feature for feature in fake_news_data.columns if fake_news_data[feature].dtype == 'O']

# print columns
print('We have {} numerical features : {}'.format(len(numeric_features), numeric_features))
print('\nWe have {} categorical features : {}'.format(len(categorical_features), categorical_features))

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=fake_news_data.isnull().sum().sort_values(ascending=False)
percent=(fake_news_data.isnull().sum()/fake_news_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

In [None]:
fake_news_data['label'].value_counts()

## 5) Data Cleaning and Preprocessing
---

### Replacing the null values with empty string

In [None]:
fake_news_data = fake_news_data.fillna('')

### Merging the author name and news title

In [None]:
fake_news_data['content'] = fake_news_data['author']+' '+fake_news_data['title']

In [None]:
fake_news_data.head()

### Checking Missing Value Presence

In [None]:
print('Missing Value Presence in different columns of DataFrame are as follows : ')
print('-'*100)
total=fake_news_data.isnull().sum().sort_values(ascending=False)
percent=(fake_news_data.isnull().sum()/fake_news_data.isnull().count()*100).sort_values(ascending=False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])

### Stemming

- Stemming is the process of reducing a word to its Root word

`For example:` actor, actress, acting --> act

In [None]:
porter_stemmer = PorterStemmer()

In [None]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [porter_stemmer.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
fake_news_data['content'] = fake_news_data['content'].apply(stemming)

In [None]:
fake_news_data['content']

## 6) Model Building
---

### Creating Feature Matrix (Independent Variables) & Target Variable (Dependent Variable)

In [None]:
# separating the data and labels
X = fake_news_data['content'] # Feature matrix
y = fake_news_data['label'] # Target variable

In [None]:
X

In [None]:
y

### Converting the textual data to numerical data

In [None]:
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [None]:
X

In [None]:
print(X)

### Train-Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=45)

In [None]:
print(X.shape, X_train.shape, X_test.shape)

In [None]:
print(y.shape, y_train.shape, y_test.shape)

### Model Comparison : Training & Evaluation

In [None]:
models = [LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier]
accuracy_scores = []
precision_scores = []
recall_scores = []
f1_scores = []

for model in models:
    classifier = model().fit(X_train, y_train)
    y_pred = classifier.predict(X_test)
    
    accuracy_scores.append(accuracy_score(y_test, y_pred))
    precision_scores.append(precision_score(y_test, y_pred))
    recall_scores.append(recall_score(y_test, y_pred))
    f1_scores.append(f1_score(y_test, y_pred))

In [None]:
classification_metrics_df = pd.DataFrame({
    "Model": ["Logistic Regression", "SVM", "Decision Tree", "Random Forest"],
    "Accuracy": accuracy_scores,
    "Precision": precision_scores,
    "Recall": recall_scores,
    "F1 Score": f1_scores
})

classification_metrics_df.set_index('Model', inplace=True)
classification_metrics_df

### Inference

- All models (Logistic Regression, SVM, Decision Tree, and Random Forest) exhibit excellent performance in classifying fake news with high accuracy, precision, recall, and F1 score. Among them, **Random Forest** stands out as the top performer, providing a balanced approach to identifying fake news.

**`Note:`** For real life best model selection are not solely based on accuracy score, we need to take into account other evaluation metrics, business context and model interpretability.