## **Problem Statement**

### **Introduction**
Fake news has emerged as one of the most significant challenges of our time, severely impacting both online and offline discourse. Its proliferation poses a direct threat to the democratic processes and societal stability, particularly in the western world. The ability to accurately identify and reduce the spread of fake news is essential to maintaining informed public discourse and safeguarding democratic institutions.

### **Problem Statement**
The primary challenge addressed by this project is the automatic detection of fake news articles using machine learning and natural language processing (NLP) techniques. By developing a reliable model to classify news articles as either fake or real, we aim to contribute to the efforts to curb the spread of misinformation and enhance the quality of information available to the public.

### **Aim of the Project**

The aim of this project is to build a robust and accurate fake news detection system.

### **How Does the Solution Solve the Problem?**

The proposed solution involves developing a machine learning model that leverages NLP and deep learning techniques to classify news articles as fake or real, allowing users to input news articles and classify them as fake or real, thereby providing a valuable tool for combating misinformation.


### **About the Dataset**

The dataset used in this project contains labeled news articles, categorized as either fake or real. This dataset is essential for training and evaluating the machine learning models developed to detect fake news.

### **Content**
The dataset comprises rows and columns that represent various attributes of news articles, including their textual content and labels indicating whether they are fake or real. The dataset includes information on how it was acquired and the time period it represents, providing valuable context for the analysis.




IMPORTING LIBRARIES RELEVANT FOR THE PROJECT

In [79]:
import pandas as pd
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,classification_report
import numpy as np
from sklearn.model_selection import train_test_split

LOADING AND READING THE DATA

In [3]:
news_data = pd.read_csv('fake_or_real_news.csv', usecols=lambda col: col if 'Unnamed' not in col else None)


In [60]:
news_data.head()

Unnamed: 0,title,text,label,input_text
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",0,"smell hillary fear daniel greenfield , shillma..."
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,0,watch exact moment paul ryan committed politic...
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,1,kerry paris gesture sympathy u.s. secretary st...
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",0,bernie supporter twitter erupt anger dnc : ' t...
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,1,battle new york : primary matter primary day n...


In [63]:
#Create an new column called 'input_text' 
news_data['input_text'] = news_data['title'] + ' ' + news_data['text']

In [62]:
news_data.head()

Unnamed: 0,title,text,label,input_text
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",0,You Can Smell Hillary’s Fear Daniel Greenfield...
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,0,Watch The Exact Moment Paul Ryan Committed Pol...
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,1,Kerry to go to Paris in gesture of sympathy U....
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",0,Bernie supporters on Twitter erupt in anger ag...
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,1,The Battle of New York: Why This Primary Matte...


## **Machine Learning Task Instructions**

In this task, you will work with the provided `input_text` variable. Your objective is to apply any machine learning algorithm to process the data and achieve meaningful results.

### **Steps to Follow:**

1. **Preprocess the Data**: Clean and preprocess the `input_text` data as necessary. This might include actions such as tokenization, removing stop words, and lemmatization.

2. **Extract Features**: Transform the text data into numerical features suitable for machine learning algorithms. Consider using techniques like `TfidfVectorizer` or `CountVectorizer`.

3. **Select a Machine Learning Algorithm**: Choose an appropriate machine learning algorithm for your task. Options include classification algorithms (e.g., Logistic Regression, SVM, Random Forest, and others).

4. **Train Your Model**: Split your data into training and testing sets, then train your chosen model on the preprocessed data.

5. **Evaluate Your Model**: Measure the performance of your model using suitable metrics (e.g., accuracy, precision, recall, F1-score).

Good luck!


DATA PREPROCESSING

In [38]:
#check for null values
news_data.isnull().sum()

title         0
text          0
label         0
input_text    0
dtype: int64

In [5]:
news_data.label.value_counts()

label
REAL    3171
FAKE    3164
Name: count, dtype: int64

In [6]:
#spacy model
nlp = spacy.load('en_core_web_sm',disable=['parser','ner'])

In [59]:
def spacy_tokenizer(sentence):
  tokens = nlp(sentence)
  #lemmatization
  tokens = [token.lemma_.lower().strip() if token.lemma_ != '-PRON-' else token.lower_ for token in tokens]
  #removal of stop words
  tokens = [token for token in tokens if token not in nlp.Defaults.stop_words]
  return " ".join(tokens)

In [13]:
#preprocessing the input
news_data_sample = news_data.head(100) #running the spacy_tokenizer on a fraction of the data

sample_text = news_data['input_text'].iloc[0] #Verify the spacy_tokenizer function: run it on a single string to see if it produces the desired output
print(spacy_tokenizer(sample_text))



In [15]:
#Use tqdm to track progress
from tqdm import tqdm
tqdm.pandas()
news_data['input_text'] = news_data['input_text'].progress_apply(spacy_tokenizer)

100%|██████████| 6335/6335 [10:59<00:00,  9.60it/s]


In [19]:
news_data['label'] = news_data['label'].apply(lambda x: 1 if x == 'REAL' else 0)

In [20]:
news_data.label.value_counts()

label
1    3171
0    3164
Name: count, dtype: int64

In [48]:
#assign input and target
X = news_data['input_text']
y = news_data['label']

In [49]:
#count vectorizer
count_vectorizer = CountVectorizer(max_features=5000)
X = count_vectorizer.fit_transform(X)

In [51]:
#splitting the data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.15,random_state=42)

USING MULTINOMIALNB MODEL

In [52]:
#Machine learning algorithm
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train,y_train)



In [53]:
#make predictions
y_pred = model.predict(X_test)

In [54]:
#check the accuracy 
accuracy = accuracy_score(y_test, y_pred)

In [55]:
print(accuracy)

0.8832807570977917


USING RANDOMFOREST CLASSIFIER

In [56]:
#RandomForestClassifier

model = RandomForestClassifier()
model.fit(X_train,y_train)

#make predictions

In [57]:
#make predictions
y_pred = model.predict(X_test)

In [58]:
#check the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

0.9242902208201893


USING LOGISTIC REGRESSION

In [64]:
#using logistic regression

model = LogisticRegression()
model.fit(X_train,y_train)

In [65]:
#make predictions
y_pred = model.predict(X_test)

In [69]:
#check the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

#classification report 
print(classification_report(y_test, y_pred))

0.9211356466876972
              precision    recall  f1-score   support

           0       0.92      0.92      0.92       459
           1       0.92      0.92      0.92       492

    accuracy                           0.92       951
   macro avg       0.92      0.92      0.92       951
weighted avg       0.92      0.92      0.92       951



USING DECISION TREE 

In [70]:
#use decision tree
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(X_train,y_train)

In [71]:
#make prediction
y_pred = model.predict(X_test)

In [74]:
#check the accuracy
accuracy = accuracy_score(y_test, y_pred)

print(accuracy)

0.8191377497371188


SUPPORT VECTOR MACHINE

In [75]:
#using vector machine
from sklearn.svm import SVC

model = SVC()
model.fit(X_train,y_train)

In [76]:
#make predictions
y_pred = model.predict(X_test)

In [78]:
#check accuracy
accuracy = accuracy_score(y_test, y_pred)

print(accuracy)

0.8748685594111462


#COMPARISON OF MODEL PERFORMANCE
## MultinomialNB model gave an accuracy score of 0.883 ~ 88.3% 
## RandomForestClassifier model gave an accuracy score of 0.924 ~ 92.4 %
## DecisionTreeClassifier model gave an accuracy score of 0.819 ~ 81.9%
## LogisticRegressor model gave an accuracy score of 0.921 ~ 92.1%
## SVC model gave an accuracy score of 0.875 ~ 87.5%

The best performing model is the RandomForestClassifier with an accuracy score of 92.4%. 

##NB: 
It's important to note that the performance of a machine learning model can vary depending on the specific dataset and the chosen algorithm. Always evaluate your model's performance using appropriate metrics and consider adjusting hyperparameters if necessary.
