<a href="https://colab.research.google.com/github/jubaer-404/Random-ML-Projects/blob/main/FakeNewsPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Fake News Prediction**

I worked on a simple project where I classified fake and real news using logistic regression. The dataset was collected from Kaggle.

The first step was to load and preprocess the data. All textual data was converted into numerical values. Then, the dataset was split into training and testing sets.

After that, the training data was fed into a logistic regression model. Finally, the accuracy score of the model was calculated to evaluate its performance.

In [1]:
#importing libraries
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [6]:
data = pd.read_csv('/content/FakeNewsNet.csv')
data.head()


Unnamed: 0,title,news_url,source_domain,tweet_num,real
0,Kandi Burruss Explodes Over Rape Accusation on...,http://toofab.com/2017/05/08/real-housewives-a...,toofab.com,42,1
1,People's Choice Awards 2018: The best red carp...,https://www.today.com/style/see-people-s-choic...,www.today.com,0,1
2,Sophia Bush Sends Sweet Birthday Message to 'O...,https://www.etonline.com/news/220806_sophia_bu...,www.etonline.com,63,1
3,Colombian singer Maluma sparks rumours of inap...,https://www.dailymail.co.uk/news/article-33655...,www.dailymail.co.uk,20,1
4,Gossip Girl 10 Years Later: How Upper East Sid...,https://www.zerchoo.com/entertainment/gossip-g...,www.zerchoo.com,38,1



0: Real,
1: Fake

In [7]:
data.shape

(23196, 5)

In [8]:
#finding missing values of dataset
data.isnull().sum()

Unnamed: 0,0
title,0
news_url,330
source_domain,330
tweet_num,0
real,0


In [9]:
#replace missing values
data = data.fillna('')
data.isnull().sum()

Unnamed: 0,0
title,0
news_url,0
source_domain,0
tweet_num,0
real,0


In [11]:
#here merging
data['content'] = data['title']+' '+str(data['tweet_num'])
data['content']

Unnamed: 0,content
0,Kandi Burruss Explodes Over Rape Accusation on...
1,People's Choice Awards 2018: The best red carp...
2,Sophia Bush Sends Sweet Birthday Message to 'O...
3,Colombian singer Maluma sparks rumours of inap...
4,Gossip Girl 10 Years Later: How Upper East Sid...
...,...
23191,Pippa Middleton wedding: In case you missed it...
23192,Zayn Malik & Gigi Hadid’s Shocking Split: Why ...
23193,Jessica Chastain Recalls the Moment Her Mother...
23194,"Tristan Thompson Feels ""Dumped"" After Khloé Ka..."


In [12]:
#separating data
x = data.drop(columns = 'real', axis = 1)
y = data['real']
print(x)
print(y)

                                                   title  \
0      Kandi Burruss Explodes Over Rape Accusation on...   
1      People's Choice Awards 2018: The best red carp...   
2      Sophia Bush Sends Sweet Birthday Message to 'O...   
3      Colombian singer Maluma sparks rumours of inap...   
4      Gossip Girl 10 Years Later: How Upper East Sid...   
...                                                  ...   
23191  Pippa Middleton wedding: In case you missed it...   
23192  Zayn Malik & Gigi Hadid’s Shocking Split: Why ...   
23193  Jessica Chastain Recalls the Moment Her Mother...   
23194  Tristan Thompson Feels "Dumped" After Khloé Ka...   
23195  Kelly Clarkson Performs a Medley of Kendrick L...   

                                                news_url  \
0      http://toofab.com/2017/05/08/real-housewives-a...   
1      https://www.today.com/style/see-people-s-choic...   
2      https://www.etonline.com/news/220806_sophia_bu...   
3      https://www.dailymail.co.uk/news

Now we have to stemme all the data in content. It means it reduces every word into its root word.

In [26]:
!pip install nltk
import nltk
nltk.download('stopwords')
from nltk.stem import PorterStemmer
port_stem = PorterStemmer()



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [27]:
#stemming the dataset
def stemming(content):
  stemming_content = re.sub('[^a-zA-Z]',' ',content)
  stemming_content = stemming_content.lower()
  stemming_content = stemming_content.split()
  stemming_content = [port_stem.stem(word) for word in stemming_content if not word in stopwords.words('english')]
  stemming_content = ' '.join(stemming_content)
  return stemming_content

In [28]:
data['content'] = data['content'].apply(stemming)
data['content']

Unnamed: 0,content
0,kandi burruss explod rape accus real housew at...
1,peopl choic award best red carpet look name tw...
2,sophia bush send sweet birthday messag one tre...
3,colombian singer maluma spark rumour inappropr...
4,gossip girl year later upper east sider shock ...
...,...
23191,pippa middleton wed case miss pippa marri lace...
23192,zayn malik gigi hadid shock split chanc reunit...
23193,jessica chastain recal moment mother boyfriend...
23194,tristan thompson feel dump khlo kardashian ref...


In [40]:
#converting x and y into numpy array
x = data['content'].values
y = data['real'].values

#converting textual data into numeric data
vectorizer = TfidfVectorizer()
vectorizer.fit(x)
x = vectorizer.transform(x)
print(x)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 321872 stored elements and shape (23196, 12549)>
  Coords	Values
  (0, 53)	0.27050643730255725
  (0, 576)	0.3201538815396584
  (0, 1524)	0.40360290512990127
  (0, 3271)	0.044323385881010036
  (0, 3740)	0.36982227750765356
  (0, 5247)	0.2721067333555454
  (0, 5520)	0.044323385881010036
  (0, 5897)	0.40360290512990127
  (0, 6380)	0.044323385881010036
  (0, 7536)	0.044323385881010036
  (0, 7811)	0.044323385881010036
  (0, 8978)	0.3147197315076521
  (0, 9023)	0.24105938849727213
  (0, 9290)	0.26490334413044686
  (0, 11593)	0.044323385881010036
  (0, 11962)	0.22047054915756695
  (1, 643)	0.3123289779521153
  (1, 986)	0.3426054842484946
  (1, 1710)	0.39209759093450064
  (1, 1975)	0.42751728045947607
  (1, 3271)	0.06913008204090813
  (1, 5520)	0.06913008204090813
  (1, 6380)	0.06913008204090813
  (1, 6575)	0.33707212530024655
  (1, 7536)	0.06913008204090813
  :	:
  (23194, 6403)	0.32158148100847245
  (23194, 7434)	0.293987224169623

Separate data into test and train and train the model

In [43]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = y, random_state = 2)

In [58]:
print(x_test)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 64267 stored elements and shape (4640, 12549)>
  Coords	Values
  (0, 3271)	0.062060361256406966
  (0, 5094)	0.48133652348841405
  (0, 5520)	0.062060361256406966
  (0, 5699)	0.4999611229105888
  (0, 6274)	0.3787555234667408
  (0, 6380)	0.062060361256406966
  (0, 7536)	0.062060361256406966
  (0, 7811)	0.062060361256406966
  (0, 10598)	0.25338797801307716
  (0, 11435)	0.3787555234667408
  (0, 11593)	0.062060361256406966
  (0, 12102)	0.37964211513441276
  (1, 356)	0.20109108815848625
  (1, 359)	0.32784481820422773
  (1, 467)	0.3878761600074069
  (1, 1303)	0.3548568531746669
  (1, 1307)	0.28597084039903126
  (1, 1650)	0.4036771875332823
  (1, 3271)	0.03897012889617187
  (1, 3829)	0.20663463676813038
  (1, 4555)	0.25665220748129197
  (1, 5520)	0.03897012889617187
  (1, 5789)	0.20240481240834238
  (1, 6380)	0.03897012889617187
  (1, 7053)	0.2307764083588629
  :	:
  (4638, 6145)	0.3005634558834724
  (4638, 6192)	0.31642179614465765


In [44]:
model = LogisticRegression()
model.fit(x_train, y_train)

Calculating accuracy

In [56]:
x_train_predict = model.predict(x_train)
accuracy = accuracy_score(x_train_predict, y_train)
print(accuracy)

0.8589135589566717


In [57]:
x_test_predict = model.predict(x_test)
accuracy = accuracy_score(x_test_predict, y_test)
print(accuracy)

0.8349137931034483


Making a predictive system

In [62]:
x_new = x_test[0]
prediction = model.predict(x_new)
print(prediction)
if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[1]
The news is Fake
