# FAKE NEWS DETECTOR
### Fake News Detector using Logistic Regression model.

google collab link: https://colab.research.google.com/drive/1YxD2SRlRn9YfG5Lak7Pr9b8Y7bg6maOZ?authuser=0

< importing datasets directly from kaggle >

dataset link: https://www.kaggle.com/competitions/fake-news/data

!pip install kaggle <br>
!kaggle competitions download -c fake-news <br>
!unzip fake-news.zip <br>

In [1]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [5]:
import nltk
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [WinError 10061]
[nltk_data]     No connection could be made because the target machine
[nltk_data]     actively refused it>


False

In Machine Learning, stopwords are common words that are filtered out and excluded from the text processing pipeline during natural language processing (NLP) tasks. These words are typically very frequent and contribute little to the overall meaning of a document or sentence. Removing stopwords is a preprocessing step aimed at reducing the dimensionality of the text data and improving the efficiency and quality of NLP algorithms.

In [None]:
print(stopwords.words('english'))

< Pre-Processing the Data >

Importing the dataset into a dataframe, and handling missing values.

In [None]:
news_dataset = pd.read_csv('train.csv')

In [None]:
news_dataset.head()
news_dataset.shape

In [None]:
# the number of missing values in the dataset
news_dataset.isnull().sum()

In [None]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [None]:
# merging the author name and news title
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [None]:
print(news_dataset['content'])

In [None]:
# separating the data & label
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [None]:
print(X)
print(Y)

< Stemming Process >

Stemming is a natural language processing (NLP) technique used in Machine Learning to reduce words to their root or base form, called the "stem." The purpose of stemming is to normalize words with the same root, even if they have different endings or suffixes, so that they can be treated as the same word during text processing and analysis.

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [None]:
print(news_dataset['content'])

In [None]:
#separating the data and label
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [None]:
print(X)
X.shape
print(Y)
Y.shape

In [None]:
# converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [None]:
print(X)

< Splitting training and testing data >

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

< Training the model >

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, Y_train)

< Model Accuracy >

In [None]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print('Accuracy of Training data : ', training_data_accuracy)

In [None]:
# accuracy score on the testing data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy of Test data : ', test_data_accuracy)

< Prediction >

0 = Real News <br>
1 = Fake News

In [None]:
X_new = X_test[3]

prediction = model.predict(X_new)
print(prediction)

if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

In [None]:
res = Y_test[3]

if (res == 0)
    print('The news is Real - 0')
else
    print('The news is Fake - 1')