Team SAO presents to you a high accuracy model to predict whether the news is real or fake.

**Dataset Description**

1. id: Unique serial number of the news
2. title: Title of a news
3. author: Author/Editor of the news article
4. content: The text of the article
5. label: a label that marks whether the news article is real or fake:
           1: if Fake news
           0: if real News





Importing the Dependencies

In [9]:
import numpy as np
import pandas as pd
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [10]:
import nltk
nltk.download('stopwords')#downloading stopwords package

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
# Printing stopwords in English
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

Data Pre-processing

In [13]:
# loading our train.csv dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/train.csv')

ParserError: ignored

In [14]:
news_dataset.shape

NameError: ignored

In [None]:
# To see starting few data entries
news_dataset.head()

In [None]:
# Here Team SAO is counting the number of missing values in our dataset
news_dataset.isnull().sum()

In [None]:
# Replacing the null values with empty string to increase the accuracy of dataset
news_dataset = news_dataset.fillna('')

In [None]:
# merging the columns, author name and title present in the dataset
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [None]:
print(news_dataset['content'])#lets see what is in the content section

In [None]:
# separating the data & label and assigning varaible X and Y to them
X = news_dataset.drop(columns='label', axis=1)
Y = news_dataset['label']

In [None]:
print(X)
print(Y)

What is Stemming ?

Stemming is a method of reducing a word to its ROOT WORD

like:
coder, coding, codes --> code ||

Let's use this feature

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]',' ',content)
    stemmed_content = stemmed_content.lower()
    stemmed_content = stemmed_content.split()
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
    stemmed_content = ' '.join(stemmed_content)
    return stemmed_content

Now Applying stemming on content section of our dataset

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming)

In [None]:
print(news_dataset['content']) #to see the effect stemming had on our dataset i.e. all words comes to its root version

In [None]:
#separating the data and label columns
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [None]:
print(X) #to see what went in variable X

In [None]:
print(Y) #to see what went in variable Y

In [None]:
Y.shape

In [None]:
# converting the textual data(STRING) to numerical data using vectorizer to make the model simpler.
vectorizer = TfidfVectorizer()
vectorizer.fit(X)

X = vectorizer.transform(X)

In [None]:
print(X)

Splitting the dataset to training & test data

1.   80% in training
2.   20% in test
3.   Stratifying Y to ensure all training or test set doesn't get only one kind of outcome





In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, stratify=Y, random_state=2)

Training our Model: Bringing in Logistic Regression

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, Y_train)

Result time for the Model

**Checking Accuracy Score**

In [None]:
# accuracy score on the training data
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_train_prediction, Y_train)

In [None]:
print('Accuracy score of the training data : ', training_data_accuracy*100, "%") # printing accuracy score of our training data

In [None]:
# accuracy score on the test data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction, Y_test)

In [None]:
print('Accuracy score of the test data : ', test_data_accuracy*100, "%") # printing accuracy score of our test data

So now we have got a nice score, let's predict now

In [None]:
k = int(input("\nEnter News Id No. to check the article: ")) #For eg. enter 3 or 4 or any label number from the dataset

X_new = X_test[k]

prediction = model.predict(X_new) #predicting
print(prediction)

if (prediction[0]==0):
  print('\t\tThe news is Real\n')
else:
  print('\t\tThe news is Fake\n')

print(Y_test[k])  