About the Dataset:

1. id: unique id for a news article
2. title: the title of a news article
3. author: author of the news article
4. text: the text of the article; could be incomplete
5. label: a label that marks whether the news article is real or fake:
           1: Fake news
           0: real News





Importing the Dependencies

In [None]:
!pip install git+https://github.com/codelucas/newspaper.git

Collecting git+https://github.com/codelucas/newspaper.git
  Cloning https://github.com/codelucas/newspaper.git to /tmp/pip-req-build-56n884r2
  Running command git clone -q https://github.com/codelucas/newspaper.git /tmp/pip-req-build-56n884r2


In [None]:
import numpy as np
import pandas as pd
import re
import nltk
import nltk.corpus
import newspaper
from newspaper import Article
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
nltk.download('stopwords')
corpus = []

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


Data Pre-processing

In [None]:
# load dataset to a pandas DataFrame
news_dataset = pd.read_csv('/content/train.csv')

In [None]:
# replace null values with empty string
news_dataset = news_dataset.fillna('')

In [None]:
# merging author name and news title
news_dataset['content'] = news_dataset['author'] + ' ' + news_dataset['title']

In [None]:
# separating the data & label
# remove row: axis=0, remove column: axis=1
X = news_dataset.drop(columns='label', axis=1) 
Y = news_dataset['label']

Stemming:
The process of reducing a word to its root word

In [None]:
port_stem = PorterStemmer()

In [None]:
def stemming(content):
  stemmed_content = re.sub('[^a-zA-Z]', ' ', content).lower().split()
  stemmed_content = [port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content = ' '.join(stemmed_content)
  corpus.append(stemmed_content)
  return stemmed_content

In [None]:
news_dataset['content'] = news_dataset['content'].apply(stemming)
print(news_dataset['content'][0])

darrel lucu hou dem aid even see comey letter jason chaffetz tweet


In [None]:
# separating the data and label
Y = news_dataset['label'].values

In [None]:
# converting text data to number data
vectorizer = TfidfVectorizer().fit(corpus)
X = vectorizer.transform(corpus).toarray()

Training the Model: Logistic Regression

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
model = LogisticRegression().fit(X, Y)

Evaluation: Accuracy score

In [None]:
# accuracy score on the training data
X_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(X_prediction, Y_train)
print('Accuracy of the training data : ', training_data_accuracy)

Accuracy of the training data :  0.9886418269230769


Pre-process Article Data

In [None]:
# extract data from article
url = 'https://apnews.com/article/caribbean-caracas-venezuela-5a113f7f603f4d449e926ac9b981d4d5'
article = Article(url)
article.download()
article.html
article.parse()
print(article.authors)
print(article.title)

['Regina Garcia Cano']
On Venezuelan roads, old cars prevail, break down everywhere


In [None]:
article_authors = ''
for author in article.authors:
  article_authors = article_authors + author + ' '
article_data = article_authors + ' ' + article.title
print(article_data)

Regina Garcia Cano  On Venezuelan roads, old cars prevail, break down everywhere


In [None]:
X_new = [stemming(article_data)]
print(X_new)

['regina garcia cano venezuelan road old car prevail break everywher']


In [None]:
X_new = vectorizer.transform(X_new).toarray()

Make a Prediction

In [None]:
prediction = model.predict(X_new)

if (prediction == 0):
  print('The news is real')
else:
  print('The news is fake')

The news is fake
