# Fake News Prediction Project
# Logistic Regression Model (Binary Classification)
### **YouTube Link**: https://youtu.be/nacLBdyG6jE?si=JSufJlsbLzUUVIFx
### **Practiced by**: Mariah Noelle Cornelio
### **Date**: October 7, 2024
This project is used to predict whether a news article found online is either real or fake using logistic regression for binary data. I am using this project to PRACTICE machine learning techniques. 
- All credit goes to @Siddhardhan on YouTube, amazing guy!
- The link to his video is here: https://youtu.be/nacLBdyG6jE?si=JSufJlsbLzUUVIFx
- The dataset used can be found here: https://www.kaggle.com/competitions/fake-news/data

## Importing and Preprocessing the Data

In [3]:
import numpy as np
import pandas as pd
import re # Regular expression - searches text documents for words
from nltk.corpus import stopwords # Natural language body, take out words that don't add much value like "the"
from nltk.stem.porter import PorterStemmer # Stemmer removes prfix/suffix and returns root word
from sklearn.feature_extraction.text import TfidfVectorizer # Converts text to feature vectors (numbers)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Download stopwords from nltk library
import nltk
nltk.download("stopwords")

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/marielle/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
print(stopwords.words("English")) # Prints English stopwords, remmoved during stemming process

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [7]:
news_dataset=pd.read_csv("train.csv")
news_dataset.head(5)

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [6]:
news_dataset.shape

(20800, 5)

In [8]:
# Check for missing values
news_dataset.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [9]:
# Replace the null values with empty string (null string)
news_dataset=news_dataset.fillna("")

# Use title and author for our processing because it may take a while to load the text
# Can still try it with text, though if time permits

In [10]:
# Combining/merging the title and author name
news_dataset["content"]=news_dataset["author"]+" "+news_dataset["title"]

In [11]:
print(news_dataset["content"]) # Use this content data to make the prediction

0        Darrell Lucus House Dem Aide: We Didn’t Even S...
1        Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2        Consortiumnews.com Why the Truth Might Get You...
3        Jessica Purkiss 15 Civilians Killed In Single ...
4        Howard Portnoy Iranian woman jailed for fictio...
                               ...                        
20795    Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796    Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797    Michael J. de la Merced and Rachel Abrams Macy...
20798    Alex Ansary NATO, Russia To Hold Parallel Exer...
20799              David Swanson What Keeps the F-35 Alive
Name: content, Length: 20800, dtype: object


## Separating the data and label

In [12]:
X=news_dataset.drop(columns="label", axis=1)
y=news_dataset["label"]

In [13]:
print(X)

          id  ...                                            content
0          0  ...  Darrell Lucus House Dem Aide: We Didn’t Even S...
1          1  ...  Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2          2  ...  Consortiumnews.com Why the Truth Might Get You...
3          3  ...  Jessica Purkiss 15 Civilians Killed In Single ...
4          4  ...  Howard Portnoy Iranian woman jailed for fictio...
...      ...  ...                                                ...
20795  20795  ...  Jerome Hudson Rapper T.I.: Trump a ’Poster Chi...
20796  20796  ...  Benjamin Hoffman N.F.L. Playoffs: Schedule, Ma...
20797  20797  ...  Michael J. de la Merced and Rachel Abrams Macy...
20798  20798  ...  Alex Ansary NATO, Russia To Hold Parallel Exer...
20799  20799  ...            David Swanson What Keeps the F-35 Alive

[20800 rows x 5 columns]


In [14]:
print(y)

0        1
1        0
2        1
3        1
4        1
        ..
20795    0
20796    0
20797    0
20798    1
20799    1
Name: label, Length: 20800, dtype: int64


## What is stemming? 
Stemming is the process of reducing a word to its root word.

Example: 
- Actor
- Actress
- Acting
- The main root word is act

Reduce the words as much as possible to better the model.

In [15]:
port_stem=PorterStemmer()

def stemming(content):
    stemmed_content=re.sub("[^a-zA-Z]", " ", content)
    stemmed_content=stemmed_content.lower()
    stemmed_content=stemmed_content.split()
    stemmed_content=[port_stem.stem(word) for word in stemmed_content if not word in stopwords.words("English")]
    stemmed_content=" ".join(stemmed_content)
    return stemmed_content

# Calls re library (searches paragraph/text) and substitutes (^excluding everything that is not in a-z or A-Z)
# with a space in the content column - in other words, we just want to keep the words 

# Converts all letters to lowercase so it is easier for model to learn because case-sensitive

# Splits each word into a list for that entry

# Stems each word for all the words while also removing the stopwords (if not stopword)

# Rejoins the words so it is not a list anymore

# Returns the stemmed content

In [16]:
news_dataset["content"]=news_dataset["content"].apply(stemming)

In [17]:
print(news_dataset["content"])

0        darrel lucu hous dem aid even see comey letter...
1        daniel j flynn flynn hillari clinton big woman...
2                   consortiumnew com truth might get fire
3        jessica purkiss civilian kill singl us airstri...
4        howard portnoy iranian woman jail fiction unpu...
                               ...                        
20795    jerom hudson rapper trump poster child white s...
20796    benjamin hoffman n f l playoff schedul matchup...
20797    michael j de la merc rachel abram maci said re...
20798    alex ansari nato russia hold parallel exercis ...
20799                            david swanson keep f aliv
Name: content, Length: 20800, dtype: object


## Separating the data and label again after stemming

In [18]:
X=news_dataset["content"].values
y=news_dataset["label"].values

In [19]:
print(X)

['darrel lucu hous dem aid even see comey letter jason chaffetz tweet'
 'daniel j flynn flynn hillari clinton big woman campu breitbart'
 'consortiumnew com truth might get fire' ...
 'michael j de la merc rachel abram maci said receiv takeov approach hudson bay new york time'
 'alex ansari nato russia hold parallel exercis balkan'
 'david swanson keep f aliv']


In [20]:
print(y)

[1 0 1 ... 0 1 1]


In [21]:
y.shape

(20800,)

## Converting text into numerical values

In [22]:
vectorizer=TfidfVectorizer() # Term frequency inverse document frequency
vectorizer.fit(X)
X=vectorizer.transform(X)

# Counts the # of times a word appears in a document and assigns a numerical value to the word
# If we are rating the Avengers movie and want to see what is positive or negative, 
# the word Avengers is not really an important word so inverse document frequency
# finds repeating important words only and converts it to numbers

# We don't need to fit y because y is already a number

In [23]:
print(X)

# Feed this data into the model, ML only reads numbers, not strings

  (0, 267)	0.2701012497770876
  (0, 2483)	0.36765196867972083
  (0, 2959)	0.24684501285337127
  (0, 3600)	0.3598939188262558
  (0, 3792)	0.27053324808454915
  (0, 4973)	0.23331696690935097
  (0, 7005)	0.2187416908935914
  (0, 7692)	0.24785219520671598
  (0, 8630)	0.2921251408704368
  (0, 8909)	0.36359638063260746
  (0, 13473)	0.2565896679337956
  (0, 15686)	0.2848506356272864
  (1, 1497)	0.2939891562094648
  (1, 1894)	0.15521974226349364
  (1, 2223)	0.3827320386859759
  (1, 2813)	0.19094574062359204
  (1, 3568)	0.26373768806048464
  (1, 5503)	0.7143299355715573
  (1, 6816)	0.1904660198296849
  (1, 16799)	0.30071745655510157
  (2, 2943)	0.3179886800654691
  (2, 3103)	0.46097489583229645
  (2, 5389)	0.3866530551182615
  (2, 5968)	0.3474613386728292
  (2, 9620)	0.49351492943649944
  :	:
  (20797, 3643)	0.2115550061362374
  (20797, 7042)	0.21799048897828685
  (20797, 8364)	0.22322585870464115
  (20797, 8988)	0.36160868928090795
  (20797, 9518)	0.29542040034203126
  (20797, 9588)	0.17455348

## Splitting the dataset into training and test data

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=2)

## Training the model - logistic regression

In [25]:
model=LogisticRegression() # If threshold value is greater than 0.5, then label is 1 and 0 for less than 5
# This uses the sigmoid function - big weight means important column
model.fit(X_train, y_train)

## Evaluation

In [26]:
# Training accuracy score
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction, y_train)
print("Accuracy score of training data: ", training_data_accuracy)

# 98% is really really good!
# Accuracy score on training data is not that import, it is more import on the test data

Accuracy score of training data:  0.9863581730769231


In [28]:
# Testing accuracy score
X_test_prediction=model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction, y_test)
print("Accuracy score of test data: ", test_data_accuracy)

# 97.9 or ~98% accuracy still! Very good.

Accuracy score of test data:  0.9790865384615385


## Making a predictive system

In [31]:
X_new=X_test[69]
prediction=model.predict(X_new)
print(prediction)

if prediction[0]==0:
    print("This is real news.")
else:
    print("This is fake news.")

[0]
This is real news.


In [32]:
print(y_test[69])

0


In [None]:
# Our model is correct