In [44]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

import re
import string

In [45]:
data_fake = pd.read_csv("Large_Fake_News.csv")
data_true = pd.read_csv("Large_True_News.csv")

In [46]:
data_fake.head()

Unnamed: 0,title,text,subject,date
0,Internet to Be Shut Down for 48 Hours Due to G...,A group of rogue scientists claim they have di...,politicsNews,"January 23, 2023"
1,Aliens Confirmed to Have Landed in the Amazon ...,A group of rogue scientists claim they have di...,scienceNews,"December 15, 2023"
2,Aliens Confirmed to Have Landed in the Amazon ...,Reports suggest that the global internet will ...,scienceNews,"October 11, 2021"
3,Ancient Civilization Found in the Depths of th...,"In a shocking turn of events, Tesla CEO Elon M...",worldNews,"June 25, 2023"
4,Bill Gates Purchases Entire Country of New Zea...,Russian scientists have allegedly discovered t...,worldNews,"October 09, 2022"


In [47]:
data_true.tail()

Unnamed: 0,title,text,subject,date
19995,World Health Organization Declares End of COVI...,International coalitions have intensified thei...,scienceNews,"March 08, 2021"
19996,United Nations Calls for Global Action on Plas...,International coalitions have intensified thei...,techNews,"May 20, 2023"
19997,International Efforts to Combat Cybercrime Int...,The European Parliament has passed the AI Regu...,economyNews,"January 26, 2023"
19998,Global Climate Summit 2023 Concludes with Hist...,Scientists have announced a breakthrough in re...,economyNews,"November 07, 2023"
19999,Historic Peace Agreement Reached in Middle Eas...,The WHO has officially declared the COVID-19 p...,techNews,"October 09, 2020"


In [48]:
data_fake["class"] = 0
data_true["class"] = 1

In [49]:
data_fake.shape, data_true.shape

((20000, 5), (20000, 5))

In [50]:
data_fake_manual_testing = data_fake.tail(10)
rows, cols = data_fake.shape
for i in range(rows-1, rows-11, -1):
    data_fake.drop([i], axis=0, inplace=True)

data_true_manual_testing = data_true.tail(10)
rows, cols = data_true.shape
for i in range(rows-1, rows-11, -1):
    data_true.drop([i], axis=0, inplace=True)

In [51]:
data_fake.shape, data_true.shape

((19990, 5), (19990, 5))

In [52]:
data_fake_manual_testing["class"] = 0
data_true_manual_testing["class"] = 1

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_fake_manual_testing["class"] = 0
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_true_manual_testing["class"] = 1


In [53]:
data_fake_manual_testing

Unnamed: 0,title,text,subject,date,class
19990,Ancient Civilization Found in the Depths of th...,A group of rogue scientists claim they have di...,techNews,"September 15, 2021",0
19991,UFO Sightings Increase Dramatically as Governm...,"In a shocking turn of events, Tesla CEO Elon M...",politicsNews,"January 15, 2021",0
19992,Elon Musk to Announce Candidacy for U.S. Presi...,Marine archaeologists claim to have found an a...,scienceNews,"March 15, 2023",0
19993,Scientists Discover Cure for Aging in Secret L...,Russian scientists have allegedly discovered t...,scienceNews,"December 06, 2020",0
19994,UFO Sightings Increase Dramatically as Governm...,Leaked documents reveal that global leaders ha...,worldNews,"March 08, 2020",0
19995,UFO Sightings Increase Dramatically as Governm...,Leaked documents reveal that global leaders ha...,politicsNews,"February 21, 2020",0
19996,Global Leaders Hold Secret Meeting on Alien In...,A group of rogue scientists claim they have di...,techNews,"November 04, 2020",0
19997,Elon Musk to Announce Candidacy for U.S. Presi...,A sudden increase in UFO sightings worldwide h...,politicsNews,"March 09, 2020",0
19998,UFO Sightings Increase Dramatically as Governm...,Marine archaeologists claim to have found an a...,techNews,"October 18, 2022",0
19999,Scientists Discover Cure for Aging in Secret L...,Reports suggest that the global internet will ...,politicsNews,"May 12, 2022",0


In [54]:
data_true_manual_testing

Unnamed: 0,title,text,subject,date,class
19990,Breakthrough in Renewable Energy Achieved in 2023,NASA's Artemis II mission successfully launche...,techNews,"December 27, 2022",1
19991,Breakthrough in Renewable Energy Achieved in 2023,NASA's Artemis II mission successfully launche...,healthNews,"April 18, 2022",1
19992,2023 Marks Record High for Space Exploration A...,The European Parliament has passed the AI Regu...,healthNews,"February 01, 2021",1
19993,World Health Organization Declares End of COVI...,The European Parliament has passed the AI Regu...,healthNews,"April 28, 2022",1
19994,AI Regulation Bill Passed in European Parliament,International coalitions have intensified thei...,worldNews,"March 28, 2021",1
19995,World Health Organization Declares End of COVI...,International coalitions have intensified thei...,scienceNews,"March 08, 2021",1
19996,United Nations Calls for Global Action on Plas...,International coalitions have intensified thei...,techNews,"May 20, 2023",1
19997,International Efforts to Combat Cybercrime Int...,The European Parliament has passed the AI Regu...,economyNews,"January 26, 2023",1
19998,Global Climate Summit 2023 Concludes with Hist...,Scientists have announced a breakthrough in re...,economyNews,"November 07, 2023",1
19999,Historic Peace Agreement Reached in Middle Eas...,The WHO has officially declared the COVID-19 p...,techNews,"October 09, 2020",1


In [55]:
data_merge = pd.concat([data_fake, data_true], axis=0)
data_merge.head(10)

Unnamed: 0,title,text,subject,date,class
0,Internet to Be Shut Down for 48 Hours Due to G...,A group of rogue scientists claim they have di...,politicsNews,"January 23, 2023",0
1,Aliens Confirmed to Have Landed in the Amazon ...,A group of rogue scientists claim they have di...,scienceNews,"December 15, 2023",0
2,Aliens Confirmed to Have Landed in the Amazon ...,Reports suggest that the global internet will ...,scienceNews,"October 11, 2021",0
3,Ancient Civilization Found in the Depths of th...,"In a shocking turn of events, Tesla CEO Elon M...",worldNews,"June 25, 2023",0
4,Bill Gates Purchases Entire Country of New Zea...,Russian scientists have allegedly discovered t...,worldNews,"October 09, 2022",0
5,Moon Found to Be Hollow According to New Study,A group of rogue scientists claim they have di...,scienceNews,"April 24, 2023",0
6,Scientists Discover Cure for Aging in Secret L...,Rumors have surfaced that billionaire Bill Gat...,techNews,"October 14, 2023",0
7,Elon Musk to Announce Candidacy for U.S. Presi...,Rumors have surfaced that billionaire Bill Gat...,scienceNews,"February 04, 2020",0
8,Moon Found to Be Hollow According to New Study,A sudden increase in UFO sightings worldwide h...,worldNews,"October 11, 2022",0
9,Ancient Civilization Found in the Depths of th...,Several reports from indigenous tribes and exp...,techNews,"September 14, 2022",0


In [56]:
data_merge.columns

Index(['title', 'text', 'subject', 'date', 'class'], dtype='object')

In [57]:
data = data_merge.drop(["title", "subject", "date"], axis=1)

In [58]:
data.isnull().sum()

text     0
class    0
dtype: int64

In [59]:
data = data.sample(frac = 1)  # shuffle

In [60]:
data.head()

Unnamed: 0,text,class
8843,International coalitions have intensified thei...,1
5147,Space exploration in 2023 has reached new heig...,1
8785,"The 2023 Global Climate Summit, held in Geneva...",1
14232,Scientists have announced a breakthrough in re...,1
8724,"In a shocking turn of events, Tesla CEO Elon M...",0


In [61]:
data.reset_index(inplace=True)
# data.drop(['index', 'level_0'], axis=1, inplace=True)
data.drop(['index'], axis=1, inplace=True)

In [62]:
data.columns

Index(['text', 'class'], dtype='object')

In [92]:
# Modify preprocessing to retain more nuanced features
def wordopt(text):
    text = text.lower()
    # Keep some punctuation as it might be indicative
    text = re.sub(r'[^a-zA-Z0-9.,!? ]', '', text)
    return text

# Adjust vectorization parameters
vectorization = TfidfVectorizer(
    max_features=5000,
    min_df=3,
    max_df=0.9,
    ngram_range=(1, 2),
    strip_accents='unicode'
)



""" def wordopt(text):
    text = text.lower()
    # Remove more stopwords and add stemming/lemmatization
    from nltk.corpus import stopwords
    from nltk.tokenize import word_tokenize
    from nltk.stem import PorterStemmer
    
    stop_words = set(stopwords.words('english'))
    ps = PorterStemmer()
    
    # Tokenize
    words = word_tokenize(text)
    # Remove stopwords and stem
    words = [ps.stem(word) for word in words if word.isalnum() and word not in stop_words]
    
    return ' '.join(words) """

""" def wordopt(text):
    text = text.lower()
    text = re.sub(r"\[.*?\]", "", text)
    text = re.sub(r"\\W", "", text)
    text = re.sub(r"https?://\S+|www\.\S+", "", text)
    text = re.sub(r"<.*?>+", "", text)
    text = re.sub(r"[%s]" % re.escape(string.punctuation), '', text)
    text = re.sub(r"\n", "", text)
    text = re.sub(r"\w*\d\w*", "", text)
    return text """
''''''

In [64]:
data["text"] = data["text"].apply(wordopt)

In [65]:
x = data["text"]
y = data["class"]

In [66]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

In [67]:
vectorization = TfidfVectorizer()
xv_train = vectorization.fit_transform(x_train)
xv_test = vectorization.transform(x_test)

In [68]:
lr = LogisticRegression(
    C=1.0,               # Inverse of regularization strength
    class_weight='balanced',
    random_state=42
)

lr.fit(xv_train, y_train)

In [69]:
pred_lr = lr.predict(xv_test)

In [70]:
lr.score(xv_test, y_test)

1.0

In [71]:
print(classification_report(y_test, pred_lr))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4984
           1       1.00      1.00      1.00      5011

    accuracy                           1.00      9995
   macro avg       1.00      1.00      1.00      9995
weighted avg       1.00      1.00      1.00      9995



In [72]:
dt = DecisionTreeClassifier(
    max_depth=10,        # Limit tree depth
    min_samples_split=5,
    random_state=42
)
dt.fit(xv_train, y_train)

In [73]:
pred_dt = dt.predict(xv_test)

In [74]:
dt.score(xv_test, y_test)

1.0

In [75]:
print(classification_report(y_test, pred_dt))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4984
           1       1.00      1.00      1.00      5011

    accuracy                           1.00      9995
   macro avg       1.00      1.00      1.00      9995
weighted avg       1.00      1.00      1.00      9995



In [76]:
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42
)
gb.fit(xv_train, y_train)

In [77]:
pred_gb = gb.predict(xv_test)

In [78]:
gb.score(xv_test, y_test)

1.0

In [79]:
print(classification_report(y_test, pred_gb))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4984
           1       1.00      1.00      1.00      5011

    accuracy                           1.00      9995
   macro avg       1.00      1.00      1.00      9995
weighted avg       1.00      1.00      1.00      9995



In [80]:
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=5,
    random_state=42
)
rf.fit(xv_train, y_train)

In [81]:
pred_rf = rf.predict(xv_test)

In [82]:
rf.score(xv_test, y_test)

1.0

In [83]:
print(classification_report(y_test, pred_rf))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4984
           1       1.00      1.00      1.00      5011

    accuracy                           1.00      9995
   macro avg       1.00      1.00      1.00      9995
weighted avg       1.00      1.00      1.00      9995



In [84]:
def output_label(n):
    if n == 0:
        return "Fake news"
    elif n == 1:
        return "Not a Fake news"

def manual_testing(news):
    testing_news = {"text": [news]}
    new_def_test = pd.DataFrame(testing_news)
    new_def_test["text"] = new_def_test["text"].apply(wordopt)
    new_x_test = new_def_test["text"]
    new_xv_test = vectorization.transform(new_x_test)
    pred_lr = lr.predict(new_xv_test)
    pred_dt = dt.predict(new_xv_test)
    pred_gb = gb.predict(new_xv_test)
    pred_rf = rf.predict(new_xv_test)

    result = {
        "Logistic Regression": output_label(pred_lr[0]),
        "Decison Tree": output_label(pred_dt[0]),
        "Gradient Boosting": output_label(pred_gb[0]),
        "Random Forest": output_label(pred_rf[0])
    }

    return result

A Fargo, North Dakota, man was arrested for clearing snow with a flamethrower.

Fred Rogers served as a sniper during the Vietnam War and had a large number of confirmed kills.
Fred Rogers wore his iconic sweaters to conceal the extensive tattoos on his arms that were acquired while serving in the military. 

Share a certain post of Bill Gates on Facebook and he will send you money.
"Hey Facebook, As some of you may know, I'm Bill Gates. If you click that share link, I will give you $5,000. I always deliver, I mean, I brought you Windows XP, right?"

An article from January 2020, claiming to be from a local news station in North Dakota (Valley News Live), circulated on Facebook.
The article claimed that a coronavirus case has been confirmed in Fargo.

In September 2017, Hillary Clinton released her memoir What Happened.
A month before in August 2017, a clip of the audiobook was released.
Hillary is also the narrator.
MSNBC and Fox News both picked up the story.

In August 2017, Hurricane Harvey hit and devastated areas of Texas and Louisiana.
Around this time, articles began circulating on social media saying Black Lives Matter protesters blocked rescue efforts.
The title of one such article reads: "'Black Lives Matter' Thugs Block Texas Rescue Efforts to Protest Trump... IMMEDIATELY REGRET IT." This article was shared by Freedom Daily and others. Their post included the article and the text "CHARGE THEM WITH FELONIES! Do you agree??"

The Pentagon is considering a Boeing proposal to supply Ukraine with cheap, small precision bombs fitted on to abundantly available rockets, allowing Kyiv to strike far behind Russian lines, according to a Reuters report. US and allied military inventories are shrinking, and Ukraine faces an increasing need for more sophisticated weapons as the war drags on. Boeing's proposed system, dubbed Ground-Launched Small Diameter Bomb (GLSDB), is one of about a half-dozen plans for getting new munitions into
production for Ukraine and America's eastern European allies, industry sources told the news agency. GLSDB could be delivered as early as spring 2023, according to a document reviewed by Reuters and thr
ee people familiar with the plan. It combines the G8U-39 Small Diameter Bomb (SD8) with the "26 rocke t motor, both Of which are co—-on in US inventories. Although a handful Of GLSDB units have already
been made, there are many logistical obstacles to formal procurement. The Boeing plan requires a price discovery waiver, exempting the contractor from an in-depth review that ensures the Pentagon is getting the best deal possible. Any arrangement would also require at least six suppliers to expedite sh
ipments Of their parts and services to produce the weapon quickly. Although the US has rebuffed requests for the 18S-mile (297k") range Atacms missile, the GLSDB's 94-mile (150km) range would allow Ukraine to hit valuable military targets that have been out of reach and help it continue pressing its counterattacks by disrupting Russian rear areas. GLSDB is made jointly by Saab AB and Boeing Co and h
as been in development since 2019, well before the invasion, which Russia calls a "special operation. In October, SAAB chief executive Micael Johansson said of the GLSDB: "We are imminently shortly expecting contracts on that." According to the document - a Boeing proposal to US European- has small, folding wings that allow it to glide more than 100km if dropped from an aircraft and targets as small as 3ft in diameter.

In [94]:
news = input()
manual_testing(news)

NotFittedError: The TF-IDF vectorizer is not fitted

In [86]:
from sklearn.model_selection import cross_val_score

# Add after each model fitting
scores = cross_val_score(lr, xv_train, y_train, cv=5)
print(f"Cross-validation scores: {scores}")
print(f"Average CV score: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

Cross-validation scores: [1. 1. 1. 1. 1.]
Average CV score: 1.000 (+/- 0.000)
