id: unique id for a news article

title: the title of a news article

author: author of the news article

text: the text of the article; could be incomplete

label: a label that marks the article as potentially unreliable

1: fake

0: real

In [None]:
import numpy as np
import pandas as pd
import csv
import re
import nltk.corpus
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [None]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
print(stopwords.words('english'))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [None]:
import pandas as pd
import csv # Import the csv module

#loading the dataset to pandas dataframe
news_dataset = pd.read_csv('/content/train.csv', quoting=csv.QUOTE_NONE, escapechar="\\", on_bad_lines='skip')
# Setting quoting to csv.QUOTE_NONE to treat all values as strings and escapechar to "\\" to handle unescaped backslashes.

# Use csv.QUOTE_MINIMAL to access the constant
# quoting=csv.QUOTE_MINIMAL: Ensures that only unescaped quotes within strings are considered as delimiters. This can help if there are unescaped quotes within the data.
# on_bad_lines='skip': This argument replaces the deprecated 'error_bad_lines'. It instructs the parser to skip lines with errors. Alternative options like 'warn' and a custom error handling function can be used.

In [None]:
news_dataset.shape

(8261, 5)

In [None]:
news_dataset.head()

Unnamed: 0,Unnamed: 1,id,title,author,text,label
0,House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It,Darrell Lucus,"""House Dem Aide: We Didn’t Even See Comey’s Le...",2016 Subscribe Jason Chaffetz on the stump in...,Utah ( image courtesy Michael Jolley,available under a Creative Commons-BY license)
With apologies to Keith Olbermann,there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide,it looks like we also know who the second-wor...,the ranking Democrats on the relevant committ...,,,
As we now know,Comey notified the Republican chairmen and Democratic ranking members of the House Intelligence,Judiciary,and Oversight committees that his agency was ...,Oversight Committee Chairman Jason Chaffetz s...,"""""The FBI has learned of the existence of ema...",
— Jason Chaffetz (@jasoninthehouse) October 28,2016,,,,,
Of course,we now know that this was not the case . Comey was actually saying that it was reviewing the emails in light of “an unrelated case”–which we now know to be Anthony Weiner’s sexting with a teenager. But apparently such little things as facts didn’t matter to Chaffetz. The Utah Republican had already vowed to initiate a raft of investigations if Hillary wins–at least two years’ worth,and possibly an entire term’s worth of them. ...,,,,


In [None]:
news_dataset.isnull().sum()

Unnamed: 0,0
id,4167
title,5460
author,6618
text,7416
label,7936


In [None]:
# replacing the null values with empty string
news_dataset = news_dataset.fillna('')

In [None]:
# merging the author name and news title
news_dataset['content'] = news_dataset['author']+' '+news_dataset['title']

In [None]:
print(news_dataset['content'])

0                                               House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It                                                                                                                                                                                                                                                                                                                     2016 Subscribe Jason Chaffetz on the stump in...
With apologies to Keith Olbermann                there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide                                                                                                                                                                                                                                                                   the ranking Democrats on the relevant commit...
As we now know            

In [None]:
# separating the data and label
x=news_dataset.drop(columns='label',axis=1)
y=news_dataset['label']

In [None]:
print(x)
print(y)

                                                                                                                                                  id  \
0                                              House Dem Aide: We Didn’t Even See Comey’s Lett...                                      Darrell Lucus   
With apologies to Keith Olbermann               there is no doubt who the Worst Person in The ...   it looks like we also know who the second-wor...   
As we now know                                  Comey notified the Republican chairmen and Dem...                                          Judiciary   
— Jason Chaffetz (@jasoninthehouse) October 28  2016                                                                                                   
Of course                                       we now know that this was not the case . Comey...   and possibly an entire term’s worth of them. ...   
...                                                                                     

In [None]:
port_stem=PorterStemmer()

In [None]:
def stemming(content):
  stemmed_content=re.sub('[^a-zA-Z]',' ',content)
  stemmed_content=stemmed_content.lower()
  stemmed_content=stemmed_content.split()
  stemmed_content=[port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')]
  stemmed_content=' '.join(stemmed_content)
  return stemmed_content


In [None]:
news_dataset['content']=news_dataset['content'].apply(stemming)


In [None]:
print(news_dataset['content'])

0                                               House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It                                                                                                                                                                                                                                                                                                                    subscrib jason chaffetz stump american fork ho...
With apologies to Keith Olbermann                there is no doubt who the Worst Person in The World is this week–FBI Director James Comey. But according to a House Democratic aide                                                                                                                                                                                                                                                                 rank democrat relev committe hear comey found ...
As we now know            

In [None]:
X=news_dataset['content'].values
Y=news_dataset['label'].values

In [None]:
print(X)


['subscrib jason chaffetz stump american fork hous dem aid even see comey letter jason chaffetz tweet darrel lucu octob'
 'rank democrat relev committe hear comey found via tweet one republican committe chairmen'
 'oversight committe chairman jason chaffetz set polit world ablaz tweet fbi dir inform oversight committe agenc review email recent discov order see contain classifi inform long letter went'
 ... '' '' 'someth need ever']


In [None]:
print(Y)

[' available under a Creative Commons-BY license) ' '' '' ... '' '' '']


In [None]:
Y.shape

(8261,)

In [None]:
vectorizer=TfidfVectorizer()
vectorizer.fit(X)
X=vectorizer.transform(X)

In [None]:
print(X)

  (0, 144)	0.20343753160484926
  (0, 217)	0.15581932461667572
  (0, 964)	0.4359816780408594
  (0, 1160)	0.20127631881484984
  (0, 1492)	0.23310088256498993
  (0, 1582)	0.2518163732763751
  (0, 2091)	0.149033647663725
  (0, 2390)	0.2518163732763751
  (0, 2890)	0.1778116489583359
  (0, 3202)	0.4115738073249634
  (0, 3473)	0.20836034958400634
  (0, 3581)	0.23310088256498993
  (0, 4190)	0.15740806199095583
  (0, 5354)	0.16184436688627407
  (0, 5834)	0.24086851302761964
  (0, 5850)	0.23310088256498993
  (0, 6290)	0.1956699011422195
  (1, 968)	0.32504626734257513
  (1, 1160)	0.25980882531183297
  (1, 1171)	0.5051440473869784
  (1, 1588)	0.22242448947158638
  (1, 2412)	0.2295204717308101
  (1, 2777)	0.28138404034871595
  (1, 4223)	0.17480308912700843
  (1, 4894)	0.27673010490985117
  :	:
  (8237, 3962)	0.26413259322350546
  (8237, 4175)	0.1440475643896338
  (8237, 4189)	0.13915010006690862
  (8237, 4577)	0.14064423001605053
  (8237, 4684)	0.1364788853810804
  (8237, 5461)	0.48345495375102193


In [None]:
# Assuming Y has binary labels (0 and 1)
# Get the counts of each class in Y
unique_classes, class_counts = np.unique(Y, return_counts=True)

# Print the class counts to identify the problematic class
print("Class Counts:", dict(zip(unique_classes, class_counts)))

# Check if any class has less than 2 samples
if any(count < 2 for count in class_counts):
    # Handle the issue:
    # 1. Gather more data for the under-represented class(es).
    # 2. Consider removing the 'stratify' parameter from train_test_split.
    # 3. If using 'stratify' is important, adjust the 'test_size' to ensure
    #    at least 2 samples are present for each class in both training and testing sets.
    print("Warning: One or more classes have less than 2 samples. Stratified splitting might not be possible.")
    # Here, we choose to remove the 'stratify' parameter as an example:
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
else:
    # Proceed with stratified split if all classes have enough samples
    X_train, X_test, Y_

Class Counts: {'': 7936, ' & a trust fund kid former Republican (FL) ': 1, ' 127 people were detained in the biggest mass arrest to date.': 1, ' 2008. REUTERS/Ali Jarekji/File Photo ': 1, ' 2011 ( David Shankbone / CC BY 3.0 ) Michael Moore’s “Trumpland” is a textbook illustration of how the mindset of voting for “the lesser evil” just results in self-delusion—and ever more evil. ': 1, ' 2016 ': 8, ' 2016 / POLITICS / A focus group of 23 people put together by CBS News revealed a frightening look at what America has become – a divided nation. ': 1, ' 2016 8:53 AM EST ': 1, ' 2016 Tweet ': 1, ' 2016 Universal Child Hosted by Joanna L Ross KCOR Digital Radio Network Season 1 Episode 24 ': 1, ' 2016 at 6:09 pm ': 1, ' 2016 by Robert Rich in Politics Share This ': 1, ' 2016. ': 1, ' 33 and 32 percent. ': 1, ' 5+1 talks have begun already for an agreement on trade and economic cooperation between all participants in the process. ': 1, ' 700 light years from Earth. ': 1, ' Afghanistan. (Naji

In [None]:

model=LogisticRegression()

In [None]:
model.fit(X_train,Y_train)

In [None]:
X_train_prediction=model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)

In [None]:
print('Accuracy score of the training data:',training_data_accuracy)

Accuracy score of the training data: 0.9603510895883777
