FAKE NEWS DETECTOR With Python and Machine Learning 

What is a TfidfVectorizer?
TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. higher the value means term appears more often than others, and so, the document is a good match when the term is part of the search terms.

TF(t)= Total number of terms in the document/Number of times term t appears in a document

 


IDF (Inverse Document Frequency): Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.

The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.

In [210]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.preprocessing import LabelEncoder

1.Loading the Dataset:


This data set file named WELFAKE_Dataset.csv has 4 columns index number , title of the news and the news text and labels containing values 0&1 (0 means fake and 1 means Real)
WELFake) is a dataset of 72,134 news articles with 35,028 real and 37,106 fake news. For this, authors merged four popular news datasets (i.e. Kaggle, McIntire, Reuters, BuzzFeed Political) to prevent over-fitting of classifiers and to provide more text data for better ML training

In [211]:


# Read the data
df = pd.read_csv(r'C:\Users\pariksheeth\fake_or_real_news.csv\fake_or_real_news.csv')
df.head()


Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [212]:
#DataFlair - Get the labels
labels=df.label
labels.head()
labels.shape

(6335,)

2. Preparing the Data for Training:


We need to split the data into training and testing sets. 

In [213]:
#DataFlair - Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)


Here, x_train and x_test represent the text data, while y_train and y_test are the corresponding labels. We set a random_state to ensure the split is reproducible.

In [214]:
print(x_train.isna().sum())  # Check how many NaN values are present



0


3. Text Preprocessing with TF-IDF Vectorizer:


Next, we need to convert the raw text data into a format that can be understood by the machine learning model. We use a TF-IDF Vectorizer to transform the text data into numerical features. This method considers both the term frequency (how often a word appears in a document) and the inverse document frequency (how rare or unique a word is across all documents).

In [215]:
#DataFlair - Initialize a TfidfVectorizer and fit/transform the train data
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

#DataFlair - Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)


We first initialize the TfidfVectorizer, removing common stop words (like "the," "is," etc.) and ignoring words that appear in more than 70% of the documents (max_df=0.7). We then apply this transformation to both the training and test data.



4. Training the Model:


We use a Passive-Aggressive Classifier for this task. This model is efficient for large datasets and performs well in text classification tasks like detecting fake news. The classifier is trained using the transformed training data.

In [273]:
#DataFlair - Initialize a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=100)
pac.fit(tfidf_train,y_train)



5. Evaluating the Model:


After training the model, we evaluate its performance using the test data. The model predicts whether the news articles are real or fake, and we compare these predictions with the true labels. The accuracy score and confusion matrix help us understand how well the model is performing.

In [217]:
#DataFlair - Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.58%


In [218]:
#DataFlair - Build confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[587,  51],
       [ 43, 586]], dtype=int64)

now lets move towords using our trained model to check weather the given news is real or fake . 
below is the example news.
6. Making Predictions on New Data
Now that we have a trained model, we can use it to predict whether a new piece of news is real or fake. Here, we provide an example news article for testing.

In [264]:
# Example single news article for testing
news_title ="Scientists Have Discovered a Fountain of Youth in the Amazon Rainforest"
news_text = """In a groundbreaking discovery, scientists claim to have found the mythical Fountain of Youth deep within the Amazon Rainforest. The research team, led by Dr. John Doe, reports that the fountain has the ability to reverse the aging process, making it possible for humans to live for centuries. The team suggests that the water from the fountain could be the key to eternal life and has begun seeking funding to bring the discovery to the public. “This is the most important discovery in human history,” Dr. Doe said during a press conference. However, many experts have raised concerns, calling the findings a hoax and warning that the claims lack credible evidence. No scientific peer-reviewed studies have yet confirmed the discovery."""








In [265]:
# Preprocessing function (same as used during training)
def preprocess_text(text):
    # Example preprocessing: lowercase and remove non-alphabetic characters
    return text.lower()



In [266]:
# Combine title and text and preprocess
processed_news = preprocess_text(news_title + " " + news_text)



In [267]:
# Step 2: Transform the text into the same feature representation using the trained TF-IDF vectorizer
news_tfidf = tfidf_vectorizer.transform([processed_news])  # Use the fitted tfidf_vectorizer from training




In [268]:
# Step 1: Get the predicted class (0 for fake, 1 for real) directly
prediction = pac.predict(news_tfidf)  # This gives you the predicted class (0 or 1)


In [269]:
# Ensure prediction is in integer format before inverse_transform
prediction_int = prediction[0]  # Extract the integer from the prediction array


In [270]:
# If prediction is string (like 'FAKE', 'REAL'), convert it to numeric labels (0, 1)
if isinstance(prediction_int, str):
    if prediction_int == 'FAKE':
        prediction_int = 0
    elif prediction_int == 'REAL':
        prediction_int = 1

In [271]:
# Decoding the prediction using the LabelEncoder
# Assuming you have already trained a LabelEncoder during the model training,
# use that encoder to convert 0 or 1 back to 'FAKE' or 'REAL'
prediction_decoded = encoder.inverse_transform([prediction_int])  # Pass a list instead of a single value

In [272]:
# Output the prediction
print(f"News Title: {news_title}")
print(f"News Text: {news_text}")
print(f"Prediction: {prediction_decoded[0]}")  # This will print 'REAL' or 'FAKE'



News Title: Scientists Have Discovered a Fountain of Youth in the Amazon Rainforest
Prediction: FAKE
