# TruthLens Prototype

### Objective of the Prototype

The prototype should serve as a minimal viable product demonstrating:
- A basic pipeline for text preprocessing, feature extraction, and classification.
- The binary classification task (real vs. fake news) using a simple machine learning model.
- The potential for explainability by outputting key features or tokens influencing the classification.

This prototype is not expected to achieve high accuracy but should showcase the project’s core structure and provide a baseline for future development.

In [2]:
#import relevant libraries
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from textblob import TextBlob
import spacy
import nltk
from nltk.corpus import stopwords
import string
import csv
import random
import re
import requests
from bs4 import BeautifulSoup

#get english stopwords
nltk.download('stopwords')
nltk.download('punkt')

pd.set_option('display.max_colwidth', 100)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\hazel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hazel\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Phase 1: Binary Classification
The ISOT database has been downloaded from https://onlineacademiccommunity.uvic.ca/isot/2022/11/27/fake-news-detection-datasets/ and the dataset description is available here: https://onlineacademiccommunity.uvic.ca/isot/wp-content/uploads/sites/7295/2023/02/ISOT_Fake_News_Dataset_ReadMe.pdf
### Generate subset of ISOT dataset
For the prototype, a subset of 1000 articles, 500 from each class, will be used. These 1000 articles are randomly selected and loaded into a dataframe with a class label (0 for True and 1 for False). The title and text fields are also concatenated into one column called "content".

In [3]:
#Import the ISOT dataset
ISOT_True_Full = 'ISOT-True.csv'  
ISOT_True_Subset = 'ISOT-True-Subset.csv'

#Set output files
ISOT_Fake_Full = 'ISOT-Fake.csv'  
ISOT_Fake_Subset = 'ISOT-Fake-Subset.csv'  

#Set the sample amount
sample_amount = 500


def sample_csv(input_file, output_file, sample_amount):
    """
    Reads a CSV file, randomly selects a specified number of rows, and writes them to a new CSV file.

    Args:
        input_file (str): Path to the input CSV file.
        output_file (str): Path to the output CSV file where the sampled rows will be saved.
        sample_amount (int): Number of rows to randomly sample from the input file.

    This function ensures that the CSV is read and written using UTF-8 encoding and handles encoding errors gracefully.
    """
    with open(input_file, 'r', encoding='utf-8', errors='replace') as csvfile:
        reader = list(csv.reader(csvfile))
        headers = reader[0]
        data_rows = reader[1:]
        sampled_rows = random.sample(data_rows, min(sample_amount, len(data_rows)))

    with open(output_file, 'w', newline='', encoding='utf-8') as outfile:
        writer = csv.writer(outfile)
        writer.writerow(headers)
        writer.writerows(sampled_rows)

    #print(f"Random sample of {len(sampled_rows)} rows written to {output_file}.")

# Generate a random sample from each class
sample_csv(ISOT_True_Full, ISOT_True_Subset, sample_amount)
sample_csv(ISOT_Fake_Full, ISOT_Fake_Subset, sample_amount)

#load each subset to a dataframe
true_df = pd.read_csv(ISOT_True_Subset)
fake_df = pd.read_csv(ISOT_Fake_Subset)
#add a label for each class - 0 for True and 1 for False
true_df['label'] = 0
fake_df['label'] = 1
#combine the dataframes
data = pd.concat([true_df, fake_df], ignore_index=True)
#shuffle the dataset
data = data.sample(frac=1, random_state=999).reset_index(drop=True)
#combine the title and text columns into a single column
data['content'] = data['title'] + " " + data['text']
#drop unnecessary columns
data = data[['content', 'label']] 
#check we have 500 of each, and print the head
print(data['label'].value_counts())
print(data.head())

1    500
0    500
Name: label, dtype: int64
                                                                                               content  \
0  WOW! BARBARA BUSH Will Be Keynote Speaker For Baby-Killing Planned Parenthood Fundraiser Her fat...   
1  Bid to 'fix' Iran nuclear deal faces uphill climb in U.S. Congress WASHINGTON (Reuters) - Presid...   
2  Suicide attack targets area southeast of Baghdad BAGHDAD (Reuters) - Two attackers shot several ...   
3  Congressman Jim Jordan stops CNN gatekeeper Chris Cuomo on Benghazi cover-up  21st Century Wire ...   
4  Islamic State claims responsibility for blast in Afghan capital, Kabul CAIRO (Reuters) - Islamic...   

   label  
0      1  
1      0  
2      0  
3      1  
4      0  


### Replace censored words
I've noticed that the data contains a lot of censored words - for example "f*ck" instead of "fuck", so I am replacing those with the uncensored versions before continuing with preprocessing.

In [4]:
def find_censored_words(dataframe, column_name):
    """
    Finds censored words in the specified column of a DataFrame. 
    Censored words typically include one or more asterisks (*) surrounded by other letters.

    Parameters:
    ----------
    dataframe : pd.DataFrame
        The DataFrame containing the text data.
    column_name : str
        The name of the column to search for censored words.

    Returns:
    -------
    list
        A list of unique censored words found in the specified column.
    """
    # Define the regex pattern for censored words (e.g., words with * surrounded by other letters)
    pattern = r'\b\w*\*\w*\b'
    
    # Combine all text in the specified column
    combined_text = ' '.join(dataframe[column_name].astype(str))
    
    # Find all matches of the pattern
    censored_words = re.findall(pattern, combined_text)
    
    # Return unique matches
    return list(set(censored_words))

# Example usage:
data['content'] = data['content'].str.lower()
censored_words = find_censored_words(data, 'content')
print("Censored Words Found:", censored_words)

Censored Words Found: ['batsh*t', 'p*ssing', 'f*ck', 'c*ck', 'apesh*t', 'n*gger', 'n*ggers', 'f*cking', 'sh*t']


In [5]:
substitutions = {
    'b*tch': 'bitch',
    'f*cks': 'fucks',
    'd*mn': 'damn',
    'sh*t': 'shit',
    'f*ck': 'fuck',
    'n*ggers': 'niggers',
    'sh*tter': 'shitter',
    'clusterf*ck': 'clusterfuck',
    'f*cking': 'fucking',
    'p*ssy':'pussy',
    'sh*tty': 'shitty',
    'motherf*cker': 'motherfucker',
    'scr*wed': 'screwed',
    'a*s': 'ass',
    'h*ll': 'hell',
    'sh*tshow': 'shitshow',
    'f*ckin': 'fucking',
    'bullsh*t': 'bullshit',
    'p*ss': 'piss',
    'p*ssies': 'pussies',
    'f*cker': 'fucker',
    'p*ssygrabber': 'pussygrabber',
    'g*d': 'god',
    'dumbf*ckery': 'dumbfuckery',
    'sh*tfest': 'shitfest',
    'f*cked': 'fucked',
    'p*rn': 'porn',
    'batsh*t': 'batshit',
    'motherf*cking': 'motherfucking',
}

def replace_censored_words(text, substitutions):
    """
    Replaces censored words in the given text based on the substitutions dictionary.

    Parameters:
    ----------
    text : str
        The input text to process.
    substitutions : dict
        A dictionary where keys are censored words and values are their uncensored equivalents.

    Returns:
    -------
    str
        The text with censored words replaced.
    """
    # Replace each censored word in the text
    for censored, uncensored in substitutions.items():
        text = text.replace(censored, uncensored)
    return text

#replace the censored words
data['content'] = data['content'].apply(lambda x: replace_censored_words(x, substitutions))
print(data.head())

                                                                                               content  \
0  wow! barbara bush will be keynote speaker for baby-killing planned parenthood fundraiser her fat...   
1  bid to 'fix' iran nuclear deal faces uphill climb in u.s. congress washington (reuters) - presid...   
2  suicide attack targets area southeast of baghdad baghdad (reuters) - two attackers shot several ...   
3  congressman jim jordan stops cnn gatekeeper chris cuomo on benghazi cover-up  21st century wire ...   
4  islamic state claims responsibility for blast in afghan capital, kabul cairo (reuters) - islamic...   

   label  
0      1  
1      0  
2      0  
3      1  
4      0  


### Preprocess text
In this step we do some basic NLP preprocessing - convert all text to lowercase and remove punctuation and stopwords. Then tokenise the text and return the processed text as a column.

In [6]:
def preprocess_text(text):
    """
        Preprocesses a given text string by applying the following steps:
        1. Converts the text to lowercase.
        2. Removes punctuation marks.
        3. Tokenizes the text into individual words.
        4. Removes stopwords (common words that add little value to classification tasks).

        Parameters:
        ----------
        text : str
            The input text string to preprocess.

        Returns:
        -------
        str
            The cleaned and preprocessed text, with tokens joined back into a single string.
    """
    stop_words = set(stopwords.words('english'))
    text = text.lower()
    text = text.translate(str.maketrans('', '', string.punctuation))
    tokens = [word for word in text.split() if word not in stop_words]
    return ' '.join(tokens)

# Apply preprocessing
data['clean_content'] = data['content'].apply(preprocess_text)  # Adjust to 'content' if using combined column
print(data[['content', 'clean_content']].head())

                                                                                               content  \
0  wow! barbara bush will be keynote speaker for baby-killing planned parenthood fundraiser her fat...   
1  bid to 'fix' iran nuclear deal faces uphill climb in u.s. congress washington (reuters) - presid...   
2  suicide attack targets area southeast of baghdad baghdad (reuters) - two attackers shot several ...   
3  congressman jim jordan stops cnn gatekeeper chris cuomo on benghazi cover-up  21st century wire ...   
4  islamic state claims responsibility for blast in afghan capital, kabul cairo (reuters) - islamic...   

                                                                                         clean_content  
0  wow barbara bush keynote speaker babykilling planned parenthood fundraiser father staunch suppor...  
1  bid fix iran nuclear deal faces uphill climb us congress washington reuters president donald tru...  
2  suicide attack targets area southeast baghdad

### Feature Extraction Using TF-IDF and n-grams

In [7]:
vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 3))
X = vectorizer.fit_transform(data['clean_content'])

y = data['label']

### Split dataset

In [8]:
#retain the indices as we need these for looking up explanations later
train_indices, test_indices = train_test_split(data.index, test_size=0.2, random_state=42)
# Split X and y using the train/test indices
X_train = X[train_indices]
X_test = X[test_indices]
y_train = y.iloc[train_indices]
y_test = y.iloc[test_indices]

### Logistic Regression

In [9]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

Accuracy: 0.965
Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.96      0.97       103
           1       0.96      0.97      0.96        97

    accuracy                           0.96       200
   macro avg       0.96      0.97      0.96       200
weighted avg       0.97      0.96      0.97       200



In [10]:
def explain_prediction(text):
    """
    Explains the prediction of the model by showing the most influential words for the prediction.

    Parameters:
    ----------
    text : str
        The input text to analyze.
    model : object
        The trained machine learning model.
    vectorizer : object
        The TF-IDF vectorizer used to transform the text.

    Returns:
    -------
    dict
        A dictionary containing the prediction ('label') and the top contributing words ('features').
    """
    # Transform the text using the vectorizer
    tfidf_text = vectorizer.transform([text])
    # Predict the label
    prediction = model.predict(tfidf_text)[0]
    # Get top contributing features (words)
    feature_importances = model.coef_[0]  # Logistic regression coefficients
    feature_names = vectorizer.get_feature_names_out()
    # Sort by importance
    top_indices = tfidf_text.toarray().argsort()[0][-5:]  # Top 5 features
    top_features = [feature_names[i] for i in top_indices]

    return {
        "label": prediction,
        "features": top_features
    }

# Create a DataFrame for test data
test_df = pd.DataFrame({
    'text': data.loc[test_indices, 'clean_content'].reset_index(drop=True),
    'true_label': y_test.reset_index(drop=True),
    'predicted_label': y_pred
})

# Row predicted as Real (0)
real_example = test_df[test_df['predicted_label'] == 0].iloc[0]

# Row predicted as Fake (1)
fake_example = test_df[test_df['predicted_label'] == 1].iloc[0]

In [11]:
real_explanation = explain_prediction(real_example['text'])
fake_explanation = explain_prediction(fake_example['text'])

print("Real Example Prediction:")
print("Text:", real_example['text'])
print("Predicted Label:", real_explanation['label'])
print("Top Features:", real_explanation['features'])

print("\nFake Example Prediction:")
print("Text:", fake_example['text'])
print("Predicted Label:", fake_explanation['label'])
print("Top Features:", fake_explanation['features'])


Real Example Prediction:
Text: factbox new zealand 2017 election main parties policies reuters new zealand two main parties neck neck opinion polls appointment charismatic leader boosted opposition labour party threatening governing national party decadelong hold power election sept 23 main parties positions key issues economynew zealand booming economy facing capacity constraints unemployment eightyear low labor shortage noticeably construction threatens curb growth main parties plan fiscally prudent maintain budget surplus would differ monetary policy national plans cut net debt 1015 percent gdp 2025 labour green party plan cut 20 percent gdp within five years taking office labour proposes adding full employment existing inflation mandate reserve bank new zealand economists say could lead easier monetary policy fuel longerterm inflation form government national may dependent new zealand first favors greater currency intervention something new zealand reluctant past national planning 

### Support Vector Machines (SVM)

In [12]:
# Train SVM
svm_model = SVC()
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
print("SVM Accuracy:", accuracy_score(y_test, svm_pred))
print("SVM Classification Report:\n", classification_report(y_test, svm_pred))

SVM Accuracy: 0.975
SVM Classification Report:
               precision    recall  f1-score   support

           0       0.96      0.99      0.98       103
           1       0.99      0.96      0.97        97

    accuracy                           0.97       200
   macro avg       0.98      0.97      0.97       200
weighted avg       0.98      0.97      0.97       200



### Random Forest

In [13]:
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
rf_pred = rf_model.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_pred))
print("Random Forest Classification Report:\n", classification_report(y_test, rf_pred))

Random Forest Accuracy: 0.995
Random Forest Classification Report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00       103
           1       1.00      0.99      0.99        97

    accuracy                           0.99       200
   macro avg       1.00      0.99      0.99       200
weighted avg       1.00      0.99      0.99       200



## Phase 2: Multi class classification
One of the key challenges in this phase is to build the custom dataset. For this we are looking to scrape 1200 articles, 200 for each of the 7 categories, and clean that data. For the prototype we will manually scrape some articles from The Onion, a well known satire site. The articles scraped are the ones featured on the 2024 "Annual Year" post found here: https://theonion.com/our-annual-year-2024/ - the top 5 from each month have been chosen, so a total of 55 articles as December stats are not yet available.

In [14]:
def scrape_onion_article(url):
    """
    Scrapes an article from a given URL on theonion.com and extracts relevant information.

    Parameters:
    ----------
    url : str
        The URL of the article to scrape.

    Returns:
    -------
    dict
        A dictionary containing the extracted article data.
    """
    article_data = {
        "title": "",
        "text": "",
        "site": "",
        "date": "",
        "category": "",
        "class": "Satire", #satire is hardcoded here as we know TheOnion is a satire site
        "url": url
    }

    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract the title from the meta property "og:title"
        title_meta = soup.find('meta', property='og:title')
        article_data["title"] = title_meta['content'] if title_meta else "Title not found"
        
        # Extract the article text from the meta property "og:description"
        description_meta = soup.find('meta', property='og:description')
        article_data["text"] = description_meta['content'] if description_meta else "Description not found"
        
        # Extract the URL from the meta property "og:url"
        url_meta = soup.find('meta', property='og:url')
        article_data["url"] = url_meta['content'] if url_meta else url  # Fallback to input URL
        
        # Extract the site name from the meta property "og:site_name"
        site_name_meta = soup.find('meta', property='og:site_name')
        article_data["site"] = site_name_meta['content'] if site_name_meta else "Site name not found"
        
        # Extract the published date from the meta property "article:published_time"
        published_date_meta = soup.find('meta', property='article:published_time')
        article_data["date"] = published_date_meta['content'] if published_date_meta else "Published date not found"
        
        # Extract the category (e.g., "Politics")
        category_element = soup.find('div', class_='taxonomy-category')
        category_link = category_element.find('a') if category_element else None
        article_data["category"] = category_link.text.strip() if category_link else "Category not found"
    
    
    else:
        print(f"Failed to fetch the webpage: {url}. Status code: {response.status_code}")
    
    return article_data


def scrape_multiple_onion_articles(urls):
    """
    Scrapes multiple articles from a list of URLs and stores the data in a DataFrame.

    Parameters:
    ----------
    urls : list
        A list of article URLs to scrape.

    Returns:
    -------
    pd.DataFrame
        A DataFrame containing the scraped data from all URLs.
    """
    articles = []
    for url in urls:
        article = scrape_onion_article(url)
        articles.append(article)
    return pd.DataFrame(articles)


# List of URLs to scrape
urls = [
    #January
    "https://theonion.com/biden-addresses-nation-while-hanging-from-branch-on-sid-1851106795/",
    "https://theonion.com/marriage-counselor-sides-with-hotter-spouse-1851143488/",
    "https://theonion.com/wealthy-dad-surprises-child-with-tree-house-he-can-airb-1851112919/",
    "https://theonion.com/glowing-pulsating-hair-product-takes-control-of-gavin-1851160421/",
    "https://theonion.com/gen-z-announces-julie-andrews-is-problematic-but-refuse-1851180352/",
    #February
    "https://theonion.com/mrbeast-announces-he-has-resurrected-everyone-buried-at-1851217565/",
    "https://theonion.com/introverted-cowboy-struggling-to-round-up-posse-1851226175/",
    "https://theonion.com/country-stations-refuse-to-play-beyonce-s-music-after-a-1851261135/",
    "https://theonion.com/stab-him-stab-him-you-cowards-says-terrified-kamal-1851243467/",
    "https://theonion.com/emerging-filmmaker-malia-obama-changes-surname-to-scors-1851278946/",
    #March
    "https://theonion.com/u-s-airdrops-rubble-into-gaza-1851305713/",
    "https://theonion.com/ozempic-maker-triumphantly-announces-new-drug-that-make-1851320436/",
    "https://theonion.com/study-millennial-women-forgoing-dating-apps-in-favor-o-1851338275/",
    "https://theonion.com/beyonce-reveals-new-country-album-cover-featuring-tooth-1851355991/",
    "https://theonion.com/but-dog-likes-fighting-for-money-1851352386/",
    #April
    "https://theonion.com/finance-whiz-has-over-300-in-bank-account-1851375065/",
    "https://theonion.com/sotheby-s-announces-auction-of-napkin-on-which-jeffrey-1851375213/",
    "https://theonion.com/o-j-simpson-allowed-to-remain-living-after-coffin-does-1851403804/",
    "https://theonion.com/travis-kelce-impresses-coachella-crowd-by-tossing-taylo-1851410856/",
    "https://theonion.com/biden-carried-away-by-ants-1851422363/",
    #May
    "https://theonion.com/tesla-lays-off-entire-team-behind-brakes-1851449223/",
    "https://theonion.com/drake-drops-new-track-inviting-kendrick-lamar-out-to-co-1851458534/",
    "https://theonion.com/perdue-announces-initiative-to-even-the-playing-field-b-1851423157/",
    "https://theonion.com/new-florida-law-requires-all-women-to-produce-3-healthy-1851482288/",
    "https://theonion.com/everyone-in-er-bit-off-finger-while-holding-sandwich-1851488798/",
    #June
    "https://theonion.com/cult-leader-not-even-charismatic-1851512851/",
    "https://theonion.com/embarrassed-david-attenborough-realizes-he-spent-10-min-1851512951/",
    "https://theonion.com/newest-u-s-aid-mission-just-single-powerbar-labeled-f-1851540802/",
    "https://theonion.com/report-every-place-on-earth-has-wrong-amount-of-water-1851544516/",
    "https://theonion.com/nasa-warns-space-hawk-has-swooped-in-and-picked-up-eart-1851544578/",
    #July
    "https://theonion.com/clarence-thomas-torn-over-case-where-both-sides-offer-c-1851566812/",
    "https://theonion.com/democrats-panic-after-kamala-harris-ages-40-years-in-si-1851601473/",
    "https://theonion.com/congress-bans-roofs-1851592883/",
    "https://theonion.com/news-happening-faster-than-man-can-generate-uninformed-1851601466/",
    "https://theonion.com/god-forced-to-shave-head-after-contracting-plague-of-li-1851580149/",
    #August
    "https://theonion.com/environmentalists-warn-u-s-running-out-of-small-wooded-1851609190/",
    "https://theonion.com/r-kelly-petitions-supreme-court-to-watch-him-pee-1851619802rev1723482404693/",
    "https://theonion.com/federated-union-of-bear-cub-carcass-dumpers-endorses-rf-1851613425/",
    "https://theonion.com/glen-powell-opens-up-about-dangerous-stunt-work-filming-with-sydney-sweeneys-breasts/",
    "https://theonion.com/j-d-vance-accuses-tim-walz-of-stolen-valor-for-wearing-1851621120/",
    #September
    "https://theonion.com/everyone-in-restaurant-jealous-of-toddler-who-gets-to-wear-pajamas-and-watch-ipad/",
    "https://theonion.com/horrified-taylor-swift-realizes-football-happens-every-year/",
    "https://theonion.com/trump-avoids-answering-hard-questions-by-pretending-he-shot-in-ear-again/",
    "https://theonion.com/man-replies-stop-to-political-fundraiser-text-like-powerful-wizard-casting-spell-to-ward-off-mythical-beast/",
    "https://theonion.com/scarecrow-has-double-ds/",
    #October
    "https://theonion.com/the-onion-officially-endorses-joe-biden-for-president/",
    "https://theonion.com/texas-sex-ed-class-teaches-boys-how-to-cheat-on-pregnant-wife/",
    "https://theonion.com/sabrina-carpenter-completes-mandatory-service-in-south-korean-military/",
    "https://theonion.com/north-carolina-family-informed-their-insurance-policy-voided-once-house-gets-wet/",
    "https://theonion.com/grandma-who-survived-great-depression-casually-drops-that-she-once-killed-man-for-mayonnaise/",
    #November
    "https://theonion.com/piss-soaked-tucker-carlson-claims-demon-urinated-on-him-while-he-slept/",
    "https://theonion.com/trump-calls-harris-to-congratulate-himself-on-winning/",
    "https://theonion.com/america-defeats-america/",
    "https://theonion.com/man-forgetting-difference-between-meteoroid-meteorite-struggles-to-describe-what-just-killed-his-dog/",
    "https://theonion.com/every-movement-in-mans-burrito-eating-technique-informed-by-past-burrito-tragedies/"
]

# Scrape articles and create a DataFrame
custom_data_df = scrape_multiple_onion_articles(urls)
# Store to CSV
custom_data_df.to_csv("onion_scraped_articles.csv", index=False)

### Preprocess data

In [15]:
#combine the title and text columns into a single column
custom_data_df['content'] = custom_data_df['title'] + " " + custom_data_df['text']
#drop unnecessary columns
custom_data_df = custom_data_df[['content', 'class','category']] 
#check the breakdown of categories
print(custom_data_df['category'].value_counts())
print(custom_data_df.head())

News             19
Local            12
Entertainment    12
Politics         11
Editorials        1
Name: category, dtype: int64
                                                                                               content  \
0  Biden Addresses Nation While Hanging From Branch On Side Of Cliff WASHINGTON—Using his platform ...   
1  Marriage Counselor Sides With Hotter Spouse ANCHORAGE, AK—Stating that she had heard both perspe...   
2  Wealthy Dad Surprises Child With Tree House He Can Airbnb For Passive Income WILMETTE, IL—Tellin...   
3  Glowing, Pulsating Hair Product Takes Control Of Gavin Newsom’s Thoughts SACRAMENTO, CA—As an ot...   
4  Gen Z Announces Julie Andrews Is Problematic But Refuses To Explain Why ​​NEW YORK—Standing befo...   

    class       category  
0  Satire       Politics  
1  Satire          Local  
2  Satire          Local  
3  Satire       Politics  
4  Satire  Entertainment  


In [16]:
#apply preprocessing
custom_data_df['clean_content'] = custom_data_df['content'].apply(preprocess_text)
print(custom_data_df[['content', 'clean_content']].head())

                                                                                               content  \
0  Biden Addresses Nation While Hanging From Branch On Side Of Cliff WASHINGTON—Using his platform ...   
1  Marriage Counselor Sides With Hotter Spouse ANCHORAGE, AK—Stating that she had heard both perspe...   
2  Wealthy Dad Surprises Child With Tree House He Can Airbnb For Passive Income WILMETTE, IL—Tellin...   
3  Glowing, Pulsating Hair Product Takes Control Of Gavin Newsom’s Thoughts SACRAMENTO, CA—As an ot...   
4  Gen Z Announces Julie Andrews Is Problematic But Refuses To Explain Why ​​NEW YORK—Standing befo...   

                                                                                         clean_content  
0  biden addresses nation hanging branch side cliff washington—using platform plead americans lend ...  
1  marriage counselor sides hotter spouse anchorage ak—stating heard perspectives could understand ...  
2  wealthy dad surprises child tree house airbnb

### Sentiment analysis

In [21]:
def get_sentiment(text):
    blob = TextBlob(text)
    return blob.sentiment.polarity, blob.sentiment.subjectivity

custom_data_df[['polarity', 'subjectivity']] = custom_data_df['clean_content'].apply(lambda x: pd.Series(get_sentiment(x)))
print(custom_data_df[['clean_content','polarity', 'subjectivity']].head())

                                                                                         clean_content  \
0  biden addresses nation hanging branch side cliff washington—using platform plead americans lend ...   
1  marriage counselor sides hotter spouse anchorage ak—stating heard perspectives could understand ...   
2  wealthy dad surprises child tree house airbnb passive income wilmette il—telling child peek walk...   
3  glowing pulsating hair product takes control gavin newsom’s thoughts sacramento ca—as otherworld...   
4  gen z announces julie andrews problematic refuses explain ​​new york—standing crowd millennials ...   

   polarity  subjectivity  
0  0.010714      0.592857  
1  0.190476      0.433333  
2  0.156618      0.476471  
3  0.105000      0.577500  
4  0.400000      0.400000  


### Named Entity Recognition

In [20]:
nlp = spacy.load("en_core_web_sm")

def extract_entities(text):
    doc = nlp(text)
    return [ent.text for ent in doc.ents]

custom_data_df['entities'] = custom_data_df['content'].apply(extract_entities)
print(custom_data_df[['clean_content','entities']].head())

                                                                                         clean_content  \
0  biden addresses nation hanging branch side cliff washington—using platform plead americans lend ...   
1  marriage counselor sides hotter spouse anchorage ak—stating heard perspectives could understand ...   
2  wealthy dad surprises child tree house airbnb passive income wilmette il—telling child peek walk...   
3  glowing pulsating hair product takes control gavin newsom’s thoughts sacramento ca—as otherworld...   
4  gen z announces julie andrews problematic refuses explain ​​new york—standing crowd millennials ...   

                                                                                              entities  
0  [Biden Addresses Nation While Hanging From Branch On Side Of Cliff WASHINGTON, Americans, Joe Bi...  
1  [Spouse ANCHORAGE, AK, Laurie Hartford, David, Julia Carter, David, at least two, half, six, her...  
2                            [IL, Kenneth Schwei