## Gender Bias Headline Classification Modeling as a Foundation for News Objectivity

Madeline F. Birch | November 2023 | Flatiron School Data Science Program | Final Capstone Project

# Contents
1. [Project Overview](#Project_Overview)
2. [The Dataset](#The_Dataset)
3. [Exploratory Data Analysis](#EDA)
4. [Feature Engineering (and more EDA)](#Feature_Engineering)
5. [Preprocessing](#Preprocessing)
6. [Train Test Split](#Train_Test_Split)




# Project Overview<a id='Project_Overview'></a>

In an era where information shapes perspectives, the media plays a pivotal role in influencing societal narratives. Understanding the subtle nuances and potential biases embedded in headlines is crucial, and this project aims to shed light on the degree of bias present in headlines. Focusing on data classification through machine learning, the project seeks to predict headlines into three classes: No Bias, Low Bias, and High Bias. The objective is not to scrutinize sensationalism or analyze sentiment polarity but to leverage textual and numerical features to predict the bias level accurately.

<img src="Images/pbs_logo.png" alt="PBS Logo" width="400"/>

### Our Stakeholder: PBS News
The significance of this project lies in its potential impact on journalism's objectivity, particularly for PBS News, a revered American news source known for its impartiality and lack of apparent agenda. Publicly funded by 15%, the entity continuously faces accusations of general bias and [threats of defunding from various actors](https://www.nytimes.com/2011/02/28/business/media/28cpb.html). By adopting insights gained from our efforts,  PBS News can silence these threats and secure its value as a trusted, objective news source.

### Our Vision: Gender Bias in Headlines as a Framework
We chose to focus on gender bias as a focus for this project because it is undeniably one of the most prevalent forms of bias in published news content. A 2021 Topic Modeling [study](https://www.frontiersin.org/articles/10.3389/frai.2021.664737/full) found women are unfortunately but not unexpectedly mentioned "more frequently in topics related to lifestyle, healthcare, and crimes and sexual assault." Another 2021 Natural Language Processing [study](file:///Users/madelinebirch/Downloads/journal.pone.0245533.pdf) concluded, "although
we see a certain tokenism in having female voices present in the news, their voices are drowned
out by the overwhelming number of times that we hear from men, often from just a handful of
men." There's no denying that gender bias is present in news articles themselves, but what about the content of headlines? Couldn't headlines *about* women be biased, too?

At the heart of this initiative is the recognition of headlines as powerful agents that shape our perceptions. These succinct phrases captivate our attention and mold our subconscious understanding of entire articles. The urgency to prove and maintain neutrality, especially on a platform as eminent as PBS, underscores the relevance of our undertaking.

While our primary focus is on demonstrating which ML algorithms are most adept at detecting gender bias in headlines, this project could serve a framework for ongoing assessments of headlines across diverse bias types, including political, racial, LGBTQ+, socioeconomic class and beyond. The broader vision is to contribute to a media landscape characterized by transparency, objectivity, and accountability, fostering a public discourse grounded in fair and unbiased reporting.

# The Dataset<a id='The_Dataset'></a>

The dataset utilized in this project originates from a comprehensive collection of data scraped for ["When Women Make Headlines,"](https://pudding.cool/2022/02/women-in-headlines/) a visual essay published by [*The Pudding*](https://pudding.cool/) in June 2022. Released by [Amber Thomas](https://data.world/amberthomas) on [data.world](https://data.world/the-pudding/women-in-headlines), this dataset is one of few open source datasets we could find that investigates gender bias specifically in headlines. It encompasses a diverse array of headlines about women, each annotated with corresponding bias scores. Bias scores were calculated following the methodology outlined in ["Proposed Taxonomy for Gender Bias in Text; A Filtering Methodology for the Gender Generalization Subtype."](https://aclanthology.org/W19-3802.pdf)

The dataset's origin in a visual essay adds an element of real-world applicability, grounding the project in the practical considerations of media consumption and perception. Its richness of lies in its amalgamation of both textual numerical features associated with each headline. The text data provides the linguistic context of the headlines, while numerical features offer additional dimensions for analysis. This holistic approach enables the development of a machine learning model that can discern patterns beyond linguistic constructs, contributing to a nuanced understanding of what contributes to bias. 

### Our Working File
The dataset contains a multitude of `.csv` files; for the sake of simplicity, we will be working exclusively with `headlines.csv` and engineering additional numerical features to bolster our models.

### Our Target
Our target variable will be `bias`.

# Exploratory Data Analysis<a id='EDA'></a>

Let's begin our EDA by:
1) importing all necessary libraries, packages and modules necessary for exploring and visualizing our data,
2) loading and inspecting our `headlines.csv` file as a DataFrame,
3) seeing if any irrelevant columns need dropping,
4) getting the shape of our DataFrame,
5) getting primary statistics and, 
6) inspecting the distribution of our target variable `bias`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
import datetime
from nltk.corpus import stopwords
import gensim
import gensim.downloader
from gensim.utils import simple_preprocess

In [None]:
# Loading 'headlines.csv' into a Pandas DataFrame
headlines = pd.read_csv('Data/headlines.csv')

# Showing first 5 rows of the DataFrame
headlines.head()

In [None]:
# Dropping irrelevant columns 'url', 'Unnamed: 0', and 'index'
headlines = headlines.drop(columns=['url', 'Unnamed: 0', 'index'])

In [None]:
# Getting the shape of our DataFrame
headlines.shape

In [None]:
# Getting primary statistics
headlines.describe()

In [None]:
# Getting value counts for bias feature
headlines['bias'].value_counts()

We see above that our target, though numerical in nature, is ultimately categorical, as there are only 6 distinct bias scores. We'll deal with that further along in data processing. We also see that it is a highly imbalanced feature. Let's plot to confirm:

In [None]:
# Plotting bias distribution

# Setting Seaborn style
sns.set(style="whitegrid")

# Plotting histogram of 'bias'
plt.figure(figsize=(10, 6))
sns.histplot(headlines['bias'], bins=20, kde=True, color='blue')

# Styling the plot
plt.title('Distribution of Bias Scores', fontsize=16)
plt.xlabel('Bias Score', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Showing the plot
plt.tight_layout()
plt.show()

# Feature Engineering (and more EDA) <a id='Feature_Engineering'></a>

Since our DataFrame contains mostly text features, let's introduce a numerical feature indicating the sentiment polarity of each headline. To accomplish this, we employ a sentiment analyzer using the SentimentIntensityAnalyzer from the Natural Language Toolkit (nltk) library. The goal is to capture the overall sentiment expressed in each headline.

- We create a sentiment analyzer object, `sid`, using the SentimentIntensityAnalyzer.
    - The analyzer is then applied to each headline in the 'headline_no_site' column.
    - The resulting compound score, indicative of the overall sentiment, is stored in the new 'sentiment_polarity' feature.
    - We then get value counts, min/max values, and plot the distribution of the feature

*Please note that the sentiment analysis process may take some time to run, as we have a very large dataset.*

In [None]:
# Creating feature 'sentiment_polarity'

# Creating a sentiment analyzer
sid = SentimentIntensityAnalyzer()

# Applying the sentiment analyzer to each headline and storing the compound score - this takes a while to run
headlines['sentiment_polarity'] = headlines['headline_no_site'].apply(lambda x: sid.polarity_scores(x)['compound'])

In [None]:
# Getting distribution of sentiment_polarity
headlines['sentiment_polarity'].value_counts()

In [None]:
# Printing min and max values for sentiment polarity
print(headlines['sentiment_polarity'].min())
print(headlines['sentiment_polarity'].max())

In [None]:
# Plotting dist of sentiment polarity

# Setting Seaborn style
sns.set(style="whitegrid")

# Plotting histogram of 'sentiment_polarity'
plt.figure(figsize=(10, 6))
sns.histplot(headlines['sentiment_polarity'], bins=20, kde=True, color='green')

# Styling the plot
plt.title('Distribution of Sentiment Polarity Scores', fontsize=16)
plt.xlabel('Sentiment Polarity Score', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Showing the plot
plt.tight_layout()
plt.show()

Another highly imbalanced feature. Hopefully, since we are using classification modeling, we won't need to worry about this too much. Classification models often produce probability scores or decision values that can be thresholded to make predictions. These scores are not affected by feature imbalances since they represent the model's confidence in predicting a certain class.

Let's move on with some more feature engineering, this time extractin temporal data:

- **Day of the Week and Month:**
  - We create two new features, 'Day_of_Week' and 'Month,' by extracting information from the 'time' column. 'Day_of_Week' indicates the day on which the headline was published, while 'Month' represents the numerical month.

- **Hour of the Day:**
  - Another temporal feature, 'Hour_of_Day,' is engineered to capture the specific hour of publication. This fine-grained detail could reveal patterns related to the time of day.

- **Publication Year:**
  - The 'time' column is converted to a datetime format, and the year information is extracted to create a new feature, 'Publication_Year.' This allows us to analyze trends and biases over different years.

The 'time' column is also converted to datetime format, ensuring accurate extraction of temporal features. Any errors during the conversion are coerced to handle potential issues with the datetime format.

In [None]:
# Engineering 'Day of the Week' and 'Month' Features
headlines['Day_of_Week'] = pd.to_datetime(headlines['time']).dt.day_name()
headlines['Month'] = pd.to_datetime(headlines['time']).dt.month

# Engineering 'Hour of Dat' feature
headlines['Hour_of_Day'] = pd.to_datetime(headlines['time']).dt.hour

# Converting 'time' column to datetime format
headlines['time'] = pd.to_datetime(headlines['time'], errors='coerce')

# Extracting the year and creating a new 'Publication Year' feature
headlines['Publication_Year'] = headlines['time'].dt.year

In [None]:
# Dropping time column
headlines = headlines.drop(columns=['time'])

Now, to get word count and text length values for each headline:

- **Word Count:**
  - We create a `Word_Count` feature by applying a lambda function to calculate the number of words in each headline. This feature provides insight into the lexical richness and complexity of the headlines.

- **Text Length:**
  - A `Text_Length` feature is engineered by computing the length of each headline using the `len` function. This feature captures the overall character count, offering an additional perspective on the headline's brevity or verbosity.

In [None]:
# Creating word count feature
headlines['Word_Count'] = headlines['headline_no_site'].apply(lambda x: len(x.split()))

# Creating text length feature
headlines['Text_Length'] = headlines['headline_no_site'].apply(len)

In [None]:
# Inspecting new head
headlines.head()

Moving on to `site` and `country`. Let's check value counts for distribution:

In [None]:
# Getting value counts for site
headlines['site'].value_counts()

In [None]:
# Getting value counts for country
headlines['country'].value_counts()

To manage the size of our dataset, we apply a threshold to consider only news sites with a substantial volume of headlines.

- We set a threshold of at least 5000 headlines for inclusion.
- We identify the top news sites meeting or exceeding this threshold using the `site` column.
- A new DataFrame, `headlines_filtered,` is created to contain only the headlines from the selected news sites.

In [None]:
# Setting a threshold for news sites with at least 5000 headlines
min_headlines_threshold = 5000
top_sites = headlines['site'].value_counts()
top_sites = top_sites[top_sites >= min_headlines_threshold].index

# Creating a new dataframe with only the sites with at least 5000 headlines
headlines_filtered = headlines[headlines['site'].isin(top_sites)].copy()

In [None]:
# Getting shape of new DataFrame
headlines_filtered.shape

We'll nowcreate a DataFrame `top_10_sites` and apply `.nlargest` function to get the top 10 sites with the most headlines. Then, we plot the distribution of top 10 news sources in a pie chart:

In [None]:
# Getting top 10 sites with most headlines
top_10_sites = headlines_filtered['site'].value_counts().nlargest(10)

# Creating a pie chart
plt.figure(figsize=(8, 8))
plt.pie(top_10_sites, labels=top_10_sites.index, autopct='%1.1f%%', colors=sns.color_palette('viridis'), startangle=90)
plt.title('Top 10 News Sources Distribution')
plt.tight_layout()

# Showing the plot
plt.show()

Now, let's plot the distribution of countries:

In [None]:
# Getting distribution of countries
plt.figure(figsize=(8, 6))
sns.countplot(x='country', data=headlines_filtered, palette='viridis')
plt.title('Distribution of Countries')
plt.xlabel('Country')
plt.ylabel('Number of Headlines')
plt.tight_layout()

# Showing the plot
plt.show()

Let's continue our EDA of engineered features like `Word_Count` and `Text_Length` by plotting histograms of each:

In [None]:
# Set the style for seaborn
sns.set(style="whitegrid")

# Plotting the distribution
plt.figure(figsize=(10, 6))
sns.histplot(headlines_filtered['Word_Count'], bins=30, color='skyblue', kde=False)
plt.title('Distribution of Word Count in Headlines')
plt.xlabel('Word_Count')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In [None]:
# Set the style for seaborn
sns.set(style="whitegrid")

# Plotting the distribution
plt.figure(figsize=(10, 6))
sns.histplot(headlines_filtered['Text_Length'], bins=30, color='skyblue', kde=False)
plt.title('Distribution of Text Length in Headlines')
plt.xlabel('Word_Count')
plt.ylabel('Frequency')

# Show the plot
plt.show()

In [None]:
# Getting distribution of bias in headlines_filtered
headlines_filtered['bias'].value_counts()

It looks like there are few instances of headlines with bias scores of 0.666667 and 0.833333. We will drop them:

In [None]:
# Setting values to drop
values_to_drop = [0.666667, 0.833333]

# Use boolean indexing to drop rows with specified values in 'bias' column
headlines_filtered = headlines_filtered[~headlines_filtered['bias'].isin(values_to_drop)]

We will now introduce a new feature, 'bias_category,' by categorizing the 'bias' values into distinct levels. The categorization is based on predefined conditions:

- **No Bias:**
  - Values equal to 0.0 are categorized as having no bias.

- **Low Bias:**
  - Values between 0.1 and 0.2 (inclusive) are considered to indicate low bias.

- **High Bias:**
  - Values between 0.3 and 0.5 (inclusive) fall into the high bias category.

The `np.select` function efficiently applies these conditions, allowing us to create a categorical feature.


In [None]:
# Defining conditions
conditions = [
    headlines_filtered['bias'].between(0.000000, 0.000000, inclusive='both'),
    headlines_filtered['bias'].between(0.1, 0.2, inclusive='both'),
    headlines_filtered['bias'].between(0.3, 0.5, inclusive='both'),
]

# Setting category labels
labels = ['No Bias', 'Low Bias', 'High Bias']

# Applying conditions
headlines_filtered['bias_category'] = np.select(conditions, labels, default=None)

In [None]:
# Getting unique bias category values
headlines_filtered['bias_category'].unique()

We see that in the process of categorizing bias, we have raised some None-type values. We will deal with this shortly. First let's inspect our new head and drop the original `bias` column:

In [None]:
# Showing new head
headlines_filtered.head()

In [None]:
# Dropping original bias column
headlines_filtered = headlines_filtered.drop(columns=['bias'])

We continue our EDA by plotting some key feature interactions that might be interesting to see:

In [None]:
# Plotting the distribution of bias_category by Publication_Year

# Setting the style for seaborn
sns.set(style="whitegrid")

# Setting fig size, plotting feature interaction, setting axes titles and labels
plt.figure(figsize=(12, 8))
sns.countplot(x="Publication_Year", hue="bias_category", data=headlines_filtered)
plt.title('Distribution of Bias Category by Publication_Year')
plt.xlabel('Publication Year')
plt.ylabel('Count')

# Showing the plot
plt.show()

In [None]:
# Plotting a swarm plot for Sentiment_Polarity vs. bias with a gradient color scheme

# Setting a Seaborn style
sns.set(style="whitegrid")

# Setting fig size, creating plot of feature interaction
plt.figure(figsize=(12, 6))
scatter = sns.scatterplot(x='sentiment_polarity', y='bias_category', data=headlines_filtered, hue='bias_category', palette='viridis', size=3)

# Styling the plot
plt.title('Distribution of Sentiment Polarity for Different Bias Categories', fontsize=16)
plt.xlabel('Sentiment Polarity', fontsize=12)
plt.ylabel('Bias Category', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
plt.grid(axis='both', linestyle='--', alpha=0.7)

# Creating a ScalarMappable for the colorbar
sm = plt.cm.ScalarMappable(cmap='viridis')
sm.set_array([])  # Set an empty array

# Showing the plot
plt.tight_layout()
plt.show()


# Pre-processing<a id='Preprocessing'></a>

To enhance the effectiveness of our machine learning model, we employ one-hot encoding for our many categorical features. This process involves:

- **Selecting Categorical Columns:**
  - We identify the columns eligible for one-hot encoding, including 'site,' 'country,' 'Day_of_Week,' 'Month,' 'Hour_of_Day,' and 'Publication_Year.'

- **Creating One-Hot Encoded Columns:**
  - Using the `pd.get_dummies` function, we convert the selected categorical columns into binary-encoded columns with 1s and 0s, dropping the first category to avoid multicollinearity.

- **Concatenating with the Original DataFrame:**
  - The one-hot encoded columns are concatenated with the original DataFrame, 'headlines_filtered,' creating a more feature-rich dataset.

- **Dropping Original Categorical Columns:**
  - The original categorical columns are dropped from the DataFrame, leaving behind the one-hot encoded representations.

This process enhances the model's ability to interpret and learn from categorical information, contributing to a more robust analysis.

In [None]:
# Selecting the categorical columns to one-hot encode
categorical_columns = ['site', 'country', 'Day_of_Week', 'Month', 'Hour_of_Day', 'Publication_Year']

# Creating one-hot encoded columns with 1s and 0s
one_hot_encoded = pd.get_dummies(headlines_filtered[categorical_columns], drop_first=True, dtype=int)

# Concatenating the one-hot encoded columns with the original DataFrame
headlines_filtered_encoded = pd.concat([headlines_filtered, one_hot_encoded], axis=1)

# Dropping the original categorical columns
headlines_filtered_encoded.drop(categorical_columns, axis=1, inplace=True)

# Displaying first row of resulting DataFrame
headlines_filtered_encoded.head(1)

In [None]:
# Renaming the 'headlines_no_site' column to 'headlines'
headlines_filtered_encoded.rename(columns={'headline_no_site': 'headlines'}, inplace=True)

headlines_filtered_encoded['headlines'].head()

To preprocess our text feature of note, `headlines`, we will need to import the following packages and modules:

In [None]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
from nltk.corpus import stopwords
import gensim
import gensim.downloader
from gensim.utils import simple_preprocess

We now perform essential text preprocessing steps to enhance the quality of our textual data for machine learning analysis. The process includes:

- **Tokenization:**
  - Headline text is tokenized using the `word_tokenize` function, breaking down sentences into individual words.

- **Cleaning Text:**
  - Non-alphabetic characters are removed, empty strings are handled, and extra spaces are addressed. This ensures a cleaner representation of the text.

- **Converting to Lowercase:**
  - All tokens are converted to lowercase, providing a standardized format for analysis.

- **Lemmatization:**
  - lemmatization is the process of reducing words in the headline text to their base or root form. 
  - Lemmatization is applied using the WordNet lemmatizer from the Natural Language Toolkit (nltk). This step reduces words to their base or root form, aiding in feature extraction and improving the model's ability to discern patterns.

Please note that the lemmatization process may take a minute or two to run, considering its comprehensive nature.

In [None]:
# Tokenize, Clean, and Lemmatize Text

# Tokenizing the headline text
headlines_filtered_encoded['tokenized_text'] = headlines_filtered_encoded['headlines'].apply(word_tokenize)

# Removing non-alphabetic characters, handle empty strings, and extra spaces
headlines_filtered_encoded['cleaned_text'] = headlines_filtered_encoded['tokenized_text'].apply(lambda tokens: [re.sub(r'[^a-zA-Z0-9]', '', token).strip() for token in tokens if re.sub(r'[^a-zA-Z0-9]', '', token).strip()])

# Converting to lowercase
headlines_filtered_encoded['cleaned_text'] = headlines_filtered_encoded['cleaned_text'].apply(lambda tokens: [token.lower() for token in tokens])

# Lemmatization - this takes a minute or two to run
lemmatizer = WordNetLemmatizer()
headlines_filtered_encoded['lemmatized_text'] = headlines_filtered_encoded['cleaned_text'].apply(lambda tokens: [lemmatizer.lemmatize(token) for token in tokens])

In [None]:
# Storing a copy of headlines_filtered_encoded as lemmatized_df
lemmatized_df = headlines_filtered_encoded.copy()

In [None]:
# Dropping the 'headlines' column from lemmatized_df
lemmatized_df.drop('headlines', axis=1, inplace=True)

# Displaying the first row of lemmatized_df after dropping the column
lemmatized_df.head(1)

We now take a crucial step in refining our textual data for analysis by removing common English stop words. The process involves:

- **Acquiring Stop Words:**
  - We obtain a set of English stop words using the `stopwords.words('english')` function from the Natural Language Toolkit (nltk).

- **Filtering Stop Words:**
  - Stop words are then removed from the 'lemmatized_text' column using a lambda function. This helps eliminate frequently occurring words that typically don't contribute significant meaning to the analysis.

By eliminating stop words, we focus the terms with the most meaning.

We will then drop `tokenized_text,` `cleaned_text,` and `lemmatized_text.`

In [None]:
# Getting stop words
stop_words = set(stopwords.words('english'))

# Creating no stopwords feature, removing stop words from the lemmatized_text column
lemmatized_df['lemmatized_text_no_stopwords'] = lemmatized_df['lemmatized_text'].apply(lambda tokens: [token for token in tokens if token not in stop_words])

In [None]:
# Inspecting head of lemmatized_text_no_stopwords feature
lemmatized_df['lemmatized_text_no_stopwords'].head()

In [None]:
# Storing a copy of lemmatized_df as df_to_vectorize
df_to_vectorize = lemmatized_df.copy()

In [None]:
# Setting list of columns to drop
columns_to_drop = ['tokenized_text', 'cleaned_text', 'lemmatized_text']

# Dropping the specified columns
df_to_vectorize.drop(columns=columns_to_drop, inplace=True)

In [None]:
# Inspecting head 
df_to_vectorize.head(1)

We'll now explore available pre-trained word embedding models in the `gensim-data` repository and download a specific pre-trained word embedding model for our analysis. The process involves:

- **Listing Available Models:**
  - We print the list of available models in the gensim-data repository using `gensim.downloader.info()['models'].keys()`.

- **Downloading the Model:**
  - We select the 'fasttext-wiki-news-subwords-300' model and download it using `gensim.downloader.load('fasttext-wiki-news-subwords-300')`. This pre-trained word embedding model is known for its representation of subword information and is well-suited for various natural language processing tasks.

Please note that downloading the model may take a while due to its size.

In [None]:
# Printing available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

In [None]:
# Downloading a pre-trained word embedding model and assigning it to 'model'- this will take a while! 
model = gensim.downloader.load('fasttext-wiki-news-subwords-300')

Let's define a function that generates an embedding for the pre-trained text model by mapping the embeddings into a 300-dimensional space. For out-of-vocabulary words, we use a zero-vector replacement, and we'll remove stop words from the text.
- Input: text (str): Text to be embedded.
- Output: embedding_vector (np.array): Averaged embedding vector in a 300-dimensional space.

In [None]:
# Defining function
def text2vec(text):
    
    tokenized = text
    
    word_embeddings = [np.zeros(300)]
    for word in tokenized:
        if word in model:
            vector = model[word]
        else:
            vector = np.zeros(300)
            
        word_embeddings.append(vector)
    
    text_embedding = np.stack(word_embeddings).mean(axis=0)
    
    return text_embedding

In [None]:
# Applying the function over the lemmatized text column and assigning the results to new columns
df_to_vectorize['headline_vectors'] = df_to_vectorize['lemmatized_text_no_stopwords'].apply(lambda x: text2vec(x))

In [None]:
# Inspecting headline vectors
df_to_vectorize['headline_vectors'].head()

In [None]:
# Making copy of df_to_vectorize
final_df = df_to_vectorize.copy()

In [None]:
# Dropping the 'lemmatized_text_no_stopwords' column
final_df = final_df.drop('lemmatized_text_no_stopwords', axis=1)

In [None]:
# Checking headlines vectors of first row in filtered dataset
final_df['headline_vectors'][9207]

Let's check for null values and drop them.

In [None]:
# Checking final_df for null values
final_df.isnull().sum()

In [None]:
# Dropping all rows containing null values
final_df = final_df.dropna()

In [None]:
# Ensuring there are no null values 
final_df.isnull().sum()

To mitigate class imbalance, we will now manually calculate class weights for the 'bias_category' column in our dataset. The process involves:

- **Extracting Unique Classes and Counts:**
  - We use `np.unique` to extract the unique classes and their corresponding counts from the 'bias_category' column in the DataFrame.

- **Calculating Class Weights:**
  - The class weights are computed by dividing the total number of samples by the product of the number of classes and the count of each class. This ensures that classes with fewer samples receive higher weights.

- **Creating a Class Weight Dictionary:**
  - A dictionary, 'class_weight_dict,' is generated by mapping class labels to their respective weights using the `zip` function.

- **Displaying Class Weights:**
  - The resulting class weights are printed to provide insight into the distribution and importance assigned to each class.

In [None]:
# Manually calculating class weights

# Extract the unique classes and their counts from the 'bias_category' column in final_df
classes, counts = np.unique(final_df['bias_category'], return_counts=True)

# Calculate class weights for the 'bias_category' column in final_df
total_samples = len(final_df['bias_category'])
class_weights = total_samples / (len(classes) * counts)

# Create a dictionary mapping class labels to their respective weights
class_weight_dict = dict(zip(classes, class_weights))

# Print the class weights
print('Class Weights:', class_weight_dict)

# Train Test Split<a id='Train_Test_Split'></a>

We are now ready to perform a train-test split to prepare our data for modeling. To do this, we import `train_test_split` from `SciKitLearn`. We will then set our X and y variables, split the training and test data, and print the shape for each:

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Setting X and y variables
X = final_df('bias_category', axis=1)
y = final_df['bias_category']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape)
print("y_test shape:", y_test.shape)

In [None]:
# Ensuring there are no NaN values in y_train
y_train.unique()

In [None]:
# Ensuring there are no NaN values in y_test
y_test.unique()

Following the train-test split, we now must convert headline vectors for both the training and testing sets into arrays. The process includes:

- **For Training Set (`X_train`):**
  - The 'headline_vectors' column in the training set, representing the vectorized form of headlines, is converted to a NumPy array using `np.array` and `tolist()`.

- **For Testing Set (`X_test`):**
  - Similarly, the 'headline_vectors' column in the testing set undergoes the same conversion to a NumPy array.

This conversion facilitates the compatibility of the headline vectors with machine learning models that expect array-like input. The resulting arrays, 'X_train_array' and 'X_test_array,' are now ready for use in model training and evaluation.

In [None]:
# Converting X_train vectors to arrays
X_train_array = np.array(X_train['headline_vectors'].tolist())

# Converting X_test vectors to arrays
X_test_array = np.array(X_test['headline_vectors'].tolist())

# Modeling

### Model Evaluation Metric
The **macro average F1 score** is often more appropriate for highly imbalanced datasets such as ours. The macro average calculates the F1 score (a score that considers both precision and recall across different classes) for each class independently, and then takes the unweighted average across all classes. This means that each class contributes equally to the final macro average, regardless of its size.

This metric is appropriate for our data, as our target has an imbalanced class distribution. We will use this metric across all models for ease of comparison.

## Simple Models

### Logistic Regression
Logistic Regression (LogReg) serves as an excellent baseline simple model due to its interpretability, computational efficiency, and ease of implementation. Its linear decision boundary makes it suitable for binary and multiclass classification tasks, providing a straightforward comparison point for more complex models while offering a clear understanding of feature importance in the context of the dataset.

We will import all necessary packages and modules:

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import recall_score, f1_score, accuracy_score
from sklearn.model_selection import GridSearchCV

Next, we instantiate a Logistic Regression model (`LogisticRegression`) configured with class weights derived from our previously calculated `class_weight_dict`. The training process involves:

- **Instantiating the Model:**
  - We create an instance of Logistic Regression with specified parameters, including a maximum iteration limit (`max_iter=1000`), a random state for reproducibility (`random_state=42`), and class weights determined earlier.

- **Model Fitting:**
  - The model is fitted to the training data (`X_train_array` and `y_train`) using the `fit` method. During training, the model adjusts its parameters to learn the underlying patterns and relationships..

- **Making Predictions:**
  - Predictions are generated on the test data (`X_test_array`) using the trained Logistic Regression model. The resulting predictions, stored in `y_pred_lr`, can be evaluated to assess the model's performance on unseen data.

Logistic Regression serves as a baseline model, providing a benchmark for more sophisticated algorithms.

In [None]:
# Create an instance of LogisticRegression with class weights
logreg_model = LogisticRegression(max_iter=1000, random_state=42, class_weight=class_weight_dict)

# Fit the model on the training data
logreg_model.fit(X_train_array, y_train)

# Make predictions on the test data
y_pred_lr = logreg_model.predict(X_test_array)

To assess the model, we import `classification_report` and get our evaluation scores:

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred_lr))

**LogReg Model Evaluation:** *Our LogReg model performs reasonably well in identifying low bias headlines but struggles with high and no bias categories.*

**High Bias Class:
Precision (0.37):** Out of all predicted high bias headlines, 37% were correctly classified.
**Recall (0.63):** The model identified 63% of the actual high bias headlines.

**Low Bias Class:
Precision (0.80):** The model achieved a high accuracy (80%) in predicting low bias headlines.
**Recall (0.44):** Only 44% of the actual low bias headlines were correctly identified by the model.

**No Bias Class:
Precision (0.27):** The model's precision in predicting no bias headlines was relatively low (27%).
**Recall (0.59):** The model captured 59% of the actual no bias headlines.

Our LogReg Macro Average F1 score is **0.47**, indicating a fair balance between precision and recall across different bias categories without being overly influenced by class size.

In [None]:
from sklearn.metrics import confusion_matrix

# Get unique classes from y_test
unique_classes = np.unique(y_test)

# Calculate the confusion matrix
logreg_cm = confusion_matrix(y_test, y_pred_lr, labels=unique_classes)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(logreg_cm, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=unique_classes, yticklabels=unique_classes)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

In [None]:
logreg_cm

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

d

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred_rf))

In [None]:
# Create the confusion matrix
rf_cm = confusion_matrix(y_test, y_pred_rf)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(rf_cm, annot=True, fmt='d', cmap='Blues')
plt.title('Random Forest Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

## Complex Models

### Neural Network

In [None]:
from sklearn.utils.class_weight import compute_class_weight
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report
from tensorflow.keras.layers import BatchNormalization, Dropout, LeakyReLU
from tensorflow.keras.optimizers import Adam

In [None]:
# Assuming y_train is a pandas Series with string labels
label_encoder = LabelEncoder()

# Encoding y_test and y_train
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

# One-hot encoding y_train_encoded and y_test_encoded
y_train_one_hot = to_categorical(y_train_encoded)
y_test_one_hot = to_categorical(y_test_encoded)

# Assuming y_train_encoded is an array of class labels
class_labels = np.unique(y_train_encoded)

# Compute class weights for the neural network
class_weights = compute_class_weight(class_weight='balanced', classes=class_labels, y=y_train_encoded)
class_weight_dict = dict(enumerate(class_weights))

In [None]:
# Define the neural network model
nn_model = Sequential()
nn_model.add(Dense(64, input_dim=X_train_array.shape[1], activation='relu'))
nn_model.add(Dense(32, activation='relu'))
nn_model.add(Dense(len(class_labels), activation='softmax'))

# Compile the model with class weights
nn_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with class weights
nn_model.fit(X_train_array, y_train_one_hot, epochs=10, batch_size=32, validation_split=0.2, class_weight=class_weight_dict)

In [None]:
# Inspect the mapping between original class labels and encoded numbers
class_labels_mapping = dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)))
print("Class Labels Mapping:", class_labels_mapping)

In [None]:
# Make predictions on the test data
nn_model_y_pred_one_hot = nn_model.predict(X_test_array)

# Convert the predicted probabilities to class labels
nn_model_y_pred_classes = np.argmax(nn_model_y_pred_one_hot, axis=1)

# Print classification report
print(classification_report(y_test_encoded, nn_model_y_pred_classes, target_names=label_encoder.classes_))

neural net with dropout rate to prevent overfitting

In [None]:
# Define the neural network model
nn_model_2 = Sequential()
nn_model_2.add(Dense(64, input_dim=X_train_array.shape[1], activation='relu'))
nn_model_2.add(Dropout(0.5))  # Adding dropout rate of 0.5
nn_model_2.add(Dense(32, activation='relu'))
nn_model_2.add(Dropout(0.5))  # Adding dropout rate of 0.5
nn_model_2.add(Dense(len(class_labels), activation='softmax'))

# Compile the model with class weights
nn_model_2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with class weights
nn_model_2.fit(X_train_array, y_train_one_hot, epochs=10, batch_size=32, validation_split=0.2, class_weight=class_weight_dict)

In [None]:
# Make predictions on the test data
nn_model_2_y_pred_one_hot = nn_model_2.predict(X_test_array)

# Convert the predicted probabilities to class labels
nn_model_2_y_pred_classes = np.argmax(nn_model_2_y_pred_one_hot, axis=1)

# Print classification report
print(classification_report(y_test_encoded, nn_model_2_y_pred_classes, target_names=label_encoder.classes_))

neural network with l2 regularization

In [None]:
from tensorflow.keras.regularizers import l2

# Define the simple neural network model
nn_model_3 = Sequential()
nn_model_3.add(Dense(64, input_dim=X_train_array.shape[1], activation='relu', kernel_regularizer=l2(0.01)))
nn_model_3.add(Dense(32, activation='relu', kernel_regularizer=l2(0.01)))
nn_model_3.add(Dense(len(class_labels), activation='softmax'))

# Compile the model with class weights
nn_model_3.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with class weights
nn_model_3.fit(X_train_array, y_train_one_hot, epochs=10, batch_size=32, validation_split=0.2, class_weight=class_weight_dict)

In [None]:
# Make predictions on the test data
nn_model_3_y_pred_one_hot = nn_model_3.predict(X_test_array)

# Convert the predicted probabilities to class labels
nn_model_3_y_pred_classes = np.argmax(nn_model_3_y_pred_one_hot, axis=1)

# Print classification report
print(classification_report(y_test_encoded, nn_model_3_y_pred_classes, target_names=label_encoder.classes_))

add conf matrix here

neural network with batch normalizatiton

In [None]:
# Define the neural network model
nn_model_4 = Sequential()
nn_model_4.add(Dense(64, input_dim=X_train_array.shape[1], activation='relu'))
nn_model_4.add(BatchNormalization())
nn_model_4.add(Dense(32, activation='relu'))
nn_model_4.add(BatchNormalization())
nn_model_4.add(Dense(len(class_labels), activation='softmax'))

# Compile the model with class weights
nn_model_4.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with class weights
nn_model_4.fit(X_train_array, y_train_one_hot, epochs=10, batch_size=32, validation_split=0.2, class_weight=class_weight_dict)


In [None]:
# Make predictions on the test data
nn_model_4_y_pred_one_hot = nn_model_4.predict(X_test_array)

# Convert the predicted probabilities to class labels
nn_model_4_y_pred_classes = np.argmax(nn_model_4_y_pred_one_hot, axis=1)

# Print classification report
print(classification_report(y_test_encoded, nn_model_4_y_pred_classes, target_names=label_encoder.classes_))

In [None]:
# Define the neural network model
nn_model_5 = Sequential()
nn_model_5.add(Dense(64, input_dim=X_train_array.shape[1]))
nn_model_5.add(LeakyReLU(alpha=0.1))  # Leaky ReLU activation
nn_model_5.add(BatchNormalization())
nn_model_5.add(Dense(32))
nn_model_5.add(LeakyReLU(alpha=0.1))  # Leaky ReLU activation
nn_model_5.add(BatchNormalization())
nn_model_5.add(Dense(len(class_labels), activation='softmax'))

# Compile the model with class weights
nn_model_5.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with class weights
nn_model_5.fit(X_train_array, y_train_one_hot, epochs=10, batch_size=32, validation_split=0.2, class_weight=class_weight_dict)


In [None]:
# Make predictions on the test data
nn_model_5_y_pred_one_hot = nn_model_5.predict(X_test_array)

# Convert the predicted probabilities to class labels
nn_model_5_y_pred_classes = np.argmax(nn_model_5_y_pred_one_hot, axis=1)

# Print classification report
print(classification_report(y_test_encoded, nn_model_5_y_pred_classes, target_names=label_encoder.classes_))

lowering learning rate, adding epochs to 50

In [None]:
# Define the neural network model
nn_model_6 = Sequential()
nn_model_6.add(Dense(64, input_dim=X_train_array.shape[1]))
nn_model_6.add(LeakyReLU(alpha=0.1))  # Leaky ReLU activation
nn_model_6.add(BatchNormalization())
nn_model_6.add(Dense(32))
nn_model_6.add(LeakyReLU(alpha=0.1))  # Leaky ReLU activation
nn_model_6.add(BatchNormalization())
nn_model_6.add(Dense(len(class_labels), activation='softmax'))

# Compile the model with a faster learning rate
nn_model_6.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])

# Train the model with class weights
nn_model_6.fit(X_train_array, y_train_one_hot, epochs=50, batch_size=32, validation_split=0.2, class_weight=class_weight_dict)


In [None]:
# Make predictions on the test data
nn_model_6_y_pred_one_hot = nn_model_6.predict(X_test_array)

# Convert the predicted probabilities to class labels
nn_model_6_y_pred_classes = np.argmax(nn_model_6_y_pred_one_hot, axis=1)

# Print classification report
print(classification_report(y_test_encoded, nn_model_6_y_pred_classes, target_names=label_encoder.classes_))

## Recurrent Neural Networks

### GRU

In [None]:
from tensorflow.keras.layers import GRU
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.callbacks import LearningRateScheduler

In [None]:
# Reshape the input data for GRU (assuming X_train_array has shape (samples, features))
X_train_reshaped = X_train_array.reshape((X_train_array.shape[0], X_train_array.shape[1], 1))
X_test_reshaped = X_test_array.reshape((X_test_array.shape[0], X_test_array.shape[1], 1))

# Define the GRU model with Batch Normalization and Dropout
gru_model = Sequential()
gru_model.add(GRU(64, input_shape=(X_train_array.shape[1], 1), activation='relu'))
gru_model.add(BatchNormalization())
gru_model.add(Dropout(0.5))
gru_model.add(Dense(32, activation='relu'))
gru_model.add(BatchNormalization())
gru_model.add(Dropout(0.5))
gru_model.add(Dense(len(class_labels), activation='softmax'))

# Compile the model with class weights
gru_model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

early_stopping = EarlyStopping(monitor='val_loss', patience=1, restore_best_weights=True) #setting patience to 1

def lr_schedule(epoch):
    return 0.001 * 0.9 ** epoch

learning_rate_scheduler = LearningRateScheduler(lr_schedule)

# Train the GRU model with class weights, lowering epochs, increasing batch size
gru_model.fit(X_train_reshaped, y_train_one_hot, epochs=5, batch_size=250, validation_split=0.2, class_weight=class_weight_dict, callbacks=[early_stopping, learning_rate_scheduler])

In [None]:
# Make predictions on the test data
y_pred_one_hot_gru = gru_model.predict(X_test_reshaped)

# Convert the predicted probabilities to class labels
y_pred_classes_gru = np.argmax(y_pred_one_hot_gru, axis=1)

# Convert true labels to class labels
y_test_classes_gru = np.argmax(y_test_one_hot, axis=1)

# Print classification report
print(classification_report(y_test_classes_gru, y_pred_classes_gru, target_names=label_encoder.classes_))


In [None]:
# Generate confusion matrix
gru_cm = confusion_matrix(y_test_classes_gru, y_pred_classes_gru)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(gru_cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.title('GRU Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

### LSTM

In [None]:
from tensorflow.keras.layers import LSTM

# Define the neural network model with LSTM
lstm_model = Sequential()
lstm_model.add(LSTM(64, input_shape=(X_train_array.shape[1], 1), activation='relu'))
lstm_model.add(BatchNormalization())
lstm_model.add(Dropout(0.5))
lstm_model.add(Dense(32, activation='relu'))
lstm_model.add(BatchNormalization())
lstm_model.add(Dropout(0.5))
lstm_model.add(Dense(len(class_labels), activation='softmax'))

# Compile the model with class weights and a faster learning rate
lstm_model.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])

# Reshape the input data for RNN (assuming X_train_array has shape (samples, features))
X_train_reshaped = X_train_array.reshape((X_train_array.shape[0], X_train_array.shape[1], 1))

# Train the RNN model with class weights
lstm_model.fit(X_train_reshaped, y_train_one_hot, epochs=5, batch_size=250, validation_split=0.2, class_weight=class_weight_dict, callbacks=[early_stopping, learning_rate_scheduler])

In [None]:


# Make predictions on the test data
lstm_model_y_pred_one_hot = lstm_model.predict(X_test_reshaped)

# Convert the predicted probabilities to class labels
lstm_model_y_pred_classes = np.argmax(lstm_model_y_pred_one_hot, axis=1)

# Convert true labels to class labels
lstm_model_y_test_classes = np.argmax(y_test_one_hot, axis=1)

# Print classification report
print(classification_report(lstm_model_y_test_classes, lstm_model_y_pred_classes, target_names=label_encoder.classes_))


In [None]:
# Calculate confusion matrix
lstm_cm = confusion_matrix(lstm_model_y_test_classes, lstm_model_y_pred_classes)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(lstm_cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.title('RNN Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

### Bidirectional LSTM

In [None]:
from tensorflow.keras.layers import Bidirectional

# Define the Bidirectional LSTM model
blstm_model = Sequential()
blstm_model.add(Bidirectional(LSTM(64, activation='relu'), input_shape=(X_train_array.shape[1], 1)))
blstm_model.add(BatchNormalization())
blstm_model.add(Dropout(0.5))
blstm_model.add(Dense(32, activation='relu'))
blstm_model.add(BatchNormalization())
blstm_model.add(Dropout(0.5))
blstm_model.add(Dense(len(class_labels), activation='softmax'))

# Compile the model learning rate
blstm_model.compile(optimizer=Adam(learning_rate=0.01), loss='categorical_crossentropy', metrics=['accuracy'])

# Train the Bidirectional LSTM model with class weights
blstm_model.fit(X_train_reshaped, y_train_one_hot, epochs=5, batch_size=250, validation_split=0.2, class_weight=class_weight_dict, callbacks=[early_stopping, learning_rate_scheduler])


In [None]:
# Make predictions on the test data
blstm_model_y_pred_one_hot = blstm_model.predict(X_test_reshaped)

# Convert the predicted probabilities to class labels
blstm_y_pred_classes = np.argmax(blstm_model_y_pred_one_hot, axis=1)

# Convert true labels to class labels
blstm_model_y_test_classes = np.argmax(y_test_one_hot, axis=1)

# Print classification report
print(classification_report(blstm_model_y_test_classes, blstm_y_pred_classes, target_names=label_encoder.classes_))

In [None]:
# Calculate the confusion matrix
blstm_cm = confusion_matrix(blstm_model_y_test_classes, blstm_y_pred_classes)

# Plot the confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(blstm_cm, annot=True, fmt='d', cmap='Blues', xticklabels=label_encoder.classes_, yticklabels=label_encoder.classes_)
plt.title('Bidirectional LSTM Confusion Matrix')
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.show()

applying SMOTE to GRU model

In [None]:
#Import SMOTE
from imblearn.over_sampling import SMOTE

In [None]:
#Initialize SMOTE
smote = SMOTE(random_state=42)

#Apply SMOTE to the training set
X_train_smote, y_train_smote = smote.fit_resample(X_train_array, y_train_encoded)

In [None]:
#Convert y_train_smote to a DataFrame
y_train_smote_df = pd.DataFrame({'bias_category': y_train_smote})

In [None]:
y_train_smote_df.value_counts()

In [None]:
# One-hot encode y_train_smote
y_train_smote_one_hot = to_categorical(y_train_smote_df['bias_category'])

In [None]:
# Reshape the input data for GRU (assuming X_train_array has shape (samples, features))
X_train_reshaped = X_train_smote.reshape((X_train_smote.shape[0], X_train_smote.shape[1], 1))
X_test_reshaped = X_test_array.reshape((X_test_array.shape[0], X_test_array.shape[1], 1))

# Define the GRU model with Batch Normalization and Dropout
gru_model_smote = Sequential()
gru_model_smote.add(GRU(64, input_shape=(X_train_smote.shape[1], 1), activation='relu'))
gru_model_smote.add(BatchNormalization())
gru_model_smote.add(Dropout(0.5))
gru_model_smote.add(Dense(32, activation='relu'))
gru_model_smote.add(BatchNormalization())
gru_model_smote.add(Dropout(0.5))
gru_model_smote.add(Dense(len(class_labels), activation='softmax'))

# Compile the model with class weights
gru_model_smote.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

early_stopping = EarlyStopping(monitor='val_loss', patience=1, restore_best_weights=True)  # setting patience to 1

def lr_schedule(epoch):
    return 0.001 * 0.9 ** epoch

learning_rate_scheduler = LearningRateScheduler(lr_schedule)

# Train the GRU model with class weights, lowering epochs, increasing batch size
gru_model_smote.fit(X_train_reshaped, y_train_smote_one_hot, epochs=5, batch_size=250, validation_split=0.2, class_weight=class_weight_dict, callbacks=[early_stopping, learning_rate_scheduler])


In [None]:
# Make predictions on the test data
y_pred_one_hot_gru_smote = gru_model_smote.predict(X_test_reshaped)

# Convert the predicted probabilities to class labels
y_pred_classes_gru_smote = np.argmax(y_pred_one_hot_gru_smote, axis=1)

# Convert true labels to class labels
y_test_classes_gru_smote = np.argmax(y_test_one_hot, axis=1)

# Print classification report
print(classification_report(y_test_classes_gru_smote, y_pred_classes_gru_smote, target_names=label_encoder.classes_))


print all macro averaged scores for the best performing models (rf, nn, log reg)
print feature importances for each

Future Aspirations: A Tool for Writers

Looking ahead, a pivotal goal is to create a model or web app empowering writers to assess their headlines' predicted bias scores before publication. This not only emphasizes the commitment to unbiased reporting but also provides a practical solution for writers to navigate the nuanced landscape of headline neutrality.