# Sentiment Analysis of Financial News Headlines

Author: Mohamed Oussama NAJI

Date: March 27, 2024

## Table of Contents
1. [Introduction](#introduction)
2. [Dataset](#dataset)
3. [Data Loading](#data-loading)
4. [Data Exploration](#data-exploration)
5. [Data Cleaning](#data-cleaning)
6. [SMOTE (Imbalanced Dataset)](#smote)
7. [Bag-of-Words (BoW) Model](#bow-model)
8. [TF-IDF Model](#tfidf-model)
9. [Train-Test Split](#train-test-split)
10. [Classification Algorithms](#classification-algorithms)
    - [LightGBM](#lightgbm)
    - [Logistic Regression](#logistic-regression)
11. [Confusion Matrices](#confusion-matrices)
12. [Conclusion](#conclusion)

## Introduction <a id="introduction"></a>

Sentiment analysis is a powerful technique used to determine the sentiment or emotional tone of a piece of text. In this notebook, we will perform sentiment analysis on financial news headlines to classify them as positive, negative, or neutral.

We will explore various steps involved in the sentiment analysis pipeline, including data loading, data cleaning, feature extraction using Bag-of-Words (BoW) and TF-IDF models, and classification using LightGBM and Logistic Regression algorithms. We will also handle the imbalanced dataset using the SMOTE technique and evaluate the performance of the models using confusion matrices.


## Dataset <a id="dataset"></a>

The dataset used in this notebook contains financial news headlines along with their sentiment labels. It can be downloaded from the following URL:
https://raw.githubusercontent.com/subashgandyer/datasets/main/financial_news_headlines_sentiment.csv


In [None]:
import requests

url = 'https://raw.githubusercontent.com/subashgandyer/datasets/main/financial_news_headlines_sentiment.csv'
response = requests.get(url)

if response.status_code == 200:
    with open('financial_news_headlines_sentiment.csv', 'wb') as f:
        f.write(response.content)
    print("Dataset downloaded successfully.")
else:
    print("Failed to download the dataset. Status code:", response.status_code)


## Data Loading <a id="data-loading"></a>

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

h_cols = ['sentiment', 'headline']
news_headlines_df = pd.read_csv('financial_news_headlines_sentiment.csv', sep=',', names=h_cols, encoding='latin-1')


## Data Exploration <a id="data-exploration"></a>

In [None]:
print("shape : ", news_headlines_df.shape)
print(news_headlines_df.head())
print(news_headlines_df['sentiment'].value_counts())

## Data Cleaning <a id="data-cleaning"></a>

if news_headlines_df.isnull().sum().any():
    news_headlines_df = news_headlines_df.dropna()

news_headlines_df = news_headlines_df.drop_duplicates()
news_headlines_df['headline'] = news_headlines_df['headline'].str.replace('[^\w\s]','')
news_headlines_df['headline'] = news_headlines_df['headline'].str.lower()

print("shape : ", news_headlines_df.shape)
print(news_headlines_df.head())
print(news_headlines_df['sentiment'].value_counts())

## SMOTE (Imbalanced Dataset) <a id="smote"></a>

In [None]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer

X = news_headlines_df.headline
y = news_headlines_df.sentiment

train_data, test_data, train_labels, test_labels = train_test_split(X, y, test_size=0.2, random_state=42)

vectorizer = TfidfVectorizer()
train_data_vectorizer = vectorizer.fit_transform(train_data)
test_data_vectorizer = vectorizer.transform(test_data)

sm = SMOTE(random_state=42)
train_data_res, train_labels_res = sm.fit_resample(train_data_vectorizer, train_labels)

model = LogisticRegression()
model.fit(train_data_res, train_labels_res)

predicted_labels = model.predict(test_data_vectorizer)
print(classification_report(test_labels, predicted_labels))

## Bag-of-Words (BoW) Model <a id="bow-model"></a>

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

text_data = news_headlines_df['headline'].values

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(text_data)

print(vectorizer.get_feature_names_out())
print(X.toarray())

## TF-IDF Model <a id="tfidf-model"></a>

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

text_data = news_headlines_df['headline'].values

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(text_data)

print(vectorizer.get_feature_names_out())
print(X.toarray())

## Train-Test Split <a id="train-test-split"></a>

In [None]:
X = news_headlines_df.headline
y = news_headlines_df.sentiment

train_data, test_data, train_labels, test_labels = train_test_split(X, y, test_size=0.2, random_state=42)


## Classification Algorithms <a id="classification-algorithms"></a>

### LightGBM <a id="lightgbm"></a>

In [None]:
import numpy as np
import lightgbm as lgb
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

text_data = news_headlines_df['headline'].values
target_labels = news_headlines_df['sentiment'].values

label_encoder = LabelEncoder()
encoded_target_labels = label_encoder.fit_transform(target_labels)

vectorizers = {
    'BoW': CountVectorizer(),
    'TF-IDF': TfidfVectorizer()
}

predicted_labels_lgb = {
    'BoW': {},
    'TF-IDF': {}
}

for vectorizer_name, vectorizer in vectorizers.items():
    text_features = vectorizer.fit_transform(text_data).astype(np.float32)
    train_data, test_data, train_labels, test_labels = train_test_split(text_features, encoded_target_labels, test_size=0.2, random_state=42)

    train_data_lgb = lgb.Dataset(train_data, label=train_labels)
    params = {
        'objective': 'multiclass',
        'num_class': len(np.unique(encoded_target_labels)),
        'metric': 'multi_logloss',
        'verbose': -1
    }
    classifier_lgb = lgb.train(params, train_data_lgb, 100)
    predicted_labels_lgb[vectorizer_name] = np.argmax(classifier_lgb.predict(test_data), axis=1)


### Logistic Regression <a id="logistic-regression"></a>

In [None]:
from sklearn.linear_model import LogisticRegression

predicted_labels_lr = {
    'BoW': {},
    'TF-IDF': {}
}

for vectorizer_name, vectorizer in vectorizers.items():
    text_features = vectorizer.fit_transform(text_data)
    train_data, test_data, train_labels, test_labels = train_test_split(text_features, encoded_target_labels, test_size=0.2, random_state=42)

    classifier_lr = LogisticRegression(max_iter=1000)
    classifier_lr.fit(train_data, train_labels)
    predicted_labels_lr[vectorizer_name] = classifier_lr.predict(test_data)

## Confusion Matrices <a id="confusion-matrices"></a>

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import confusion_matrix
import seaborn as sns

label_encoder = LabelEncoder()
encoded_target_labels = label_encoder.fit_transform(target_labels)

_, encoded_test_labels = train_test_split(encoded_target_labels, test_size=0.2, random_state=42)

for vectorizer_name in ['BoW', 'TF-IDF']:
    for classifier_name in ['Logistic Regression', 'LightGBM']:
        if classifier_name == 'Logistic Regression':
            predicted_labels = predicted_labels_lr[vectorizer_name]
        else:
            predicted_labels = predicted_labels_lgb[vectorizer_name]

        encoded_test_labels = np.array(encoded_test_labels)
        predicted_labels = np.array(predicted_labels)

        conf_matrix = confusion_matrix(encoded_test_labels, predicted_labels)
        plt.figure(figsize=(10,7))
        sns.heatmap(conf_matrix, annot=True, fmt='d')
        plt.title(f'Confusion Matrix for {classifier_name} with {vectorizer_name}')
        plt.xlabel('Predicted')
        plt.ylabel('Actual')
        plt.show()

## Conclusion <a id="conclusion"></a>

In this notebook, we performed sentiment analysis on financial news headlines using various techniques and algorithms. We started by downloading and loading the dataset, followed by data exploration and cleaning.

To handle the imbalanced dataset, we applied the SMOTE technique to oversample the minority class. We then extracted features from the text data using Bag-of-Words (BoW) and TF-IDF models.

Next, we trained and evaluated two classification algorithms: LightGBM and Logistic Regression. We used the BoW and TF-IDF features as input to these algorithms and made predictions on the test set.

Finally, we generated confusion matrices to assess the performance of each classifier-vectorizer combination. The confusion matrices provide insights into the true positive, true negative, false positive, and false negative predictions made by the models.

Sentiment analysis of financial news headlines can be valuable for various applications, such as market trend analysis, investment decision-making, and risk assessment. The techniques and algorithms demonstrated in this notebook can be further improved and customized based on specific requirements and domain knowledge.

For future work, we can explore other feature extraction techniques, experiment with different classification algorithms, and fine-tune the hyperparameters to enhance the performance of the sentiment analysis models.