## **Problem Statement**

### Business Context

The prices of the stocks of companies listed under a global exchange are influenced by a variety of factors, with the company's financial performance, innovations and collaborations, and market sentiment being factors that play a significant role. News and media reports can rapidly affect investor perceptions and, consequently, stock prices in the highly competitive financial industry. With the sheer volume of news and opinions from a wide variety of sources, investors and financial analysts often struggle to stay updated and accurately interpret its impact on the market. As a result, investment firms need sophisticated tools to analyze market sentiment and integrate this information into their investment strategies.

### Problem Definition

With an ever-rising number of news articles and opinions, an investment startup aims to leverage artificial intelligence to address the challenge of interpreting stock-related news and its impact on stock prices. They have collected historical daily news for a specific company listed under NASDAQ, along with data on its daily stock price and trade volumes.

As a member of the Data Science and AI team in the startup, you have been tasked with analyzing the data, developing an AI-driven sentiment analysis system that will automatically process and analyze news articles to gauge market sentiment, and summarizing the news at a weekly level to enhance the accuracy of their stock price predictions and optimize investment strategies. This will empower their financial analysts with actionable insights, leading to more informed investment decisions and improved client outcomes.

### Data Dictionary

* `Date` : The date the news was released
* `News` : The content of news articles that could potentially affect the company's stock price
* `Open` : The stock price (in \$) at the beginning of the day
* `High` : The highest stock price (in \$) reached during the day
* `Low` :  The lowest stock price (in \$) reached during the day
* `Close` : The adjusted stock price (in \$) at the end of the day
* `Volume` : The number of shares traded during the day
* `Label` : The sentiment polarity of the news content
    * 1: positive
    * 0: neutral
    * -1: negative

## **Installing and Importing Necessary Libraries**

In [None]:
# installing the sentence-transformers and gensim libraries for word embeddings
!pip install -U sentence-transformers gensim transformers tqdm -q

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

In [None]:
from sklearn.model_selection import train_test_split


## **Loading the dataset**

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
path='/content/drive/MyDrive/courses_AI_ML/LLMs and prompt engineering /NLP_project/stock_news.csv'

In [None]:
reviws=pd.read_csv(path)

In [None]:
reviws.head()

## **Data Overview**

In [None]:
df=reviws.copy()

In [None]:
df.isnull().sum()

no missing value in the data !

In [None]:
df.duplicated().sum()

no duplicate in the dataset

In [None]:
df.info()

In [None]:
df.columns

Date and News columns are object

In [None]:
df.shape

the shape of the data is 349 rows and  8 columns , we say that data might be considered small

In [None]:
df.describe().T

## **Exploratory Data**

**Utile Functions**

In [None]:
 # function to create labeled barplots

def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [None]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram

### Univariate Analysis

* Distribution of individual variables
* Compute and check the distribution of the length of news content

In [None]:
histogram_boxplot(df, 'Open',kde=True)

- The distribution of the stock opening prices is not normal.
- the price of opening  stok's price of 60 is missing from the data
- The box plot reveals four outliers, indicating high opening prices. These outliers are greater than 60.
- Low opening stock prices are dominant. The majority of opening prices fall within the range of 20 to 50

In [None]:
df.columns

In [None]:
histogram_boxplot(df, 'High',kde=True)

- The distributions are largely similar to the opening stock prices
- We still lack data for stocks with a price of 60

In [None]:
histogram_boxplot(df, 'Low',kde=True)

- Similar to the open and high features, the low prices exhibit the same patterns, with corresponding changes in count
- The figure does not include stock prices close to 53. The stock price of 45 has a low count

In [None]:
histogram_boxplot(df, 'Close',kde=True)

At closing, stock prices often revert to their opening levels.

In [None]:
histogram_boxplot(df, 'Volume',kde=True)

- The stock volume distribution is approximately normal with a right-skewed tail
- One data point is considered an outlier and accounts for approximately 20.

In [None]:
labeled_barplot(df,'Label',perc=True)

The target variable is imbalanced. Class 0 comprises 48.7% of the data, class 1 comprises 28.4%, and class -1 comprises 22.9%

### Bivariate Analysis

* Correlation
* Sentiment Polarity vs Price
* Date vs Price

**Note**: The above points are listed to provide guidance on how to approach bivariate analysis. Analysis has to be done beyond the above listed points to get maximum scores.

In [None]:
plt.figure(figsize=(15,8))
sns.heatmap(df.select_dtypes(include=np.number).corr(),annot=True, vmin=-1, vmax=1)

- As observed in the invariant analysis, the open, high, low, and close features exhibit a strong correlation with a correlation coefficient of 1.

In [None]:
sns.pairplot(df,hue='Label')

- The trading volume of stocks at opening and closing is generally similar  
- No consistent relationship was found between the trading **volume** of stocks and **their price** fluctuations throughout the day  

In [None]:
sns.boxplot(data=df,x='Label',y='Open')

- The negative label has an opening price median of around 43, with three outlier points.
- The median opening price and minimum opening price for labels 0 and 1 are the same, but the maximum opening price for the stocks differs.

## **Data Preprocessing**

**FEATURE ENGINEERING**

We have three stock prices: **open**, **high**, **Low** and **close**. After univariate analysis and examining the correlation coefficient, we observe that these features are quite similar. We suggest taking **the mean** of these three features

In [None]:
df1=df.copy()

In [None]:
df1['Mean_Price'] = df1[['Open', 'High','Low', 'Close']].mean(axis=1)

In [None]:
df1.head()

In [None]:
histogram_boxplot(df1, 'Mean_Price',kde=True)

The distribution of the mean price of the stocks is quite similar to the distributions of the open, high, Low and close features..

In [None]:
# Drop unnecessary features such as open, high, and close
df1.drop(['Open', 'High','Low','Close'], axis=1, inplace=True)

In [None]:
df1.columns

For **feature engineering** of the '**Date**' feature, I will use **Cyclic Features**. This approach will help capture the periodic nature of days and months in a way that is meaningful for machine learning algorithms, especially when operating within the same year (2019)

In [None]:
# Convert the date column to datetime
df1['Date'] = pd.to_datetime(df1['Date'])

In [None]:
# Extract day and month
df1['day'] = df1['Date'].dt.day
df1['month'] = df1['Date'].dt.month

In [None]:
# Encode day as cyclic features
df1['day_sin'] = np.sin(2 * np.pi * df1['day'] / 31)
df1['day_cos'] = np.cos(2 * np.pi * df1['day'] / 31)

# Encode month as cyclic features
df1['month_sin'] = np.sin(2 * np.pi * df1['month'] / 12)
df1['month_cos'] = np.cos(2 * np.pi * df1['month'] / 12)

In [None]:
df1.drop(['Date','day','month'],axis=1,inplace=True)

In [None]:
df1.columns

In [None]:
df2=df1.copy()

In [None]:
X=df2.drop('Label',axis=1)
y=df2['Label']

In [None]:
x_temp,X_test,y_temp,y_test=train_test_split(X,y,test_size=0.2,random_state=1)

X_test and y_test will be reserved for evaluating the model. To avoid data leakage, I will keep them as they are and preprocess X_temp and y_temp data.

In [None]:
df2.tail()

## **Word Embeddings**

**APPROACH**:

I will adopt two approaches for word embedding:

- Use **Word2Vec** and **GloVe**.
- Use encoder transformers from **Sentence Transformers**.


At the end, I will evaluate each method.

#### ***Word2Vec*** approch

**preprocessing** the text data

In [None]:
x_temp1=x_temp.copy()

In [None]:
#Removing special characters

In [None]:
import re

In [None]:
# defining a function to remove special characters
def remove_special_characters(text):
    # Defining the regex pattern to match non-alphanumeric characters
    pattern = '[^A-Za-z0-9]+'

    # Finding the specified pattern and replacing non-alphanumeric characters with a blank string
    new_text = ''.join(re.sub(pattern, ' ', text))

    return new_text

In [None]:
# Applying the function to remove special characters
x_temp1['cleaned_news'] = x_temp1['News'].apply(remove_special_characters)

In [None]:
# checking a couple of instances of cleaned data
x_temp1.loc[0:3, ['News','cleaned_news']]

In [None]:
#Lowercasing

In [None]:
# changing the case of the text data to lower case
x_temp1['cleaned_news'] = x_temp1['cleaned_news'].str.lower()

In [None]:
#Removing extra whitespace

In [None]:
# removing extra whitespaces from the text
x_temp1['cleaned_news'] = x_temp1['cleaned_news'].str.strip()

In [None]:
#Removing stopwords

In [None]:
import nltk
nltk.download('stopwords')    # loading the stopwords
# nltk.download('punkt')    # loading the punkt module used in tokenization
# nltk.download('omw-1.4')    # dependency for tokenization
nltk.download('wordnet')    # loading the wordnet module that is used in stemming

In [None]:
# to remove common stop words
from nltk.corpus import stopwords

In [None]:
# defining a function to remove stop words using the NLTK library
def remove_stopwords(text):
    # Split text into separate words
    words = text.split()

    # Removing English language stopwords
    new_text = ' '.join([word for word in words if word not in stopwords.words('english')])

    return new_text

In [None]:
# Applying the function to remove stop words using the NLTK library
x_temp1['cleaned_news_w_stpds'] = x_temp1['cleaned_news'].apply(remove_stopwords)

In [None]:
# checking a couple of instances of cleaned data
x_temp1.loc[0:3,['cleaned_news','cleaned_news_w_stpds']]

In [None]:
#Stemming

In [None]:
# to perform stemming
from nltk.stem.porter import PorterStemmer

In [None]:
# Loading the Porter Stemmer
ps = PorterStemmer()

In [None]:
# defining a function to perform stemming
def apply_porter_stemmer(text):
    # Split text into separate words
    words = text.split()

    # Applying the Porter Stemmer on every word of a message and joining the stemmed words back into a single string
    new_text = ' '.join([ps.stem(word) for word in words])

    return new_text

In [None]:
# Applying the function to perform stemming
x_temp1['final_cleaned_news'] = x_temp1['cleaned_news_w_stpds'].apply(apply_porter_stemmer)

In [None]:
x_temp1.loc[0:3,['cleaned_news','final_cleaned_news']]

###### **Text Vectorization**

#### Word2Vec

In [None]:
# installing libraries to remove accented characters and use word embeddings
!pip install unidecode gensim -q
#!pip install --user unidecode gensim -q

In [None]:
# To import Word2Vec
from gensim.models import Word2Vec

In [None]:
# Creating a list of all words in our data
words_list = [item.split(" ") for item in x_temp1['final_cleaned_news'].values]

In [None]:
# Creating an instance of Word2Vec
vec_size = 300
model_W2V = Word2Vec(words_list, vector_size = vec_size, min_count = 1, window=5, workers = 6)

In [None]:
# Checking the size of the vocabulary
print("Length of the vocabulary is", len(list(model_W2V.wv.key_to_index)))

In [None]:
# Retrieving the words present in the Word2Vec model's vocabulary
words = list(model_W2V.wv.key_to_index.keys())

# Retrieving word vectors for all the words present in the model's vocabulary
wvs = model_W2V.wv[words].tolist()

# Creating a dictionary of words and their corresponding vectors
word_vector_dict = dict(zip(words, wvs))

In [None]:
def average_vectorizer_Word2Vec(doc):
    # Initializing a feature vector for the sentence
    feature_vector = np.zeros((vec_size,), dtype="float64")

    # Creating a list of words in the sentence that are present in the model vocabulary
    words_in_vocab = [word for word in doc.split() if word in words]

    # adding the vector representations of the words
    for word in words_in_vocab:
        feature_vector += np.array(word_vector_dict[word])

    # Dividing by the number of words to get the average vector
    if len(words_in_vocab) != 0:
        feature_vector /= len(words_in_vocab)

    return feature_vector

In [None]:
# creating a dataframe of the vectorized documents
df_Word2Vec = pd.DataFrame(x_temp1['final_cleaned_news'].apply(average_vectorizer_Word2Vec).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
df_Word2Vec

In [None]:
df_Word2Vec.shape

In [None]:
x_temp1.shape

In [None]:
df_Word2Vec.columns


In [None]:
x_temp1.head()

In [None]:
# Ensure indexes are aligned
df_Word2Vec.index = x_temp1.index

In [None]:
x_final = pd.concat([x_temp1, df_Word2Vec], axis=1)

In [None]:
x_final.shape

In [None]:
x_final.head()

In [None]:
x_final.columns

In [None]:
x_final1=x_final.copy()

In [None]:
x_final1.drop(['News','cleaned_news','cleaned_news_w_stpds','final_cleaned_news'],axis=1,inplace=True)

In [None]:
x_final1.columns

#### GLOVe

In [None]:
# Converting the Stanford GloVe model vector format to word2vec
from gensim.scripts.glove2word2vec import glove2word2vec
glove_input_file = '/content/drive/MyDrive/courses_AI_ML/LLMs and prompt engineering /NLP_project/glove.6B.100d.txt'
word2vec_output_file = 'glove.6B.100d.txt.word2vec'
glove2word2vec(glove_input_file, word2vec_output_file)

In [None]:
from gensim.models import KeyedVectors
# load the Stanford GloVe model
filename = 'glove.6B.100d.txt.word2vec'
glove_model = KeyedVectors.load_word2vec_format(filename, binary=False)

In [None]:
# Checking the size of the vocabulary
print("Length of the vocabulary is", len(glove_model.index_to_key))

In [None]:
vec_size=100 # The GloVe model we used (glove.6B.100d.txt) provides word vectors with 100 dimensions

In [None]:
glove_words = glove_model.index_to_key

In [None]:
glove_word_vector_dict = dict(zip(glove_model.index_to_key,list(glove_model.vectors)))

In [None]:
def average_vectorizer_GloVe(doc):
    # Initializing a feature vector for the sentence
    feature_vector = np.zeros((vec_size,), dtype="float64")

    # Creating a list of words in the sentence that are present in the model vocabulary
    words_in_vocab = [word for word in doc.split() if word in glove_words]

    # adding the vector representations of the words
    for word in words_in_vocab:
        feature_vector += np.array(glove_word_vector_dict[word])

    # Dividing by the number of words to get the average vector
    if len(words_in_vocab) != 0:
        feature_vector /= len(words_in_vocab)

    return feature_vector

In [None]:
# creating a dataframe of the vectorized documents
df_Glove = pd.DataFrame(x_temp1['final_cleaned_news'].apply(average_vectorizer_GloVe).tolist(), columns=['Feature '+str(i) for i in range(vec_size)])
df_Glove.head()

In [None]:
df_Glove.shape

In [None]:
# align the two datasets
df_Glove.index = x_temp1.index

In [None]:
# concate the two datasets
df_Glove_final = pd.concat([x_temp1, df_Glove], axis=1)

In [None]:
df_Glove_final.shape

In [None]:
df_Glove_final.head()

In [None]:
df_Glove_final.columns

In [None]:
df_G_final2=df_Glove_final.copy()

In [None]:
# drop the unncessery columns
df_G_final2.drop(['News','cleaned_news','cleaned_news_w_stpds','final_cleaned_news'],axis=1,inplace=True)

In [None]:
df_G_final2.shape

## **Sentiment Analysis**

In [None]:
from sklearn.metrics import confusion_matrix

In [None]:
# Creating a function to plot the confusion matrix
def plot_confusion_matrix(actual, predicted):
    cm = confusion_matrix(actual, predicted)
    plt.figure(figsize=(5, 4))
    label_list = ['negative', 'neutral', 'positive']
    sns.heatmap(cm, annot=True, fmt='.0f', xticklabels=label_list, yticklabels=label_list)
    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()


#### using word2vec vectors

In [None]:
x_train,x_val,y_train,y_val=train_test_split(x_final1,y_temp,test_size=0.2,random_state=1, stratify=y_temp)

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Building the model
rf_word2vec = RandomForestClassifier(n_estimators = 100, max_depth = 7, random_state = 1)

# Fitting on train data
rf_word2vec.fit(x_train, y_train)

In [None]:
# Predicting on train data
y_pred_train = rf_word2vec.predict(x_train)

# Predicting on test data
y_pred_val = rf_word2vec.predict(x_val)

In [None]:
plot_confusion_matrix(y_pred_train, y_train)

The model predicted all of our labels correctly for the training data. Let's look at the validation data

In [None]:
plot_confusion_matrix(y_val, y_pred_val)

The model failed to classify any of the positive sentiments correctly, this is exacty means that my model is overfitting !

let's try fine tune the model

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.class_weight import compute_class_weight


# Balance class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weights_dict = dict(zip(np.unique(y_train), class_weights))

# Define the parameter grid
param_grid = {
    'n_estimators': [int(x) for x in np.linspace(start=100, stop=1000, num=10)],
    'max_depth': [int(x) for x in np.linspace(10, 110, num=11)] + [None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2'],
    'bootstrap': [True, False],
}
# Initialize the Random Forest model with class weights
rf = RandomForestClassifier(random_state=1, class_weight=class_weights_dict)

# Initialize the Randomized Search
random_search = RandomizedSearchCV(estimator=rf, param_distributions=param_grid, n_iter=100, cv=5, verbose=2, random_state=1, n_jobs=-1)

# Fit the Randomized Search to the data
random_search.fit(x_train, y_train)

# Get the best parameters
best_params = random_search.best_params_
print("Best parameters found: ", best_params)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


In [None]:
# Evaluate on validation set
y_pred_v = random_search.predict(x_val)
plot_confusion_matrix(y_val, y_pred_v)

In [None]:
y_val.value_counts()

Even with hyperparameter tuning, the model still struggles to predict the positive class. This is likely due to the small dataset and the small positive class, which consists of only 13 classes. Let's try different ensemble models

But first, let's use the **SMOTE** **technique** to augment the data

In [None]:
from imblearn.over_sampling import SMOTE

# Initialize SMOTE
smote = SMOTE(sampling_strategy='minority', random_state=1)

# Fit and apply SMOTE to your training data
X_res, y_res = smote.fit_resample(x_train, y_train)

In [None]:
# Check the distribution of the resampled training data
from collections import Counter
print("Original training set distribution:", Counter(y_train))
print("Resampled training set distribution:", Counter(y_res))

The augmentation was applied to the positive and neutral classes as shown.




In [None]:
# Train your RandomForest model on the resampled data
rf = RandomForestClassifier(n_estimators=100, max_depth=7, random_state=1)
rf.fit(X_res, y_res)

# Predict on the validation set
y_pred = rf.predict(x_val)

In [None]:
plot_confusion_matrix(y_val, y_pred)

With data augmentation, the model improved, but still did not achieve the goal !

###### **Ensemble Technique models**

In [None]:
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix

# Initialize individual models
gb = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=1)
ada = AdaBoostClassifier(n_estimators=100, random_state=1)
xgb = XGBClassifier(n_estimators=100, max_depth=3, use_label_encoder=False, eval_metric='mlogloss', random_state=1)

# Combine them using a voting classifier
ensemble = VotingClassifier(estimators=[
    ('gb', gb),
    ('ada', ada),
    ('xgb', xgb)
], voting='soft')  # 'hard' voting for majority voting, 'soft' for weighted voting

# Fit the ensemble model
ensemble.fit(x_train, y_train)

# Predict on the validation set
y_pred1 = ensemble.predict(x_val)


In [None]:
# Evaluate the model
plot_confusion_matrix(y_val, y_pred1)

those model struggle still !

#### Glove vectors

We used Word2Vec vectors, but the models encountered difficulties with the classification task, especially for the positive class. Let's try GLOVE vectors and see if these models can better capture the patterns in the text news we have.

In [None]:
df_G_final2.columns

In [None]:
x_traing,x_valg,y_traing,y_valg=train_test_split(df_G_final2,y_temp,test_size=0.2,random_state=1, stratify=y_temp)

In [None]:
# Initialize individual models
rf = RandomForestClassifier(n_estimators=100, max_depth=7, random_state=1, class_weight='balanced')
gb = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=1)
ada = AdaBoostClassifier(n_estimators=100, random_state=1)
xgb = XGBClassifier(n_estimators=100, max_depth=3, use_label_encoder=False, eval_metric='mlogloss', random_state=1)

# Combine them using a voting classifier
ensembleg = VotingClassifier(estimators=[
    ('rf', rf),
    ('gb', gb),
    ('ada', ada),
    ('xgb', xgb)
], voting='hard')  # 'hard' voting for majority voting, 'soft' for weighted voting

# Fit the ensemble model
ensembleg.fit(x_traing, y_traing)

# Predict on the validation set
y_predg = ensembleg.predict(x_valg)


In [None]:
plot_confusion_matrix(y_valg, y_predg)

In [None]:
# Combine them using a voting classifier
ensemblegs = VotingClassifier(estimators=[
    ('rf', rf),
    ('gb', gb),
    ('ada', ada),
    ('xgb', xgb)
], voting='soft')  # 'hard' voting for majority voting, 'soft' for weighted voting

# Fit the ensemble model
ensemblegs.fit(x_traing, y_traing)

# Predict on the validation set
y_predg = ensemblegs.predict(x_valg)


In [None]:
plot_confusion_matrix(y_valg, y_predg)

**The voting classifier** using **hard** voting performs better than **soft** voting. However, the classification is still not satisfactory, with the same incorrect classifications occurring. Nonetheless, **GLOVE vectors** **are superior** to **Word2Vec** vectors in capturing **the patterns** in the text data we have.

In [None]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

# Fit the scaler on the training data and transform both training and validation sets
X_train_scaled = scaler.fit_transform(x_traing)
X_val_scaled = scaler.transform(x_valg)

# Now, train your model on the scaled data
ensemble.fit(X_train_scaled, y_traing)

# Predict on the validation set
y_pred_scaled = ensemble.predict(X_val_scaled)

In [None]:
plot_confusion_matrix(y_valg, y_pred_scaled)

Even with standardization, the confusion matrix shows that the model is still not performing as expected. We initially thought this might be the reason, but after standardization, **trying different techniques and ensembles**, we realized that the technique used to convert the text to vectors could be a limitation. At this point, I will **use** an **encoder transformer**, which is a state-of-the-art method for capturing more meaning in the text

#### LSTM model classifier

In [None]:
df3=df1.copy()

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Concatenate, Dropout

In [None]:
# Tokenize and pad the news column
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text_sequences = tokenizer.texts_to_sequences(df3['News'])
max_length = 100
text_padded = pad_sequences(text_sequences, maxlen=max_length)

# Standardize numerical features
scaler = StandardScaler()
numerical_features = df.select()[['volume_of_stocks', 'price']]
numerical_scaled = scaler.fit_transform(numerical_features)

# One-hot encode the labels
labels = get_dummies(df3['Label']).values


In [None]:


# Define model inputs
text_input = Input(shape=(max_length,))
num_input = Input(shape=(numerical_scaled.shape[1],))

# Text processing with LSTM
embedding = Embedding(input_dim=30522, output_dim=128, input_length=max_length)(text_input)
lstm = LSTM(units=128, return_sequences=True)(embedding)
lstm = LSTM(units=64)(lstm)
lstm = Dropout(0.5)(lstm)

# Combine text and numerical features
combined = Concatenate()([lstm, num_input])
combined = Dense(64, activation='relu')(combined)
combined = Dropout(0.5)(combined)
output = Dense(3, activation='softmax')(combined)

# Define the model
model = Model(inputs=[text_input, num_input], outputs=output)

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary()

from sklearn.model_selection import train_test_split

# Split data into training and validation sets
X_train_text, X_val_text, X_train_num, X_val_num, y_train, y_val = train_test_split(
    text_padded, numerical_scaled, labels, test_size=0.2, random_state=42
)

# Train the model
history = model.fit(
    [X_train_text, X_train_num], y_train,
    epochs=10, batch_size=32,
    validation_data=([X_val_text, X_val_num], y_val)
)

# Evaluate the model
loss, accuracy = model.evaluate([X_val_text, X_val_num], y_val)
print(f'Validation Accuracy: {accuracy}')


In [None]:
df3.columns

In [None]:



text_data = df3['News']
num_data = df3.drop(columns=['News', 'Label'])
y = df3['Label']

# Tokenize the text data
tokenizer = Tokenizer(num_words=5000)  # Adjust num_words based on your dataset size
tokenizer.fit_on_texts(text_data)
text_sequences = tokenizer.texts_to_sequences(text_data)

# Pad the sequences to ensure equal length
max_length = 100  # Based on your data analysis
text_padded = pad_sequences(text_sequences, maxlen=max_length)

# Scale the numerical features
scaler = StandardScaler()
num_scaled = scaler.fit_transform(num_data)

# Combine text and numerical data
X_combined = np.hstack((text_padded, num_scaled))

# One-Hot Encode the target variable
y_one_hot = to_categorical(y + 1)  # Shift labels to 0, 1, 2 for one-hot encoding

# Ensure the target variable has the same number of samples as the input data
assert len(X_combined) == len(y_one_hot), "Mismatch between input data and target variable samples."

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_combined, y_one_hot, test_size=0.2, random_state=1)


In [None]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Dropout, Concatenate
from tensorflow.keras.optimizers import Adam

# Define inputs
text_input = Input(shape=(max_length,))
num_input = Input(shape=(num_scaled.shape[1],))

# Text processing with LSTM
x = Embedding(input_dim=5000, output_dim=128, input_length=max_length)(text_input)
x = LSTM(units=128, return_sequences=True)(x)
x = Dropout(0.5)(x)
x = LSTM(units=64)(x)
x = Dropout(0.5)(x)

# Combine text and numerical features
combined = Concatenate()([x, num_input])

# Add dense layers
combined = Dense(64, activation='relu')(combined)
combined = Dropout(0.5)(combined)
output = Dense(3, activation='softmax')(combined)  # Three output units for three classes

# Define the model
model = Model(inputs=[text_input, num_input], outputs=output)

# Compile the model with categorical crossentropy and custom metrics
model.compile(optimizer=Adam(learning_rate=0.001), loss='categorical_crossentropy', metrics=['accuracy'])

# Print model summary
model.summary()


In [None]:
from sklearn.metrics import f1_score
from tensorflow.keras.callbacks import Callback

class F1ScoreCallback(Callback):
    def on_epoch_end(self, epoch, logs=None):
        val_predict = np.argmax(self.model.predict([X_val[:, :max_length], X_val[:, max_length:]]), axis=1)
        val_targ = np.argmax(y_val, axis=1)
        _val_f1 = f1_score(val_targ, val_predict, average='weighted')
        print(f' — val_f1: {_val_f1:.4f}')

# Train the model with the F1-Score callback
history = model.fit(
    [X_train[:, :max_length], X_train[:, max_length:]],
    y_train,
    epochs=10,
    batch_size=32,
    validation_data=(
        [X_val[:, :max_length], X_val[:, max_length:]],
        y_val
    ),
    callbacks=[F1ScoreCallback()]
)


In [None]:
import numpy as np
from sklearn.metrics import f1_score
from tensorflow.keras.callbacks import Callback
import matplotlib.pyplot as plt

class F1ScoreCallback(Callback):
    def __init__(self):
        super().__init__()
        self.f1_scores = []

    def on_epoch_end(self, epoch, logs=None):
        val_predict = np.argmax(self.model.predict([X_val[:, :max_length], X_val[:, max_length:]]), axis=1)
        val_targ = np.argmax(y_val, axis=1)
        _val_f1 = f1_score(val_targ, val_predict, average='weighted')
        self.f1_scores.append(_val_f1)
        print(f' — val_f1: {_val_f1:.4f}')

# Initialize the callback
f1_callback = F1ScoreCallback()

# Train the model with the F1-Score callback
history = model.fit(
    [X_train[:, :max_length], X_train[:, max_length:]],
    y_train,
    epochs=10,
    batch_size=32,
    validation_data=(
        [X_val[:, :max_length], X_val[:, max_length:]],
        y_val
    ),
    callbacks=[f1_callback]
)

# Plotting the F1 scores
plt.plot(f1_callback.f1_scores)
plt.title('F1 Score per Epoch')
plt.xlabel('Epoch')
plt.ylabel('F1 Score')
plt.show()


## **Weekly News Summarization**

**Important Note**: It is recommended to run this section of the project independently from the previous sections in order to avoid runtime crashes due to RAM overload.

#### Installing and Importing the necessary libraries

In [None]:
# Installation for GPU llama-cpp-python
# uncomment and run the following code in case GPU is being used
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.45 --force-reinstall --no-cache-dir -q

# Installation for CPU llama-cpp-python
# uncomment and run the following code in case GPU is not being used
# !CMAKE_ARGS="-DLLAMA_CUBLAS=off" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.45 --force-reinstall --no-cache-dir -q

In [None]:
# For downloading the models from HF Hub
!pip install huggingface_hub==0.20.3 -q

In [None]:
# Function to download the model from the Hugging Face model hub
from huggingface_hub import hf_hub_download

# Importing the Llama class from the llama_cpp module
from llama_cpp import Llama

# Function to download and load the model
from tqdm import tqdm # For progress bar related functionalities
tqdm.pandas()

In [None]:
import torch

In [None]:
# setting the device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

#### Loading the model

#### Llma

In [None]:
model_name_or_path = "TheBloke/Llama-2-13B-chat-GGUF"
model_basename = "llama-2-13b-chat.Q5_K_M.gguf" # the model is in gguf format

In [None]:
# Using hf_hub_download to download a model from the Hugging Face model hub
# The repo_id parameter specifies the model name or path in the Hugging Face repository
# The filename parameter specifies the name of the file to download
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)

In [None]:
lcpp_llm = Llama(
    model_path=model_path,
    n_threads=2,  # CPU cores
    n_batch=512,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
    # n_gpu_layers=43,  # uncomment and change this value based on GPU VRAM pool.
    n_ctx=4096,  # Context window
)

#### Mistral

In [None]:
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GGUF"
model_basename = "mistral-7b-instruct-v0.2.Q6_K.gguf"

In [None]:
model_path = hf_hub_download(
    repo_id=model_name_or_path,
    filename=model_basename
)

In [None]:
llm = Llama(
    model_path=model_path,
    n_ctx=4024,
)

In [None]:
reviws.head()

In [None]:
data=reviws.copy()

#### Aggregating the data weekly

In [None]:
data["Date"] = pd.to_datetime(data['Date'])  # Convert the 'Date' column to datetime format.

In [None]:
# Group the data by week using the 'Date' column.
weekly_grouped = data.groupby(pd.Grouper(key='Date', freq='W'))

In [None]:
weekly_grouped = weekly_grouped.agg(
    {
        'News': lambda x: ' || '.join(x)  # Join the news values with ' || ' separator.
    }
).reset_index()

print(weekly_grouped.shape)

In [None]:
weekly_grouped

In [None]:
weekly_grouped.News[1]

In [None]:
# creating a copy of the data
data_1=weekly_grouped.copy()

#### Summarization

**Note**:

- The model is expected to summarize the news from the week by identifying the top three positive and negative events that are most likely to impact the price of the stock.

- As an output, the model is expected to return a JSON containing two keys, one for Positive Events and one for Negative Events.

For the project, we need to define the prompt to be fed to the LLM to help it understand the task to perform. The following should be the components of the prompt:

1. **Role**: Specifies the role the LLM will be taking up to perform the specified task, along with any specific details regarding the role

  - **Example**: `You are an expert data analyst specializing in news content analysis.`

2. **Task**: Specifies the task to be performed and outlines what needs to be accomplished, clearly defining the objective

  - **Example**: `Analyze the provided news headline and return the main topics contained within it.`

3. **Instructions**: Provides detailed guidelines on how to perform the task, which includes steps, rules, and criteria to ensure the task is executed correctly

  - **Example**:

```
Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.
```

4. **Output Format**: Specifies the format in which the final response should be structured, ensuring consistency and clarity in the generated output

  - **Example**: `Return the output in JSON format with keys as the topic number and values as the actual topic.`

**Full Prompt Example**:

```
You are an expert data analyst specializing in news content analysis.

Task: Analyze the provided news headline and return the main topics contained within it.

Instructions:
1. Read the news headline carefully.
2. Identify the main subjects or entities mentioned in the headline.
3. Determine the key events or actions described in the headline.
4. Extract relevant keywords that represent the topics.
5. List the topics in a concise manner.

Return the output in JSON format with keys as the topic number and values as the actual topic.
```

**Sample Output**:

`{"1": "Politics", "2": "Economy", "3": "Health" }`

##### Utility Functions

In [None]:
# defining a function to parse the JSON output from the model
def extract_json_data(json_str):
    import json
    try:
        # Find the indices of the opening and closing curly braces
        json_start = json_str.find('{')
        json_end = json_str.rfind('}')

        if json_start != -1 and json_end != -1:
            extracted_category = json_str[json_start:json_end + 1]  # Extract the JSON object
            data_dict = json.loads(extracted_category)
            return data_dict
        else:
            print(f"Warning: JSON object not found in response: {json_str}")
            return {}
    except json.JSONDecodeError as e:
        print(f"Error parsing JSON: {e}")
        return {}

##### Defining the response function

In [None]:
#Defining the response function
def response_mistral_1(prompt, news):
    model_output = llm(
      f"""
      [INST]
      {prompt}
      News Articles: {news}
      [/INST]
      """,
      max_tokens=1024, #Complete the code to set the maximum number of tokens the model should generate for this task.
      temperature=0.1, #Complete the code to set the value for temperature.
      top_p=0.95, #Complete the code to set the value for top_p
      top_k=50, #Complete the code to set the value for top_k
      echo=False,
    )

    final_output = model_output["choices"][0]["text"]

    return final_output

In [None]:
#Defining the response funciton for Task 1.
def response_1(prompt,review):
    model_output = llm(
      f"""
      Q: {prompt}
      Review: {review}
      A:
      """,
      max_tokens=32,
      stop=["Q:", "\n"],
      temperature=0.01,
      echo=False,
    )

    temp_output = model_output["choices"][0]["text"]

    return temp_output

##### Checking the model output on a sample

**Note**: Use this section to test out the prompt with one instance before using it for the entire weekly data.

In [None]:
# Define the prompt
prompt = """
    You are an AI analyzing news articles about stock prices. Classify the sentiment of the provided news article into the following categories:
    - Positive
    - Negative
    - Neutral
"""

# Use the first row of the news column
first_news_article = data_1['News'].iloc[0]

# Get the response
response = response_1(prompt, first_news_article)
print("First news article sentiment:", response)

In [None]:
instruction_1 = """
    You are an AI analyzing news articles about stock prices. Classify the sentiment of the provided news article into one of the following categories:
    - Positive
    - Negative
    - Neutral
    """


In [None]:
data_1['model_response1'] = data_1['News'].apply(lambda x: response_mistral_1(instruction_1, x))

In [None]:
data_1.head()

In [None]:
print(data_1.loc[4, 'model_response'])

In [None]:
data_1['model_response'][0]

In [None]:

# Function to extract and group sentiments into separate columns
def parse_and_group_sentiments(model_response):
    sentiment_dict = {'Positive': 0, 'Negative': 0, 'Neutral': 0}

    # Split the response into individual parts
    parts = model_response.split('      ')
    for part in parts:
        # Detect sentiment and corresponding text
        if 'Positive:' in part:
            sentiment_dict['Positive'] += 1
        elif 'Negative:' in part:
            sentiment_dict['Negative'] += 1
        elif 'Neutral:' in part:
            sentiment_dict['Neutral'] += 1

    return pd.Series([sentiment_dict['Positive'], sentiment_dict['Negative'], sentiment_dict['Neutral']], index=['Positive', 'Negative', 'Neutral'])




In [None]:
# Improved function to parse and group sentiments into separate columns
def parse_and_group_sentiments(model_response):
    sentiment_dict = {'Positive': 0, 'Negative': 0, 'Neutral': 0}

    # Split the response into individual parts
    parts = model_response.split('\n')

    for part in parts:
        # Detect sentiment and corresponding text
        part_lower = part.lower()
        if 'positive' in part_lower:
            sentiment_dict['Positive'] += 1
        elif 'negative' in part_lower:
            sentiment_dict['Negative'] += 1
        elif 'neutral' in part_lower:
            sentiment_dict['Neutral'] += 1

    return pd.Series([sentiment_dict['Positive'], sentiment_dict['Negative'], sentiment_dict['Neutral']], index=['Positive', 'Negative', 'Neutral'])


In [None]:
# Function to extract top three positive and negative events
def extract_top_events(model_response):
    positive_events = []
    negative_events = []

    # Split the response into individual sentences
    sentences = model_response.split('\n')

    for sentence in sentences:
        if 'positive' in sentence.lower():
            positive_events.append(sentence)
        elif 'negative' in sentence.lower():
            negative_events.append(sentence)

    # Sort and select top three events
    top_positive = positive_events[:3]
    top_negative = negative_events[:3]

    return pd.Series([top_positive, top_negative], index=['Top_Positive', 'Top_Negative'])



In [None]:
# Apply the function to extract sentiments and create separate columns
data_1[['Positive', 'Negative', 'Neutral']] = data_1['model_response'].apply(parse_and_group_sentiments)


In [None]:
data_1.head()

In [None]:
# Determine the dominant sentiment
def dominant_sentiment(row):
    sentiments = row[['Positive', 'Negative', 'Neutral']]
    max_sentiment = sentiments.idxmax()
    return max_sentiment

In [None]:
data_1['Dominant_Sentiment'] = data_1.apply(dominant_sentiment, axis=1)

In [None]:
data_1.head()

In [None]:
data_1.value_counts('Dominant_Sentiment')

#### find the top three positive and negative events that are most impact the stocts price

In [None]:
instruction_2 = """
    Summarize the news from the week by identifying the top three positive and negative events that are most likely to impact the price of the stock.
"""

In [None]:
data_1['model_response'] = data_1['News'].apply(lambda x: response_mistral_1(instruction_2, x))

In [None]:
# Apply the function to extract top events and create separate columns
data_1[['Top_Positive', 'Top_Negative']] = data_1['model_response'].apply(extract_top_events)

In [None]:
data_1.head()

In [None]:
# Display the DataFrame with top positive and negative events
data_1[['Date', 'Top_Positive', 'Top_Negative']]


##### Formatting the model output

In [None]:
import json

# Function to return JSON output
def get_json_output(df):
    df['Date'] = df['Date'].dt.strftime('%Y-%m-%d')  # Convert Timestamps to string format
    json_output = df[['Date', 'Top_Positive', 'Top_Negative']].to_dict(orient='records')
    return json.dumps(json_output, indent=4)



In [None]:
# Get JSON output
json_output = get_json_output(data_1)
print(json_output)

## **Conclusions and Recommendations**

-




<font size=6 color='blue'>Power Ahead</font>
___