# **Neural Network-Based Stock Recommendation System**


- Combines financial data and news sentiment to rank S&P 500 stocks.

- Uses a Bidirectional LSTM model to evaluate financial performance over time.

- Uses FinBERT (Transformer) to classify sentiment from real news articles.

- Final recommendation score = 80% financial_score + 20% sentiment_score.

- Risk is assessed using debt ratios and downside volatility.

- Interactive Gradio UI filters top stocks by sector and risk level.

- Designed for scalability and real-world adaptability with minimal changes.

All code runs end-to-end inside this notebook using local files
Includes visualizations, metrics, and a user-friendly recommender UI


**Note: Restart and run all cells command causes session to time out. please run each cell individually**



**Install Dependencies:**  
Installs required libraries like NumPy, pandas, TensorFlow, etc.

Note: You will have to restart your runtime in order for these changes to take effect after running this cell.


In [None]:


!pip install -U  \
    numpy==1.26.4 \
    pandas==2.2.2 \
    scikit-learn==1.6.1 \
    matplotlib==3.8.4 \
    seaborn==0.13.2 \
    tensorflow==2.16.1 \
    tensorflow-text==2.16.1 \
    transformers==4.39.3 \
    torch==2.1.2 \
    optuna==4.2.0 \
    yfinance==0.2.55 \
    datasets \
    rapidfuzz \
    gradio






**Unzip and Load Dataset Files**

This cell extracts the financial and sentiment dataset ZIP files and lists their contents. Ensure the files are named financial.zip and senti.zip before uploading.

Note: If you get this error:  ValueError: numpy.dtype size changed, may indicate binary incompatibility.

If you get this error, please restart your runtime.

In [None]:

import os
import zipfile
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from sklearn.preprocessing import StandardScaler

from google.colab import files


financial_zip_path = "/content/financial.zip"
sentiment_zip_path = "/content/senti.zip"




def extract_zip(file_path, extract_to):
    """Extracts a zip file to the specified directory."""
    if os.path.exists(file_path):
        with zipfile.ZipFile(file_path, 'r') as zip_ref:
            zip_ref.extractall(extract_to)
        print(f"Extracted: {file_path} → {extract_to}")
    else:
        raise FileNotFoundError(f"{file_path} not found. Make sure it's in your Google Drive.")


extract_zip(financial_zip_path, "financial_data")
extract_zip(sentiment_zip_path, "sentiment_data")


print("\nFinancial Data Files:", os.listdir("financial_data"))
print("\nSentiment Data Files:", os.listdir("sentiment_data"))



**Load and Combine Yearly Financial CSVs**

This cell reads all yearly financial CSV files, adds a Year column extracted from the filenames, standardizes the Ticker column, and concatenates everything into a single DataFrame called combined_df.

In [None]:
import pandas as pd

folder_path = "financial_data"
all_files = sorted([f for f in os.listdir(folder_path) if f.endswith(".csv")])

def extract_year(filename):
    return int(filename.split("_")[0])

all_dfs = []

for file in all_files:
    year = extract_year(file)
    df = pd.read_csv(os.path.join(folder_path, file))


    if 'Unnamed: 0' in df.columns:
        df = df.rename(columns={'Unnamed: 0': 'Ticker'})
    elif df.index.name == 'Ticker':
        df.reset_index(inplace=True)
    else:
        df['Ticker'] = df.index

    df['Year'] = year
    all_dfs.append(df)

combined_df = pd.concat(all_dfs, ignore_index=True)
print(f"Combined shape: {combined_df.shape}")
combined_df.head()



**Preprocess Financial Data for LSTM Model**

This cell filters the combined financial dataset to retain only rows with all required features (Class + 16 financial indicators). It then selects only those tickers that have a full 5-year time series of data, resulting in a clean df_lstm dataset ready for LSTM training.



In [None]:

combined_df = combined_df.sort_values(by=["Ticker", "Year"])


features = [
    'Revenue', 'Revenue Growth', 'Gross Profit', 'Operating Income',
    'Net Income', 'EPS', 'EBIT Margin', 'Profit Margin', 'EBITDA',
    'Operating Cash Flow', 'Market Cap', 'Enterprise Value',
    'Debt to Equity', 'Interest Coverage', 'ROIC', 'Net Profit Margin'
]


required_columns = ['Ticker', 'Year', 'Class'] + features
df_lstm = combined_df[required_columns].dropna()


grouped = df_lstm.groupby('Ticker')


sequence_length = 5
valid_tickers = [name for name, group in grouped if len(group) == sequence_length]


df_lstm = df_lstm[df_lstm['Ticker'].isin(valid_tickers)]

print(f"LSTM dataset shape (after filtering): {df_lstm.shape}")
print(f"Number of valid time series (tickers with 5 years): {len(valid_tickers)}")


**Normalize & Format Time Series Data for LSTM Training**

This cell prepares the filtered dataset for LSTM model training:

- The selected financial features are normalized using StandardScaler to ensure uniform input scale across all features.

- For each ticker with 5 years of data, we extract a sequence of shape (5, num_features) and assign the final year's class label as the target.

- The result is a 3D input array X for LSTM training and a 1D label array y, suitable for supervised classification.

In [None]:
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split


sequences = []
labels = []


scaler = StandardScaler()
df_lstm.loc[:, features] = scaler.fit_transform(df_lstm[features])



for ticker, group in df_lstm.groupby("Ticker"):
    group = group.sort_values("Year")
    if len(group) == sequence_length:
        seq = group[features].values
        label = group["Class"].values[-1]
        sequences.append(seq)
        labels.append(label)


X = np.array(sequences)
y = np.array(labels)

print(f"Final LSTM Input Shape: {X.shape}")
print(f"Final Labels Shape: {y.shape}")


**Financial LSTM Model Training with Bidirectional Memory and Attention**

This cell trains a custom neural network to predict long-term stock potential using time-series financial data:

- We use a Bidirectional LSTM, allowing the model to learn from past and future trends across 5 years of data per stock.

- An Attention layer is applied to dynamically focus on the most informative years in the sequence, improving interpretability and prediction quality.

- Class imbalance (due to fewer failing companies) is corrected using manual class weights ({0: 1.25, 1: 0.83}), boosting sensitivity to underrepresented outcomes.

- EarlyStopping halts training when validation loss plateaus, reducing overfitting.

- After training, we report final test accuracy and a classification report showing performance on unseen stocks.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Dropout, Bidirectional, Attention, Flatten
from tensorflow.keras.callbacks import EarlyStopping
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import numpy as np


X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)

print(f"Train: {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}")


class_weights = {0: 1.25, 1: 0.83}


input_layer = Input(shape=(X.shape[1], X.shape[2]), name="input_layer")
x = Bidirectional(LSTM(128, return_sequences=True))(input_layer)
x = Dropout(0.3)(x)
x = Attention()([x, x])
x = Flatten()(x)
x = Dropout(0.3)(x)
output = Dense(1, activation='sigmoid', name="dense_output")(x)

finmodel = Model(inputs=input_layer, outputs=output)
finmodel.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
finmodel.summary()


early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

history = finmodel.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=50,
    batch_size=32,
    class_weight=class_weights,
    callbacks=[early_stop],
    verbose=1
)


test_loss, test_acc = finmodel.evaluate(X_test, y_test)
print(f"\n Final Test Accuracy: {test_acc:.4f}")

y_pred = (finmodel.predict(X_test) > 0.5).astype("int32")
print(classification_report(y_test, y_pred))



**LSTM Financial Model: Performance Visualization**

This section provides key visualizations to evaluate the performance of the LSTM-based financial classifier:

- Training Curves: Accuracy and loss trends over epochs to verify convergence and spot overfitting.

- Confusion Matrix: Shows how well the model distinguishes between high and low performers.

- ROC Curve: Illustrates the model's ability to balance true positives vs. false positives, with AUC as a summary metric.

These plots offer a holistic view of the model’s behavior and predictive performance on the test set.

In [None]:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np


history_df = pd.DataFrame(history.history)

plt.figure(figsize=(14, 5))


plt.subplot(1, 2, 1)
plt.plot(history_df['accuracy'], label='Train Accuracy')
plt.plot(history_df['val_accuracy'], label='Validation Accuracy')
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.title("Training vs Validation Accuracy (Financial Model)")
plt.legend()
plt.grid(True)


plt.subplot(1, 2, 2)
plt.plot(history_df['loss'], label='Train Loss')
plt.plot(history_df['val_loss'], label='Validation Loss')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss (Financial Model)")
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()


cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Not High Performer", "High Performer"])
disp.plot(cmap="Blues")
plt.title(" Confusion Matrix – LSTM Financial Model")
plt.grid(False)
plt.show()


y_probs = finmodel.predict(X_test).ravel()
fpr, tpr, _ = roc_curve(y_test, y_probs)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title(" ROC Curve – LSTM Financial Model")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()




**Load and Preview Sentiment Training & Validation Data**

This cell loads the pre-labeled sentiment datasets for training and validation. Each row contains a financial news sentence and its associated sentiment label (0 = negative, 1 = neutral, 2 = positive).

We preview the structure of the data and display the distribution of sentiment classes to understand any class imbalance.

In [None]:

train_df = pd.read_csv("sentiment_data/sent_train.csv")
valid_df = pd.read_csv("sentiment_data/sent_valid.csv")


print("Training Sample:")
print(train_df.head())

print("\nValidation Sample:")
print(valid_df.head())


print("\nColumns in train set:", train_df.columns)
print("\nLabel Distribution (Train):")
print(train_df['label'].value_counts())


**Load Sentiment Data into Hugging Face Format**

We use the datasets library from Hugging Face to load the sentiment training and validation CSVs into a structured DatasetDict.

This allows for cleaner integration with Transformers pipelines and batch processing for tokenization and training.

In [None]:
from datasets import load_dataset


data_files = {
    "train": "sentiment_data/sent_train.csv",
    "validation": "sentiment_data/sent_valid.csv"
}

raw_datasets = load_dataset("csv", data_files=data_files)


**Sentiment Classification Using FinBERT (Transformer-Based Model)**

In this section, we fine-tune the ProsusAI/finbert model to classify financial sentiment into three classes.
Steps performed:

- Loaded the FinBERT tokenizer and model (frozen to speed up training).

- Tokenized sentiment text from the dataset (padding and truncation to 128 tokens).

- Built a Keras model that extracts the [CLS] embedding for sentiment representation.

- Added dense layers for classification with dropout for regularization.

- Trained the model using labeled sentiment data for 3 epochs.

- Evaluated the model on a held-out test set and reported accuracy.

Note: If you get a ValueError during training, restart the runtime to fix TensorFlow graph conflicts.

If you get this error, please restart your runtime.




In [None]:
import os
import tensorflow as tf
from transformers import AutoTokenizer, TFBertModel
from sklearn.model_selection import train_test_split
from keras.saving import register_keras_serializable
from tensorflow.keras.layers import Input, Lambda, Dense, Dropout
from tensorflow.keras.models import Model


checkpoint = "ProsusAI/finbert"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
bert_model = TFBertModel.from_pretrained(checkpoint)
bert_model.trainable = False

def tokenize_data(df, tokenizer, max_len=128):
    return tokenizer(
        list(df['text']),
        padding='max_length',
        truncation=True,
        max_length=max_len,
        return_tensors='tf'
    )


train_full, test_df = train_test_split(train_df, test_size=0.10, stratify=train_df['label'], random_state=42)
train_df_small, val_df = train_test_split(train_full, test_size=0.10, stratify=train_full['label'], random_state=42)

X_train = tokenize_data(train_df_small, tokenizer)
X_val = tokenize_data(val_df, tokenizer)
X_test = tokenize_data(test_df, tokenizer)

y_train = tf.convert_to_tensor(train_df_small['label'].values)
y_val = tf.convert_to_tensor(val_df['label'].values)
y_test = tf.convert_to_tensor(test_df['label'].values)


@register_keras_serializable()
def extract_cls(inputs):
    ids, mask = inputs
    return bert_model(ids, attention_mask=mask).last_hidden_state[:, 0]


input_ids = Input(shape=(128,), dtype=tf.int32, name="input_ids")
attention_mask = Input(shape=(128,), dtype=tf.int32, name="attention_mask")
cls_token = Lambda(extract_cls, output_shape=(768,), name="cls_extraction")([input_ids, attention_mask])


x = Dense(128, activation='relu')(cls_token)
x = Dropout(0.2)(x)
output = Dense(3, activation='softmax')(x)

model = Model(inputs=[input_ids, attention_mask], outputs=output)
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
              loss='sparse_categorical_crossentropy', metrics=['accuracy'])


history = model.fit(
    {"input_ids": X_train['input_ids'], "attention_mask": X_train['attention_mask']},
    y_train,
    validation_data=(
        {"input_ids": X_val['input_ids'], "attention_mask": X_val['attention_mask']}, y_val
    ),
    epochs=3,
    batch_size=32
)


loss, acc = model.evaluate(
    {"input_ids": X_test['input_ids'], "attention_mask": X_test['attention_mask']},
    y_test
)
print(f"\n Final Test Accuracy: {acc:.4f} — this will be used in portfolio scoring.")








Sentiment Model: Performance Visualization
 This section visualizes the training and evaluation results of the sentiment classification model (FinBERT-based):

- Training Curves: Accuracy and loss trends during fine-tuning to assess learning progress and potential overfitting.

- Confusion Matrix: Displays prediction accuracy across all sentiment classes: Negative, Neutral, and Positive.

- ROC Curve (Micro-Averaged): Aggregates all class performance to show the model’s overall discriminative ability.

- Classification Report: Includes precision, recall, and F1-scores for each class to give a detailed breakdown of performance.

These diagnostics demonstrate how well the model classifies financial sentiment based on textual news and updates

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay, roc_curve, auc
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np


history_df = pd.DataFrame(history.history)

plt.figure(figsize=(14, 5))


plt.subplot(1, 2, 1)
plt.plot(history_df['accuracy'], label='Train Accuracy')
plt.plot(history_df['val_accuracy'], label='Validation Accuracy')
plt.xlabel("Epochs")
plt.ylabel("Accuracy")
plt.title("Training vs Validation Accuracy (Sentiment Model)")
plt.legend()
plt.grid(True)


plt.subplot(1, 2, 2)
plt.plot(history_df['loss'], label='Train Loss')
plt.plot(history_df['val_loss'], label='Validation Loss')
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.title("Training vs Validation Loss (Sentiment Model)")
plt.legend()
plt.grid(True)

plt.tight_layout()
plt.show()


y_pred = model.predict(
    {"input_ids": X_test['input_ids'], "attention_mask": X_test['attention_mask']}
).argmax(axis=1)

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=["Negative", "Neutral", "Positive"])
disp.plot(cmap="Blues")
plt.title(" Confusion Matrix – Sentiment Model")
plt.grid(False)
plt.show()


y_probs = model.predict({"input_ids": X_test['input_ids'], "attention_mask": X_test['attention_mask']})
y_test_onehot = tf.one_hot(y_test, depth=3).numpy()

fpr, tpr, _ = roc_curve(y_test_onehot.ravel(), y_probs.ravel())
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title(" ROC Curve – Sentiment Model (Micro-Average)")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()


print("\n Classification Report:")
print(classification_report(y_test, y_pred, target_names=["Negative", "Neutral", "Positive"]))


**Mapping Ticker Symbols to Company Names (Using yfinance)**

This cell retrieves official company names corresponding to the stock tickers used in our LSTM dataset.

- Extracted unique tickers from df_lstm.

- Queried Yahoo Finance API via yfinance to fetch each company's shortName.

- Successfully mapped most tickers to names and stored them in a dictionary (ticker_to_name).

- Logged and counted any failed requests (e.g., due to missing or malformed data).

- Added a slight delay between requests to avoid rate-limiting.

In [None]:
import yfinance as yf
import time
import logging
import json


logging.getLogger("yfinance").setLevel(logging.ERROR)


tickers = df_lstm['Ticker'].dropna().unique()

ticker_to_name = {}
failed = []


print(f" Fetching company names for {len(tickers)} LSTM tickers...\n")

for i, ticker in enumerate(tickers):
    try:
        info = yf.Ticker(ticker).info
        name = info.get('shortName', '')
        if name:
            ticker_to_name[ticker] = name
            print(f" [{i+1}/{len(tickers)}] {ticker}: {name}")
        else:
            print(f"  [{i+1}/{len(tickers)}] {ticker}: No name found.")
        time.sleep(0.25)
    except Exception as e:
        failed.append(ticker)
        print(f" [{i+1}/{len(tickers)}] {ticker}: Failed ({str(e)[:50]})")


print(f" Done. Mapped {len(ticker_to_name)} tickers. Skipped {len(failed)} tickers.")






**Matching Sentiment Texts to Stock Tickers**

The sentiment dataset we use contains financial news and social media posts, but the company references inside these texts are inconsistent — sometimes they appear as ticker symbols (e.g., $AAPL), and other times as full company names (e.g., Apple Inc). To properly align this dataset with our financial LSTM dataset (which is organized by ticker), we need a robust matching system.



- Reverse Mapping: We create a dictionary that maps lowercase company names to tickers.

- Strict Matching: Detects tickers in the form $TICKER using regex. Finds exact matches of company names inside the text.

- Fuzzy Matching: If no exact match is found, uses RapidFuzz to perform similarity matching on company names (≥ 90% match confidence).

- Result: A clean ticker column is created, aligning sentiment texts to company tickers.

- Final Step: Unmatched rows are dropped to ensure downstream consistency.

This process is essential for integrating the sentiment signal with our financial predictions, enabling meaningful scoring and recommendation.

In [None]:
from rapidfuzz import fuzz, process
import pandas as pd
import json
import re



name_to_ticker = {v.lower(): k for k, v in ticker_to_name.items()}


sentiment_df = pd.read_csv("sentiment_data/sent_train.csv")


def strict_match_ticker(text):
    text_lower = text.lower()


    for ticker in ticker_to_name:
        if len(ticker) < 2:
            continue
        pattern = rf"\${re.escape(ticker.lower())}(\b|[^a-zA-Z])"
        if re.search(pattern, text_lower):
            print(f" Matched by $ticker: {ticker} in → {text[:90]}...")
            return ticker


    for name, ticker in name_to_ticker.items():
        if f" {name} " in f" {text_lower} ":
            print(f" Matched by exact company name: {name} → {text[:90]}...")
            return ticker


    best_match = process.extractOne(text_lower, name_to_ticker.keys(), scorer=fuzz.token_sort_ratio)
    if best_match and best_match[1] >= 90:
        matched_name = best_match[0]
        ticker = name_to_ticker[matched_name]
        print(f" Fuzzy matched '{matched_name}' ({best_match[1]}%) → {ticker} in → {text[:90]}...")
        return ticker

    return None


sentiment_df['ticker'] = sentiment_df['text'].apply(strict_match_ticker)


sentiment_df = sentiment_df.dropna(subset=['ticker'])

print(f"\n Ticker matching complete: {len(sentiment_df)} entries matched.")






**Final Prediction Pipeline: Merging Financial and Sentiment Insights**

This cell performs the final stage of our stock scoring system, combining predictions from the LSTM financial model and the FinBERT sentiment model. Here’s a breakdown of what this does:

- Filter Financial Data: We limit the LSTM dataset to tickers that also appear in the sentiment dataset to ensure overlap.

- Predict Financial Scores: The pre-trained LSTM (finmodel) processes each stock's 5-year time series of financial metrics and outputs a confidence score (between 0 and 1) indicating expected future performance.

- Predict Sentiment Scores: We tokenize the sentiment texts using FinBERT's tokenizer and run them through the trained sentiment model to predict sentiment labels (0, 1, or 2). We then average these predictions per ticker to form a sentiment score.

- Merge Scores: For each stock, we combine the two signals:
    - final_score = 0.8 * financial_score + 0.2 * sentiment_score

    - This weighted average prioritizes financial health while still incorporating sentiment.

- Diagnostics: We print summary statistics to show which tickers had complete data from both pipelines and identify any mismatches.

-  Output: Displays the Top 10 Recommended Stocks based on the final score — those with strong financial fundamentals and supportive sentiment.

This step is the culmination of our pipeline and forms the basis of our portfolio recommendation system.

In [None]:
from transformers import AutoTokenizer
import pandas as pd
import numpy as np
import tensorflow as tf

matched_tickers = sentiment_df['ticker'].unique()
filtered_lstm_df = df_lstm[df_lstm['Ticker'].isin(matched_tickers)].copy()
print(f"Filtered financial dataset to {filtered_lstm_df['Ticker'].nunique()} tickers ({len(filtered_lstm_df)} rows)")


sequence_length = 5
features = [col for col in filtered_lstm_df.columns if col not in ['Ticker', 'Year', 'Class']]

X = []
tickers_seq = []

for ticker in filtered_lstm_df['Ticker'].unique():
    df_ticker = filtered_lstm_df[filtered_lstm_df['Ticker'] == ticker].sort_values('Year')
    if len(df_ticker) == sequence_length:
        X.append(df_ticker[features].values)
        tickers_seq.append(ticker)
    else:
        print(f" Skipping {ticker} (has only {len(df_ticker)} rows)")

X = np.array(X)
print(f" Input shape for LSTM: {X.shape}")


financial_scores = finmodel.predict(X).squeeze()
print(f" Sample financial predictions: {financial_scores[:5]}")

lstm_results_df = pd.DataFrame({
    'Ticker': tickers_seq,
    'financial_score': financial_scores
})


tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
sentiment_texts = list(sentiment_df['text'])
encoded = tokenizer(
    sentiment_texts,
    padding='max_length',
    truncation=True,
    max_length=128,
    return_tensors='tf'
)


sentiment_preds = model.predict({
    "input_ids": encoded["input_ids"],
    "attention_mask": encoded["attention_mask"]
}, batch_size=32).argmax(axis=1)


sentiment_df['sentiment_score'] = sentiment_preds


sentiment_scores_df = sentiment_df.groupby('ticker')['sentiment_score'].mean().reset_index()
sentiment_scores_df.rename(columns={'ticker': 'Ticker'}, inplace=True)



final_df = lstm_results_df.merge(sentiment_scores_df, on='Ticker')
final_df['final_score'] = 0.8 * final_df['financial_score'] + 0.2 * final_df['sentiment_score']



tickers_in_sentiment = set(sentiment_scores_df['Ticker'])
tickers_in_lstm = set(lstm_results_df['Ticker'])

common_tickers = tickers_in_sentiment & tickers_in_lstm
only_in_sentiment = tickers_in_sentiment - tickers_in_lstm
only_in_lstm = tickers_in_lstm - tickers_in_sentiment

print(f"\n Tickers in both sentiment & LSTM: {len(common_tickers)}")
print(f" Tickers only in sentiment data: {sorted(list(only_in_sentiment))[:10]}{'...' if len(only_in_sentiment) > 10 else ''}")
print(f" Tickers only in financial data: {sorted(list(only_in_lstm))[:10]}{'...' if len(only_in_lstm) > 10 else ''}")



top_stocks = final_df.sort_values(by='final_score', ascending=False).head(10)
print("\n Top 10 Recommended Stocks:")
print(top_stocks[['Ticker', 'financial_score', 'sentiment_score', 'final_score']])





**Calculating Risk Scores and Final Dataset Construction**

This cell computes a custom risk score for each stock and integrates it into the final dataset alongside financial and sentiment scores. It also adds sector information to enable sector-based filtering later.

- Downside Volatility: We extract all the "PRICE VAR" columns and compute downside volatility — the standard deviation of only the negative annual price changes — for each ticker.

- Financial Risk: We calculate the average Debt to Equity and Interest Coverage ratios per ticker. The financial risk score is computed as:
   - financial_risk_score = Debt to Equity/1+Interest Coverage



- Overall Risk Score: Combines downside volatility and financial risk to give a unified risk_score per stock.

- Sector Merge: Adds sector information from the original dataset.

- Final DataFrame: Merges all components — financial score, sentiment score, risk score, and sector — into final_df.

Finally, it displays the top 10 highest-scoring stocks, now with complete metadata for risk-aware portfolio construction or sector-specific filtering.

In [None]:
import pandas as pd
import numpy as np


price_var_cols = [col for col in combined_df.columns if 'PRICE VAR' in col]


price_var_long = combined_df[['Ticker'] + price_var_cols].melt(
    id_vars='Ticker',
    value_vars=price_var_cols,
    var_name='Year',
    value_name='Price_Var'
)


price_var_long = price_var_long.dropna()
price_var_long = price_var_long[price_var_long['Price_Var'] < 0]


downside_volatility = price_var_long.groupby('Ticker')['Price_Var'].std().reset_index()
downside_volatility.rename(columns={'Price_Var': 'downside_volatility'}, inplace=True)

combined_df['Debt to Equity'] = combined_df['Debt to Equity'].fillna(combined_df['Debt to Equity'].median())
combined_df['Interest Coverage'] = combined_df['Interest Coverage'].fillna(combined_df['Interest Coverage'].median())

financial_risk = combined_df.groupby('Ticker').agg({
    'Debt to Equity': 'mean',
    'Interest Coverage': 'mean'
}).reset_index()

financial_risk['financial_risk_score'] = financial_risk['Debt to Equity'] / (1 + financial_risk['Interest Coverage'])


risk_df = downside_volatility.merge(financial_risk[['Ticker', 'financial_risk_score']], on='Ticker', how='outer')
risk_df['risk_score'] = risk_df['downside_volatility'].fillna(0) + risk_df['financial_risk_score'].fillna(0)


sector_info = combined_df[['Ticker', 'Sector']].drop_duplicates()
final_df['Company'] = final_df['Ticker'].map(ticker_to_name)




final_df = final_df.merge(sector_info, on='Ticker', how='left')
final_df = final_df.merge(risk_df[['Ticker', 'risk_score']], on='Ticker', how='left')


final_df = final_df[['Ticker','Company', 'Sector', 'financial_score', 'sentiment_score', 'final_score', 'risk_score']]


print(" Final comprehensive dataset ready:")
display(final_df.sort_values(by='final_score', ascending=False).head(10))



**Filter Function for Sector & Risk-Based Stock Selection**

This function powers the interactive UI for selecting top stocks based on chosen sectors and risk levels:

- Quantile-Based Risk Categorization: The risk_score is split into three categories — Low, Medium, and High — using the 33rd and 66th percentiles.

- Dynamic Filtering: Filters the final_df based on:

   - selected_sectors: list of sector names

   - selected_risk: one of "Low", "Medium", or "High"

- Ranking: Returns the top 10 highest final_score stocks after applying the filters.

Used in the frontend to let users choose their investment preferences interactively.

In [None]:
def filter_stocks_ui(selected_sectors, selected_risk):
    df = final_df.copy()


    df = df.dropna(subset=['risk_score'])


    q_low = df['risk_score'].quantile(0.33)
    q_high = df['risk_score'].quantile(0.66)

    def categorize_risk(r):
        if r <= q_low:
            return 'Low'
        elif r <= q_high:
            return 'Medium'
        else:
            return 'High'


    df['risk_category'] = df['risk_score'].apply(categorize_risk)


    if selected_sectors:
        df = df[df['Sector'].isin(selected_sectors)]


    if selected_risk:
        df = df[df['risk_category'] == selected_risk]


    top_df = df.sort_values(by='final_score', ascending=False).head(10)


    print(f" Matches found: {len(top_df)} for sectors {selected_sectors} and risk '{selected_risk}'")

    return top_df[['Ticker','Company', 'Sector', 'financial_score', 'sentiment_score', 'final_score', 'risk_score']]




**Interactive Stock Recommender UI with Gradio**

This cell builds an interactive user interface using Gradio:

- CheckboxGroup: Lets users select one or more sectors.

- Radio Button: Lets users choose a desired risk level — Low, Medium, or High.

- Output: Displays the top 10 recommended stocks based on final_score after filtering by sector and risk.

This makes the model's recommendations visually accessible and easy to explore without needing to rerun any code.

In [None]:
import gradio as gr
sector_list = sorted(final_df['Sector'].dropna().unique().tolist())

gr.Interface(
    fn=filter_stocks_ui,
    inputs=[
        gr.CheckboxGroup(choices=sector_list, label="Select Sector(s)"),
        gr.Radio(choices=['Low', 'Medium', 'High'], label="Select Risk Level"),
    ],
    outputs=gr.Dataframe(label="Top 10 Matching Stocks"),
    title="📊 Sector + Risk-Aware Stock Recommender",
    description="Choose one or more sectors and a risk level to see high-potential stocks."
).launch(debug=True)
