# Project: Sentiment Analysis to Predict Stock Prices

## Learning Objectives:

- Learn how to get historic stock prices by Python.

- Plot the stock prices.

- Learn how to scrape financial news headlines from a financial news website.

- Implement sentiment analysis on news' headlines.

- Plot stock prices and sentiment scores to see the relationship between them.

In [None]:
# Install yfinance
!pip install yfinance

In [12]:
# Import libraries
import pandas as pd
from datetime import date
import numpy as np
import matplotlib.pyplot as plt
import yfinance as yf    # library to access Yahoo! Finance
import os

In [None]:
# Specify paths
print(os.getcwd())  # print current working directory
output_path = './data/stockmarket/'  # path to save data

## 1. Pull stock prices of top 30 companies in Germany

The list of the companies is available here: https://companiesmarketcap.com/germany/largest-companies-in-germany-by-market-cap/

We will get stock prices data through library `yfinance`.

In [15]:
# Create a list of companies in Germany
tickers = ['SAP.DE', 'SIE.DE', 'DTE.DE', 'P911.DE', 'ALV.DE', 'VOW3.DE', 'MBG.DE', 'BMW.DE', 'MRK.DE', 'SHL.DE',
             'BAYZF', 'DPW.DE', 'RAA.F', 'MUV2.DE', 'IFX.DE', 'BAS.DE', 'HLAG.DE', 'UN01.DE', 'DB1.DE', 'EOAN.DE', 
             'ADS.DE', 'HEN3.DE', 'BEI.DE', 'RWE.DE', 'EBK.DE', 'DTG.F', 'BNTX', 'HNR1.DE', 'SRT.DE', 'DB']

In [16]:
# Define function to get data
def get_yahoo_data(tickers_list, period):
    # Create an empty list to store data
    data_df = pd.DataFrame(None)
    
    for ticker in tickers_list:
        # Access ticker data
        company_tkr = yf.Ticker(ticker)
        # Define period of analysis
        company_hist = company_tkr.history(period = period)
        # Format dataframe
        company_hist = company_hist.reset_index().rename(columns = {'index': 'Date'})
        company_hist['Ticker'] = [ticker] * len(company_hist)
        
        # Concatenate two dataframes
        data_df = pd.concat([data_df, company_hist], ignore_index = True)
        
    # Move the last column to the first position
    ticker_col = data_df.iloc[:, -1]  # select the last column
    data_df = pd.concat([ticker_col, data_df.iloc[:, :-1]], axis = 1)  # concatenate with the rest of the dataframe
           
     # Reuturn dataframe   
    return pd.DataFrame(data_df)

In [None]:
# Get stock prices of top 30 German companies with period of 5 years
stock_prices_df = get_yahoo_data(tickers, '5y')

# View the data
stock_prices_df

In [21]:
# Format column of Date
stock_prices_df['Date'] = stock_prices_df['Date'].apply(lambda x: x.date())

# View data
stock_prices_df

In [23]:
# Save data as csv file
stock_prices_df.to_csv(output_path + 'StockPrices_30GermanCompanies_YFinance.csv', header = True, sep = ';', index = False)

## 2. Scrape financial news headlines from FinViz

The FinViz website is a great information provider about stock market.

You can access the website: https://finviz.com/

In [7]:
# Import libraries
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import os
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
# Scrape data and save it in a dictionary
finviz_url = 'https://finviz.com/quote.ashx?t='
news_tables = {}
tickers = ['SAP', 'BNTX', 'DB']
for ticker in tickers:
    url = finviz_url + ticker
    req = Request(url = url, headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/112.0.0.0 Safari/537.36'}) 
    response = urlopen(req)    
    # Read the contents of the file into 'html'
    html = BeautifulSoup(response)
    # Find 'news-table' in the Soup and load it into 'news_table'
    news_table = html.find(id = 'news-table')
    # Add the table to our dictionary
    news_tables[ticker] = news_table

In [None]:
# Let's extract one example to see the structure of the created dictionary
# Read several headlines for ‘SAP’ 
sap = news_tables['SAP']
# Get all the table rows tagged in HTML with <tr> into ‘sap_tr’
sap_tr = sap.findAll('tr')
for i, table_row in enumerate(sap_tr):
    # Read the text of the element ‘a’ into ‘link_text’
    a_text = table_row.a.text
    # Read the text of the element ‘td’ into ‘data_text’
    td_text = table_row.td.text
    # Print the contents of ‘link_text’ and ‘data_text’ 
    print(a_text)
    print(td_text)
    # Exit after printing 6 rows of data
    if i == 5: 
        break

In [None]:
# Parse the data into a list
parsed_news = []
# Iterate through the news
for file_name, news_table in news_tables.items():
    # Iterate through all tr tags in 'news_table'
    for x in news_table.findAll('tr'):
        # Read the text from each tr tag into text
        # Get text from a only
        text = x.a.get_text() 
        # Split text in the td tag into a list 
        date_scrape = x.td.text.split()
        # if the length of 'date_scrape' is 1, load 'time' as the only element
        if len(date_scrape) == 1:
            time = date_scrape[0]
            
        # else load 'date' as the 1st element and 'time' as the second    
        else:
            date = date_scrape[0]
            time = date_scrape[1]
        # Extract the ticker from the file name, get the string up to the 1st '_'  
        ticker = file_name.split('_')[0]
        
        # Append ticker, date, time and headline as a list to the 'parsed_news' list
        parsed_news.append([ticker, date, time, text])

# Print first 5 rows of news        
parsed_news[:5] 

In [10]:
# Set column names
columns = ['Ticker', 'Date', 'Time', 'Headline']

# Convert parsed_news list into Dataframe
stock_news_df = pd.DataFrame(parsed_news, columns = columns)

# Convert date column into datetime
stock_news_df['Date'] = pd.to_datetime(stock_news_df['Date']).dt.date

# View data
stock_news_df

## 3. Construct Sentiment Classifications

From the headlines from data `stocks_news_df`, we create several columns of sentiment classifications, emotions, scores,...

### 3.1. Sentiment classifiers with `pipline`

In [24]:
# Import transformer model
from transformers import pipeline

#### Sentiment classification with values `POS`(positive), `NEU`(neutal) and `NEG`(negative)

In [None]:
# Sentiment classifier
sentiment_classifier = pipeline(model = 'finiteautomata/bertweet-base-sentiment-analysis')

In [26]:
# Define function to get sentiment
def get_sentiment(text):
    # Get sentiment prediction scores
    try:
        sentiment = sentiment_classifier(text)[0]['label']
    except:
        sentiment = 'Not classified'
    return sentiment

In [None]:
# Create new column namely 'Sentiment'
stock_news_df['Sentiment'] = stock_news_df['Headline'].astype(str).apply(lambda x: get_sentiment(x))

# View the data
stock_news_df

#### Emotion classifier with categorical values of emotions such as `joy, anger, fear, sadness, love,...`

In [None]:
# Emotion classifier
emotion_classifier = pipeline("text-classification", model = 'bhadresh-savani/distilbert-base-uncased-emotion', return_all_scores = True)

In [29]:
# Define function to get emotion
def get_emotion(text):
    # Get emotion prediction scores
    pred_scores = emotion_classifier(text)
    
    # Get emotion with highest prediction score
    emotion = max(pred_scores[0], key = lambda x: x['score'])['label']
    
    return emotion

In [None]:
# Create new column of 'Emotion'
stock_news_df['Emotion'] = stock_news_df['Headline'].astype(str).apply(lambda x: get_emotion(x))

# View the data
stock_news_df

### 3.2. Sentiment Analysis with `nltk` (Natural Language Toolkits)

In [None]:
# Import libraries
import nltk
nltk.downloader.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

In [32]:
# Sentiment score classifier
sentiment_score_classifier = SentimentIntensityAnalyzer()

In [34]:
# Iterate through the headlines and get the polarity scores using vader
sentiment_scores = stock_news_df['Headline'].apply(sentiment_score_classifier.polarity_scores).tolist()

# Convert the 'sentiment_scores' list of dicts into a DataFrame
sentiment_scores_df = pd.DataFrame(sentiment_scores)

In [None]:
# Join the DataFrames of the news and the list of dicts
stock_news_df = stock_news_df.join(sentiment_scores_df, rsuffix = '_right')

# View the data
stock_news_df

## 4. Perform Sentiment Analysis

In [None]:
# Format Dataframe of prices
stock_prices_df['Ticker'] = stock_prices_df['Ticker'].replace('SAP.DE', 'SAP')

# Combine stock_prices_df and stock_news_df
stocks_df = pd.merge(stock_news_df, stock_prices_df, how = 'inner', on = ['Ticker', 'Date'])

# View the data
stocks_df

### 4.1. Pie plots of sentiment analysis 

In [None]:
# Plot sentiment about three companies
fig, axs = plt.subplots(1, 3, figsize = (15,5))

for i, ticker in enumerate(stocks_df.Ticker.unique()):
    # Select each company
    temp_df = stocks_df[stocks_df.Ticker == ticker]
    # Compute proportion of sentiment
    counts = temp_df['Sentiment'].value_counts()
    proportions = counts/counts.sum()*100
    proportions = proportions.apply(lambda x: "{:.2f}".format(x) if isinstance(x, (float, int)) else x)
    
    # Creat pie chart 
    ## Define the colors for each label
    colors = ['lightblue', 'salmon', 'navajowhite']

    ## Define the properties of the text
    textprops = {'color': 'white', 'fontsize': 12}

    ## Plot
    axs[i].pie(proportions.values, labels = proportions.index, wedgeprops = {'width': 0.8}, colors = colors, autopct = '%1.2f%%')

    ## Add a title
    axs[i].set_title(f'Sentiment of News about the company {ticker}')
    
plt.show();


### 4.2. Bar plots of emotion distribution

In [None]:
# Plot the emotion distribution
fig, axs = plt.subplots(1, 3, figsize = (20,5))

for i, ticker in enumerate(stocks_df.Ticker.unique()):
    # Select each company
    temp_df = stocks_df[stocks_df.Ticker == ticker]

    counts = temp_df['Emotion'].value_counts()

    # Create the bar chart with the defined labels and values
    axs[i].bar(counts.index, counts.values)

    # Add a title to the bar chart
    axs[i].set_title(f'Emotions of News about the company {ticker}')

    # Add labels to the x and y axes
    axs[i].set_xlabel('Emotion')
    axs[i].set_ylabel('Count')

# Show the bar chart
plt.show();

### 4.3. Examine the relationship between sentiment scores and stock prices

In [104]:
# Set plot style
plt.style.use('fivethirtyeight')

In [None]:
# Plot Price and Sentiment Scores
for ticker in stocks_df.Ticker.unique():
    # Subset of data for each company
    stocks_df_sub = stocks_df[stocks_df.Ticker == ticker] 
    stocks_df_sub = stocks_df_sub[['Date', 'compound', 'Open', 'Close']]
    # Convert the 'Date' column to a pandas datetime object
    stocks_df_sub['Date'] = pd.to_datetime(stocks_df_sub['Date'])

    # Group the data by month and mean the values
    stocks_df_sub = stocks_df_sub.groupby(pd.Grouper(key = 'Date', freq = 'M')).mean()

    # Reset index
    stocks_df_sub.reset_index(inplace = True)

    # Line plot
    fig, ax = plt.subplots(figsize = (10, 6))
    ax.plot(stocks_df_sub.Date, stocks_df_sub.Close, marker = '.' , color = 'darkgreen', label = 'Closing price')
    ax.set_xlabel('Year-Month', fontsize = 14)
    ax.set_ylabel('Closing Price ($)', fontsize = 14)
    ax.set_title('Closing Price and Sentiment Score Over Time')

    ax2 = ax.twinx()
    # make a plot with different y-axis using second axis object
    ax2.plot(stocks_df_sub.Date, stocks_df_sub.compound, marker = '.', color = 'darkorange', label = 'Sentiment score')
    ax2.set_ylabel("Sentiment Score", fontsize = 14)

    ax.legend(loc = "lower right", bbox_to_anchor = (0.5, -0.25), fancybox = True, ncol = 5, fontsize = 15, shadow = True,)
    ax2.legend(loc = "lower left", bbox_to_anchor = (0.5, -0.25), fancybox = True, ncol = 5, fontsize = 15, shadow = True,)
    plt.show();

## References

1. Ran Aroussi (2023), *"Download market data from Yahoo! Finance's API"*, visit link: https://pypi.org/project/yfinance/

2. Damian Boh (2020), *"Sentiment Analysis of Stocks from Financial News using Python"*, Medium, visit link: https://medium.datadriveninvestor.com/sentiment-analysis-of-stocks-from-financial-news-using-python-82ebdcefb638

3. Youtube Channel CodeXplore (2021), *"Hướng Dẫn Làm Data Visualisation Project với Matlplotlib và Python"*, visit link: https://www.youtube.com/watch?v=N_7A3KPZIQw

4. Youtube Channel Thu Vu Data Analytics (2023), *"Building a Chatbot with ChatGPT API and Reddit Data"*, visit link: https://www.youtube.com/watch?v=EE1Y2enHrcU