The goal of this notebook is to show the possibilities of predicting financial outcomes from textual data via DistilBERT transformer model. 

This notebook builds on the next files: 

*   `company_des.csv` with company descriptions, see `2. Business Descriptions.ipynb` 
*   `ta.csv` and `rev.csv` with total assets and revenue data, see `3. YFinance Data.ipynb`

The steps this ipynb goes through are below: 

1.   Mount your Google Drive and establish the working directory. 
2.   Install and load the necessary libraries. 
3.   Load, merge, clean the data. 
4.   Create the label variable. 
5.   Prepare the predictor.  
6.   Run the DistilBERT model. 
7.   Train logistic regression and evaluate its accuracy.





**Note**. Save this Colab notebook to your Drive via File > Save a copy in Drive to be able to edit it. 

# 1. Mount your Google Drive and establish the working directory

Mounting allows to access files on your Google Drive. You'll need to allow the Google Drive for desktop's access to your Google Account and copying the sign in code into the authorization code field. 

In [3]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive/')

Mounted at /content/gdrive/


Set working directory to a Google Drive folder: change the `root_dir` to the folder on your Google Drive. 

`root_dir = "/content/gdrive/MyDrive/...`

In [4]:
import os

# Set your working directory to a folder in your Google Drive. This way, if your notebook times out,
# your files will be saved in your Google Drive!

# the base Google Drive directory
root_dir = "/content/gdrive/MyDrive/BU/Year1/Summer/"

# choose where you want your project files to be saved
project_folder = "capstone/"

def create_and_set_working_directory(project_folder):
  # check if your project folder exists. if not, it will be created.
  if os.path.isdir(root_dir + project_folder) == False:
    os.mkdir(root_dir + project_folder)
    print(root_dir + project_folder + ' did not exist but was created.')

  # change the OS to use your project folder as the working directory
  os.chdir(root_dir + project_folder)

  # create a test file to make sure it shows up in the right place
  !touch 'new_file_in_working_directory.txt'
  print('\nYour working directory was changed to ' + root_dir + project_folder + \
        "\n\nAn empty text file was created there. You can also run !pwd to confirm the current working directory." )

create_and_set_working_directory(project_folder)


Your working directory was changed to /content/gdrive/MyDrive/BU/Year1/Summer/capstone/

An empty text file was created there. You can also run !pwd to confirm the current working directory.


Check if the function worked by listing the files in the project folder. During the very first run it should contain only `new_file_in_working_directory.txt`. If you upload other files to your `project_folder`, other files will be listed too. 

In [5]:
!ls

 companyname.csv		    'First 26stock_des.csv'
 Environmental_Impact_dataset1.csv   new_file_in_working_directory.txt


# 2. Install, load the libraries

In [6]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/d5/43/cfe4ee779bbd6a678ac6a97c5a5cdeb03c35f9eaebbb9720b036680f9a2d/transformers-4.6.1-py3-none-any.whl (2.2MB)
[K     |████████████████████████████████| 2.3MB 6.8MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 37.4MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |████████████████████████████████| 901kB 50.2MB/s 
Collecting huggingface-hub==0.0.8
  Downloading https://files.pythonhosted.org/packages/a1/88/7b1e45720ecf59c6c6737ff332f41c955963090a18e72

In [7]:
import numpy as np
import pandas as pd
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# 2. Load, merge, clean the data

In [8]:
# load the csv files
stock_des = pd.read_csv('First 26stock_des.csv')
df = pd.read_csv('Environmental_Impact_dataset1.csv')
df1=pd.read_csv('companyname.csv') 

In [15]:
df=df.drop('Unnamed: 0',1)

In [16]:
df.head()

Unnamed: 0,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),Env_intensity,industry_avg,Industry_indicator,Environmental_Growth,Ind_Yearavg
0,2016,1&1 DRILLISCH AG,Germany,Post and telecommunications (64),-0.0007,-0.0007,-0.018382,1,,-0.01164
1,2016,AFK SISTEMA PAO,Russia,Post and telecommunications (64),0.154,0.154,-0.018382,1,6.722107,-0.01164
2,2016,AMERICA MOVIL S.A.B. DE C.V.,Mexico,Post and telecommunications (64),-0.0128,-0.0128,-0.018382,1,13.274336,-0.01164
3,2016,AT&T INC.,United States,Post and telecommunications (64),-0.017,-0.017,-0.018382,1,-8.108108,-0.01164
4,2016,CHORUS LIMITED,New Zealand,Post and telecommunications (64),-0.0114,-0.0114,-0.018382,1,-12.977099,-0.01164


In [17]:
# merge the dataframes into one
df2 = pd.merge(df1, stock_des, on='ticker')
df2.head()

Unnamed: 0,fyear,ticker,CompanyName,description
0,2010,AEP,AMERICAN ELECTRIC POWER CO,"American Electric Power Company, Inc., an elec..."
1,2011,AEP,AMERICAN ELECTRIC POWER CO,"American Electric Power Company, Inc., an elec..."
2,2012,AEP,AMERICAN ELECTRIC POWER CO,"American Electric Power Company, Inc., an elec..."
3,2013,AEP,AMERICAN ELECTRIC POWER CO,"American Electric Power Company, Inc., an elec..."
4,2014,AEP,AMERICAN ELECTRIC POWER CO,"American Electric Power Company, Inc., an elec..."


In [18]:
df2=pd.merge(df,df2,on='CompanyName')

Due to Colab's RAM limitations, limit the description size. We allow 350 characters, which is approximately 50 words and 3+ sentences. In case if you still face RAM issues try terminating other Colab notebooks (Runtime > Manage sessions) or reducing the characters to 300. 

In [19]:
df2['description'] = df2['description'].str.slice(0,350)

Create a binary variable that is 1 if the assets to revenue ratio is above its median and 0 otherwise. 

This is the **dependent variable** (label) that we'll try to predict. 

In [29]:
df2['HIGH_EI'] = (df2['Env_intensity'].gt(df2['Env_intensity'].median())).astype(int)

# Preparing the predictor and DistilBERT model

**Note**. Please enable GPU in Edit > Notebook settings > Hardware accelerator. 

Load a pre-trained BERT model.

In [22]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Tokenize the textual data for DistilBERT. 

In [23]:
tokenized = df2['description'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

Pad all lists of tokenized values to the same size. 

In [24]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [25]:
np.array(padded).shape

(120, 75)

Create attention mask variable for BERT to ignore (mask) the padding when it's processing its input.

In [26]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(120, 75)

# DistilBERT model

We run the pretrained DistilBERT model on the prepared predictor and keep the result in `last_hidden_states` variable. 

In [27]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

# Logistic regression model 



Keep the first layer of the hidden states and assign the outcome variable to `labels`. 

In [31]:
features = last_hidden_states[0][:,0,:].numpy()
labels = df2['HIGH_EI']

Split the data in train and test subsets, train the Logistic Regression on train set and evaluate its accuracy on the test set. 

In [32]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)
lr_clf = LogisticRegression(max_iter=5000)
lr_clf.fit(train_features, train_labels)
print(lr_clf.score(test_features, test_labels))

0.9333333333333333


Check if this approach works better than a random guess (0.9 > 0.5). 

In [33]:
from sklearn.dummy import DummyClassifier
clf = DummyClassifier()

scores = cross_val_score(clf, train_features, train_labels)
print("Dummy classifier score: %0.3f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Dummy classifier score: 0.556 (+/- 0.20)


