## Text Analysis - Companies description 

In this notebook, we will do some text analysis for the companies existing in the WRDS and Harvard Business School, Impact-Weighted Accounts Project report.

We will apply some Nature Language Processing (NLP) using the pre-trained DistilBERT. First, we need to merge the datasets and obtain the companies description from Yahoo Finance. 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as msno
import warnings
from sklearn import linear_model
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import requests
from bs4 import BeautifulSoup
warnings.filterwarnings('ignore')

In [None]:
df_ishares = pd.read_csv('/Users/maralinetorres/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Datasets/iShares-merged.csv')
df_ei = pd.read_csv('/Users/maralinetorres/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Datasets/Environmental_impact_cleaned.csv')
stocks = pd.read_csv("/Users/maralinetorres/Documents/GitHub/Predicting-Environmental-and-Social-Actions/Datasets/pilot_stocks.csv")

For the ishares, we don't need the tickers in the sector 'Cash and devirates' and we are also going to filter by the necessary columns

In [None]:
df_ishares = df_ishares.loc[df_ishares.Sector != 'Cash and/or Derivatives',['Ticker','Name','Sector','CUSIP','ISIN','Location']]
df_ishares.drop_duplicates(inplace=True)
df_ishares.head()

Unnamed: 0,Ticker,Name,Sector,CUSIP,ISIN,Location
0,AAPL,APPLE INC,Information Technology,37833100,US0378331005,United States
1,MSFT,MICROSOFT CORP,Information Technology,594918104,US5949181045,United States
2,AMZN,AMAZON COM INC,Consumer Discretionary,23135106,US0231351067,United States
3,TSLA,TESLA INC,Consumer Discretionary,88160R101,US88160R1014,United States
4,FB,FACEBOOK CLASS A INC,Communication,30303M102,US30303M1027,United States


In [None]:
len(df_ishares.Ticker.unique())

855

In [None]:
df_ei2 = df_ei.copy()
df_ei2 = df_ei2.loc[:,['ISIN','CompanyName','Country']]
df_ei2.drop_duplicates(inplace=True)
df_ei2.head()

Unnamed: 0,ISIN,CompanyName,Country
0,GB00BMX64W89,Saga plc,United Kingdom
1,MYL1818OO003,BURSA MALAYSIA BHD,Malaysia
2,GB0031638363,INTERTEK GROUP PLC,United Kingdom
3,ZAE000079711,JSE LIMITED,South Africa
4,FR0006174348,BUREAU VERITAS SA,France


In [None]:
len(df_ei2.ISIN.unique())

2628

In [None]:
stocks = stocks.iloc[:,0:2]
stocks.head()

Unnamed: 0,Year,Ticker
0,2005,AEE
1,2006,AEE
2,2007,AEE
3,2008,AEE
4,2009,AEE


Let's start merging the datasets. First, we are going to merge iShares and Environmental impact by ISIN. 

In [None]:
df = pd.merge(df_ishares, df_ei2, on='ISIN', how='outer')
df.head()

Unnamed: 0,Ticker,Name,Sector,CUSIP,ISIN,Location,CompanyName,Country
0,AAPL,APPLE INC,Information Technology,37833100,US0378331005,United States,,
1,MSFT,MICROSOFT CORP,Information Technology,594918104,US5949181045,United States,MICROSOFT CORPORATION,United States
2,AMZN,AMAZON COM INC,Consumer Discretionary,23135106,US0231351067,United States,"AMAZON.COM, INC.",United States
3,TSLA,TESLA INC,Consumer Discretionary,88160R101,US88160R1014,United States,,
4,FB,FACEBOOK CLASS A INC,Communication,30303M102,US30303M1027,United States,FACEBOOK INCORPORATION,United States


In [None]:
print(df.shape)
print(len(df.Ticker.unique()))

(3005, 8)
856


In [None]:
mising_Ticker = df.loc[df.Ticker.isna(),'ISIN']
df_missing = df_ei.loc[df_ei.ISIN.isin(mising_Ticker), ['ISIN','CompanyName','Country']]
df_missing.drop_duplicates(inplace=True)
print(f'We are unable to match {df_missing.shape[0]} ISIN')
df_missing.head()

We are unable to match 2136 ISIN


Unnamed: 0,ISIN,CompanyName,Country
0,GB00BMX64W89,Saga plc,United Kingdom
1,MYL1818OO003,BURSA MALAYSIA BHD,Malaysia
3,ZAE000079711,JSE LIMITED,South Africa
5,GB0007370074,RICARDO PLC,United Kingdom
6,AU000000ASX7,ASX LIMITED,Australia


With this merge, we were able to find a 492 tickers for the 2,628 companies in the Environmental Impact dataset. We need to figure out how to map their ISIN to a Ticker.

In [None]:
df_missing.to_csv('Companies_missing_Tickers.csv', index=False)

In the meantime, we are going to export the companies that we were unable to match and continue working with the ones that have the Ticker.

From the pilot stocks, we have 52 unique tickers. Let's see if all these tickers appear in the merged dataframe. 

In [None]:
tickers = stocks.Ticker.unique()
x = df.loc[(df.Ticker.isin(tickers)), ['Ticker']]
print(x.shape)
print(f'{len(x.Ticker.unique())} Tickers')
x[x.duplicated()]

(55, 1)
52 Tickers


Unnamed: 0,Ticker
540,DTE
726,MRO
802,ATO


We noticed that these three tickers are duplicated. Let's see why?

In [None]:
df_ishares[df_ishares.Ticker == 'ATO']

Unnamed: 0,Ticker,Name,Sector,CUSIP,ISIN,Location
417,ATO,ATMOS ENERGY CORP,Utilities,49560105,US0495601058,United States
809,ATO,ATOS,Information Technology,S56547813,FR0000051732,France


It seems the ticker is not globally unique. We are going to filter by United States because our pilot stocks belongs to the SP500 US Stocks. 

In [None]:
df_us = df.loc[(df.Ticker.isin(tickers)) & (df.Location == 'United States'), ['Ticker']]
print(f'Now, we have {len(df_us.Ticker.unique())} tickers.')

Now, we have 52 tickers.


We know that we weren't able to match the ISIN for all the companies in the Environmental Intensity dataset. However, let's see the ones that we were able too and validate they matched correctly. 

In [None]:
found = df.loc[(df.Ticker.notna()) & (df.CompanyName.notna()),]
print(len(found.Ticker.unique()))
print(found.shape)

487
(495, 8)


Could it be possible to have duplicate values? Let's subset for the records that have the same ticker.

In [None]:
ticker_agg = found.groupby('Ticker')[['Name']].count().sort_values(by='Name', ascending=False).reset_index()
tickers2 = ticker_agg[ticker_agg.Name >= 2]['Ticker']
df3 = found[found.Ticker.isin(tickers2)].sort_values(by='Ticker')
df3.head(6)

Unnamed: 0,Ticker,Name,Sector,CUSIP,ISIN,Location,CompanyName,Country
428,AAL,AMERICAN AIRLINES GROUP INC,Industrials,02376R102,US02376R1023,United States,AMERICAN AIRLINES GROUP INC,United States
557,AAL,ANGLO AMERICAN PLC,Materials,-,GB00B1XZS820,United Kingdom,ANGLO AMERICAN PLC,United Kingdom
99,ADP,AUTOMATIC DATA PROCESSING INC,Information Technology,53015103,US0530151036,United States,"AUTOMATIC DATA PROCESSING, INC.",United States
866,ADP,AEROPORTS DE PARIS SA,Industrials,-,FR0010340141,France,AEROPORTS DE PARIS,France
271,DTE,DTE ENERGY,Utilities,233331107,US2333311072,United States,DTE ENERGY COMPANY,United States
540,DTE,DEUTSCHE TELEKOM N AG,Communication,S58423591,DE0005557508,Germany,DEUTSCHE TELEKOM AG,Germany


It happened the same thing as before, it seems the Ticker is not globally unique. However, it seems it matches correctly by ISIN. We did a little research and found that AAL in UK actually is AAL.L in Yahoo Finance. 

Another example,  DTE- DEUTSCHE TELEKOM ticker in Yahoo Finance is DTE.DE.  We decided to re-format the Ticker for the companies that are not from US. To some of them, we need to add '.' + 'the first two letters from the ISIN' to the current ticker. 

Let's see. 

In [None]:
ticker_agg = df.groupby('Ticker')[['Name']].count().sort_values(by='Name', ascending=False).reset_index()
tickers2 = ticker_agg[ticker_agg.Name >= 2]['Ticker']
df3 = df[df.Ticker.isin(tickers2)].sort_values(by='Ticker')
df3.head(6)

Unnamed: 0,Ticker,Name,Sector,CUSIP,ISIN,Location,CompanyName,Country
428,AAL,AMERICAN AIRLINES GROUP INC,Industrials,02376R102,US02376R1023,United States,AMERICAN AIRLINES GROUP INC,United States
557,AAL,ANGLO AMERICAN PLC,Materials,-,GB00B1XZS820,United Kingdom,ANGLO AMERICAN PLC,United Kingdom
735,ADM,ADMIRAL GROUP PLC,Financials,-,GB00B02J6398,United Kingdom,ADMIRAL GROUP PLC,United Kingdom
234,ADM,ARCHER DANIELS MIDLAND,Consumer Staples,39483102,US0394831020,United States,,
866,ADP,AEROPORTS DE PARIS SA,Industrials,-,FR0010340141,France,AEROPORTS DE PARIS,France
99,ADP,AUTOMATIC DATA PROCESSING INC,Information Technology,53015103,US0530151036,United States,"AUTOMATIC DATA PROCESSING, INC.",United States


In [None]:
df4 = df3[df3.Location != 'United States']
df4

Unnamed: 0,Ticker,Name,Sector,CUSIP,ISIN,Location,CompanyName,Country
557,AAL,ANGLO AMERICAN PLC,Materials,-,GB00B1XZS820,United Kingdom,ANGLO AMERICAN PLC,United Kingdom
735,ADM,ADMIRAL GROUP PLC,Financials,-,GB00B02J6398,United Kingdom,ADMIRAL GROUP PLC,United Kingdom
866,ADP,AEROPORTS DE PARIS SA,Industrials,-,FR0010340141,France,AEROPORTS DE PARIS,France
802,ATO,ATOS,Information Technology,S56547813,FR0000051732,France,ATOS SE,France
553,DG,VINCI SA,Industrials,-,FR0000125486,France,VINCI,France
540,DTE,DEUTSCHE TELEKOM N AG,Communication,S58423591,DE0005557508,Germany,DEUTSCHE TELEKOM AG,Germany
560,EL,ESSILORLUXOTTICA SA,Consumer Discretionary,S72124779,FR0000121667,France,ESSILORLUXOTTICA SA,France
731,LEG,LEG IMMOBILIEN N AG,Real Estate,-,DE000LEG1110,Germany,LEG IMMOBILIEN AG,Germany
631,MRK,MERCK,Health Care,S47418447,DE0006599905,Germany,MERCK KGAA,Germany
726,MRO,MELROSE INDUSTRIES PLC,Industrials,-,GB00BZ1G4322,United Kingdom,MELROSE INDUSTRIES PLC,United Kingdom


In [None]:
ticker = df4.Ticker.tolist()
yahoo_ticker = ['AAL.L','ADM.L','ADP.PA','ATO.PA','DG.PA','DTE.DE','EL.PA','LEG.DE','MRK.DE','MRO.L','PRU.L','SAN.PA','SAN.MC','TEL.OL','TSCO.L']

In [None]:
df = df.sort_values(by='Ticker')
df.loc[(df.Ticker.isin(ticker)) & (df.Location != 'United States'), 'Ticker'] = yahoo_ticker
df.head()

Unnamed: 0,Ticker,Name,Sector,CUSIP,ISIN,Location,CompanyName,Country
867,-,IBERDROLA SA,Utilities,-,ES0144583236,Spain,,
711,1COV,COVESTRO AG,Materials,-,DE0006062144,Germany,COVESTRO AG,Germany
170,A,AGILENT TECHNOLOGIES INC,Health Care,00846U101,US00846U1016,United States,"AGILENT TECHNOLOGIES, INC.",United States
428,AAL,AMERICAN AIRLINES GROUP INC,Industrials,02376R102,US02376R1023,United States,AMERICAN AIRLINES GROUP INC,United States
557,AAL.L,ANGLO AMERICAN PLC,Materials,-,GB00B1XZS820,United Kingdom,ANGLO AMERICAN PLC,United Kingdom


In [None]:
print(len(df.Ticker.unique()))

870


Now, we have 870 unique tickers that we can send to yahoo finance and get the company description. Below, we will create a tickers list. Also, we want to export the merged dataset to do some analysis onwards. 

In [None]:
tickers = df.Ticker.unique().tolist()

In [None]:
df.to_csv('df_merged.csv', index=False)

### Collecting companies data in Reuters website

Reuters, the news and media division of Thomson Reuters, is the world's largest multimedia news provider, reaching billions of people worldwide every day.

First, we are going to add '.OQ' to each ticker. T

In [None]:
tickers_US = df.loc[(df.Location == 'United States') & (df.Ticker.notna()),'Ticker'].unique().tolist()
tickers_nonUS = df.loc[(df.Location != 'United States') & (df.Ticker.notna()),'Ticker'].unique().tolist()

In [None]:
# reuter_ticker_nonUS = []
# for ticker in tickers_nonUS:
#     data = df.loc[df.Ticker == ticker, 'Location'].tolist()
#     new_ticker = str(ticker) + '.' + str(data[0][0])
#     reuter_ticker_nonUS.append(new_ticker)
# reuter_ticker_nonUS[:5]

['-.S', '1COV.G', 'AAL.L.U', 'ABBN.S', 'ABF.U']

In [None]:
reuter_ticker = []
for ticket in tickers_US:
    ticket = str(ticket) + '.OQ'
    reuter_ticker.append(ticket)

In [None]:
# Create a loop to store URLs of all stocks' description page
URL = [] 
DES = [] 
comp_desc = {}
for i in reuter_ticker:
    url ='https://www.reuters.com/companies/'+i 
    URL.append(url)
    page = requests.get(url) # visits the URL
    htmldata = BeautifulSoup(page.content, 'html.parser')
    Business_Description = htmldata.find('p',{'class':'TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__serif___3lOpX Profile-body-2Aarn'}) # finds the business description part in the HTML code
    if Business_Description is not None:
        DES.append(Business_Description.text)
        comp_desc[i] = [Business_Description.text]
    else:
        comp_desc[i] = 'Not exists'  

In [None]:
exists = 0 
do_not = 0
for val in comp_desc.values():
    if val == 'Not exists':
        do_not+=1
    else:
        exists+=1

print(do_not)
print(exists)

362
143


### Collect companies descriptions from Yahoo Finance

In this section, we will grab the company description directly from the Yahoo Finance website for the 870 companies

In [None]:
# Create a loop to store URLs of all stocks' description page
URL = [] 
DES = [] 
comp_desc = {}
for i in tickers[:2]: 
    i = str(i)
    url ='https://finance.yahoo.com/quote/'+i+'/profile' 
    URL.append(url)
    page = requests.get(url) # visits the URL 
    htmldata = BeautifulSoup(page.content, 'html.parser')
    Business_Description = htmldata.find('p',{'class':'Mt(15px) Lh(1.6)'}) # finds the business description part in the HTML code
    DES.append(Business_Description.text)
    comp_desc[i] = [Business_Description.text]

AttributeError: 'NoneType' object has no attribute 'text'

In [None]:
df_comp_desc = pd.DataFrame.from_dict(comp_desc, orient='index', columns = ['Description'])
df_comp_desc.reset_index(inplace = True)
df_comp_desc.rename(columns = {'index':'Ticker'}, inplace=True)
df_comp_desc.head()

Unnamed: 0,Ticker,Description
0,-,
1,1COV,
2,A,
3,AAL,
4,AAL.L,


In [None]:
df_comp_desc.to_csv('Companydescription.csv',index=False)

### Start working with the DistilBERT

If you don't have the package install, you might need to run: !pip install transformers

If you do, only import the packcage and continue working. 

# DistilBERT

Load, merge and clean the data

In [None]:
# load the csv files
stock_des=pd.read_csv('202 companies_des.csv')
df = pd.read_csv('IEV_holdings-1.csv')
df1=pd.read_csv('EE-ISIN_merged.csv') 

In [None]:
df1=df1.drop('Unnamed: 0',1)

In [None]:
df1=df1[df1['Year']==2019]
df1

Unnamed: 0,ISIN,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),EnvironmentalIntensity(OpInc),TotalEnvironmentalCost,WorkingCapacity,FishProductionCapacity,CropProductionCapacity,MeatProductionCapacity,Biodiversity,AbioticResources,Waterproductioncapacity(Drinkingwater&IrrigationWater),WoodProductionCapacity,SDG1.5,SDG2.1,SDG2.2,SDG2.3,SDG2.4,SDG3.3,SDG3.4,SDG3.9,SDG6,SDG12.2,SDG14.1,SDG14.2,SDG14.3,SDG14.c,SDG15.1,SDG15.2,SDG15.5,%Imputed,gvkey,fyear,datadate,at,isin,conm,fic,sic
0,MYL1818OO003,2019,BURSA MALAYSIA BHD,Malaysia,Activities auxiliary to financial intermediati...,-0.017,-3.47%,-1968379,-1924910,-451,-25349,-5938,-81,-168,-11502,20,-852646,-502708,-502460,-6337,-6337,-81118,-4791,-27,-11502,-168,-1,-1,-222,-2,10,10,-79,4%,272691,2019,20191231,2321.040,MYL1818OO003,BURSA MALAYSIA BHD,MYS,6200.0
1,GB0031638363,2019,INTERTEK GROUP PLC,United Kingdom,Activities auxiliary to financial intermediati...,-0.015,-9.49%,-60599272,-59281663,-13774,-788289,-184802,-2487,-3804,-324960,508,-26533166,-15557810,-15550827,-197072,-197072,-2509207,284215,-703,-324960,-3804,-17,-4,-6861,-20,254,254,-2470,1%,252384,2019,20191231,2818.400,GB0031638363,INTERTEK GROUP PLC,GBR,8700.0
2,ZAE000079711,2019,JSE LIMITED,South Africa,Activities auxiliary to financial intermediati...,-0.015,,-2290124,-2239814,-510,-29662,-6938,-93,-901,-12200,-6,-995881,-576811,-576488,-7415,-7415,-92910,-19470,-277,-12200,-901,0,-1,-253,0,-3,-3,-93,2%,278391,2019,20191231,40227.215,ZAE000079711,JSE LIMITED,ZAF,6211.0
3,FR0006174348,2019,BUREAU VERITAS SA,France,Activities auxiliary to financial intermediati...,-0.007,-5.10%,-39978650,-39107612,-9330,-520701,-121953,-1671,-4116,-214438,1172,-17514837,-10430409,-10425281,-130175,-130175,-1684676,561195,-577,-214438,-4116,-38,-9,-4607,-45,586,586,-1633,3%,286961,2019,20191231,7049.100,FR0006174348,BUREAU VERITAS SA,FRA,8700.0
4,GB0007370074,2019,RICARDO PLC,United Kingdom,Activities auxiliary to financial intermediati...,-0.007,-7.27%,-3247235,-3176408,-753,-42228,-9899,-135,-468,-17406,63,-1421576,-842731,-842343,-10557,-10557,-136060,34998,-87,-17406,-468,-2,0,-373,-2,31,31,-133,3%,221859,2019,20190630,371.900,GB0007370074,RICARDO PLC,GBR,8711.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1057,ZAE000216537,2019,BID CORPORATION LTD,South Africa,"Wholesale trade and commission trade, except o...",-0.014,-28.02%,-124047261,-121344740,-28962,-1603616,-374519,-5177,-35468,-658523,3743,-53787360,-32057343,-32036952,-400904,-400904,-5177518,533888,-10313,-658523,-35468,-167,-103,-14129,-200,1871,1871,-5007,6%,321670,2019,20190630,64931.978,ZAE000216537,BID CORP(NEW),ZAF,6799.0
1058,ZAE000058517,2019,SPAR GROUP LIMITED,South Africa,"Wholesale trade and commission trade, except o...",-0.003,-49.56%,-24127362,-23594642,-6350,-305744,-70543,-1082,-27940,-124010,2948,-10145768,-6518174,-6510343,-76436,-76436,-1059431,420567,-8124,-124010,-27940,-132,-81,-2898,-157,1474,1474,-948,20%,271087,2019,20190930,34052.900,ZAE000058517,SPAR GROUP LTD,ZAF,5140.0
1059,CNE100000FN7,2019,SINOPHARM GROUP CO LTD,China,"Wholesale trade and commission trade, except o...",-0.001,-1.85%,-46263408,-42158085,-10358,-562862,-131780,-1839,-16404,-3384365,2285,-18933076,-11481810,-11476273,-140715,-140715,-1857658,1173028,-694,-3384365,-16404,-77,-13,-5068,-92,1142,1142,-1761,1%,292783,2019,20191231,269888.371,CNE100000FN7,SINOPHARM GROUP CO,CHN,5122.0
1060,TW0005902001,2019,"Tait Marketing & Distribution Co., Ltd.",Taiwan,"Wholesale trade and commission trade, except o...",-0.001,-1.61%,-35015,-34239,-11,-456,-105,-2,-25,-185,8,-15168,-10607,-10595,-114,-114,-1736,3531,-3,-185,-25,0,0,-5,0,4,4,-1,22%,279003,2019,20191231,1076.614,TW0005902001,TAIT MARKETING & DIST CO LTD,TWN,5140.0


In [None]:
df2=pd.merge(df,df1,on='ISIN')
df2

Unnamed: 0,Ticker,Name,Sector,Asset Class,Market Value,Weight (%),Notional Value,Shares,CUSIP,ISIN,SEDOL,Price,Location,Exchange,Currency,FX Rate,Market Currency,Accrual Date,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),EnvironmentalIntensity(OpInc),TotalEnvironmentalCost,WorkingCapacity,FishProductionCapacity,CropProductionCapacity,MeatProductionCapacity,Biodiversity,AbioticResources,Waterproductioncapacity(Drinkingwater&IrrigationWater),WoodProductionCapacity,SDG1.5,SDG2.1,SDG2.2,SDG2.3,SDG2.4,SDG3.3,SDG3.4,SDG3.9,SDG6,SDG12.2,SDG14.1,SDG14.2,SDG14.3,SDG14.c,SDG15.1,SDG15.2,SDG15.5,%Imputed,gvkey,fyear,datadate,at,isin,conm,fic,sic
0,NESN,NESTLE SA,Consumer Staples,Equity,64181630.45,3.32,64181630.45,505570.00,S71238703,CH0038863350,7123870,126.95,Switzerland,SIX Swiss Exchange,USD,0.90,CHF,-,2019,NESTLE S.A.,Switzerland,Processing of Food products nec,-0.016,-9.75%,-1527139399,-1405885897,-416410,-18578077,-4368093,-65196,-113252,-97865932,153457,-628976294,-420810968,-420563568,-4644519,-4644519,-68675754,119281774,-6492,-97865932,-113252,-5213,-135,-201854,-6213,76729,76729,-59918,0%,16603,2019,20191231,127940.000,CH0038863350,NESTLE SA/AG,CHE,2000.0
1,ROG,ROCHE HOLDING PAR AG,Health Care,Equity,46859542.94,2.42,46859542.94,123210.00,S71103881,CH0012032048,7110388,380.32,Switzerland,SIX Swiss Exchange,USD,0.90,CHF,-,2019,ROCHE HOLDING AKTIENGESELLSCHAFT,Switzerland,"Manufacture of medical, precision and optical ...",-0.003,-0.84%,-160446737,-152869527,-35079,-1974118,-461147,-6303,-1149997,-3950618,52,-66193394,-38441169,-38420015,-493529,-493529,-6193341,-4364957,-722554,-3950618,-1149997,-139,-317,-16901,-165,26,26,-6163,6%,25648,2019,20191231,83091.000,CH0012032048,ROCHE HOLDING AG,CHE,2834.0
2,NOVN,NOVARTIS AG,Health Care,Equity,40415079.41,2.09,40415079.41,433340.00,S71030654,CH0012005267,7103065,93.26,Switzerland,SIX Swiss Exchange,USD,0.90,CHF,-,2019,NOVARTIS AG,Switzerland,"Manufacture of medical, precision and optical ...",-0.007,-3.18%,-348286772,-314529861,-74014,-4143676,-967262,-13198,-903525,-27659340,4103,-138890514,-81935724,-81878242,-1035919,-1035919,-13219836,-1114082,-567694,-27659340,-903525,-246,-249,-36343,-294,2051,2051,-12948,3%,101310,2019,20191231,118370.000,CH0012005267,NOVARTIS AG,CHE,2834.0
3,AZN,ASTRAZENECA PLC,Health Care,Equity,26955662.18,1.39,26955662.18,230130.00,S09895293,GB0009895292,989529,117.13,United Kingdom,London Stock Exchange,USD,0.71,GBP,-,2019,ASTRAZENECA PLC,United Kingdom,"Manufacture of medical, precision and optical ...",-0.005,-3.98%,-134985600,-129784288,-39655,-1688455,-393446,-6368,-573484,-2526476,26573,-56694491,-39535401,-39506085,-422114,-422114,-6472569,11948529,-781929,-2526476,-573484,-1050,-312,-18122,-1251,13286,13286,-5305,22%,28272,2019,20191231,61377.000,GB0009895292,ASTRAZENECA PLC,GBR,2834.0
4,SIE,SIEMENS N AG,Industrials,Equity,23306443.99,1.21,23306443.99,141729.00,S57279739,DE0007236101,5727973,164.44,Germany,Xetra,USD,0.82,EUR,-,2019,SIEMENS AG,Germany,Activities of membership organisation n.e.c. (91),-0.005,-5.49%,-431933738,-405755192,-97827,-5290306,-1228970,-16859,-600004,-18945339,758,-176454055,-102930381,-102840656,-1322577,-1322577,-16586392,-10700011,-167303,-18945339,-600004,-206,-1140,-46960,-246,379,379,-16650,7%,19349,2019,20190930,150248.000,DE0007236101,SIEMENS AG,DEU,9997.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149,GALP,GALP ENERGIA SGPS SA,Energy,Equity,939164.58,0.05,939164.58,80006.00,-,PTGAL0AM0009,B1FW751,11.74,Portugal,Nyse Euronext - Euronext Lisbon,USD,0.82,EUR,-,2019,"GALP ENERGIA SGPS, S.A.",Portugal,Extraction of crude petroleum and services rel...,-0.057,-79.38%,-1089402909,-975645972,-256476,-12901913,-3033548,-42690,-,-97567302,44991,-436137236,-272357396,-272229158,-3225478,-3225478,-44186427,39651504,-,-97567302,-,-1522,-,-126443,-1814,22496,22496,-41149,0%,279448,2019,20191231,13770.000,PTGAL0AM0009,GALP ENERGIA SGPS SA,PRT,2911.0
150,DLG,DIRECT LINE INSURANCE PLC,Financials,Equity,919839.00,0.05,919839.00,223040.00,-,GB00BY9D0Y18,BY9D0Y1,4.12,United Kingdom,London Stock Exchange,USD,0.71,GBP,-,2019,DIRECT LINE INSURANCE GROUP PLC,United Kingdom,"Insurance and pension funding, except compulso...",-0.001,-0.55%,-3981590,-3892996,-1032,-52067,-12193,-177,-2084,-21437,395,-1753332,-1111209,-1110653,-13017,-13017,-180503,224126,-168,-21437,-2084,-13,-1,-498,-16,197,197,-164,10%,268159,2019,20191231,9434.200,GB00BY9D0Y18,DIRECT LINE INSURANCE GRP,GBR,6331.0
151,SECU B,SECURITAS B,Industrials,Equity,911513.62,0.05,911513.62,56773.00,S55540413,SE0000163594,5554041,16.06,Sweden,Nasdaq Omx Nordic,USD,8.26,SEK,-,2019,SECURITAS AB,Sweden,Other service activities (93),-0.004,-9.11%,-50877031,-49705407,-11746,-660208,-154717,-2105,-7830,-335967,949,-22217949,-13159217,-13152988,-165052,-165052,-2124359,460741,-2332,-335967,-7830,-35,-9,-5817,-42,475,475,-2069,3%,104981,2019,20191231,62190.000,SE0000163594,SECURITAS AB,SWE,7381.0
152,CLN,CLARIANT AG,Materials,Equity,862542.17,0.04,862542.17,39804.00,S71139901,CH0012142631,7113990,21.67,Switzerland,SIX Swiss Exchange,USD,0.90,CHF,-,2019,CLARIANT AG,Switzerland,Chemicals nec,-0.067,-72.91%,-303725335,-244189831,-61797,-3279867,-755897,-10775,-4931,-55439989,17754,-108643559,-67205147,-67103952,-819967,-819967,-10884662,7221119,-400,-55439989,-4931,-639,-81,-30023,-762,8877,8877,-10128,2%,206489,2019,20191231,7979.000,CH0012142631,CLARIANT AG,CHE,2860.0


In [None]:
df2=pd.merge(df,df1,on='ISIN')
df2 = pd.merge(df2, stock_des, on='Ticker',how='inner')
df2

Unnamed: 0,Ticker,Name,Sector,Asset Class,Market Value,Weight (%),Notional Value,Shares,CUSIP,ISIN,SEDOL,Price,Location,Exchange,Currency,FX Rate,Market Currency,Accrual Date,Year,CompanyName,Country,Industry(Exiobase),EnvironmentalIntensity(Sales),EnvironmentalIntensity(OpInc),TotalEnvironmentalCost,WorkingCapacity,FishProductionCapacity,CropProductionCapacity,MeatProductionCapacity,Biodiversity,AbioticResources,Waterproductioncapacity(Drinkingwater&IrrigationWater),WoodProductionCapacity,SDG1.5,SDG2.1,SDG2.2,SDG2.3,SDG2.4,SDG3.3,SDG3.4,SDG3.9,SDG6,SDG12.2,SDG14.1,SDG14.2,SDG14.3,SDG14.c,SDG15.1,SDG15.2,SDG15.5,%Imputed,gvkey,fyear,datadate,at,isin,conm,fic,sic,description
0,ROG,ROCHE HOLDING PAR AG,Health Care,Equity,46859542.94,2.42,46859542.94,123210.0,S71103881,CH0012032048,7110388,380.32,Switzerland,SIX Swiss Exchange,USD,0.9,CHF,-,2019,ROCHE HOLDING AKTIENGESELLSCHAFT,Switzerland,"Manufacture of medical, precision and optical ...",-0.003,-0.84%,-160446737,-152869527,-35079,-1974118,-461147,-6303,-1149997,-3950618,52,-66193394,-38441169,-38420015,-493529,-493529,-6193341,-4364957,-722554,-3950618,-1149997,-139,-317,-16901,-165,26,26,-6163,6%,25648,2019,20191231,83091.0,CH0012032048,ROCHE HOLDING AG,CHE,2834.0,"Rogers Corporation designs, develops, manufact..."
1,NOVN,NOVARTIS AG,Health Care,Equity,40415079.41,2.09,40415079.41,433340.0,S71030654,CH0012005267,7103065,93.26,Switzerland,SIX Swiss Exchange,USD,0.9,CHF,-,2019,NOVARTIS AG,Switzerland,"Manufacture of medical, precision and optical ...",-0.007,-3.18%,-348286772,-314529861,-74014,-4143676,-967262,-13198,-903525,-27659340,4103,-138890514,-81935724,-81878242,-1035919,-1035919,-13219836,-1114082,-567694,-27659340,-903525,-246,-249,-36343,-294,2051,2051,-12948,3%,101310,2019,20191231,118370.0,CH0012005267,NOVARTIS AG,CHE,2834.0,"Novan, Inc., a clinical development-stage biot..."
2,AZN,ASTRAZENECA PLC,Health Care,Equity,26955662.18,1.39,26955662.18,230130.0,S09895293,GB0009895292,989529,117.13,United Kingdom,London Stock Exchange,USD,0.71,GBP,-,2019,ASTRAZENECA PLC,United Kingdom,"Manufacture of medical, precision and optical ...",-0.005,-3.98%,-134985600,-129784288,-39655,-1688455,-393446,-6368,-573484,-2526476,26573,-56694491,-39535401,-39506085,-422114,-422114,-6472569,11948529,-781929,-2526476,-573484,-1050,-312,-18122,-1251,13286,13286,-5305,22%,28272,2019,20191231,61377.0,GB0009895292,ASTRAZENECA PLC,GBR,2834.0,"AstraZeneca PLC discovers, develops, manufactu..."
3,SAN,SANOFI SA,Health Care,Equity,21623837.19,1.12,21623837.19,201397.0,S56717358,FR0000120578,5671735,107.37,France,Nyse Euronext - Euronext Paris,USD,0.82,EUR,-,2019,SANOFI S.A.,France,"Manufacture of medical, precision and optical ...",-0.01,-6.09%,-434457155,-393211644,-77490,-4399917,-1003233,-13907,-451590,-35245079,-54294,-143654775,-75667077,-75484826,-1099979,-1099979,-12050491,-89489470,-108652,-35245079,-451590,-461,-3419,-33072,-549,-27147,-27147,-13441,25%,101204,2019,20191231,112736.0,FR0000120578,SANOFI,FRA,2834.0,"Banco Santander, S.A., together with its subsi..."
4,SAN,BANCO SANTANDER SA,Financials,Equity,12744703.6,0.66,12744703.6,3038453.0,S57059461,ES0113900J37,5705946,4.19,Spain,Bolsa De Madrid,USD,0.82,EUR,-,2019,BANCO SANTANDER SA,Spain,"Financial intermediation, except insurance and...",-0.001,-0.70%,-99333328,-92801728,-25942,-1243952,-290468,-4367,-52104,-4928087,13320,-41800811,-27420552,-27402198,-310988,-310988,-4466925,7367610,-4289,-4928087,-52104,-461,-47,-12357,-549,6660,6660,-3901,13%,14140,2019,20191231,1522695.0,ES0113900J37,BANCO SANTANDER SA,ESP,6020.0,"Banco Santander, S.A., together with its subsi..."
5,TTE,TOTALENERGIES,Energy,Equity,21247071.1,1.1,21247071.1,438365.0,-,FR0000120271,B15C557,48.47,France,Nyse Euronext - Euronext Paris,USD,0.82,EUR,-,2019,TOTAL SA,France,Extraction of crude petroleum and services rel...,-0.084,-86.39%,-14772813804,-14478449101,-3769054,-169082300,-38943242,-606970,-13813139,-68456247,306249,-5597249540,-3534954350,-3529245416,-42270575,-42270575,-573515105,-1365534096,-3370723,-68456247,-13813139,-75065,-102420,-1642356,-89472,153125,153125,-530974,16%,24625,2019,20191231,273294.0,FR0000120271,TOTAL SE,FRA,2911.0,TotalEnergies SE operates as an integrated oil...
6,GSK,GLAXOSMITHKLINE PLC,Health Care,Equity,17365446.14,0.9,17365446.14,879169.0,S09252883,GB0009252882,925288,19.75,United Kingdom,London Stock Exchange,USD,0.71,GBP,-,2019,GLAXOSMITHKLINE PLC,United Kingdom,"Manufacture of medical, precision and optical ...",-0.009,-3.33%,-385312247,-370745733,-102404,-4865743,-1135847,-17137,-1015460,-7476974,47052,-163432220,-106636776,-106565263,-1216436,-1216436,-17363406,21016285,-1384550,-7476974,-1015460,-1859,-552,-48182,-2215,23526,23526,-15256,16%,5180,2019,20191231,79692.0,GB0009252882,GLAXOSMITHKLINE PLC,GBR,2834.0,"GlaxoSmithKline plc, together with its subsidi..."
7,RIO,RIO TINTO PLC,Materials,Equity,16138658.65,0.83,16138658.65,189144.0,S07188758,GB0007188757,718875,85.33,United Kingdom,London Stock Exchange,USD,0.71,GBP,-,2019,RIO TINTO PLC,United Kingdom,Quarrying of sand and clay,-0.144,-43.41%,-6462466711,-5904798809,-2407048,-73707062,-17463797,-317956,-30954637,-434498443,1681042,-2528247789,-2094113630,-2092789063,-18426766,-18426766,-347251425,1104934479,-2859512,-434498443,-30954637,-67037,-14558,-1102616,-79904,840521,840521,-250087,4%,19565,2019,20191231,87802.0,GB0007188757,RIO TINTO GROUP,GBR,1000.0,"Rio Tinto Group engages in exploring, mining, ..."
8,AI,LAIR LIQUIDE SOCIETE ANONYME POUR,Materials,Equity,14292701.76,0.74,14292701.76,83090.0,-,FR0000120073,B1YXBJ7,172.01,France,Nyse Euronext - Euronext Paris,USD,0.82,EUR,-,2019,AIR LIQUIDE,France,Chemicals nec,-0.366,-224.97%,-9004051487,-8426185675,-2002880,-111825588,-26225271,-356440,-1022461,-436571184,138011,-3766240396,-2234999726,-2233985494,-27956397,-27956397,-360882027,86867711,-85778,-436571184,-1022461,-5255,-1060,-993652,-6263,69006,69006,-351120,1%,101202,2019,20191231,43666.5,FR0000120073,L'AIR LIQUIDE SA,FRA,2810.0,"C3.ai, Inc. operates as an enterprise artifici..."
9,DTE,DEUTSCHE TELEKOM N AG,Communication,Equity,12240959.85,0.63,12240959.85,568301.0,S58423591,DE0005557508,5842359,21.54,Germany,Xetra,USD,0.82,EUR,-,2019,DEUTSCHE TELEKOM AG,Germany,Post and telecommunications (64),-0.008,-6.45%,-678082340,-662403185,-157402,-8812866,-2064549,-28179,-105742,-4526646,16228,-296493439,-176025217,-175942301,-2203216,-2203216,-28422975,7948136,-17155,-4526646,-105742,-574,-168,-77772,-684,8114,8114,-27598,3%,221616,2019,20191231,170672.0,DE0005557508,DEUTSCHE TELEKOM,DEU,4813.0,DTE Energy Company engages in the utility oper...


Create a binary variable that is 1 if the assets to revenue ratio is above its median and 0 otherwise.

This is the dependent variable (label) that we'll try to predict.

In [None]:
df2['HIGH_EI'] = (df2['EnvironmentalIntensity(Sales)'].gt(df2['EnvironmentalIntensity(Sales)'].median())).astype(int)

In [None]:
!pip install transformers
import numpy as np
import pandas as pd
import torch
import transformers as ppb # pytorch transformers
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/1a/41c644c963249fd7f3836d926afa1e3f1cc234a1c40d80c5f03ad8f6f1b2/transformers-4.8.2-py3-none-any.whl (2.5MB)
[K     |████████████████████████████████| 2.5MB 3.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/d4/e2/df3543e8ffdab68f5acc73f613de9c2b155ac47f162e725dcac87c521c11/tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3MB)
[K     |████████████████████████████████| 3.3MB 16.7MB/s 
Collecting huggingface-hub==0.0.12
  Downloading https://files.pythonhosted.org/packages/2f/ee/97e253668fda9b17e968b3f97b2f8e53aa0127e8807d24a547687423fe0b/huggingface_hub-0.0.12-py3-none-any.whl
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/75/ee/67241dc87f266093c533a2d4d3d69438e57d7a90abb216fa076e7d475d4a/sacremoses-0.0.45-py3-none-any.whl (895kB)
[K     |█████

Load a pre-trained BERT model.

In [None]:
model_class, tokenizer_class, pretrained_weights = (ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased')

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=442.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267967963.0, style=ProgressStyle(descri…




Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_transform.weight', 'vocab_projector.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Tokenize the textual data for DistilBERT.

In [None]:
tokenized = df2['description'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

Pad all lists of tokenized values to the same size

In [None]:
max_len = 0
for i in tokenized.values:
    if len(i) > max_len:
        max_len = len(i)

padded = np.array([i + [0]*(max_len-len(i)) for i in tokenized.values])

In [None]:
np.array(padded).shape

(41, 463)

Create attention mask variable for BERT to ignore (mask) the padding when it's processing its input.

In [None]:
attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(41, 463)

We run the pretrained DistilBERT model on the prepared predictor and keep the result in last_hidden_states variable.

In [None]:
input_ids = torch.tensor(padded)  
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
    last_hidden_states = model(input_ids, attention_mask=attention_mask)

Keep the first layer of the hidden states and assign the outcome variable to labels.

In [None]:
features = last_hidden_states[0][:,0,:].numpy()
labels = df2['HIGH_EI']

Split the data in train and test subsets, train the Logistic Regression on train set and evaluate its accuracy on the test set.

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels,test_size=0.2,random_state=42)
lr_clf = LogisticRegression(max_iter=5000)
lr_clf.fit(train_features, train_labels)
print(lr_clf.score(test_features, test_labels))

0.6666666666666666


Summary: Our model can 67% accuratly capture whether the company is high or low environmental intensity.

In [None]:
test_labels

24    0
13    0
8     0
25    1
4     1
40    0
19    1
39    0
29    1
Name: HIGH_EI, dtype: int64