# Methodology and Analysis Notebook
#### In this notebook I note my methods and ideas as I go

#### Experiment-setup:

I am trying to investigate whether firms can utilize greenwashing as a profitable strategy. In order to do that I want to create a measure of greenwashing by following *(Lagasio, 2024)*. Furthermore, I will utilize a natural experiment (DiD) to search for causality.

<br>

#### Experiment notes:
* Use a DiD with continuous treatment to see whether greenwashing more makes you profit more
* For applying treatment, use the level of greenwashing from 2014/2015
    * The justification is that in 2013/2014 the previous IPCC report was released 
    * Since that created a bit of buzz, it may have allowed firms to start greenwashing
    <div>
    <img src="../img/ipcc_trends_2004-2025.png" width="600"/>
    </div>

* For the shock, use September 2019, as that's when there was a lot of buzz around another IPCC report and Fridays For Future movement
    <div>
    <img src="../img/climate_trends_2004_2022.png" width="600"/>
    </div>

* use cumulative abnormal returns as the outcome variable 
* use sales as a secondary outcome variable

----
#### Some links:
* [IPPC Wikipedia Page](https://en.wikipedia.org/wiki/Intergovernmental_Panel_on_Climate_Change)
* [6th IPCC Assessment](https://en.wikipedia.org/wiki/IPCC_Sixth_Assessment_Report)
* [IPCC Special Report from 2019](https://en.wikipedia.org/wiki/Special_Report_on_Climate_Change_and_Land)
* [Fridays For Future movement Wiki](https://en.wikipedia.org/wiki/Fridays_for_Future)


Total Return Index from datastream:
RI - Total Return Index
Explorers	
Equities » Key Datatypes
Equities » Datastream » Time Series » Pricing
Actions	Add to My Selections
Notes	
The Return Index shows the growth in value of a security over a specified period, assuming that dividends are re-invested to purchase additional shares of the security at the closing price (P) on the dividend ex-date.

For all markets except Canada and the US, dividend payment data (DDE) is only available from 1988, for securities listed before this point the return index is calculated using the dividend yield (DY). This method adds an increment of 1/260th part of the dividend yield to the price each weekday. There are assumed to be 260 weekdays in a year, market holidays are ignored.

Method 1 (using annualised dividend yield)

RI on the BDATE =100, then:

cid:image001.gif@01CC34D5.05D81F30

Where:

cid:image002.gif@01CC34D5.05D81F30 = return index on day t

cid:image003.gif@01CC34D5.05D81F30= return index on previous day

cid:image004.gif@01CC34D5.05D81F30 = price index on day t

cid:image005.gif@01CC34D5.05D81F30 = price index on previous day

cid:image006.gif@01CC34D5.05D81F30 = dividend yield % on day t

 N = number of working days in the year (taken to be 260)

For securities listed before 1988 the RI calculation switches from the dividend yield method to using the dividend payment data, for example the first recorded dividend for BP is 15/08/88 – on this date the RI calculation switches from the dividend yield method to using the dividend payment data. This method represents a more accurate measure of the security’s growth in which the discrete quantity of dividend paid is added to the price on the ex-date of the payment

Method 2 (using dividend payment data)

RI on the BDATE =100, then:

 cid:image007.gif@01CC34D5.05D81F30

 except when t = ex-date of the dividend payment Dt then:

cid:image011.gif@01CC34D6.0623B1B0

Where:

 cid:image008.gif@01CC34D5.05D81F30 = price on ex-date

= price on previous day

 cid:image010.gif@01CC34D5.05D81F30 = dividend payment associated with ex-date t

The calculation ignores tax and re-investment charges.

Canadian and US securities are not affected by the dual calculation method, dividend payment data is available from 1973 – the Datastream market inception date for both markets.

Securities listed after 1988 automatically use RI calculation method 2, dividend payment data.

Adjusted closing prices are used throughout to determine price index and hence return index.

Note: where the dividend payment data history contains a mixture of gross and net dividends the annualised dividend yield (method 1 above) is used in order to achieve a consistent growth measure. The net and gross markers can be identified using the datatype DTAX (tax marker). To display the total return using the dividend payment data (method 2) in these cases, the alternative total return datatype RZ (Return Index - As Paid) may be used. RZ uses the dividend payment data calculation method irrespective of the tax markers.

For UK companies the RI includes a tax credit on the dividend until it was abolished in April 2004. Prior to that time dividends as announced by the company are grossed up to include to the credit in the RI calculation. The rate used varies over time, the last rate being 10% in the period April 1999 to April 2004. To see a return index measure which does not include tax credits, the datatype RN (Return Index – Net) should be used. The RN calculation is the same as RI except the dividends used are as announced by the company and not grossed up for the tax credits. It follows that since April 2004 the performance of RI and RN for UK securities is the same. RN is available for UK securities only.

See also:

PI	Price Index


#### Greenwashing Indicator

First reading in pre-processed data

In [1]:
# libs
import pandas as pd
import numpy as np
import pymupdf

In [4]:
# data
data_greenwashing = pd.read_excel('../data/LSEG data/matched_final.xlsx', sheet_name='companies')
data_returns = pd.read_excel('../data/LSEG data/matched_final.xlsx', sheet_name='indicators_1')
data_indices = pd.read_excel('../data/LSEG data/matched_final.xlsx', sheet_name='INDICES')


Adding some more pre-processing

In [None]:
# /////////////////////////////////////
#       expanding
# /////////////////////////////////////

data['year_lists'] = data['year_lists'].apply(eval)

# Expand each list entry into its own row
data = data.explode('year_lists', ignore_index=True)

data['year'] = data['year_lists'].str.extract(r'(\d+)')
data['year'] = data['year'].astype(float)

print(data.value_counts('year'))

obs_2014 = data[data['year']==2014]

# # ////////////////////////////////////////////////////////
# # NOTE HOW MANY FIRMS WITH REPORTS IN THE YEARS I NEED
g = data[(data["year"]>2016) & (data["year"]<2022)].groupby("name").count()
print(g.value_counts("industry"))


Text extraction

In [None]:
# ////////////////////////////////////////////////////////////
#       EXTRACTING TEXT FROM PDFS
# ////////////////////////////////////////////////////////////


pdf_information = data[["name"]]
pdf_information["first_page"] = pd.NA
pdf_information["second_page"] = pd.NA

result = pd.DataFrame({"name":[],  
                       "title_page":[], 
                       "second_page":[]})

for index,row in pdf_information[0:11].iterrows():
    
    company_name = row["name"].lower().replace(" ", "_")
        
    if any(i in company_name for i in ("?", "|")):
        company_name = company_name.replace("?", "")
        company_name = company_name.replace("|", "")
    
    company_dir = f"C:/Users/Jakub/OneDrive - Tilburg University/thesis data/responsibility reports/{company_name}"
    
    list_of_paths = [company_dir + f"/{file}" for file in os.listdir(company_dir)]
    
    temp_df = pd.DataFrame({"name":[row["name"]]*len(list_of_paths),
                            "title_page":[None]*len(list_of_paths), 
                            "second_page":[None]*len(list_of_paths)})
    
    i=0
    for file in list_of_paths:    
        with pymupdf.open(file) as doc:
            title_page = doc[0].get_text()
            second_page = doc[1].get_text()
            
            temp_df.iat[i, 1] = title_page
            temp_df.iat[i, 2] = second_page
            i+=1
    
    
    result = pd.concat([result, temp_df])    
            
            
    


