<a href="https://colab.research.google.com/github/jialeCharloote/Exploring-ESG-Sentiment-and-Financial-Performance/blob/main/greenwashing%E2%80%94%E2%80%94Charlotte.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Greenwashing

### Definition: Selective Disclosure

Retain the disclosure of negative information related to the company’s environmental/social performance and expose positive information regarding its environmental/social performance.

### Signs of Greenwashing (possible $X$)

**levaraging self-disclosure & company commitment**

1. diversities: “The second possibility is that divestitures of pollutive assets respond to external environmental pressures by transferring ownership from firms that face stronger pressures to firms that face weaker pressures (or are better at addressing those pressures)” ([Duchin et al., 2022, p. 3])

2. using vague and ambitious words without pratical target and action: specific target vs vague descriptive words  （Policies vs Target）
    
3. no proof: an environmental claim that cannot be substantiated by easily accessible supporting information or by a reliable third-party certification

### Identify Greenwashing (possible $Y$)

**need more objective measure (eg: total polution/ violation & misconduct / employee gender diversity etc)**

1. invloved in greenwashing controversies (related news comprise key words) eg: The list of keywords consists of ‘litigation’, ‘lawsuit’, ‘fine(d)’, ‘sue(d)’, ‘attorney’, ‘judge’, ‘lawyers’, ‘barrister’, ‘trial’, ‘court’, ‘legal’, ‘prosecuted/tion’

2. companies that violate regulations/laws (news/twitter/regulation)



### Related literature

1. Greenwash: Corporate Environmental Disclosure under Threat of Audit [link:https://doi-org.proxy.uchicago.edu/10.1111/j.1530-9134.2010.00282.x]

2. Diversity Washing (Baker et al., 2022):

* Measures of diversity washing based on the intra-year distance between the **amount of DEI commitment discussion** and **actual diversity**.

* calculate these deviations as the difference between a firm’s within-year DEIcommitment disclosure percentile and its diversity percentile.

* Construct a binary variable that equals 1 if a firm’s disclosure percentile is higher than its diversity percentile, and 0 otherwise, and label the resulting variable Diversity Washer.

$$
Diversity\_Washer = \begin{cases}
      1 & \text{if } DEI\_commitment\_disclosure\_percentile > diversity\_percentile \\
      0 & \text{otherwise}
   \end{cases}
$$

3. Sustainability or Greenwashing: Evidence from the Asset Market for Industrial Pollution (Duchin et al., 2022)

    Possion DiD regression:

    $Y_{i,t} = \beta Divested_i \times Post_{i,t} + \alpha_i + \tau_t + \epsilon_{i,t} \quad (1)$

    Y, include total pollution, pollution intensity, and pollution abatement activities such as source reduction and the percentage of waste being recycled, recovered, or treated.



### Investigating Greenwashing: An Analysis Snapshot


**Motivation:** In the context of diversity washing, firms might also use less precise discussions to avoid litigation risk (e.g., Skinner, 1994).

**Vagueness:** Follow Baker's method(2022), the terms they identify as vague words are: believe, can, commonly, could, help, leading, like, many, may, maybe, might, often, possibly, probably, rarely, seem, some, up to, virtually, and widely.

**Strategy:** In the following code, I am trying to explore the rlationship between vague and ambiguous ESG discussions in corporate annual report and corporate misconduct behavior by levaraging 830 public firm 2022 10-k annual reports.

However, after referring to related literature and searching related datasets online, I didn't find a comprehensive dataset regarding firms' ESG-related misconduct behavior in 2022, so I cannot get a good Y in my regression analysis. (Datasources like Goodjobsfirst's Violation Tracker data do comprehend firm misconduct information; however, it requires searching information by each firm, which requires hand coding. Due to time limits, I didn't finish this task.)


**Conclusions:** Therefore, unable to procure suitable Y (dependent variable) for regression analysis, I sought to explore the potential links between vagueness scores and key financial metrics such as Return on Equity (ROE), solvency ratio, asset scores, and Total ESG Sentiment scores. By running OLS regression below, I do find that firms with higher ROE have significantly lower vagueness scores, while firms with higher solvency ratios tend to have higher vagueness scores. There also exists a positive relationship between ESG sentiment and vagueness in ESG discussions, which potentially indicates that firms would use more vague and positive tone in their ESG discussions when they have worse financial performance and facing higher risks.

**Limitatons**: I just offer indicative evidence showing firms' disclosure and may use "greenwashing strategy" to conceal their bad financial performance. However, due to data limitations, I believe the key to investigate whether firms exhibit greenwashing behavior is to using objective evidence to show whether those firms truly walk the talk.

In [None]:
#prep: import the following packages
from collections import Counter
import os
import os.path
import string
import nltk
import csv
nltk.download('punkt')
from nltk.tokenize import MWETokenizer  #import tokenizer
from nltk.tokenize import word_tokenize
tokenizer = MWETokenizer()
nltk.download('stopwords')
from nltk.corpus import stopwords  #import the list of stopwords
from nltk.stem.snowball import SnowballStemmer  #import stemmer module
stemmer = SnowballStemmer('english')
import pandas as pd
import spacy
import regex as re

[nltk_data] Downloading package punkt to /Users/charlotte/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/charlotte/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
regex_digits = re.compile(r'\d+')
regex_whitespace = re.compile(r'\s+')
english_stopwords = set(stopwords.words('english'))
stemmer = SnowballStemmer('english')

def clean_tokenize(text):
    """
    Cleans and tokenizes the input text. This includes removing numbers, certain punctuation marks,
    converting to lowercase, and eliminating stopwords. The remaining words are then stemmed.

    Parameters:
    - text (str): The text to be processed.

    Returns:
    - list: A list of stemmed tokens from the input text, excluding stopwords and punctuation.

    """
    # Remove numbers and specific characters
    text_cleaned = regex_digits.sub('', text)
    text_cleaned = text_cleaned.replace('”', '').replace('“', '').replace('—', ' ')

    # Remove punctuations and convert characters to lower case, then trim whitespace
    text_cleaned = "".join([char.lower() for char in text_cleaned if char not in string.punctuation])
    text_cleaned = regex_whitespace.sub(' ', text_cleaned).strip()

    # Tokenize, remove stopwords, and stem
    tokens = word_tokenize(text_cleaned)
    filtered_tokens = [stemmer.stem(word) for word in tokens if word not in english_stopwords]

    return filtered_tokens

In [None]:

def evaluate_vagueness(sentence, vague_words):
    """
    Evaluate the vagueness of a sentence based on a set of predefined vague words.

    Args:
    - sentence (str): The sentence to evaluate.
    - vague_words (list): A list of predefined vague words.

    Returns:
    - vagueness_score (int): A score indicating the vagueness level of the sentence,
      based on the frequency of occurrence of vague words.
    """
    # Tokenize the sentence into words
    words = word_tokenize(sentence.lower())

    # Count occurrences of vague words in the sentence
    vague_word_count = sum(1 for word in words if word in vague_words)

    # Calculate vagueness score as a percentage of total words in the sentence
    total_words = len(words)
    if total_words == 0:
        return 0
    else:
        return (vague_word_count / total_words) * 100

# Define the list of vague words
vague_words = ['believe', 'can', 'commonly', 'could', 'help', 'leading', 'like', 'many', 'may',
               'maybe', 'might', 'often', 'possibly', 'probably', 'rarely', 'seem', 'some',
               'up to', 'virtually', 'widely']



In [None]:
def calculate_vagueness(text, vague_words):
    """
    Calculate the vagueness score for a given text based on the frequency of occurrence of vague words.
    """
    # Tokenize the text into words
    tokens = word_tokenize(text.lower())

    # Count occurrences of vague words in the text
    vague_word_count = sum(1 for word in tokens if word in vague_words)

    # Calculate the total number of words in the text for normalization
    total_words = len(tokens)

    # Normalize the vagueness score
    vagueness_score = vague_word_count / total_words if total_words > 0 else 0

    return vagueness_score

def append_vagueness_scores_to_df(data, vague_words):
    """
    Appends normalized vagueness scores as a new column to the DataFrame.
    """
    # Initialize column for vagueness scores
    data['Vagueness Score'] = 0

    # Calculate and append scores for each row
    for index, row in data.iterrows():
        vagueness_score = calculate_vagueness(row['Content'], vague_words)
        data.at[index, 'Vagueness Score'] = vagueness_score

    return data

In [None]:

data = pd.read_csv("data_2022_cleaned.csv", encoding="latin1") ###Thhis dataset already contains firms financial performance, related esg sentiment score and extract related esg discussion in corporate annual reoort

# Define the list of vague words
vague_words = ['believe', 'can', 'commonly', 'could', 'help', 'leading', 'like', 'many', 'may',
               'maybe', 'might', 'often', 'possibly', 'probably', 'rarely', 'seem', 'some',
               'up to', 'virtually', 'widely']

# Append vagueness scores to the DataFrame
data = append_vagueness_scores_to_df(data, vague_words)

In [None]:

data['Vagueness Score'].describe

<bound method NDFrame.describe of 0       0.009156
2       0.012191
4       0.011236
6       0.006420
8       0.007405
          ...   
1652    0.011661
1654    0.011869
1656    0.007255
1658    0.011295
1660    0.003279
Name: Vagueness Score, Length: 830, dtype: float64>

In [None]:
# main regression
X = data[["Total ESG Sentiment","roe", "solvency", "asset"]]
y = data['Vagueness Score']

# Add a constant to the independent variable for the intercept term
X = sm.add_constant(X)

# Fit the linear regression model
model = sm.OLS(y, X).fit()

# Print the regression summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:        Vagueness Score   R-squared:                       0.047
Model:                            OLS   Adj. R-squared:                  0.042
Method:                 Least Squares   F-statistic:                     10.06
Date:                Thu, 11 Apr 2024   Prob (F-statistic):           5.91e-08
Time:                        22:41:26   Log-Likelihood:                 3586.0
No. Observations:                 830   AIC:                            -7162.
Df Residuals:                     825   BIC:                            -7138.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                   0.0062    