# Part 2: ESG Sentimental Analysis

This script is designed for performing ESG (Environmental, Social, Governance) sentiment analysis on a dataset containing company-related content. It leverages a combination of Python libraries and methodologies to preprocess textual data, identify relevant ESG keywords, and compute sentiment scores based on the presence of positive and negative words. The process involves cleaning and tokenizing the text, removing stopwords, stemming, and finally classifying content into ESG categories using a predefined dictionary. Positive and negative sentiment scores are calculated using a list of words from the **Loughran-McDonald Sentiment Word Lists**. Additionally, the script utilizes a pre-trained **Word2Vec model** to enrich the ESG keyword list with similar terms, enhancing the analysis's depth and accuracy. The final output is a DataFrame augmented with ESG scores for each category and a total ESG score for each piece of content, providing insightful metrics to gauge the ESG sentiment within corporate communications.

## 2.1: Importing Packages

In [1]:
### ESG Sentimental Analasis

#prep: import the following packages
from collections import Counter   
import os
import os.path
import string
import nltk
import csv
nltk.download('punkt')
from nltk.tokenize import MWETokenizer  #import tokenizer
from nltk.tokenize import word_tokenize
tokenizer = MWETokenizer()
nltk.download('stopwords')  
from nltk.corpus import stopwords  #import the list of stopwords
from nltk.stem.snowball import SnowballStemmer  #import stemmer module
stemmer = SnowballStemmer('english')
import pandas as pd
import spacy   
nlp = spacy.load('en_core_web_sm') # Load in language package

[nltk_data] Downloading package punkt to /Users/charlotte/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/charlotte/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2.2: Function clean_tokenize

In [15]:
regex_digits = re.compile(r'\d+')
regex_whitespace = re.compile(r'\s+')
english_stopwords = set(stopwords.words('english')) 
stemmer = SnowballStemmer('english')

def clean_tokenize(text):
    """
    Cleans and tokenizes the input text. This includes removing numbers, certain punctuation marks,
    converting to lowercase, and eliminating stopwords. The remaining words are then stemmed.

    Parameters:
    - text (str): The text to be processed.

    Returns:
    - list: A list of stemmed tokens from the input text, excluding stopwords and punctuation.

    """
    # Remove numbers and specific characters
    text_cleaned = regex_digits.sub('', text)
    text_cleaned = text_cleaned.replace('”', '').replace('“', '').replace('—', ' ')

    # Remove punctuations and convert characters to lower case, then trim whitespace
    text_cleaned = "".join([char.lower() for char in text_cleaned if char not in string.punctuation])
    text_cleaned = regex_whitespace.sub(' ', text_cleaned).strip()

    # Tokenize, remove stopwords, and stem
    tokens = word_tokenize(text_cleaned)
    filtered_tokens = [stemmer.stem(word) for word in tokens if word not in english_stopwords]

    return filtered_tokens


## 2.3: read_and_process_word_list

In [16]:
import pandas as pd
import os

def read_and_process_word_list(file_path, sheet_name):
    """
    Reads a word list from a specified sheet in an Excel file, converts words to lowercase,
    and formats them with leading and trailing spaces.

    Parameters:
    - file_path (str): Path to the Excel file.
    - sheet_name (str): Name of the sheet to read.

    Returns:
    - list: A list of processed words.
    """
    df = pd.read_excel(file_path, sheet_name=sheet_name)
    words = df['WORD'].str.lower().tolist()
    return [f' {word} ' for word in words]



In [17]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [18]:
# Read and process the positive, negative excel (LM dictionary)
LM_file_path = '/content/drive/My Drive/Final_Project/Dataset/LM Sentiment Dictionary.xlsx'
positive = read_and_process_word_list(LM_file_path, 'Positive')
negative = read_and_process_word_list(LM_file_path, 'Negative')

In [19]:
## read company data regarding ESG content
data_file_path = '/content/drive/My Drive/Final_Project/Dataset/Full_data_new.xlsx'
data = pd.read_excel(data_file_path,"Sheet1")
data.head()


Unnamed: 0.1,Unnamed: 0,Company name Latin alphabet,Adjust-Company Name,Inactive,Quoted,Branch,OwnData,Woco,Country ISO code,"NACE Rev. 2, core code (4 digits)",...,"US SIC, primary code(s)",Unnamed: 21,"US SIC, primary code(s).1","US SIC, secondary code(s)",National ID,National ID type,National ID label,Ticker symbol,CIK,Content
0,1,WALMART INC.,WALMART INC.,No,Yes,No,No,Yes,US,4719,...,5331,5331,5331,5411.0,71-0415188,VAT/Tax number,EIN,WMT,104169,"['Directors, Executive Officers and Corporate ..."
1,2,"AMAZON.COM, INC.","AMAZON.COM, INC.",No,Yes,No,No,Yes,US,4791,...,5961,5961,5961,5999.0,91-1646860,VAT/Tax number,EIN,AMZN,1018724,"['Directors, Executive Officers, and Corporate..."
2,4,EXXON MOBIL CORP,EXXON MOBIL CORP,No,Yes,No,No,Yes,US,1920,...,2911,2911,2911,1311.0,13-5409005,VAT/Tax number,EIN,XOM,34088,"['Directors, Executive Officers and Corporate ..."
3,5,CVS HEALTH CORPORATION,CVS HEALTH CORPORATION,No,Yes,No,No,Yes,US,4773,...,5912,5912,5912,,05-0494040,VAT/Tax number,EIN,CVS,64803,"['Directors, Executive Officers and Corporate ..."
4,7,MCKESSON CORPORATION,MCKESSON CORPORATION,No,Yes,No,No,Yes,US,4645,...,5122,5122,5122,5047.0,94-3207296,VAT/Tax number,EIN,MCK,927653,"['Directors, Executive Officers, and Corporate..."


In [20]:
contents = data["Content"]
contents.head()

0    ['Directors, Executive Officers and Corporate ...
1    ['Directors, Executive Officers, and Corporate...
2    ['Directors, Executive Officers and Corporate ...
3    ['Directors, Executive Officers and Corporate ...
4    ['Directors, Executive Officers, and Corporate...
Name: Content, dtype: object

## 2.4: Building ESG Dictionary
——Extracting Similar Words for ESG Keywords Using Word2Vec

This part of the script demonstrates how to load a pre-trained Word2Vec model and utilize it to find words similar to specified ESG (Environmental, Social, Governance) keywords.

The script uses Google's pre-trained Word2Vec model, specifically the 'GoogleNews-vectors-negative300.bin' file, to explore the semantic similarity between the provided ESG keywords and other words in the model's vocabulary.

For each keyword, the script attempts to retrieve the top 10 most similar words based on the model's understanding of semantic similarities.

In [21]:
from gensim.models import KeyedVectors

# Load a pre-trained Word2Vec model (this example uses Google's pre-trained model)
# Correct path to the .bin file in your Google Drive
model_path = '/content/drive/My Drive/Final_Project/Dataset/GoogleNews-vectors-negative300.bin'

word_vectors = KeyedVectors.load_word2vec_format(model_path, binary=True)

## generate seed words by chatgpt4
esg_keywords = [
    'environment',
    'climate', 'pollution', 'resource', 'biodiversity', 'waste',
    'carbon', 'renewable', 'water', 'deforestation', 'greenhouse',
    'social',
    'rights', 'labor', 'employee', 'diversity', 'community',
    'safety', 'development', 'consumer', 'trade', 'justice',
    'governance',
    'board', 'pay', 'corruption', 'shareholder', 'transparency',
    'ethics', 'risk', 'privacy', 'investment', 'corporate'
]



similar_words = {}
for keyword in esg_keywords:
    # Get the top 10 similar words for each keyword
    try:
        similar_words[keyword] = word_vectors.most_similar(keyword, topn=10)
    except KeyError:
        # This handles cases where the keyword is not in the model's vocabulary
        print(f"{keyword} not found in Word2Vec vocabulary.")

# Display the similar words found
for keyword, words in similar_words.items():
    print(f"\nWords similar to '{keyword}':")
    for word, similarity in words:
        print(f"  {word} (similarity: {similarity})")


Words similar to 'environment':
  environments (similarity: 0.6948072910308838)
  enviroment (similarity: 0.6488352417945862)
  environ_ment (similarity: 0.6331034898757935)
  enviornment (similarity: 0.6283590793609619)
  evironment (similarity: 0.6025119423866272)
  climate (similarity: 0.6009922027587891)
  envrionment (similarity: 0.5722888708114624)
  envi_ronment (similarity: 0.5690474510192871)
  environment.â_€ (similarity: 0.5512977242469788)
  environement (similarity: 0.5422994494438171)

Words similar to 'climate':
  climate_change (similarity: 0.6569507122039795)
  Climate (similarity: 0.6230838298797607)
  climates (similarity: 0.6195024251937866)
  global_warming (similarity: 0.6047458648681641)
  environment (similarity: 0.6009922027587891)
  climatic (similarity: 0.5555011630058289)
  climatic_conditions (similarity: 0.5207005143165588)
  ambassador_Brice_Lalonde (similarity: 0.5172268152236938)
  Global_warming (similarity: 0.5048916339874268)
  Climate_Change (simil

### ESG Dictionary

In [22]:
esg_dictionary = {
    "Environmental": [
        "climate", "climate change", "climates", "global warming", "climatic", "climatic conditions",
        "pollution", "air pollution", "pollutants", "pollutant", "emissions", "pollutions", "pollutant emissions", "polluting", "mercury pollution",
        "resource", "resources", "mineral resources",
        "biodiversity", "bio diversity", "biodiversity conservation", "ecosystems", "marine biodiversity", "deforestation", "ecological", "ecology", "habitats", "fauna",
        "waste", "wastes", "waste disposal", "hazardous waste", "garbage", "recyclable waste", "landfills", "recycling", "landfill", "landfilling",
        "carbon", "carbon emissions", "carbon emission", "CO2", "greenhouse gas", "emission", "carbon dioxide emissions", "greenhouse gases", "greenhouse gas emissions", "carbon dioxide",
        "renewable", "renewable energy", "renewables", "biomass", "renewable fuels", "biofuels", "renewable energies", "fossil fuels",
        "water", "potable water", "sewage", "groundwater", "freshwater", "potable", "wastewater", "brackish groundwater",
        "deforestation", "tropical deforestation", "Amazon deforestation", "rampant deforestation", "rainforest destruction", "biodiversity", "tropical forests", "desertification", "rainforests",
        "greenhouse", "greenhouses", "hydroponic garden", "glasshouse", "unheated greenhouse", "hydroponically", "garden", "hydroponic greenhouse", "glasshouses"
    ],
    "Social": [
        "rights", "freedoms", "inalienable rights", "constitutional protections",
        "labor", "wages", "union", "labor unions", "wage",
        "employee", "employees", "worker", "employer", "coworker", "workers", "staffer",
        "diversity", "cultural diversity", "diverse", "inclusiveness", "multicultural", "culturally diverse", "geographic diversity", "linguistic diversity", "inclusivity",
        "community", "communities",
        "safety",
        "development", "revitalization",
        "consumer", "consumers", "retail", "consumer electronics",
        "trade", "trading", "trades", "traded",
        "justice", "judicial", "criminal justice", "equality", "injustice"
    ],
    "Governance": [
        "board", "directors", "trustees", "boards",
        "pay", "paying", "paid", "pays", "reimburse", "payment", "repay",
        "corruption", "rampant corruption", "graft", "bribery", "endemic corruption", "corrupt", "cronyism", "rampant graft", "endemic graft", "anticorruption",
        "shareholder", "shareholders", "stockholder", "controlling shareholder", "stockholders", "shareowner", "investor", "shareholding", "unitholder",
        "transparency", "accountability", "openness", "transparent", "clarity", "objectivity",
        "ethics", "ethical", "ethical lapses",
        "risk", "risks", "probability", "danger", "likelihood", "risky", "hazard", "peril",
        "privacy", "confidentiality",
        "investment", "investments", "investing", "investment", "investor", "invest", "investors", "equity", "investement",
        "corporate", "corporations", "multinational corporations"
    ]
}



### 2.5: ESG Score Calculation and Appending

In [23]:

def calculate_esg_scores(text, esg_dictionary, positive_list, negative_list):
    """
    Calculate and normalize sentiment scores for ESG categories based on the presence of relevant keywords,
    the balance of positive and negative words in the text, and the total number of words in the content.
    """
    # Clean and tokenize the text
    tokens = clean_tokenize(text)
    token_counts = Counter(tokens)

    # Calculate the total number of tokens (words) in the content for normalization
    total_tokens = len(tokens)

    scores = {}
    for category, keywords in esg_dictionary.items():
        processed_keywords = [clean_tokenize(kw)[0] for kw in keywords] # Assuming single words
        category_score = sum(token_counts[word] for word in processed_keywords if word in token_counts)
        scores[category] = category_score

    # Calculate positive and negative sentiment scores
    positive_score = sum(token_counts.get(word, 0) for word in positive_list)
    negative_score = sum(token_counts.get(word, 0) for word in negative_list)

    # Adjust scores based on sentiment
    adjusted_scores = {category: score + positive_score - negative_score for category, score in scores.items()}

    # Normalize the adjusted scores by the total number of tokens
    normalized_scores = {category: score / total_tokens if total_tokens > 0 else 0 for category, score in adjusted_scores.items()}

    # Calculate total ESG score and normalize it
    total_esg_score = sum(normalized_scores.values())

    return normalized_scores, total_esg_score

def append_esg_scores_to_df(data, esg_dictionary, positive_list, negative_list):
    """
    Appends normalized ESG scores as new columns to the DataFrame.
    """
    # Initialize columns for scores
    for category in esg_dictionary.keys():
        data[f'{category} Score'] = 0
    data['Total ESG Score'] = 0

    # Calculate, normalize, and append scores for each row
    for index, row in data.iterrows():
        scores, total_esg_score = calculate_esg_scores(row['Content'], esg_dictionary, positive_list, negative_list)
        for category, score in scores.items():
            data.at[index, f'{category} Score'] = score
        data.at[index, 'Total ESG Score'] = total_esg_score

    return data


In [24]:

# Append ESG scores to the DataFrame
data_with_esg_scores = append_esg_scores_to_df(data, esg_dictionary, positive, negative)

# Display the first few rows to verify the new columns
print(data_with_esg_scores.head())

   Unnamed: 0 Company name Latin alphabet     Adjust-Company Name Inactive  \
0           1                WALMART INC.            WALMART INC.       No   
1           2            AMAZON.COM, INC.        AMAZON.COM, INC.       No   
2           4            EXXON MOBIL CORP        EXXON MOBIL CORP       No   
3           5      CVS HEALTH CORPORATION  CVS HEALTH CORPORATION       No   
4           7        MCKESSON CORPORATION    MCKESSON CORPORATION       No   

  Quoted Branch OwnData Woco Country ISO code  \
0    Yes     No      No  Yes               US   
1    Yes     No      No  Yes               US   
2    Yes     No      No  Yes               US   
3    Yes     No      No  Yes               US   
4    Yes     No      No  Yes               US   

   NACE Rev. 2, core code (4 digits)  ... National ID  National ID type  \
0                               4719  ...  71-0415188    VAT/Tax number   
1                               4791  ...  91-1646860    VAT/Tax number   
2          

In [None]:
# Export DataFrame to a CSV file
data_with_esg_scores.to_csv('data_with_esg_scores.csv', index=False)