Fork of code repository for measuring corporate culture, modified to accommodate for corporations' CSR and ESG initiatives, including DEI values

Originally forked from MS20190155/Measuring-Corporate-Culture-Using-Machine-Learning

Find the original README below

Measuring corporations' CSR and ESG initiatives by constructing an index using word-embeddings and machine learning.

Python code based on a pipeline to measure corporate culture, but modified to accommodate for corporations' CSR and ESG initiatives, including DEI values. This repository is built on the work you can find in the original fork, the code has been modified to accomodate for CSR and ESG reports, as well as transcripts of earnings calls. Major additions so far: (1) a preprocessor module to handle pdf files (earnings calls transcripts and reports) and xml metadata files.

Setup from scratch, on Windows (tl;dr)

  • Install Anaconda (

  • Install an IDE (e.g. Visual Studio Code -

  • Install Git (

  • Install Java (Windows Offline 64-bit -

  • Install Stanford CoreNLP v3.9.2 ( by manually placing the uncompressed folder somewhere, e.g., "C:\Users\user\AppData\Local\stanford-corenlp-full-2018-10-05"

  • Clone code repository to your working directory: git clone

  • Create an environment called "index": conda create -n index python=3.9

  • To activate this environment: conda activate index

  • Add Anaconda to Windows Path environment variables so that VSCode terminal will recognize and use Anaconda prompt (add these two lines "C:\Users\user\AppData\Local\Anaconda3\Scripts" and "C:\Users\user\AppData\Local\Anaconda3")

  • Add a Python interpreter to VSCode from this new conda environment. If it is not offered, you can probably find it at: C:\Users\user\AppData\Local\anaconda3\envs\index\python.exe

  • Install required python packages: pip install -r requirements.txt

  • Add Stanford CoreNLP path to os.environ["CORENLP_HOME"] = "C:/Users/user/AppData/Local/stanford-corenlp-full-2018-10-05/"

  • Test with command: python -m culture.preprocess; if success, you should see:

    Starting server with command: java -Xmx16G -cp C:/Users/gparti/AppData/Local/stanford-corenlp-full-2018-10-05//* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 60000 -threads 1 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-8922256f13d24a46.props -preload tokenize, ssplit, pos, lemma, ner, depparse (['when[pos:WRB] I[pos:PRP] be[pos:VBD] a[pos:DT] child[pos:NN] in[pos:IN] [NER:LOCATION]Ohio[pos:NNP] ,[pos:,] I[pos:PRP] always[pos:RB] want[pos:VBD] to[pos:TO] go[pos:VB] to[pos:TO] [NER:ORGANIZATION]Stanford[pos:NNP]_University[pos:NNP] with[pos:IN]_respect[pos:NN]_to[pos:TO] higher[pos:JJR] education[pos:NN] .[pos:.] ', 'but[pos:CC] I[pos:PRP] go[pos:VBD] along[pos:IN] with[pos:IN] my[pos:PRP$] parent[pos:NNS] .[pos:.] '], ['None_0', 'None_1'])

  • Tweak settings in according to your machine (e.g. RAM, CPU cores, etc.)

  • Setup complete, define DATA_FOLDER in to train on your own data, adjust dimensions DIMS and SEED_WORDS to your own needs.

For prerequisites and system requirements, you can also follow the instructions of the original repository (see details below). In short, you need Python, Java, and Stanford CoreNLP 3.9.2.

Usage notes

1. Create your data folder

Create a data folder in the project directory, e.g. data, and place your pdf (and xml) files in a folder named raw inside this data folder. In the settings file, at line 14, define the DATA_FOLDER constant as your folder, e.g.:

    DATA_FOLDER: str = "data/"

Your folder structure will look like this:

├── data
│   ├── raw
│   ├── input
│   └── processed

Currently, data-test has been already set up for you with 150 sample documents. Training the model on data-test (150 documents) should take 1 hour on an ordinary office machine.

2. Run to run everything, or run the modules one by one as below.

The two variants - and - accommodate for different datasets. The "Transcripts" dataset consist of pdf files with accomodating xml files for metadata, while the "CSR" dataset just contains pdfs; the preprocessor is different for these two, the rest of the modules are the same.

Files and folders appended with -transcripts or -csr hold code, data, or results for these two datasets.

1. python

This module takes in pdf files (and accompanying xml metadata files in case of transcripts) from a dataset and processes the documents to be suitable for training, extracting their content and creating input files documents.txt and document_ids.txt. See explanations on the rest of the modules in the original README (below).

2. python

3. python

4. python

5. python

6. python

If you encounter problems, you most likely need to:

  • Check if the packages' versions are compatible (you would get an error message)
  • Check for correct settings of paths and parameters in
  • Check for documents with missing data (xml tree errors)
  • Check for documents that are not in UTF-8 character encoding.
  • Check for documents too large.
  • Deprecation warnings can be ignored for now.

3. Find results in the outputs/scores folder.

For example, the file firm_scores_TFIDF.csv shows the aggregated TFIDF scores for all searched dimensions (e.g., "diversity", "equity", "inclusion"), for every document. The file mean_firm_scores_TFIDF.csv groups the previous scores by firm, and returns their mean values.

Original README

Measuring Corporate Culture Using Machine Learning


The repository implements the method described in the paper

Kai Li, Feng Mai, Rui Shen, Xinyan Yan, Measuring Corporate Culture Using Machine Learning, The Review of Financial Studies, 2020; DOI:10.1093/rfs/hhaa079 [Available at SSRN]

The code is tested on Ubuntu 18.04 and macOS Catalina, with limited testing on Windows 10.


The code requires

os.environ["CORENLP_HOME"] = "/home/user/stanford-corenlp-full-2018-10-05/"

  • If you are using Windows, use "/" instead of "\" to separate directories.

  • Make sure requirements for CoreNLP are met. For example, you need to have Java installed (if you are using Windows, install Windows Offline (64-bit) version). To check if CoreNLP is set up correctly, use command line (terminal) to navigate to the project root folder and run python -m culture.preprocess. You should see parsed outputs from a single sentence printed after a moment:

    (['when[pos:WRB] I[pos:PRP] be[pos:VBD] a[pos:DT]....


We included some example data in the data/input/ folder. The three files are

  • documents.txt: Each line is a document (e.g., each earnings call). Each document needs to have line breaks remvoed. The file has no header row.
  • document_ids.txt: Each line is document ID (e.g., unique identifier for each earnings call). A document ID cannot have _ or whitespaces. The file has no header row.
  • (Optional) id2firms.csv: A csv file with three columns (document_id:str, firm_id:str, time:int). The file has a header row.

Before running the code

You can config global options in the The most important options are perhaps:

  • The RAM allocated for CoreNLP
  • The number of CPU cores for CoreNLP parsing and model training
  • The seed words
  • The max number of words to include in each dimension. Note that after filtering and deduplication (each word can only be loaded under a single dimension), the number of words will be smaller.

Running the code

  1. Use python to use Stanford CoreNLP to parse the raw documents. This step is relatvely slow so multiple CPU cores is recommended. The parsed files are output in the data/processed/parsed/ folder:

    • documents.txt: Each line is a sentence.
    • document_sent_ids.txt: Each line is a id in the format of docID_sentenceID (e.g. doc0_0, doc0_1, ..., doc1_0, doc1_1, doc1_2, ...). Each line in the file corresponds to documents.txt.

    Note about performance: This step is time-consuming (~10 min for 100 calls). Using python can speed up the process considerably (~2 min with 8 cores for 100 calls) but it is not well-tested on all platforms. To not break things, the two implementations are separated.

  2. Use python to clean, remove stopwords, and named entities in parsed documents.txt. The program then learns corpus specific phrases using gensim and concatenate them. Finally, the program trains the word2vec model.

    The options can be configured in the file. The program outputs the following 3 output files:

    • data/processed/unigram/documents_cleaned.txt: Each line is a sentence. NERs are replaced by tags. Stopwords, 1-letter words, punctuation marks, and pure numeric tokens are removed. MWEs and compound words are concatenated.
    • data/processed/bigram/documents_cleaned.txt: Each line is a sentence. 2-word phrases are concatenated.
    • data/processed/trigram/documents_cleaned.txt: Each line is a sentence. 3-word phrases are concatenated. This is the final corpus for training the word2vec model and scoring.

    The program also saves the following gensim models:

    • models/phrases/bigram.mod: phrase model for 2-word phrases
    • models/phrases/trigram.mod: phrase model for 3-word phrases
    • models/w2v/w2v.mod: word2vec model
  3. Use python to create the expanded dictionary. The program outputs the following files:

    • outputs/dict/expanded_dict.csv: A csv file with the number of columns equal to the number of dimensions in the dictionary (five in the paper). The row headers are the dimension names.

    (Optional): It is possible to manually remove or add items to the expanded_dict.csv before scoring the documents.

  4. Use python to score the documents. Note that the output scores for the documents are not adjusted by the document length. The program outputs three sets of scores:

    • outputs/scores/scores_TF.csv: using raw term counts or term frequency (TF),
    • outputs/scores/scores_TFIDF.csv: using TF-IDF weights,
    • outputs/scores/scores_WFIDF.csv: TF-IDF with Log normalization (WFIDF).

    (Optional): It is possible to use additional weights on the words (see score.score_tf_idf() for detail).

  5. (Optional): Use python to aggregate the scores to the firm-time level. The final scores are adjusted by the document lengths.


