How to run the NLP code?
-
First we train the model using:
\train_models\LDA_model_estimation.py
orLDAMarius.slurm
if using a multi-core system.- The model files are saves under
.\pretrained_models
- Estimation logs are saved under
.\train_models\Logs
- Perplexity scores are saved under
.\train_models\Output
- The model files are saves under
-
Second, we run the
\train_models\study_topics.py
file to plot perplexity against the number of topics and select the optimal topics.- Output is a graph,
perplexity_topics.pdf
. - The file also outputs the top ten words for each topic, given a
(manually) prespecified number of topics X:
topics_terms_n=X.pdf
.
- Output is a graph,
-
The file
industry_gettopics.py
generates a quarter-industry panel of topic loadings, saved in the file:.\IndustryAnalysis\topic_loadings_by_industryquarter.csv'
-
Code
.\IndustryAnalysis\industry_toptopics.py
generates the top 2 topics (with list of words) for each GIC code and saves inTopTopics_Industries.csv
. -
Code
build_shapley.py
(together withShapleyMarius.slurm
) generate panels of Shapley values by analyst-ticker-quarter (including information diversity, contribution), saved inOutputShapley
folder. -
Use
merge_shapley.py
in theOutputShapley
folder to generate aDataShapley.csv
file. -
Run
get_technicaldummy.py
to get a file with analyst-level topic loadings on technical analysis topics (`DataShapley_TechnicalTopicWeights.csv') -
The complete merged file (
DataShapley.csv
+DataShapley_TechnicalTopicWeights.csv') is saved as
Data_InfoContributionAnalyst.csv'