# This demo is adapted from [MilaNLProc](https://colab.research.google.com/drive/1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing#scrollTo=HoKrSIkxaNBt)

## Installing Contextualized Topic Models

1. Please enable GPU (Runtime -> Change Runtime -> GPU)
2. Please install the contextualized topic model library
3. Restart notebook to reflect changes (Runtime -> Restart Runtime)

In [None]:
%%capture
!pip install contextualized-topic-models==2.3.0
!pip install pyldavis

# Data

You can upload the scrapped company filings to your github repo, for this demo, we will read in the one I uploaded for 5 companies in the Energy Sector. There are three filings for each company for years 2018, 2019, and 2020.

In [None]:
%%capture
!wget https://raw.githubusercontent.com/huiyinz/CourseProject/main/ProcessedFiles/Filings_Text.txt

In [None]:
!head -n 2 Filings_Text.txt

Item 1A. Risk Factors Chevron is a global energy company and its operating and financial results are subject to a variety of risks inherent in the global oil, gas, and petrochemical businesses. Many of these risks are not within the companys control and could materially impact the companys results of operations and financial condition. Chevron is exposed to the effects of changing commodity prices Chevron is primarily in a commodities business that has a history of price volatility. The single largest variable that affects the companys results of operations is the price of crude oil, which can be influenced by general economic conditions, industry production and inventory levels, technology advancements, production quotas or other actions that might be imposed by the Organization of Petroleum Exporting Countries (OPEC) or other producers, weather-related damage and disruptions due to other natural or human causes beyond our control, competing fuel prices, and geopolitical risks. Chevro

In [None]:
text_file = "Filings_Text.txt" # EDIT THIS WITH THE FILE YOU UPLOAD

# Import necessary libraries

In [None]:
from contextualized_topic_models.models.ctm import CombinedTM
from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation
from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessingStopwords
from nltk.corpus import stopwords as stop_words
import nltk

## Preprocessing

In [None]:
nltk.download('stopwords')
stopwords = list(stop_words.words("english"))

documents = [t for line in open(text_file, encoding="utf-8").readlines() 
               for t in line.split('.') if len(t.split(' ')) >= 3]

sp = WhiteSpacePreprocessingStopwords(documents, stopwords_list=stopwords)
preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
preprocessed_documents[:10]

['risk factors chevron global energy company operating financial results subject variety risks inherent global oil gas petrochemical businesses',
 'many risks within companys control could materially impact companys results operations financial condition',
 'chevron exposed effects changing commodity prices chevron primarily commodities business history price volatility',
 'single largest variable affects companys results operations price crude oil influenced general economic conditions industry production inventory levels technology advancements production quotas actions might imposed organization petroleum exporting countries opec producers weather related damage disruptions due natural human causes beyond control competing fuel prices geopolitical risks',
 'chevron evaluates risk changing commodity prices core part business planning process',
 'investment company carries significant exposure fluctuations global crude oil prices',
 'extended periods low prices crude oil material adve

In [None]:
tp = TopicModelDataPreparation("all-mpnet-base-v2")

training_dataset = tp.fit(text_for_contextual=unpreprocessed_corpus, text_for_bow=preprocessed_documents)

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]



Batches:   0%|          | 0/18 [00:00<?, ?it/s]



## Training the Combined TM

I decided to extract 15 topics to avoid duplicated themes. Feel free to change as you see fit.

In [None]:
ctm = CombinedTM(bow_size=len(tp.vocab), contextual_size=768, n_components=15, num_epochs=10)
ctm.fit(training_dataset) # run the model

Epoch: [10/10]	 Seen Samples: [35040/35040]	Train Loss: 136.71724446614584	Time: 0:00:00.966419: : 10it [00:10,  1.01s/it]
Sampling: [20/20]: : 20it [00:16,  1.20it/s]


# Topics


In [None]:
ctm.get_topic_lists(5)

[['demand', 'products', 'emissions', 'use', 'regulations'],
 ['legal', 'laws', 'companys', 'impact', 'regulations'],
 ['business', 'systems', 'condition', 'cybersecurity', 'effect'],
 ['could', 'business', 'result', 'loss', 'cybersecurity'],
 ['condition', 'flows', 'results', 'adversely', 'materially'],
 ['prices', 'reserves', 'commodity', 'bitumen', 'flows'],
 ['uncompetitive', 'sea', 'restructuring', 'prevention', 'ranges'],
 ['technologies', 'research', 'greenhouse', 'hydraulic', 'carbon'],
 ['condition', 'results', 'adversely', 'materially', 'financial'],
 ['operations', 'adversely', 'condition', 'materially', 'could'],
 ['installation', 'hostilities', 'robust', 'internally', 'pose'],
 ['oil', 'gas', 'demand', 'production', 'crude'],
 ['restructuring', 'predictable', 'able', 'robust', 'considerations'],
 ['installation', 'robust', 'serve', 'uncompetitive', 'hostilities'],
 ['crude', 'oil', 'gas', 'natural', 'equipment']]