### Inscope Take Home

In this assignment you will build a prototype of a cluster analysis tool to navigate financial statements.

Each company has a unique CIK (Central Index Key) that is used to identify it in the data. The CIK is a 10 digit number, and is the prefix of the file name for each company's 10-K report.

There are three main tasks in this assignment:

- To construct a CIK -> Company Name mapping from the provided data.

- To summarize each company's report into a few sentences.

- To cluster the companies into similar groups based on their financial statements.

The goal of this assignment is to demonstrate your ability to build a data pipeline to process unstructured data, and to use that data to build a simple clustering and summarizing tool whose output could be built into a more complex application. What we expect you to build are proofs of concept, and not production-ready models.

**Specification** 

- [x] You will filter down the dataset to cluster companies that are in the S&P 500 index. You can find a recent list of CIKs for companies in the S&P 500 in the SP500.txt file.


- [x] You will create a script that given a directory with report files can produce a CIK -> Company Name mapping in the shape of a CSV file with two columns: CIK and Company Name. Each row in this file will represent each file in the provided data. (hint: you don't need to throw an LLM at this problem)


- [x] You will run your mapping script on the provided data, and include it in your response.


- [x] You will write a data pipeline to process the provided HTML into an intermediate representation that can be used for clustering. One of the features in your intermediate representation should be a 1-paragraph summary of the report. You can use any pre-trained language model you like to generate the summary.
 
 
- [x] You will use your pipeline to assign every company in the dataset into similar groups based on their financial statements.


- [x] You will provide a Jupyter Notebook, a Streamlit app, or equivalent for users to inspect and interact with the results of your clustering and summarization. The visualization should allow the user to select a company and show other similar companies in the same cluster.

#### Task #1 - Creating a CIK - Company Name mapping script

- You will filter down the dataset to cluster companies that are in the S&P 500 index. You can find a recent list of CIKs for companies in the S&P 500 in the SP500.txt file.

- You will create a script that given a directory with report files can produce a CIK -> Company Name mapping in the shape of a CSV file with two columns: CIK and Company Name. Each row in this file will represent each file in the provided data. (hint: you don't need to throw an LLM at this problem)

- You will run your mapping script on the provided data, and include it in your response.

In [247]:
#!pip install langchain
#!pip install torch
#!pip install transformers
#!pip install bs4
#!pip install sentence_transformers
#!pip install openai
#!pip install faiss-gpu
#!pip install streamlit

In [212]:
import os
import re
import requests
import json
import pandas as pd
import numpy as np
import langchain
import torch
import transformers
from bs4 import BeautifulSoup
from transformers import BertTokenizer, BertModel, pipeline
import sentencepiece
from langchain.chat_models import ChatOpenAI
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import TextLoader

import faiss
from sentence_transformers import SentenceTransformer

In [140]:
os.environ['OPENAI_API_KEY'] = "xxx"

Read files from the directory and filter to keep only S&P500

In [142]:
# specify directory path
dir_path = os.getcwd() + '/data'

In [143]:
# list the files from the directory
orig_files = [f for f in os.listdir(dir_path)]
len(orig_files)

3096

In [145]:
orig_files[:5]

['0001722482.html',
 '0001317839.html',
 '0001141234.html',
 '0000320193.html',
 '0001408210.html']

In [146]:
# clean up for matching
files = [f.replace(".html","") for f in orig_files if ".html" in f]
files[:5]

['0001722482', '0001317839', '0001141234', '0000320193', '0001408210']

In [147]:
# save a pandas dataframe
#df = pd.DataFrame({'cos_idx': files})
#df

In [148]:
# Read S&P 500 companies data 
sp500 = pd.read_csv('SP500.txt', delimiter="\t", header=None, names = ['col1'], dtype=str) # read as string, so that leading zeros are not truncated
sp500.sample(5)

Unnamed: 0,col1
131,909832
109,723254
41,1013462
289,60667
83,906345


In [149]:
sp500.shape

(502, 1)

In [150]:
# Drop duplicates
sp500 = sp500.drop_duplicates(subset="col1")
sp500.shape

(499, 1)

In [151]:
# To List
sp500_lst = list(sp500['col1'])
sp500_lst[:5]

['0000091142', '0000001800', '0001551152', '0000815094', '0001467373']

In [152]:
# Filtered list of 3000+ to S&P500
filtered_list = [f for f in files if f in sp500_lst]
len(filtered_list)

497

In [17]:
# Read CIK - Company ticket mapping file from sec.gov
with open('sec_co.json', 'r') as file:
    data = file.read()

sec_co = json.loads(data)

In [18]:
# Create a dataframe with filtered list of companies in S&P500
df_cik_co = pd.DataFrame({'cik': filtered_list})
df_cik_co.head(5)

Unnamed: 0,cik
0,320193
1,1048286
2,1048695
3,79879
4,100517


In [19]:
# Function of lookup company name 
def cik_company(cik):

    # match cik format, remove leading 0s
    cik = int(cik)

    for co in sec_co.values():

        if co['cik_str'] == cik:
            return co['title']
            
    return np.nan

In [20]:
# Apply function on each row of pandas dataframe
df_cik_co['company_name'] = df_cik_co.apply(lambda x: cik_company(x['cik']), axis=1)
df_cik_co

Unnamed: 0,cik,company_name
0,0000320193,Apple Inc.
1,0001048286,MARRIOTT INTERNATIONAL INC /MD/
2,0001048695,"F5, INC."
3,0000079879,PPG INDUSTRIES INC
4,0000100517,"United Airlines Holdings, Inc."
...,...,...
492,0000945841,POOL CORP
493,0000019617,JPMORGAN CHASE & CO
494,0001410636,"American Water Works Company, Inc."
495,0001725057,Ceridian HCM Holding Inc.


In [21]:
# Save the mapping table in a CSV
df_cik_co.to_csv(os.getcwd() + "/data/df_cik_co.csv")

#### Task #2 - Clustering + Summarization

- You will write a data pipeline to process the provided HTML into an intermediate representation that can be used for clustering. One of the features in your intermediate representation should be a 1-paragraph summary of the report. You can use any pre-trained language model you like to generate the summary.

- You will use your pipeline to assign every company in the dataset into similar groups based on their financial statements.
  
- You will provide a Jupyter Notebook, a Streamlit app, or equivalent for users to inspect and interact with the results of your clustering and summarization. The visualization should allow the user to select a company and show other similar companies in the same cluster.


In [19]:
# test code to parse html text and clean up
file = "0001051470.html"

with open(os.getcwd() + '/data/{}'.format(file), 'r') as file:
    
    content = file.read()
        
    # parse html
    soup = BeautifulSoup(content, 'html.parser')
        
    #text = soup.get_text()

    #header_text = soup.find_all(["h5"])
    para_text = soup.get_text()

    #print('text', para_text)

In [20]:
# skip the initial pages - table of contents
text_to_summarize = para_text
ts = text_to_summarize[25000:]
#ts

Using `t5-small` model:

In [45]:
summarizer = pipeline(
                        task="summarization",
                        model="t5-small",
                        min_length=50,
                        max_length=500,
                        truncation=True
                     )

In [153]:
summarizer(ts)

[{'summary_text': 'FORM 10-K __________________________ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934For the transition period from to Commission File Number 001-16441 . Yes  No Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15d of the Act .'}]

Summaries are not very meaningful or cohesive. They look like simple extractive summaries of the text. 

Using `langchain` document loaders + LLMs

In [22]:
# test code to parse html text and clean up
file = "0001051470.html"

with open(os.getcwd() + '/data/{}'.format(file), 'r') as file:
    
    content = file.read()
        
    # parse html
    soup = BeautifulSoup(content, 'html.parser')
 
    para_text = soup.get_text()

    # skip initial pages - table of content
    ts = para_text[23000:]
    print('text', ts[0:5000])

text  FORM 10-K  __________________________☒ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934For the fiscal year ended December 31, 2022 or ☐TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934For the transition period from              to             Commission File Number 001-16441  __________________________CROWN CASTLE INC. (Exact name of registrant as specified in its charter) __________________________ Delaware 76-0470458(State or other jurisdictionof incorporation or organization) (I.R.S. EmployerIdentification No.)8020 Katy Freeway, Houston, Texas 77024-1908(Address of principal executive offices) (Zip Code)(713) 570-3000 (Registrant's telephone number, including area code) Securities Registered Pursuant toSection 12(b) of the ActTrading SymbolsName of Each Exchangeon Which RegisteredCommon Stock, $0.01 par valueCCINew York Stock ExchangeSecurities Registered Pursuant to Section 12(g) of the Act: NONE. _____

In [174]:
llm = ChatOpenAI(
    model='gpt-3.5-turbo-16k', 
    request_timeout=120
)

In [104]:
llm_summary = llm.invoke(f'Summarize: "{ts[0:50000]}"')
print(llm_summary.content)

Crown Castle Inc. is a company that owns, operates, and leases shared communications infrastructure in the United States. This includes towers, small cells, and fiber assets. The company's core business is providing access to its communications infrastructure through long-term tenant contracts. They aim to grow cash flows from their existing infrastructure, return cash to stockholders in the form of dividends, and invest capital to grow cash flows and dividends per share. The company faces competition from other infrastructure providers and is subject to federal, state, and local regulations. The demand for their infrastructure is driven by the increasing demand for data and the growth of wireless networks. However, any slowdown in demand or reduction in network investment by their tenants could negatively impact the company.


Bringing it all together

In [193]:
"""
1. Function to read .html file and parse the text

2. Tokenize 

3. Generate Embeddings

"""

class doc_function:
    
    def __init__(self, model):
        self.model = model
    
    def parse_html(self, cik):

        print('cik', cik)
        
        with open(os.getcwd() + '/data/{}.html'.format(cik), 'r') as file:
            content = file.read()
        
            # parse html
            soup = BeautifulSoup(content, 'html.parser')
        
            all_texts = soup.get_text()

            # skip initial pages - table of content
            texts = all_texts[23000:]

            return texts

    def summarize(self, parsed_text):
        
        try:
        
            llm = ChatOpenAI(
                             model=self.model, 
                             request_timeout=120
                            )

            # pass the snippet to get produce summary, limit to max tokens allowed for the model
            llm_summary = llm.invoke(f'Summarize: "{parsed_text[0:16384]}"')

            print('Summary length:', len(llm_summary.content))

            return llm_summary.content
        
        except Exception as e:
            print('Exception', e)
            return np.nan

In [194]:
# Initialize the class
model = 'gpt-3.5-turbo-16k'
doc_func = doc_function(model)

In [195]:
# Get parsed html text
df_cik_co['html_text'] = df_cik_co.apply(lambda x: doc_func.parse_html(x['cik']), axis=1)
df_cik_co.sample(5)

In [129]:
# save parsed html texts
#df_cik_co.to_csv(os.getcwd() + '/data/df_cik_html_parsed.csv')

In [200]:
# Generate summaries
df_cik_co['summary'] = df_cik_co.apply(lambda x: doc_func.summarize(x['html_text']), axis=1)
df_cik_co.sample(5)

In [201]:
# save summaries
#df_cik_co.to_csv(os.getcwd() + '/data/df_cik_summary.csv')

#### Clutering using FAISS: https://github.com/facebookresearch/faiss

Steps involved: 

1. Vectors / embeddings 
2. Create a FAISS index for storage and search
3. Implement k-means or other clustering algorithms
4. Assign clusters
5. Predict cluster of unseen text

In [217]:
# Load a pre-trained BERT model for text embeddings
bert_model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Obtain BERT embeddings for the document text
text_data = df_cik_co['html_text']
embeddings = bert_model.encode(text_data, show_progress_bar=True)

Batches:   0%|          | 0/16 [00:00<?, ?it/s]

In [218]:
# Create a FAISS index
d = embeddings.shape[1]  # Dimension of the embeddings
index = faiss.IndexFlatL2(d) 
index.add(embeddings)

In [222]:
index.ntotal

497

In [228]:
# clustering using k-means
nclusters = 10
kmeans = faiss.Kmeans(d, nclusters)
kmeans.train(embeddings)
cluster_assign = kmeans.assign(embeddings)

In [232]:
# Add cluster assignments to the DataFrame
df_cik_co["cluster"] = cluster_assign[1]

In [251]:
# save final dataframe with clusters
df_cik_co.to_csv(os.getcwd() + "/App/df_cik_final.csv", encoding="utf-8")

### Follow-up questions 

- Describe which task you found most difficult in the implementation, and why.

> Parsing text from the `.html` files and subsequently generating summaries of the document text was challenging. Given the layout of the document, different structure containing disclaimers, table of content, it was tricky to extract clean text / information that we are interested in. 

> Upon parsing the html, the document text is fairly length and exceeds the context window allowed LLMs for summarization. We are not able to pass the complete text for summarization and need to experiment with different chunking strategies, extractive, abstractive outputs and so on. 

- What led you to choose the libraries or frameworks you used in your implementation?

> I ended up choosing `langchain's chatOpenAI` module given it's simplicity to to work with openAI models. `AWS SageMaker` for compute and processing power. `gpt-3.5-turbo-16k` for larger context window. `FAISS` for vector storage and search functionality and support for clustering algorithm. 

- How did you evaluate whether the clusters and summaries created by your system were good or not?

> In the absence of ground-truth data, one way to evaluate is randomly sample companies from the list and compare the system-generated summaries against human-generated ones. 

> For clustering, I compared the companies within each cluster based on their industry / sector, financial performance, regulations and so on. For example; cluster #3 includes financial companies like Morgan Stanley, BlackRock, PayPal, Bank of America, Goldman Sachs and JPMorgan Chase

- If there were no time or budget restrictions for this exercise, what improvements would you make in the following areas:

    > Implementation
    
       I'd spend more time dealing with unstructured data and extracting relevant text from the 10k's. I'd experiment with different chunking strategies and summarization methods based on the use-case. For example, chunk based on sections in the report or each paragraph? Generate summaries of the each chunk of the document and then combine them for a more cohesive summary. Try out a few different models with larger context window. 
        
        Try an domain specific model or Fine-tune one on financial data and utilize that for generating summaries. Example, https://huggingface.co/ProsusAI/finbert.   
    
    > User Experience
    
       I'd work on generating more meaningful clusters. The improvements here would be from extracting relevant texts, generating embeddings and then trying out different clustering / similarity algorithms.
    
    > Data Quality
    
       Improvements in how we process unstructured textual data. Using Beautiful soup to parse text, identify and clean up irrelevant texts, piping in factual data such as financials, sector, market cap and so on and using them for either feature engineering for classical ML or into the embeddings for LLMs. 

- If you built this using classic ML models, how would approach it if you had to build it with LLMs? Similarly, if you used LLMs, what are some things you would try if you had to build it with classic ML models?

> For classical ML model, I'd attempt to process unstructured text into structured data by extracting relevant data points for each company for feature engineering. Data points could include financial information such as revenue, profits, expenses, operational costs, assets, liabilities, employees, market / sector, competition and so on. I'd then train an unsupervised clustering algorithm based on features. 

- If you had to build this as part of a production system, providing an inference API for previously unseen reports, how would you do it? What would you change in your implementation?

> For production systems, I'd implement some the improvements around data quality, UX and evaluations outlined above. Once we have satisfactory results, I'd productionize using `AWS SageMaker` for deploying the model to production and creating an inference endpoint. I'd consider different inference options (batch, real-time, async), costs, latency, throughput, other metrics for running a production system. I'd ensure proper configuration of logs, uptime, downtime to monitor usage. 