# Lab Instructions

In the lab, you're presented a task such as building a dataset, training a model, or writing a training loop, and we'll provide the code structured in such a way that you can fill in the blanks in the code using the knowledge you acquired in the chapters that precede the lab. You should be able to find appropriate snippets of code in the course content that work well in the lab with minor or no adjustments.

The blanks in the code are indicated by ellipsis (`...`) and comments (`# write your code here`).

In some cases, we'll provide you partial code to ensure the right variables are populated and any code that follows it runs accordingly.

```python
# write your code here
x = ...
```

The solution should be a single statement that replaces the ellipsis, such as:

```python
# write your code here
x = [0, 1, 2]
```

In some other cases, when there is no new variable being created, the blanks are shown like in the example below: 

```python
# write your code here
...
```

Although we're showing you only a single ellipsis (`...`), you may have to write more than one line of code to complete the step, such as:

```python
# write your code here
for i, xi in enumerate(x):
    x[i] = xi * 2
```

## Installation Notes

To run this notebook on Google Colab, you will need to install the following libraries: transformers, sentence-transformers, and sec-edgar-downloader.

In Google Colab, you can run the following command to install these libraries:

In [None]:
!pip install transformers sentence-transformers sec-edgar-downloader

## 17.4 Lab 7: Document Q&A

In this lab, we'll put together several tools we already used to extract information from a set of documents, also called "Document Q&A". We'll retrieve the latest 10-K forms filed by S&P500 top companies and search for information about their reported risks using natural language.

Here are the tickers for the top 25 companies, as of June 2023. Just run the code below as is:

In [None]:
tickers = ['AAPL', 'MSFT', 'AMZN', 'NVDA', 'GOOGL', 'GOOG', 'META', 'TSLA', 'UNH', 'XOM', 'JPM',
           'JNJ', 'V', 'LLY', 'PG', 'AVGO', 'MA', 'HD', 'MRK', 'CVX', 'PEP', 'ABBV', 'KO', 'COST']

### 17.4.1 EDGAR

EDGAR is the Securities and Exchange Commission's (SEC) Eletronic Data Gathering, Analysis, and Retrieval (EDGAR) system.

"_[it] performs automated collection, validation, indexing, acceptance, and forwarding of submissions by companies and others who are required by law to file forms with the U.S. Securities and Exchange Commission (SEC). Its primary purpose is to increase the efficiency and fairness of the securities market for the benefit of investors, corporations, and the economy by accelerating the receipt, acceptance, dissemination, and analysis of time-sensitive corporate information filed with the agency._"

Source: [Important Information About EDGAR](https://www.sec.gov/edgar/searchedgar/aboutedgar.htm)

### 17.4.2 Form 10-K

In this lab, we'll be retrieving the latest 10-K form filed by the companies previously listed.

"_A Form 10-K is an annual report required by the U.S. Securities and Exchange Commission (SEC), that gives a comprehensive summary of a company's financial performance. Although similarly named, the annual report on Form 10-K is distinct from the often glossy "annual report to shareholders," which a company must send to its shareholders when it holds an annual meeting to elect directors (though some companies combine the annual report and the 10-K into one document). The 10-K includes information such as company history, organizational structure, executive compensation, equity, subsidiaries, and audited financial statements, among other information._"

Source: [Wikipedia](https://en.wikipedia.org/wiki/Form_10-K)

We'll be paying special attention to the section ["Item 1A - Risk Factors"](https://en.wikipedia.org/wiki/Form_10-K#Item_1A_%E2%80%93_Risk_Factors), where "_...the company lays anything that could go wrong, likely external effects, possible future failures to meet obligations, and other risks disclosed to adequately warn investors and potential investors._"

### 17.4.3 Downloader

While it's possible to retrieve public information directly from EDGAR, you'd have to find the proper identification numbers of companies and filings to download the reports. It is more conveniente to use a Python package that handles the nitty-gritty details for us and retrieves as many reports as we want by simply specifying the company's ticker (e.g. MSFT, GOOGL), and the type of report (e.g. 10-K). The package [`sec-edgar-downloader`](https://github.com/jadchaar/sec-edgar-downloader) does exactly that.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step1.png)

We can easily download the forms by creating an instance of a `Downloader` that points to the destination folder where files will be stored, and calling its `get()` method repeatedly, once for each ticker. Just run the code below as is to get the latest version of all reports.

For the first time you run the code in this lab, though, we recommend you download the dataset we prepared for you, so you can more easily follow along and double-check your answers later on.

In [None]:
from sec_edgar_downloader import Downloader

dest_folder = "./edgar10k_sp500_top25"
dl = Downloader("MyCompanyName", "my.email@domain.com", dest_folder)

form = '10-K'
for ticker in tickers:
    dl.get("10-K", ticker, limit=1, download_details=True)

Alternatively, in order to get the same results as shown in this lab, you can download the compressed folder containing all forms (as of June 2023) from the following link:

```
https://github.com/dvgodoy/assets/releases/download/dataset/edgar10k_sp500_top25.tar.gz
```

You should uncompress the file, and rename the `filings` folder to `edgar10k_sp500_top25`.

If you're using Google Colab, you may run the following commands to accomplish that:

In [None]:
!wget https://github.com/dvgodoy/assets/releases/download/dataset/edgar10k_sp500_top25.tar.gz
!tar -xvzf edgar10k_sp500_top25.tar.gz
!mv filings edgar10k_sp500_top25

It will create a subfolder for each ticker, each containing a folder corresponding to the downloaded form (10-K), and yet another folder named after the form's corresponding ID number. 

In the compressed dataset above, the inner folder is named `0001564590-22-026876` and it has two files: `filing-details.html` and `full-submission.txt`. If you're using Google Colab, you can run the command below to list the files inside that folder:

In [None]:
!ls -l edgar10k_sp500_top25/sec-edgar-filings/MSFT/10-K/0001564590-22-026876

We'll be using the details file.

### 17.4.4 Parser

The details file is a mix of HTML and XML tags, and it would be very cumbersome to parse them ourselves. Fortunately, we can easily adapt a parser function, [`parse_10k_filing()`](https://github.com/rsljr/edgarParser/blob/master/parse_10K.py) from the [edgarParser](https://github.com/rsljr/edgarParser) repository, to parse our downloaded files.

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step2.png)

Its original docstring states:

"_The function *parse_10k_filing()* parses 10-K forms to extract the following sections: business description, business risk, and management discussioin and analysis. The function takes two arguments, a link and a number indicating the section, and returns a list with the requested sections. Current options are **0(All), 1(Business), 2(Risk), 4(MDA).**_"

We'll be using option number two to retrieve text related to section "Item 1A - Risk Factors" only. Just run the code below as is to define the function we'll be using to parse the forms.

In [None]:
# Adapted from https://github.com/rsljr/edgarParser/blob/master/parse_10K.py
import re
import unicodedata
from bs4 import BeautifulSoup as bs
import requests

def parse_10k_filing(content, section):

    if section not in [0, 1, 2, 3]:
        print("Not a valid section")
        sys.exit()

    def get_text(content):
        html = bs(content, "html.parser")
        text = html.get_text()
        text = unicodedata.normalize("NFKD", text).encode('ascii', 'ignore').decode('utf8')
        text = text.split("\n")
        text = " ".join(text)
        return(text)

    def extract_text(text, item_start, item_end):
        item_start = item_start
        item_end = item_end
        starts = [i.start() for i in item_start.finditer(text)]
        ends = [i.start() for i in item_end.finditer(text)]
        positions = list()
        for s in starts:
            control = 0
            for e in ends:
                if control == 0:
                    if s < e:
                        control = 1
                        positions.append([s,e])
        item_length = 0
        item_position = list()
        for p in positions:
            if (p[1]-p[0]) > item_length:
                item_length = p[1]-p[0]
                item_position = p

        item_text = text[item_position[0]:item_position[1]]

        return(item_text)

    text = get_text(content)

    if section == 1 or section == 0:
        try:
            item1_start = re.compile("item\s*[1][\.\;\:\-\_]*\s*\\b", re.IGNORECASE)
            item1_end = re.compile("item\s*1a[\.\;\:\-\_]\s*Risk|item\s*2[\.\,\;\:\-\_]\s*Prop", re.IGNORECASE)
            businessText = extract_text(text, item1_start, item1_end)
        except:
            businessText = "Something went wrong!"

    if section == 2 or section == 0:
        try:
            item1a_start = re.compile("(?<!,\s)item\s*1a[\.\;\:\-\_]\s*Risk", re.IGNORECASE)
            item1a_end = re.compile("item\s*2[\.\;\:\-\_]\s*Prop|item\s*[1][\.\;\:\-\_]*\s*\\b", re.IGNORECASE)
            riskText = extract_text(text, item1a_start, item1a_end)
        except:
            riskText = "Something went wrong!"

    if section == 3 or section == 0:
        try:
            item7_start = re.compile("item\s*[7][\.\;\:\-\_]*\s*\\bM", re.IGNORECASE)
            item7_end = re.compile("item\s*7a[\.\;\:\-\_]\sQuanti|item\s*8[\.\,\;\:\-\_]\s*", re.IGNORECASE)
            mdaText = extract_text(text, item7_start, item7_end)
        except:
            mdaText = "Something went wrong!"

    if section == 0:
        data = [businessText, riskText, mdaText]
    elif section == 1:
        data = [businessText]
    elif section == 2:
        data = [riskText]
    elif section == 3:
        data = [mdaText]
    return(data)

Let's parse the latest 10-K form filed by Microsoft (as of June 2023). Just run the code below as is:

In [None]:
with open('./edgar10k_sp500_top25/sec-edgar-filings/MSFT/10-K/0001564590-22-026876/filing-details.html', 'r',
          encoding='utf-8') as f:
    html = f.read()

res = parse_10k_filing(html, 2)[0]
len(res)

That's about 70,000 characters. We need to split it into more manageable chunks.

In Chapter 14, we briefly discussed chunking strategies. Sometimes, as in the case of our 10-K form filed by Microsoft, there's some other indication to the text's structure: it looks like paragraphs are separated by a sequence of two or more spaces.

Let's try it out. Just run the code below as is to visualize the first three chunks:

In [None]:
docs = res.split('  ')
docs[:3]

Looks good, these are definitely paragraphs. Unfortunately, this may not be the case for every document: in some 10-K forms, there's no clear indication of a paragraph, and you'll need to rely on a different chunking strategy to move forward. For now, we're sticking with this particular 10-K form, and we'll proceed using paragraphs as chunks.

If we look at the full list, though, we'll see that there are many empty lines as well as really short ones that are likely section headers. We can discard these chunks that are too short (say, less than 10 characters long). Just run the code below as is to split the whole text into paragraphs:

In [None]:
paragraphs = list(map(lambda s: s.strip(), filter(lambda s: len(s) > 10, res.split('  '))))
len(paragraphs)

We got 88 paragraphs. Let's take a look at one of those paragraphs. Just run the code below as is to visualize the first paragraph:

In [None]:
text = paragraphs[1]
text

How many words is that? Just run the code below as is to find out:

In [None]:
len(text.split())

Perhaps we can make it shorter?

### 17.4.6 Summarization

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step5.png)

Load a pretrained summarization pipeline from HuggingFace and use it to summarize the text above. Try different minimum and maximum lengths and observe the resulting summaries. How they compare to the original text?

In [None]:
import torch
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1

# write your code here
summarizer = ...

Once you created your summarizer, just run the code below as is to summarize the text in the first paragraph:

In [None]:
summarizer(text, max_length=50, min_length=20)

Summarizing text is great, but we may be doing it prematurely at this point. Instead of summarizing individual paragraphs (or other chunks of text), it may be more interesting to find (full) paragraphs of interest first, and only then summarize them as a whole.

If we're doing document Q&A, we need to query our documents (paragraphs) and find those that are more likely to contain the answer, that is, those more closely related to the topic of our query.

How can we search for similar documents?

### 17.4.7 Embeddings

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/data_step3.png)

You already know how to search for similar documents. You need to embed them first!

Use the `sentence transformers` package to load a pretrained model for sentence embeddings (e.g. `all-MiniLM-L12-v2`) and embed every paragraph of text from Microsoft's 10-K form.

In [None]:
from sentence_transformers import SentenceTransformer

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# write your code here
model = ...

In [None]:
# write your code here
embeddings = ...
embeddings.shape

### 17.4.8 Searching

There are two alternatives to search for similar embeddings, and we have tried them both already: PyTorch's own cosine similarity, and vector databases such as ChromaDB.

At this point, let's keep it as simple as it can be, and stick with cosine similarity. Create an instance of the cosine similarity layer and use it to find five paragraphs that are most similar to the query below (don't forget to embed the query as well):

In [None]:
import torch.nn as nn

# write your code here
similarity = ...

In [None]:
# Embed the query and make it a tensor

query = "what are the sources of uncertainties?"
# write your code here
q = ...
content = torch.as_tensor(embeddings)

In [None]:
# Compute the cosine similarity between query and content
# and get the top 5 results
# write your code here
similarities = ...
most = ...

You should get a list of five indices corresponding to the paragraphs that are most relevant to our query.

### 17.4.9 Context

Now, join all the paragraphs together as a single piece of text. This is going to be what is referred to as the "context". Notice that the indices may be ordered according to their similarity to the query. However, it's probably a good idea to order them as they appear on the text instead.

In [None]:
# write your code here
context = ...
print(context)

The context should contain the relevant information to answer our query, and it is one of the arguments you need to pass to a question answering pipeline.

### 17.4.10 Question Answering

We have a question, "_what are the sources of uncertainties?_" and we have a context, five paragraphs from our text that are the most similar to the question. That's everything you need to try a question answering pipeline!

![](https://raw.githubusercontent.com/dvgodoy/assets/main/PyTorchInPractice/images/ch0/model_step5.png)

Create an instance of Q&A pipeline, and call it using its `question` and `context` arguments:

In [None]:
from transformers import pipeline

device = 0 if torch.cuda.is_available() else -1

# write your code here
qa_model = ...

Once you created your Q&A pipeline, just run the code below as is to answer the query:

In [None]:
query = "what are the sources of uncertainties?"

qa_model(question=query, context=context)

The Q&A model is good at answering questions that are extractive in nature and can be easily pinpointed in the text. It gives you back the start and end positions in the text that contain the answer to your question.

It may technically correct, but perhaps it's a bit too short, right?

In theory, the context should contain the relevant information to our query. But, it is too verbose and it doesn't read well, after all, it is just a sequence of paragraphs patched together. One way of trying to make it look more like an answer is to summarize it.

Use the summarization pipeline you already created to summarize the context above. Make sure the minimum and maximum length are appropriate given the original length of the context.

In [None]:
# write your code here
summary = ...
summary

How do you like the summary? Does it look like an answer to our question? Could it have been better? Pause and ponder for a while, what could you do to get a better answer or, better yet, to get a better context back?