# Data

Now let's take a closer look at the data we'll be using for this tutorial.

In [2]:
!pipenv install bs4
!pipenv install tqdm
!pipenv install langchain

[1mLoading .env environment variables...[0m
[1;32mInstalling bs4[0m[1;33m...[0m
[?25lResolving bs4[33m...[0m
[2K✔ Installation Succeeded
[2K[32m⠋[0m Installing bs4...
[1A[2K[1mInstalling dependencies from Pipfile.lock [0m[1m([0m[1;36m1e4971[0m[1m)[0m[1;33m...[0m
To activate this project's virtualenv, run [33mpipenv shell[0m.
Alternatively, run a command inside the virtualenv with [33mpipenv run[0m.
[1mLoading .env environment variables...[0m
[1;32mInstalling tqdm[0m[1;33m...[0m
[?25lResolving tqdm[33m...[0m
[2K✔ Installation Succeeded
[2K[32m⠋[0m Installing tqdm...
[1A[2K[1mInstalling dependencies from Pipfile.lock [0m[1m([0m[1;36m1e4971[0m[1m)[0m[1;33m...[0m
To activate this project's virtualenv, run [33mpipenv shell[0m.
Alternatively, run a command inside the virtualenv with [33mpipenv run[0m.
[1mLoading .env environment variables...[0m
[1;32mInstalling langchain[0m[1;33m...[0m
[?25lResolving langchain[33m...[0m
[2K✔

## Explore Text with Beautiful Soup

In [3]:
from bs4 import BeautifulSoup

# example HTML file
file_path="data/html/1.103.html"

# read in HTML file
with open(file_path) as fp:
    soup = BeautifulSoup(fp, "html.parser")

In [4]:
print(soup.prettify())

<!DOCTYPE html
  SYSTEM "about:legacy-compat">
<html lang="en">
 <head>
  <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
  <meta charset="utf-8"/>
  <meta content="(C) Copyright 2023" name="copyright"/>
  <meta content="(C) Copyright 2023" name="DC.rights.owner"/>
  <meta content="concept" name="DC.Type"/>
  <meta content="Subpart_1.1.html" name="DC.Relation" scheme="URI"/>
  <meta content="XHTML" name="DC.Format"/>
  <meta content="FAR_1_103" name="DC.Identifier"/>
  <link href="commonltr.css" rel="stylesheet" type="text/css"/>
  <link href="gsa-base.css" rel="stylesheet" type="text/css"/>
  <title>
   1.103 Authority.
  </title>
 </head>
 <body>
  <main role="main">
   <article aria-labelledby="ariaid-title1" role="article">
    <article aria-labelledby="ariaid-title1" class="nested0" id="FAR_1_103">
     <h1 class="title topictitle1" id="ariaid-title1">
      <span class="ph autonumber">
       1.103
      </span>
      Authority.
     </h1>
     <div class="bo

In [5]:
# inspect title
soup.title.string

'1.103 Authority.'

In [6]:
# inspect parent reference
soup.nav.a.string

'Subpart 1.1 - Purpose, Authority, Issuance'

In [8]:
# inspect content
print(soup.p.get_text())


(a)
                  
						            The development of the FAR System is in accordance with the requirements of 41 U.S.C. chapter 13, Acquisition Councils.


In [37]:
# use regex to extract FAR citation
import re

file_citation = re.search("\d{1,2}[.]\d+(\-\d)*", file_path)
print(file_citation.group())

1.103


In [10]:
# parse only relevant text 

from bs4 import SoupStrainer

# load HTML file
with open(file_path,'r') as file:
    html_content = file.read()

only_relevant_text = SoupStrainer("p")
print(BeautifulSoup(html_content, "html.parser", parse_only=only_relevant_text).prettify())

<p class="ListL1" id="FAR_1_103__d16e10">
 <span class="ph autonumber">
  (a)
 </span>
 The development of the FAR System is in accordance with the requirements of
 <a class="xref" href="http://uscode.house.gov/browse.xhtml;jsessionid=114A3287C7B3359E597506A31FC855B3" target="_blank">
  41 U.S.C. chapter 13
 </a>
 , Acquisition Councils.
</p>
<p class="ListL1" id="FAR_1_103__d16e18">
 <span class="ph autonumber">
  (b)
 </span>
 The FAR is prepared, issued, and maintained, and the FAR System is prescribed jointly by the Secretary of Defense, the Administrator of General Services, and the Administrator, National Aeronautics and Space Administration, under their several statutory authorities.
</p>



## Load single document with Beautiful Soup

In [13]:
from langchain.document_loaders import BSHTMLLoader
from bs4 import SoupStrainer

# example HTML file
file_path="data/html/1.103.html"

In [14]:
# define Beautiful Soup key word args
bs_kwargs = {
    "features": "html.parser", 
    "parse_only": SoupStrainer("p") # only include relevant text
}

loader = BSHTMLLoader(file_path, open_encoding="utf-8", bs_kwargs=bs_kwargs)
data = loader.load()
data

[Document(page_content='\n(a)\n                  \n\t\t\t\t\t\t            The development of the FAR System is in accordance with the requirements of 41 U.S.C. chapter 13, Acquisition Councils.\n(b)\n                  \n\t\t\t\t\t\t            The FAR is prepared, issued, and maintained, and the FAR System is prescribed jointly by the Secretary of Defense, the Administrator of General Services, and the Administrator, National Aeronautics and Space Administration, under their several statutory authorities.', metadata={'source': 'data/html/1.103.html', 'title': ''})]

I just don't think it makes sense to split text beyond the sections in which they appear since this will make it more difficult to return a meaningful citation. Plus, I'm not sure how to split on HTML elements below the Header level which is the only possibility since each HTML file contains a single Header.

In [16]:
from langchain.text_splitter import HTMLHeaderTextSplitter

# define headers to split on
headers_to_split_on = [
    ("h1", "Header 1")
    #("span", "Span")
    #("p", "Paragraph")
]

# create HTML splitter instance
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

# Split the HTML text
split_documents = html_splitter.split_text(html_content)

# Print the split documents
for document in split_documents:
    print(f"Page Content: {document.page_content}")
    print(f"Metadata: {document.metadata}")
    print()

Page Content: (a) The development of the FAR System is in accordance with the requirements of 41 U.S.C. chapter 13, Acquisition Councils.  
(b) The FAR is prepared, issued, and maintained, and the FAR System is prescribed jointly by the Secretary of Defense, the Administrator of General Services, and the Administrator, National Aeronautics and Space Administration, under their several statutory authorities.  
Parent topic: Subpart 1.1 - Purpose, Authority, Issuance
Metadata: {}



We can manually modify the page content and metadata.

In [19]:
loader = BSHTMLLoader(file_path, open_encoding="utf-8", bs_kwargs=bs_kwargs)
data = loader.load()
text = data[0].page_content.replace("\n", " ")
data[0].page_content = data[0].page_content.replace("\n", " ")
print(data[0].page_content)

 (a)                    						            The development of the FAR System is in accordance with the requirements of 41 U.S.C. chapter 13, Acquisition Councils. (b)                    						            The FAR is prepared, issued, and maintained, and the FAR System is prescribed jointly by the Secretary of Defense, the Administrator of General Services, and the Administrator, National Aeronautics and Space Administration, under their several statutory authorities.


Here's the latest attempt:

In [20]:
# define Beautiful Soup key word args
bs_kwargs = {
    "features": "html.parser", 
    "parse_only": SoupStrainer("p") # only include relevant text
}

loader = BSHTMLLoader(file_path, open_encoding="utf-8", bs_kwargs=bs_kwargs)
data = loader.load()
data

[Document(page_content='\n(a)\n                  \n\t\t\t\t\t\t            The development of the FAR System is in accordance with the requirements of 41 U.S.C. chapter 13, Acquisition Councils.\n(b)\n                  \n\t\t\t\t\t\t            The FAR is prepared, issued, and maintained, and the FAR System is prescribed jointly by the Secretary of Defense, the Administrator of General Services, and the Administrator, National Aeronautics and Space Administration, under their several statutory authorities.', metadata={'source': 'data/html/1.103.html', 'title': ''})]

## Load multiple documents

In [21]:
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import BSHTMLLoader

# define file path
file_path="data/html/1.103.html"

In [24]:
loader = DirectoryLoader('data/html', glob="*.html", use_multithreading=True, show_progress=True, loader_cls=BSHTMLLoader)
docs = loader.load()
len(docs)

100%|██████████| 3487/3487 [00:06<00:00, 533.78it/s]


3487

In [28]:
docs[0]

Document(page_content='46.708 Warranties of data.\n\nWarranties of data shall be developed and used in accordance with agency regulations.\n\nSubpart 46.7 - Warranties', metadata={'source': 'data/html/46.708.html'})

In [25]:
doc_sources = [doc.metadata['source']  for doc in docs]
doc_sources

['data/html/46.708.html',
 'data/html/11.106.html',
 'data/html/16.204.html',
 'data/html/9.405.html',
 'data/html/7.201.html',
 'data/html/8.405.html',
 'data/html/6.201.html',
 'data/html/1.103.html',
 'data/html/17.204.html',
 'data/html/31.205-43.html',
 'data/html/4.2104.html',
 'data/html/22.103-5.html',
 'data/html/49.305-2.html',
 'data/html/47.303-11.html',
 'data/html/42.403.html',
 'data/html/9.110.html',
 'data/html/18.115.html',
 'data/html/19.816.html',
 'data/html/47.303-7.html',
 'data/html/52.219-27.html',
 'data/html/52.242-9.html',
 'data/html/32.111.html',
 'data/html/22.1503.html',
 'data/html/52.216-14.html',
 'data/html/25.504-2.html',
 'data/html/52.214-29.html',
 'data/html/52.219-8.html',
 'data/html/22.1016.html',
 'data/html/4.1401.html',
 'data/html/32.404.html',
 'data/html/31.205-14.html',
 'data/html/53.223.html',
 'data/html/8.002.html',
 'data/html/52.204-29.html',
 'data/html/28.106-1.html',
 'data/html/29.402.html',
 'data/html/17.603.html',
 'data/h

In [33]:
doc_content = [doc.page_content  for doc in docs]
doc_content

['46.708 Warranties of data.\n\nWarranties of data shall be developed and used in accordance with agency regulations.\n\nSubpart 46.7 - Warranties',
 '16.204 Fixed-price incentive contracts.\n\nA fixed-price incentive contract is a fixed-price contract that provides for adjusting profit and establishing the final contract price by a formula based on the relationship of final negotiated total cost to total target cost. Fixed-price incentive contracts are covered in subpart\xa0 16.4, Incentive Contracts. See 16.403 for more complete descriptions, application, and limitations for these contracts. Prescribed clauses are found at 16.406.\n\nSubpart 16.2 - Fixed-Price Contracts',
 '11.106 Purchase descriptions for service contracts.\n\nIn drafting purchase descriptions for service contracts, agency requiring activities shall ensure that inherently governmental functions (see subpart\xa0 7.5) are not assigned to a contractor. These purchase descriptions shall-\n\n(a)\n                  \n\t\t

In [36]:
doc_sources = [doc.metadata["source"]  for doc in docs]
doc_sources

['data/html/46.708.html',
 'data/html/16.204.html',
 'data/html/11.106.html',
 'data/html/7.201.html',
 'data/html/8.405.html',
 'data/html/9.405.html',
 'data/html/6.201.html',
 'data/html/1.103.html',
 'data/html/17.204.html',
 'data/html/4.2104.html',
 'data/html/22.103-5.html',
 'data/html/31.205-43.html',
 'data/html/42.403.html',
 'data/html/49.305-2.html',
 'data/html/9.110.html',
 'data/html/47.303-11.html',
 'data/html/18.115.html',
 'data/html/19.816.html',
 'data/html/47.303-7.html',
 'data/html/52.219-27.html',
 'data/html/52.242-9.html',
 'data/html/25.504-2.html',
 'data/html/52.216-14.html',
 'data/html/22.1503.html',
 'data/html/32.111.html',
 'data/html/22.1016.html',
 'data/html/52.214-29.html',
 'data/html/4.1401.html',
 'data/html/52.219-8.html',
 'data/html/32.404.html',
 'data/html/31.205-14.html',
 'data/html/28.106-1.html',
 'data/html/53.223.html',
 'data/html/52.204-29.html',
 'data/html/29.402.html',
 'data/html/8.002.html',
 'data/html/17.603.html',
 'data/h

Here's the latest attempt:

In [26]:
# define Beautiful Soup key word args
bs_kwargs = {
    "features": "html.parser", 
    "parse_only": SoupStrainer("p") # only include relevant text
}

# define Loader key word args
loader_kwargs = {
    "open_encoding": "utf-8",
    "bs_kwargs": bs_kwargs
}

loader = DirectoryLoader(
    path='data/html', 
    glob="*.html", 
    loader_cls=BSHTMLLoader,
    loader_kwargs=loader_kwargs,
    use_multithreading=True, 
    show_progress=True
    )
#loader = BSHTMLLoader(file_path, open_encoding="utf-8", bs_kwargs=bs_kwargs)
data = loader.load()
data

100%|██████████| 3487/3487 [00:04<00:00, 784.47it/s]


[Document(page_content='Warranties of data shall be developed and used in accordance with agency regulations.', metadata={'source': 'data/html/46.708.html', 'title': ''}),
 Document(page_content='\n(a) Contractors debarred, suspended,\nor proposed for debarment are excluded from receiving contracts,\nand agencies shall not solicit offers from, award contracts to,\nor consent to subcontracts with these contractors, unless the agency\nhead determines that there is a compelling reason for such action\n(see 9.405-1(a)(2), 9.405-2, 9.406-1(c), 9.407-1(d), and 23.506(e)). Contractors\ndebarred, suspended, or proposed for debarment are also excluded\nfrom conducting business with the Government as agents or representatives\nof other contractors.\n(b) Contractors\nand other entities that have an active exclusion record in SAM because\nthey have been declared ineligible on the basis of statutory or\nother regulatory procedures are excluded from receiving contracts,\nand if applicable, subcontra

# Latest Attempt

Now to bring it all together...

In [95]:
from bs4 import BeautifulSoup
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import BSHTMLLoader

# example HTML file
file_path="data/html/1.103.html"

In [61]:
# define Beautiful Soup key word args
bs_kwargs = {
    "features": "html.parser", 
    "parse_only": SoupStrainer("p") # only include relevant text
}

# define Loader key word args
loader_kwargs = {
    "open_encoding": "utf-8",
    "bs_kwargs": bs_kwargs
}

loader = DirectoryLoader(
    path='data/html', 
    glob="*.html", 
    loader_cls=BSHTMLLoader,
    loader_kwargs=loader_kwargs,
    use_multithreading=False, 
    show_progress=True
    )
#loader = BSHTMLLoader(file_path, open_encoding="utf-8", bs_kwargs=bs_kwargs)
docs = loader.load()
docs

  0%|          | 0/3487 [00:00<?, ?it/s]

100%|██████████| 3487/3487 [00:04<00:00, 719.51it/s]


[Document(page_content='Warranties of data shall be developed and used in accordance with agency regulations.', metadata={'source': 'data/html/46.708.html', 'title': ''}),
 Document(page_content='\n(a) Contractors debarred, suspended,\nor proposed for debarment are excluded from receiving contracts,\nand agencies shall not solicit offers from, award contracts to,\nor consent to subcontracts with these contractors, unless the agency\nhead determines that there is a compelling reason for such action\n(see 9.405-1(a)(2), 9.405-2, 9.406-1(c), 9.407-1(d), and 23.506(e)). Contractors\ndebarred, suspended, or proposed for debarment are also excluded\nfrom conducting business with the Government as agents or representatives\nof other contractors.\n(b) Contractors\nand other entities that have an active exclusion record in SAM because\nthey have been declared ineligible on the basis of statutory or\nother regulatory procedures are excluded from receiving contracts,\nand if applicable, subcontra

First, let's see what we have so far by inspecting the page content and metadata.

In [96]:
doc_content = [doc.page_content  for doc in docs]
doc_content

['Warranties of data shall be developed and used in accordance with agency regulations.',
 '\n(a) Contractors debarred, suspended,\nor proposed for debarment are excluded from receiving contracts,\nand agencies shall not solicit offers from, award contracts to,\nor consent to subcontracts with these contractors, unless the agency\nhead determines that there is a compelling reason for such action\n(see 9.405-1(a)(2), 9.405-2, 9.406-1(c), 9.407-1(d), and 23.506(e)). Contractors\ndebarred, suspended, or proposed for debarment are also excluded\nfrom conducting business with the Government as agents or representatives\nof other contractors.\n(b) Contractors\nand other entities that have an active exclusion record in SAM because\nthey have been declared ineligible on the basis of statutory or\nother regulatory procedures are excluded from receiving contracts,\nand if applicable, subcontracts, under the conditions and for the\nperiod set forth in the statute or regulation. Agencies shall not

In [None]:
# # TODO: clean up doc content?
# text = docs[0].page_content.replace("\n", " ")
# data[0].page_content = data[0].page_content.replace("\n", " ")
# print(data[0].page_content)

In [97]:
doc_sources = [doc.metadata  for doc in docs]
doc_sources

[{'source': 'FAR 46.708', 'title': '46.708 Warranties of data.'},
 {'source': 'FAR 9.405',
  'title': '11.106 Purchase descriptions for service contracts.'},
 {'source': 'FAR 11.106', 'title': '16.204 Fixed-price incentive contracts.'},
 {'source': 'FAR 16.204', 'title': '9.405 Effect of listing.'},
 {'source': 'FAR 7.201', 'title': '4.2104 Waivers.'},
 {'source': 'FAR 4.2104', 'title': '7.201 [Reserved]'},
 {'source': 'FAR 8.405',
  'title': '8.405 Ordering procedures for Federal Supply Schedules.'},
 {'source': 'FAR 17.204',
  'title': '31.205-43 Trade, business, technical and professional activity costs.'},
 {'source': 'FAR 6.201', 'title': '6.201 Policy.'},
 {'source': 'FAR 31.205-4', 'title': '17.204 Contracts.'},
 {'source': 'FAR 1.103', 'title': '49.305-2 Construction contracts.'},
 {'source': 'FAR 49.305-2',
  'title': '47.303-11 F.o.b. inland point, country of importation.'},
 {'source': 'FAR 22.103-5', 'title': '22.103-5 Contract clauses.'},
 {'source': 'FAR 47.303-1', 'title

In [98]:
# reformat source info
import re

for doc in docs:
    doc_source = re.search("\d{1,2}[.]\d+(\-\d)*", doc.metadata["source"]).group() 
    doc.metadata["source"] = " ".join(["FAR", doc_source])
    print(doc.metadata["source"])

FAR 46.708
FAR 9.405
FAR 11.106
FAR 16.204
FAR 7.201
FAR 4.2104
FAR 8.405
FAR 17.204
FAR 6.201
FAR 31.205-4
FAR 1.103
FAR 49.305-2
FAR 22.103-5
FAR 47.303-1
FAR 42.403
FAR 52.219-2
FAR 9.110
FAR 18.115
FAR 19.816
FAR 47.303-7
FAR 52.219-8
FAR 32.111
FAR 22.1503
FAR 52.242-9
FAR 25.504-2
FAR 52.216-1
FAR 52.214-2
FAR 22.1016
FAR 32.404
FAR 52.204-2
FAR 4.1401
FAR 28.106-1
FAR 31.205-1
FAR 8.002
FAR 53.223
FAR 17.603
FAR 29.402
FAR 22.404-9
FAR 25.1103
FAR 26.206
FAR 52.223
FAR 16.603
FAR 11.501
FAR 8.1104
FAR 52.211-1
FAR 49.402-7
FAR 44.306
FAR 3.1001
FAR 32.1005
FAR 3.907-6
FAR 51.103
FAR 50.103
FAR 4.1006
FAR 32.900
FAR 50.205-3
FAR 32.003
FAR 36.522
FAR 45.605
FAR 52.235
FAR 22.102-1
FAR 16.301-2
FAR 22.2102
FAR 53.235
FAR 36.303-1
FAR 28.103-2
FAR 19.504
FAR 10.002
FAR 29.101
FAR 6.305
FAR 9.501
FAR 28.101
FAR 7.305
FAR 11.002
FAR 32.304-7
FAR 32.500
FAR 22.1407
FAR 42.803
FAR 44.202
FAR 3.1105
FAR 35.017-1
FAR 45.202
FAR 19.800
FAR 26.302
FAR 9.106
FAR 18.103
FAR 34.005-4
FAR 52.2

In [59]:
# Add part info
import re 

for doc in docs:
    doc_part = re.search('^(\d{1,2})', doc.metadata['source']).group()
    doc.metadata["part"] = " ".join(["FAR Part", doc_part])
    
    print(doc.metadata["part"])

FAR Part 46
FAR Part 9
FAR Part 11
FAR Part 16
FAR Part 7
FAR Part 4
FAR Part 8
FAR Part 17
FAR Part 6
FAR Part 31
FAR Part 1
FAR Part 49
FAR Part 22
FAR Part 47
FAR Part 42
FAR Part 52
FAR Part 9
FAR Part 18
FAR Part 19
FAR Part 47
FAR Part 52
FAR Part 32
FAR Part 22
FAR Part 52
FAR Part 25
FAR Part 52
FAR Part 52
FAR Part 22
FAR Part 32
FAR Part 52
FAR Part 4
FAR Part 28
FAR Part 31
FAR Part 8
FAR Part 53
FAR Part 17
FAR Part 29
FAR Part 22
FAR Part 25
FAR Part 26
FAR Part 52
FAR Part 16
FAR Part 11
FAR Part 8
FAR Part 52
FAR Part 49
FAR Part 44
FAR Part 3
FAR Part 32
FAR Part 3
FAR Part 51
FAR Part 50
FAR Part 4
FAR Part 32
FAR Part 50
FAR Part 32
FAR Part 36
FAR Part 45
FAR Part 52
FAR Part 22
FAR Part 16
FAR Part 22
FAR Part 53
FAR Part 36
FAR Part 28
FAR Part 19
FAR Part 10
FAR Part 29
FAR Part 6
FAR Part 9
FAR Part 28
FAR Part 7
FAR Part 11
FAR Part 32
FAR Part 32
FAR Part 22
FAR Part 42
FAR Part 44
FAR Part 3
FAR Part 35
FAR Part 45
FAR Part 19
FAR Part 26
FAR Part 9
FAR Part 1

In [99]:
# TODO: title info missing

# define Beautiful Soup key word args
bs_kwargs = {
    "features": "html.parser", 
    "parse_only": SoupStrainer("title") # only include relevant text
}

# define Loader key word args
loader_kwargs = {
    "open_encoding": "utf-8",
    "bs_kwargs": bs_kwargs
}

loader = DirectoryLoader(
    path='data/html', 
    glob="*.html", 
    loader_cls=BSHTMLLoader,
    loader_kwargs=loader_kwargs,
    use_multithreading=True, 
    show_progress=True
    )
#loader = BSHTMLLoader(file_path, open_encoding="utf-8", bs_kwargs=bs_kwargs)
doc_titles = loader.load()

  0%|          | 0/3487 [00:00<?, ?it/s]

100%|██████████| 3487/3487 [00:03<00:00, 1050.95it/s]


In [85]:
# title_list = [{k: v for k, v in doc.metadata.items()} for doc in doc_titles]
# title_list[1]['title']

'11.106 Purchase descriptions for service contracts.'

In [100]:
# Convert the metadata for the specified label into a list
title_list = [doc.metadata["title"] for doc in doc_titles]

# Print the metadata list
print(title_list)

['46.708 Warranties of data.', '11.106 Purchase descriptions for service contracts.', '9.405 Effect of listing.', '16.204 Fixed-price incentive contracts.', '4.2104 Waivers.', '7.201 [Reserved]', '8.405 Ordering procedures for Federal Supply Schedules.', '17.204 Contracts.', '31.205-43 Trade, business, technical and professional activity costs.', '6.201 Policy.', '47.303-11 F.o.b. inland point, country of importation.', '49.305-2 Construction contracts.', '22.103-5 Contract clauses.', '1.103 Authority.', '42.403 Evaluation of contract administration offices.', '52.219-27 Notice of Service-Disabled Veteran-Owned Small Business Set-Aside.', '9.110 Reserve Officer Training Corps and military recruiting on campus.', '18.115 HUBZone sole source awards.', '47.303-7 F.o.b. destination, within consignee’s premises.', '19.816 Exiting the 8(a) program.', '32.111 Contract clauses for non-commercial purchases.', '52.219-8 Utilization of Small Business Concerns.', '22.1503 Procedures for acquiring 

In [101]:
i = 0
for doc in docs:
    
    doc.metadata["title"] = title_list[i]
    
    print(doc.metadata["title"])

    i += 1

46.708 Warranties of data.
11.106 Purchase descriptions for service contracts.
9.405 Effect of listing.
16.204 Fixed-price incentive contracts.
4.2104 Waivers.
7.201 [Reserved]
8.405 Ordering procedures for Federal Supply Schedules.
17.204 Contracts.
31.205-43 Trade, business, technical and professional activity costs.
6.201 Policy.
47.303-11 F.o.b. inland point, country of importation.
49.305-2 Construction contracts.
22.103-5 Contract clauses.
1.103 Authority.
42.403 Evaluation of contract administration offices.
52.219-27 Notice of Service-Disabled Veteran-Owned Small Business Set-Aside.
9.110 Reserve Officer Training Corps and military recruiting on campus.
18.115 HUBZone sole source awards.
47.303-7 F.o.b. destination, within consignee’s premises.
19.816 Exiting the 8(a) program.
32.111 Contract clauses for non-commercial purchases.
52.219-8 Utilization of Small Business Concerns.
22.1503 Procedures for acquiring end products on the List of Products Requiring Contractor Certificat

In [102]:
# inspect metadata again
doc_metadata = [doc.metadata  for doc in docs]
doc_metadata

[{'source': 'FAR 46.708', 'title': '46.708 Warranties of data.'},
 {'source': 'FAR 9.405',
  'title': '11.106 Purchase descriptions for service contracts.'},
 {'source': 'FAR 11.106', 'title': '9.405 Effect of listing.'},
 {'source': 'FAR 16.204', 'title': '16.204 Fixed-price incentive contracts.'},
 {'source': 'FAR 7.201', 'title': '4.2104 Waivers.'},
 {'source': 'FAR 4.2104', 'title': '7.201 [Reserved]'},
 {'source': 'FAR 8.405',
  'title': '8.405 Ordering procedures for Federal Supply Schedules.'},
 {'source': 'FAR 17.204', 'title': '17.204 Contracts.'},
 {'source': 'FAR 6.201',
  'title': '31.205-43 Trade, business, technical and professional activity costs.'},
 {'source': 'FAR 31.205-4', 'title': '6.201 Policy.'},
 {'source': 'FAR 1.103',
  'title': '47.303-11 F.o.b. inland point, country of importation.'},
 {'source': 'FAR 49.305-2', 'title': '49.305-2 Construction contracts.'},
 {'source': 'FAR 22.103-5', 'title': '22.103-5 Contract clauses.'},
 {'source': 'FAR 47.303-1', 'title