Let us have a look at the no_sale_countries.md document. It can be seen that the document can be split into three parts:
1. The first two paragraphs describing the need to omit sales in specific countries.
2. Numbered paragraphs listing the "no sale" countries along with an explanation
3. A conclusion paragraph summarizing possible future updates to the no-sale-country policy

The basic functionality of our chatbot should look something like this:
1. If the user asks a customer-support-related question, they should be given a general answer briefly mentioning the no-sale-country policy (in case they are from one of these countries)
2. If the user asks a customer-support-related question related to one of the no-sale countries, they should be given a description of the policy together with a country-specific explanation
3. If the user asks specifically about the no-sale countries, they should be given a description of the policy, together with the list of no-sale countries
4. Questions which are not relevant to customer support must not be answered

Let's split the data into relevant documents, extracting metadata and replacing "[Your Company Name]" with the actual customer-company name.

In [4]:
import re
import json
from pathlib import Path

CUSTOMER_NAME = 'Home Accessories LLC'

INPUT_FILE = Path().resolve().parent / 'data/no_sale_countries.md'
OUTPUT_FILE = Path().resolve().parent / 'data/no_sale_countries.json'

In [5]:
raw_text = INPUT_FILE.read_text().replace('[Your Company Name]', CUSTOMER_NAME)
paragraphs = raw_text.split('\n\n')
for i, paragraph in enumerate(paragraphs):
    print(f'=== paragraph {i} ===')
    print(paragraph)
    print()

=== paragraph 0 ===
# No Sale Countries

=== paragraph 1 ===
As part of our commitment to ethical business practices and compliance with
international regulations, Home Accessories LLC has identified certain countries
where we will not conduct sales. This decision is based on a combination of
factors including but not limited to legal restrictions, ethical concerns, and
market conditions.

=== paragraph 2 ===
The following countries are on our no sale list, along with the specific reasons
for each designation:

=== paragraph 3 ===
1. Spain
- Reason: Compliance with Local Regulations
    - Spain has recently implemented stringent regulations on the sale of
      specific categories of products that we manufacture. Our current product
      lines do not meet the new regulatory requirements, and bringing them into
      compliance would require significant changes to our production process and
      supply chain, resulting in unsustainable costs.

=== paragraph 4 ===
2. Italy
- Reason: Un

Paragraphs 3-6 have to be parsed a bit further. Otherwise, we can create a bit more structured data file.

In [6]:
result = {
    'documents': [],
    'introduction': paragraphs[1],
    'conclusion': f'{paragraphs[8]}\n\n{paragraphs[9]}'
}
no_sale_countries = []
for paragraph in paragraphs[3:7]:
    parts = [part.strip() for part in paragraph.split('-')]
    country = parts[0].split()[-1]
    no_sale_countries.append(country)

    result['documents'].append({
        'text': parts[1] + '\n' + re.sub(r'\n\W*', ' ', parts[2]),
        'meatadata': {
            'geography': parts[0].split()[-1]
        }
    })

with open(OUTPUT_FILE, 'w') as f:
    json.dump(result, f, indent=2)