#UNICC Chatbot

In this notebook, we train the open-source Llama LLM on our dataset of UNECE documents, using a 4-bit quantized version from Unsloth to improve efficiency. To improve accuracy, we also introduced a RAG pipeline that identifies relevant parsed segments of the PDF database and passes them in as context to the Llama queries.

## Contents

1. Performing text splitting: parses PDF database of UNECE policy documents and session resolutions into 50-500 character "chunks" using font size, boldness, etc. to identify section headers. Stores these chunks along with document metadata to later feed into the RAG pipeline.

  **1.1**. Uses llama-index (open source embedding library) to embed and store these chunks in a vector-based document index for later collection. Uses a traditional tf-idf scoring with cosine similarity for relevance evaluations.

2. Llama 4-bit quantized: prepares the model itself, using unsloth to get a pre-trained 4-bit quantized version of Llama 3.1 8B Instruct. Re-loads the same PDFs from the text splitting phase but as entire documents to pass into the model for fine-tuning.

    **2.1.** Uses LoRA for fine-tuning
    **2.2.** Trains with SFTTrainer from hugging face

3. Front end: basic chatbot website set up for demo purposes -- allows user to input questions, view responses, and interact with relevant documents based on submitted queries (collected from the chunk embeddings metadata). Uses ngrok to simulate a mini-server on Colab.

In [1]:
# Text parsing packages
!pip install pdfplumber
!pip install fitz
!pip install PyMuPDF
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting pdfplumber
  Downloading pdfplumber-0.11.4-py3-none-any.whl.metadata (41 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/42.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pdfminer.six==20231228 (from pdfplumber)
  Downloading pdfminer.six-20231228-py3-none-any.whl.metadata (4.2 kB)
Collecting pypdfium2>=4.18.0 (from pdfplumber)
  Downloading pypdfium2-4.30.0-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Downloading pdfplumber-0.11.4-py3-none-any.whl (59 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pdfminer.six-20231228-py3-none-any.whl (5.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━

Collecting PyMuPDF
  Downloading pymupdf-1.25.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.25.0-cp39-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (20.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.25.0
Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m99.4 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. Y

In [2]:
# Imports for the RAG encodings
!pip install llama-index
!pip install llama-index-embeddings-huggingface

Collecting llama-index
  Downloading llama_index-0.12.3-py3-none-any.whl.metadata (11 kB)
Collecting llama-index-agent-openai<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.0-py3-none-any.whl.metadata (726 bytes)
Collecting llama-index-cli<0.5.0,>=0.4.0 (from llama-index)
  Downloading llama_index_cli-0.4.0-py3-none-any.whl.metadata (1.5 kB)
Collecting llama-index-core<0.13.0,>=0.12.3 (from llama-index)
  Downloading llama_index_core-0.12.3-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4.0,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.6.3-py3-none-any.whl.metadata (3.8 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48.post4-py3-none-any.whl.metadata (8.5 kB)
Collecting 

In [3]:
# Imports for the model
!pip install unsloth
!pip install -U bitsandbytes

Collecting unsloth
  Downloading unsloth-2024.12.4-py3-none-any.whl.metadata (59 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/59.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting unsloth_zoo>=2024.11.8 (from unsloth)
  Downloading unsloth_zoo-2024.12.1-py3-none-any.whl.metadata (16 kB)
Collecting xformers>=0.0.27.post2 (from unsloth)
  Downloading xformers-0.0.28.post3-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (1.0 kB)
Collecting bitsandbytes (from unsloth)
  Downloading bitsandbytes-0.45.0-py3-none-manylinux_2_24_x86_64.whl.metadata (2.9 kB)
Collecting triton>=3.0.0 (from unsloth)
  Downloading triton-3.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)
Collecting tyro (from unsloth)
  Downloading tyro-0.9.2-py3-none-any.whl.metadata (9.4 kB)
Collecting datasets>=2.16.0 (from unsloth)
  Downloadi



In [4]:
# Installs for frontend
!pip install flask-ngrok pyngrok
# this is my (lucy's) authtoken, it should be fine to use so you don't have to make an account (pls don't do anything else with it lol)
!ngrok authtoken 2ojOiPQ59Oi8KsIkY8xByZxp3xp_GJ1GTXSJfSimSNUKquke
!pip install pdf2image

Collecting flask-ngrok
  Downloading flask_ngrok-0.0.25-py3-none-any.whl.metadata (1.8 kB)
Collecting pyngrok
  Downloading pyngrok-7.2.1-py3-none-any.whl.metadata (8.3 kB)
Downloading flask_ngrok-0.0.25-py3-none-any.whl (3.1 kB)
Downloading pyngrok-7.2.1-py3-none-any.whl (22 kB)
Installing collected packages: pyngrok, flask-ngrok
Successfully installed flask-ngrok-0.0.25 pyngrok-7.2.1
Authtoken saved to configuration file: /root/.config/ngrok/ngrok.yml
Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl.metadata (6.2 kB)
Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Installing collected packages: pdf2image
Successfully installed pdf2image-1.17.0


# Performing text splitting

Section dedicated to parsing PDF into chunks based on topic headers, which are then encoded to use in the RAG pipeline

In [5]:
import fitz  # PyMuPDF
import os
from google.colab import drive
import glob
import re

def extract_chunks_from_pdf_mupdf(pdf_path, title):
    chunks = []
    current_chunk = ""
    current_title = "Introduction"  # Default title for the first chunk
    current_title = title
    tables = []  # To store detected tables
    table_pattern = re.compile(r"([A-Za-z0-9]+(\s{2,}|,\s?))+")  # Pattern for detecting rows in tables

    # Bullet points pattern
    bullet_pattern = re.compile(r"^[•●○‣▪■□–-]\s")

    doc = fitz.open(pdf_path)
    for page in doc:
        blocks = page.get_text("dict")["blocks"]
        for block in blocks:
            if "lines" in block:
                table_content = []
                for line in block["lines"]:

                    #accounting for too big chunks
                    if len(current_chunk) > 650:
                        chunks.append({"title": title, "chunk_title": current_title, "content": current_chunk.strip()})
                        current_chunk = ""
                        chunk_title=""

                    line_text = " ".join([span["text"] for span in line["spans"]]).strip()
                    font_size = line["spans"][0]["size"]
                    font_name = line["spans"][0]["font"]

                    # Heuristic for headers
                    is_bold = "Bold" in font_name or "SemiBold" in font_name
                    is_bullet = bullet_pattern.match(line_text)
                    line_text = bullet_pattern.sub("", line_text).strip()

                    # Detect numbers-only lines (e.g., page numbers)
                    if line_text.isdigit():
                        continue

                    # checking if cur line is a footnote
                    if font_size < 10:
                        continue

                    # Check if the line matches table pattern
                    is_table_line = table_pattern.match(line_text)
                    if is_table_line:
                        table_content.append(line_text)
                        continue

                    # checking for headers
                    if (font_size >= 14 or is_bold) and not is_bullet:
                        if current_chunk:
                            chunks.append({"title": title, "chunk_title": current_title, "content": current_chunk.strip()})
                            current_chunk = ""
                        current_title = line_text
                        current_chunk = line_text #adding the title to the chunk, just bc chunk_title is often not that specific and is messing us up
                    else:
                        current_chunk += " " + line_text

                if table_content:
                    tables.append({"title": title, "chunk_title": current_title, "content": "\n".join(table_content)})
                    table_content = []

    if current_chunk:
        chunks.append({"title": title, "chunk_title": current_title,  "content": current_chunk.strip()})
    if tables:
        chunks.extend(tables)

    return chunks

# Get all .pdf files in the folder
#NOTE: you will have to make this file yourself in your own drive, it just contains all of the PDFS Jason gave us
drive.mount('/content/drive')
folder_path = '/content/drive/My Drive/UNICC_dataset/' #'/content/drive/My Drive/UNICC_db/'

file_pattern = os.path.join(folder_path, '*.pdf')
chunks = []
for file_path in glob.glob(file_pattern):
    filename = os.path.basename(file_path)
    title = os.path.splitext(filename)[0]
    print("Extracting passages from document:", file_path)
    chunks.extend(extract_chunks_from_pdf_mupdf(file_path, title))
    #break  # Just process the first document for now

#removing super short chunks
rem_chunks =  [chunk for chunk in chunks if len(chunk["content"]) < 50 ]
chunks =  [chunk for chunk in chunks if len(chunk["content"]) >= 50 ]


Mounted at /content/drive
Extracting passages from document: /content/drive/My Drive/UNICC_dataset/Copy of UNFC_ES61_Update_2019.pdf
Extracting passages from document: /content/drive/My Drive/UNICC_dataset/Copy of Updated Chinese Minerals-UNFC-BD 25Oct2022_CEFR.pdf
Extracting passages from document: /content/drive/My Drive/UNICC_dataset/Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL.pdf
Extracting passages from document: /content/drive/My Drive/UNICC_dataset/Copy of UNFC_Antropogenic_Resource_Specifications.pdf
Extracting passages from document: /content/drive/My Drive/UNICC_dataset/Copy of UNFC_Solar_Specifications.pdf
Extracting passages from document: /content/drive/My Drive/UNICC_dataset/Copy of Updated Chinese Petroleum-UNFC-BD 25October 2022_CEFR_0.pdf
Extracting passages from document: /content/drive/My Drive/UNICC_dataset/Copy of UNFC_Geothermal_Specs_25October2022.pdf
Extracting passages from document: /content/drive/My Drive/UNICC_dataset/Copy of UNFC_

## Getting embeddings from chunks

In [6]:
# using https://docs.llamaindex.ai/en/v0.9.48/examples/embeddings/huggingface.html
# https://docs.llamaindex.ai/en/stable/module_guides/models/embeddings/

from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Document, Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.llms import MockLLM
# This is basically an empty LLM that takes the place of the default OpenAI API so we don't have to have an API key
# because we're only retrieving documents, then passing them into our own queries later, this is totally fine.
llm = MockLLM()

embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5" # try BAAI/bge-m3!! multilingual
)

Settings.embed_model = embed_model

# formatting previously identified chunks into documents
documents = [
    Document(
        text=chunk['content'],
        metadata={
            'document_title': chunk['title'],
            'chunk_title': chunk['chunk_title']
        }
    ) for chunk in chunks
]

# building index
index = VectorStoreIndex.from_documents(documents)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [7]:
query_engine = index.as_query_engine(embed_model=embed_model, llm=llm, similarity_top_k=10)

response = query_engine.query("What is methane drainage?")

print("\nRelevant documents found:")
for i, node in enumerate(response.source_nodes, start=1):
    print(f"\nDocument {i}:")
    print(f"Document Title: {node.node.metadata.get('document_title', 'No title available')}")
    print(f"Chunk Title: {node.node.metadata.get('chunk_title', 'No chunk title available')}")
    print(f"Content: {node.node.text}\n")


Relevant documents found:

Document 1:
Document Title: Copy of Copy of BPG_2017
Chunk Title: Methane drainage and its challenges
Content: in captured gas. When these gases are in or near the explosive range during transport and use, they create hazards.


Document 2:
Document Title: Copy of Copy of BPG_2017
Chunk Title: Chapter 5. Methane drainage
Content: technology into the mining environment to ensure that safety is not compromised and best practices are maintained. Methane drainage system performance can be improved through proper regular Transporting methane-air mixtures at concentrations in or near the explosive range in coal mines is a dangerous practice and should be prohibited.


Document 3:
Document Title: Copy of Copy of BPG_2017
Chunk Title: Methane drainage and its challenges
Content: Methane drainage and its challenges The purpose of methane drainage is to capture high- purity gas at its source before it can enter mine airways. For regulatory purposes, the amount of gas 

# Llama 4-bit quantized


In [8]:
# FRom https://huggingface.co/unsloth/Meta-Llama-3.1-8B-bnb-4bit

* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* [**NEW**] We make Gemma-2 9b / 27b **2x faster**! See our [Gemma-2 9b notebook](https://colab.research.google.com/drive/1vIrqH5uYDQwsJ4-OO3DErvuv4pBgVwk4?usp=sharing)
* [**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing)
* [**NEW**] We make Mistral NeMo 12B 2x faster and fit in under 12GB of VRAM! [Mistral NeMo notebook](https://colab.research.google.com/drive/17d3U-CAIwzmbDRqbZ9NnpHxCkmXB6LZ0?usp=sharing)

In [9]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit", # "unsloth/Meta-Llama-3.1-8B", #"unsloth/Meta-Llama-3.1-70B-bnb-4bit", # "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2024.12.4: Fast Llama patching. Transformers:4.46.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

In [10]:
# adding LoRA adaptors -- use a very high lora_alpha to increase the impact of UNECE dataset over pre-training.
# also use rank stabilized LoRA for slightly improved performance
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 64, # increased from 16 to increase the impact of our training dataset in comparison to the default training
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = True,  #using rank stabilized LoRA
    loftq_config = None,
)

Unsloth 2024.12.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [None]:
#### Alternative experiment ##########

from unsloth import FastLanguageModel
import torch
from transformers import BitsAndBytesConfig
from peft import LoftQConfig

loftq_config = LoftQConfig(
    loftq_bits=4,  # Match the bit-depth of quantization
    loftq_iter=1   # Number of optimization iterations
)

# QLoRA-specific quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",  # Normal Float 4-bit quantization
    bnb_4bit_use_double_quant=True  # Nested quantization for further compression
)

max_seq_length = 2048

# Load model with QLoRA-specific quantization
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length = max_seq_length,
    quantization_config = quantization_config,
)

# Apply QLoRA-style LoRA with additional configurations
model = FastLanguageModel.get_peft_model(
    model,
    r = 16,  # LoRA rank
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 64,
    lora_dropout = 0.1,  # Added some dropout
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = True,
    # Optional: Add QLoRA-specific low-rank quantization
    loftq_config = loftq_config,  # You can configure LoftQ if desired
)

### Data Prep
Parsing UNECE PDFS into text -- doesn't do any other processing.

In [11]:
#Data processing, from the EDA Colab:

import fitz  # PyMuPDF
import spacy
import glob
from google.colab import drive
import os
import pandas as pd
from datasets import load_dataset


# Function to extract text from PDF
def extract_text_from_pdf(file_path):
    doc = fitz.open(file_path)
    text = ""
    for page in doc:
        text += page.get_text()
    #Remove new line characters
    text = text.replace('\n', '')
    # make it all lower case
    text = text.lower()
    print(text)
    return text  # Return the extracted text

drive.mount('/content/drive')

# Get all .pdf files in the folder
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
docs = []
folder_path = '/content/drive/My Drive/UNICC_dataset'#UNICC_db/'
file_pattern = os.path.join(folder_path, '*.pdf')
for file_path in glob.glob(file_pattern):
    print(f"Processing file: {file_path}")
    docs.append(extract_text_from_pdf(file_path) + EOS_TOKEN)


df = pd.DataFrame({"text": docs})
# Save the DataFrame to a CSV file
df.to_csv('dataset.csv', index=False, escapechar='\\') #using escapechar bc our actual data contains commas


#have already defined tokenizer tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True)

# Assuming a CSV with a 'text' column containing document content
dataset = load_dataset('csv', data_files={'train': 'dataset.csv'}, split='train')
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Add labels by copying input_ids
def add_labels(batch):
    batch['labels'] = batch['input_ids'].copy()  # GPT-2 uses next-token prediction, so labels are the same as input_ids
    return batch

tokenized_datasets = tokenized_datasets.map(add_labels, batched=True)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Processing file: /content/drive/My Drive/UNICC_dataset/Copy of UNFC_ES61_Update_2019.pdf
61unece energy seriesenvironmental-socio-economicviability  sold or usedproductiontechnical feasibility  degree of confidence  production which isunused or consumedin operationsremaining products not developedother combinationsproduced quantitiescodification (e1; f2; g3)viable projectspotentially viable projects non-viable projectsprospective projectsunited nations framework classification for resourcesupdate 2019 united nations framework classification for resourcesupdate 2019 ece energy series no. 61geneva, 2020united nations economic commission for europerequests to reproduce excerpts or to photocopy should be addressed to the copyright clearance center at copyright.com. all other queries on rights and licenses, including subsidiary rights, should be addressed to:unite

Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/93 [00:00<?, ? examples/s]

Map:   0%|          | 0/93 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
SFT Docs (chosen based on unsloth docs): (https://huggingface.co/docs/trl/sft_trainer) -- train with max_steps for now to shorted training process and reduce compute on Colab

In [12]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

#TODO: experiment with different training parameters
trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        #num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 10,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/93 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [13]:
# training!

#lucy's wandb api key (can just use this one if it asks for it): 3188c3181154f953e650fdf9b314e997e36a894a

# can also set up a secret in google colab (uncomment this if so)
# import os
# from google.colab import userdata

# # setting api key from colab secrets
# os.environ["WANDB_API_KEY"] = userdata.get('wandb-api-key')

trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 93 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 10
 "-____-"     Number of trainable parameters = 41,943,040
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Step,Training Loss
1,2.087
2,2.3117
3,2.1614
4,2.0142
5,1.9927
6,1.7782
7,1.9361
8,1.9709
9,1.9621
10,1.7575


<a name="Inference"></a>
### Inference

In [31]:
# Formatting this into a function used in the flask frontend

# Example questions:  Why is it important to reduce gas emissions?
# From what type of mine do most coal mine emissions come from?
# When and where did the first occurance of methane drainage take place?

conv_history = [] # list of question - response strings
context = ""
max_conv_len = 1024 # used to trim conv history if it's getting too long

""" returns a formatted string of the conversation history to pass into a prompt """
def get_conv_hist():
    global conv_history
    formatted_conv = [("User: " + conv if i % 2 == 0 else "AI assistant: " + conv) for i, conv in enumerate(conv_history)]
    formatted_conv= "\n".join(formatted_conv)
    return formatted_conv

""" adds text to the existing conv history, and reduces len if it exceeds max_conv_len """
def set_conv_hist(text):
    global conv_history
    conv_history.append(text)

    # cutting off oldest parts of the conversation in pairs of 2 (question + answer)
    while len("\n".join(conv_history)) > max_conv_len:
      conv_history = conv_history[2:]


"""
Uses the query engine created during the "Getting embeddings from chunks" section
to identify the most similar document chunks
given a specific query.
"""
def get_context(question):
    # adding context from the identified similar chunks
    global context
    cur_cont = ""
    titles = []
    query_engine = index.as_query_engine(embed_model=embed_model, llm=llm, similarity_top_k=10)
    question_history = [conv if i%2 == 0 else "" for i, conv in enumerate(conv_history)]
    #print("CONTEXT QUERY: ", ("\n".join(question_history) + question))
    response = query_engine.query(("\n".join(question_history) + question))
    for i, doc in enumerate(response.source_nodes, start=1):
        #print(f"Document {i}: {doc.node.text[:200]}...")
        #print(f"Document Title: {doc.node.metadata.get('document_title', 'No title available')}")
        if i < 8: #taking top 8 results
          cur_cont += f"Document: {doc.node.metadata.get('document_title', 'No title available')} (Excerpt from text: {doc.node.text}) \n\n"
          titles.append(doc.node.metadata.get('document_title', 'No title available'))

    # adding this new round of docs to the FRONT of the context string
    context = cur_cont + context

    return context, titles

"""
Explores prompt engineering to get a prompt that takes in context and a question.
TODO: this doesn't really account for questions that CANNOT be answered in the dataset.
"""
def get_prompt(question):
    context, titles = get_context(question)

    formatted_conv = get_conv_hist()

    # this prompt helps keep a consistent complete sentance format
    # NOTE: the tags are specific to the instruct model, see https://www.llama.com/docs/model-cards-and-prompt-formats/llama3_1/
    prompt = f"""
    <|begin_of_text|><|start_header_id|>system<|end_header_id|>
    You are a helpful AI assistant. You will analyze the information in the provides context passages and conversation history, and answer questions based solely on that context.
    Answer the question based on the information in the passages.

    - Do NOT reference the context chunks directly
    - Respond in a complete sentence
    - if the question cannot be answered based on the information in the passages, say so explicitly

    Here is the history of your conversation with the user:
    {formatted_conv}
    <|eot_id|>

    <|start_header_id|>user<|end_header_id|>

    Here is the relevant context:
    {context}

    Question: {question}

    <|start_header_id|>assistant<|end_header_id|>
    """

    return prompt, titles

"""
Primary method called from the frontend.

Takes in a question, formats the prompt and context, then passes the output into the model.
"""
def get_response(question):
  global conv_history
  print("Conv hist in get_response: ", conv_history)
  FastLanguageModel.for_inference(model) # Enable native 2x faster inference

  prompt, titles = get_prompt(question)

  #print(prompt)

  inputs = tokenizer([prompt], return_tensors = "pt").to("cuda")
  #outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
  outputs = model.generate(
        **inputs,
        max_new_tokens=150,
        min_new_tokens=5, #key to avoiding empty inputs
        temperature=0.1, # increase here = more likely to choose less probable tokens, increases creativity (we don't want that lol)
        use_cache=True,
        top_p=0.2, # similar to temp
        #num_beams=3,
        # turned this off to ensure consistency, cancells out temp and top_p
        do_sample=False, # random samples groups of likely tokens, also introduces randomness that increases creativity
        pad_token_id=tokenizer.eos_token_id,
        # Stop at the end of the answer
        eos_token_id=tokenizer.eos_token_id,
        # Prevent prompt repetition
        no_repeat_ngram_size=3
    )

  #response = tokenizer.batch_decode(outputs)
  response = tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True)
  if not response: #this shouldn't ever happen bc we've set min_new_tokens but is here in case something goes wrong
    return "Hm, I can't seem to find an answer to that question in this dataset."

  # adding both the question and response to the conversation history
  set_conv_hist(question)
  set_conv_hist(response.strip())

  return response.strip(), titles

In [None]:
# Testing inference

# Example questions:  Why is it important to reduce gas emissions?
# From what type of mine do most coal mine emissions come from?
# When and where did the first occurance of methane drainage take place?
# "When and where was methane drainage first recorded?"
# "Which documents should I reference to learn more about methane drainage?"
# "From what type of coal mine does the most ventilation air methane come from?" --> this one is good

question = "What is methane drainage?"
resp, titles = get_response(question)
print(resp)
print("\n")

question = "Can you explain in more detail?"
resp, titles = get_response(question)
print(resp)
print("\n")

question = "When and where did methane drainage first take place?"
resp, titles = get_response(question)
print(resp)
print("\n")

#Front end

We run a flask app from the server in the "Main Server" subsection. This handles a basic frontend that displays an input for the user to ssend a question, then sends that question to our inference functions, which returns a response. We then display the response on the frontend and prompt the user again for input.

The HTML and CSS style files used in the frontend are placed before the main server file because they must be generated before the server is run.  

INSTRUCTIONS to run the app:


1.   Run all of the cells in the file until you reach the "Main server" section (shortcut: go to that cell, click on it, then go to Runtime->Run before to run every cell prior in the notebook)
2.   Run the main server cell
3.   After it starts, you'll see the following output (or similar):

* Public URL: NgrokTunnel: "https://8c7f-34-142-236-178.ngrok-free.app" -> "http://localhost:5000"
 * Serving Flask app '__main__'
 * Debug mode: off
INFO:werkzeug:WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
 * Running on http://127.0.0.1:5000

4.   Click on the NgrokTunnel URL (NOT localhost or 127.0.0.1), then select OK when asked about security. This will be the page where you can see/interact with the bot.



## HTML Templates
Run each cell to create the file, which will then be stored in the Colab file storage. This just prevents us from having to upload new files every time we run the Colab/change the HTML

In [15]:
# First, create necessary directories and files
!mkdir -p templates static/css
!mkdir -p content

In [16]:
# Write HTML content to files
%%writefile templates/base.html
<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>{{ title }}</title>
    <link rel="stylesheet" href="{{ url_for('static', filename='css/style.css') }}">
</head>
<body>
    <nav>
        <ul>
            <li><a href="{{ url_for('home') }}">Home</a></li>
        </ul>
    </nav>

    <main>
        {% block content %}
        {% endblock %}
    </main>
</body>
</html>

Writing templates/base.html


In [17]:
%%writefile templates/home.html
{% extends "base.html" %}

{% block content %}
<div class="container">
    <h1>Welcome to Flask</h1>
    <p>This is your homepage with styled content!</p>
</div>
{% endblock %}

Writing templates/home.html


In [49]:
%%writefile templates/index.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>GenAI-Bot</title>
    <link rel="stylesheet" href="{{ url_for('static', filename='css/bot-style.css') }}">
    <script src="https://ajax.googleapis.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
</head>
<body>
    <header>
        <h1>Questions about Climate Change?</h1>
        <h4>Ask our interactive bot below to be directed to helpful resources</h4>
    </header>

    <main>
        <div class="chat-container">
            <div id="chatbox" class="chat-box">
                <p class="botText">
                    <span>Hello! How can I help you?</span>
                </p>
            </div>
            <div id="userInput" class="input-container">
                <input id="textInput" type="text" name="msg" placeholder="Type your message..." />
                <!-- could add submit button here if wanted -->
            </div>
        </div>


        <!-- what actually displays the relevant docs -->
        <div id="relevantDocs" class="relevant-docs">
            <h4>Relevant Documents:</h4>
            {% for doc in rel_docs %}
            <p>
                <a href="{{ doc.url }}" target="_blank">{{ doc.title }}</a>
            </p>
            {% endfor %}
        </div>

        <div class="actions">
            <a onclick="callFlaskEndpoint()" class="view-pdfs-button"> Clear History </a>
            <a href="{{ url_for('dataset')}}" class="view-pdfs-button">View all PDFs</a>
        </div>

        <br/>
        <br/>
        <br/>

    </main>

    <script>
        function getBotResponse() {
            var rawText = $("#textInput").val();
            var userHtml = '<p class="userText"><span>' + rawText + "</span></p>";
            $("#textInput").val("");
            $("#chatbox").append(userHtml);
            document
                .getElementById("userInput")
                .scrollIntoView({ block: "start", behavior: "smooth" });

            $.get("/get", { msg: rawText }).done(function(data) {
                var botHtml = '<p class="botText"><span>' + data.response + "</span></p>";
                $("#chatbox").append(botHtml);

                var docsContainer = document.getElementById("relevantDocs");
                docsContainer.innerHTML = ""; // Clear existing docs
                data.rel_docs.forEach(function(doc) {
                    var p = document.createElement("p");
                    var a = document.createElement("a");
                    a.href = doc.url;
                    a.textContent = doc.title;
                    a.target = "_blank"; // Opens in new tab
                    p.appendChild(a);
                    docsContainer.appendChild(p);
                });

                document
                    .getElementById("userInput")
                    .scrollIntoView({ block: "start", behavior: "smooth" });
            });
        }

        $("#textInput").keypress(function(e) {
            if (e.which == 13) {
                getBotResponse();
            }
        });

        // resetting conv history when requested
        async function callFlaskEndpoint() {
            try {
                const response = await fetch('/clear-conv-hist', {
                    method: 'POST',
                    headers: {
                        'Content-Type': 'application/json'
                    }
                });
            } catch (error) {
                console.error('Error:', error);
            }
        }
    </script>
</body>
</html>


Overwriting templates/index.html


In [20]:
%%writefile templates/pdf_gallery.html

<!DOCTYPE html>
<html>
<head>
    <title>PDF Thumbnail Gallery</title>
     <link rel="stylesheet" href="{{ url_for('static', filename='css/bot-style.css') }}">
    <link href="https://cdn.jsdelivr.net/npm/tailwindcss@2.2.16/dist/tailwind.min.css" rel="stylesheet">
</head>
<body>
    <div class="back-button">
        <button> <a href="{{ url_for('home')}}"> Back to Bot </a> </button>
    </div>
    <div class="container mx-auto my-8">
        <h1 class="text-3xl font-bold mb-4">PDF Thumbnail Gallery</h1>
        <div class="grid grid-cols-3 gap-4">
            {% for pdf in pdf_data %}
            <div class="border rounded shadow p-4">
                <!-- <img src="{{ pdf.thumbnailUrl }}" alt="{{ pdf.title }}" class="w-full h-auto"> -->
                <a href="#" onclick="openPdfInNewTab('{{ pdf.pdfUrl }}'); return false;" class="block border rounded shadow p-4 hover:shadow-lg transition-shadow">
                  <img src="{{ url_for('static', filename=pdf.thumbnailUrl) }}" class="image" />
                  <h3 class="mt-2 text-lg font-medium">{{ pdf.title }}</h3>
                </a>
            </div>
            {% endfor %}
        </div>
    </div>
    <script>
        function openPdfInNewTab(pdfUrl) {
            window.open(pdfUrl, '_blank');
        }
    </script>
</body>
</html>

Writing templates/pdf_gallery.html


##CSS

In [21]:
%%writefile static/css/bot-style.css


* {
    box-sizing: border-box;
    margin: 0;
    padding: 0;
}

body, html {
    height: 100%;
    font-family: 'Arial', sans-serif;
    background-color: #f4f4f9;
    color: #333;
    line-height: 1.6;
}

header {
    text-align: center;
    padding: 20px;
    background-color: #f4f4f9;
    color: #4c87af;
}

header h1 {
    margin-bottom: 10px;
    font-size: 2rem;
}

header h4 {
    font-weight: normal;
}

.chat-container {
    max-width: 600px;
    margin: 20px auto;
    padding: 20px;
    background: white;
    border-radius: 10px;
    box-shadow: 0 4px 10px rgba(0, 0, 0, 0.1);
}

.chat-box {
    max-height: 400px;
    overflow-y: auto;
    margin-bottom: 20px;
    padding: 10px;
    border: 1px solid #ddd;
    border-radius: 5px;
    background-color: #f9f9f9;
}

.input-container {
    display: flex;
    justify-content: space-between;
}

#textInput {
    width: 100%;
    padding: 10px;
    border: 1px solid #ccc;
    border-radius: 5px;
    font-size: 16px;
    outline: none;
    transition: border-color 0.2s;
}

#textInput:focus {
    border-color: #4c87af; //4CAF50;
}

.userText, .botText {
    margin: 10px 0;
    font-size: 16px;
}

.userText span {
    background-color: #444;
    color: white;
    padding: 10px;
    border-radius: 10px;
    display: inline-block;
}

.botText span {
    background-color: #4c87af;
    color: white;
    padding: 10px;
    border-radius: 10px;
    display: inline-block;
}

.actions {
    text-align: center;
    margin: 20px 0;
}

.view-pdfs-button {
    display: inline-block;
    padding: 10px 20px;
    background-color: #4c87af;
    color: white;
    text-decoration: none;
    border-radius: 5px;
    transition: background-color 0.3s;
}

.view-pdfs-button:hover {
    background-color: #456ba0; //45a049;
}

.relevant-docs {
    max-width: 600px;
    margin: 20px auto;
    padding: 10px;
    background: #f9f9f9;
    border-radius: 5px;
    box-shadow: 0 4px 10px rgba(0, 0, 0, 0.1);
}

.relevant-docs p a {
    color: #333;
    text-decoration: none;
}

.relevant-docs p a:hover {
    text-decoration: underline;
}

footer {
    text-align: center;
    padding: 10px 0;
    background: #4c87af;
    color: white;
    position: fixed;
    bottom: 0;
    width: 100%;
}


Writing static/css/bot-style.css


## Main server

In [22]:
# from https://colab.research.google.com/drive/10doc9xwhFDpDGNferehBzkQ6M0Un-tYq#scrollTo=QDVm2QUrnJaF
!apt-get install poppler-utils

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  poppler-utils
0 upgraded, 1 newly installed, 0 to remove and 49 not upgraded.
Need to get 186 kB of archives.
After this operation, 696 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy-updates/main amd64 poppler-utils amd64 22.02.0-2ubuntu0.5 [186 kB]
Fetched 186 kB in 1s (150 kB/s)
Selecting previously unselected package poppler-utils.
(Reading database ... 123632 files and directories currently installed.)
Preparing to unpack .../poppler-utils_22.02.0-2ubuntu0.5_amd64.deb ...
Unpacking poppler-utils (22.02.0-2ubuntu0.5) ...
Setting up poppler-utils (22.02.0-2ubuntu0.5) ...
Processing triggers for man-db (2.10.2-1) ...


In [45]:
from flask import Flask, request, render_template, jsonify
from pyngrok import ngrok
import os
import shutil
from pdf2image import convert_from_path
from googleapiclient.discovery import build
from google.oauth2 import service_account
import json

""" uses GDrive API to find folder with a given name """
def find_folder(drive_service, folder_name):
    # getting all folders shared with the drive API
    all_folders = drive_service.files().list(
        q="mimeType='application/vnd.google-apps.folder' and trashed=false",
        fields='files(id, name)',
        spaces='drive'
    ).execute()

    folders = all_folders.get('files', [])
    for folder in folders:
        if folder['name'].lower() == folder_name.lower():
            return folder['id']

    return None

""" gets a folder with GDrive API, then pulls all PDFS from the folder """
def get_pdfs():
    colab_dir = '/content/static'
    os.makedirs(colab_dir, exist_ok=True)

    # initialize Google Drive API
    creds = service_account.Credentials.from_service_account_info(
        info=json.load(open('/content/service_account.json', 'r'))
    )
    drive_service = build('drive', 'v3', credentials=creds)

    folder_id = find_folder(drive_service, 'UNICC_dataset')

    if not folder_id:
        raise Exception("Could not find the UNICC_db folder. Please check folder sharing permissions.")

    # Search for PDF files in the folder
    file_list = drive_service.files().list(
        q=f"'{folder_id}' in parents and mimeType='application/pdf' and trashed=false",
        fields='files(id, name, webViewLink)',
        pageSize=100,
        spaces='drive'
    ).execute()

    pdf_files = file_list.get('files', [])
    print(f"\nFound {len(pdf_files)} PDF files:")
    for pdf in pdf_files:
        print(f"- {pdf['name']} (ID: {pdf['id']})")

    if not pdf_files:
        # If no PDFs found, check what files are actually in the folder
        all_files = drive_service.files().list(
            q=f"'{folder_id}' in parents and trashed=false",
            fields='files(id, name, mimeType)',
            pageSize=100
        ).execute()
        print("\nAll files in folder:")
        for file in all_files.get('files', []):
            print(f"- {file['name']} ({file['mimeType']})")

    # extracting text from found PDFs
    pdf_data = []
    for file in pdf_files:
        filename = file['name']
        print(f"\nProcessing {filename}...")
        local_path = os.path.join(colab_dir, filename)

        try:
            request = drive_service.files().get_media(fileId=file['id'])
            with open(local_path, 'wb') as f:
                f.write(request.execute())
            print(f"Downloaded {filename}")

            # generating thumbnail image
            first_page = convert_from_path(local_path, last_page=1)[0]
            thumbnail_path = os.path.join(colab_dir, f"{os.path.splitext(filename)[0]}.png")
            first_page.save(thumbnail_path, 'PNG')
            print(f"Generated thumbnail for {filename}")

            pdf_data.append({
                'pdfUrl': file['webViewLink'],
                'thumbnailUrl': f"{os.path.splitext(filename)[0]}.png",
                'title': os.path.splitext(filename)[0]
            })

        except Exception as e:
            print(f"Error processing file {filename}: {str(e)}")
            continue
        finally:
            if os.path.exists(local_path):
                os.remove(local_path)
                print(f"Cleaned up {filename}")

    return pdf_data

app = Flask(__name__)

@app.route('/')
def home():
    return render_template('index.html', title='Home')

@app.route("/get")
def get_bot_response():
    userText = request.args.get('msg')
    response, titles = get_response(userText)
    print("updated titles: ",titles)

    # getting URLs for all of the titles
    titles = set(titles) # removing duplicates
    rel_docs = []
    for title in titles:
        for pdf in pdf_data:
            if pdf['title'] == title:
                rel_docs.append({"title": title, "url": pdf['pdfUrl']})
                break

    return jsonify({
        'response': response,
        'rel_docs': rel_docs
    })

@app.route("/dataset")
def dataset():

    return render_template('pdf_gallery.html', pdf_data=pdf_data)

@app.route("/clear-conv-hist", methods=["POST"])
def clear_conv_hist():
  print("Reaching clear conv hist")
  global conv_history
  conv_history = []

  return jsonify({'status': 'success'})

In [27]:
# loading PDFs once bc it takes forever
try:
    pdf_data = get_pdfs()
except Exception as e:
    print(f"Error loading PDF data: {str(e)}")
    pdf_data = []

IndentationError: unexpected indent (<ipython-input-27-c31c75b94b4a>, line 2)

In [50]:
if __name__ == "__main__":

    # Get a tunnel from ngrok and run Flask
    public_url = ngrok.connect(5000)
    print(f' * Public URL: {public_url}')

    # Run the app
    app.run(port=5000)

 * Public URL: NgrokTunnel: "https://896f-35-197-134-122.ngrok-free.app" -> "http://localhost:5000"
 * Serving Flask app '__main__'
 * Debug mode: off


 * Running on http://127.0.0.1:5000
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:16:31] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:16:32] "GET /static/css/bot-style.css HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:16:32] "[33mGET /favicon.ico HTTP/1.1[0m" 404 -
INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:16:34] "POST /clear-conv-hist HTTP/1.1" 200 -


Reaching clear conv hist


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:16:41] "POST /clear-conv-hist HTTP/1.1" 200 -


Reaching clear conv hist


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:18:35] "POST /clear-conv-hist HTTP/1.1" 200 -


Reaching clear conv hist
Conv hist in get_response:  []


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:19:47] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:19:47] "GET /get?msg=What%20is%20the%20political%20and%20institutional%20context%20of%20coal%20mining%20in%20Albania? HTTP/1.1" 200 -


updated titles:  ['Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of Copy of ECE_ENERGY_GE.4_2024_5_Mapping Albania_Final']


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:19:48] "[36mGET /static/css/bot-style.css HTTP/1.1[0m" 304 -


Conv hist in get_response:  ['What is the political and institutional context of coal mining in Albania?', "The political and social context of the coal industry in Albania is characterized by a lack of collective energy and a fading intensity of the Just Transition and coal exit. The country's coal mining history dates back to 1991, with the sector experiencing a disruptive shutdown within five years after the collapse and economic liberalisation of the socialist states in Eastern Europe and Soviet Union. The sector's workforce was significantly reduced, with up to half the mining workers emigrating to Italy or Greece. The coal industry has not been a priority for the government, and there is no clear plan for the Just transition and coal phase-out. The government has not taken any measures to address the social and economic challenges faced by the coal communities. The lack of a clear plan and the"]


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:20:34] "GET /get?msg=What%20is%20the%20political%20and%20institutional%20context%20of%20coal%20mining%20in%20Albania? HTTP/1.1" 200 -


updated titles:  ['Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL']
Conv hist in get_response:  ['What is the political and institutional context of coal mining in Albania?', "The political context of Albania's coal industry is characterized as a lack collective energy, and a faded intensity of Just Transition. The industry has been in a state of decline since the collapse in 1999, with many mines closed and a significant portion of the workforce emigrates to other countries. The current government has 

INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:21:17] "GET /get?msg=What%20is%20a%20Just%20Transition%20Framework%20for%20Sector%20Decarbonization? HTTP/1.1" 200 -


updated titles:  ['Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL', 'Copy of Copy of UNECE ALBANIA Just Transition And Decarbonization Report FINAL']


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:21:26] "POST /clear-conv-hist HTTP/1.1" 200 -


Reaching clear conv hist
Conv hist in get_response:  []


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:21:53] "GET /get?msg=What%20is%20enhanced%20oil%20recovery? HTTP/1.1" 200 -


updated titles:  ['Copy of Copy of CCUS brochure_EN_final', 'Copy of Copy of 1919051_E_ECE_ENERGY_109_WEB', 'Copy of Copy of Geologic CO2 storage report_final_EN', 'Copy of Copy of 1919051_E_ECE_ENERGY_109_WEB', 'Copy of Copy of Geologic CO2 storage report_final_EN', 'Copy of Copy of 2017886_E_ECE_ENERGY_134_WEB', 'Copy of Copy of Geologic CO2 storage report_final_EN']
Conv hist in get_response:  ['What is enhanced oil recovery?', 'Enhanced oil production is a method of extracting more oil from a reservoir after the primary production phase.']


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:22:23] "GET /get?msg=Can%20you%20give%20me%20more%20details? HTTP/1.1" 200 -


updated titles:  ['Copy of Copy of CCUS brochure_EN_final', 'Copy of Copy of 1919051_E_ECE_ENERGY_109_WEB', 'Copy of Copy of Geologic CO2 storage report_final_EN', 'Copy of Copy of 1919051_E_ECE_ENERGY_109_WEB', 'Copy of Copy of 1919051_E_ECE_ENERGY_109_WEB', 'Copy of Copy of Geologic CO2 storage report_final_EN', 'Copy of Copy of 2017886_E_ECE_ENERGY_134_WEB']


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:22:49] "POST /clear-conv-hist HTTP/1.1" 200 -


Reaching clear conv hist
Conv hist in get_response:  []


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:23:22] "GET /get?msg=Quel%20type%20de%20mines%20de%20charbon%20produit%20le%20plus%20d’émissions%20de%20méthane%20? HTTP/1.1" 200 -


updated titles:  ['Copy of Copy of Best_Practice_Guidance_for_Effective_Methane_Recovery_and_Use_from_Abandoned_Coal_Mines_FINAL__with_covers_', 'Copy of Copy of Best_Practice_Guidance_for_Effective_Methane_Recovery_and_Use_from_Abandoned_Coal_Mines_FINAL__with_covers_', 'Copy of Copy of Best_Practice_Guidance_for_Effective_Methane_Recovery_and_Use_from_Abandoned_Coal_Mines_FINAL__with_covers_', 'Copy of Copy of BPG_2017', 'Copy of Copy of BPG_2017', 'Copy of Copy of 2119167_E_ECE_ENERGY_139_WEB', 'Copy of Copy of Best_Practice_Guidance_for_Effective_Methane_Recovery_and_Use_from_Abandoned_Coal_Mines_FINAL__with_covers_']
Conv hist in get_response:  ['Quel type de mines de charbon produit le plus d’émissions de méthane ?', 'The coal mines with the highest methane emissions are underground coalmines.']


INFO:werkzeug:127.0.0.1 - - [08/Dec/2024 03:23:56] "GET /get?msg=*%20Quel%20type%20de%20mines%20de%20charbon%20produit%20le%20plus%20d’émissions%20de%20méthane%20? HTTP/1.1" 200 -


updated titles:  ['Copy of Copy of Best_Practice_Guidance_for_Effective_Methane_Recovery_and_Use_from_Abandoned_Coal_Mines_FINAL__with_covers_', 'Copy of Copy of BPG_2017', 'Copy of Copy of Best_Practice_Guidance_for_Effective_Methane_Recovery_and_Use_from_Abandoned_Coal_Mines_FINAL__with_covers_', 'Copy of Copy of Best_Practice_Guidance_for_Effective_Methane_Recovery_and_Use_from_Abandoned_Coal_Mines_FINAL__with_covers_', 'Copy of Copy of LCA_0708_correction', 'Copy of Copy of Best_Practice_Guidance_for_Effective_Methane_Recovery_and_Use_from_Abandoned_Coal_Mines_FINAL__with_covers_', 'Copy of Copy of BPG_2017']
