# **Retrieval-Augmented Generation**
## **Modelo para le creación de base de datos de obras de arquitectura**

La idea de este código es completar una base de datos utilizando un modelo generativo capaz de responder la información necesaria para cada columna de la base solicitada.

El modelo generativo empleará una estrategia basada en Retrieval-Augmented Generation (RAG), donde será necesario identificar los contextos (ventana de palabras anteriores y/o posteriores al objeto de búsqueda) más relevantes para responder las preguntas que completan cada columna de la base de datos. Estos contextos se obtendrán a partir de un paso previo de reconocimiento de entidades (*ver código NER*), extrayendo toda la pagina en donde se encuentra la entidad.

Dado que en algunos casos las múltiples apariciones de una misma entidad podrían generar demasiados contextos para una sola respuesta, se aplicarán búsquedas semánticas para priorizar y rankear los contextos más informativos. Para esto, se empleará un índice semántico basado en [Faiss](https://faiss.ai/index.html).

El desafío principal radica en lograr una alta precisión en la identificación de los contextos adecuados y en desarrollar un modelo que pueda responder correctamente a las preguntas planteadas.

El modelo seleccionado fue [Flan-T5](https://huggingface.co/docs/transformers/en/model_doc/flan-t5) debido a su facilidad para la generación de texto y su buen desempeño en tareas similares.

#Preparacion de Entorno
Defino variables de entorno y contecto Google drive

In [None]:
GENERATE_DATASET = True
IMPORT_NER = True
SEED = 223

In [None]:
!pip install -qq faiss-gpu
!pip install -qq virtualenv
!pip install -qq selenium
!pip install -qq requests
!pip install -qq urllib3
!pip install -qq bs4

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m9.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m38.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 kB[0m [31m19.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.7/9.7 MB[0m [31m61.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m481.7/481.7 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from google.colab import drive
from transformers import T5Tokenizer, T5ForConditionalGeneration, AutoTokenizer, AutoModel, pipeline
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
from tqdm import tqdm
from torch.utils.data import Dataset, DataLoader
import torch
import random
import numpy as np
import pandas as pd
import nltk
nltk.download('stopwords')
import faiss
import os
import requests
import re
import sys
import string
import csv
import io

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from google.colab import drive

drive.mount('/content/drive')

%cd "/content/drive/"

torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

# Which book am I going to analyse
file_path = '/content/drive/MyDrive/ARCHITECTURE_NER/NER/libros_en_txt/pages_BOOK Kenneth Frampton Modern Architecture.txt'

# The location of the text segmentation file corresponding to the file path
ts_file_path = '/content/drive/MyDrive/ARCHITECTURE_NER/Text_Segmentation/csvs_text_segmentation/TS lines_Kenneth Frampton Modern Architecture.csv'

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive


#RAG Definition

In [None]:
RAG_TESTING = False

##Flan T5 model

In [None]:
t5_tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-large")
flan_t5_model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large").to('cuda')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

###Generation intructions

In [None]:
def generate_answer(question: str, input_dict, max_length: int = 2046) -> tuple[str, str]:
    prompt = "Answer the following question about architectural buildings, using your expertise in historical and contemporary architecture and considering all provided contexts. Be concise and answer only the question.\n\n"
    prompt += f"Question: {question}\n"

    # Conseguir y concatenar los contextos
    contexts = retrieve_contexts(question, input_dict)
    for i, ctx in enumerate(contexts, 1):
        prompt += f"Context {i}: {ctx}\n\n"

    prompt += "The final answer is:\n"

    inputs = t5_tokenizer.encode_plus(
        prompt,
        return_tensors="pt",
        max_length=max_length,
        truncation=True
    ).to('cuda')

    outputs = flan_t5_model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=max_length,
        num_beams=5,
        early_stopping=True
    )

    answer = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

    del inputs, outputs, contexts, prompt

    return '', answer

In [None]:
def verify_arch(arch: str, input_dict, max_length: int = 2046) -> tuple[str, str]:
    prompt = '''Classify the input as a valid architectural building based on the given contexts. Remember, only standalone architectural structures qualify; locations, streets, and people's names do not. Respond with 'Yes' or 'No' only. Here are some examples:
                Input: Casa Vicens
                Output: Yes

                Input: Bois de Boulogne
                Output: No

                Input: Maison Dom-Ino
                Output: Yes

                Input: Rue de la Harpe
                Output: No

                Input: Gallerie des Machines
                Output: Yes

                Input: Le Corbusier
                Output: No
                '''

    prompt += f"Input: {arch}\n"

    # contexts = retrieve_contexts(prompt, input_dict)
    # for i, ctx in enumerate(contexts[:2], 1):
    #     prompt += f"Context {i}: {ctx}\n\n"

    inputs = t5_tokenizer.encode_plus(
        prompt,
        return_tensors="pt",
        max_length=max_length,
        truncation=False
    ).to('cuda')

    outputs = flan_t5_model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=max_length,
        num_beams=5,
        early_stopping=True
    )

    answer = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

    del inputs, outputs, prompt

    return '', answer

In [None]:
def infer_tipologia(arch: str, input_dict, max_length: int = 2046) -> tuple[str, str]:
    prompt = '''Given the following definition:
    A building's functional type is a designation that classifies buildings by their originally intended use or purpose at the time of construction (e.g., house, sanatorium, office). While the functional type is sometimes part of a building's name (e.g., "Farnsworth House" implies "house"), it may also be absent or misleading (e.g., "Seagram Building" does not directly indicate a function, and "Les Arcades du Lac" is residential, not commercial). Additionally, buildings may have overlapping designations (e.g., "house", "residence", or "villa").

    Using the provided contexts, infer and output the building's functional type as a single word or short phrase.'''

    contexts = retrieve_contexts(prompt, input_dict)

    prompt += f"Building: {arch}\n"
    for i, ctx in enumerate(contexts[:2]):
        prompt += f"Context {i}: {ctx}\n\n"
    prompt += "Output:\n"

    inputs = t5_tokenizer.encode_plus(
        prompt,
        return_tensors="pt",
        max_length=max_length,
        truncation=False
    ).to('cuda')

    outputs = flan_t5_model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=max_length,
        num_beams=5,
        early_stopping=False
    )

    answer = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

    del inputs, outputs, contexts, prompt

    return '', answer

def classify_tipologia(arch: str, input_dict, max_length: int = 2046) -> tuple[str, str]:
    prompt = '''
    Analyze the following architecture with your expertise in historical and contemporary architecture, considering the provided context. Classify it into one of the functional types listed below, and state your classification at the end.
    Functional Types: Abbey - Customs Office - Altar - Asylum / Orphanage - Studio - Auditorium - University Auditorium - Lecture Hall - Town Hall or Municipality - Bank (branch or headquarters) - Public Baths - Library - Winery - Stock Exchange - Bowling Alley - Cabins - Bathhouse - Retreat House - Rest Home - Casino / Gambling House - Cemetery - Server Center - Telephone Center - Civic Center - Cultural or Exhibition Center - Convention Center - Exhibition Center - Innovation Center - Research Center - Logistics Center - Congress or Convention Palace - Sports Center - Cinema - Club - Clubhouse - Professional Public School - Holiday Camp - Summer Camp - Congress, Parliament, or Assembly - Municipal Council - Clinic - Convent - Post Office - Court of Justice - Crematorium - General Warehouse - Train Depot - Specialized Education - Embassy or Consulate - Facility - Primary School - Secondary School - Fire Station - Subway Station - Police Station - Service Station - Railway Station - Indoor Sports Stadium - Baseball Stadium - Football Stadium - Olympic Stadium - Radio Broadcasting Studios - Factory - Commercial Arcade - Gym - Farm / Stable - Hangar - Hybrid - Hippodrome - Hospital / Clinic / Dispensary - Hostel - Hotel - Vertical Garden - Church - Printing House - Greenhouse / Botanical Garden - Courts - Kindergarten - Laboratory - Master Plan - Slaughterhouse - Media Library - Memorial - Wholesale / Central Market - Retail Market - Mosque - Ministry or Government Department - Observation Tower - Monastery - Motel - Museum - Observatory - Government Administrative Offices - General Offices - Exhibition Pavilion - Meeting Pavilion - Government Palace - Pantheon - Beach Resort - Parking Garage - Parliament or Assembly - Park - Water Park - Penitentiary - Pool / Swimming Facility - Power Plant - Recycling Plant - Bridge - Recycling Facility - Shelter - Renovation - Government Residence and Palace - Restaurant / Bar / Coffee Shop - University Services (for students, etc.) - Shopping Mall - Silo - Synagogue - Supermarket - Theater - Airport Terminal - Bus Terminal - Passenger Port Terminal - Retail Store / Showroom - Department Store - Single-Brand Store - University or College - Velodrome - Collective Housing - Single-Family Housing - Housing for Pensioners - University Housing - Zoo    '''

    # contexts = retrieve_contexts(prompt, input_dict)

    prompt += f"Architecture: {arch}\n"
    # for i, ctx in enumerate(contexts[:2]):
    #     prompt += f"Context {i}: {ctx}\n\n"
    prompt += "Output:\n"

    inputs = t5_tokenizer.encode_plus(
        prompt,
        return_tensors="pt",
        max_length=max_length,
        truncation=False
    ).to('cuda')

    outputs = flan_t5_model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=max_length,
        num_beams=5,
        early_stopping=False
    )

    answer = t5_tokenizer.decode(outputs[0], skip_special_tokens=True)

    del inputs, outputs, prompt

    return '', answer

## Faiss index

In [None]:
from sentence_transformers import SentenceTransformer
import numpy as np
import faiss
import torch

# Load SentenceTransformer model
st_model = SentenceTransformer('all-mpnet-base-v2', device='cuda')

# Encoding contexts
def encode_contexts(contexts: list[str], model: SentenceTransformer, batch_size: int = 8) -> np.ndarray:
    encoded_contexts = model.encode(
        contexts,
        batch_size=batch_size,
        convert_to_numpy=True,
        device='cuda',
        show_progress_bar=True
    )
    return encoded_contexts

# Build FAISS index
def build_faiss_index(contexts: list[str], model: SentenceTransformer, batch_size: int = 8) -> faiss.IndexFlatL2:
    context_embeddings = encode_contexts(contexts, model, batch_size)
    dimension = context_embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(context_embeddings)
    return index

# Code for semantic search
def semantic_search(query: str, index: faiss.IndexFlatL2, model: SentenceTransformer, k: int = 3) -> list[int]:
    query_embedding = model.encode(query, convert_to_numpy=True, device='cuda').reshape(1, -1)
    D, I = index.search(query_embedding, k)
    # Return indices of top k retrieved contexts
    return I[0].tolist()

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

#Fuctions

##Useful functions

In [None]:
titles = ['Mrs.', 'Mr.', 'Dr.', 'Ms.']
uppers = [chr(i) + '.' for i in range(65, 91)]
lowers = [ ' ' + chr(i) + '.' for i in range(97, 122)]
numbers = [str(i) + '.' for i in range(0, 10)]

def remove_titles(text):
    mytext = text

    for title in titles:
        mytext = mytext.replace(title, title[:-1])
    for upper in uppers:
        mytext = mytext.replace(upper, upper[:-1])
    for number in numbers:
        mytext = mytext.replace(number, number[:-1]+',')
    for lower in lowers:
        mytext = mytext.replace(lower, lower[:-1])

    return mytext

def prepare_data(filename):
    txt = open(filename, 'r')
    txt = txt.read().replace('\n', ' ')
    txt = txt.replace('_', ' ')
    txt = txt.replace('—', '-')
    txt = txt.replace('–', '-')
    txt = txt.replace('“', '"')
    txt = txt.replace('”', '"')
    txt = txt.replace('’', "'")
    txt = txt.replace('‘', "'")
    txt = txt.replace('…', ' ')
    txt = txt.replace('...', ' ')
    txt = txt.replace('|', 'I')
    txt = txt.replace('+', ' ')
    txt = txt.replace('/', ' ')
    txt = txt.replace('(', ' ')
    txt = txt.replace(')', ' ')
    txt = txt.replace(':', ' ')
    txt = txt.replace(';', ' ')

    pattern = r"\[\d+\]"
    txt = re.sub(pattern, "", txt)
    txt = re.sub(' +', ' ', txt)
    txt = remove_titles(txt)
    data = [sentence.strip() for sentence in txt.split('.') if sentence.strip() != '']

    return data

def create_context(arch: str, content: str):
    pages = []
    split_content = content.split('\n')
    for i in range(len(split_content)):
        page = split_content[i]
        previous_page = split_content[i-1] if i > 0 else None
        if arch in page:
          pages.append(page.strip())
    return pages

def read_file(file_path):
    with open(file_path, 'r') as file:
        content = file.read()
    return content

def remove_stopwords(text: str):
  stop_words = set(stopwords.words('english'))
  words = text.split()
  filtered_words = [word for word in words if word.lower() not in stop_words]
  return ' '.join(filtered_words)

def count_and_position_correction(arch: str, txt: str):
    text = txt.lower()
    arch = arch.lower()

    matches = list(re.finditer(re.escape(arch), text))
    positions = [(m.start(), m.end()) for m in matches]
    count = max(len(matches), 1)

    return count, positions

def retrieve_contexts(question, input_dict):
    contexts = input_dict['contexts']
    index = input_dict['book_index']

    wiki_contexts = input_dict['wiki_contexts']
    wiki_index = input_dict['wiki_index']

    # Siempre se le da prioridad a los contextos de wikipedia
    small_wiki_contexts = []
    small_contexts = []
    if wiki_index is not None:
      return [wiki_contexts[i] for i in semantic_search(question, wiki_index, st_model, 2)]
    elif wiki_contexts != []:
      return wiki_contexts
    elif index is not None:
      return [contexts[i] for i in semantic_search(question, index, st_model, 2)]
    elif contexts != []:
      return contexts
    else:
      return []

##Scrapping Functions

In [None]:
def generate_url(architecture_name):
    formatted_name = architecture_name.replace(" ", "_").title()

    # List of French, English, Spanish, and Italian articles/prepositions that should remain lowercase
    lowercase_words = [
        # French
        "De", "Des", "Du", "La", "Le", "Les", "Aux", "Et", "En", "Sur", "À",
        # English
        "Of", "The", "And", "A", "An", "In", "On", "For", "At",
        # Spanish
        "De", "Del", "La", "El", "Los", "Las", "Y", "En", "Con", "Por", "Para", "Un", "Una",
        # Italian
        "Di", "Della", "Del", "Lo", "Il", "I", "Gli", "Le", "E", "Nel", "Con", "Per", "Tra", "Fra"
    ]

    # Split the name into words and correct the casing for articles/prepositions
    formatted_name_parts = formatted_name.split("_")
    formatted_name_parts = [word if word not in lowercase_words else word.lower() for word in formatted_name_parts]

    # Join the words back with underscores
    formatted_name = "_".join(formatted_name_parts)

    url = f"https://en.wikipedia.org/wiki/{formatted_name}"
    return url

def scrape_architect(architect_name, architecture_name):
    url = generate_url(architect_name)

    try:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')

        # Check if the page is a disambiguation or not found
        disambig_phrases = [
            "Other reasons this message may be displayed:",
            f"may refer to:"
        ]

        # If any disambiguation or not-found phrase appears, skip saving
        if any(phrase in str(soup) for phrase in disambig_phrases):
          return []

        # Extract paragraphs
        paragraphs = soup.find_all('p')

        pattern = r"\[\d+\]"

        contexts = [re.sub(pattern, "", p.get_text().replace("\n", '')) for p in paragraphs if p.get_text().replace("\n", '') != '']

        return [context for context in contexts if architecture_name.lower() in context.lower()]

    except Exception as e:
        return []

# Function to scrape Wikipedia page
def scrape_architecture(architecture_name, index, contexts):
    url = generate_url(architecture_name)

    try:
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')

        # Check if the page is a disambiguation or not found
        disambig_phrases = [
            "Other reasons this message may be displayed:",
            f"may refer to:"
        ]

        # If page not found look for the architect behind the architecture's page
        if any(phrase in str(soup) for phrase in disambig_phrases):
            question = f"Who was/were the architect/s behind the {architecture_name}?"

            input_dict = {
                'contexts': contexts,
                'book_index': index,
                'wiki_contexts': [],
                'wiki_index': None
            }

            prompt, answer = generate_answer(question, input_dict)

            return scrape_architect(answer, architecture_name)

        paragraphs = soup.find_all('p')

        pattern = r"\[\d+\]"

        return [re.sub(pattern, "", p.get_text().replace("\n", '')) for p in paragraphs if p.get_text().replace("\n", '') != '']

    except Exception as e:
        question = f"Who was/were the architect/s behind the {architecture_name}?"

        input_dict = {
            'contexts': contexts,
            'book_index': index,
            'wiki_contexts': [],
            'wiki_index': None
        }

        prompt, answer = generate_answer(question, input_dict)

        return scrape_architect(answer, architecture_name)

##Result functions

In [None]:
def text_segmentation_results(architecture: str, book):
  counts = {'qty_title': 0, 'qty_paragraph': 0, 'qty_caption': 0}

  for _, row in book.ts_df.iterrows():
    text = row['text'].lower()
    label = row['label']
    if architecture.lower() in text:
      if label == 'Title':
        counts['qty_title'] += 1
      elif label == 'Paragraph':
        counts['qty_paragraph'] += 1
      elif label == 'Caption':
        counts['qty_caption'] += 1

  return counts

column_questions = {
    'author/s': "Who was/were the architect/s behind the {arch}?",
    'city': "In which city is the {arch} located?",
    'country': "In which country is the {arch} situated?",
    'start_year': "In which year did the construction of the {arch} begin?",
    'end_year': "In which year was the construction of the {arch} completed?"
}

def generate_questions(archs, topics):
    questions = []
    for arch in archs:
        for topic, question_template in topics.items():
            question = question_template.format(arch=arch)
            questions.append(question)
    return questions

In [None]:
def generate_result_line(input_dict: dict, book):
  output_dict = {}

  output_dict['architecture'] = input_dict['arch']

  output_dict['is_architecture'] = "Yes"

  input_text = input_dict['arch']

  questions = generate_questions([input_text], column_questions)

  for k, question in enumerate(questions):

    if k == 2:
      question = f"In which country is {output_dict['city']} situated?"

    prompt, answer = generate_answer(question, input_dict)

    column = list(column_questions.keys())[k]

    output_dict[column] = answer

  ts_counts = text_segmentation_results(input_dict['arch'], book)
  output_dict.update(ts_counts)
  output_dict['qty_unassigned'] = input_dict['count'] - sum(ts_counts.values())
  output_dict['qty'] = input_dict['count']
  output_dict['certainty_score'] = sum(input_dict["scores"]) / len(input_dict["scores"])

  prompt, answer = infer_tipologia(input_dict['arch'], input_dict)
  output_dict['fuctional_type_inferred'] = answer

  prompt, answer = classify_tipologia(input_dict['arch'], input_dict)
  output_dict['fuctional_type_classified'] = answer

  output_dict['uses_wikipedia'] = "True" if input_dict['wiki_contexts'] != [] else "False"

  return output_dict

def get_results(dataset, book):
  results = []
  for i in tqdm(range(len(dataset))):
    arch = dataset[i]['arch']
    prompt, answer = verify_arch(arch, dataset[i])
    if "yes" in answer.lower():
      results.append(generate_result_line(dataset[i], book))
    else:
      results.append({'architecture': arch, "is_architecture": "No", 'author/s': '', 'city': '', 'country': '', 'start_year': '', 'end_year': '', 'qty_title': 0, 'qty_paragraph': '', 'qty_caption': '', 'qty_unassigned': '', 'qty': '', 'certainty_score': '', 'fuctional_type_inferred': '', 'fuctional_type_classified': '', 'uses_wikipedia': ''})

    torch.cuda.empty_cache()
  return results

#Datasets Definition

In [None]:
if IMPORT_NER:
  ner = pipeline(
    'token-classification',
    model="lucasdefino/architecture-NER",
    tokenizer="lucasdefino/architecture-NER",
    aggregation_strategy='simple',
    device=0
  )

tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [None]:
class BookDataset(Dataset):
    def __init__(self, book_file_path, txt_seg_file_path):
        self.txt: str = read_file(book_file_path)
        self.data: list[str] = prepare_data(book_file_path)
        self.ts_df = pd.read_csv(txt_seg_file_path)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]

def identify_and_process_architectures(
    book: BookDataset, model, batch_size: int, confidence: float = 0.9, min_len: int = 7
):
    results = model(book.data, batch_size=batch_size)

    processed_archs = {}

    for i, result in enumerate(tqdm(results, desc="Identifying and processing architectures")):
        for ent in result:
            if ent['entity_group'] == 'ARCH' and len(ent['word']) > min_len and ent['score'] > confidence:
                # Handle '#' in the word
                if '#' in ent['word']:
                    while ent['start'] > 0 and book.data[i][ent['start'] - 1] != ' ':
                        ent['start'] -= 1
                    ent['word'] = book.data[i][ent['start']:ent['end']]

                word = remove_stopwords(ent['word'])

                if word in processed_archs:
                    processed_archs[word]['scores'].append(ent['score'])
                else:
                    count, positions = count_and_position_correction(word, book.txt)

                    contexts = create_context(word, book.txt)

                    book_index = (
                        build_faiss_index(contexts, st_model) if len(contexts) > 0 else None
                    )

                    wiki_contexts = scrape_architecture(word, book_index, contexts)
                    wiki_index = (
                        build_faiss_index(wiki_contexts, st_model)
                        if len(wiki_contexts) > 0
                        else None
                    )

                    processed_archs[word] = {
                        'arch': word,
                        'positions': positions,
                        'scores': [ent['score']],
                        'contexts': contexts,
                        'wiki_contexts': wiki_contexts,
                        'count': count,
                        'book_index': book_index,
                        'wiki_index': wiki_index,
                    }

    return processed_archs

class ArchitectureDataset(Dataset):
    def __init__(self, processed_archs: dict):
        self.archs = list(processed_archs.values())

    def __len__(self):
        return len(self.archs)

    def __getitem__(self, idx):
        return self.archs[idx]

In [None]:
if GENERATE_DATASET:
  book = BookDataset(file_path, ts_file_path)
  processed_archs = identify_and_process_architectures(
      book=book,
      model=ner,
      batch_size=16,
      confidence=0.8,
      min_len=7,
  )

  dataset = ArchitectureDataset(processed_archs)

Identifying and processing architectures: 100%|██████████| 1240/1240 [03:09<00:00,  6.54it/s]


#Testing
Usamos las arquitecturas que nos pasó Julian

In [None]:
def process_architecures(archs: list, book: BookDataset):
  processed_archs = {}

  for ent in tqdm(archs, desc="Processing architectures"):
    if ent['word'] in processed_archs:
        processed_archs[ent['word']]['scores'].append(ent['score'])
    else:
      count, positions = count_and_position_correction(ent['word'], book.txt)
      contexts = create_context(ent['word'], book.txt)
      if len(contexts) == 0:
        book_index = None
      else:
        book_index = build_faiss_index(contexts, st_model)
      wiki_contexts = scrape_architecture(ent['word'], book_index, contexts)
      if wiki_contexts == []:
        wiki_index = None
      else:
        wiki_index = build_faiss_index(wiki_contexts, st_model)

      processed_archs[ent['word']] = {
          'arch': ent['word'],
          'positions': positions,
          'scores': [ent['score']],
          'contexts': contexts,
          'wiki_contexts': wiki_contexts,
          'count': count,
          'book_index': book_index,
          'wiki_index': wiki_index
      }

  return processed_archs

In [None]:
if RAG_TESING:
    import pandas as pd

    csv_file = "/content/drive/MyDrive/ARCHITECTURE_NER/RAG_Chat/frampton_testing.csv"
    df = pd.read_csv(csv_file, delimiter=',')
    testing_dict = df.set_index('Obra').apply(tuple, axis=1).to_dict()

    book = BookDataset(file_path, ts_file_path)
    arquitecturas_identificadas = []
    for obra in df["Obra"].to_list():
        arquitecturas_identificadas.append({"word": obra, "score":100})
    test_processed_archs = process_architecures(arquitecturas_identificadas, book)
    test_kenneth_dataset = ArchitectureDataset(test_processed_archs)

Processing architectures:   0%|          | 0/114 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:   2%|▏         | 2/114 [00:02<02:10,  1.16s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:   3%|▎         | 3/114 [00:03<01:38,  1.13it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:   4%|▎         | 4/114 [00:03<01:12,  1.51it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/6 [00:00<?, ?it/s]

Processing architectures:   4%|▍         | 5/114 [00:04<01:36,  1.13it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:   5%|▌         | 6/114 [00:05<01:13,  1.47it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:   6%|▌         | 7/114 [00:05<01:16,  1.41it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:   8%|▊         | 9/114 [00:06<01:06,  1.58it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:   9%|▉         | 10/114 [00:08<01:21,  1.28it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  10%|▉         | 11/114 [00:08<01:21,  1.26it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  11%|█         | 12/114 [00:09<01:12,  1.41it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  11%|█▏        | 13/114 [00:10<01:09,  1.45it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  12%|█▏        | 14/114 [00:10<01:07,  1.47it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/9 [00:00<?, ?it/s]

Processing architectures:  13%|█▎        | 15/114 [00:12<01:26,  1.15it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  14%|█▍        | 16/114 [00:12<01:14,  1.32it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Processing architectures:  15%|█▍        | 17/114 [00:13<01:13,  1.32it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/8 [00:00<?, ?it/s]

Processing architectures:  16%|█▌        | 18/114 [00:14<01:36,  1.00s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  17%|█▋        | 19/114 [00:15<01:27,  1.09it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  18%|█▊        | 20/114 [00:16<01:18,  1.20it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  18%|█▊        | 21/114 [00:16<01:08,  1.35it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  19%|█▉        | 22/114 [00:17<01:06,  1.37it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  20%|██        | 23/114 [00:18<01:01,  1.47it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  21%|██        | 24/114 [00:18<01:01,  1.46it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  22%|██▏       | 25/114 [00:19<01:01,  1.45it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  23%|██▎       | 26/114 [00:20<00:59,  1.49it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  24%|██▎       | 27/114 [00:20<01:01,  1.43it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  25%|██▍       | 28/114 [00:21<01:03,  1.36it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  25%|██▌       | 29/114 [00:22<00:54,  1.55it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  26%|██▋       | 30/114 [00:22<00:51,  1.62it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  27%|██▋       | 31/114 [00:23<00:49,  1.69it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  28%|██▊       | 32/114 [00:23<00:48,  1.70it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/7 [00:00<?, ?it/s]

Processing architectures:  29%|██▉       | 33/114 [00:24<00:53,  1.50it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  30%|██▉       | 34/114 [00:25<00:57,  1.39it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Processing architectures:  31%|███       | 35/114 [00:25<00:51,  1.54it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  32%|███▏      | 36/114 [00:26<00:46,  1.67it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  32%|███▏      | 37/114 [00:27<00:47,  1.63it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Processing architectures:  33%|███▎      | 38/114 [00:27<00:40,  1.88it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  35%|███▌      | 40/114 [00:28<00:41,  1.79it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  36%|███▌      | 41/114 [00:29<00:40,  1.79it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  37%|███▋      | 42/114 [00:29<00:41,  1.72it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  38%|███▊      | 43/114 [00:30<00:42,  1.66it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  39%|███▊      | 44/114 [00:30<00:38,  1.83it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  39%|███▉      | 45/114 [00:31<00:44,  1.56it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  40%|████      | 46/114 [00:32<00:41,  1.62it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  41%|████      | 47/114 [00:32<00:40,  1.67it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  44%|████▍     | 50/114 [00:33<00:28,  2.29it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  45%|████▍     | 51/114 [00:34<00:26,  2.38it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  46%|████▌     | 52/114 [00:35<00:31,  1.94it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  46%|████▋     | 53/114 [00:35<00:33,  1.82it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  47%|████▋     | 54/114 [00:36<00:32,  1.87it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  48%|████▊     | 55/114 [00:37<00:44,  1.31it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  49%|████▉     | 56/114 [00:38<00:40,  1.45it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Processing architectures:  50%|█████     | 57/114 [00:38<00:41,  1.38it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  51%|█████     | 58/114 [00:39<00:43,  1.29it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  52%|█████▏    | 59/114 [00:40<00:43,  1.27it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  53%|█████▎    | 60/114 [00:40<00:37,  1.46it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/4 [00:00<?, ?it/s]

Processing architectures:  54%|█████▎    | 61/114 [00:41<00:41,  1.28it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  54%|█████▍    | 62/114 [00:42<00:39,  1.32it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  55%|█████▌    | 63/114 [00:43<00:35,  1.45it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  56%|█████▌    | 64/114 [00:43<00:32,  1.56it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  57%|█████▋    | 65/114 [00:44<00:29,  1.64it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  58%|█████▊    | 66/114 [00:44<00:30,  1.58it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  59%|█████▉    | 67/114 [00:45<00:27,  1.72it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  60%|█████▉    | 68/114 [00:46<00:31,  1.46it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  61%|██████    | 69/114 [00:47<00:30,  1.46it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  61%|██████▏   | 70/114 [00:47<00:27,  1.58it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  62%|██████▏   | 71/114 [00:48<00:26,  1.61it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  63%|██████▎   | 72/114 [00:48<00:26,  1.60it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  64%|██████▍   | 73/114 [00:49<00:26,  1.53it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  65%|██████▍   | 74/114 [00:50<00:27,  1.46it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  66%|██████▌   | 75/114 [00:50<00:26,  1.48it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  67%|██████▋   | 76/114 [00:51<00:25,  1.48it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  68%|██████▊   | 77/114 [00:52<00:26,  1.42it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  68%|██████▊   | 78/114 [00:52<00:23,  1.53it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Processing architectures:  70%|███████   | 80/114 [00:54<00:27,  1.24it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  71%|███████   | 81/114 [00:55<00:25,  1.31it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  72%|███████▏  | 82/114 [00:56<00:23,  1.39it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  73%|███████▎  | 83/114 [00:56<00:22,  1.39it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  74%|███████▎  | 84/114 [00:57<00:23,  1.28it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Processing architectures:  75%|███████▍  | 85/114 [00:58<00:23,  1.23it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  75%|███████▌  | 86/114 [00:59<00:23,  1.20it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  76%|███████▋  | 87/114 [01:00<00:24,  1.10it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  77%|███████▋  | 88/114 [01:01<00:22,  1.18it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  78%|███████▊  | 89/114 [01:01<00:16,  1.50it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  79%|███████▉  | 90/114 [01:02<00:15,  1.53it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  80%|███████▉  | 91/114 [01:02<00:12,  1.87it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  81%|████████  | 92/114 [01:02<00:10,  2.05it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  82%|████████▏ | 93/114 [01:03<00:10,  2.04it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  82%|████████▏ | 94/114 [01:03<00:09,  2.11it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  83%|████████▎ | 95/114 [01:05<00:13,  1.38it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  84%|████████▍ | 96/114 [01:05<00:13,  1.34it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

Processing architectures:  85%|████████▌ | 97/114 [01:06<00:13,  1.23it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Processing architectures:  86%|████████▌ | 98/114 [01:07<00:12,  1.30it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Processing architectures:  87%|████████▋ | 99/114 [01:08<00:10,  1.39it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  88%|████████▊ | 100/114 [01:08<00:10,  1.35it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  89%|████████▊ | 101/114 [01:09<00:08,  1.48it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  89%|████████▉ | 102/114 [01:10<00:08,  1.43it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  91%|█████████ | 104/114 [01:11<00:06,  1.65it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  92%|█████████▏| 105/114 [01:11<00:05,  1.57it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/5 [00:00<?, ?it/s]

Processing architectures:  93%|█████████▎| 106/114 [01:13<00:06,  1.16it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  94%|█████████▍| 107/114 [01:14<00:05,  1.19it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  95%|█████████▍| 108/114 [01:14<00:04,  1.24it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/23 [00:00<?, ?it/s]

Processing architectures:  96%|█████████▌| 109/114 [01:18<00:08,  1.78s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  96%|█████████▋| 110/114 [01:19<00:05,  1.43s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  97%|█████████▋| 111/114 [01:20<00:03,  1.22s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  98%|█████████▊| 112/114 [01:21<00:02,  1.07s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures:  99%|█████████▉| 113/114 [01:21<00:01,  1.00s/it]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Processing architectures: 100%|██████████| 114/114 [01:22<00:00,  1.38it/s]


In [None]:
if RAG_TESTING:
    import math

    def check_correct(answer: str, data: str) -> bool:
      set_answer = answer.lower()
      set_data = set(data.lower().split('-'))
      return all(word in set_answer for word in set_data)

    def extract_numeric_part(s):
        return str(''.join(re.findall(r'\d+', s)))

    accuracy_autor = 0
    accuracy_ciudad = 0
    accuracy_pais = 0
    accuracy_inicio = 0
    accuracy_fin = 0
    n = len(test_kenneth_dataset)
    n_inicio = 0
    n_fin = 0

    for i in tqdm(range(len(test_kenneth_dataset)), desc="Testing"):
      arch = test_kenneth_dataset[i]['arch']
      test_data = testing_dict[arch]
      result = generate_result_line(test_kenneth_dataset[i], book)

      if check_correct(result["author/s"], test_data[1]):
          accuracy_autor += 1
      if check_correct(result["city"], test_data[2]):
          accuracy_ciudad += 1
      if check_correct(result["country"], test_data[3]):
          accuracy_pais += 1
      if not math.isnan(test_data[4]):
          if extract_numeric_part(result['start_year']) == str(int(test_data[4])):
              accuracy_inicio += 1
          n_inicio += 1
      if not math.isnan(test_data[5]):
          if extract_numeric_part(result['end_year']) == str(int(test_data[5])):
              accuracy_fin += 1
          n_fin += 1

      torch.cuda.empty_cache()

Testing: 100%|██████████| 113/113 [30:23<00:00, 16.14s/it]


In [None]:
if RAG_TESTING:
  print("\n")
  print("Accuraccy Autor: ", accuracy_autor/n)
  print("Accuraccy Ciudad: ", accuracy_ciudad/n)
  print("Accuraccy Pais: ", accuracy_pais/n)
  print("Accuraccy Inicio: ", accuracy_inicio/n_inicio)
  print("Accuraccy Fin: ", accuracy_fin/n_fin)



Accuraccy Autor:  0.6460176991150443
Accuraccy Ciudad:  0.672566371681416
Accuraccy Pais:  0.5752212389380531
Accuraccy Inicio:  0.41964285714285715
Accuraccy Fin:  0.35384615384615387


#Results Generation

##Results extraction

In [None]:
results = get_results(dataset, book)

 94%|█████████▍| 109/116 [20:48<04:36, 39.51s/it]

In [None]:
# Define the keys (columns) for the CSV file
fieldnames = ['architecture', 'is_architecture', 'author/s', 'city', 'country', 'start_year', 'end_year', 'qty_title', 'qty_paragraph', 'qty_caption', 'qty_unassigned', 'qty', 'certainty_score', 'fuctional_type_inferred', 'fuctional_type_classified', 'uses_wikipedia']

# Create an in-memory buffer to store the CSV data
csv_buffer = io.StringIO()

# Write the data to the buffer
writer = csv.DictWriter(csv_buffer, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(results)

# Move the buffer's cursor to the beginning
csv_buffer.seek(0)

# Save the file to a persistent location in the environment
with open('/content/drive/MyDrive/ARCHITECTURE_NER/RAG_Chat/resultados/DB_book.csv', mode='w', encoding='utf-8') as file:
    file.write(csv_buffer.getvalue())

print("CSV file created successfully!")
