# RAG - Sistema de asistencia en investigación académica

In [1]:
import pandas as pd
from pinecone import Pinecone
from openai import OpenAI

  from tqdm.autonotebook import tqdm


## Load Corpus - ACL Anthology

In [70]:
df = pd.read_parquet('../acl-publication-info.74k.parquet')

In [3]:
# Get the first 10 rows
df_reduced = df.head(10)
df_reduced

Unnamed: 0,acl_id,abstract,full_text,corpus_paper_id,pdf_hash,numcitedby,url,publisher,address,year,...,booktitle,author,title,pages,doi,number,volume,journal,editor,isbn
0,O02-2002,There is a need to measure word similarity whe...,There is a need to measure word similarity whe...,18022704,0b09178ac8d17a92f16140365363d8df88c757d0,14,https://aclanthology.org/O02-2002,,,2002,...,International Journal of Computational Linguis...,"Chen, Keh-Jiann and\nYou, Jia-Ming",A Study on Word Similarity using Context Vecto...,37--58,,,,,,
1,L02-1310,,,8220988,8d5e31610bc82c2abc86bc20ceba684c97e66024,93,http://www.lrec-conf.org/proceedings/lrec2002/...,European Language Resources Association (ELRA),"Las Palmas, Canary Islands - Spain",2002,...,Proceedings of the Third International Confere...,"Mihalcea, Rada F.",Bootstrapping Large Sense Tagged Corpora,,,,,,,
2,R13-1042,Thread disentanglement is the task of separati...,Thread disentanglement is the task of separati...,16703040,3eb736b17a5acb583b9a9bd99837427753632cdb,10,https://aclanthology.org/R13-1042,"INCOMA Ltd. Shoumen, BULGARIA","Hissar, Bulgaria",2013,...,Proceedings of the International Conference Re...,"Jamison, Emily and\nGurevych, Iryna","Headerless, Quoteless, but not Hopeless? Using...",327--335,,,,,,
3,W05-0819,"In this paper, we describe a word alignment al...","In this paper, we describe a word alignment al...",1215281,b20450f67116e59d1348fc472cfc09f96e348f55,15,https://aclanthology.org/W05-0819,Association for Computational Linguistics,"Ann Arbor, Michigan",2005,...,Proceedings of the {ACL} Workshop on Building ...,"Aswani, Niraj and\nGaizauskas, Robert",Aligning Words in {E}nglish-{H}indi Parallel C...,115--118,,,,,,
4,L02-1309,,,18078432,011e943b64a78dadc3440674419821ee080f0de3,12,http://www.lrec-conf.org/proceedings/lrec2002/...,European Language Resources Association (ELRA),"Las Palmas, Canary Islands - Spain",2002,...,Proceedings of the Third International Confere...,"Suyaga, Fumiaki and\nTakezawa, Toshiyuki and...",Proposal of a very-large-corpus acquisition me...,,,,,,,
5,R13-1044,The paper 1 presents a rule-based approach to ...,The paper 1 presents a rule-based approach to ...,2491460,c0f1047fe0f95c367184d494e78bb07b11ee3608,2,https://aclanthology.org/R13-1044,"INCOMA Ltd. Shoumen, BULGARIA","Hissar, Bulgaria",2013,...,Proceedings of the International Conference Re...,"K{\k{e}}dzia, Pawe{\l} and\nMaziarz, Marek",Recognizing semantic relations within {P}olish...,342--349,,,,,,
6,W05-0818,"In this paper we describe LIHLA, a lexical ali...","In this paper we describe LIHLA, a lexical ali...",15322146,ff3f05120d24e5dac2879f25402993bc6355f780,5,https://aclanthology.org/W05-0818,Association for Computational Linguistics,"Ann Arbor, Michigan",2005,...,Proceedings of the {ACL} Workshop on Building ...,"Caseli, Helena M. and\nNunes, Maria G. V. an...",{LIHLA}: Shared Task System Description,111--114,,,,,,
7,L02-1313,,,649937,c5c1643517ee6646c47b4ee2b8443d4f62ee1ae5,4,http://www.lrec-conf.org/proceedings/lrec2002/...,European Language Resources Association (ELRA),"Las Palmas, Canary Islands - Spain",2002,...,Proceedings of the Third International Confere...,"Baldwin, Timothy and\nBilac, Slaven and\nOku...",Enhanced {J}apanese Electronic Dictionary Look-up,,,,,,,
8,R13-1045,We describe an approach to building a morpholo...,We describe an approach to building a morpholo...,690455,0b125557ba23075532380e88fb990933838975b7,2,https://aclanthology.org/R13-1045,"INCOMA Ltd. Shoumen, BULGARIA","Hissar, Bulgaria",2013,...,Proceedings of the International Conference Re...,"Khaliq, Bilal and\nCarroll, John",Unsupervised Induction of {A}rabic Root and Pa...,350--356,,,,,,
9,W05-0821,Statistical machine translation systems use a ...,Statistical machine translation systems use a ...,1966857,2a05c9c5373a3e1e01b8161e6687b960ab3d2ff5,55,https://aclanthology.org/W05-0821,Association for Computational Linguistics,"Ann Arbor, Michigan",2005,...,Proceedings of the {ACL} Workshop on Building ...,"Kirchhoff, Katrin and\nYang, Mei",Improved Language Modeling for Statistical Mac...,125--128,,,,,,


In [4]:
# print title, abstract and full text of the first row
print(df_reduced.iloc[0]['title'])
print(df_reduced.iloc[0]['abstract'])
print(df_reduced.iloc[0]['full_text'])

A Study on Word Similarity using Context Vector Models
There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarit ies. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word , an

## Connect to Vector DB

In [51]:
# Create a serverless index
pc = Pinecone(api_key="PINECONE_API_KEY")

# pc.create_index(name="example-index", dimension=1024, 
#     spec=ServerlessSpec(cloud='aws', region='us-east-1') 
# )

# Target the index
pinecone_index = pc.Index("multilingual-e5-large")

## Upload data to the index

### Loading reduced dataset for testing

#### Generate embeddings from the dataset

In [33]:
embeddings = pc.inference.embed(
    model="multilingual-e5-large",
    inputs=[row['full_text'] for _, row in df_reduced.iterrows()],
    parameters={
        "input_type": "passage",
        "truncate": "END"
    }
)

In [34]:
embeddings

EmbeddingsList(
  model='multilingual-e5-large',
  vector_type='dense',
  data=[
    {'vector_type': dense, 'values': [0.0005664825439453125, -0.025634765625, ..., -0.019683837890625, -0.0157012939453125]},
    {'vector_type': dense, 'values': [0.048675537109375, -0.033416748046875, ..., -0.019744873046875, 0.024078369140625]},
    ... (6 more embeddings) ...,
    {'vector_type': dense, 'values': [0.02337646484375, -0.0054168701171875, ..., -0.041595458984375, 0.03546142578125]},
    {'vector_type': dense, 'values': [0.033172607421875, -0.019378662109375, ..., -0.0194091796875, 0.0030612945556640625]}
  ],
  usage={'total_tokens': 3596}
)

#### Attach metadata to the embeddings before uploading

In [35]:
vectors = []
for i, e in enumerate(embeddings):
    vectors.append({
        "id": df_reduced.iloc[i]['acl_id'],
        "values": e.values,
        "metadata": {
            'title': df_reduced.iloc[i]['title'],
            'author': df_reduced.iloc[i]['author'],
            'url': df_reduced.iloc[i]['url'],
            'abstract': df_reduced.iloc[i]['abstract']
            }
    })

#### Upload embeddings to the Vector DB

In [36]:
response = pinecone_index.upsert(vectors=vectors)
response

{'upserted_count': 10}

## Query the index

### Example of querying the index

In [37]:
query = "There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarit ies. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word , and all the context features are adjusted according to their IDF (inverse document frequency) values. The agglomerative clustering algorithm is applied to group similar words according to their similarity values. It turns out that words with similar syntactic categories and semantic classes are grouped together."

### Embedding query

In [38]:
x = pc.inference.embed(
    model="multilingual-e5-large",
    inputs=[query],
    parameters={
        "input_type": "query"
    }
)

### Query similar records in the index

In [39]:
results = pinecone_index.query(
    vector=x[0].values,
    top_k=3,
    include_values=False,
    include_metadata=True
)

In [40]:
results

{'matches': [{'id': 'O02-2002',
              'metadata': {'abstract': 'There is a need to measure word '
                                       'similarity when processing natural '
                                       'languages, especially when using '
                                       'generalization, classification, or '
                                       'example -based approaches. Usually, '
                                       'measures of similarity between two '
                                       'words are defined according to the '
                                       'distance between their semantic '
                                       'classes in a semantic taxonomy . The '
                                       'taxonomy approaches are more or less '
                                       'semantic -based that do not consider '
                                       'syntactic similarit ies. However, in '
                                       'rea

## Delete all records from DB

In [32]:
# Delete all vectors
# index.delete(delete_all=True)

{}

## Load LLM

In [41]:
client = OpenAI(api_key="OPEN_AI_KEY")

## Setting up a prompt

In [42]:
template = """
You are an assistant that provides answers to questions based on
a given context. 

Answer the question based on the context. If you can't answer the
question, reply "I don't know".

Be as concise as possible and go straight to the point.

Context: {context}

Question: {question}
"""

example_context = "Word similarity is important in natural language processing because it helps in generalization, classification, and example-based approaches."
example_question = "Why is word similarity important in natural language processing?"

In [43]:
query_with_context = template.format(context=example_context, question=example_question)

In [44]:
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[ 
        { "role": "system", "content": "You are a helpful assistant." },
        {
            "role": "user",
            "content": query_with_context,
        },
    ]
)

In [45]:
completion.choices[0].message

ChatCompletionMessage(content='Word similarity is important in natural language processing because it aids in generalization, classification, and example-based approaches.', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None)

## RAG System

In [62]:
question = "Explain to me word similarity when processing natural languages, in Spanish."

### Transform question to embedding

In [63]:
x = pc.inference.embed(
    model="multilingual-e5-large",
    inputs=[question],
    parameters={
        "input_type": "query"
    }
)

In [64]:
x

EmbeddingsList(
  model='multilingual-e5-large',
  vector_type='dense',
  data=[
    {'vector_type': dense, 'values': [0.0121917724609375, -0.0251617431640625, ..., -0.0251617431640625, -0.0377197265625]}
  ],
  usage={'total_tokens': 22}
)

### Query the vector DB for similar records

In [65]:
results = pinecone_index.query(
    vector=x[0].values,
    top_k=1,
    include_values=False,
    include_metadata=True
)

In [66]:
results.matches[0].metadata['abstract']

'There is a need to measure word similarity when processing natural languages, especially when using generalization, classification, or example -based approaches. Usually, measures of similarity between two words are defined according to the distance between their semantic classes in a semantic taxonomy . The taxonomy approaches are more or less semantic -based that do not consider syntactic similarit ies. However, in real applications, both semantic and syntactic similarities are required and weighted differently. Word similarity based on context vectors is a mixture of syntactic and semantic similarit ies. In this paper, we propose using only syntactic related co-occurrences as context vectors and adopt information theoretic models to solve the problems of data sparseness and characteristic precision. The probabilistic distribution of co-occurrence context features is derived by parsing the contextual environment of each word , and all the context features are adjusted according to t

### Set-up prompt for LLM

In [None]:
similar_vectors_context = results.matches[0].metadata['abstract']

In [67]:
query_with_context = template.format(context=similar_vectors_context, question=question)

In [68]:
completion = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        { "role": "system", "content": "You are a helpful assistant." },
        {
            "role": "user",
            "content": query_with_context,
        },
    ]
)

In [69]:
completion.choices[0].message

ChatCompletionMessage(content='La similitud de palabras al procesar idiomas naturales se mide generalmente según la distancia entre sus clases semánticas en una taxonomía semántica. Las medidas de similitud suelen ser semánticas y no consideran similitudes sintácticas. Sin embargo, en aplicaciones reales, se requieren similitudes semánticas y sintácticas, que se ponderan de manera diferente. La similitud de palabras basada en vectores de contexto combina sintácticas y semánticas. En este enfoque, se proponen co-ocurrencias relacionadas sintácticamente como vectores de contexto y se utilizan modelos teóricos de la información para abordar problemas de escasez de datos y precisión. Se ajustan las características de contexto según sus valores de IDF (frecuencia inversa de documentos), y se aplica un algoritmo de clustering aglomerativo para agrupar palabras similares. Así, las palabras con categorías sintácticas y clases semánticas similares se agrupan juntas.', refusal=None, role='assist