# **Retrieval Evaluation Setup**

In [1]:
%pip install --quiet --upgrade bitsandbytes langchain langchain-community langchain-huggingface transformers beautifulsoup4 faiss-gpu rank_bm25 lark

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m65.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m90.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m111.0/111.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m409.7/409.7 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
from langchain_core.documents import Document
from langchain.retrievers import EnsembleRetriever # Supports Ensembling of results from multiple retrievers
from langchain_community.retrievers import BM25Retriever
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
import faiss
from langchain_community.docstore.in_memory import InMemoryDocstore
import nltk
from nltk.corpus import stopwords
import re
import pandas as pd
import os
from google.colab import files

## **User Action Required**

1. Run the code below to create the ```data``` and ```retriever_eval``` folder

2. Choose to upload the following files
- ```eval_dataset.ipynb```


In [3]:
data_folder = os.path.join(os.getcwd(), 'data')
output_folder = os.path.join(os.getcwd(), 'retriever_eval')
os.makedirs(data_folder, exist_ok=True)
os.makedirs(output_folder, exist_ok=True)

In [4]:
uploaded_files = files.upload()

Saving eval_dataset.csv to eval_dataset.csv


In [5]:
for file_name in uploaded_files.keys():
    os.rename(file_name, os.path.join(data_folder, file_name))

Your folder structure should now look as such:

```
data
- eval_dataset.ipynb

retriever_eval
-
```

**Control Variables**

- Fix the document chunk/splitting method based on our <u>prior research</u> (Placed after reading in the data)
  - RecursiveCharacterTextSplitter
    - chunk_size=250
    - chunk_overlap=50

In [6]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=250, chunk_overlap=50, add_start_index=True
)

**Replicate sample Dataframe that we will read in**

```
queries = {
    "query": ["What is the best food in Finland?", "What is the best food in Finland?", "What hikes can I do in Iceland?", "What hikes can I do in Iceland?"],
    "query_number": [1, 1, 2, 2],
    "content": ["""Traditional Finnish cuisine features several iconic dishes, each offering a unique taste of the country's culinary heritage. One of the most celebrated is Karjalanpaisti, or Karelian hot pot, a hearty stew made with a combination of pork and beef, and sometimes lamb, seasoned with black peppercorns and allspice. This dish holds significant cultural importance and was voted Finland's national dish in 2007. Another staple is Ruisleipä, a dense, dark rye bread that is a cornerstone of the Finnish diet. Made from whole-grain rye flour, it is typically unsweetened and free from spices, distinguishing it from other Nordic rye breads. In 2017, ruisleipä was chosen as Finland's national food, underscoring its integral role in Finnish cuisine. For those with a sweet tooth, Mustikkapiirakka, or blueberry pie, is a must-try dessert. This delightful treat is especially popular during the summer months when Finnish forests are abundant with bilberries, a close relative of the blueberry. The pie is best enjoyed warm, straight from the oven, often accompanied by a scoop of vanilla ice cream. These dishes represent just a glimpse into Finland's rich culinary traditions, each offering a unique and authentic taste of the country's heritage.""",
    """Exploring Iceland offers various transportation options, each catering to different preferences and travel plans. Renting a car is the most popular choice, providing flexibility to explore at your own pace and access remote areas. The Ring Road, or Route 1, encircles the country and connects most inhabited regions, making it ideal for road trips. However, be prepared for varying road conditions, especially in rural areas where some roads remain unpaved. Public transportation is available but limited, primarily concentrated in Reykjavik and other urban centers. The Strætó bus system operates within the capital and offers some long-distance routes, but services can be infrequent in rural areas, requiring careful planning. Domestic flights are a viable option for covering long distances quickly, with airports in Reykjavik, Akureyri, and Egilsstaðir providing regular services. This is particularly useful for reaching remote regions or during winter when some roads may be impassable. Additionally, ferries connect certain coastal towns and islands, such as the ferry to the Westman Islands, offering a scenic mode of travel. Organized tours are also available, providing guided experiences to popular destinations without the need to navigate yourself. Each mode of transportation has its advantages and considerations, so choosing the best option depends on your itinerary, comfort level with driving, and desire for flexibility.""",
    """Iceland offers a diverse array of hiking opportunities, catering to both casual walkers and seasoned trekkers. One of the most renowned trails is the Laugavegur Trail, a 54 km route from Landmannalaugar to Þórsmörk, celebrated for its vibrant rhyolite mountains, expansive lava fields, and geothermal zones. Typically completed over four days, this trail provides a comprehensive experience of Iceland's varied landscapes. For a more challenging endeavor, the **Fimmvörðuháls Trail** extends 25 km from Skógafoss to Þórsmörk, guiding hikers past numerous waterfalls and between two glaciers, Eyjafjallajökull and Mýrdalsjökull. This trek is often regarded as one of Iceland's most scenic. For those seeking shorter excursions, the hike to Glymur Waterfall, Iceland's second-highest waterfall, offers a rewarding experience with panoramic views over the surrounding valleys. Another accessible option is the Reykjadalur Hot Springs Trail, leading through geothermal areas to natural hot springs where hikers can enjoy a relaxing soak. Each of these trails showcases Iceland's unique natural beauty, from cascading waterfalls and geothermal hot springs to majestic glaciers and volcanic landscapes.""",
    """Icelandic cuisine offers a variety of unique and traditional dishes that reflect the country's rich cultural heritage. One of the most iconic foods is the Icelandic hot dog, or 'pylsur,' made from a blend of lamb, beef, and pork, and typically served with condiments like ketchup, sweet mustard, remoulade, and both raw and fried onions. A renowned spot to try this delicacy is Bæjarins Beztu Pylsur, a popular hot dog stand in Reykjavík. Another traditional dish is 'plokkfiskur,' a hearty fish stew combining white fish such as cod or haddock with potatoes, onions, milk, butter, and flour, creating a comforting and flavorful meal. For those interested in more adventurous flavors, 'hákarl,' or fermented shark, offers a distinctive taste of Iceland's culinary traditions. This dish involves a specific fermentation process that results in a pungent flavor, often considered an acquired taste. These dishes, among others, provide a glimpse into Iceland's unique gastronomic landscape, shaped by its history and natural resources."""],
    "sample": ["positive", "negative", "positive", "negative"]
}

queries_df = pd.DataFrame(queries)
```

In [9]:
queries_df = pd.read_csv(os.path.join(data_folder, 'eval_dataset.csv'))

In [11]:
queries_df

Unnamed: 0,query,query_number,content,sample
0,What are the top 5 food I cannot miss in Iceland?,1,Memorable meals await at these hand-picked eat...,positive
1,What are the top 5 food I cannot miss in Iceland?,1,Volcanic activity is a fact of life in Iceland...,negative
2,What are the top 5 food I cannot miss in Iceland?,1,When the alternative rock-band The Sugarcubes ...,negative
3,What are the top 5 food I cannot miss in Iceland?,1,"When planning a journey to Iceland, one of the...",negative
4,Where do you recommend for a cycling trip?,2,Did you know that Finland is a great holiday d...,positive
5,Where do you recommend for a cycling trip?,2,Information on how to get to and from the main...,negative
6,Where do you recommend for a cycling trip?,2,Discover design architecture in the forest: Yo...,negative
7,Where do you recommend for a cycling trip?,2,Sweden may not be the first place that comes t...,negative
8,I'm interested in vikings. Where can I visit t...,3,Vikings have captured the imaginations of all ...,positive
9,I'm interested in vikings. Where can I visit t...,3,Ribersborgs Kallbadhus in Malmö and Kallbadhus...,negative


In [12]:
queries_df_split = [group for query, group in queries_df.groupby('query_number')]

In [13]:
queries_df_split[0]

Unnamed: 0,query,query_number,content,sample
0,What are the top 5 food I cannot miss in Iceland?,1,Memorable meals await at these hand-picked eat...,positive
1,What are the top 5 food I cannot miss in Iceland?,1,Volcanic activity is a fact of life in Iceland...,negative
2,What are the top 5 food I cannot miss in Iceland?,1,When the alternative rock-band The Sugarcubes ...,negative
3,What are the top 5 food I cannot miss in Iceland?,1,"When planning a journey to Iceland, one of the...",negative


In [14]:
for query_df in queries_df_split:
  query_eval_fname = f'query_{query_df["query_number"].iloc[0]}_eval.csv'
  query_docs_fname = f'query_{query_df["query_number"].iloc[0]}_docs.csv'
  docs = []
  for i,row in query_df.iterrows():
    doc = Document(page_content = row['content'])
    docs.append(doc)

  doc_chunks = text_splitter.split_documents(docs)

  docs_cleaned = [doc.page_content for doc in docs]
  doc_chunks_cleaned = [chunk.page_content for chunk in doc_chunks]

  query_eval_df = {
      'chunks': doc_chunks_cleaned,
      'score': ''
  }

  query_docs_df = {
      'docs': docs_cleaned
  }

  query_eval_df = pd.DataFrame(query_eval_df)
  query_docs_df = pd.DataFrame(query_docs_df)

  query_eval_df.to_csv(os.path.join(output_folder, query_eval_fname), index=False)
  query_docs_df.to_csv(os.path.join(output_folder, query_docs_fname), index=False)

In [15]:
unique_queries_df = queries_df[['query', 'query_number']].drop_duplicates()

In [16]:
unique_queries_df

Unnamed: 0,query,query_number
0,What are the top 5 food I cannot miss in Iceland?,1
4,Where do you recommend for a cycling trip?,2
8,I'm interested in vikings. Where can I visit t...,3
12,What can I do in Finland in autumn?,4
16,What are some outdoor activites I can do in Ic...,5


In [17]:
unique_queries_df.to_csv(os.path.join(output_folder, 'queries.csv'), index=False)

Your folder structure should now look as such:

```
data
  - iceland_articles_updated.csv
  - finland_articles_updated.csv

retriever_eval
  - queries.csv
  - query_1_eval.csv
  - query_1_docs.csv
  - query_2_eval.csv
  - query_2_docs.csv
  ...
  - query_n_eval.csv
  - query_n_docs.csv
```

**Save the experiment output**

In [18]:
%cd /content/

/content


In [19]:
!zip -r retriever_eval.zip retriever_eval

  adding: retriever_eval/ (stored 0%)
  adding: retriever_eval/query_3_docs.csv (deflated 57%)
  adding: retriever_eval/query_3_eval.csv (deflated 63%)
  adding: retriever_eval/query_2_docs.csv (deflated 62%)
  adding: retriever_eval/query_1_docs.csv (deflated 53%)
  adding: retriever_eval/query_4_eval.csv (deflated 63%)
  adding: retriever_eval/queries.csv (deflated 34%)
  adding: retriever_eval/query_2_eval.csv (deflated 67%)
  adding: retriever_eval/query_4_docs.csv (deflated 57%)
  adding: retriever_eval/query_5_eval.csv (deflated 61%)
  adding: retriever_eval/query_5_docs.csv (deflated 55%)
  adding: retriever_eval/query_1_eval.csv (deflated 59%)


In [20]:
files.download("retriever_eval.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## **User Action Required**

1. Now that the data has downloaded, it is time for you to score the eval files :)