# Article Retrieval System
Author: Karolina Surówka

Below is presented the strategy of the program.

![image.png](strategy.png)

#### **Installation**

1. Install the required dependencies specified in `requirements.txt` using the command: `pip install -r requirements.txt`.

2. Install [Ollama](https://ollama.com/download) for chat completion service.


In [None]:
%pip install -r requirements.txt

Import needed libraries

In [2]:
import warnings
import os
from openai import OpenAI
from dotenv import load_dotenv
import pandas as pd
from langchain_community.document_loaders.dataframe import DataFrameLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
import cohere
from langchain_cohere import CohereEmbeddings
from langchain.vectorstores import Chroma

In [3]:
warnings.filterwarnings("ignore")

Firstly load the data to DataFrame and then to DataFrameLoader to gain a proper format of data for further steps.

In [5]:
data = pd.read_csv('medium.csv')
df = pd.DataFrame(data)
df.head()

loader = DataFrameLoader(data_frame=df, page_content_column="Title")
pages = loader.load()

In [6]:
df

Unnamed: 0,Title,Text
0,A Beginner’s Guide to Word Embedding with Gens...,1. Introduction of Word2vec\n\nWord2vec is one...
1,Hands-on Graph Neural Networks with PyTorch & ...,"In my last article, I introduced the concept o..."
2,How to Use ggplot2 in Python,Introduction\n\nThanks to its strict implement...
3,Databricks: How to Save Data Frames as CSV Fil...,Photo credit to Mika Baumeister from Unsplash\...
4,A Step-by-Step Implementation of Gradient Desc...,A Step-by-Step Implementation of Gradient Desc...
...,...,...
1386,Brain: A Mystery,“The most beautiful experience we can have is ...
1387,Machine Learning: Lincoln Was Ahead of His Time,Photo by Jp Valery on Unsplash\n\nIn the 45th ...
1388,AI and Us — an Opera Experience. In my previou...,EKHO COLLECTIVE: OPERA BEYOND SERIES\n\nIn my ...
1389,Digital Skills as a Service (DSaaS),Have you ever thought about what will be in th...


> How it looks like:

In [7]:
print(len(pages))
print(pages[0].page_content[0:500])
print(pages[0].metadata)

1391
A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model
{'Text': '1. Introduction of Word2vec\n\nWord2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language computer-readable, then further implementation of mathematical operations on words can be used to detect their similarities. A well-trained set of word vectors will place similar words close to each other in that space. For instance, the words women, men, and human might cluster in one corner, while yellow, red and blue cluster together in another.\n\nThere are two main training algorithms for word2vec, one is the continuous bag of words(CBOW), another is called skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target context. Generally, the skip-gram me

All documents must be splitted and divided into the chunks. For that purpose the `RecursiveCharacterTextSplitter` is used with
`chunk_size =400` and `chunk_overlap = 25`

In [8]:
chunk_size =400
chunk_overlap = 25

splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap, )
splitted_documents = splitter.split_documents(pages)
splitted_documents

[Document(page_content='A Beginner’s Guide to Word Embedding with Gensim Word2Vec Model', metadata={'Text': '1. Introduction of Word2vec\n\nWord2vec is one of the most popular technique to learn word embeddings using a two-layer neural network. Its input is a text corpus and its output is a set of vectors. Word embedding via word2vec can make natural language computer-readable, then further implementation of mathematical operations on words can be used to detect their similarities. A well-trained set of word vectors will place similar words close to each other in that space. For instance, the words women, men, and human might cluster in one corner, while yellow, red and blue cluster together in another.\n\nThere are two main training algorithms for word2vec, one is the continuous bag of words(CBOW), another is called skip-gram. The major difference between these two methods is that CBOW is using context to predict a target word while skip-gram is using a word to predict a target contex

Next key step is to extract embeddings from all documents. There were tested many technologies and their various models, but finally I applied `CohereEmbeddings` with model `embed-multilingual-light-v3.0`.

> NOTE: Before you run the cell below make sure that Your `COHERE_API_KEY` is in file `config.env`.

In [12]:
load_dotenv("config.env")
COHERE_API_KEY = os.environ.get("COHERE_API_KEY")
client = cohere.Client(api_key=os.environ.get("COHERE_API_KEY"),)
embeddings = CohereEmbeddings(model="embed-multilingual-light-v3.0")  ## better
# embeddings = CohereEmbeddings(model="embed-english-light-v3.0") ## worse

# embeddings = OpenAIEmbeddings()
# embeddings = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

Create the vector store

In [13]:
persist_directory = 'docs/chroma/'

vectordb = Chroma.from_documents(
    documents=splitted_documents,
    embedding=embeddings,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

1391


Helper function for displaying retrieved articles.

In [14]:
def display_search_results(question, docs):
    print("===============================",question.upper(),"===============================")
    
    for doc in docs:
        print("------ ", " ANSWER ",docs.index(doc) +1 ," ------")
        details = doc.to_json()['kwargs']
        print("Title: ", details['page_content'])
        print("Content: \n\n")
        print(details['metadata']['Text'][0:500])
        print("\n\n\n")   
    print("ALL RETRIEVED ARTICLES: ", len(docs)) 

Example of usage

In [32]:
question = "Tell me something about deep learning"
docs = vectordb.similarity_search(question,k=3)
display_search_results(question, docs)  

------   ANSWER  1  ------
Title:  Can Deep Learning help us to rediscover the past?
Content: 


Can Deep Learning help us to rediscover the past?

Deep Learning (D.L.) has already proven to be an outstanding tool to solve a wide range of problems in computer vision and with practical applications in several fields. However, this kind of techniques hasn’t been widely explored in archaeology and specifically, in the detection of archaeological sites. In this article, I want to tell you about the application of D.L. in the analysis of airborne data to detect archaeological traces.

Rune stone




------   ANSWER  2  ------
Title:  Which Deep Learning Framework is Growing Fastest?
Content: 


Which Deep Learning Framework is Growing Fastest?

In September 2018, I compared all the major deep learning frameworks in terms of demand, usage, and popularity in this article. TensorFlow was the undisputed heavyweight champion of deep learning frameworks. PyTorch was the young rookie with lots of 

**OpenAI Client Initialization**

This code initializes the OpenAI client with the specified API key ('ollama') and base URL ('http://localhost:11434/v1'). The API key is used for authentication when making requests to the OpenAI services. The base URL indicates the endpoint where the OpenAI services are hosted, in this case, it's set to a local server running on port 11434.

> NOTE: Make sure Your Ollama server is run.

In [16]:
client = OpenAI(api_key='ollama', base_url="http://localhost:11434/v1")

Helper function for merging the text.

In [17]:
def merge_document_texts(docs):
    merged_text = ""
    for doc in docs:
        merged_text += doc.to_json()['kwargs']['metadata']['Text']
    return merged_text[:500]

**Function: ask_llm**

This function queries the OpenAI chat completion service using the LLM (Large Language Model) model "llama2" to generate responses to a given question. It first retrieves relevant document fragments using a similarity search based on the question. Then, it merges these document fragments into a single text. The merged text along with the question is sent as a message to the OpenAI service. The function iterates over the response chunks from the service and prints the generated content.

In [22]:
def ask_llm(question):
    docs = vectordb.similarity_search(question,k=3)
    merged_text = merge_document_texts(docs)
    stream = client.chat.completions.create(
    model="llama2",
    messages=[{"role": "user", "content": "Say this is a test"},
              {
                "role": "user",
                "content": f"Question: {question}. Content: {merged_text}"
            }],
    stream=True,
    max_tokens=300
    )
    for chunk in stream:
        if chunk.choices[0].delta.content is not None:
            print(chunk.choices[0].delta.content, end="")

Example of usage

In [33]:
ask_llm("Tell me something about deep learning")

Great! Let's dive into how deep learning can help us rediscover the past through the lens of archaeology.

Archaeology is a field that seeks to uncover and understand the history of human societies by analyzing material remains such as artifacts, architecture, and cultural landscapes. However, the process of excavating and documenting these remains can be time-consuming, expensive, and often limited by the availability of data. This is where deep learning can help.

Airborne LiDAR (Light Detection and Ranging) is a remote sensing technology that uses lasers to create highly detailed 3D models of the earth's surface. By analyzing these models using deep learning algorithms, researchers can identify subtle patterns and features that may indicate the presence of archaeological sites.

One study published in the journal Antiquity used airborne LiDAR data to detect previously undetected archaeological sites in the UK. The team trained a deep learning model on a dataset of known archaeologic

In [19]:
ask_llm("What is Random Forest?")

Great to hear that you're enjoying the fast.ai course! A Random Forest is a machine learning algorithm that combines multiple decision trees to improve the accuracy and robustness of the model. It's a popular ensemble learning method used for classification and regression tasks.

In a Random Forest, each decision tree is trained on a random subset of the training data and a random subset of the features. The combination of multiple trees helps to reduce overfitting and improve generalization by averaging out the predictions of individual trees. Additionally, the feature selection process can help identify the most important features in the dataset.

To use Random Forest in your capstone project, you could consider the following steps:

1. Load

In [20]:
ask_llm("What is the difference between Random Forest and Decision Tree?") 

Sure, I'd be happy to help! Here is the content you requested:

---

Question: What is the difference between Random Forest and Decision Tree?

As data scientists, we have an array of algorithms at our disposal for building predictive models. Two of the most popular and widely used techniques are decision trees and random forests. While both can be effective in modeling tasks, they differ significantly in their underlying principles, strengths, and weaknesses. In this article, we will delve into the details of decision trees and random forests, explaining their differences and exploring which one might be better suited to different problems.

Decision Trees: A Conceptually

In [26]:
ask_llm("What are the advantages of using Accumulated Local Effects Plot?")

Sure, here are some advantages of using Accumulative Local Effects (ALE) plots:

1. Interactive Choropleth Maps: ALE plots can be easily converted into interactive choropleth maps using tools like Plotly. This allows users to explore the data in a more intuitive and engaging way, by hovering over different regions to see the corresponding voting results.
2. Matching Geo- and Non-spatial Data: ALE plots can be used to visualize both geographical and non-geographical data, such as voter demographics or candidate names. This allows for a more comprehensive understanding of the election results, by comparing and contrasting different areas.
3. Creating an Interactive Map: By using tools like Plotly, ALE plots can be transformed into interactive maps that allow users to explore the data in a more dynamic and visual way. This can help to identify patterns and trends in the data that may not be immediately apparent from a static plot.
4. Easy to Read: ALE plots are designed to be easy to read