Eystathios Andreopoulos 4630
Giorgos Hatziligos 4835

#### -----------  Step 1 ----------- ####

i.all initial imports
ii.load data from CNN_Article_clean.csv and a sample

#### -----------  Step 1 ----------- ####

In [107]:
import os
import pandas as pd
import numpy as np
from tqdm import tqdm
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_ollama import OllamaLLM
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from sklearn.metrics.pairwise import cosine_similarity

df = pd.read_csv('CNN_Articels_clean.csv')


print("Data format:")
print(f"Number of articles: {len(df)}")
print(df.head())

Data format:
Number of articles: 4076
   Index                                             Author  \
0      0                                 Jacopo Prisco, CNN   
1      2                              Stephanie Bailey, CNN   
2      3  Words by Stephanie Bailey, video by Zahra Jamshed   
3      4                    Paul R. La Monica, CNN Business   
4      7                                            Reuters   

        Date published  Category    Section  \
0  2021-07-15 02:46:59      news      world   
1  2021-05-12 07:52:09      news      world   
2  2021-06-16 02:51:30      news       asia   
3  2022-03-15 09:57:36  business  investing   
4  2022-03-15 11:27:02  business   business   

                                                 Url  \
0  https://www.cnn.com/2021/07/14/world/tusimple-...   
1  https://www.cnn.com/2021/05/12/world/ironhand-...   
2  https://www.cnn.com/2021/06/15/asia/swarm-robo...   
3  https://www.cnn.com/2022/03/15/investing/brics...   
4  https://www.cnn.c

-------------------------------------------------------------- Step 2 -----------------------------------------------------------

Preprocessing Step : This step splits into smaller chunks 

2.1.RecursiveCharacterTextSplitter divides text based on natural separarators 

2.2.We hold from the CNN_Articles_clean column : "Headline" and "Article text" and we combine them 

2.3.Split the text from the combnation of them  "Headline" and "Article text" as a chunk and put it in a list of chunks 

2.4.Metadata for each chunk like idx , chunk_id , title/headline **we can add more**

-------------------------------------------------------------- Step 2 ---------------------------------------------------------- 

In [108]:
# Create the text splitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,         # Size of each chunk (in characters)
    chunk_overlap=80,       # Overlap between chunks (for better continuity)
    length_function=len,     # Function for measuring size (here simply counting characters)
    separators=["\n\n", "\n", ".", " ", ""]  # Preferred splitting points
)


all_chunks = []
all_metadatas = []


for idx, row in tqdm(df.iterrows(), total=len(df), desc="Splitting articles into chunks"):
    title = row['Headline'] if not pd.isna(row['Headline']) else ""
    content = row['Article text'] if not pd.isna(row['Article text']) else ""
    
    # Combine title and content with a separator
    full_text = f"Title: {title}\n\nContent: {content}"
    #full_text = f"Content: {content}"
    # Split the text into chunks
    chunks = text_splitter.split_text(full_text)
    
    # Add the chunks to the lists
    all_chunks.extend(chunks)
    
    # Add metadata for each chunk
    for i in range(len(chunks)):
        metadata = {
            'title': title,
            'article_id': idx,
            'chunk_id': i,
            'source': 'CNN'
        }
        all_metadatas.append(metadata)

print(f"Total number of chunks: {len(all_chunks)}")
print(f"Example chunk: {all_chunks[0][:200]}...")

Splitting articles into chunks: 100%|██████████| 4076/4076 [00:00<00:00, 5615.57it/s]

Total number of chunks: 59420
Example chunk: Title: There's a shortage of truckers, but TuSimple thinks it has a solution: no driver needed - CNN...





-------------------------------------------------------------- Step 3 -----------------------------------------------------------

Initialize a emdedding model 

examples of models

sentence-transformers/all-MiniLM-L6-v2 

sentence-transformers/all-mpnet-base-v2 # Higher quality embeddings, but larger and slower , More powerful model

sentence-transformers/all-MiniLM-L12-v2 # Deeper model

-------------------------------------------------------------- Step 3 -----------------------------------------------------------

In [109]:

embeddings_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
    model_kwargs={'device': 'cpu'},  
    encode_kwargs={'normalize_embeddings': True}  
)


-------------------------------------------------------------- Step 3_1 -----------------------------------------------------------

Create the vector database 

This step converts all text chunks into vector embeddings and stores them in a FAISS index

FAISS enables efficient similarity search across large collections of embeddings

Create the vector database from texts and their embedding

-------------------------------------------------------------- Step 3_1 -----------------------------------------------------------

In [110]:
# Create the vector database
vectorstore = FAISS.from_texts(
    texts=all_chunks,
    embedding=embeddings_model,
    metadatas=all_metadatas
)

# Save the database to disk
vectorstore.save_local("faiss_index")

print("The vector database was created and saved successfully!")

The vector database was created and saved successfully!


-------------------------------------------------------------- Step 4 -----------------------------------------------------------


4.1 : This step sets up the Large Language Model : llama3.2:1b , llama3.2:3b & temprature = 0.1 

4.2 : we can set the temperature as 0 to take the same response of llm always with a specific question

-------------------------------------------------------------- Step 4 -----------------------------------------------------------

In [111]:
try:
    # Initialize the Llama model via Ollama
    llm = OllamaLLM(model="llama3.2:3b", temperature=0.1)
    
    # Check that the LLM is working correctly
    test_response = llm.invoke("which is the current president of Greece")
    print(f"Test response from LLM: {test_response}")
except Exception as e:
    print(f"Error initializing LLM: {e}")
    print("Make sure that:")
    print("1. You have installed ollama from https://ollama.com/download")
    print("2. You have run: ollama pull llama3.2:1b or llama3.2:3b")
    print("3. Ollama is running in the background: ollama serve")


Test response from LLM: I don't have real-time information, but as of my knowledge cutoff in December 2023, the President of Greece was Katerina Sakellaropoulou. However, please note that this information may have changed since then.

For the most up-to-date information, I recommend checking a reliable news source or the official website of the Greek Presidency for the latest information on the current president of Greece.


-------------------------------------------------------------- Step 5 -----------------------------------------------------------

This step sets up the prompt template and the retrieval-based QA chain

The chain will retrieve relevant chunks and feed them to the LLM to generate answers

-------------------------------------------------------------- Step 5 -----------------------------------------------------------


In [112]:
# Create the template for the prompt
template = """
You are an AI assistant. Answer the following question.
Answer the following question: {question}

Base your answer on the following context:
{context}

Instructions:
- Use the information from the context above to answer the question
- If the context contains relevant information, provide a detailed answer

Answer:
"""

# Create the PromptTemplate
PROMPT = PromptTemplate(
    template=template,
    input_variables=["question", "context"]
)

# Create the question-answer chain with retrieval (RetrievalQA)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Simple strategy that "stuffs" all retrieved texts into the prompt
    retriever=vectorstore.as_retriever(search_kwargs={"k": 10}),  # Retrieve the top-20 relevant chunks
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True  # Returns also the documents used
)


-------------------------------------------------------------- Step 5 -----------------------------------------------------------

Define Query functions 

1.)  Process a question using the RAG . 

1.1) We use the the question-answer chain to answer the question and keep from the answer the retrieved docs from dataset which is used

1.2)we print the question , answer of llm , chunk index , and the idx , headline , Article text

2.)  Process a question using the the llm . 

2.1) We simply make a prompt with an faculty + question

2.2) We gave it in the llm and it answers it .



-------------------------------------------------------------- Step 5 -----------------------------------------------------------

In [113]:
def get_retrieval_similarity_scores(question, retrieved_docs, embeddings_model):
    
    
    question_embedding = embeddings_model.embed_query(question)
    
    print("\nSimilarity scores between question and retrieved chunks:")
    print("-" * 80)
    
    similarity_results = []
    
    for i, doc in enumerate(retrieved_docs):
        # Get document embedding only with previous model 
        doc_embedding = embeddings_model.embed_query(doc.page_content)
        
        # Calculate cosine similarity 
        similarity = cosine_similarity([question_embedding], [doc_embedding])[0][0]
        
        title = doc.metadata.get('title', '')
        
        # Store result
        similarity_results.append({
            'index': i+1,
            'article_id': doc.metadata.get('article_id'),
            'chunk_id': doc.metadata.get('chunk_id'),
            'similarity': similarity,
            'title': title
        })
        
        # Print result with proper title display
        print(f"Chunk {i+1} (Article {doc.metadata.get('article_id')}, Chunk {doc.metadata.get('chunk_id')})")
        print(f"Title: {title}")  
        print(f"Similarity Score: {similarity:.4f}")
        print("-" * 80)
    
    # Print statistics
    avg_similarity = np.mean([r['similarity'] for r in similarity_results])
    max_similarity = max([r['similarity'] for r in similarity_results])
    min_similarity = min([r['similarity'] for r in similarity_results])
    
    print(f"\nSimilarity Statistics:")
    print(f"Average similarity score: {avg_similarity:.4f}")
    print(f"Maximum similarity score: {max_similarity:.4f}")
    print(f"Minimum similarity score: {min_similarity:.4f}")
    
    return similarity_results

In [None]:
def query_with_rag(question):
    result = qa_chain.invoke({"query": question})
    answer = result["result"]
    source_docs = result["source_documents"]
    
    print(f"Question: {question}")
    print(f"Answer (with RAG): {answer}")
    
    
    similarity_scores = get_retrieval_similarity_scores(question, source_docs, embeddings_model)
    
    print("\nThe following chunks were used:")
    for i, doc in enumerate(source_docs):
        print(f"\nChunk {i+1}:")
        print(f"Metadata: {doc.metadata}")
        print(f"Content: {doc.page_content[:200]}...")  
    
    return answer, source_docs, similarity_scores

# Function for answering questions directly with the LLM 
def query_without_rag(question):
    prompt = f"You are an artificial intelligence assistant who is an expert in analyzing from CNN articles 2020 - 2022. Answer the following question:: {question}"
    answer = llm.invoke(prompt)
    
    print(f"Question: {question}")
    print(f"Answer (without RAG): {answer}")
    
    return answer 

-------------------------------------------------------------- Step 6 -----------------------------------------------------------

Evaluate Question - Answer

1.) 5 question 

2.) for each question print 

2.1) with rag -> the index of question , question , retrieved docs with the headline , Article text , idx , chunk_id , answer

2.2.) without  ->  the index of question , question , answer

3.) Save all in a csv 

-------------------------------------------------------------- Step 6 -----------------------------------------------------------

In [None]:
question_array = [
    
    "How did the COVID-19 pandemic affect the global economy?",
    "News about the summer Olymbic games 2020 ?",
    "What are John Biden's plans for the American economy ?",
    "What are the player transfers for football teams?",
    "What are the main challenges currently facing U.S.Α domestic and foreign policy ?"

]


print("######################################## RAG SYSTEM EVALUATION ########################################")
print("=======================================================================================================")
results = []

for q_idx, question in enumerate(question_array):
    print(f"\n\nQUESTION {q_idx+1}: {question}")
    print("#" * 80)
    
    # Answer with RAG
    print("\n[WITH RAG]")
    try:
        with_rag_answer, source_docs, similarity_scores = query_with_rag(question)
    except Exception as e:
        print(f"Error during RAG execution: {e}")
        with_rag_answer = "Error in RAG execution"
        source_docs = []
        similarity_scores = []
    
    # Answer without RAG
    print("\n[WITHOUT RAG]")
    try:
        without_rag_answer = query_without_rag(question)
    except Exception as e:
        print(f"Error during execution without RAG: {e}")
        without_rag_answer = "Error in execution without RAG"
    
    # Save results
    results.append({
        "question": question,
        "with_rag": with_rag_answer,
        "without_rag": without_rag_answer,
        "source_docs": source_docs,
        "similarity_scores": similarity_scores
    })
    
    print("\n" + "#" * 80)


evaluation_df = pd.DataFrame([
    {
        "question": r["question"],
        "with_rag_answer": r["with_rag"],
        "without_rag_answer": r["without_rag"],
        "num_source_docs": len(r["source_docs"]),
        "avg_similarity": np.mean([s["similarity"] for s in r["similarity_scores"]]) if r["similarity_scores"] else 0,
        "max_similarity": max([s["similarity"] for s in r["similarity_scores"]]) if r["similarity_scores"] else 0,
        "min_similarity": min([s["similarity"] for s in r["similarity_scores"]]) if r["similarity_scores"] else 0
    }
    for r in results
])


print("\nSummary evaluation results:")
print(evaluation_df[["question", "num_source_docs", "avg_similarity", "max_similarity", "min_similarity"]])


evaluation_df.to_csv("rag_evaluation_results_with_similarity.csv", index=False)
print("=======================================================================================================")
print("#################################### END RAG SYSTEM EVALUATION ########################################")


######################################## RAG SYSTEM EVALUATION ########################################


QUESTION 1: How did the COVID-19 pandemic affect the global economy?
################################################################################

[WITH RAG]
Question: How did the COVID-19 pandemic affect the global economy?
Answer (with RAG): The COVID-19 pandemic had a significant impact on the global economy, with far-reaching consequences that are still being felt today. One of the most notable effects was the massive job loss in the labor market, with 22 million jobs lost in just two months in the US, which is more than twice as many jobs lost during the entire Great Recession and financial crisis of 2008-2009.

The pandemic also led to a global supply chain disruption, resulting in inflation. However, it's worth noting that this inflation was largely due to global supply chain problems, rather than the strong jobs recovery that has since occurred. Without this strong reco

 avg_similarity  max_similarity  min_similarity  
0        0.599126        0.640436        0.567098  
1        0.570741        0.629303        0.553037  
2        0.550139        0.591970        0.525329  
3        0.597792        0.644366        0.562622  
4        0.438194        0.454634        0.422457 