Before starting, please run the following code in a terminal (in a new tab)

```
module purge
module load ollama
export HUGGINGFACE_HUB_CACHE=/pbs/throng/training/universite-hiver/cache/huggingface
export OLLAMA_MODELS=/pbs/throng/training/universite-hiver/cache/ollama/models
export OLLAMA_HOST=127.0.0.1:65383
ollama serve &
```

This will let run an Ollama backend to run an LLM (GPT-oss-20b) for the "Generation" part in RAG.

# A Starting Example

Retrieval-Augmented Generation (RAG) is a technique that enhances the output of language models by grounding their responses in external, relevant documents. Instead of relying solely on model parameters, RAG systems first retrieve context passages related to a user query and then generate an answer based on that evidence. This improves factual accuracy, transparency, and adaptability to custom datasets.

In this example, we implement a simple RAG pipeline using U.S. presidential speeches as our knowledge base. We'll walk through the following steps:

- **Data Preparation**: Load and explore a dataset of presidential speech transcripts.
- **TF-IDF Retrieval**: Convert speeches into vector representations using TF-IDF (keywords) and retrieve the most relevant ones based on cosine similarity.
- **Query Answering with Ollama**: Use a locally hosted language model (via Ollama) to generate a natural language answer using the retrieved speech excerpts as context.
- **Keyword Match Exploration**: Inspect which terms from the query appear in each retrieved document to better understand the retrieval step.
- **Embedding-based Retrieval**: Retrieve sematically relevant documents intead of keywords.

This setup demonstrates a lightweight but effective RAG workflow suitable for text collections in the social sciences, especially where domain-specific knowledge is locked inside qualitative documents such as interviews, policy statements, or historical texts.

## Import Dependencies

In [None]:
import numpy as np
import pandas as pd

from ollama import Client

from sentence_transformers import SentenceTransformer

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer

## Load data

In [None]:
data1 = pd.read_csv('data/speeches.csv')

In [None]:
data1.head()

In [None]:
len(data1)

In [None]:
print(data1['transcript'].values[0])

In [None]:
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.5, min_df=5, max_features=1000)
tfidf_matrix = vectorizer.fit_transform(data1['transcript'])

In [None]:
tfidf_matrix.shape

In [None]:
vectorizer.get_feature_names_out()

### helper functions

In [None]:
def retrieve_relevant_speeches(data, tfidf_matrix, query, top_k=5):
    query_vec = vectorizer.transform([query])
    similarity = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_indices = np.argsort(similarity)[-top_k:][::-1]
    return data.iloc[top_indices], similarity[top_indices]

In [None]:
def retrieve_relevant_speeches_with_matches(data, tfidf_matrix, query, top_k=5):
    query_vec = vectorizer.transform([query])
    similarity = cosine_similarity(query_vec, tfidf_matrix).flatten()
    top_indices = np.argsort(similarity)[-top_k:][::-1]
    retrieved_df = data.iloc[top_indices].copy()
    
    # Get matched words
    matched_words_list = []
    query_features = query_vec.nonzero()[1]
    query_terms = set([vectorizer.get_feature_names_out()[i] for i in query_features])

    for idx in top_indices:
        speech_vec = tfidf_matrix[idx]
        speech_features = speech_vec.nonzero()[1]
        speech_terms = set([vectorizer.get_feature_names_out()[i] for i in speech_features])
        common_terms = query_terms.intersection(speech_terms)
        matched_words_list.append(list(common_terms))

    retrieved_df["matched_words"] = matched_words_list
    retrieved_df["similarity"] = similarity[top_indices]
    return retrieved_df

In [None]:
def build_context(speeches_df, max_chars=1000):
    context = ""
    for _, row in speeches_df.iterrows():
        context += f"\n---\nSpeech by {row['president']}:\n{row['transcript'][:max_chars]}...\n"
    return context

## Asking an LLM, with and without RAG

In [None]:
host='127.0.0.1:65383'
model="gpt-oss:20b"

client = Client(
    host=host,
)

In [None]:
answer = client.chat(model, messages=[
        {
            'role': 'user',
            'content': f"""What did president Grover Cleveland say about immigration?"""
        },
    ])

print(answer.message.content)

In [None]:
def query_ollama(context, question, client):
    messages = [
        {
            'role': 'user',
            'content': f"""Answer the following question based on the context:\n\nContext:\n{context}\n\nQuestion: {question}\n\nAnswer:"""
        },
    ]

    for part in client.chat(model, messages=messages, stream=True):
        print(part.message.content, end='', flush=True)

In [None]:
query = "What did president Grover Cleveland say about immigration?"

# Retrieve
retrieved, scores = retrieve_relevant_speeches(data1, tfidf_matrix, query)
context = build_context(retrieved)

In [None]:
print(context)

In [None]:
# Generate answer
query_ollama(context, query, client)

Grover Cleveland’s remarks were focused on a specific amendment to the 1891 immigration act.  
He explained that the bill he was returning would add a new category of excluded aliens:  

* **Persons who are physically capable and over 16 years old but cannot read or write English (or any other language).**  
* **Exceptions:**  
  * An older parent or grandparent (over 50 years old) who is the parent or grandparent of a qualified immigrant over 21 years old and who is able to support that immigrant may accompany or be brought to the United States.  
  * Likewise, a wife or minor child who cannot read or write may accompany such an adult, or such an adult may be sent for and join the family of a child or grandchild over 21 who is qualified and capable.

In short, Cleveland said the proposed amendment would broaden the list of excluded immigrants to those over 16 who are illiterate in English, but would carve out a special provision for older family members who can support a qualified adult immigrant.

### See what a context looks like

In [None]:
query = "What did the presidents say about immigration?"
results_df = retrieve_relevant_speeches_with_matches(data1, tfidf_matrix, query, top_k=10)

for _, row in results_df.iterrows():
    print(f"\n➡️ {row['president']}:")
    print(f"Similarity: {row['similarity']:.4f}")
    print(f"Matched words: {', '.join(row['matched_words'])}")
    print(f"Excerpt: {row['transcript'][:1500]}...\n")

In [None]:
query = "What did the presidents say about oil?"
results_df = retrieve_relevant_speeches_with_matches(data1, tfidf_matrix, query)

for _, row in results_df.iterrows():
    print(f"\n➡️ {row['president']}:")
    print(f"Similarity: {row['similarity']:.4f}")
    print(f"Matched words: {', '.join(row['matched_words'])}")
    print(f"Excerpt: {row['transcript'][:1500]}...\n")

In [None]:
query = "What did the presidents say about oil?"

# Retrieve
retrieved, scores = retrieve_relevant_speeches(data1, tfidf_matrix, query)
context = build_context(retrieved)

# Generate answer
query_ollama(context, query, client)

**What the presidents said about oil**

| President | Main points about oil (or energy) |
|-----------|-----------------------------------|
| **Barack Obama** | • The Deepwater Horizon rig exploded and released an unprecedented oil spill off Louisiana. <br>• “We’re waging a battle against an oil spill that is assaulting our shores and our citizens.” <br>• Stopping the leak “has tested the limits of human technology,” so he assembled a national team of scientists and engineers to contain it. |
| **Jimmy Carter** | • In his first address, he framed the energy problem as “the greatest challenge” the nation would face in its lifetime, emphasizing that we must “balance our demand for energy with our rapidly shrinking resources.” <br>• In his follow‑up speech he announced the creation of a new Department of Energy and the “National Energy Plan,” underscoring the need for comprehensive federal action to address the looming crisis. |
| **Gerald Ford** | • He pledged a program to make the United States “independent of foreign sources of energy by 1985.” <br>• He quoted that the country was dependent on about **37 %** of its petroleum needs from abroad and warned that in ten years it could import more than **half** of its oil. <br>• He highlighted the rising cost of foreign oil – from **$3 B** five years ago to **$25 B** per year now – and warned that the trend could continue if nothing is done. |

In short:  
- **Obama** spoke of the emergency response to an oil spill.  
- **Carter** warned of an impending energy crisis and called for new federal leadership and policy.  
- **Ford** focused on the United States’ dependence on imported oil and the need for energy independence.


In [None]:
query = "What did the presidents say about immigration from mexico?"

# Retrieve
retrieved, scores = retrieve_relevant_speeches(data1, tfidf_matrix, query)
context = build_context(retrieved)

# Generate answer
query_ollama(context, query, client)

The three speeches that were supplied contain no remarks that directly address immigration coming from Mexico.  

- **Donald Trump** speaks about a “safe and lawful” immigration system and the need to reform it, but he does not single out Mexico or Mexican immigrants.  
- **James K. Polk** talks about national prosperity, territorial expansion, and the influx of people in general; he never mentions Mexican immigration.  
- **John Tyler** discusses the Mexican government’s hostile language about the war with Texas and the pending annexation of Texas, but he does not discuss immigration from Mexico.  

So, in the excerpts provided, none of the presidents made a statement about immigration from Mexico.

In [None]:
query = "What did the presidents say about immigration from mexico?"
results_df = retrieve_relevant_speeches_with_matches(data1, tfidf_matrix, query)

for _, row in results_df.iterrows():
    print(f"\n➡️ {row['president']}:")
    print(f"Similarity: {row['similarity']:.4f}")
    print(f"Matched words: {', '.join(row['matched_words'])}")
    print(f"Excerpt: {row['transcript'][:1500]}...\n")

In [None]:
query = "What did the presidents say about freedom of speech?"

# Retrieve
retrieved, scores = retrieve_relevant_speeches(data1, tfidf_matrix, query)
context = build_context(retrieved)

# Generate answer
query_ollama(context, query, client)

In [None]:
print(context)

# Chunking and Embeddings

In [None]:
def chunk_speeches_by_paragraph(df):
    chunks = []
    metadata = []
    for i, row in df.iterrows():
        paragraphs = [p.strip() for p in row['transcript'].split('\n') if len(p.strip()) > 50]
        for para in paragraphs:
            chunks.append(para)
            metadata.append({
                'president': row['president'],
                'speech_id': i
            })
    return chunks, metadata



In [None]:
chunks, meta = chunk_speeches_by_paragraph(data1)

In [None]:
chunks[:5]

In [None]:
len(chunks)

In [None]:
embedder = SentenceTransformer('all-MiniLM-L6-v2')  # Small and fast

In [None]:
chunk_embeddings = embedder.encode(chunks, convert_to_tensor=True, show_progress_bar=True)

In [None]:
def retrieve_with_embeddings(query, chunk_embeddings, top_k=5):
    query_embedding = embedder.encode([query], convert_to_tensor=True)
    sims = cosine_similarity(query_embedding.cpu(), chunk_embeddings.cpu())[0]
    top_indices = np.argsort(sims)[-top_k:][::-1]

    results = []
    for idx in top_indices:
        results.append({
            "transcript": chunks[idx],
            "similarity": float(sims[idx]),
            "president": meta[idx]['president'],
            "speech_id": meta[idx]['speech_id']
        })
    return results

In [None]:
query = "How did presidents talk about freedom of speech?"
results = retrieve_with_embeddings(query, chunk_embeddings)

for res in results:
    print(f"\n➡️ {res['president']} (Similarity: {res['similarity']:.4f})")
    print(res['transcript'])

In [None]:
def build_chunck_context(speeches_df):
    context = ""
    for row in speeches_df:
        context += f"\n---\nPart of speech by {row['president']}:\n{row['transcript']}\n"
    return context

In [None]:
query = "How did presidents talk about freedom of speech?"

# Retrieve
results = retrieve_with_embeddings(query, chunk_embeddings, top_k=10)
context = build_chunck_context(results)

# Generate answer
query_ollama(context, query, client)

**In short:** The presidents used “freedom of speech” as a moral touch‑stone, a shield against tyranny, and a rallying cry for action. They spoke of it as a core American value, warned that it can be eroded when politics or war dominate, and urged both the public and government to guard it from censorship, silence, and intimidation.

| President | What they emphasized | How it shows up in the speech |
|-----------|----------------------|--------------------------------|
| **Lyndon B. Johnson** | Freedom of speech is a *living* safeguard that must be defended and exercised without restraint. | “We have freedom of speech… we have freedom of assembly… I have seen no restraints imposed by anybody.” |
| **George W. Bush** | Speech is a tool of decision‑making; politicians should not be swayed by loud voices but must confront hard realities. | “Presidents can try to avoid hard decisions… I am the kind of person that is willing to take on hard tasks.” |
| **Donald Trump** | Freedom of speech is under threat from cancel‑culture and censorship; it is essential to keep the nation united. | “Unprecedented assault on free speech… the efforts to censor, cancel, and blacklist… are wrong and dangerous.” |
| **Franklin D. Roosevelt** | Speech is a universal right that must be protected worldwide. | “The first is freedom of speech and expression everywhere in the world.” |
| **Richard M. Nixon** | Confidentiality is necessary, but open discussion is vital to resolve conflict. | “The principle of confidentiality… but the same brutal candor is necessary in discussing how to bring warring factions to the peace table.” |
| **Joe Biden** | Freedom of speech is part of the American ideal and cannot be ignored when rights are violated. | “No responsible American President could remain silent when basic human rights are being so blatantly violated.” |

**Key themes**

1. **Defining freedom of speech as a cornerstone of democracy.**  
2. **Highlighting its fragility**—when other powers (war, politics, censorship) press against it.  
3. **Calling for action**—not just speeches but deeds that protect and expand the right.  
4. **Warning against censorship** and cancel culture.  
5. **Framing it as a moral obligation** for both presidents and citizens.  

These speeches collectively illustrate that U.S. presidents consistently framed freedom of speech as a protective, unifying, and indispensable value that must be defended against every threat, whether domestic or international.


In [None]:
query = "Which president talked about immigration and immigrants positively in their speech?"

# Retrieve
results = retrieve_with_embeddings(query, chunk_embeddings, top_k=10)
context = build_chunck_context(results)

# Generate answer
query_ollama(context, query, client)

In [None]:
print(context)

# *To Be Continued..*