# In-Context Learning


In-context learning is a generalisation of few-shot learning where the LLM is provided a context as part of the prompt and asked to respond by utilising the information in the context.

* Example: *"Summarize this research article into one paragraph highlighting its strengths and weaknesses: [insert article text]”*
* Example: *"Extract all the quotes from this text and organize them in alphabetical order: [insert text]”*

A very popular technique that you will learn in week 5 called Retrieval-Augmented Generation (RAG) is a form of in-context learning, where:
* a search engine is used to retrieve some relevant information
* that information is then provided to the LLM as context


In this example we download some recent research papers from arXiv papers, extract the text from the PDF files and ask Gemini to summarize the articles as well as provide the main strengths and weaknesses of the papers. Finally we print the summaries to a local html file and as markdown.

In [3]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-5.1.0-py3-none-any.whl.metadata (7.2 kB)
Downloading pypdf-5.1.0-py3-none-any.whl (297 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/298.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m298.0/298.0 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-5.1.0


In [4]:
import os
import requests
from bs4 import BeautifulSoup
import google.generativeai as genai
from urllib.request import urlopen, urlretrieve
from IPython.display import Markdown, display
from pypdf import PdfReader
from datetime import date
from tqdm import tqdm

In [5]:
API_KEY = os.environ.get("GOOGLE_API_KEY")
genai.configure(api_key=API_KEY)

We select those papers that have been featured in Hugging Face papers.

In [19]:
BASE_URL = "https://huggingface.co/papers"
page = requests.get(BASE_URL)
soup = BeautifulSoup(page.content, "html.parser")
h3s = soup.find_all("h3")

papers = []

for h3 in h3s:
    a = h3.find("a")
    if a:  # Ensure there is an anchor tag
        title = a.text.strip()
        link = a.get("href", "")
        if "flex" in a.get("class", ""):  # Check if 'flex' is in the class attribute
            arxiv_link = link.replace('/papers', '')
            papers.append({"title": title, "url": f"https://arxiv.org/pdf{arxiv_link}"})


Code to extract text from PDFs.

In [20]:
def extract_paper(url):
    html = urlopen(url).read()
    soup = BeautifulSoup(html, features="html.parser")

    # kill all script and style elements
    for script in soup(["script", "style"]):
        script.extract()    # rip it out

    # get text
    text = soup.get_text()

    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)

    return text


def extract_pdf(url):
    pdf = urlretrieve(url, "pdf_file.pdf")
    reader = PdfReader("pdf_file.pdf")
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text


def printmd(string):
    display(Markdown(string))

In [21]:
LLM = "gemini-1.5-flash"
model = genai.GenerativeModel(LLM)

We use Gemini to summarize the papers.

In [22]:
for paper in tqdm(papers):
    try:
        paper["summary"] = model.generate_content("Summarize this research article into one paragraph without formatting highlighting its strengths and weaknesses. " + extract_pdf(paper["url"])).text
    except:
        print("Generation failed")
        paper["summary"] = "Paper not available"

  8%|▊         | 1/13 [00:00<00:06,  1.81it/s]

Generation failed


 15%|█▌        | 2/13 [00:00<00:04,  2.26it/s]

Generation failed


 23%|██▎       | 3/13 [00:01<00:03,  2.50it/s]

Generation failed


 31%|███       | 4/13 [00:01<00:03,  2.62it/s]

Generation failed


 38%|███▊      | 5/13 [00:01<00:02,  2.69it/s]

Generation failed


 46%|████▌     | 6/13 [00:02<00:02,  2.68it/s]

Generation failed


 54%|█████▍    | 7/13 [00:02<00:02,  2.66it/s]

Generation failed


 62%|██████▏   | 8/13 [00:03<00:01,  2.66it/s]

Generation failed


 69%|██████▉   | 9/13 [00:03<00:01,  2.63it/s]

Generation failed


 77%|███████▋  | 10/13 [00:03<00:01,  2.65it/s]

Generation failed


 85%|████████▍ | 11/13 [00:04<00:00,  2.69it/s]

Generation failed


 92%|█████████▏| 12/13 [00:04<00:00,  2.70it/s]

Generation failed


100%|██████████| 13/13 [00:05<00:00,  2.59it/s]

Generation failed





We print the results to a html file.

In [23]:
page = f"<html> <head> <h1>Daily Dose of AI Research</h1> <h4>{date.today()}</h4> <p><i>Summaries generated with: {LLM}</i>"
with open("papers.html", "w") as f:
    f.write(page)
for paper in papers:
    page = f'<h2><a href="{paper["url"]}">{paper["title"]}</a></h2> <p>{paper["summary"]}</p>'
    with open("papers.html", "a") as f:
        f.write(page)
end = "</head>  </html>"
with open("papers.html", "a") as f:
    f.write(end)

We can also print the results to this notebook as markdown.

In [24]:
for paper in papers:
    printmd("**[{}]({})**<br>{}<br><br>".format(paper["title"],
                                                paper["url"],
                                                paper["summary"]))

**[Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization](https://arxiv.org/papers/2412.18525)**<br>Paper not available<br><br>

**[On the Compositional Generalization of Multimodal LLMs for Medical Imaging](https://arxiv.org/papers/2412.20070)**<br>Paper not available<br><br>

**[Bringing Objects to Life: 4D generation from 3D objects](https://arxiv.org/papers/2412.20422)**<br>Paper not available<br><br>

**[Efficiently Serving LLM Reasoning Programs with Certaindex](https://arxiv.org/papers/2412.20993)**<br>Paper not available<br><br>

**[TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization](https://arxiv.org/papers/2412.21037)**<br>Paper not available<br><br>

**[Edicho: Consistent Image Editing in the Wild](https://arxiv.org/papers/2412.21079)**<br>Paper not available<br><br>

**[Facilitating large language model Russian adaptation with Learned Embedding Propagation](https://arxiv.org/papers/2412.21140)**<br>Paper not available<br><br>

**[HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation](https://arxiv.org/papers/2412.21199)**<br>Paper not available<br><br>

**[Training Software Engineering Agents and Verifiers with SWE-Gym](https://arxiv.org/papers/2412.21139)**<br>Paper not available<br><br>

**[OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System](https://arxiv.org/papers/2412.20005)**<br>Paper not available<br><br>

**[PERSE: Personalized 3D Generative Avatars from A Single Portrait](https://arxiv.org/papers/2412.21206)**<br>Paper not available<br><br>

**[Slow Perception: Let's Perceive Geometric Figures Step-by-step](https://arxiv.org/papers/2412.20631)**<br>Paper not available<br><br>

**[Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs](https://arxiv.org/papers/2412.21187)**<br>Paper not available<br><br>