# Custom Chatbot Project

## CERN is celebrating its 70th anniversary 
During a crucial period for high-energy physics, coinciding with the initiation of the third update to the European strategy for particle physics. In this special edition of CERN Courier magazine, early-career researchers share their visions for the future of the field while reflecting on CERN's scientific and societal contributions. The magazine features expert insights into the achievements of the Large Hadron Collider (LHC) and explores the advancements of the hybrid pixel detector technology, emphasizing its applications beyond particle physics.

## The CERN Courier website 
is a rich repository of articles covering a wide array of topics in particle physics, high-energy physics, and associated technological advancements. It provides in-depth reporting on the latest experimental results from CERN and other international laboratories, offering insights into ongoing research and discoveries in the field.

## Last 11 years of the CERN Courier Magazine in PDF
In this dataset I am downloading the Last 11 years of the CERN Courier Magazine.  I will then take this database and then encode it to be used as a Context Window to ask Questions to OpenAI

# 1: Data Wrangling

In [3]:
%%time
import requests
from bs4 import BeautifulSoup
import os
from urllib.parse import urljoin, unquote
import time
from tqdm import tqdm
import re

class CERNPDFCrawler:
    def __init__(self):
        self.base_url = "https://home.cern/resources"
        self.session = requests.Session()
        self.headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
        }
        self.download_folder = "cern_pdfs"
        self.processed_article_urls = set()
        self.downloaded_files = set()
        
        if not os.path.exists(self.download_folder):
            os.makedirs(self.download_folder)
        self.load_existing_files()

    def load_existing_files(self):
        for filename in os.listdir(self.download_folder):
            if filename.lower().endswith('.pdf'):
                self.downloaded_files.add(filename)
        print(f"Found {len(self.downloaded_files)} existing PDF files")

    def get_page_content(self, url):
        max_retries = 3
        for attempt in range(max_retries):
            try:
                response = self.session.get(url, headers=self.headers)
                response.raise_for_status()
                return response.text
            except requests.RequestException as e:
                if attempt == max_retries - 1:
                    print(f"Error fetching {url}: {e}")
                    return None
                time.sleep(2 ** attempt)
        return None

    def extract_pdf_urls_from_text(self, text):
        """Extract PDF URLs from text content including 'File path:' patterns"""
        pdf_urls = set()
        
        # Look for "File path:" pattern
        file_path_matches = re.finditer(r'File path:\s*(https?://[^\s<>"]+\.pdf)', text, re.IGNORECASE)
        for match in file_path_matches:
            pdf_urls.add(match.group(1))
            
        # Look for direct PDF links
        pdf_link_matches = re.finditer(r'href="(https?://[^\s<>"]+\.pdf)"', text, re.IGNORECASE)
        for match in pdf_link_matches:
            pdf_urls.add(match.group(1))
            
        return pdf_urls

    def find_courier_links(self, page_url):
        content = self.get_page_content(page_url)
        if not content:
            return []
        
        soup = BeautifulSoup(content, 'html.parser')
        courier_links = []
        
        for link in soup.find_all('a', href=True):
            href = link['href']
            if '/resources/courier/' in href or '/record/' in href:
                full_url = urljoin("https://home.cern", href)
                if full_url not in self.processed_article_urls:
                    courier_links.append(full_url)
                    self.processed_article_urls.add(full_url)
        
        return courier_links

    def find_pdf_links(self, article_url):
        """Find all PDF download links on an article page"""
        content = self.get_page_content(article_url)
        if not content:
            return []
        
        pdf_urls = set()
        
        # Extract URLs from text content
        pdf_urls.update(self.extract_pdf_urls_from_text(content))
        
        # Parse with BeautifulSoup for structured extraction
        soup = BeautifulSoup(content, 'html.parser')
        
        # Look for links containing PDF
        for link in soup.find_all('a', href=True):
            href = link['href']
            if href.lower().endswith('.pdf'):
                full_url = urljoin("https://home.cern", href)
                pdf_urls.add(full_url)
        
        return list(pdf_urls)

    def sanitize_filename(self, url):
        """Create a safe filename from URL"""
        filename = unquote(url.split('/')[-1])
        # Remove or replace unsafe characters
        filename = re.sub(r'[<>:"/\\|?*]', '_', filename)
        return filename

    def download_pdf(self, pdf_url, filename):
        if filename in self.downloaded_files:
            print(f"Skipping {filename} - already downloaded")
            return True
            
        try:
            response = self.session.get(pdf_url, headers=self.headers, stream=True)
            response.raise_for_status()
            
            file_path = os.path.join(self.download_folder, filename)
            
            total_size = int(response.headers.get('content-length', 0))
            
            with open(file_path, 'wb') as file, tqdm(
                desc=filename,
                total=total_size,
                unit='iB',
                unit_scale=True,
                unit_divisor=1024,
            ) as pbar:
                for data in response.iter_content(chunk_size=1024):
                    size = file.write(data)
                    pbar.update(size)
            
            self.downloaded_files.add(filename)
            return True
        except Exception as e:
            print(f"Error downloading {filename}: {e}")
            return False

    def crawl_and_download(self, start_page=0, end_page=7):
        print(f"Starting CERN PDF crawler (pages {start_page} to {end_page})")
        
        found_pdfs = 0
        downloaded_pdfs = 0
        skipped_pdfs = 0
        failed_downloads = []
        
        for page_num in range(start_page, end_page + 1):
            page_url = f"{self.base_url}?type=52&page={page_num}"
            print(f"\nProcessing page {page_num}...")
            
            courier_links = self.find_courier_links(page_url)
            print(f"Found {len(courier_links)} new article links on page {page_num}")
            
            for article_url in courier_links:
                pdf_urls = self.find_pdf_links(article_url)
                
                for pdf_url in pdf_urls:
                    found_pdfs += 1
                    filename = self.sanitize_filename(pdf_url)
                    
                    print(f"\nFound PDF: {filename}")
                    print(f"URL: {pdf_url}")
                    
                    if filename in self.downloaded_files:
                        print(f"Skipping - already downloaded")
                        skipped_pdfs += 1
                        continue
                        
                    if self.download_pdf(pdf_url, filename):
                        downloaded_pdfs += 1
                    else:
                        failed_downloads.append(filename)
                
                time.sleep(1)
        
        print("\nDownload Summary:")
        print("-" * 20)
        print(f"Total PDFs found: {found_pdfs}")
        print(f"Successfully downloaded: {downloaded_pdfs}")
        print(f"Skipped (already downloaded): {skipped_pdfs}")
        print(f"Failed downloads: {len(failed_downloads)}")
        if failed_downloads:
            print("\nFailed downloads:")
            for fail in failed_downloads:
                print(f"- {fail}")

if __name__ == "__main__":
    crawler = CERNPDFCrawler()
    crawler.crawl_and_download(0, 7)

Found 0 existing PDF files
Starting CERN PDF crawler (pages 0 to 7)

Processing page 0...
Found 15 new article links on page 0

Found PDF: CERNCourier2024MayJun-digitaledition.pdf
URL: https://cds.cern.ch/record/2896932/files/CERNCourier2024MayJun-digitaledition.pdf


CERNCourier2024MayJun-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 14.7M/14.7M [00:01<00:00, 8.20MiB/s]



Found PDF: CERNCourier2024MarApr-digitaledition.pdf
URL: https://cds.cern.ch/record/2893513/files/CERNCourier2024MarApr-digitaledition.pdf


CERNCourier2024MarApr-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 13.5M/13.5M [00:01<00:00, 10.4MiB/s]



Found PDF: CERNCourier2024JanFeb-digitaledition.pdf
URL: https://cds.cern.ch/record/2886335/files/CERNCourier2024JanFeb-digitaledition.pdf


CERNCourier2024JanFeb-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 13.0M/13.0M [00:01<00:00, 8.04MiB/s]



Found PDF: CERNCourier2023NovDec-digitaledition NEW.pdf
URL: https://cds.cern.ch/record/2879381/files/CERNCourier2023NovDec-digitaledition%20NEW.pdf


CERNCourier2023NovDec-digitaledition NEW.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 12.3M/12.3M [00:01<00:00, 9.70MiB/s]



Found PDF: CERNCourier2023SepOct-digitaledition.pdf
URL: https://cds.cern.ch/record/2869155/files/CERNCourier2023SepOct-digitaledition.pdf


CERNCourier2023SepOct-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 19.6M/19.6M [00:05<00:00, 3.46MiB/s]



Found PDF: CERNCourier2023JulAug-digitaledition.pdf
URL: https://cds.cern.ch/record/2863407/files/CERNCourier2023JulAug-digitaledition.pdf


CERNCourier2023JulAug-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 16.3M/16.3M [00:05<00:00, 3.35MiB/s]



Found PDF: CERNCourier2023MayJun-digitaledition.pdf
URL: https://cds.cern.ch/record/2857134/files/CERNCourier2023MayJun-digitaledition.pdf


CERNCourier2023MayJun-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 18.3M/18.3M [00:04<00:00, 4.35MiB/s]



Found PDF: CERNCourier2023MarApr-digitaledition.pdf
URL: https://cds.cern.ch/record/2857133/files/CERNCourier2023MarApr-digitaledition.pdf


CERNCourier2023MarApr-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 15.9M/15.9M [00:02<00:00, 5.93MiB/s]



Found PDF: CERNCourier2023JanFeb-digitaledition.pdf
URL: https://cds.cern.ch/record/2845914/files/CERNCourier2023JanFeb-digitaledition.pdf


CERNCourier2023JanFeb-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11.7M/11.7M [00:01<00:00, 6.24MiB/s]



Found PDF: CERNCourier2022NovDec-digitaledition.pdf
URL: https://cds.cern.ch/record/2840144/files/CERNCourier2022NovDec-digitaledition.pdf


CERNCourier2022NovDec-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 15.4M/15.4M [00:01<00:00, 8.74MiB/s]



Found PDF: CERNCourier2022SepOct-digitaledition.pdf
URL: https://cds.cern.ch/record/2826497/files/CERNCourier2022SepOct-digitaledition.pdf


CERNCourier2022SepOct-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 13.6M/13.6M [00:06<00:00, 2.16MiB/s]



Found PDF: CERNCourier2022MayJun-digitaledition.pdf
URL: https://cds.cern.ch/record/2807618/files/CERNCourier2022MayJun-digitaledition.pdf


CERNCourier2022MayJun-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 12.9M/12.9M [00:05<00:00, 2.42MiB/s]



Processing page 1...
Found 15 new article links on page 1

Found PDF: CERNCourier2022MarApr-digitaledition.pdf
URL: https://cds.cern.ch/record/2804425/files/CERNCourier2022MarApr-digitaledition.pdf


CERNCourier2022MarApr-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10.5M/10.5M [00:03<00:00, 3.49MiB/s]



Found PDF: CERNCourier2022JanFeb-digitaledition.pdf
URL: https://cds.cern.ch/record/2799462/files/CERNCourier2022JanFeb-digitaledition.pdf


CERNCourier2022JanFeb-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 9.46M/9.46M [00:02<00:00, 4.61MiB/s]



Found PDF: CERNCourier2021NovDec-digitaledition.pdf
URL: https://cds.cern.ch/record/2789409/files/CERNCourier2021NovDec-digitaledition.pdf


CERNCourier2021NovDec-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11.9M/11.9M [00:01<00:00, 6.29MiB/s]



Found PDF: CERNCourier2021SepOct-digitaledition.pdf
URL: https://cds.cern.ch/record/2782568/files/CERNCourier2021SepOct-digitaledition.pdf


CERNCourier2021SepOct-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 15.1M/15.1M [00:07<00:00, 2.11MiB/s]



Found PDF: CERNCourier2021JulAug-digitaledition.pdf
URL: https://cds.cern.ch/record/2773907/files/CERNCourier2021JulAug-digitaledition.pdf
Error downloading CERNCourier2021JulAug-digitaledition.pdf: 404 Client Error: Not Found for url: https://cds.cern.ch/record/2773907/files/CERNCourier2021JulAug-digitaledition.pdf

Found PDF: CERNCourier2021MayJun-digitaledition.pdf
URL: https://cds.cern.ch/record/2765233/files/CERNCourier2021MayJun-digitaledition.pdf


CERNCourier2021MayJun-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 9.97M/9.97M [00:01<00:00, 7.10MiB/s]



Found PDF: CERNCourier2021MarApr-digitaledition.pdf
URL: https://cds.cern.ch/record/2753402/files/CERNCourier2021MarApr-digitaledition.pdf


CERNCourier2021MarApr-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 13.8M/13.8M [00:01<00:00, 11.2MiB/s]



Found PDF: CERNCourier2021JanFeb-digitaledition.pdf
URL: https://cds.cern.ch/record/2750037/files/CERNCourier2021JanFeb-digitaledition.pdf


CERNCourier2021JanFeb-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 21.6M/21.6M [00:01<00:00, 12.5MiB/s]



Found PDF: CERNCourier2020NovDec-digitaledition.pdf
URL: https://cds.cern.ch/record/2743359/files/CERNCourier2020NovDec-digitaledition.pdf


CERNCourier2020NovDec-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 10.6M/10.6M [00:01<00:00, 9.85MiB/s]



Found PDF: CERNCourier2020SepOct-digitaledition.pdf
URL: https://cds.cern.ch/record/2743358/files/CERNCourier2020SepOct-digitaledition.pdf


CERNCourier2020SepOct-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 13.1M/13.1M [00:01<00:00, 10.6MiB/s]



Found PDF: CERNCourier2020JulAug-digitaledition.pdf
URL: https://cds.cern.ch/record/2722711/files/CERNCourier2020JulAug-digitaledition.pdf


CERNCourier2020JulAug-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 13.5M/13.5M [00:01<00:00, 12.1MiB/s]



Found PDF: CERNCourier2020MayJun-digitaledition.pdf
URL: https://cds.cern.ch/record/2717129/files/CERNCourier2020MayJun-digitaledition.pdf


CERNCourier2020MayJun-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 12.2M/12.2M [00:01<00:00, 9.70MiB/s]



Found PDF: CERNCourier2020MarApr-digitaledition.pdf
URL: https://cds.cern.ch/record/2712176/files/CERNCourier2020MarApr-digitaledition.pdf


CERNCourier2020MarApr-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 13.7M/13.7M [00:01<00:00, 11.1MiB/s]



Found PDF: CERNCourier2020JanFeb-digitaledition.pdf
URL: https://cds.cern.ch/record/2706508/files/CERNCourier2020JanFeb-digitaledition.pdf


CERNCourier2020JanFeb-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 8.74M/8.74M [00:00<00:00, 9.34MiB/s]



Found PDF: CERNCourier2019NovDec-digitaledition.pdf
URL: https://cds.cern.ch/record/2701615/files/CERNCourier2019NovDec-digitaledition.pdf


CERNCourier2019NovDec-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11.3M/11.3M [00:01<00:00, 10.4MiB/s]



Processing page 2...
Found 15 new article links on page 2

Found PDF: CERNCourier2019SepOct-digitaledition.pdf
URL: https://cds.cern.ch/record/2689203/files/CERNCourier2019SepOct-digitaledition.pdf


CERNCourier2019SepOct-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11.6M/11.6M [00:01<00:00, 7.50MiB/s]



Found PDF: CCJulAug19-digital.pdf
URL: https://cds.cern.ch/record/2681906/files/CCJulAug19-digital.pdf


CCJulAug19-digital.pdf: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10.5M/10.5M [00:01<00:00, 8.52MiB/s]



Found PDF: CCMayJun19-digital.pdf
URL: https://cds.cern.ch/record/2673718/files/CCMayJun19-digital.pdf


CCMayJun19-digital.pdf: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15.4M/15.4M [00:01<00:00, 12.3MiB/s]



Found PDF: CERNCourier2019MarApr-digitaledition.pdf
URL: https://cds.cern.ch/record/2666160/files/CERNCourier2019MarApr-digitaledition.pdf


CERNCourier2019MarApr-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 12.1M/12.1M [00:01<00:00, 11.1MiB/s]



Found PDF: CERNCourier2019JanFeb-digitaledition.pdf
URL: https://cds.cern.ch/record/2654576/files/CERNCourier2019JanFeb-digitaledition.pdf


CERNCourier2019JanFeb-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 11.2M/11.2M [00:01<00:00, 7.28MiB/s]



Found PDF: CERNCourier2018Dec-digitaledition.pdf
URL: https://cds.cern.ch/record/2649360/files/CERNCourier2018Dec-digitaledition.pdf


CERNCourier2018Dec-digitaledition.pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10.2M/10.2M [00:01<00:00, 9.58MiB/s]



Found PDF: CERNCourier2018Nov-digitaledition.pdf
URL: https://cds.cern.ch/record/2645275/files/CERNCourier2018Nov-digitaledition.pdf


CERNCourier2018Nov-digitaledition.pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 8.90M/8.90M [00:00<00:00, 9.51MiB/s]



Found PDF: CERNCourier2018Oct-digitaledition.pdf
URL: https://cds.cern.ch/record/2640475/files/CERNCourier2018Oct-digitaledition.pdf


CERNCourier2018Oct-digitaledition.pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 8.13M/8.13M [00:00<00:00, 8.73MiB/s]



Found PDF: CERNCourier2018Sep-digitaledition.pdf
URL: https://cds.cern.ch/record/2636286/files/CERNCourier2018Sep-digitaledition.pdf


CERNCourier2018Sep-digitaledition.pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 15.6M/15.6M [00:01<00:00, 12.2MiB/s]



Found PDF: CERNCourier2018JulAug-digitaledition.pdf
URL: https://cds.cern.ch/record/2628313/files/CERNCourier2018JulAug-digitaledition.pdf


CERNCourier2018JulAug-digitaledition.pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 6.69M/6.69M [00:00<00:00, 8.40MiB/s]



Found PDF: CERN Courier June 2018 (Volume 53 Issue 5).pdf
URL: https://home.cern/sites/default/files/2018-06/CERN%20Courier%20June%202018%20%28Volume%2053%20Issue%205%29.pdf


CERN Courier June 2018 (Volume 53 Issue 5).pdf: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 8.69M/8.69M [00:01<00:00, 6.94MiB/s]



Found PDF: CERN Courier May 2018 (Volume 53 Issue 4).pdf
URL: https://cds.cern.ch/record/2318574/files/CERN%20Courier%20May%202018%20(Volume%2053%20Issue%204).pdf


CERN Courier May 2018 (Volume 53 Issue 4).pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 9.32M/9.32M [00:01<00:00, 6.67MiB/s]



Found PDF: CERN Courier Volume 58 Issue 3 (April 2018).pdf
URL: https://cds.cern.ch/record/2309976/files/CERN%20Courier%20Volume%2058%20Issue%203%20(April%202018).pdf


CERN Courier Volume 58 Issue 3 (April 2018).pdf: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 13.3M/13.3M [00:01<00:00, 10.9MiB/s]



Found PDF: CERN Courier Volume 58 Issue 2 (March 2018).pdf
URL: https://cds.cern.ch/record/2304934/files/CERN%20Courier%20Volume%2058%20Issue%202%20(March%202018).pdf


CERN Courier Volume 58 Issue 2 (March 2018).pdf: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 10.5M/10.5M [00:01<00:00, 9.73MiB/s]



Found PDF: CERN Courier Volume 58 Issue 1 (Jan-Feb 2018).pdf
URL: https://cds.cern.ch/record/2300591/files/CERN%20Courier%20Volume%2058%20Issue%201%20(Jan-Feb%202018).pdf


CERN Courier Volume 58 Issue 1 (Jan-Feb 2018).pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 10.5M/10.5M [00:01<00:00, 8.53MiB/s]



Processing page 3...
Found 15 new article links on page 3

Found PDF: CERN Courier December 2017 (Volume 57 Issue 10).pdf
URL: https://cds.cern.ch/record/2292627/files/CERN%20Courier%20December%202017%20(Volume%2057%20Issue%2010).pdf


CERN Courier December 2017 (Volume 57 Issue 10).pdf: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 8.04M/8.04M [00:01<00:00, 5.80MiB/s]



Found PDF: CERN Courier Volume 57 Issue 9 (November 2017).pdf
URL: https://cds.cern.ch/record/2289267/files/CERN%20Courier%20Volume%2057%20Issue%209%20(November%202017).pdf


CERN Courier Volume 57 Issue 9 (November 2017).pdf: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 8.51M/8.51M [00:00<00:00, 9.13MiB/s]



Found PDF: CERN Courier Volume 57 Issue 8 October 2017.pdf
URL: https://cds.cern.ch/record/2285637/files/CERN%20Courier%20Volume%2057%20Issue%208%20October%202017.pdf


CERN Courier Volume 57 Issue 8 October 2017.pdf: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 9.17M/9.17M [00:00<00:00, 9.69MiB/s]



Found PDF: CERN Courier Volume 57 Issue 7 (September 2017).pdf
URL: https://cds.cern.ch/record/2281303/files/CERN%20Courier%20Volume%2057%20Issue%207%20(September%202017).pdf


CERN Courier Volume 57 Issue 7 (September 2017).pdf: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 10.1M/10.1M [00:01<00:00, 9.15MiB/s]



Found PDF: CERN Courier Volume 57 Issue 6 (July-August 2017).pdf
URL: https://cds.cern.ch/record/2273705/files/CERN%20Courier%20Volume%2057%20Issue%206%20(July-August%202017).pdf


CERN Courier Volume 57 Issue 6 (July-August 2017).pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████| 10.6M/10.6M [00:01<00:00, 9.84MiB/s]



Found PDF: CERN Courier Volume 57 Issue 4 .pdf
URL: https://cds.cern.ch/record/2259560/files/CERN%20Courier%20Volume%2057%20Issue%204%20.pdf


CERN Courier Volume 57 Issue 4 .pdf: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.5M/12.5M [00:01<00:00, 8.01MiB/s]



Found PDF: CERN Courier Volume 57 Issue 3 April 2017.pdf
URL: https://cds.cern.ch/record/2256135/files/CERN%20Courier%20Volume%2057%20Issue%203%20April%202017.pdf


CERN Courier Volume 57 Issue 3 April 2017.pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 6.81M/6.81M [00:00<00:00, 8.59MiB/s]



Found PDF: CERN Courier Volume 57 Issue 2 March 2017.pdf
URL: https://cds.cern.ch/record/2252407/files/CERN%20Courier%20Volume%2057%20Issue%202%20March%202017.pdf


CERN Courier Volume 57 Issue 2 March 2017.pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 9.75M/9.75M [00:01<00:00, 10.1MiB/s]



Found PDF: CERN Courier Jan-Feb 2017 (Volume 57 issue 1).pdf
URL: https://cds.cern.ch/record/2241972/files/CERN%20Courier%20Jan-Feb%202017%20(Volume%2057%20issue%201).pdf


CERN Courier Jan-Feb 2017 (Volume 57 issue 1).pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 9.89M/9.89M [00:00<00:00, 10.4MiB/s]



Found PDF: CERN Courier November 2016 (Volume 56 Issue 9).pdf
URL: https://cds.cern.ch/record/2224294/files/CERN%20Courier%20November%202016%20(Volume%2056%20Issue%209).pdf


CERN Courier November 2016 (Volume 56 Issue 9).pdf: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 9.23M/9.23M [00:01<00:00, 6.56MiB/s]



Found PDF: CERN Courier October 2016 (Volume 56 Issue 8).pdf
URL: https://cds.cern.ch/record/2219443/files/CERN%20Courier%20October%202016%20(Volume%2056%20Issue%208).pdf


CERN Courier October 2016 (Volume 56 Issue 8).pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 16.3M/16.3M [00:01<00:00, 11.8MiB/s]



Found PDF: CERN Courier September 2016 (Volume 56 Issue 7).pdf
URL: https://cds.cern.ch/record/2211464/files/CERN%20Courier%20September%202016%20(Volume%2056%20Issue%207).pdf


CERN Courier September 2016 (Volume 56 Issue 7).pdf: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 15.3M/15.3M [00:01<00:00, 10.8MiB/s]



Found PDF: CERN Courier July-August 2016 (Volume 56 Issue 6).pdf
URL: http://cds.cern.ch/record/2198166/files/CERN%20Courier%20July-August%202016%20(Volume%2056%20Issue%206).pdf


CERN Courier July-August 2016 (Volume 56 Issue 6).pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████| 21.9M/21.9M [00:46<00:00, 497kiB/s]



Processing page 4...
Found 15 new article links on page 4

Found PDF: CERN Courier June 2016 (Volume 56 Issue 5).pdf
URL: https://cds.cern.ch/record/2155287/files/CERN%20Courier%20June%202016%20(Volume%2056%20Issue%205).pdf


CERN Courier June 2016 (Volume 56 Issue 5).pdf: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 9.50M/9.50M [00:01<00:00, 6.79MiB/s]



Found PDF: CERN Courier May 2016 (Volume 56 Issue 4).pdf
URL: http://cds.cern.ch/record/2146835/files/CERN%20Courier%20May%202016%20(Volume%2056%20Issue%204).pdf


CERN Courier May 2016 (Volume 56 Issue 4).pdf: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 10.2M/10.2M [00:18<00:00, 584kiB/s]



Found PDF: CERN Courier Mar 2016 (Volume 56 Issue 2).pdf
URL: https://cds.cern.ch/record/2131754/files/CERN%20Courier%20Mar%202016%20(Volume%2056%20Issue%202).pdf


CERN Courier Mar 2016 (Volume 56 Issue 2).pdf: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 9.46M/9.46M [00:01<00:00, 6.78MiB/s]



Processing page 5...
Found 15 new article links on page 5

Found PDF: CERNCourier2013Oct-digitaledition.pdf
URL: https://cds.cern.ch/record/1603700/files/CERNCourier2013Oct-digitaledition.pdf
Error downloading CERNCourier2013Oct-digitaledition.pdf: 404 Client Error: Not Found for url: http://cds.cern.ch/record/1735007/files/CERNCourier2013Oct-digitaledition.pdf

Processing page 6...
Found 5 new article links on page 6

Found PDF: CERN Courier June 2013.pdf
URL: http://cds.cern.ch/record/1550751/files/CERN%20Courier%20June%202013.pdf
Error downloading CERN Courier June 2013.pdf: 404 Client Error: Not Found for url: http://cds.cern.ch/record/1734960/files/CERN%20Courier%20June%202013.pdf

Found PDF: CERN Courier digital edition May 2013.pdf
URL: http://cds.cern.ch/record/1544352/files/CERN%20Courier%20digital%20edition%20May%202013.pdf
Error downloading CERN Courier digital edition May 2013.pdf: 404 Client Error: Not Found for url: http://cds.cern.ch/record/1734947/files/CERN%20Courier%

# 2: Encode the CERN PDF documents

In [None]:
%%time
import os
from pathlib import Path
import PyPDF2
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
import chromadb
from tqdm import tqdm
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

class PDFProcessor:
    def __init__(self, pdf_dir="cern_pdfs", db_dir="cern_vectordb"):
        # Check for API key
        if not os.getenv("OPENAI_API_KEY"):
            raise ValueError("OPENAI_API_KEY not found in .env file")
            
        self.pdf_dir = Path(pdf_dir)
        self.db_dir = Path(db_dir)
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=1000,
            chunk_overlap=200,
            length_function=len,
            add_start_index=True,
        )
        
        # Initialize embeddings
        self.embeddings = OpenAIEmbeddings()
        
    def extract_text_from_pdf(self, pdf_path):
        """Extract text from a PDF file"""
        try:
            with open(pdf_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)
                text = ""
                
                # Extract text from each page
                for page in pdf_reader.pages:
                    text += page.extract_text() + "\n"
                    
                return text
        except Exception as e:
            print(f"Error extracting text from {pdf_path}: {e}")
            return None

    def process_pdfs(self):
        """Process all PDFs in the directory and return chunks with metadata"""
        all_chunks = []
        
        # Process each PDF file
        pdf_files = list(self.pdf_dir.glob("*.pdf"))
        for pdf_path in tqdm(pdf_files, desc="Processing PDFs"):
            text = self.extract_text_from_pdf(pdf_path)
            if text:
                # Split text into chunks
                chunks = self.text_splitter.create_documents(
                    texts=[text],
                    metadatas=[{"source": pdf_path.name}]
                )
                all_chunks.extend(chunks)
        
        return all_chunks

    def create_vector_db(self):
        """Create and populate the vector database"""
        # Get text chunks
        chunks = self.process_pdfs()
        
        if not chunks:
            print("No text chunks were created. Check the PDF processing.")
            return None
        
        print(f"\nCreating vector database with {len(chunks)} chunks...")
        
        # Create and persist the vector store
        vectordb = Chroma.from_documents(
            documents=chunks,
            embedding=self.embeddings,
            persist_directory=str(self.db_dir)
        )
        
        # Persist the database
        vectordb.persist()
        print(f"Vector database created and saved to {self.db_dir}")
        
        return vectordb

if __name__ == "__main__":
    try:
        processor = PDFProcessor()
        vectordb = processor.create_vector_db()
    except ValueError as e:
        print(f"Error: {e}")

# 3: Chat with Cern Magazine

In [1]:
%%time
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

class CERNResearchAssistant:
    def __init__(self, db_dir="cern_vectordb"):
        # Check for API key
        if not os.getenv("OPENAI_API_KEY"):
            raise ValueError("OPENAI_API_KEY not found in .env file")
            
        # Initialize the vector store
        self.vectorstore = Chroma(
            persist_directory=db_dir,
            embedding_function=OpenAIEmbeddings()
        )
        
        # Initialize the language model
        self.llm = ChatOpenAI(
            model="gpt-4-turbo-preview",
            temperature=0
        )
        
        # Create the retriever
        self.retriever = self.vectorstore.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 4}
        )
        
        # Setup the prompt template
        template = """You are a helpful research assistant with access to CERN Courier articles.
        Use the following articles to answer the question. If you can't answer the question based
        on the articles, say so clearly.

        Context articles:
        {context}

        Question: {question}

        Please provide a detailed answer with specific references to the articles when possible:"""
        
        self.prompt = ChatPromptTemplate.from_template(template)
        
        # Setup the RAG chain
        self.chain = (
            RunnableParallel(
                {"context": self.retriever, "question": RunnablePassthrough()}
            )
            | self.prompt
            | self.llm
            | StrOutputParser()
        )
    
    def query(self, question):
        """Ask a question about CERN research"""
        try:
            response = self.chain.invoke(question)
            return response
        except Exception as e:
            return f"Error processing query: {e}"

def main():
    try:
        # Initialize the assistant
        assistant = CERNResearchAssistant()
        
        print("CERN Research Assistant Ready!")
        print("Ask questions about CERN research (type 'quit' to exit)")
        
        while True:
            question = input("\nYour question: ")
            if question.lower() in ['quit', 'exit', 'q']:
                break
                
            response = assistant.query(question)
            print("\nAssistant:", response)
            
    except ValueError as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()



CERN Research Assistant Ready!
Ask questions about CERN research (type 'quit' to exit)



Your question:  What is driving the accelerated expansion of the Universe?



Assistant: The accelerated expansion of the universe is one of the most profound discoveries in cosmology. This phenomenon is primarily attributed to what is broadly referred to as dark energy, although its exact nature and physics remain unknown. The concept of dark energy emerged from observations that contradicted the then-existing understanding of the universe's expansion. Specifically, in 1998, researchers found that, contrary to expectations that the universe's expansion was slowing down, it was actually speeding up. This conclusion was reached through comparisons of the expansion rate of the universe over time, significantly strengthened by combined measurements with those of the High-z Supernova Search Team (CERNCourier2019SepOct-digitaledition.pdf, start_index: 186634).

One early attempt to explain the universe's dynamics involved the cosmological constant, initially introduced by Einstein to allow for a static universe within the framework of general relativity. This cosmol


Your question:  exit


CPU times: user 1.64 s, sys: 317 ms, total: 1.96 s
Wall time: 1min 24s


# 4: Fine Tuning with OpenAI 

In [2]:
%%time

import os
import json
from pathlib import Path
import PyPDF2
from tqdm import tqdm
from dotenv import load_dotenv
from openai import OpenAI
import tiktoken
import time

# Load environment variables
load_dotenv()

class FineTunePrep:
    def __init__(self, pdf_dir="cern_pdfs", output_dir="finetune_data"):
        if not os.getenv("OPENAI_API_KEY"):
            raise ValueError("OPENAI_API_KEY not found in .env file")
            
        self.client = OpenAI()
        self.pdf_dir = Path(pdf_dir)
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        self.tokenizer = tiktoken.get_encoding("cl100k_base")
        
        # Constants for token limits
        self.MAX_TOKENS_PER_EXAMPLE = 3000  # Leave room for system and user messages
        self.MIN_TOKENS_PER_EXAMPLE = 500   # Ensure meaningful content
        
    def count_tokens(self, text):
        """Count tokens in a text string"""
        return len(self.tokenizer.encode(text))

    def extract_text_from_pdf(self, pdf_path):
        """Extract text from a PDF file"""
        try:
            with open(pdf_path, 'rb') as file:
                pdf_reader = PyPDF2.PdfReader(file)
                text = ""
                for page in pdf_reader.pages:
                    text += page.extract_text() + "\n"
                return text
        except Exception as e:
            print(f"Error extracting text from {pdf_path}: {e}")
            return None

    def split_into_chunks(self, text):
        """Split text into chunks of appropriate token length"""
        chunks = []
        current_chunk = ""
        current_tokens = 0
        
        # Split into sentences (roughly)
        sentences = [s.strip() + "." for s in text.replace("\n", " ").split(".") if s.strip()]
        
        for sentence in sentences:
            sentence_tokens = self.count_tokens(sentence)
            
            # If single sentence is too long, split it into smaller parts
            if sentence_tokens > self.MAX_TOKENS_PER_EXAMPLE:
                words = sentence.split()
                temp_chunk = ""
                temp_tokens = 0
                
                for word in words:
                    word_tokens = self.count_tokens(word + " ")
                    if temp_tokens + word_tokens > self.MAX_TOKENS_PER_EXAMPLE:
                        if temp_tokens >= self.MIN_TOKENS_PER_EXAMPLE:
                            chunks.append(temp_chunk.strip())
                        temp_chunk = word + " "
                        temp_tokens = word_tokens
                    else:
                        temp_chunk += word + " "
                        temp_tokens += word_tokens
                
                if temp_tokens >= self.MIN_TOKENS_PER_EXAMPLE:
                    chunks.append(temp_chunk.strip())
                continue
            
            # If adding this sentence would exceed limit, save current chunk and start new one
            if current_tokens + sentence_tokens > self.MAX_TOKENS_PER_EXAMPLE:
                if current_tokens >= self.MIN_TOKENS_PER_EXAMPLE:
                    chunks.append(current_chunk.strip())
                current_chunk = sentence + " "
                current_tokens = sentence_tokens
            else:
                current_chunk += sentence + " "
                current_tokens += sentence_tokens
        
        # Add the last chunk if it's long enough
        if current_tokens >= self.MIN_TOKENS_PER_EXAMPLE:
            chunks.append(current_chunk.strip())
        
        return chunks

    def create_training_examples(self, chunks):
        """Create training examples from text chunks"""
        examples = []
        
        for chunk in chunks:
            # Create the messages for this chunk
            messages = [
                {
                    "role": "system",
                    "content": "You are an expert on CERN and particle physics, trained to provide accurate information from CERN publications."
                },
                {
                    "role": "user",
                    "content": "What are the key findings or developments described in this CERN research?"
                },
                {
                    "role": "assistant",
                    "content": f"Based on the CERN publications: {chunk}"
                }
            ]
            
            # Verify total tokens
            total_tokens = sum(self.count_tokens(msg["content"]) for msg in messages)
            if total_tokens <= 4096:  # GPT-4's context window
                examples.append({"messages": messages})
            
        return examples

    def prepare_training_data(self):
        """Process PDFs and prepare training data"""
        all_examples = []
        pdf_files = list(self.pdf_dir.glob("*.pdf"))
        
        for pdf_path in tqdm(pdf_files, desc="Processing PDFs"):
            text = self.extract_text_from_pdf(pdf_path)
            if text:
                # First split text into appropriate chunks
                chunks = self.split_into_chunks(text)
                print(f"\nCreated {len(chunks)} chunks from {pdf_path.name}")
                
                # Create examples from chunks
                examples = self.create_training_examples(chunks)
                all_examples.extend(examples)
        
        # Save training data
        training_file_path = self.output_dir / "training_data.jsonl"
        with open(training_file_path, 'w', encoding='utf-8') as f:
            for example in all_examples:
                f.write(json.dumps(example) + '\n')
        
        print(f"\nCreated {len(all_examples)} valid training examples")
        print(f"Training data saved to {training_file_path}")
        return training_file_path

    def submit_fine_tuning_job(self, training_file_path):
        """Submit fine-tuning job to OpenAI"""
        try:
            # Upload the training file
            with open(training_file_path, 'rb') as f:
                training_file = self.client.files.create(
                    file=f,
                    purpose='fine-tune'
                )
            print(f"Training file uploaded with ID: {training_file.id}")
            
            # Create fine-tuning job
            job = self.client.fine_tuning.jobs.create(
                training_file=training_file.id,
                model="gpt-4o-mini-2024-07-18",
                hyperparameters={
                    "n_epochs": 2,
                    "learning_rate_multiplier": 0.1
                }
            )
            
            print(f"Fine-tuning job created with ID: {job.id}")
            return job.id
            
        except Exception as e:
            print(f"Error submitting fine-tuning job: {e}")
            return None

    def monitor_fine_tuning_job(self, job_id):
        """Monitor the status of a fine-tuning job"""
        print("\nMonitoring fine-tuning job...")
        
        while True:
            try:
                job = self.client.fine_tuning.jobs.retrieve(job_id)
                print(f"\nStatus: {job.status}")
                
                # Safely print additional info if available
                if hasattr(job, 'trained_tokens') and job.trained_tokens is not None:
                    print(f"Trained tokens: {job.trained_tokens:,}")
                if hasattr(job, 'training_accuracy') and job.training_accuracy is not None:
                    print(f"Training accuracy: {job.training_accuracy:.4f}")
                
                if job.status == 'succeeded':
                    print(f"\nFine-tuning completed successfully!")
                    print(f"Fine-tuned model ID: {job.fine_tuned_model}")
                    return job
                elif job.status == 'failed':
                    print(f"\nFine-tuning failed: {getattr(job, 'error', 'Unknown error')}")
                    return job
                elif job.status == 'cancelled':
                    print("\nFine-tuning job was cancelled")
                    return job
                
                time.sleep(60)
                
            except Exception as e:
                print(f"Error checking job status: {e}")
                time.sleep(60)

def main():
    try:
        prep = FineTunePrep()
        
        print("Step 1: Preparing training data...")
        training_file_path = prep.prepare_training_data()
        
        print("\nStep 2: Submitting fine-tuning job...")
        job_id = prep.submit_fine_tuning_job(training_file_path)
        
        if job_id:
            final_job = prep.monitor_fine_tuning_job(job_id)
            
            if getattr(final_job, 'status', None) == 'succeeded':
                print("\nFine-tuning process completed successfully!")
                print(f"You can now use your fine-tuned model with ID: {final_job.fine_tuned_model}")
                
    except ValueError as e:
        print(f"Error: {e}")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

if __name__ == "__main__":
    main()



Step 1: Preparing training data...


Processing PDFs:   2%|██▎                                                                                                                                 | 1/57 [00:05<04:56,  5.29s/it]


Created 20 chunks from CERNCourier2024MayJun-digitaledition.pdf


Processing PDFs:   4%|████▋                                                                                                                               | 2/57 [00:10<04:35,  5.02s/it]


Created 19 chunks from CERNCourier2022NovDec-digitaledition.pdf


Processing PDFs:   5%|██████▉                                                                                                                             | 3/57 [00:13<03:49,  4.25s/it]


Created 16 chunks from CERN Courier Volume 57 Issue 8 October 2017.pdf


Processing PDFs:   7%|█████████▎                                                                                                                          | 4/57 [00:17<03:45,  4.25s/it]


Created 21 chunks from CERNCourier2023MayJun-digitaledition.pdf


Processing PDFs:   9%|███████████▌                                                                                                                        | 5/57 [00:24<04:30,  5.21s/it]


Created 22 chunks from CERNCourier2018Sep-digitaledition.pdf


Processing PDFs:  11%|█████████████▉                                                                                                                      | 6/57 [00:28<04:06,  4.83s/it]


Created 20 chunks from CERNCourier2019NovDec-digitaledition.pdf


Processing PDFs:  12%|████████████████▏                                                                                                                   | 7/57 [00:31<03:32,  4.25s/it]


Created 17 chunks from CERN Courier Volume 57 Issue 3 April 2017.pdf


Processing PDFs:  14%|██████████████████▌                                                                                                                 | 8/57 [00:40<04:31,  5.53s/it]


Created 18 chunks from CERN Courier September 2016 (Volume 56 Issue 7).pdf


Processing PDFs:  16%|████████████████████▊                                                                                                               | 9/57 [00:46<04:33,  5.69s/it]


Created 20 chunks from CERN Courier Volume 57 Issue 6 (July-August 2017).pdf


Processing PDFs:  18%|██████████████████████▉                                                                                                            | 10/57 [00:50<04:07,  5.26s/it]


Created 23 chunks from CERNCourier2021MayJun-digitaledition.pdf


Processing PDFs:  19%|█████████████████████████▎                                                                                                         | 11/57 [00:57<04:28,  5.83s/it]


Created 21 chunks from CERNCourier2019MarApr-digitaledition.pdf


Processing PDFs:  21%|███████████████████████████▌                                                                                                       | 12/57 [01:01<03:55,  5.24s/it]


Created 20 chunks from CERNCourier2022JanFeb-digitaledition.pdf


Processing PDFs:  23%|█████████████████████████████▉                                                                                                     | 13/57 [01:05<03:39,  5.00s/it]


Created 17 chunks from CERN Courier Volume 57 Issue 9 (November 2017).pdf


Processing PDFs:  25%|████████████████████████████████▏                                                                                                  | 14/57 [01:10<03:29,  4.87s/it]


Created 14 chunks from CERN Courier May 2018 (Volume 53 Issue 4).pdf


Processing PDFs:  26%|██████████████████████████████████▍                                                                                                | 15/57 [01:15<03:28,  4.97s/it]


Created 15 chunks from CERN Courier Volume 57 Issue 2 March 2017.pdf


Processing PDFs:  28%|████████████████████████████████████▊                                                                                              | 16/57 [01:21<03:34,  5.22s/it]


Created 17 chunks from CERN Courier June 2018 (Volume 53 Issue 5).pdf


Processing PDFs:  30%|███████████████████████████████████████                                                                                            | 17/57 [01:26<03:21,  5.04s/it]


Created 20 chunks from CERNCourier2020NovDec-digitaledition.pdf


Processing PDFs:  32%|█████████████████████████████████████████▎                                                                                         | 18/57 [01:38<04:47,  7.37s/it]


Created 21 chunks from CERN Courier July-August 2016 (Volume 56 Issue 6).pdf


Processing PDFs:  33%|███████████████████████████████████████████▋                                                                                       | 19/57 [01:42<03:55,  6.19s/it]


Created 19 chunks from CERNCourier2020JanFeb-digitaledition.pdf


Processing PDFs:  35%|█████████████████████████████████████████████▉                                                                                     | 20/57 [01:47<03:43,  6.03s/it]


Created 21 chunks from CERNCourier2022SepOct-digitaledition.pdf


Processing PDFs:  37%|████████████████████████████████████████████████▎                                                                                  | 21/57 [01:52<03:23,  5.66s/it]


Created 20 chunks from CCJulAug19-digital.pdf


Processing PDFs:  39%|██████████████████████████████████████████████████▌                                                                                | 22/57 [01:58<03:20,  5.71s/it]


Created 18 chunks from CERN Courier Volume 58 Issue 1 (Jan-Feb 2018).pdf


Processing PDFs:  40%|████████████████████████████████████████████████████▊                                                                              | 23/57 [02:06<03:37,  6.39s/it]


Created 20 chunks from CERNCourier2023MarApr-digitaledition.pdf


Processing PDFs:  42%|███████████████████████████████████████████████████████▏                                                                           | 24/57 [02:18<04:23,  7.99s/it]


Created 22 chunks from CERNCourier2021JanFeb-digitaledition.pdf


Processing PDFs:  44%|█████████████████████████████████████████████████████████▍                                                                         | 25/57 [02:23<03:54,  7.31s/it]


Created 26 chunks from CERNCourier2020MarApr-digitaledition.pdf


Processing PDFs:  46%|███████████████████████████████████████████████████████████▊                                                                       | 26/57 [02:30<03:35,  6.97s/it]


Created 18 chunks from CERN Courier Volume 57 Issue 4 .pdf


Processing PDFs:  47%|██████████████████████████████████████████████████████████████                                                                     | 27/57 [02:41<04:06,  8.21s/it]


Created 17 chunks from CERN Courier October 2016 (Volume 56 Issue 8).pdf


Processing PDFs:  49%|████████████████████████████████████████████████████████████████▎                                                                  | 28/57 [02:47<03:44,  7.76s/it]


Created 20 chunks from CERNCourier2023SepOct-digitaledition.pdf


Processing PDFs:  51%|██████████████████████████████████████████████████████████████████▋                                                                | 29/57 [02:51<03:06,  6.64s/it]


Created 16 chunks from CERN Courier December 2017 (Volume 57 Issue 10).pdf


Processing PDFs:  53%|████████████████████████████████████████████████████████████████████▉                                                              | 30/57 [02:57<02:49,  6.27s/it]


Created 22 chunks from CERNCourier2022MarApr-digitaledition.pdf


Processing PDFs:  54%|███████████████████████████████████████████████████████████████████████▏                                                           | 31/57 [03:05<02:55,  6.76s/it]


Created 16 chunks from CERN Courier June 2016 (Volume 56 Issue 5).pdf


Processing PDFs:  56%|█████████████████████████████████████████████████████████████████████████▌                                                         | 32/57 [03:10<02:35,  6.21s/it]


Created 21 chunks from CERNCourier2021MarApr-digitaledition.pdf


Processing PDFs:  58%|███████████████████████████████████████████████████████████████████████████▊                                                       | 33/57 [03:16<02:29,  6.24s/it]


Created 17 chunks from CERN Courier Volume 58 Issue 2 (March 2018).pdf


Processing PDFs:  60%|██████████████████████████████████████████████████████████████████████████████▏                                                    | 34/57 [03:25<02:41,  7.03s/it]


Created 16 chunks from CERN Courier Volume 58 Issue 3 (April 2018).pdf


Processing PDFs:  61%|████████████████████████████████████████████████████████████████████████████████▍                                                  | 35/57 [03:31<02:29,  6.79s/it]


Created 17 chunks from CERN Courier Mar 2016 (Volume 56 Issue 2).pdf


Processing PDFs:  63%|██████████████████████████████████████████████████████████████████████████████████▋                                                | 36/57 [03:41<02:44,  7.85s/it]


Created 24 chunks from CERNCourier2020JulAug-digitaledition.pdf


Processing PDFs:  65%|█████████████████████████████████████████████████████████████████████████████████████                                              | 37/57 [03:46<02:16,  6.82s/it]


Created 23 chunks from CERNCourier2023NovDec-digitaledition NEW.pdf


Processing PDFs:  67%|███████████████████████████████████████████████████████████████████████████████████████▎                                           | 38/57 [03:52<02:07,  6.72s/it]


Created 21 chunks from CERNCourier2019SepOct-digitaledition.pdf


Processing PDFs:  68%|█████████████████████████████████████████████████████████████████████████████████████████▋                                         | 39/57 [03:58<01:53,  6.31s/it]


Created 21 chunks from CERNCourier2019JanFeb-digitaledition.pdf


Processing PDFs:  70%|███████████████████████████████████████████████████████████████████████████████████████████▉                                       | 40/57 [04:01<01:32,  5.42s/it]


Created 19 chunks from CERNCourier2023JanFeb-digitaledition.pdf


Processing PDFs:  72%|██████████████████████████████████████████████████████████████████████████████████████████████▏                                    | 41/57 [04:05<01:20,  5.01s/it]


Created 21 chunks from CERNCourier2024MarApr-digitaledition.pdf


Processing PDFs:  74%|████████████████████████████████████████████████████████████████████████████████████████████████▌                                  | 42/57 [04:09<01:09,  4.64s/it]


Created 20 chunks from CERN Courier Volume 57 Issue 7 (September 2017).pdf


Processing PDFs:  75%|██████████████████████████████████████████████████████████████████████████████████████████████████▊                                | 43/57 [04:13<01:03,  4.55s/it]


Created 21 chunks from CERNCourier2021NovDec-digitaledition.pdf


Processing PDFs:  77%|█████████████████████████████████████████████████████████████████████████████████████████████████████                              | 44/57 [04:19<01:01,  4.77s/it]


Created 22 chunks from CERNCourier2020MayJun-digitaledition.pdf


Processing PDFs:  79%|███████████████████████████████████████████████████████████████████████████████████████████████████████▍                           | 45/57 [04:21<00:50,  4.21s/it]


Created 16 chunks from CERN Courier November 2016 (Volume 56 Issue 9).pdf


Processing PDFs:  81%|█████████████████████████████████████████████████████████████████████████████████████████████████████████▋                         | 46/57 [04:26<00:46,  4.27s/it]


Created 20 chunks from CERNCourier2023JulAug-digitaledition.pdf


Processing PDFs:  82%|████████████████████████████████████████████████████████████████████████████████████████████████████████████                       | 47/57 [04:30<00:41,  4.11s/it]


Created 17 chunks from CERNCourier2018JulAug-digitaledition.pdf


Processing PDFs:  84%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████▎                    | 48/57 [04:34<00:36,  4.10s/it]


Created 23 chunks from CERNCourier2020SepOct-digitaledition.pdf


Processing PDFs:  86%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                  | 49/57 [04:39<00:36,  4.61s/it]


Created 22 chunks from CCMayJun19-digital.pdf


Processing PDFs:  88%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉                | 50/57 [04:44<00:32,  4.60s/it]


Created 17 chunks from CERNCourier2018Dec-digitaledition.pdf


Processing PDFs:  89%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏             | 51/57 [04:49<00:27,  4.62s/it]


Created 15 chunks from CERNCourier2018Oct-digitaledition.pdf


Processing PDFs:  91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌           | 52/57 [04:56<00:27,  5.48s/it]


Created 24 chunks from CERNCourier2022MayJun-digitaledition.pdf


Processing PDFs:  93%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊         | 53/57 [05:03<00:22,  5.75s/it]


Created 14 chunks from CERN Courier May 2016 (Volume 56 Issue 4).pdf


Processing PDFs:  95%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████       | 54/57 [05:07<00:16,  5.40s/it]


Created 21 chunks from CERNCourier2021SepOct-digitaledition.pdf


Processing PDFs:  96%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍    | 55/57 [05:11<00:10,  5.05s/it]


Created 21 chunks from CERN Courier Jan-Feb 2017 (Volume 57 issue 1).pdf


Processing PDFs:  98%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋  | 56/57 [05:16<00:04,  4.83s/it]


Created 15 chunks from CERNCourier2018Nov-digitaledition.pdf


Processing PDFs: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [05:21<00:00,  5.64s/it]


Created 20 chunks from CERNCourier2024JanFeb-digitaledition.pdf

Created 1104 valid training examples
Training data saved to finetune_data/training_data.jsonl

Step 2: Submitting fine-tuning job...





Training file uploaded with ID: file-57Mt4LPzbyj7jMS49CcMc8
Fine-tuning job created with ID: ftjob-8tjSyNrQ1YLXahS4y7m2V82M

Monitoring fine-tuning job...

Status: validating_files

Status: validating_files

Status: validating_files

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

Status: running

S

# 5: RAG vs Fine-Tuning

In [3]:
from openai import OpenAI
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from dotenv import load_dotenv
import os
import time

# Load environment variables
load_dotenv()

class ModelComparison:
    def __init__(self, fine_tuned_model_id, db_dir="cern_vectordb"):
        if not os.getenv("OPENAI_API_KEY"):
            raise ValueError("OPENAI_API_KEY not found in .env file")
            
        self.client = OpenAI()
        self.fine_tuned_model_id = fine_tuned_model_id
        
        # Initialize RAG components
        self.vectorstore = Chroma(
            persist_directory=db_dir,
            embedding_function=OpenAIEmbeddings()
        )
        
        self.llm = ChatOpenAI(
            model="gpt-4o",
            temperature=0
        )
        
        self.retriever = self.vectorstore.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 4}
        )
        
        template = """You are a helpful research assistant with access to CERN Courier articles.
        Use the following articles to answer the question. If you can't answer the question based
        on the articles, say so clearly.

        Context articles:
        {context}

        Question: {question}

        Please provide a detailed answer with specific references to the articles when possible:"""
        
        self.prompt = ChatPromptTemplate.from_template(template)
        
        self.rag_chain = (
            RunnableParallel(
                {"context": self.retriever, "question": RunnablePassthrough()}
            )
            | self.prompt
            | self.llm
            | StrOutputParser()
        )

    def query_fine_tuned_model(self, question):
        """Query the fine-tuned model"""
        try:
            start_time = time.time()
            
            response = self.client.chat.completions.create(
                model=self.fine_tuned_model_id,
                messages=[
                    {"role": "system", "content": "You are an expert on CERN and particle physics, trained to provide accurate information from CERN publications."},
                    {"role": "user", "content": question}
                ],
                temperature=0
            )
            
            end_time = time.time()
            
            return {
                'response': response.choices[0].message.content,
                'time': end_time - start_time
            }
            
        except Exception as e:
            return {
                'response': f"Error querying fine-tuned model: {e}",
                'time': 0
            }

    def query_rag(self, question):
        """Query the RAG system"""
        try:
            start_time = time.time()
            
            response = self.rag_chain.invoke(question)
            
            end_time = time.time()
            
            return {
                'response': response,
                'time': end_time - start_time
            }
            
        except Exception as e:
            return {
                'response': f"Error querying RAG system: {e}",
                'time': 0
            }

    def compare_responses(self, question):
        """Compare responses from both approaches"""
        print("\nQuerying both models...")
        
        # Get responses
        ft_result = self.query_fine_tuned_model(question)
        rag_result = self.query_rag(question)
        
        # Print comparison
        print("\n" + "="*50)
        print("Question:", question)
        print("="*50)
        
        print("\nFine-tuned Model Response:")
        print("-"*30)
        print(ft_result['response'])
        print(f"Response time: {ft_result['time']:.2f} seconds")
        
        print("\nRAG System Response:")
        print("-"*30)
        print(rag_result['response'])
        print(f"Response time: {rag_result['time']:.2f} seconds")
        
        return {
            'fine_tuned': ft_result,
            'rag': rag_result
        }

def main():
    # Replace with your fine-tuned model ID
    FINE_TUNED_MODEL_ID = "ft:gpt-4o-mini-2024-07-18:personal::AbZrBIYn"
    
    try:
        comparison = ModelComparison(FINE_TUNED_MODEL_ID)
        
        print("CERN Research Assistant Comparison")
        print("Compare Fine-tuned model vs RAG approach")
        print("Type 'quit' to exit")
        
        while True:
            question = input("\nYour question: ")
            if question.lower() in ['quit', 'exit', 'q']:
                break
                
            comparison.compare_responses(question)
            
    except ValueError as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

  self.vectorstore = Chroma(


CERN Research Assistant Comparison
Compare Fine-tuned model vs RAG approach
Type 'quit' to exit



Your question:  why is the universe expanding



Querying both models...

Question: why is the universe expanding

Fine-tuned Model Response:
------------------------------
The expansion of the universe is a fundamental observation in cosmology, first noted by Edwin Hubble in the 1920s. The primary reason for this expansion is attributed to the Big Bang, the event that marked the beginning of the universe approximately 13.8 billion years ago. 

Initially, the universe was in an extremely hot and dense state, and as it expanded, it cooled down, allowing for the formation of subatomic particles and later atoms. This expansion is described by the solutions to Einstein's equations of general relativity, which govern the dynamics of spacetime.

In addition to the initial expansion, observations show that the rate of expansion is currently accelerating. This acceleration is attributed to a mysterious form of energy known as dark energy, which makes up about 68% of the universe. The exact nature of dark energy is still one of the biggest q


Your question:  what is the Future Circular Collider (FCC) feasibility study



Querying both models...

Question: what is the Future Circular Collider (FCC) feasibility study

Fine-tuned Model Response:
------------------------------
The Future Circular Collider (FCC) feasibility study is a comprehensive assessment of the technical, financial, and organizational aspects of a proposed next-generation particle accelerator complex at CERN. The FCC aims to explore the energy frontier of particle physics beyond the capabilities of the Large Hadron Collider (LHC) and to address fundamental questions about the universe, such as the nature of dark matter and the properties of the Higgs boson.

The feasibility study, which began in 2014, is part of a broader effort to develop a long-term vision for the future of particle physics. It includes detailed studies of the accelerator technologies, detector concepts, and experimental programs that would be required for the FCC, as well as an evaluation of the potential scientific impact of the project.

The FCC feasibility study


Your question:  What does the timeline for a 10TeV muon collider look like?



Querying both models...

Question: What does the timeline for a 10TeV muon collider look like?

Fine-tuned Model Response:
------------------------------
The timeline for a 10 TeV muon collider is not definitively established, as it depends on various factors including technological developments, funding, and international collaboration. However, the European Strategy for Particle Physics Update 2020 has identified a muon collider as a potential future project. 

The timeline for such a project could be divided into several phases:

1. **Feasibility Studies (Ongoing)**: Continued research into the feasibility of muon colliders, including studies on cooling, acceleration, and detector technologies.

2. **Conceptual Design Report (5-10 years)**: Development of a detailed conceptual design report (CDR) that outlines the technical and financial aspects of the project.

3. **R&D Phase (10-15 years)**: A dedicated R&D phase to address the technical challenges identified in the CDR, includin


Your question:  exit


# 6: Fine Tunging on nvidia/Llama3-ChatQA-1.5-8B

In [2]:
import os
import torch
from transformers import (
    AutoTokenizer, 
    AutoModelForCausalLM, 
    TrainingArguments, 
    Trainer,
    DataCollatorForLanguageModeling,
    BitsAndBytesConfig
)
from datasets import Dataset
from pathlib import Path
import PyPDF2
from tqdm import tqdm
from accelerate import Accelerator
import bitsandbytes as bnb
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, TaskType
from huggingface_hub import login
from dotenv import load_dotenv
import gc
from pynvml import (
    nvmlInit,
    nvmlDeviceGetCount,
    nvmlDeviceGetHandleByIndex,
    nvmlDeviceGetMemoryInfo,
    NVMLError
)

def clean_gpu_memory():
    """
    Thoroughly clean GPU memory and print memory usage statistics
    """
    def get_gpu_memory_usage():
        try:
            nvmlInit()
            deviceCount = nvmlDeviceGetCount()
            memory_usage = []
            
            for i in range(deviceCount):
                handle = nvmlDeviceGetHandleByIndex(i)
                info = nvmlDeviceGetMemoryInfo(handle)
                memory_usage.append({
                    'device': i,
                    'used_mb': info.used / 1024**2,
                    'total_mb': info.total / 1024**2,
                    'used_percent': (info.used / info.total) * 100
                })
            
            return memory_usage
        except NVMLError as e:
            print(f"NVML Error: {e}")
            return None

    # Print initial memory usage
    memory_usage = get_gpu_memory_usage()
    if memory_usage:
        print("\nInitial GPU memory usage:")
        for gpu in memory_usage:
            print(f"GPU {gpu['device']}: {gpu['used_mb']:.2f}MB / {gpu['total_mb']:.2f}MB ({gpu['used_percent']:.2f}%)")

    # Empty CUDA cache
    torch.cuda.empty_cache()
    
    # Delete all variables in CUDA memory
    for obj in gc.get_objects():
        try:
            if torch.is_tensor(obj):
                if obj.is_cuda:
                    del obj
        except Exception:
            pass
    
    # Run garbage collector
    gc.collect()
    
    # Reset peak memory stats
    if torch.cuda.is_available():
        torch.cuda.reset_peak_memory_stats()
        torch.cuda.reset_max_memory_allocated()
        torch.cuda.synchronize()
    
    # Print final memory usage
    memory_usage = get_gpu_memory_usage()
    if memory_usage:
        print("\nFinal GPU memory usage after cleaning:")
        for gpu in memory_usage:
            print(f"GPU {gpu['device']}: {gpu['used_mb']:.2f}MB / {gpu['total_mb']:.2f}MB ({gpu['used_percent']:.2f}%)")

# Set environment variable for memory allocation
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:64,garbage_collection_threshold:0.8"
clean_gpu_memory()

load_dotenv()

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param:.2f}"
    )

class LlamaFineTuner:
    def __init__(
        self,
        model_name="nvidia/Llama3-ChatQA-1.5-8B",
        cache_dir=None,
        pdf_dir="cern_pdfs",
        output_dir="llama_finetuned",
        device="cuda"
    ):
        self.pdf_dir = Path(pdf_dir)
        self.output_dir = Path(output_dir)
        self.output_dir.mkdir(exist_ok=True)
        self.logging_dir = self.output_dir / "logs"
        self.logging_dir.mkdir(exist_ok=True)
        self.device = device
        self.cache_dir = cache_dir
        
        try:
            print("Loading tokenizer...")
            self.tokenizer = AutoTokenizer.from_pretrained(
                model_name,
                cache_dir=self.cache_dir,
                trust_remote_code=True,
                use_fast=True
            )
            
            if self.tokenizer.pad_token is None:
                self.tokenizer.add_special_tokens({'pad_token': '[PAD]'})
            
            print("Loading model with optimized quantization...")
            quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.bfloat16,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True,
                load_in_8bit_fp32_cpu_offload=True
            )
            
            # Calculate available GPU memory
            clean_gpu_memory()  # Clean memory before loading model
            gpu_memory = int(torch.cuda.get_device_properties(0).total_memory/1e9) - 2
            
            self.model = AutoModelForCausalLM.from_pretrained(
                model_name,
                quantization_config=quantization_config,
                device_map="auto",
                torch_dtype=torch.bfloat16,
                max_memory={0: f"{int(gpu_memory*0.7)}GB", "cpu": "32GB"},
                trust_remote_code=True,
                low_cpu_mem_usage=True
            )
            
            # Enable gradient checkpointing
            self.model.gradient_checkpointing_enable()
            self.model.enable_input_require_grads()
            self.model.config.use_cache = False
            
            print("Initial model parameters:")
            print_trainable_parameters(self.model)
            
            print("Preparing model for QLoRA training...")
            self.model = prepare_model_for_kbit_training(self.model)
            
            # Configure LoRA
            lora_config = LoraConfig(
                r=8,
                lora_alpha=32,
                target_modules=["q_proj", "v_proj"],
                lora_dropout=0.1,
                bias="none",
                task_type=TaskType.CAUSAL_LM
            )
            
            print("Applying LoRA...")
            self.model = get_peft_model(self.model, lora_config)
            print_trainable_parameters(self.model)
            
            # Initialize accelerator
            self.accelerator = Accelerator(
                gradient_accumulation_steps=32,
                mixed_precision="bf16",
                project_dir=str(self.logging_dir),
                split_batches=True,
                dispatch_batches=True
            )
            
        except Exception as e:
            raise RuntimeError(f"Error initializing model and tokenizer: {str(e)}")

    def extract_text_from_pdf(self, pdf_path):
        """Extract text from a PDF file."""
        try:
            with open(pdf_path, 'rb') as file:
                reader = PyPDF2.PdfReader(file)
                text = ""
                for page in reader.pages:
                    text += page.extract_text() + "\n"
                return text
        except Exception as e:
            print(f"Error extracting text from {pdf_path}: {e}")
            return None

    def prepare_training_data(self, max_length=512):
        """Prepare training data from PDF files."""
        if not self.pdf_dir.exists():
            raise ValueError(f"PDF directory not found: {self.pdf_dir}")
        
        training_data = []
        pdf_files = list(self.pdf_dir.glob("*.pdf"))
        
        if not pdf_files:
            raise ValueError(f"No PDF files found in {self.pdf_dir}")
        
        for pdf_path in tqdm(pdf_files, desc="Processing PDFs"):
            text = self.extract_text_from_pdf(pdf_path)
            if text:
                entry = {
                    "instruction": "Analyze and explain the following research:",
                    "input": text,
                    "output": "This research discusses: " + text[:500]
                }
                training_data.append(entry)
        
        if not training_data:
            raise ValueError("No valid training data could be extracted from PDFs")
        
        dataset = Dataset.from_list(training_data)
        
        def tokenize_function(examples):
            prompts = []
            for inst, inp in zip(examples["instruction"], examples["input"]):
                prompt = f"[INST] {inst}\n{inp} [/INST]"
                prompts.append(prompt)
            
            return self.tokenizer(
                prompts,
                truncation=True,
                max_length=max_length,
                padding="max_length",
                return_tensors="pt"
            )
        
        return dataset.map(
            tokenize_function,
            batched=True,
            remove_columns=dataset.column_names
        )

    def train(self, num_epochs=3, batch_size=1):
        """Train the model."""
        try:
            print("Preparing training data...")
            train_dataset = self.prepare_training_data()
            
            training_args = TrainingArguments(
                output_dir=str(self.output_dir),
                num_train_epochs=num_epochs,
                per_device_train_batch_size=batch_size,
                gradient_accumulation_steps=32,
                warmup_steps=50,
                logging_steps=10,
                save_steps=100,
                learning_rate=2e-4,
                bf16=True,
                optim="paged_adamw_32bit",
                logging_dir=str(self.logging_dir),
                group_by_length=True,
                gradient_checkpointing=True,
                max_grad_norm=0.3,
                save_total_limit=2,
                evaluation_strategy="no",
                report_to="tensorboard",
                remove_unused_columns=False,
                lr_scheduler_type="cosine",
                weight_decay=0.01
            )
            
            trainer = Trainer(
                model=self.model,
                args=training_args,
                train_dataset=train_dataset,
                data_collator=DataCollatorForLanguageModeling(
                    tokenizer=self.tokenizer,
                    mlm=False
                ),
            )
            
            print("Starting training...")
            with torch.cuda.amp.autocast(dtype=torch.bfloat16):
                trainer.train()
            
            print("Saving model...")
            trainer.save_model(str(self.output_dir / "final_model"))
            self.tokenizer.save_pretrained(
                str(self.output_dir / "final_model"),
                safe_serialization=True
            )
            
            print("Training complete!")
            
        except Exception as e:
            raise RuntimeError(f"Error during training: {str(e)}")

def main():
    try:
        if not torch.cuda.is_available():
            raise RuntimeError("This script requires a CUDA-capable GPU")
        
        # Print GPU info
        print(f"Using GPU: {torch.cuda.get_device_name(0)}")
        print(f"Available GPU memory: {torch.cuda.get_device_properties(0).total_memory/1e9:.2f} GB")
        
        # Clean GPU memory before starting
        clean_gpu_memory()
        
        # Set memory optimizations
        torch.cuda.empty_cache()
        torch.backends.cudnn.benchmark = False
        torch.backends.cuda.matmul.allow_tf32 = True
        torch.backends.cudnn.allow_tf32 = True
        torch.cuda.set_per_process_memory_fraction(0.75)
        
        # Create cache directory
        cache_dir = Path.home() / ".cache" / "huggingface"
        cache_dir.mkdir(parents=True, exist_ok=True)
        
        # Initialize and train
        print("Initializing fine-tuner...")
        finetuner = LlamaFineTuner(cache_dir=str(cache_dir))
        finetuner.train()
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")
        raise

if __name__ == "__main__":
    main()


Initial GPU memory usage:
GPU 0: 380.31MB / 24576.00MB (1.55%)





Final GPU memory usage after cleaning:
GPU 0: 637.25MB / 24576.00MB (2.59%)
Using GPU: NVIDIA GeForce RTX 3090
Available GPU memory: 25.44 GB

Initial GPU memory usage:
GPU 0: 637.25MB / 24576.00MB (2.59%)

Final GPU memory usage after cleaning:
GPU 0: 637.25MB / 24576.00MB (2.59%)
Initializing fine-tuner...
Loading tokenizer...


Unused kwargs: ['load_in_8bit_fp32_cpu_offload']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


Loading model with optimized quantization...

Initial GPU memory usage:
GPU 0: 637.25MB / 24576.00MB (2.59%)

Final GPU memory usage after cleaning:
GPU 0: 637.25MB / 24576.00MB (2.59%)


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

dataloader_config = DataLoaderConfiguration(dispatch_batches=True, split_batches=True)


Initial model parameters:
trainable params: 1050939392 || all params: 4540600320 || trainable%: 23.15
Preparing model for QLoRA training...
Applying LoRA...
trainable params: 3407872 || all params: 4544008192 || trainable%: 0.07
Preparing training data...


Processing PDFs: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 57/57 [05:19<00:00,  5.61s/it]


Map:   0%|          | 0/57 [00:00<?, ? examples/s]



Starting training...




Step,Training Loss


Saving model...
Training complete!


# 7: Chat with the Fine-Tuned nvidia/Llama3-ChatQA-1.5-8B

In [5]:
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig
)
from peft import PeftModel
from pathlib import Path

class ModelChat:
    def __init__(
        self,
        base_model_name="nvidia/Llama3-ChatQA-1.5-8B",
        finetuned_path="llama_finetuned/final_model",
        max_sequence_length=2048
    ):
        try:
            print("Loading tokenizer...")
            self.tokenizer = AutoTokenizer.from_pretrained(
                base_model_name,
                trust_remote_code=True,
                use_fast=True
            )
            
            # Set padding parameters
            if self.tokenizer.pad_token is None:
                self.tokenizer.pad_token = self.tokenizer.eos_token
            
            self.max_sequence_length = max_sequence_length
            
            print("Loading base model...")
            quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.float16,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True
            )
            
            base_model = AutoModelForCausalLM.from_pretrained(
                base_model_name,
                torch_dtype=torch.float16,
                device_map="auto",
                quantization_config=quantization_config,
                trust_remote_code=True
            )
            
            print("Loading fine-tuned adapters...")
            self.model = PeftModel.from_pretrained(
                base_model,
                finetuned_path,
                torch_dtype=torch.float16,
                device_map="auto"
            )
            
            self.model.eval()
            print("Model loaded successfully!")
            print(f"Model is on device: {self.model.device}")
            
        except Exception as e:
            raise RuntimeError(f"Error initializing model: {str(e)}")
    
    def generate_response(self, instruction, max_new_tokens=256, temperature=0.7):
        try:
            # Format the input
            prompt = f"[INST] {instruction} [/INST]"
            print(f"\nDebug - Prompt: {prompt}")
            
            # Tokenize input
            inputs = self.tokenizer(
                prompt,
                return_tensors="pt",
                padding=True,
                truncation=True,
                max_length=self.max_sequence_length
            )
            print(f"Debug - Input shape: {inputs.input_ids.shape}")
            
            # Move to GPU
            inputs = {k: v.to(self.model.device) for k, v in inputs.items()}
            
            # Generate with minimal parameters first
            with torch.no_grad():
                outputs = self.model.generate(
                    input_ids=inputs['input_ids'],
                    attention_mask=inputs['attention_mask'],
                    max_new_tokens=max_new_tokens,
                    temperature=temperature,
                    do_sample=True,
                    pad_token_id=self.tokenizer.pad_token_id,
                    eos_token_id=self.tokenizer.eos_token_id
                )
            
            print(f"Debug - Output shape: {outputs.shape}")
            
            # Decode response
            response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
            print(f"Debug - Raw response: {response}")
            
            # Clean response
            response = response.replace(prompt, "").strip()
            print(f"Debug - Cleaned response: {response}")
            
            return response
            
        except Exception as e:
            print(f"Debug - Error in generate_response: {str(e)}")
            return f"Error generating response: {str(e)}"
    
    def chat(self):
        print("\nStarting chat session with the fine-tuned model")
        print("Type 'quit' to exit, 'clear' to clear the conversation")
        print("-" * 50)
        
        while True:
            try:
                user_input = input("\nYou: ").strip()
                
                if user_input.lower() in ['quit', 'exit', 'bye']:
                    print("\nGoodbye!")
                    break
                
                if not user_input:
                    continue
                
                print("\nGenerating response...")
                response = self.generate_response(user_input)
                
                if response:
                    print(f"\nModel: {response}")
                else:
                    print("\nModel: No response generated. Please try again.")
                
            except KeyboardInterrupt:
                print("\n\nInterrupted by user. Type 'quit' to exit or continue chatting.")
            except Exception as e:
                print(f"\nError during chat: {str(e)}")

def main():
    try:
        if not torch.cuda.is_available():
            raise RuntimeError("This script requires a CUDA-capable GPU")
        
        print("\nInitializing chat model...")
        chat_model = ModelChat()
        chat_model.chat()
        
    except Exception as e:
        print(f"An error occurred: {str(e)}")

if __name__ == "__main__":
    main()


Initializing chat model...
Loading tokenizer...
Loading base model...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading fine-tuned adapters...
Model loaded successfully!
Model is on device: cuda:0

Starting chat session with the fine-tuned model
Type 'quit' to exit, 'clear' to clear the conversation
--------------------------------------------------



You:  hello



Generating response...

Debug - Prompt: [INST] hello [/INST]
Debug - Input shape: torch.Size([1, 7])
Debug - Output shape: torch.Size([1, 8])
Debug - Raw response: [INST] hello [/INST]
Debug - Cleaned response: 

Model: No response generated. Please try again.



You:  The coolest job in physics



Generating response...

Debug - Prompt: [INST] The coolest job in physics [/INST]
Debug - Input shape: torch.Size([1, 11])
Debug - Output shape: torch.Size([1, 180])
Debug - Raw response: [INST] The coolest job in physics [/INST] So I'm making a science podcast. I'm going to talk to scientists, engineers, people in science communication, and all sorts of people about what they're doing. The first episode is coming out next week. This is the first episode of the first season. I'm going to make more. It's going to be fun. There will be other episodes. So, yes, I'm making a science podcast. The coolest job in physics. Because I'm a physics person. And I think it's going to be fun. And there will be more. And more. I hope you enjoy it. If you do, please share it. If you'd like to be on the podcast, please drop me a line. If you'd like to sponsor the podcast, please drop me a line. I will talk to you soon!
Debug - Cleaned response: So I'm making a science podcast. I'm going to talk to scie


You:  what is The coolest job in physics



Generating response...

Debug - Prompt: [INST] what is The coolest job in physics [/INST]
Debug - Input shape: torch.Size([1, 13])
Debug - Output shape: torch.Size([1, 77])
Debug - Raw response: [INST] what is The coolest job in physics [/INST] - a discussion of the coolest job in physics.  I'm not sure what that would be, but I would guess that most of the people here would think it is the LHC, or maybe CERN in general, or maybe being a NASA astronaut.  What do you think?
The coolest job in physics
Debug - Cleaned response: - a discussion of the coolest job in physics.  I'm not sure what that would be, but I would guess that most of the people here would think it is the LHC, or maybe CERN in general, or maybe being a NASA astronaut.  What do you think?
The coolest job in physics

Model: - a discussion of the coolest job in physics.  I'm not sure what that would be, but I would guess that most of the people here would think it is the LHC, or maybe CERN in general, or maybe being a NAS


You:  who is Bruce Marsh



Generating response...

Debug - Prompt: [INST] who is Bruce Marsh [/INST]
Debug - Input shape: torch.Size([1, 10])
Debug - Output shape: torch.Size([1, 241])
Debug - Raw response: [INST] who is Bruce Marsh [/INST] [/INST] You're in luck because Bruce Marsh is a member of MyHeritage. We are the world’s largest online family history and genealogy company. We have been in business since 2003. Currently there are over 1.9 billion records available for searching. They have records from the United States, Canada, the United Kingdom, Australia, New Zeal [...]
[BIO] Bruce Marsh was born on April 15, 1960. He died on August 31, 2016 at 56 years old. We know that Bruce Marsh had been residing in Klamath Falls, Klamath County, Oregon 97601.

Every day, we help thousands of families by sharing these life stories. This memorial page was created in the loving memory of a beloved family member. By sharing a fond memory or writing a kind tribute, you will help keep their legacy alive. Every memorial 


You:  Bruce Marsh A specialist in laser technology at CERN



Generating response...

Debug - Prompt: [INST] Bruce Marsh A specialist in laser technology at CERN [/INST]
Debug - Input shape: torch.Size([1, 16])
Debug - Output shape: torch.Size([1, 272])
Debug - Raw response: [INST] Bruce Marsh A specialist in laser technology at CERN [/INST] [INST] John Raby A specialist in the design of accelerators at CERN [/INST] [INST] James Scargill A specialist in the design of accelerators at CERN [/INST] [INST] John Ellis A physicist at CERN and Cambridge University, England. He is a member of the High Energy Physics Group of Cambridge University and has been a member of the CERN Theory Group since 1971. He is the author of more than 100 papers on the theory of elementary particles and their interactions. He is a Fellow of the Royal Society of London and of the Institute of Physics. He is also an Honorary Fellow of the Institute of Physics and a Fellow of the Royal Society of Edinburgh. He has been a member of the CERN Theory Group since 1971. He is the 


You:  exit



Goodbye!



You:  exit


# 8: Now compaire OpenAI RAG vs OpenAI Fine-Tune vs nvidia/Llama3-ChatQA-1.5-8B

In [1]:
import os
import time
import torch
from openai import OpenAI
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableParallel, RunnablePassthrough
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    BitsAndBytesConfig
)
from peft import PeftModel
from dotenv import load_dotenv
from pathlib import Path

class ModelComparison:
    def __init__(
        self,
        fine_tuned_model_id="ft:gpt-4o-mini-2024-07-18:personal::AbZrBIYn",
        base_model_name="nvidia/Llama3-ChatQA-1.5-8B",
        finetuned_path="llama_finetuned/final_model",
        db_dir="cern_vectordb"
    ):
        load_dotenv()
        if not os.getenv("OPENAI_API_KEY"):
            raise ValueError("OPENAI_API_KEY not found in .env file")
        
        # Initialize OpenAI client
        self.client = OpenAI()
        self.fine_tuned_model_id = fine_tuned_model_id
        
        # Initialize RAG components
        print("Initializing RAG system...")
        self.vectorstore = Chroma(
            persist_directory=db_dir,
            embedding_function=OpenAIEmbeddings()
        )
        
        self.llm = ChatOpenAI(model="gpt-4", temperature=0)
        
        self.retriever = self.vectorstore.as_retriever(
            search_type="similarity",
            search_kwargs={"k": 4}
        )
        
        # Setup RAG prompt template
        template = """You are a CERN research assistant. Use the following articles to answer the question.
        If you cannot answer based on the articles, say so clearly.

        Context articles:
        {context}

        Question: {question}

        Answer with specific references to the articles:"""
        
        self.prompt = ChatPromptTemplate.from_template(template)
        
        # Setup RAG chain
        self.rag_chain = (
            RunnableParallel(
                {"context": self.retriever, "question": RunnablePassthrough()}
            )
            | self.prompt
            | self.llm
            | StrOutputParser()
        )
        
        # Initialize LLaMA model
        print("Loading LLaMA model...")
        try:
            # Initialize tokenizer
            self.llama_tokenizer = AutoTokenizer.from_pretrained(
                base_model_name,
                trust_remote_code=True,
                use_fast=True
            )
            
            if self.llama_tokenizer.pad_token is None:
                self.llama_tokenizer.pad_token = self.llama_tokenizer.eos_token
            
            # Setup quantization
            quantization_config = BitsAndBytesConfig(
                load_in_4bit=True,
                bnb_4bit_compute_dtype=torch.float16,
                bnb_4bit_quant_type="nf4",
                bnb_4bit_use_double_quant=True
            )
            
            # Load base model
            base_model = AutoModelForCausalLM.from_pretrained(
                base_model_name,
                torch_dtype=torch.float16,
                device_map="auto",
                quantization_config=quantization_config,
                trust_remote_code=True
            )
            
            # Load LoRA adapters
            self.llama_model = PeftModel.from_pretrained(
                base_model,
                finetuned_path,
                torch_dtype=torch.float16,
                device_map="auto"
            )
            
            self.llama_model.eval()
            print("LLaMA model loaded successfully!")
            
        except Exception as e:
            print(f"Warning: Could not load LLaMA model: {e}")
            self.llama_model = None
            self.llama_tokenizer = None

    def query_llama(self, question, max_length=512):
        """Query the LoRA-fine-tuned LLaMA model"""
        if not self.llama_model or not self.llama_tokenizer:
            return {
                'response': "LLaMA model not available",
                'time': 0
            }
        
        try:
            start_time = time.time()
            
            # Prepare input
            prompt = f"[INST] {question} [/INST]"
            inputs = self.llama_tokenizer(
                prompt,
                return_tensors="pt",
                truncation=True,
                max_length=max_length,
                padding=True
            ).to(self.llama_model.device)
            
            # Generate response
            with torch.no_grad():
                outputs = self.llama_model.generate(
                    **inputs,
                    max_new_tokens=max_length,
                    temperature=0.7,
                    top_p=0.95,
                    repetition_penalty=1.1,
                    do_sample=True,
                    pad_token_id=self.llama_tokenizer.pad_token_id,
                    eos_token_id=self.llama_tokenizer.eos_token_id
                )
            
            # Decode response
            response = self.llama_tokenizer.decode(outputs[0], skip_special_tokens=True)
            response = response.replace(prompt, "").strip()
            
            end_time = time.time()
            
            return {
                'response': response,
                'time': end_time - start_time
            }
            
        except Exception as e:
            return {
                'response': f"Error querying LLaMA model: {e}",
                'time': 0
            }

    def query_fine_tuned_gpt(self, question):
        """Query the OpenAI fine-tuned model"""
        try:
            start_time = time.time()
            
            response = self.client.chat.completions.create(
                model=self.fine_tuned_model_id,
                messages=[
                    {"role": "system", "content": "You are an expert on CERN and particle physics research."},
                    {"role": "user", "content": question}
                ],
                temperature=0.7
            )
            
            end_time = time.time()
            
            return {
                'response': response.choices[0].message.content,
                'time': end_time - start_time
            }
            
        except Exception as e:
            return {
                'response': f"Error querying fine-tuned GPT: {e}",
                'time': 0
            }

    def query_rag(self, question):
        """Query the RAG system"""
        try:
            start_time = time.time()
            response = self.rag_chain.invoke(question)
            end_time = time.time()
            
            return {
                'response': response,
                'time': end_time - start_time
            }
            
        except Exception as e:
            return {
                'response': f"Error querying RAG system: {e}",
                'time': 0
            }

    def compare_responses(self, question):
        """Compare responses from all three models"""
        print("\nProcessing your question across all models...")
        
        # Get responses
        rag_result = self.query_rag(question)
        ft_result = self.query_fine_tuned_gpt(question)
        llama_result = self.query_llama(question)
        
        # Print results
        print("\n" + "="*80)
        print(f"Question: {question}")
        print("="*80)
        
        print("\n1. RAG System (GPT-4 + CERN Articles)")
        print("-"*50)
        print(rag_result['response'])
        print(f"Response time: {rag_result['time']:.2f} seconds")
        
        print("\n2. Fine-tuned GPT-4")
        print("-"*50)
        print(ft_result['response'])
        print(f"Response time: {ft_result['time']:.2f} seconds")
        
        print("\n3. Fine-tuned LLaMA (with LoRA)")
        print("-"*50)
        print(llama_result['response'])
        print(f"Response time: {llama_result['time']:.2f} seconds")
        
        return {
            'rag': rag_result,
            'fine_tuned_gpt': ft_result,
            'llama': llama_result
        }

def main():
    try:
        print("Initializing Model Comparison System...")
        comparison = ModelComparison()
        
        print("\nCERN Research Assistant - Model Comparison")
        print("Compare: RAG vs Fine-tuned GPT-4 vs Fine-tuned LLaMA")
        print("Type 'quit' to exit")
        
        while True:
            question = input("\nYour question: ").strip()
            if question.lower() in ['quit', 'exit', 'q']:
                break
            if not question:
                continue
                
            comparison.compare_responses(question)
            
    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
    main()

Initializing Model Comparison System...
Initializing RAG system...


  self.vectorstore = Chroma(


Loading LLaMA model...


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

LLaMA model loaded successfully!

CERN Research Assistant - Model Comparison
Compare: RAG vs Fine-tuned GPT-4 vs Fine-tuned LLaMA
Type 'quit' to exit



Your question:  give me an overview of what CERN is



Processing your question across all models...

Question: give me an overview of what CERN is

1. RAG System (GPT-4 + CERN Articles)
--------------------------------------------------
The articles provided do not contain information on what CERN is.
Response time: 2.41 seconds

2. Fine-tuned GPT-4
--------------------------------------------------
The European Organization for Nuclear Research, known as CERN (from the French "Conseil Européen pour la Recherche Nucléaire"), is one of the world's largest and most respected centers for scientific research in particle physics. Established in 1954, CERN is located on the border between France and Switzerland, near Geneva. Its main purpose is to provide the particle accelerators and other infrastructure needed for high-energy physics research.

### Key Features of CERN:

1. **Research Facilities**: CERN is famous for its large particle accelerators, including the Large Hadron Collider (LHC), which is the world's largest and most powerful par


Your question:  exit


In [None]:
%%time
from sentence_transformers import SentenceTransformer
import numpy as np
from openai import OpenAI
from numpy.linalg import norm
import os
import dotenv

class RAGQuerySystem:
    def __init__(self, embeddings_path: str, openai_model: str = "gpt-4o"):
        # Load environment variables
        dotenv.load_dotenv()
        
        # Initialize OpenAI client
        self.client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
        self.model = openai_model
        
        # Load embeddings and documents
        data = np.load(embeddings_path, allow_pickle=True)
        self.embeddings = data['encodings']
        self.chunks = data['chunks']
        self.metadata = data['metadata']
        
        # Initialize the same embedding model used for document encoding
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

    def query(self, question: str) -> str:
        # Get query embedding using the same model as document encoding
        query_embedding = self.embedding_model.encode(question)
        
        # Calculate cosine similarity
        similarities = np.dot(self.embeddings, query_embedding) / (
            norm(self.embeddings, axis=1) * norm(query_embedding)
        )
        
        # Get top 3 similar chunks
        top_indices = np.argsort(similarities)[-3:][::-1]
        
        # Prepare context from similar chunks
        context = "\n\n".join([
            f"[Document: {self.metadata[idx]}]\n{self.chunks[idx]}"
            for idx in top_indices
        ])

        # Create prompt
        prompt = f"""Using the following CERN research documents as context, answer the question. 
        If you cannot answer from the context, say so.

        Context:
        {context}

        Question: {question}
        
        Answer:"""

        # Get response from OpenAI
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": "You are a helpful physics research assistant specializing in CERN experiments and findings."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=800
        )

        return response.choices[0].message.content

def main():
    # Initialize RAG system
    rag = RAGQuerySystem('document_encodings.npz')
    
    # Ask about W boson mass
    question = "Based on the latest data inputs, the Standard Model (SM) constrains the mass of the W boson (mW) to be?"
    
    try:
        print("\nQuestion:", question)
        answer = rag.query(question)
        print("\nAnswer:", answer)
    except Exception as e:
        print(f"Error: {e}")

if __name__ == "__main__":
    main()

## Anything that you ever wnated to know about Cern.  Review some of the QA from below

In [6]:
%%time
import numpy as np
from openai import OpenAI
import numpy as np
from numpy.linalg import norm
import os
from typing import List, Tuple
import dotenv
from sentence_transformers import SentenceTransformer

class RAGQuerySystem:
    def __init__(self, embeddings_path: str, openai_model: str = "gpt-4o"):
        # Load environment variables
        dotenv.load_dotenv()
        
        # Initialize OpenAI client
        self.client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
        self.model = openai_model
        
        # Load embeddings and documents
        data = np.load(embeddings_path, allow_pickle=True)
        self.embeddings = data['encodings']
        self.chunks = data['chunks']
        self.metadata = data['metadata']
        
        # Initialize the same embedding model used for document encoding
        self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')

    def find_similar_chunks(self, query_embedding: np.ndarray, top_k: int = 3) -> List[Tuple[int, float]]:
        # Calculate cosine similarity
        similarities = np.dot(self.embeddings, query_embedding) / (
            norm(self.embeddings, axis=1) * norm(query_embedding)
        )
        
        # Get top k similar chunks
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        return [(i, similarities[i]) for i in top_indices]

    def query(self, question: str) -> str:
        # Get query embedding using the same model as document encoding
        query_embedding = self.embedding_model.encode(question)
        
        # Find relevant chunks
        similar_chunks = self.find_similar_chunks(query_embedding)
        
        # Prepare context from similar chunks
        context = "\n\n".join([
            f"[Document: {self.metadata[idx]}]\n{self.chunks[idx]}"
            for idx, _ in similar_chunks
        ])

        # Create prompt
        prompt = f"""Using the following CERN research documents as context, answer the question. 
        If you cannot answer from the context, say so.

        Context:
        {context}

        Question: {question}
        
        Answer:"""

        # Get response from OpenAI
        response = self.client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": "You are a helpful physics research assistant specializing in CERN experiments and findings."},
                {"role": "user", "content": prompt}
            ],
            temperature=0.7,
            max_tokens=800
        )

        return response.choices[0].message.content

def main():
    rag = RAGQuerySystem('document_encodings.npz')
    
    print("CERN Research Query System (type 'quit' to exit)")
    print("-" * 50)
    
    while True:
        question = input("\nWhat would you like to know about CERN research? ").strip()
        
        if question.lower() == 'quit':
            break
            
        try:
            answer = rag.query(question)
            print("\nAnswer:", answer)
        except Exception as e:
            print(f"Error: {e}")

if __name__ == "__main__":
    main()

FileNotFoundError: [Errno 2] No such file or directory: 'document_encodings.npz'

# RAG System Analysis: Improving Answer Quality with Document Grounding
## A Comparison of Direct LLM vs. RAG-Enhanced Responses

## Introduction
This notebook analyzes the effectiveness of using Retrieval-Augmented Generation (RAG) for querying CERN research data compared to direct LLM queries. Our implementation processes CERN Courier documents to ground responses in authoritative sources.

## System Architecture
### Document Processing Pipeline
```python
# Key statistics from our implementation
total_pdfs_processed = 96
total_chunks_processed = 22103
embedding_model = 'all-MiniLM-L6-v2'
chunk_size = 1000
```

Our system processes PDFs through several stages:
1. Document Collection: Downloads CERN Courier PDFs
2. Text Extraction: Converts PDFs to processable text
3. Chunking: Splits text into manageable segments
4. Embedding Generation: Creates vector representations
5. Indexing: Organizes embeddings for efficient retrieval

## Comparative Analysis

### Case Study: W Boson Mass Query

#### Direct OpenAI Query
```python
question = "Based on the latest data inputs, what does the Standard Model (SM) constrain the mass of the W boson (mW) to be?"

# Direct OpenAI response (truncated):
"""
As of my last update in 2023, within the framework of the Standard Model of particle physics, 
precise calculations constrain the mass of the W boson (mW) based on various experimental 
inputs and theoretical considerations...

Before 2022, the Particle Data Group (PDG) reported a world average...
CDF collaboration announced a new measurement...
"""
```

#### RAG System Query
```python
# RAG system response:
"""
80357 ± 6 MeV
"""
```

### Key Improvements Analysis

#### 1. Answer Precision
- **Direct LLM**: 
  - Provides general context
  - Includes historical data
  - No specific current value
  
- **RAG System**:
  - Exact measurement with uncertainty
  - Current CERN-sourced value
  - No extraneous information

#### 2. Response Time
```python
# Average response times
direct_llm_time = "3.2 seconds"
rag_system_time = "5.1 seconds"  # Includes retrieval overhead
```

#### 3. Source Reliability
```python
# RAG system source tracking
source_metadata = {
    "document_type": "CERN Courier",
    "total_sources": 96,
    "date_range": "Last 11 years",
    "verification": "Direct CERN publications"
}
```

## Technical Implementation Details

### Embedding Generation
```python
class DocumentEncoder:
    def __init__(self, batch_size=256):
        self.model = SentenceTransformer('all-MiniLM-L6-v2')
        self.batch_size = batch_size
        
    def encode_documents(self, pdf_directory):
        # Process and encode documents
        # Returns: embeddings array of shape (n_chunks, 384)
```

### Query Processing
```python
class RAGQuerySystem:
    def query(self, question: str) -> str:
        # 1. Encode question
        # 2. Find similar chunks
        # 3. Prepare context
        # 4. Generate answer
        return answer
```

## Results Summary

### Performance Metrics
1. **Accuracy**:
   - RAG responses grounded in CERN documentation
   - Specific numerical values vs. general ranges
   - Reduced hallucination risk

2. **Efficiency**:
   - Fast response times (~5s)
   - Focused, relevant answers
   - Direct access to technical details

3. **Source Attribution**:
   - All answers traceable to CERN documents
   - Metadata preserved for verification
   - Recent publications ensure currency

## Conclusions

The RAG system demonstrates significant advantages over direct LLM queries:
1. Higher precision in technical answers
2. Direct grounding in authoritative sources
3. Clear provenance for all information
4. Reduced tendency for hallucination or generalization

### Future Improvements
1. Implement caching for faster responses
2. Add source citation in responses
3. Expand document collection
4. Optimize chunk size for better context

## Usage Example
```python
# Initialize RAG system
rag = RAGQuerySystem('document_encodings.npz')

# Example query
question = "How many people work for CERN?"
answer = rag.query(question)
print(f"Answer: {answer}")  # Output: "CERN employs around 2600 staff members."
```

This implementation shows how combining retrieval with generation can significantly improve the quality and reliability of answers in specialized technical domains.