In [11]:
!pip install -q beautifulsoup4 requests transformers sentence-transformers faiss-cpu langchain gradio

In [12]:
!pip install -U langchain-community



In [13]:
!pip install gradio



In [14]:
import os

# Updated Folder where the markdown files will be saved
base_folder = '/content/drive/MyDrive/md_knowledge_base'

# Dictionary containing the filenames and their respective content
md_files = {
    "general_information.md": """
# General Information about the University of Cambridge

## Location:
- Cambridge, Cambridgeshire, England, UK.

## Founded:
- 1209

## Motto:
- "Hinc lucem et pocula sacra" (From here, light and sacred draughts)

## Overview:
- The University of Cambridge is one of the oldest universities in the world, with over 800 years of history. It is known for its rigorous academic standards and its world-class research and teaching.

## Famous Alumni:
- Sir Isaac Newton
- Charles Darwin
- Stephen Hawking
- Alan Turing

## Accreditations and Rankings:
- Ranked 1st globally in the QS World University Rankings (2023).
- Member of the prestigious Russell Group of UK universities.

## Website:
- [www.cam.ac.uk](https://www.cam.ac.uk)
""",

    "admissions.md": """
# Admissions at the University of Cambridge

## Undergraduate Admissions:

### Requirements:
- **A-Levels**: A*A*A for science and engineering courses; A*AA for other courses.
- **International Baccalaureate (IB)**: 41-42 points, including core points.
- **English Language Requirements**:
  - IELTS: 7.5 overall, with no less than 7.0 in each component.
  - TOEFL iBT: 110 overall, with at least 25 in each component.

### Application Process:
- Application through UCAS.
- Deadline: 15 October of the year preceding entry.

### Documents Required:
- Personal Statement
- Reference Letter
- Academic Transcripts

### Entrance Exams:
- Cambridge Admissions Test is required for certain courses (e.g., Law, Medicine).

## Postgraduate Admissions:

### Requirements:
- A 2:1 UK Bachelor's degree or equivalent.

### Application Process:
- Apply directly through the University of Cambridge’s portal.

### Documents Required:
- Statement of Purpose
- Academic Transcripts
- References
- CV/Resume
""",

    "courses.md": """
# Courses at the University of Cambridge

## Undergraduate Courses:

### Some of the available undergraduate programs:
- **Anglo-Saxon, Norse, and Celtic**
- **Archaeology**
- **Classics**
- **Economics**
- **Engineering**
- **Law**
- **Mathematics**
- **Physics**
- **Philosophy**

### Example Course Duration:
- 3 years for most courses (e.g., Law, History, Economics).
- 4 years for some courses (e.g., Engineering, Architecture).

## Postgraduate Courses:

### Some of the available postgraduate programs:
- **MPhil in Computer Science**
- **MSc in Advanced Computer Science**
- **MSc in Physics**
- **MBA**
- **MSc in Economics**
- **PhD in History**

### Duration:
- MPhil: Typically 1-2 years.
- MSc: 1 year.
- PhD: 3-4 years.
""",

    "fees.md": """
# Tuition Fees and Funding at the University of Cambridge

## Undergraduate Tuition Fees:

### Home Students:
- £9,535 per year.

### International Students:
- Fees vary by course.
- Examples:
  - **Anglo-Saxon, Norse, and Celtic**: £27,024
  - **Economics**: £27,024
  - **Mathematics**: £30,144
  - **Chemical Engineering**: £41,124
  - **Veterinary Medicine**: £70,554

## Postgraduate Tuition Fees:

### University Composition Fee (UCF):
- For most postgraduate courses, the UCF varies.
- Example for MPhil: £26,300 for international students (for courses like Economics, Engineering, and Physics).

### Maintenance (Living Costs):
- Estimated at £12,000–£14,000 per year for a full-time student.
""",

    "student_life.md": """
# Student Life at the University of Cambridge

## Colleges:
- The University consists of 31 autonomous colleges, each offering accommodation, social spaces, and support services.
- Examples: **Trinity College**, **King's College**, **St John's College**.

## Extracurricular Activities:
- Over 400 student-run societies.
- Sports teams, including rowing, cricket, rugby, and athletics.

## Support Services:
- **Counseling and Mental Health Services**.
- **Disability Support**.
- **Academic and Career Support**.
""",

    "campus_facilities.md": """
# Campus Facilities at the University of Cambridge

## Libraries:
- Over 100 libraries, including the **University Library** which holds over 8 million books.

## Sports Facilities:
- **University Sports Centre**: Gym, pool, courts, and boathouse.
- Other sports facilities for cricket, rugby, and athletics.

## Study and Dining Facilities:
- Many cafes, dining halls, and study spaces across the campus and colleges.

## Health Services:
- The **University Health Service** provides healthcare to all students.
""",

    "international_students.md": """
# International Students at the University of Cambridge

## Visa Information:
- Tier 4 (General) Student Visa required for international students.

## Support:
- The **International Student Office** provides guidance on visas, orientation, and settling into Cambridge.

## English Language Requirements:
- IELTS: 7.5 overall with no band below 7.0.
- TOEFL iBT: 110 overall, with a minimum of 25 in each component.
""",

    "careers_and_employability.md": """
# Careers and Employability at the University of Cambridge

## Career Services:
- Offers career advice, CV workshops, and employer connections.
- **Handshake** platform for job and internship opportunities.

## Graduate Outcomes:
- High employability rates, with a significant portion of graduates securing top jobs or further study placements.

## Alumni Network:
- The University boasts an extensive and influential alumni network.
""",

    "research_and_impact.md": """
# Research and Impact at the University of Cambridge

## Research Excellence:
- Cambridge is globally renowned for its research contributions in science, technology, and medicine.

## Key Research Areas:
- **Artificial Intelligence**: Cutting-edge research in AI and machine learning.
- **Medical Research**: Contributions to cancer treatment, neuroscience, and more.

## Industry Collaborations:
- Cambridge partners with various industries for research and innovation, including tech companies, healthcare institutions, and government bodies.
"""
}

# Create the directory if it doesn't exist
os.makedirs(base_folder, exist_ok=True)

# Write the data to the markdown files
for filename, content in md_files.items():
    file_path = os.path.join(base_folder, filename)
    with open(file_path, 'w', encoding='utf-8') as f:
        f.write(content)

print("Markdown files created successfully!")


Markdown files created successfully!


In [15]:
!pip install -U langchain_classic




In [27]:
import os
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer
import faiss, numpy as np
from langchain_community.vectorstores import FAISS as LangChainFAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.documents import Document

# Path to the knowledge base (adjust this to your correct folder path)
KB_PATH = "/content/drive/MyDrive/md_knowledge_base"

# Get all files in the directory
files = [os.path.join(dp, f) for dp, _, fn in os.walk(KB_PATH) for f in fn if f.endswith(".md")]
docs = [doc for f in files for doc in TextLoader(f, encoding="utf-8").load()]

# Initialize the text splitter with adjusted chunk overlap
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)  # Increased overlap

# Dynamically adjust chunk size based on content length (for example, increasing for long content)
def get_dynamic_chunk_size(text):
    # Use smaller chunks for smaller text blocks
    if len(text) < 1000:
        return 300  # for short sections like contact details
    elif len(text) < 5000:
        return 500  # average chunk size for most sections
    else:
        return 1000  # large chunk size for detailed sections like courses

# Split the documents dynamically with the adjusted chunk size
chunks = []
for doc in docs:
    chunk_size = get_dynamic_chunk_size(doc.page_content)
    chunk_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=100)  # Adjusted overlap
    chunks.extend(chunk_splitter.split_documents([doc]))

# Verify the first chunk to ensure the content is correct and not truncated
print(f"First chunk preview: {chunks[0].page_content[:500]}")  # Print the first 500 characters of the first chunk

# Extract text content from chunks (no longer directly used for embeddings or FAISS creation this way)
texts = [t.page_content for t in chunks]

# Initialize the embedding function
embed_model_id = "sentence-transformers/all-MiniLM-L6-v2"
embed_fn = HuggingFaceEmbeddings(model_name=embed_model_id)

# Create the LangChain FAISS vectorstore directly from documents
# The .from_documents method handles chunk embedding and index creation
vectorstore = LangChainFAISS.from_documents(chunks, embed_fn)

print("✅ FAISS vectorstore is ready for retrieval.")

# Example query
question = "What documents are required for undergraduate admission?"
docs = vectorstore.similarity_search(question, k=5)  # Increased k to 5 for more relevant results

# Output the content of each retrieved chunk
for doc in docs:
    print(doc.page_content)

First chunk preview: # Admissions at the University of Cambridge

## Undergraduate Admissions:


  embed_fn = HuggingFaceEmbeddings(model_name=embed_model_id)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ FAISS vectorstore is ready for retrieval.
### Application Process:
- Application through UCAS.
- Deadline: 15 October of the year preceding entry.

### Documents Required:
- Personal Statement
- Reference Letter
- Academic Transcripts

### Entrance Exams:
- Cambridge Admissions Test is required for certain courses (e.g., Law, Medicine).
# Admissions at the University of Cambridge

## Undergraduate Admissions:
## Postgraduate Admissions:

### Requirements:
- A 2:1 UK Bachelor's degree or equivalent.

### Application Process:
- Apply directly through the University of Cambridge’s portal.

### Documents Required:
- Statement of Purpose
- Academic Transcripts
- References
- CV/Resume
### Requirements:
- **A-Levels**: A*A*A for science and engineering courses; A*AA for other courses.
- **International Baccalaureate (IB)**: 41-42 points, including core points.
- **English Language Requirements**:
  - IELTS: 7.5 overall, with no less than 7.0 in each component.
# Courses at the University o

In [30]:
import torch
import re
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_community.llms import HuggingFacePipeline

# 1. Load model from Drive
MODEL_PATH = "/content/drive/MyDrive/falcon-e-1b-instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, local_files_only=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    local_files_only=True
).to("cuda" if torch.cuda.is_available() else "cpu")  # Adjust for available device

# 2. Build generation pipeline
text_gen_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    return_full_text=False,
    do_sample=False,
    max_new_tokens=200,
    pad_token_id=tokenizer.eos_token_id
)
llm = HuggingFacePipeline(pipeline=text_gen_pipeline)

# 3. Prompt + answer function
def format_prompt(context, question):
    return (
        "You are the Cambridge University Assistant—a friendly, knowledgeable chatbot dedicated to helping students with questions about courses, admissions, tuition fees, and student life. "
        "Use ONLY the information provided in the context below to answer the question. "
        "If the answer cannot be found in the context, reply: \"I’m sorry, but I don’t have that information available right now.\"\n\n"
        f"Context:\n{context}\n\n"
        f"Student Question: {question}\n"
        "Assistant Answer:"
    )

def answer_fn(question):
    # Retrieve the most similar documents (increase k if needed)
    docs = vectorstore_faiss.similarity_search(question, k=5)  # Increased k to 5 to get more context

    if not docs:
        return "I'm sorry, I couldn't find any relevant information for your query."

    # Build context string from the documents
    context = "\n\n".join(d.page_content for d in docs)

    # Format the prompt for the model
    prompt = format_prompt(context, question)

    try:
        # Invoke the model and generate the answer
        raw = llm.invoke(prompt).strip()
        # Return the first line of the model's response (usually enough for concise answers)
        return raw.split("\n")[0].strip()
    except Exception as e:
        return f"An error occurred while generating the response: {e}"


HFValidationError: Repo id must be in the form 'repo_name' or 'namespace/repo_name': '/content/drive/MyDrive/falcon-e-1b-instruct'. Use `repo_type` argument if needed.

### Fixing the Model Directory Structure

The `HFValidationError` often occurs because the local directory where your model is stored doesn't match the expected structure of a Hugging Face model repository. To fix this, we'll download the model directly from the Hugging Face Hub to a new, properly structured local folder.

**Note:** This will download the model again. If you prefer to manually organize your existing files, ensure that `config.json`, `tokenizer_config.json`, and other model files are directly in the `falcon-e-1b-instruct` folder you specified in Google Drive, and then update the `MODEL_PATH` in the next code block accordingly. Otherwise, proceed with the download below.

In [31]:
from huggingface_hub import snapshot_download
import os

# Define a new local path for the correctly downloaded model
new_model_dir = "/content/falcon-e-1b-instruct_correct"

# Model ID on Hugging Face Hub (assuming 'tiiuae/falcon-e-1b-instruct')
model_id_on_hub = "tiiuae/falcon-e-1b-instruct"

print(f"Downloading model '{model_id_on_hub}' to '{new_model_dir}'...")

# Download the model, ensuring it's stored with the correct structure
# If a folder with the name `new_model_dir` already exists, `snapshot_download` will update it.
# If you're encountering persistent issues and suspect corrupted files, you might want to delete `new_model_dir` manually before running this cell.
snapshot_download(repo_id=model_id_on_hub, local_dir=new_model_dir)

print(f"Model downloaded successfully to: {new_model_dir}")

# Update the MODEL_PATH variable to point to the new directory
# This variable is used in the model loading cell 'w4POkkb7sPD3'
MODEL_PATH = new_model_dir

print("MODEL_PATH has been updated. Please re-run the model loading cell (w4POkkb7sPD3) next.")


Downloading model 'tiiuae/falcon-e-1b-instruct' to '/content/falcon-e-1b-instruct_correct'...


Fetching 8 files:   0%|          | 0/8 [00:00<?, ?it/s]

model.safetensors:   0%|          | 0.00/665M [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/822 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

.gitattributes: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

Model downloaded successfully to: /content/falcon-e-1b-instruct_correct
MODEL_PATH has been updated. Please re-run the model loading cell (w4POkkb7sPD3) next.


In [32]:
import torch
import re
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from langchain_community.llms import HuggingFacePipeline

# 1. Load model from Drive
# MODEL_PATH is now set by the previous cell after downloading the model correctly
# If you want to use a model from Google Drive, ensure its structure is correct
# and set MODEL_PATH = "/content/drive/MyDrive/falcon-e-1b-instruct"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, local_files_only=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype=torch.bfloat16,
    local_files_only=True
).to("cuda" if torch.cuda.is_available() else "cpu")  # Adjust for available device

# 2. Build generation pipeline
text_gen_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    return_full_text=False,
    do_sample=False,
    max_new_tokens=200,
    pad_token_id=tokenizer.eos_token_id
)
llm = HuggingFacePipeline(pipeline=text_gen_pipeline)

# 3. Prompt + answer function
def format_prompt(context, question):
    return (
        "You are the Cambridge University Assistant—a friendly, knowledgeable chatbot dedicated to helping students with questions about courses, admissions, tuition fees, and student life. "
        "Use ONLY the information provided in the context below to answer the question. "
        "If the answer cannot be found in the context, reply: \"I’m sorry, but I don’t have that information available right now.\"\n\n"
        f"Context:\n{context}\n\n"
        f"Student Question: {question}\n"
        "Assistant Answer:"
    )

def answer_fn(question):
    # Retrieve the most similar documents (increase k if needed)
    docs = vectorstore_faiss.similarity_search(question, k=5)  # Increased k to 5 to get more context

    if not docs:
        return "I'm sorry, I couldn't find any relevant information for your query."

    # Build context string from the documents
    context = "\n\n".join(d.page_content for d in docs)

    # Format the prompt for the model
    prompt = format_prompt(context, question)

    try:
        # Invoke the model and generate the answer
        raw = llm.invoke(prompt).strip()
        # Return the first line of the model's response (usually enough for concise answers)
        return raw.split("\n")[0].strip()
    except Exception as e:
        return f"An error occurred while generating the response: {e}"


`torch_dtype` is deprecated! Use `dtype` instead!
You have loaded a BitNet model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
`torch_dtype` is deprecated! Use `dtype` instead!
Device set to use cuda:0
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
  llm = HuggingFacePipeline(pipeline=text_gen_pipeline)


In [33]:
# ... your code to build `vectorstore` ...

# 5. Alias so answer_fn still works
vectorstore_faiss = vectorstore


# Now test your chatbot
sample_question = "Where is university of cambridge located?"
print("Sample Question:", sample_question)
print("Answer:", answer_fn(sample_question))

Sample Question: Where is university of cambridge located?




Answer: Cambridge is located in Cambridge, Cambridgeshire, England, UK.


In [35]:
import gradio as gr
demo = gr.Interface(
    fn=answer_fn,
    inputs=gr.Textbox(lines=2, placeholder="Ask a question..."),
    outputs=gr.Textbox(lines=6),
    title="📘 University of cambridge Chatbot",
    description="",
    theme="default"
)

demo.launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://03bf5ff81a3e797ce3.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)




How long does an undergraduate course at Cambridge typically last