In [68]:
"""Changes made:
   1.To handle documents with varying formats, you can decrease the chunk_size and increase the chunk_overlap to ensure that context is preserved and sections don't get cut off mid-sentence.
   2.More advanced PDF parsing libraries (like pdfplumber or PyMuPDF) to better handle varying formats and extract text more cleanly.
   3.Temperature Adjustment: The temperature parameter controls the randomness of the model's outputs. If you need more deterministic results, especially for technical documents, consider lowering it.
   4.Prompt Optimization: Fine-tune the prompt template to better fit the specific document structure or content.
   5.Indexing Parameters: Ensure that the vector store (Chroma) is configured to handle the size and nature of the document. Adjust search_kwargs to fine-tune the retrieval process.
"""

"Changes made:\n   1.To handle documents with varying formats, you can decrease the chunk_size and increase the chunk_overlap to ensure that context is preserved and sections don't get cut off mid-sentence.\n   2.More advanced PDF parsing libraries (like pdfplumber or PyMuPDF) to better handle varying formats and extract text more cleanly.\n   3.Temperature Adjustment: The temperature parameter controls the randomness of the model's outputs. If you need more deterministic results, especially for technical documents, consider lowering it.\n   4.Prompt Optimization: Fine-tune the prompt template to better fit the specific document structure or content.\n   5.Indexing Parameters: Ensure that the vector store (Chroma) is configured to handle the size and nature of the document. Adjust search_kwargs to fine-tune the retrieval process.\n"

In [1]:
## Installing necessary libraries for the project
! pip install -q --upgrade google-generativeai langchain-google-genai chromadb pypdf

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/67.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m436.4 kB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m584.3/584.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.8/295.8 kB[0m [31m18.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m273.8/273.8 kB[0m [31m15.5 MB/s[0m eta [36m

In [2]:
# Importing necessary tools to display text in a nice format
from IPython.display import display
from IPython.display import Markdown
import textwrap

# Creating a function to format text as Markdown
def to_markdown(text):
  text = text.replace('•', '  *')

  # Indent and format the text as Markdown
  return Markdown(textwrap.indent(text, '> ', predicate=lambda _: True))

In [3]:
# Import the Google Generative AI library and load user data
import google.generativeai as genai
from google.colab import userdata

In [4]:
# Get the Google API key from user data and set it up for the AI library
import os
GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

In [5]:
# Create a Gemini Pro AI model
model = genai.GenerativeModel(model_name = "gemini-pro")
model

genai.GenerativeModel(
    model_name='models/gemini-pro',
    generation_config={},
    safety_settings={},
    tools=None,
    system_instruction=None,
    cached_content=None
)

In [6]:
# Ask the Gemini Pro model what CNNs are used for/questions.
response = model.generate_content("What are the usecases of CNNs?")

In [7]:
# Format the AI model's response as Markdown for better display
to_markdown(response.text)

> **Image Classification:**
> * ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
> * Identifying objects, scenes, and people in images
> 
> **Object Detection:**
> * PASCAL Visual Object Classes Challenge (VOC)
> * Detecting and localizing objects of interest within an image
> 
> **Image Segmentation:**
> * Semantic segmentation: Assigning each pixel in an image a class label (e.g., sky, grass, person)
> * Instance segmentation: Identifying and segmenting individual instances of objects within an image
> 
> **Facial Recognition:**
> * Identifying and matching faces from images
> * Verifying identity for security purposes
> 
> **Medical Imaging:**
> * Diagnosing diseases from medical scans (e.g., X-rays, MRIs)
> * Image-guided surgery and therapy
> 
> **Natural Language Processing:**
> * Text classification: Categorizing text documents into predefined classes (e.g., spam, news)
> * Named entity recognition: Identifying and classifying specific entities within text (e.g., persons, organizations)
> 
> **Video Analysis:**
> * Action recognition: Identifying and classifying actions performed in videos
> * Video summarization: Generating condensed versions of videos
> 
> **Self-Driving Cars:**
> * Image recognition for object detection, lane detection, and traffic sign recognition
> * Perception and prediction of surroundings for autonomous navigation
> 
> **Other Use Cases:**
> * Handwritten digit recognition
> * Facial expression recognition
> * Drug discovery
> * Financial forecasting
> * Anomaly detection

In [19]:
# Install necessary libraries for handling documents
!pip install langchain
!pip install pdfplumber

# Import required modules
import os
import pdfplumber  # Import pdfplumber instead of pypdf
from langchain.document_loaders import TextLoader

# Load the PDF document using pdfplumber
with pdfplumber.open("/content/data\/MLBOOK.pdf") as pdf:
    # Extract text from each page
    pages = [page.extract_text() for page in pdf.pages]

# Print the total number of pages in the PDF
print(f"Total pages in the PDF: {len(pages)}")

# Check if the PDF has pages and print the content of the first page if available
if len(pages) > 0:
    print(pages[0])
else:
    print("No pages found in the PDF")


Total pages in the PDF: 188
INTRODUCTION
TO
MACHINE LEARNING
AN EARLY DRAFT OF A PROPOSED
TEXTBOOK
Nils J. Nilsson
Robotics Laboratory
Department of Computer Science
Stanford University
Stanford, CA 94305
e-mail: nilsson@cs.stanford.edu
November 3, 1998
Copyright (cid:13)c2005 Nils J. Nilsson
This material may not be copied, reproduced, or distributed without the
written permission of the copyright holder.


In [10]:
# Install additional tools for splitting text and creating embeddings
!pip install langchain-text-splitters

# Import necessary libraries
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_google_genai import GoogleGenerativeAIEmbeddings




In [44]:
# Create a text splitter to break down the text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=5000, chunk_overlap=500)

# Combine the content of all PDF pages into a single string
# Directly join the strings in the 'pages' list, as they already contain the extracted text
context = "\n\n".join(pages)

# Split the combined text into smaller chunks using the defined text splitter
texts = text_splitter.split_text(context)

In [45]:
# Create an embeddings model using Google Generative AI
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001",google_api_key=GOOGLE_API_KEY)


In [46]:
# Install Chroma database for storing embeddings
!pip install chromadb

# Import necessary libraries
from langchain.vectorstores import Chroma # For creating a vector database
from langchain_google_genai import GoogleGenerativeAIEmbeddings # For creating embeddings

# Create a vector database using Chroma and the created embeddings
vector_index = Chroma.from_texts(texts, embeddings).as_retriever(search_kwargs={"k":10})



In [47]:
# Import the warnings to ignore all warnings to suppress output
import warnings
warnings.filterwarnings("ignore")

# Import the Path class from pathlib and alias it as 'p' for shorter usage
from pathlib import Path as p

# Import the pprint function for pretty-printing data structures
from pprint import pprint

from langchain import PromptTemplate  # Import the PromptTemplate class for creating prompts
from langchain.chains.question_answering import load_qa_chain  # Import the load_qa_chain function for loading question-answering chains
from langchain.document_loaders import PyPDFLoader  # Import the PyPDFLoader class for loading PDF documents
from langchain.text_splitter import RecursiveCharacterTextSplitter  # Import the RecursiveCharacterTextSplitter class for splitting text
from langchain.vectorstores import Chroma  # Import the Chroma class for creating vector databases
from langchain.chains import RetrievalQA  # Import the RetrievalQA class for creating retrieval question-answering chains
from langchain_google_genai import ChatGoogleGenerativeAI # Import the ChatGoogleGenerativeAI class for creating chat-based language models

# Create Gemini Pro language model with specified parameters
model = ChatGoogleGenerativeAI(model="gemini-1.5-flash",google_api_key=GOOGLE_API_KEY,
                               temperature=0.1,convert_system_message_to_human=True)

# Create a retrieval question-answering chain using the Gemini Pro model and vector index
qa_chain = RetrievalQA.from_chain_type(
    model,
    retriever=vector_index,
    return_source_documents=True,
)

In [48]:

template = """Use the following context to answer the question in detail. Provide a comprehensive and elaborative response with as much relevant information as possible.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)# Run chain
qa_chain = RetrievalQA.from_chain_type(
    model,
    retriever=vector_index,
    return_source_documents=False,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)


In [63]:
question = "what is The Widrow-Hoff Procedure?"
result = qa_chain({"query": question})
print(result["result"])


The Widrow-Hoff Procedure, also known as the LMS (Least Mean Squares) or delta procedure, is a method used to train a Threshold Logic Unit (TLU) by minimizing a squared-error function. This function measures the difference between the desired output (pattern label) and the actual output (dot product computed by the TLU).

Here's a breakdown of the procedure:

**1. Objective:**
   - Find a set of weights (w) that minimize the squared error between the pattern labels (d) and the dot product computed by the TLU.

**2. Squared Error Function:**
   - For a single pattern X with label d, the squared error is:
     ε = (d - (∑x<sub>ij</sub>w<sub>j</sub>))<sup>2</sup>
     where x<sub>ij</sub> is the j-th component of X and the summation is over all components of X.
   - The total squared error over all patterns in a training set Ξ is:
     ε = ∑(d - (∑x<sub>ij</sub>w<sub>j</sub>))<sup>2</sup>
     where the summation is over all patterns in Ξ.

**3. Gradient Descent:**
   - The procedure star

In [64]:
Markdown(result["result"])

The Widrow-Hoff Procedure, also known as the LMS (Least Mean Squares) or delta procedure, is a method used to train a Threshold Logic Unit (TLU) by minimizing a squared-error function. This function measures the difference between the desired output (pattern label) and the actual output (dot product computed by the TLU).

Here's a breakdown of the procedure:

**1. Objective:**
   - Find a set of weights (w) that minimize the squared error between the pattern labels (d) and the dot product computed by the TLU.

**2. Squared Error Function:**
   - For a single pattern X with label d, the squared error is:
     ε = (d - (∑x<sub>ij</sub>w<sub>j</sub>))<sup>2</sup>
     where x<sub>ij</sub> is the j-th component of X and the summation is over all components of X.
   - The total squared error over all patterns in a training set Ξ is:
     ε = ∑(d - (∑x<sub>ij</sub>w<sub>j</sub>))<sup>2</sup>
     where the summation is over all patterns in Ξ.

**3. Gradient Descent:**
   - The procedure starts with an arbitrary weight vector and moves it along the negative gradient of ε as a function of the weights.
   - Since ε is quadratic in the weights, it has a global minimum, and this steepest descent procedure is guaranteed to find it.

**4. Incremental Procedure:**
   - Instead of computing the gradient of the total squared error, the Widrow-Hoff procedure uses an incremental approach. It processes one pattern at a time, computes the gradient of the single-pattern squared error, adjusts the weights accordingly, and then moves on to the next pattern.
   - This incremental version approximates the batch version, but it is usually quite effective.

**5. Gradient Calculation:**
   - The j-th component of the gradient of the single-pattern error is:
     ∂ε/∂w<sub>j</sub> = -2(d - (∑x<sub>ij</sub>w<sub>j</sub>))x<sub>ij</sub>

**6. Weight Adjustment:**
   - Each weight is adjusted in the direction of the negative gradient:
     w<sub>j</sub> ← w<sub>j</sub> + c<sub>i</sub>(d - f<sub>i</sub>)x<sub>ij</sub>
     where f<sub>i</sub> = ∑x<sub>ij</sub>w<sub>j</sub> is the dot product for pattern X<sub>i</sub>, and c<sub>i</sub> is a learning rate parameter that governs the size of the adjustment.

**7. Augmented Vector Notation:**
   - The entire weight vector (including the threshold component) is adjusted using augmented vector notation:
     V ← V + c<sub>i</sub>(d - f<sub>i</sub>)Y<sub>i</sub>
     where Y<sub>i</sub> is the augmented pattern vector.

**8. Convergence:**
   - The Widrow-Hoff procedure makes adjustments to the weight vector whenever the dot product (Y<sub>i</sub>•V) does not equal the desired target value.
   - The learning rate parameter c<sub>i</sub> can be fixed or decrease with time to achieve asymptotic convergence.

**In summary, the Widrow-Hoff Procedure is a gradient descent method that iteratively adjusts the weights of a TLU to minimize the squared error between the desired output and the actual output. It is an incremental procedure that processes one pattern at a time, making it computationally efficient.**
