<a href="https://colab.research.google.com/github/hylle77/BDS/blob/main/Group_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#PORTFOLIO EXERCISE 3: GPT MODELS

### Nessecary downloads and imports

In [None]:
#Nessecary downloads
!pip install accelerate --q
!pip install pypdf --q
!pip install -qqq chromadb==0.4.10 --progress-bar off
!pip install -qqq sentence_transformers==2.2.2 --progress-bar off!pip install -Uqqq pip --progress-bar off
# !pip install -qqq torch==2.0.1 --progress-bar off
!pip install -qqq langchain==0.0.299 --progress-bar off
!pip install -qqq xformers==0.0.21 --progress-bar off
!pip install -qqq sentence_transformers==2.2.2 --progress-bar off
!pip install -qqq tokenizers==0.14.0 --progress-bar off
!pip install -qqq optimum==1.13.1 --progress-bar off
!pip install -qqq auto-gptq==0.4.2 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ --progress-bar off
!pip install -qqq unstructured==0.10.16 --progress-bar off

In [2]:
#Importing libraries
import pandas as pd
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain import PromptTemplate
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.document_loaders import PyPDFLoader
from langchain.llms import HuggingFaceHub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from textwrap import fill

### Loading the data

The dataset used is a synthetic dataset of wine varieties genrated using AI. It has then been converted to PDF format for compatibility with Chroma.

In [34]:
#Loading data
loader = PyPDFLoader("/content/wine_varieties.pdf")

docs = loader.load()
len(docs)

1

## Splitting the documents into smaller chunks

In [35]:
#Splitting the data
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts = text_splitter.split_documents(docs)
len(texts)

6

##Creating the embeddings using Hugging Face

In [40]:
#Creating embeddings for the data
embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

#Printing the embeddings
query_result = embeddings.embed_query(texts[0].page_content)
print(len(query_result))

1024


##Combing the LLM with the database using the RetrievalQA chain:

In [37]:
#Creating the Chroma database
db = Chroma.from_documents(texts, embeddings, persist_directory="db")
results = db.similarity_search("What is the most popular wine variety", k=2)
print(results[0].page_content)

Vermentino A white grape variety known for its bright acidity and citrus flavors, originating from France.
Aligoté Produces bold, spicy red wines with flavors of dark fruit and tobacco, originating from Argentina.
Negroamaro An Italian white grape variety with floral and fruity aromas, often used to make crisp, dry wines.
Mourvèdre A French red grape variety known for its rich, fruity flavors and soft tannins.
Garganega Produces full-bodied red wines with flavors of dark fruit, originating from France.
Trebbiano An Italian white grape variety known for its rich, oily texture and flavors of nuts and honey.
Nebbiolo A Spanish red grape variety producing wines with flavors of black fruit and herbs.
Sémillon Produces aromatic white wines with flavors of apricot and peach, originating from Italy.
Aglianico An Italian red grape variety with high acidity and flavors of red cherry and herbs.
Primitivo A Greek white grape variety known for its bright acidity and flavors of lemon and herbs.


In [38]:
#Initializing the LLM model
MODEL_NAME = "TheBloke/Llama-2-7B-Chat-GPTQ"

#Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto"
)

#Create a configuration for text generation based on the specified model name
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)

#Set the maximum number of new tokens in the generated text to 1024
generation_config.max_new_tokens = 1024

#Set the temperature for text generation. Lower values (e.g., 0.0001) make output more deterministic, following likely predictions.
generation_config.temperature = 0.0001

#Set the top-p sampling value. A value of 0.95 means focusing on the most likely words that make up 95% of the probability distribution.
generation_config.top_p = 0.95

#Enable text sampling. When set to True, the model randomly selects words based on their probabilities, introducing randomness.
generation_config.do_sample = True

#Set the repetition penalty. A value of 1.15 discourages the model from repeating the same words or phrases too frequently in the output.
generation_config.repetition_penalty = 1.15

#Create a text generation pipeline using the initialized model, tokenizer, and generation configuration
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
)

# Create a LangChain pipeline that wraps the text generation pipeline and set a specific temperature for generation
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/789 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.90G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## Wine Varieties Description Question Answering System

In [39]:
#Define the prompt using the template
template = """
<s>[INST] <<SYS>>
As a wine expert, describe the characteristics and profile of the following grape variety.
<</SYS>>

{context}

{question} [/INST]
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])


qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)

result = qa_chain(
    "What can you tell me about Cabernet Sauvignon?"
)
print(result["result"].strip())

Ah, an excellent choice! Cabernet Sauvignon is one of the most renowned and widely planted grape varieties globally, known for its depth of flavor and aging potential. Here are some key characteristics and profiles of this noble grape:
Flavor Profile:
Cabernet Sauvignon is typically associated with bold, dark fruit flavors such as black cherry, blackberry, and cassis. These flavors are often complemented by notes of spice, including clove, nutmeg, and cinnamon. The wine may also exhibit hints of leather, tobacco, and earthy undertones, which develop over time as the wine ages.
Tannins and Acidity:
Cabernet Sauvignon is known for its high tannin content, which provides structure and aging potential. The tannins are usually firm but fine, contributing to the wine's overall balance and complexity. The acidity level is moderate to high, which helps to preserve the wine's freshness and vitality.
Oak Influence:
Cabernet Sauvignon is frequently aged in oak barrels, which can impart additional

In [None]:
#Passing the prompt along with the top 2 results from the Chroma database
result = qa_chain(
    "Tell me about the most popular wine varieties in France, using 2-3 sentences."
)
print(fill(result["result"].strip(), width=80))