<a href="https://colab.research.google.com/github/josephmailil1/Speciale---Stibo/blob/main/RAG_LLM_OpenAI_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangChain Lab: Models, Prompts and Output Parsers

This notebook contains a short introduction to the most basic LangChain concepts. We will make direct calls to OpenAI through LangChain and see examples of prompts, models and output parsers

## OpenAI API key

Because we are using the OpenAI API in this lab, you need an API key if you want to run the lab yourself. You can get one [here](https://platform.openai.com/account/api-keys). The current price for `gpt-3.5-turbo-1106` is

| Model                   | Input                | Output               |
|-------------------------|----------------------|----------------------|
| gpt-3.5-turbo-1106      | \$0.0010 / 1K tokens  | \$0.0020 / 1K tokens  |
| gpt-3.5-turbo-instruct  | \$0.0015 / 1K tokens  | \$0.0020 / 1K tokens  |


Running all the examples in this and the following notebooks will cost you less than $1. For some examples you should be able to swap in an open source LLM without any changes. Others will need a little bit of adjustment.

## Setup

In [2]:
# Install the needed libraries

!pip install --quiet python-dotenv
!pip install --quiet openai
!pip install --quiet pandas
!pip install --quiet langchain

In [3]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
import openai
import os

# You can save your secret keys in your colab account
# (look for the key symbol in the panel to the left)

from google.colab import userdata
openai_api_key = userdata.get('TestKey') #use the name of your colab key

os.environ['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'] = openai_api_key

# Alternatively, you just set the key directly in your notebook.
# But be careful not to share it with anyone.

# os.environ['OPENAI_API_KEY'] = 'YOUR KEY HERE'

# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-4"
else:
    llm_model = "gpt-4-turbo-preview"

In [5]:
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.chains import LLMChain
import pandas as pd
import os

## Load dataset

In [6]:
df = pd.read_csv('shap_df_testset.csv')

In [7]:
df.head()

Unnamed: 0,ChurnBaseValue,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,gender_Male,Partner_Yes,Dependents_Yes,PhoneService_Yes,MultipleLines_No phone service,...,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,Contract_One year,Contract_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0.262382,0.0,-0.03179,-0.012948,-0.004579,0.0,-0.002024,0.001736,0.0,-0.001683,...,0.0,0.0,-0.021643,0.0,0.002244,-0.029851,0.0,0.001552,-0.008365,0.0
1,0.262382,0.0,-0.009395,-0.012948,-0.004579,0.0,0.001972,0.001736,0.0,-0.001683,...,0.0,0.0,-0.021643,0.0,0.002244,0.012195,0.0,0.001552,-0.013855,0.0
2,0.262382,0.0,0.01875,0.011162,-0.000835,0.0,-0.000383,0.003975,0.0,-0.000601,...,0.0,0.0,0.004006,0.0,0.005052,0.015528,0.0,-0.013351,-0.011725,0.0
3,0.262382,0.0,-0.040295,-0.012948,-0.000835,0.0,-0.000383,0.003975,0.0,-0.001683,...,0.0,0.0,0.006089,0.0,0.005052,0.011822,0.0,0.00354,-0.006235,0.0
4,0.262382,0.0,0.076218,-0.015844,-0.000835,0.0,0.000374,0.003975,0.0,0.007161,...,0.0,0.0,0.006089,0.0,0.005052,0.008283,0.0,0.00354,-0.011725,0.0


## LLM Chain

In [8]:
llm = ChatOpenAI(temperature=0, model=llm_model, openai_api_key=openai_api_key)

  warn_deprecated(


In [16]:
prompt = ChatPromptTemplate.from_template(
    "What is the most significant {feature} for  \
    the first observation?"
)

In [17]:
chain = LLMChain(llm=llm, prompt=prompt)

In [18]:
feature = df
chain.invoke(feature)

{'feature':       ChurnBaseValue  SeniorCitizen    tenure  MonthlyCharges  TotalCharges  \
 0           0.262382            0.0 -0.031790       -0.012948     -0.004579   
 1           0.262382            0.0 -0.009395       -0.012948     -0.004579   
 2           0.262382            0.0  0.018750        0.011162     -0.000835   
 3           0.262382            0.0 -0.040295       -0.012948     -0.000835   
 4           0.262382            0.0  0.076218       -0.015844     -0.000835   
 ...              ...            ...       ...             ...           ...   
 1402        0.262382            0.0  0.076075       -0.015844      0.035507   
 1403        0.262382            0.0  0.064135        0.022029     -0.000835   
 1404        0.262382            0.0 -0.040295       -0.014339     -0.000835   
 1405        0.262382            0.0 -0.048265        0.021060     -0.000835   
 1406        0.262382            0.0 -0.031790       -0.012948     -0.004579   
 
       gender_Male  Partner

# Router Chain
Idea is to add context about the data, ML model, and wanted output for the end-user

In [34]:
machinelearning_template = """You are a Machine Learning expert. \
You are great at answering questions about Machine Learning outputs in a concise\
and easy to understand manner. \
You are able to easily interpret local SHAP-values from Machine Learning outputs \.

The end-user is sales personnel, and based on the output, they are supposed to\
to be able act on the output provided by you. Make i actionable and easy to understand.

You should use the information provided in the prompt_infos to answer the question.

The data is found in the dataframe provided in prompt_infos

Here is a question:
{input}"""

In [33]:
prompt_infos = [
    {
        "name": "machine learning",
        "description": "Good for answering questions about Machine Learning",
        "prompt_template": machinelearning_template,
        "dataframe": df
    }
]

In [21]:
from langchain.chains.router import MultiPromptChain
from langchain.chains.router.llm_router import LLMRouterChain,RouterOutputParser
from langchain.prompts import PromptTemplate

In [30]:
llm = ChatOpenAI(temperature=0.2, model=llm_model, openai_api_key=openai_api_key)

In [23]:
destination_chains = {}
for p_info in prompt_infos:
    name = p_info["name"]
    dataframe = p_info["dataframe"]
    prompt_template = p_info["prompt_template"]
    prompt = ChatPromptTemplate.from_template(template=prompt_template)
    chain = LLMChain(llm=llm, prompt=prompt)
    destination_chains[name] = chain

destinations = [f"{p['name']}: {p['description']} {p['dataframe']}" for p in prompt_infos]
destinations_str = "\n".join(destinations)

In [24]:
default_prompt = ChatPromptTemplate.from_template("{input}")
default_chain = LLMChain(llm=llm, prompt=default_prompt)

In [36]:
MULTI_PROMPT_ROUTER_TEMPLATE = """Given a raw text input to a \
language model select the requested observation, as well as the numeric value for the input, often referred to as SHAP-value. \
You will be given the observation number of the available rows and a \
request of what is wanted. \
You may also revise the original input if you think that revising\
it will ultimately lead to a better response from the language model.

<< FORMATTING >>
Return a markdown code snippet with a JSON object formatted to look like:
```json
{{{{
    "destination": string \ name of the prompt to use or "DEFAULT"
    "next_inputs": string \ a potentially modified version of the original input
}}}}
```

REMEMBER: "destination" MUST be one of the candidate prompt \
names specified below OR it can be "DEFAULT" if the input is not\
well suited for any of the candidate prompts.
REMEMBER: "next_inputs" can just be the original input \
if you don't think any modifications are needed.
REMEMBER: The feature ChurnBaseValue is the Target variable and \
all the other features are the SHAP-values impacting the Target variable.

<< CANDIDATE PROMPTS >>
{destinations}

<< INPUT >>
{{input}}

<< OUTPUT (remember to include the ```json)>>"""

In [26]:
router_template = MULTI_PROMPT_ROUTER_TEMPLATE.format(
    destinations=destinations_str
)
router_prompt = PromptTemplate(
    template=router_template,
    input_variables=["input"],
    output_parser=RouterOutputParser(),
)

router_chain = LLMRouterChain.from_llm(llm, router_prompt)

In [27]:
chain = MultiPromptChain(router_chain=router_chain,
                         destination_chains=destination_chains,
                         default_chain=default_chain, verbose=True
                        )

In [45]:
import textwrap

answer = chain.invoke("What is the most important feature for the first observation?")




[1m> Entering new MultiPromptChain chain...[0m
None: {'input': 'What is the most important feature for the first observation?'}
[1m> Finished chain.[0m


In [46]:
answer

{'input': 'What is the most important feature for the first observation?',
 'text': "Your question is quite broad and could apply to various contexts, such as scientific research, data analysis, user experience design, or even observational studies in social sciences, among others. To provide a meaningful answer, I'll address a few of these contexts briefly:\n\n1. **Scientific Research and Data Analysis**: In these contexts, the most important feature for the first observation often depends on the clarity and accuracy of the data. The initial observation should be relevant to the research question or hypothesis being tested. It should provide a clear, measurable, and objective starting point for the study. For example, in a study measuring the impact of a drug, the first observation might be the baseline health metrics of participants before they receive the treatment.\n\n2. **User Experience Design**: When designing a product or service, the first observation might focus on identifyin

# Semi-Structured RAG

In [34]:
!pip install llama-index==0.9.45.post1 arize-phoenix==2.2.1 pyvis



In [9]:
from llama_index.query_pipeline import (
    QueryPipeline as QP,
    Link,
    InputComponent,
)
from llama_index.query_engine.pandas import PandasInstructionParser
from llama_index.llms import OpenAI
from llama_index.prompts import PromptTemplate

In [None]:
df = pd.read_csv('shap_df_testset.csv')

## Model instructions

In [18]:
instruction_str = (
    "1. Convert the query to executable Python code using Pandas.\n"
    "2. The final line of code should be a Python expression that can be called with the `eval()` function.\n"
    "3. The code should represent a solution to the query.\n"
    "4. PRINT ONLY THE EXPRESSION.\n"
    "5. Do not quote the expression.\n"
)

pandas_prompt_str = (
    "You are working with a pandas dataframe in Python.\n"
    "The name of the dataframe is `df`.\n"
    "This is the result of `print(df.head())`:\n"
    "{df_str}\n\n"
    "Follow these instructions:\n"
    "{instruction_str}\n"
    "Query: {query_str}\n\n"
    "Expression:"
)
response_synthesis_prompt_str = (
    "Given an input question, synthesize a response from the query results.\n"
    "Query: {query_str}\n\n"
    "Pandas Instructions (optional):\n{pandas_instructions}\n\n"
    "Pandas Output: {pandas_output}\n\n"
    "Response: "
)

pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(
    instruction_str=instruction_str, df_str=df.head(5)
)
pandas_output_parser = PandasInstructionParser(df)
response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)
llm = OpenAI(model="gpt-3.5-turbo")

## Create Query Pipeline

In [24]:
qp = QP(
    modules={
        "input": InputComponent(),
        "pandas_prompt": pandas_prompt,
        "llm1": llm,
        "pandas_output_parser": pandas_output_parser,
        "response_synthesis_prompt": response_synthesis_prompt,
        "llm2": llm,
    },
    verbose=True,
)
qp.add_chain(["input", "pandas_prompt", "llm1", "pandas_output_parser"])
qp.add_links(
    [
        Link("input", "response_synthesis_prompt", dest_key="query_str"),
        Link(
            "llm1", "response_synthesis_prompt", dest_key="pandas_instructions"
        ),
        Link(
            "pandas_output_parser",
            "response_synthesis_prompt",
            dest_key="pandas_output",
        ),
    ]
)

#add link from response synthesis prompt to llm2
qp.add_link("response_synthesis_prompt", "llm2")

In [None]:
from pyvis.network import Network

net = Network(notebook=True, cdn_resources="in_line", directed=True)
net.from_nx(qp.dag)
net.show("text2sql_dag.html")

In [35]:
openai_api_key = userdata.get('TestKey') #use the name of your colab key

os.environ['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'] = openai_api_key

In [36]:
from llama_index import VectorStoreIndex, SimpleDirectoryReader
response = qp.run(
    query_str="What is the most important feature in the first row?"
)

[1;3;38;2;155;135;227m> Running module input with input: 
query_str: What is the most important feature in the first row?

[0m[1;3;38;2;155;135;227m> Running module pandas_prompt with input: 
query_str: What is the most important feature in the first row?

[0m[1;3;38;2;155;135;227m> Running module llm1 with input: 
messages: You are working with a pandas dataframe in Python.
The name of the dataframe is `df`.
This is the result of `print(df.head())`:
   ChurnBaseValue  SeniorCitizen    tenure  MonthlyCharges  TotalCharges...

[0m

APIConnectionError: Connection error.