<h1 align="center">
    <img 
        src="../img/logo_white_bg.jpeg" 
        width="200" 
        border="1" />
</h1>
<h1 align="center">
    <b>GenAISHAP</b>
</h1>
<h4 align="center">
    <i>Explanations for Generative AI, LLM-and-SLM-Based, Solutions</i> ⚡️
</h4>



Generative AI SHAP (GenAISHAP) is a python library that supports the creation of explanations to the metrics obtained for solutions based on LLMs (Large Language Models) or SLMs (Small Language Models). 

The previous notebook showed an example of how to create the **Input** for ***GenAISHAP***, which is a simple Pandas DataFrame with the evaluation dataset. A pandas Dataframe like the following was produced and stored as a JSON file:

<img src="../img/input_example.png" width="1200" />

> The column `user_input` will be used to refer to the user prompt, and the columns `faithfulness`, `context_precision` and `context_recall` will be used as metric columns since those columns are numerical.
>
> The other columns, `retrieved_contexts`, `response`, and `reference` are not needed for **GenAISHAP** but were required for the calculation of the metrics.

***GenAISHAP*** works as follows.  ***GenAISHAP*** will create regression models, which we call them **black-box models**, for each of the metrics and will use those black-box models to produce explanations for each metric. The models are created from features extracted from the provided questions. Those **question features** could be generated automatically, using a tool, named **Featurizer** incorporated in the library or they can be manually created.

This notebook explains how to use the **Featurizer** tool.  

In [1]:
import pandas as pd
from genaishap import Featurizer
from dotenv import load_dotenv

# 1. Load input and set environment variables

The **Featurizer** tool uses an LLM to produce the features and its corresponding values per instance sample. That is the reason we need to load the environment variable here in this notebook.  The loaded dataset is the one was created in the previous notebook.


In [2]:
load_dotenv()

True

In [3]:
df_test_dataset = pd.read_json('./test-dataset.json', orient='records')
df_test_dataset.head(10)

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,context_precision,context_recall
0,Can you provide for me the three highlights fo...,"[31. In 2018, to align with industry best prac...",The three highlights for the GHG emissions sec...,"Sure, they are: \n1. 65% cumulative GHG emissi...",1.0,0.0,0.0
1,What percentage of waste from Google's offices...,[Performance highlights\nThe following section...,"In 2021, 78% of waste from Google's global dat...",Sixty-four percent.,1.0,0.0,0.0
2,Can you present me with the performance highli...,"[Education\nFor more than 40 years, we’ve work...",The performance highlights for empowering user...,Sure! The Performance Highlights for Empowerin...,1.0,0.0,0.0
3,What was the listed key achievement regarding ...,[Our approach\nWe believe that every business ...,There is no listed key achievement for Google ...,"In 2017, Google became the first major company...",1.0,1.0,1.0
4,Did Google reach its intended Waste target und...,[BUILDING BETTER DEVICES AND SERVICES\nTarget ...,"Yes, in 2021, Google achieved the UL 2799 Zero...","No, this target has not been met in 2021. Howe...",0.666667,1.0,1.0
5,How many EV charging locations were there on G...,[This guidance does not recognize existing ren...,The provided context does not specify the numb...,200000,1.0,0.0,0.0
6,On what page of the report can I find the perf...,"[Employee Recruitment, Inclusion and Performan...",The performance highlights for the Empowering ...,The performance highlights for Empowering User...,0.0,0.0,0.0
7,Can you please provide for me the glossary of ...,[GRI INDEX\nGRI 304 - Biodiversity\nGRI 103 Ma...,I'm unable to provide the glossary of the docu...,"Sure, here is the glossary:\nGlossary\nCFE: ca...",0.5,0.0,0.0
8,On what page can I find details about Amazons ...,[IntroductionSustainability\nDriving Climate S...,You can find details about Amazon's climate so...,You can find information on driving climate so...,0.0,0.0,0.0
9,"For the listed Renewable Energy goals, by when...",[IntroductionSustainability\nDriving Climate S...,Amazon intends to have all operations powered ...,Amazon set the goal of becoming powered by 100...,1.0,1.0,0.0


The following code initializes the **Featurizer** with the data loaded in the previous step

In [4]:
featurizer = Featurizer.from_pandas(df_test_dataset)

# 2. Create features automatically

The following code uses an Azure OpenAI LLM deployment to create the features, by default the number of features is 12, but this number can be modified if required. If the quality of the black-box models is low, one option to improve it is to increase the number of features generated.

The features automatically generated can be or **boolean** or **list of strings**. The goal is to be able to capture the characteristics of the different user queries in a way that can be easily interpretable by a human, and at the same time these features should be able to be engineered to be used as regressors for the black-box regression models.  

The following is an example of how to create the features, and how to visualize them.

In [5]:
%%time

featurizer.create_features_using_azure_openai(
    deployment_name="gpt-4o", # Update with the name of your Azure OpenAI LLM deployment name
    num_features=12
)
print(featurizer.features.model_dump_json(indent=4))

{
    "features": [
        {
            "feature": "there_is_any_company_identified_in_the_question",
            "ftype": "boolean"
        },
        {
            "feature": "list_of_companies_identified_in_the_question",
            "ftype": "list_of_strings"
        },
        {
            "feature": "there_is_any_initiative_or_program_identified_in_the_question",
            "ftype": "boolean"
        },
        {
            "feature": "list_of_initiatives_or_programs_identified_in_the_question",
            "ftype": "list_of_strings"
        },
        {
            "feature": "there_is_any_financial_or_environmental_metric_identified_in_the_question",
            "ftype": "boolean"
        },
        {
            "feature": "list_of_financial_or_environmental_metrics_identified_in_the_question",
            "ftype": "list_of_strings"
        },
        {
            "feature": "there_is_any_specific_year_identified_in_the_question",
            "ftype": "boolean"
        }

These automatically generated features can be modified if required. For this, just use the `featurizer.features.features` list to remove or add features. For example, to remove the second question (index = 1), you can use `del` as follows:

```python
del featurizer.features.features[1]
```

To add an additional feature, you can use `append` as follows:

```python
from genaishap import Feature

featurizer.features.features.append(
    Feature(
        feature="is_an_open_ended_question",
        ftype="boolean"
    )
)
```


# 3. Fill out the features automatically

Once the list of features is created it is possible to automatically fill out the values of each feature using the **Featurizer**. The **Featurizer** uses an Azure OpenAI LLM to do the job. The batch size parameter is used to fill out batches of, for example, 20 questions. The batch size helps to control the number of tokens used per LLM call. 

In [6]:
%%time

featurizer.fill_out_features_using_azure_openai(
    deployment_name="gpt-4o", 
    batch_size=20
)

  0%|          | 0/3 [00:00<?, ?it/s]

CPU times: user 165 ms, sys: 21.9 ms, total: 187 ms
Wall time: 2min 32s


QIN: What was the listed key achievement regarding sustainbility and climate change for Google in 2077?
QOU: What was the listed key achievement regarding sustainability and climate change for Google in 2077?

QIN: What were Amazon's Carbon Intesity values in (C02e per $ of GMS) in the years 2019 to 2022?
QOU: What were Amazon's Carbon Intensity values in (C02e per $ of GMS) in the years 2019 to 2022?

QIN: What household brands were featured in the in the climate pledge infographic on page 14?
QOU: What household brands were featured in the climate pledge infographic on page 14?



In [7]:
df_features = featurizer.to_pandas()
df_features.style.set_table_styles(
    [{'selector': 'th', 'props': [('font-size', '5pt')]}]
).set_properties(**{'font-size': '8pt',})

Unnamed: 0,there_is_any_company_identified_in_the_question,list_of_companies_identified_in_the_question,there_is_any_initiative_or_program_identified_in_the_question,list_of_initiatives_or_programs_identified_in_the_question,there_is_any_financial_or_environmental_metric_identified_in_the_question,list_of_financial_or_environmental_metrics_identified_in_the_question,there_is_any_specific_year_identified_in_the_question,list_of_years_identified_in_the_question,is_a_question_about_trends_or_changes_over_time,is_a_question_about_identification_of_factors_or_insights,is_a_question_related_to_a_specific_page_or_section_of_a_document,is_a_question_about_goals_or_targets
0,False,[],True,['Advancing Carbon-Free Energy'],True,['GHG emissions'],False,[],False,True,True,False
1,True,['Google'],False,[],True,['waste diversion'],True,['2021'],False,False,False,False
2,False,[],True,['Empowering Users With Technology'],False,[],False,[],False,True,True,False
3,True,['Google'],False,[],True,"['sustainability', 'climate change']",True,['2077'],False,True,False,True
4,True,['Google'],True,['Building Better Devices and Services'],True,['waste target'],True,['2021'],False,False,False,True
5,True,['Google'],False,[],True,['EV charging locations'],True,['2021'],False,False,False,False
6,False,[],True,['Empowering Users With Technology'],False,[],False,[],False,False,True,False
7,False,[],False,[],False,[],False,[],False,False,True,False
8,True,['Amazon'],True,['Climate solutions'],False,[],False,[],False,False,True,False
9,True,['Amazon'],True,['Renewable Energy goals'],True,['100% renewable energy'],False,[],False,False,False,True


It is possible to manually edit the values of the features or add or remove manually an entire column.  The easies way to do it is just to manipulate the `df_features` dataframe.  Unsing pandas Dataframe methods and functions it is easy to manipulate this table as required.  The following are the restrictions for the manual manipulation:
- The total number of records of `df_features` has to be the same as the number of records of `df_test_dataset`.
- The column names of `df_features` should be self explanatory since they are going to be used for the explanations.
- The type of the columns in `df_features` needs to be **boolean** or **list of strings**.

# 4. Store features and values as JSON file

In [8]:
df_features.to_json('./test-features.json', orient='records', indent=4)

Just for visualization, the following cell shows in a single table all the columns calculated so far, user input, metrics and question features

In [9]:
df_test_dataset.join(df_features)

Unnamed: 0,user_input,retrieved_contexts,response,reference,faithfulness,context_precision,context_recall,there_is_any_company_identified_in_the_question,list_of_companies_identified_in_the_question,there_is_any_initiative_or_program_identified_in_the_question,list_of_initiatives_or_programs_identified_in_the_question,there_is_any_financial_or_environmental_metric_identified_in_the_question,list_of_financial_or_environmental_metrics_identified_in_the_question,there_is_any_specific_year_identified_in_the_question,list_of_years_identified_in_the_question,is_a_question_about_trends_or_changes_over_time,is_a_question_about_identification_of_factors_or_insights,is_a_question_related_to_a_specific_page_or_section_of_a_document,is_a_question_about_goals_or_targets
0,Can you provide for me the three highlights fo...,"[31. In 2018, to align with industry best prac...",The three highlights for the GHG emissions sec...,"Sure, they are: \n1. 65% cumulative GHG emissi...",1.0,0.0,0.0,False,[],True,[Advancing Carbon-Free Energy],True,[GHG emissions],False,[],False,True,True,False
1,What percentage of waste from Google's offices...,[Performance highlights\nThe following section...,"In 2021, 78% of waste from Google's global dat...",Sixty-four percent.,1.0,0.0,0.0,True,[Google],False,[],True,[waste diversion],True,[2021],False,False,False,False
2,Can you present me with the performance highli...,"[Education\nFor more than 40 years, we’ve work...",The performance highlights for empowering user...,Sure! The Performance Highlights for Empowerin...,1.0,0.0,0.0,False,[],True,[Empowering Users With Technology],False,[],False,[],False,True,True,False
3,What was the listed key achievement regarding ...,[Our approach\nWe believe that every business ...,There is no listed key achievement for Google ...,"In 2017, Google became the first major company...",1.0,1.0,1.0,True,[Google],False,[],True,"[sustainability, climate change]",True,[2077],False,True,False,True
4,Did Google reach its intended Waste target und...,[BUILDING BETTER DEVICES AND SERVICES\nTarget ...,"Yes, in 2021, Google achieved the UL 2799 Zero...","No, this target has not been met in 2021. Howe...",0.666667,1.0,1.0,True,[Google],True,[Building Better Devices and Services],True,[waste target],True,[2021],False,False,False,True
5,How many EV charging locations were there on G...,[This guidance does not recognize existing ren...,The provided context does not specify the numb...,200000,1.0,0.0,0.0,True,[Google],False,[],True,[EV charging locations],True,[2021],False,False,False,False
6,On what page of the report can I find the perf...,"[Employee Recruitment, Inclusion and Performan...",The performance highlights for the Empowering ...,The performance highlights for Empowering User...,0.0,0.0,0.0,False,[],True,[Empowering Users With Technology],False,[],False,[],False,False,True,False
7,Can you please provide for me the glossary of ...,[GRI INDEX\nGRI 304 - Biodiversity\nGRI 103 Ma...,I'm unable to provide the glossary of the docu...,"Sure, here is the glossary:\nGlossary\nCFE: ca...",0.5,0.0,0.0,False,[],False,[],False,[],False,[],False,False,True,False
8,On what page can I find details about Amazons ...,[IntroductionSustainability\nDriving Climate S...,You can find details about Amazon's climate so...,You can find information on driving climate so...,0.0,0.0,0.0,True,[Amazon],True,[Climate solutions],False,[],False,[],False,False,True,False
9,"For the listed Renewable Energy goals, by when...",[IntroductionSustainability\nDriving Climate S...,Amazon intends to have all operations powered ...,Amazon set the goal of becoming powered by 100...,1.0,1.0,0.0,True,[Amazon],True,[Renewable Energy goals],True,[100% renewable energy],False,[],False,False,False,True
