# TAid QA Chatbot

TAid course-oriented chatbot that aims to help answering course-related conceptual questions. It is built on Feature Forms's MLOps implementation and uses vectorised course material to provide more context.

## Requirements

* Python 3.7+
* PyPI (pip3 install PyPI)
* python-dotenv 1.0.0 (pip3 install python-dotenv 1.0.0)
* featureform (pip3 install featureform)
* Hugging Face sentence-transformers  (pip3 install sentence-transformers)
* daal4y (pip3 install daal4py)
* openai (pip3 install openai)
* `.env` file with one or both sets of credentials from Pinecone/Weaviate and openAI credentials


```
pip3 install PyPI
pip3 install python-dotenv 1.0.0
pip3 install featureform
pip3 install sentence-transformers
pip3 install daal4py
pip3 install openai
```

## .env Credentials

[Pinecone](https://www.pinecone.io/) 

PINECONE_PROJECT_ID=

PINECONE_ENVIRONMENT=

PINECONE_API_KEY=


[Weaviate](https://weaviate.io/)

WEAVIATE_URL=

WEAVIATE_API_KEY=


[openAI](https://platform.openai.com/)

OPENAI_KEY=

## Step  1. Register Source

`data/files` is a directory of CSV files, which use `;` as a delimiter and hold lecture notes of the a certain course (In this example, the notes of CS70 are used). Each row is a section, a subsection, or a subsubsection depending on the smallest unit in the notes's outline.

* Lecture
* Topic
* Section
* Text
* filename


In [1]:
import featureform as ff
from featureform import local

client = ff.Client(local=True)



**NOTE:** We'll create an instance of the client to register resources as we define them.

**NOTE:** The register saves the data in a `.featureform` folder. It is recommended to delete the folder and have a fresh start with every run.

In [2]:
lectures = local.register_directory(
    name="notes",
    path="data/files",
    description="CS70 Notes",
)

In [3]:
client.dataframe(lectures)

Applying Run: confident_minsky
Creating user default_user 
Creating provider local-mode 
Creating source notes  confident_minsky


Unnamed: 0,filename,body
0,Note_7.csv,Lecture;Topic;Section;Text\n7;Public Key Crypt...
1,Note_0.csv,Lecture;Topic;Section;Text\n0;Review of Sets;I...
2,Note_4.csv,Lecture;Topic;Section;Text\n4;The Stable Match...


**NOTE:** The next cell doesn't need to be ran. Run it only if you need to access featureform's dashboard.

In [4]:
#!featureform dash

## Step 2. Transform Transcripts

When registering a directory, files are converted into a table with columns `"filename"` and `"body"`. This is helpful for avoiding the situation where we need to register many files; however, in our case, we'll need to process this table to get it ready for vectorization.

In [5]:
@local.df_transformation(inputs=[lectures])
def process_lecture_files(dir_df):
    from io import StringIO
    import pandas as pd

    lecture_dfs = []
    for i, row in dir_df.iterrows():
        csv_str = StringIO(row[1])
        r_df = pd.read_csv(csv_str, sep=";")
        r_df["filename"] = row[0]
        lecture_dfs.append(r_df)

    return pd.concat(lecture_dfs)

We can verify this worked as we expected by serving this source as a dataframe and inspecting the results.

In [6]:
df = client.dataframe(process_lecture_files)

df.head()

Applying Run: confident_minsky
Creating provider local-mode 
Creating source process_lecture_files  confident_minsky


Unnamed: 0,Lecture,Topic,Section,Text,filename
0,7,Public Key Cryptography,Public Key Cryptography I,The basic setting for cryptography is typicall...,Note_7.csv
1,7,Public Key Cryptography,Public Key Cryptography II,"Since the link is insecure, Alice and Bob have...",Note_7.csv
2,7,Public Key Cryptography,Public Key Cryptography III,The central idea behind the RSA cryptosystem i...,Note_7.csv
3,7,Public Key Cryptography,Public Key Cryptography IV,The RSA scheme is based heavily on modular ari...,Note_7.csv
4,7,Public Key Cryptography,Encryption,[Encryption]: When Alice wants to send a messa...,Note_7.csv


## Step 3. Entity ID Transformation

For our purposes, we'll need a unique identifier for each speakers' comments, so we'll choose `"Lecture"`, `"Section"` and `"filename"` to create a new column, `"PK"` (The Primary Key).

In [7]:
@local.df_transformation(inputs=[process_lecture_files])
def text_primary_key(lectures_df):
    lectures_df["PK"] = lectures_df.apply(lambda row: f"{row['Lecture']}_{row['Section']}_{row['filename']}", axis=1)
    
    return lectures_df

In [8]:
df = client.dataframe(text_primary_key)

df.head()

Applying Run: confident_minsky
Creating provider local-mode 
Creating source text_primary_key  confident_minsky


Unnamed: 0,Lecture,Topic,Section,Text,filename,PK
0,7,Public Key Cryptography,Public Key Cryptography I,The basic setting for cryptography is typicall...,Note_7.csv,7_Public Key Cryptography I_Note_7.csv
1,7,Public Key Cryptography,Public Key Cryptography II,"Since the link is insecure, Alice and Bob have...",Note_7.csv,7_Public Key Cryptography II_Note_7.csv
2,7,Public Key Cryptography,Public Key Cryptography III,The central idea behind the RSA cryptosystem i...,Note_7.csv,7_Public Key Cryptography III _Note_7.csv
3,7,Public Key Cryptography,Public Key Cryptography IV,The RSA scheme is based heavily on modular ari...,Note_7.csv,7_Public Key Cryptography IV_Note_7.csv
4,7,Public Key Cryptography,Encryption,[Encryption]: When Alice wants to send a messa...,Note_7.csv,7_Encryption_Note_7.csv


## Step 4. Embeddings Transformation

We'll use [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to create embeddings for each speakers' comments. When we register an entity and associate a feature with this entity, this transformation will be materialized and the embeddings will be persisted in a Pinecone index.

In [9]:
@local.df_transformation(inputs=[text_primary_key])
def vectorize_comments(lectures_df):
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(lectures_df["Section"].tolist())
    lectures_df["Vector"] = embeddings.tolist()
    
    return lectures_df

## Step 5. Register Pinecone

We'll be using Pinecone for this example, but you can also choose to use Weaviate.

This step assumes you have a `.env` file with your Pinecone credentials.

In [10]:
import dotenv
import os

dotenv.load_dotenv(".env")

pinecone = ff.register_pinecone(
    name="pinecone",
    project_id=os.getenv("PINECONE_PROJECT_ID", ""),
    environment=os.getenv("PINECONE_ENVIRONMENT", ""),
    api_key=os.getenv("PINECONE_API_KEY", ""),
)

In [11]:
client.apply()

Applying Run: confident_minsky
Creating provider local-mode 
Creating provider pinecone 
Creating source vectorize_comments  confident_minsky


## Step 6. Register Entity, Features, and Embeddings and write them to Vector DB.

We'll now register an entity and a feature, which will kick off the materialization process.

**NOTE:**
This may take some time to complete. See the progress bar for status.

In [12]:
@ff.entity
class Text:
    comment_embeddings = ff.Embedding(
        vectorize_comments[["PK", "Vector"]],
        dims=384,
        vector_db=pinecone,
        description="Embeddings created from text in notes",
        variant="v2"
    )
    comments = ff.Feature(
        text_primary_key[["PK", "Text"]],
        type=ff.String,
        description="Lecture notes text",
        variant="v2"
    )

In [13]:
!pip install pinecone-client



In [14]:
client.apply()

Applying Run: confident_minsky
Creating provider local-mode 
Creating entity text 
Creating feature comment_embeddings  v2
Creating feature comments  v2


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


Updating Feature Table: |██------------------------------------------------| 5% Complete



Updating Feature Table: |███████-------------------------------------------| 14% Complete



Updating Feature Table: |███████████---------------------------------------| 23% Complete



Updating Feature Table: |--------------------------------------------------| 0% Completee



Updating Feature Table: |████----------------------------------------------| 8% Complete



Updating Feature Table: |████████------------------------------------------| 17% Complete



Updating Feature Table: |███████████---------------------------------------| 23% Complete



Updating Feature Table: |██------------------------------------------------| 5% Complete



Updating Feature Table: |███████-------------------------------------------| 14% Complete



Updating Feature Table: |███████████---------------------------------------| 23% Complete



Updating Feature Table: |████████████████----------------------------------| 32% Complete



Updating Feature Table: |██████████████████████████████████████████████████| 100% Complete

Updating Feature Table: |██████████████████████████████████████████████████| 100% Complete





## Step 7. Register On-Demand Features to Retrieve Relevent Context

We'll want to query the embeddings we created and then fetch their related docs and we can do so using Featureform's on-demand feature decorator. This creates a feature that's calculated on the client at serving time.

In [15]:
@ff.ondemand_feature(variant="calhacks")
def relevent_comments(client, params, entity):
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")
    search_vector = model.encode(params["query"])
    res = client.nearest("comment_embeddings", "v2", search_vector, k=3)
    return res

In [16]:
@ff.ondemand_feature(variant="calhack")
def contextualized_prompt(client, params, entity):
    pks = client.features([("relevent_comments", "calhacks")], {}, params=params)
    prompt = "Use the following snippets from the lecture notes to answer the following question\n"
    for pk in pks[0]:
        prompt += "```"
        prompt += client.features([("comments", "v2")], {"Text": pk})[0]
        prompt += "```\n"
    prompt += "Question: "
    prompt += params["query"]
    prompt += "?"
    return prompt


In [23]:
# input_query = "complement"
input_query = input("Please enter the question:")

client.apply()
client.features([("relevent_comments", "calhacks")], {}, params={"query": input_query})

client.apply()
client.features([("contextualized_prompt", "calhack")], {}, params={"query": input_query})


Please enter the question:Prove improvement lemma
Applying Run: confident_minsky
Creating provider local-mode 




Applying Run: confident_minsky
Creating provider local-mode 




array(['Use the following snippets from the lecture notes to answer the following question\n```Theorem 4.2. The matching output by the Propose-and-Reject algorithm is job/employer optimal. Proof. Suppose for sake of contradiction that the matching is not employer optimal. Then, there exists a day on which some job had its offer rejected by its optimal candidate. Let day k be the first such day. On this day, suppose J was rejected by C∗ (its optimal candidate) in favor of an offer from J∗. By the definition of optimal candidate, there must exist a stable matching T in which J and C∗ are paired together. Suppose T looks like this: {. . . , (J,C∗), . . . , (J∗,C′), . . .}. We will argue that (J∗,C∗) is a rogue couple in T , thus contradicting stability. First, it is clear that C∗ prefers J∗ to J, since she rejected an offer from J in favor of an offer from J∗ during the execution of the propose-and-reject algorithm. Moreover, since day k was the first day when some job had an offer reject

# Feeding into OpenAI!

In [24]:
client.apply()
# q = "What should I know about " + input_query
q = input_query
prompt = client.features([("contextualized_prompt", "calhack")], {}, params={"query": q})[0]
import openai
#openai.organization = os.getenv("OPENAI_ORG", "")
openai.api_key = os.getenv("OPENAI_KEY", "")

print(openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt,
    max_tokens=1000, # The max number of tokens to generate
    temperature=1.0 # A measure of randomness
)["choices"][0]["text"])

Applying Run: confident_minsky
Creating provider local-mode 






Improvement Lemma. If a job-candidate pair (J1, C1) is not a rogue couple in some stable matching T, and (J2, C2) is not a rogue couple in some other stable matching S, then replacing (J1, C1) by (J2, C2) in either T or S will yield a new stable matching M.

Proof. Suppose for sake of contradiction that (J1, C1) is not a rogue couple in some stable matching T, and (J2, C2) is not a rogue couple in some other stable matching S, and replacing (J1, C1) by (J2, C2) in either T or S will not yield a new stable matching M. 

If (J1, C1) is replaced by (J2, C2) in T, then we can create an equivalent stable matching where J2 prefers C1 to C2 which is a contradiction to the assumption.

If (J1, C1) is replaced by (J2, C2) in S, then we can create an equivalent stable matching where C2 prefers J1 to J2 which is a contradiction to the assumption. 

Thus, the replacement of (J1, C1) by (J2, C2) must yield a new stable matching M.
