# TAid QA Chatbot

TAid course-oriented chatbot that aims to help answering course-related conceptual questions. It is built on Feature Forms's MLOps implementation and uses vectorised course material to provide more context.

## Requirements

* Python 3.7+
* PyPI (pip3 install PyPI)
* python-dotenv 1.0.0 (pip3 install python-dotenv 1.0.0)
* featureform (pip3 install featureform)
* Hugging Face sentence-transformers  (pip3 install sentence-transformers)
* daal4y (pip3 install daal4py)
* openai (pip3 install openai)
* `.env` file with one or both sets of credentials from Pinecone/Weaviate and openAI credentials


```
pip3 install PyPI
pip3 install python-dotenv 1.0.0
pip3 install featureform
pip3 install sentence-transformers
pip3 install daal4py
pip3 install openai
```

## .env Credentials

[Pinecone](https://www.pinecone.io/) 

PINECONE_PROJECT_ID=

PINECONE_ENVIRONMENT=

PINECONE_API_KEY=


[Weaviate](https://weaviate.io/)

WEAVIATE_URL=

WEAVIATE_API_KEY=


[openAI](https://platform.openai.com/)

OPENAI_KEY=

## Step  1. Register Source

`data/files` is a directory of CSV files, which use `;` as a delimiter and hold lecture notes of the a certain course (In this example, the notes of CS70 are used). Each row is a section, a subsection, or a subsubsection depending on the smallest unit in the notes's outline.

* Lecture
* Topic
* Section
* Text
* filename


In [2]:
import featureform as ff
from featureform import local

client = ff.Client(local=True)



**NOTE:** We'll create an instance of the client to register resources as we define them.

**NOTE:** The register saves the data in a `.featureform` folder. It is recommended to delete the folder and have a fresh start with every run.

In [3]:
lectures = local.register_directory(
    name="notes",
    path="data/files",
    description="CS70 Notes",
)

In [4]:
client.dataframe(lectures)

Applying Run: zealous_mclean
Creating user default_user 
Creating provider local-mode 
Creating source notes  zealous_mclean


Unnamed: 0,filename,body
0,Note_0.csv,Lecture;Topic;Section;Text\n0;Review of Sets;I...


**NOTE:** The next cell doesn't need to be ran. Run it only if you need to access featureform's dashboard.

In [5]:
#!featureform dash

## Step 2. Transform Transcripts

When registering a directory, files are converted into a table with columns `"filename"` and `"body"`. This is helpful for avoiding the situation where we need to register many files; however, in our case, we'll need to process this table to get it ready for vectorization.

In [6]:
@local.df_transformation(inputs=[lectures])
def process_lecture_files(dir_df):
    from io import StringIO
    import pandas as pd

    lecture_dfs = []
    for i, row in dir_df.iterrows():
        csv_str = StringIO(row[1])
        r_df = pd.read_csv(csv_str, sep=";")
        r_df["filename"] = row[0]
        lecture_dfs.append(r_df)

    return pd.concat(lecture_dfs)

We can verify this worked as we expected by serving this source as a dataframe and inspecting the results.

In [7]:
df = client.dataframe(process_lecture_files)

df.head()

Applying Run: zealous_mclean
Creating provider local-mode 
Creating source process_lecture_files  zealous_mclean


Unnamed: 0,Lecture,Topic,Section,Text,filename
0,0,Review of Sets,Intro,A set is a well defined collection of objects....,Note_0.csv
1,0,Review of Sets,Cardinality,"We can also talk about the size of a set, or i...",Note_0.csv
2,0,Review of Sets,Subsets and Proper Subsets,"If every element of a set A is also in set B, ...",Note_0.csv
3,0,Review of Sets,Intersections and Unions,"The intersection of a set A with a set B, writ...",Note_0.csv
4,0,Review of Sets,Complements,"If A and B are two sets, then the relative com...",Note_0.csv


## Step 3. Entity ID Transformation

For our purposes, we'll need a unique identifier for each speakers' comments, so we'll choose `"Lecture"`, `"Section"` and `"filename"` to create a new column, `"PK"` (The Primary Key).

In [8]:
@local.df_transformation(inputs=[process_lecture_files])
def text_primary_key(lectures_df):
    lectures_df["PK"] = lectures_df.apply(lambda row: f"{row['Lecture']}_{row['Section']}_{row['filename']}", axis=1)
    
    return lectures_df

In [9]:
df = client.dataframe(text_primary_key)

df.head()

Applying Run: zealous_mclean
Creating provider local-mode 
Creating source text_primary_key  zealous_mclean


Unnamed: 0,Lecture,Topic,Section,Text,filename,PK
0,0,Review of Sets,Intro,A set is a well defined collection of objects....,Note_0.csv,0_Intro_Note_0.csv
1,0,Review of Sets,Cardinality,"We can also talk about the size of a set, or i...",Note_0.csv,0_Cardinality_Note_0.csv
2,0,Review of Sets,Subsets and Proper Subsets,"If every element of a set A is also in set B, ...",Note_0.csv,0_Subsets and Proper Subsets_Note_0.csv
3,0,Review of Sets,Intersections and Unions,"The intersection of a set A with a set B, writ...",Note_0.csv,0_Intersections and Unions_Note_0.csv
4,0,Review of Sets,Complements,"If A and B are two sets, then the relative com...",Note_0.csv,0_Complements_Note_0.csv


## Step 4. Embeddings Transformation

We'll use [`all-MiniLM-L6-v2`](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) to create embeddings for each speakers' comments. When we register an entity and associate a feature with this entity, this transformation will be materialized and the embeddings will be persisted in a Pinecone index.

In [10]:
@local.df_transformation(inputs=[text_primary_key])
def vectorize_comments(lectures_df):
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")
    embeddings = model.encode(lectures_df["Section"].tolist())
    lectures_df["Vector"] = embeddings.tolist()
    
    return lectures_df

## Step 5. Register Pinecone

We'll be using Pinecone for this example, but you can also choose to use Weaviate.

This step assumes you have a `.env` file with your Pinecone credentials.

In [11]:
import dotenv
import os

dotenv.load_dotenv(".env")

pinecone = ff.register_pinecone(
    name="pinecone",
    project_id=os.getenv("PINECONE_PROJECT_ID", ""),
    environment=os.getenv("PINECONE_ENVIRONMENT", ""),
    api_key=os.getenv("PINECONE_API_KEY", ""),
)

In [12]:
client.apply()

Applying Run: zealous_mclean
Creating provider local-mode 
Creating provider pinecone 
Creating source vectorize_comments  zealous_mclean


## Step 6. Register Entity, Features, and Embeddings and write them to Vector DB.

We'll now register an entity and a feature, which will kick off the materialization process.

**NOTE:**
This may take some time to complete. See the progress bar for status.

In [13]:
@ff.entity
class Text:
    comment_embeddings = ff.Embedding(
        vectorize_comments[["PK", "Vector"]],
        dims=384,
        vector_db=pinecone,
        description="Embeddings created from speakers' comments in episodes",
        variant="v2"
    )
    comments = ff.Feature(
        text_primary_key[["PK", "Text"]],
        type=ff.String,
        description="Speakers' original comments",
        variant="v2"
    )

In [14]:
!pip install pinecone-client



In [15]:
client.apply()

Applying Run: zealous_mclean
Creating provider local-mode 
Creating entity text 
Creating feature comment_embeddings  v2
Creating feature comments  v2




## Step 7. Register On-Demand Features to Retrieve Relevent Context

We'll want to query the embeddings we created and then fetch their related docs and we can do so using Featureform's on-demand feature decorator. This creates a feature that's calculated on the client at serving time.

In [22]:
# input_query = "complement"
input_query = input("Please enter the question")

Please enter the questionWhat is a subset?


In [23]:
@ff.ondemand_feature(variant="calhacks")
def relevent_comments(client, params, entity):
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("all-MiniLM-L6-v2")
    search_vector = model.encode(params["query"])
    res = client.nearest("comment_embeddings", "v2", search_vector, k=3)
    return res

In [24]:
client.apply()
client.features([("relevent_comments", "calhacks")], {}, params={"query": input_query})

Applying Run: zealous_mclean
Creating provider local-mode 
Creating ondemand_feature relevent_comments  calhacks




array([['0_Subsets and Proper Subsets_Note_0.csv',
        '0_Complements_Note_0.csv', '0_Significant Sets_Note_0.csv']],
      dtype='<U39')

In [25]:
@ff.ondemand_feature(variant="calhack")
def contextualized_prompt(client, params, entity):
    pks = client.features([("relevent_comments", "calhacks")], {}, params=params)
    prompt = "Use the following snippets from the lecture notes to answer the following question\n"
    for pk in pks[0]:
        prompt += "```"
        prompt += client.features([("comments", "v2")], {"speaker": pk})[0]
        prompt += "```\n"
    prompt += "Question: "
    prompt += params["query"]
    prompt += "?"
    return prompt


In [26]:
client.apply()
client.features([("contextualized_prompt", "calhack")], {}, params={"query": input_query})

Applying Run: zealous_mclean
Creating provider local-mode 
Creating ondemand_feature contextualized_prompt  calhack




array(['Use the following snippets from the lecture notes to answer the following question\n```If every element of a set A is also in set B, then we say that A is a subset of B, written A $\\subseteq$ B. Equivalently we can write B $superseteq$ A, or B is a superset of A. A proper subset is a set A that is strictly contained in B, written as A $subset$ B, meaning that A excludes at least one element of B. For example, consider the set B = {1,2,3,4,5}. Then {1,2,3} is both a subset and a proper subset of B, while {1,2,3,4,5} is a subset but not a proper subset of B. Here are a few basic properties regarding subsets: -  The empty set, denote by {} or /0, is a proper subset of any nonempty set A: {} $subset$ A. -  The empty set is a subset of every set B: {} $\\subseteq$ B. -  Every set A is a subset of itself: A $\\subseteq$ A.```\n```If A and B are two sets, then the relative complement of A in B, or the set difference between B and A, written as B $-$ A or B \\ A, is the set of element

# Finally we can feed our prompt into OpenAI!

In [27]:
client.apply()
# q = "What should I know about " + input_query
q = input_query
prompt = client.features([("contextualized_prompt", "calhack")], {}, params={"query": q})[0]
import openai
#openai.organization = os.getenv("OPENAI_ORG", "")
openai.api_key = os.getenv("OPENAI_KEY", "")

print(openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt,
    max_tokens=1000, # The max number of tokens to generate
    temperature=1.0 # A measure of randomness
)["choices"][0]["text"])

Applying Run: zealous_mclean
Creating provider local-mode 






A subset is a set A that is contained within the elements of another set B, written A $\subseteq$ B. Every set A is a subset of itself, A $\subseteq$ A. Additionally, the empty set, denote by {} or /0, is a proper subset of any nonempty set A: {} $subset$ A.
