# Abstractive QA

<a href="https://colab.research.google.com/github/myscale/examples/blob/main/abstractive-qa.ipynb" style="padding-left: 0.5rem;"><img src="https://colab.research.google.com/assets/colab-badge.svg?style=plastic)](https://colab.research.google.com/github/myscale/examples/blob/main/abstractive-qa.ipynb)"></a><a href="https://github.com/myscale/examples/blob/main/abstractive-qa.ipynb" style="padding-left: 0.5rem;"><img src="https://img.shields.io/badge/Open-Github-blue.svg?logo=github&style=plastic)](https://github.com/myscale/examples/blob/main/abstractive-qa.ipynb)"></a>

## Introduction
Abstractive QA (Question Answering) is a type of natural language processing (NLP) technique that involves generating an answer to a given question in natural language by summarizing and synthesizing information from various sources, rather than just selecting an answer from pre-existing text.

Unlike extractive QA, which relies on identifying and extracting relevant passages of text from a corpus of documents to answer a question, abstractive QA systems are capable of generating new, original sentences that capture the key information and meaning required to answer the question.

In this notebook, you will learn how MyScale can assist you in creating a abstractive QA application with openai api. There are three primary components required to construct a question-answering system:
1. A vector index for semantic search storage and execution.
2. A retriever model to embed contextual passages.
3. OpenAI API for answer extraction.

We will use [bitcoin_articles dataset](https://www.kaggle.com/datasets/balabaskar/bitcoin-news-articles-text-corpora), which contains a collection of news articles on Bitcoin that have been obtained through web scraping from different sources on the Internet using the Newscatcher API. We'll use the retriever to create embeddings for the context passages, index them in the vector database, and execute a semantic search to retrieve the top k most relevant contexts with potential answers to our question. OpenAI API will then be used to generate answers based on the returned contexts.

If you're more interested in exploring capabilities of MyScale, feel free to skip the [Building dataset](#building-dataset) section and dive right into the [Populating data to MyScale](#populate-data-to-myscale) section.

## Prerequisites
Before we get started, we need to install tools such as [clickhouse python client](https://clickhouse.com/docs/en/integrations/language-clients/python/intro/), openai, sentence-transformer, and other dependencies.

### Install dependencies

In [1]:
!pip install clickhouse-connect openai sentence-transformers torch requests pandas tqdm



### Setup openai

In [2]:
import openai
openai.api_key = "YOUR_OPENAI_API_KEY"

### Setup retriever
We will have to initiate our retriever, which will primarily perform two tasks, the first is optional:

1. Produce embeddings for each context passage (context vectors/embeddings)
2. Produce an embedding for our queries (query vector/embedding)

In [3]:
import torch
from sentence_transformers import SentenceTransformer
# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1', device=device)
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

## Building dataset

### Downloading and processing data

The dataset contains news articles about Bitcoin that were web scraped from various sources on the internet using Newscatcher API.

The information is given in CSV files and includes details such as article ID, title, author, published date, link, summary, topic, country, language, and more.Initially, we create a compact database for retrieving data. 

To make this notebook easier, we keep a full copy of the Kaggle dataset [bitcoin-news-articles-text-corpora](https://www.kaggle.com/datasets/balabaskar/bitcoin-news-articles-text-corpora) on S3 to save time configuring [Kaggle's Public API](https://github.com/Kaggle/kaggle-api) credentials.

So, we can download the dataset by the command below:

In [4]:
!wget https://myscale-saas-assets.s3.ap-southeast-1.amazonaws.com/testcases/clickhouse/bitcoin-news-articles-text-corpora.zip

# unzip the downloaded file
!unzip -o bitcoin-news-articles-text-corpora.zip 

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  837k  100  837k    0     0   196k      0  0:00:04  0:00:04 --:--:--  206k
Archive:  bitcoin-news-articles-text-corpora.zip
  inflating: bitcoin_articles.csv    


After loading data we perform basic data manipulation tasks such as eliminating duplicate entries and removing empty cells.

In [5]:
import pandas as pd
data_raw = pd.read_csv('bitcoin_articles.csv')

data_raw.drop_duplicates(subset=['summary'], keep='first', inplace=True)
data_raw.dropna(subset=['summary'], inplace=True)
data_raw.dropna(subset=['author'], inplace=True)

In [6]:
data_raw.head(3)

Unnamed: 0,article_id,title,author,published_date,link,clean_url,excerpt,summary,rights,article_rank,topic,country,language,authors,media,twitter_account,article_score
0,57a00c1140cbd3af79e77bf0e4e6af48,62% of Bitcoin Has Not Moved in a Year as Long...,Jamie McNeill,04-10-2022 17:15,https://www.business2community.com/crypto-news...,business2community.com,"Over the course of the last few years, there h...","Over the course of the last few years, there h...",business2community.com,1595,finance,US,en,Jamie McNeill,https://www.business2community.com/wp-content/...,@Jamie_DeFi,8.556426
1,21b48b3731c03466be3fac4be6c7dc67,The Orange Party Issue Playlist,Bitcoin Magazine,05-10-2022 21:17,https://bitcoinmagazine.com/culture/orange-par...,bitcoinmagazine.com,News Links: Russia Legalizing Bitcoin And Cryp...,Russia Legalizing Bitcoin And Crypto Is A Matt...,bitcoinmagazine.com,6284,news,US,en,Bitcoin Magazine,https://bitcoinmagazine.com/.image/t_share/MTk...,,8.507881
2,77030740ee160ad68c25e4e63515dd77,How Many Bitcoins Are There?,AOL Staff,04-10-2022 21:44,https://www.gobankingrates.com/investing/crypt...,gobankingrates.com,Bitcoin has a maximum supply of 21 million. Ho...,Bitcoin has a maximum supply of 21 million. Ho...,aol.com,5044,news,US,en,"AOL Staff,David Granahan",https://s.yimg.com/ny/api/res/1.2/wPK4V8gjwjrD...,@AOL,8.483973


### Generating article summary embeddings
After processing the data, we use the previously defined retriever to generate embeddings for article summaries.

In [7]:
from tqdm.auto import tqdm

summary_raw = data_raw['summary'].values.tolist()
summary_feature = []

for i in tqdm(range(0, len(summary_raw), 1)):
    i_end = min(i+1, len(summary_raw))
    # generate embeddings for summary
    emb = retriever.encode(summary_raw[i:i_end]).tolist()[0]
    summary_feature.append(emb)
    
data_raw['summary_feature'] = summary_feature

  0%|          | 0/1731 [00:00<?, ?it/s]

### Creating dataset
Finally, we convert the dataframes into csv file and compress it into a zip, and we will upload to s3 for later use.

In [8]:
data = data_raw[['article_id', 'title', 'author', 'link', 'summary', 'article_rank', 'summary_feature']]
data = data.reset_index().rename(columns={'index': 'id'})
data.to_csv('bitcoin_articles_embd.csv', index=False)

In [9]:
!zip abstractive-qa-examples.zip bitcoin_articles_embd.csv

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  adding: bitcoin_articles_embd.csv (deflated 55%)


## Populate data to MyScale

### Loading data
To populate data to MyScale, first, we download dataset which created in the previous section. The following code snippet shows how to download the data and transform them into panda DataFrames.

Note: `summary_feature` is a 384-dimensional floating-point vector that represents the text features extracted from an article summary using the `multi-qa-MiniLM-L6-cos-v1` model.

In [10]:
!wget https://myscale-saas-assets.s3.ap-southeast-1.amazonaws.com/testcases/clickhouse/abstractive-qa-examples.zip

!unzip -o abstractive-qa-examples.zip

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6941k  100 6941k    0     0  1082k      0  0:00:06  0:00:06 --:--:-- 1647k
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Archive:  abstractive-qa-examples.zip
  inflating: bitcoin_articles_embd.csv  


In [11]:
import pandas as pd
import ast

data = pd.read_csv('bitcoin_articles_embd.csv')
data['summary_feature'] = data['summary_feature'].apply(ast.literal_eval)

In [12]:
data.head(3)

Unnamed: 0,id,article_id,title,author,link,summary,article_rank,summary_feature
0,0,57a00c1140cbd3af79e77bf0e4e6af48,62% of Bitcoin Has Not Moved in a Year as Long...,Jamie McNeill,https://www.business2community.com/crypto-news...,"Over the course of the last few years, there h...",1595,"[0.054771628230810165, -0.05538482591509819, -..."
1,1,21b48b3731c03466be3fac4be6c7dc67,The Orange Party Issue Playlist,Bitcoin Magazine,https://bitcoinmagazine.com/culture/orange-par...,Russia Legalizing Bitcoin And Crypto Is A Matt...,6284,"[-0.02826531231403351, -0.018267612904310226, ..."
2,2,77030740ee160ad68c25e4e63515dd77,How Many Bitcoins Are There?,AOL Staff,https://www.gobankingrates.com/investing/crypt...,Bitcoin has a maximum supply of 21 million. Ho...,5044,"[0.028079882264137268, -0.02520909532904625, 0..."


### Creating table
Next, we create tables in MyScale. Before you begin, you will need to retrieve your cluster host, username, and password information from the MyScale console.

The following code snippet creates the bitcoin article information table.

In [13]:
import clickhouse_connect

client = clickhouse_connect.get_client(
    host='YOUR_CLUSTER_HOST',
    port=8443,
    username='YOUR_USERNAME',
    password='YOUR_CLUSTER_PASSWORD'
)

# create table for bitcoin texts
client.command("DROP TABLE IF EXISTS default.myscale_llm_bitcoin_qa")

client.command("""
CREATE TABLE default.myscale_llm_bitcoin_qa
(
    id UInt64,
    article_id String,
    title String,
    author String,
    link String,
    summary String,
    article_rank UInt64,
    summary_feature Array(Float32),
    CONSTRAINT vector_len CHECK length(summary_feature) = 384
)
ORDER BY id
""")

''

### Uploading data
After creating the table, we insert data loaded from the datasets into tables and create a vector index to accelerate later vector search queries. The following code snippet shows how to insert data into table and create a vector index with cosine distance metric.

In [14]:
# upload data from datasets
client.insert("default.myscale_llm_bitcoin_qa", 
              data.to_records(index=False).tolist(), 
              column_names=data.columns.tolist())

# check count of inserted data
print(f"article count: {client.command('SELECT count(*) FROM default.myscale_llm_bitcoin_qa')}")

# create vector index with cosine
client.command("""
ALTER TABLE default.myscale_llm_bitcoin_qa 
ADD VECTOR INDEX summary_feature_index summary_feature
TYPE MSTG
('metric_type=Cosine')
""")

article count: 1731


''

In [15]:
# check the status of the vector index, make sure vector index is ready with 'Built' status
get_index_status="SELECT status FROM system.vector_indices WHERE name='summary_feature_index'"
print(f"index build status: {client.command(get_index_status)}")

index build status: Built


## Query MyScale

### Search and filter
Use retriever to generate query question embedding.

In [16]:
question = 'what is the difference between bitcoin and traditional money?'
emb_query = retriever.encode(question).tolist()

Then, use vector search to identify the top K candidates that are most similar to the question, filter the result with article_rank < 500.

In [17]:
top_k = 10
results = client.query(f"""
SELECT summary, distance(summary_feature, {emb_query}) as dist
FROM default.myscale_llm_bitcoin_qa
WHERE article_rank < 500
ORDER BY dist ASC
LIMIT {top_k}
""")

summaries = []
for res in results.named_results():
    summaries.append(res["summary"])

### Get CoT for GPT-3.5
Combine summaries searched from MyScale into a valid prompt.

In [18]:
CoT = ''
for summary in summaries:
    CoT += summary
CoT += '\n' +'Based on the context above '+'\n' +' Q: '+ question + '\n' +' A: The answer is'
print(CoT)

Some even see a digital payment revolution unfolding on the horizon. Despite rising inflation, the interest in crypto is still growing, and adoption continues to expand. One of the industries that are bridging the gap between crypto and ordinary people is retail Forex trading. In the midst of global economic and political uncertainties and disturbances, people increasingly seek out the cryptocurrency market to probe its inner workings, principles and financial potential. Investors use crypto to diversify their portfolios, whereas the mother of all cryptocurrencies—bitcoin—even established itself as a ‘store of value'.Bitcoin prices have stayed relatively stable lately amid contractionary Fed policies. getty
Bitcoin prices have continued to trade within a relatively tight range recently, retaining their value even as Federal Reserve policies threaten the values of risk assets. The world's best-known digital currency, which has a total market value of close to $375 billion at the time of

### Get result from GPT-3.5
Then, use the generated CoT to query `gpt-3.5-turbo`.

In [20]:
response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": CoT}
    ],
    temperature=0,
)

In [21]:
print("Example: Retrieval with MyScale")
print('Q: ', question)

print('A: ', response.choices[0].message.content)

Example: Retrieval with MyScale
Q:  what is the difference between bitcoin and traditional money?
A:  Bitcoin is a decentralized digital currency that operates independently of traditional banking systems and is not backed by any government. It is based on blockchain technology and allows for peer-to-peer transactions without the need for intermediaries. Traditional money, on the other hand, is issued and regulated by central banks and governments, and its value is backed by the trust and stability of those institutions.


We return a complete and detailed answer. We have recieved great results.