### Power of Cassandra and ChatGPT for PDF Data Ingestion and Question Answering using AstraDB & Langchain 🦜
[**Link to my YouTube Channel**](https://www.youtube.com/BhaveshBhatt8791?sub_confirmation=1)

# Installs

In [None]:
!pip install -q cassandra-driver
!pip install -q langchain
!pip install -q openai
!pip install -q pypdf
!pip install -q cassio>=0.1.1
!pip install -q tiktoken==0.4.0

# Cassandra Import

In [34]:
import cassandra
print (cassandra.__version__)

3.28.0


In [35]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
import json

# This secure connect bundle is autogenerated when you donwload your SCB,
# if yours is different update the file name below
cloud_config= {
  'secure_connect_bundle': 'secure-connect-bhavesh-astra-test.zip'
}

# This token json file is autogenerated when you donwload your token,
# if yours is different update the file name below
with open("bhavesh_astra_test-token.json") as f:
    secrets = json.load(f)

CLIENT_ID = secrets["clientId"]
CLIENT_SECRET = secrets["secret"]

auth_provider = PlainTextAuthProvider(CLIENT_ID, CLIENT_SECRET)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

row = session.execute("select release_version from system.local").one()
if row:
  print(row[0])
else:
  print("An error occurred.")

ERROR:cassandra.connection:Closing connection <AsyncoreConnection(134104449286784) e312786c-bba2-4fd0-a3c3-17328cc29f6a-us-east1.db.astra.datastax.com:29042:b76400fb-5d89-4041-8fac-032b8afcdffd> due to protocol error: Error from server: code=000a [Protocol error] message="Beta version of the protocol used (5/v5-beta), but USE_BETA flag is unset"


4.0.7-131836135da7


# Import

In [37]:
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.cassandra import Cassandra
from langchain.indexes import VectorstoreIndexCreator
from langchain.text_splitter import (
    CharacterTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader, PyPDFLoader

# OS Import

In [38]:
import os
os.environ['OPENAI_API_KEY'] = "Enter your OpenAI Key Here..."

# Initialization

In [39]:
llm = OpenAI(temperature=0)
openai_embeddings = OpenAIEmbeddings()

In [40]:
table_name = 'pdf_q_n_a_table_1'
keyspace = "pdf_q_n_a_test"

index_creator = VectorstoreIndexCreator(
    vectorstore_cls = Cassandra,
    embedding = openai_embeddings,
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 400,
        chunk_overlap = 30,
    ),

    vectorstore_kwargs={
        'session': session,
        'keyspace': keyspace,
        'table_name': table_name,
    },
)

# Loading PDF

In [41]:
loader = PyPDFLoader("Attention Paper.pdf")
pages = loader.load_and_split()

In [42]:
len(pages)

12

In [44]:
pages[1]

Document(page_content='Recurrent models typically factor computation along the symbol positions of the input and output\nsequences. Aligning the positions to steps in computation time, they generate a sequence of hidden\nstatesht, as a function of the previous hidden state ht−1and the input for position t. This inherently\nsequential nature precludes parallelization within training examples, which becomes critical at longer\nsequence lengths, as memory constraints limit batching across examples. Recent work has achieved\nsigniﬁcant improvements in computational efﬁciency through factorization tricks [ 18] and conditional\ncomputation [ 26], while also improving model performance in case of the latter. The fundamental\nconstraint of sequential computation, however, remains.\nAttention mechanisms have become an integral part of compelling sequence modeling and transduc-\ntion models in various tasks, allowing modeling of dependencies without regard to their distance in\nthe input or outp

# Load to Index

In [54]:
pdf_index = index_creator.from_loaders([loader])

In [55]:
default_query = f'SELECT * FROM {keyspace}.{table_name}'

rows = session.execute(default_query)

for row_i, row in enumerate(rows):
    print(f'\nRow {row_i}:')
    print(f'row_id: {row.row_id}')
    print(f'embedding_vector: {str(row.vector)[:64]} ...')
    print(f'body_blob: {row.body_blob[:64]} ...')
    print(f'metadata_blob: {row.metadata_s}')

print('\n...')


Row 0:
row_id: 1543e4d885a84b30a713ce3945104a55
embedding_vector: [-0.043849579989910126, -0.005807334091514349, -0.00084909476572 ...
body_blob: arXiv:1609.08144 , 2016.
[32] Jie Zhou, Ying Cao, Xuguang Wang,  ...
metadata_blob: {'page': '10.0', 'source': 'Attention Paper.pdf'}

Row 1:
row_id: cf1d29fa0b504af99e26253116dd3b17
embedding_vector: [-0.02232765033841133, 0.006839259527623653, 0.0281220730394125, ...
body_blob: traverse in the network. The shorter these paths between any com ...
metadata_blob: {'page': '5.0', 'source': 'Attention Paper.pdf'}

Row 2:
row_id: 1fc0fbbb8c414a779e20c842c75093bf
embedding_vector: [-0.023693561553955078, -0.005949937738478184, 0.002372674643993 ...
body_blob: connected feed-forward network, which is applied to each positio ...
metadata_blob: {'page': '4.0', 'source': 'Attention Paper.pdf'}

Row 3:
row_id: e7e95b8c02fc4200addcff7a5a6027f4
embedding_vector: [-0.01607612706720829, -0.010801807977259159, 0.0183686986565589 ...
body_blob: translation 

# Asking Questions to the PDF

In [56]:
query_1 = "What is multi-head attention?"
pdf_index.query_with_sources(query_1, llm=llm)

{'question': 'What is multi-head attention?',
 'answer': ' Multi-head attention is a technique used in natural language processing where multiple attention layers are employed in parallel, each with a reduced dimension.\n',
 'sources': 'Attention Paper.pdf'}

In [58]:
query_2 = "What are positional encodings?"
pdf_index.query_with_sources(query_2, llm=llm)

{'question': 'What are positional encodings?',
 'answer': ' Positional encodings are additional information added to the input embeddings of a sequence to make use of the order of the sequence.\n',
 'sources': 'Attention Paper.pdf'}