<a href="https://colab.research.google.com/github/kanakOS01/langchain-tut/blob/main/pdf_query_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PDF Query Using Langchain
<hr>

In [None]:
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken
!pip install google-colab

Collecting langchain
  Downloading langchain-0.1.9-py3-none-any.whl (816 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.0/817.0 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl (12 kB)
Collecting langchain-community<0.1,>=0.0.21 (from langchain)
  Downloading langchain_community-0.0.24-py3-none-any.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m65.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain-core<0.2,>=0.1.26 (from langchain)
  Downloading langchain_core-0.1.26-py3-none-any.whl (246 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m246.4/246.4 kB[0m [31m29.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langsmith<0.2.0,>=0.1.0 (from langchain)
  Downloading langsmit

In [None]:
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] = userdata.get('openai_key')

In [None]:
from PyPDF2 import PdfReader

# embeddings is used to store data and helps llms provide contextful and coherent output
from langchain.embeddings.openai import OpenAIEmbeddings

# to split info based on some characters and max length
# import because embeddings has a token limit
from langchain.text_splitter import CharacterTextSplitter

# like a vector database
from langchain.vectorstores import FAISS

In [8]:
# pdf path
pdfreader = PdfReader('/content/drive/MyDrive/Colab Notebooks/Pdf Query/resume.pdf')

In [9]:
from typing_extensions import Concatenate

# read text from pdf
raw_text = ''
for i, page in enumerate(pdfreader.pages):
  content = page.extract_text()
  if content:
    raw_text += content

In [12]:
raw_text

'Kanak Tanwar\n/gtbgithub.com/kanakOS 01 |kanaktanwarpro@gmail.com\n/♀nednlinkedin.com/in/kanak-tanwar- 15082024 b|/ne+91-8860310807\nEducation\nDegree/Certificate Institute/Board CGPA/Percentage Year\nB.Tech - CSE Dronacharya College Of Engineering 86.3% (Current) 2022-Present\nSenior Secondary Manav Rachna International School 96.0% 2022\nSecondary Manav Rachna International School 97.6% 2020\nSkills\nProgramming Languages\nPython, Java, C/C++ *\nWeb Development\nHTML, CSS, Bootstrap, Flask\nData Science and Machine Learning\nNumPy, Pandas, Matplotlib, scikit-learn *, surprise *\nMisc\nGit, Linux, MySQL * Elementary proficiency\nProjects\n1. Unbeatable Tic Tac Toe\n•Developed a Python-based Tic Tac Toe game (player vs player / computer vs player).\n•Used minimax algorithm to develop two levels – medium and unbeatable – computer player.\n2. Grocery Management System\n•3-tier Grocery Management System.\n•Frontend using html, css, javascript, backend using python flask and mySQL databas

In [13]:
# splitting the text so that it does not exceed the max token size
text_splitter = CharacterTextSplitter(
    separator = '\n',

    # size of a chunk
    chunk_size = 600,

    # amount of info a chunk can have from the previous chunk
    chunk_overlap = 150,
    length_function = len
)

texts = text_splitter.split_text(raw_text)

In [14]:
len(texts)

4

In [15]:
texts

['Kanak Tanwar\n/gtbgithub.com/kanakOS 01 |kanaktanwarpro@gmail.com\n/♀nednlinkedin.com/in/kanak-tanwar- 15082024 b|/ne+91-8860310807\nEducation\nDegree/Certificate Institute/Board CGPA/Percentage Year\nB.Tech - CSE Dronacharya College Of Engineering 86.3% (Current) 2022-Present\nSenior Secondary Manav Rachna International School 96.0% 2022\nSecondary Manav Rachna International School 97.6% 2020\nSkills\nProgramming Languages\nPython, Java, C/C++ *\nWeb Development\nHTML, CSS, Bootstrap, Flask\nData Science and Machine Learning\nNumPy, Pandas, Matplotlib, scikit-learn *, surprise *\nMisc',
 'Web Development\nHTML, CSS, Bootstrap, Flask\nData Science and Machine Learning\nNumPy, Pandas, Matplotlib, scikit-learn *, surprise *\nMisc\nGit, Linux, MySQL * Elementary proficiency\nProjects\n1. Unbeatable Tic Tac Toe\n•Developed a Python-based Tic Tac Toe game (player vs player / computer vs player).\n•Used minimax algorithm to develop two levels – medium and unbeatable – computer player.\n2. 

In [16]:
# download embeddings from OpenAI
embeddings = OpenAIEmbeddings()

In [18]:
# convert the text to embedding
document_search = FAISS.from_texts(texts, embeddings)

In [19]:
document_search

<langchain_community.vectorstores.faiss.FAISS at 0x7f8d809333a0>

In [22]:
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI

In [23]:
# The "stuff" chain type uses a simple approach to QA, where the language model (LLM) is directly prompted with the question and asked to generate an answer. This approach is straightforward and can be effective for simple QA tasks.
chain = load_qa_chain(OpenAI(), chain_type='stuff')

In [28]:
query = "Tell me about Kanak"
docs = document_search.similarity_search(query)
print(chain.run(input_documents=docs, question=query))

 Kanak Tanwar is a male individual who is currently studying at Dronacharya College Of Engineering. He has a strong background in programming languages such as Python, Java, and C/C++. He also has experience in web development using HTML, CSS, Bootstrap, and Flask, as well as data science and machine learning using libraries like NumPy, Pandas, and scikit-learn. Kanak has worked on various projects, including a movie recommendation system and a house price prediction project. He has also achieved several accomplishments, such as being a Regionalist at ICPC Amritapuri 2023 and winning 3rd prize in a tech race contest. In his free time, Kanak enjoys playing unbeatable Tic Tac Toe and developing software systems, such as a grocery management system.


In [29]:
query = "How much did kanak score in 12th grade"
docs = document_search.similarity_search(query)
print(chain.run(input_documents=docs, question=query))

 Kanak scored 96.0% in 12th grade.
