#### - Documents and document loaders
#### - Text splitters
#### - Embeddings
#### - Vector stores and retrievers
​


In [1]:
from langchain_community.document_loaders import PyPDFLoader

def load_pdf_docs(file_path: str):
    loader = PyPDFLoader(file_path)
    docs = loader.load()
    return docs

file_path = "/Users/manasvi/Downloads/IEEE_Conference_Template.pdf"
docs = load_pdf_docs(file_path)
print(len(docs))



  from .autonotebook import tqdm as notebook_tqdm


8


LangChain implements a Document abstraction, which is intended to represent a unit of text and associated metadata. It has three attributes:
page_content: a string representing the content;
metadata: a dict containing arbitrary metadata;
id: (optional) a string identifier for the document.

In [2]:
print(docs[0].page_content[:100])
print(docs[0].metadata)
print(docs[0].id)


Enhancing Legal Document Summarization
for Professionals: An Extractive Approach
Manasvi Kalyan
SCSE
{'producer': 'pdfTeX-1.40.25', 'creator': 'TeX', 'creationdate': '2024-11-20T10:30:26+00:00', 'moddate': '2024-11-20T10:30:26+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'trapped': '/False', 'source': '/Users/manasvi/Downloads/IEEE_Conference_Template.pdf', 'total_pages': 8, 'page': 0, 'page_label': '1'}
None


### Splitting Documents
#### When retrieving information or answering questions, treating each page as a single unit can be too broad. To improve the relevance of search results, it's helpful to split the PDF into smaller chunks. This allows us to retrieve more focused Document objects that better match the user's query, without irrelevant content diluting the results.

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

def split_docs(docs):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200, add_start_index=True)
    all_splits = text_splitter.split_documents(docs)
    return all_splits

splits = split_docs(docs)
print(len(splits))


38


In [4]:
splits[0]

Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'TeX', 'creationdate': '2024-11-20T10:30:26+00:00', 'moddate': '2024-11-20T10:30:26+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'trapped': '/False', 'source': '/Users/manasvi/Downloads/IEEE_Conference_Template.pdf', 'total_pages': 8, 'page': 0, 'page_label': '1', 'start_index': 0}, page_content='Enhancing Legal Document Summarization\nfor Professionals: An Extractive Approach\nManasvi Kalyan\nSCSET\nBennett University\nGreater Noida, India\ne21cseu0405@bennett.edu.in\nPratik Dwivedi\nSCSET\nBennett University\nGreater Noida, India\ne21cseu0096@bennett.edu.in\nShallu Sharma\nSCSET\nBennett University\nGreater Noida, India\nshallu.sharma@bennett.edu.in\nAbstract—The increasing volume and complexity\nof legal documents pose significant challenges for\nlegal professionals who must extract relevant infor-\nmation efficiently. This paper addresses this issue\nb

In [5]:
from langchain_huggingface import HuggingFaceEmbeddings

def create_embeddings(model_name: str):
    embeddings = HuggingFaceEmbeddings(
        model_name=model_name 
    )

    return embeddings

embeddings = create_embeddings("sentence-transformers/all-MiniLM-L6-v2")

def embed_splits(splits, embeddings):
    texts = [doc.page_content for doc in splits]
    return embeddings.embed_documents(texts)

# correct call
vectors = embed_splits(splits, embeddings)
print(len(vectors), len(vectors[0]))


38 384


In [6]:
print(vectors[0])

[-0.05389542505145073, 0.1250315010547638, -0.04096772521734238, -0.08090446889400482, 0.048137787729501724, 0.050742652267217636, 0.021661721169948578, 0.03534843400120735, 0.06030082702636719, 0.08715370297431946, 0.006177870556712151, 0.020046375691890717, -0.02191319689154625, 0.010010316036641598, 0.038910094648599625, 0.06916803866624832, 0.07105594128370285, 0.03011358343064785, -0.0637766569852829, 0.01216666679829359, 0.021375849843025208, 0.10676101595163345, 0.0036133266985416412, 0.0021647640969604254, 0.021357765421271324, -0.012914304621517658, -0.02146732062101364, 0.02516406960785389, 0.03428533673286438, -0.026315419003367424, 0.04876181110739708, 0.09630507230758667, 0.0241411030292511, 0.13585732877254486, -0.03131893277168274, -0.020551344379782677, -0.06756353378295898, 0.09351270645856857, -0.01115174125880003, -0.006477913353592157, 0.03484858572483063, -0.03310497850179672, -0.0019096500473096967, -0.008334245532751083, 0.04569254815578461, -0.002186262514442205

In [7]:
from langchain_chroma import Chroma

def create_chroma_db(splits, embeddings):
    return Chroma.from_documents(
        documents=splits,
        embedding=embeddings,
        persist_directory="db"
    )

vector_store = create_chroma_db(splits, embeddings)

In [8]:
results = vector_store.similarity_search(
    "what is the title of the paper?"
)

print(results[0])

page_content='“Legal Document Summarization Using Nlp
and Ml Techniques”. In: International Jour-
nal of Engineering and Computer Science
9 (May 2020), pp. 25039–25046. DOI : 10 .
18535/ijecs/v9i05.4488.
[13] Nikita, Dipti P. Rana, and Rupa G. Mehta.
“Research Challenges for Legal Document
Summarization”. In: 2023 IEEE World Con-
ference on Applied Intelligence and Comput-
ing (AIC). 2023, pp. 307–312. DOI : 10.1109/
AIC57670.2023.10263906.
[14] Saloni Sharma, Surabhi Srivastava,
Pradeepika Verma, et al. “A Comprehensive
Analysis of Indian Legal Documents
Summarization Techniques”. In: SN
Computer Science 4 (Aug. 2023). DOI :
10.1007/s42979-023-01983-y.
[15] Saloni Sharma, Surabhi Srivastava,
Pradeepika Verma, et al. “A Comprehensive
Analysis of Indian Legal Documents
Summarization Techniques”. In: SN
Computer Science 4 (Aug. 2023). DOI :
10.1007/s42979-023-01983-y.
[16] Abhay Shukla, Paheli Bhattacharya, Soham
Poddar, et al. “Legal Case Document Sum-
marization: Extractive and Abstrac

In [9]:
results = vector_store.similarity_search_with_score("What is the name of the author")
doc, score = results[0]
print(f"Score: {score}\n")
print(doc)

Score: 1.3911134004592896

page_content='[5] Ilias Chalkidis, Manos Fergadiotis, Prodro-
mos Malakasiotis, et al. LEGAL-BERT: The
Muppets straight out of Law School . 2020.
arXiv: 2010.02559 [cs.CL].
[6] Utkarsh Dixit, Sonam Gupta, Arun Kumar
Yadav, et al. “Analyzing the Impact of Ex-
tractive Summarization Techniques on Legal
Text”. In: Jan. 2024, pp. 585–602. ISBN : 978-
981-99-6543-4. DOI : 10.1007/978- 981- 99-
6544-1 44.
[7] M. Gupta, N. Narayana, V . Charan, et al.
“Extractive Summarization of Indian Legal
Documents”. In: Apr. 2022, pp. 629–638.
ISBN : 978-981-19-0018-1. DOI : 10 . 1007 /
978-981-19-0019-8 47.
[8] Ben Hachey and Claire Grover. “Extractive
summarisation of legal texts”. In: Artificial
Intelligence and Law 14 (2006), pp. 305–
345. URL : https://api.semanticscholar.org/
CorpusID:10295764.
[9] Deepali Jain, Malaya Borah, and Anupam
Biswas. “Summarization of Indian Legal
Judgement Documents via Ensembling of
Contextual Embedding based MLP Models”.
In: Dec. 2021.' meta