### Data Ingestion

In [1]:
import os
os.makedirs("../data/text_files", exist_ok=True)

In [4]:
sample_texts={
    "../data/text_files/python_intro.txt":"""Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.

Python is dynamically type-checked and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.

Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language. Python 3.0, released in 2008, was a major revision and not completely backward-compatible with earlier versions. Recent versions, such as Python 3.13, 3.12 and older (and 3.14), have added capabilities and keywords for typing, helping with (optional) static typing.[35] Currently only versions in the 3.x series are supported.

Python consistently ranks as one of the most popular programming languages, and it has gained widespread use in the machine learning community. It is widely taught as an introductory programming language.""",
"../data/text_files/machine_learning.txt":"""Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance.

ML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. The application of ML to business problems is known as predictive analytics.

Statistics and mathematical optimisation (mathematical programming) methods comprise the foundations of machine learning. Data mining is a related field of study, focusing on exploratory data analysis (EDA) via unsupervised learning.

From a theoretical viewpoint, probably approximately correct learning provides a mathematical and statistical framework for describing machine learning. Most traditional machine learning and deep learning algorithms can be described as empirical risk minimization under this framework."""
}

for filepath, content in sample_texts.items():
    with open(filepath,'w',encoding="utf-8") as f:
        f.write(content)

print("Sample text files created!")

Sample text files created!


In [6]:
### Text Loader - This can directly convert any text file into a document structure that also has the metadata
from langchain_community.document_loaders import TextLoader

loader=TextLoader("../data/text_files/python_intro.txt", encoding="utf-8")
document=loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.\n\nPython is dynamically type-checked and garbage-collected. It supports multiple programming paradigms, including structured (particularly procedural), object-oriented and functional programming.\n\nGuido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language. Python 3.0, released in 2008, was a major revision and not completely backward-compatible with earlier versions. Recent versions, such as Python 3.13, 3.12 and older (and 3.14), have added capabilities and keywords for typing, helping with (optional) static typing.[35] Currently only versions in the 3.x series are supported.\n\nPython consistently ranks as one of the most popular programming languages, and it has gained widespread use in the machine le

In [10]:
### Directory Loader

from langchain_community.document_loaders import DirectoryLoader
dir_loader=DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt", ##Pattern to match
    loader_cls=TextLoader,
    loader_kwargs={'encoding':'utf-8'},
    show_progress=False
)

documents=dir_loader.load()
documents

[Document(metadata={'source': '..\\data\\text_files\\machine_learning.txt'}, page_content='Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of statistical algorithms that can learn from data and generalise to unseen data, and thus perform tasks without explicit instructions. Within a subdiscipline in machine learning, advances in the field of deep learning have allowed neural networks, a class of statistical algorithms, to surpass many previous machine learning approaches in performance.\n\nML finds application in many fields, including natural language processing, computer vision, speech recognition, email filtering, agriculture, and medicine. The application of ML to business problems is known as predictive analytics.\n\nStatistics and mathematical optimisation (mathematical programming) methods comprise the foundations of machine learning. Data mining is a related field of study, focusing on exploratory data analysis (EDA)

In [4]:
from langchain_community.document_loaders import DirectoryLoader,PyPDFLoader, PyMuPDFLoader

dir_loader=DirectoryLoader(
    "../data/pdf",
    glob="**/*.pdf",
    loader_cls=PyMuPDFLoader,
    show_progress=False
)

pdf_doc=dir_loader.load()
pdf_doc

[Document(metadata={'producer': 'Skia/PDF m140', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36', 'creationdate': '2025-09-27T17:21:52+00:00', 'source': '..\\data\\pdf\\GATE.pdf', 'file_path': '..\\data\\pdf\\GATE.pdf', 'total_pages': 2, 'format': 'PDF 1.4', 'title': 'GATE', 'author': '', 'subject': '', 'keywords': '', 'moddate': '2025-09-27T17:21:52+00:00', 'trapped': '', 'modDate': "D:20250927172152+00'00'", 'creationDate': "D:20250927172152+00'00'", 'page': 0}, page_content='Payment Status\n\xa0Payment Received\nSuccessfully\nTransaction Number:\xa0113956425763\nBank ref No:\xa0346049215535\nAmount:\xa0\n2000\nName:\xa0PIYUSH JAIN\nEnrollment Id:\xa0G214U61\nFee Type:\xa0Application Fee\nDigital FingerPrint: b6aca11627d9350ccd43aa7123dcd16c\n9/27/25, 10:51 PM\nGATE\nhttps://goaps.iitg.ac.in/postPayment\n1/2'),
 Document(metadata={'producer': 'Skia/PDF m140', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; 