## Data Ingestion Pipeline

In [1]:
from langchain_core.documents import Document

In [5]:
doc=Document(
    page_content="this is the main text I am using to create RAG",
    metadata={
        "source":"example.txt",
        "pages":2,
        "author":"Jasmine Kaur",
        "date_created":"2025-01-16"
    }
)
doc

Document(metadata={'source': 'example.txt', 'pages': 2, 'author': 'Jasmine Kaur', 'date_created': '2025-01-16'}, page_content='this is the main text I am using to create RAG')

In [6]:
# create a simple txt file

import os
os.makedirs("../data/text_files",exist_ok=True)

In [7]:
file_path = "../data/text_files/machine_learning_intro.txt"

content = """Introduction to Machine Learning
Machine Learning (ML) is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.
Types of Machine Learning:
1. Supervised Learning
   - Uses labeled training data
   - Examples: Classification, Regression
   - Common algorithms: Linear Regression, Decision Trees, Neural Networks
2. Unsupervised Learning
   - Works with unlabeled data
   - Examples: Clustering, Dimensionality Reduction
   - Common algorithms: K-Means, PCA, Hierarchical Clustering
3. Reinforcement Learning
   - Learns through trial and error
   - Agent receives rewards/penalties
   - Examples: Game playing, Robotics
Key Concepts:
- Training Data: Dataset used to train the model
- Features: Input variables used for prediction
- Labels: Output or target variable
- Model: Mathematical representation learned from data
- Overfitting: Model performs well on training data but poorly on new data
- Underfitting: Model is too simple to capture patterns
Popular Libraries:
- scikit-learn: Traditional ML algorithms
- TensorFlow: Deep learning framework
- PyTorch: Deep learning with dynamic computation
- XGBoost: Gradient boosting for structured data
Applications:
- Image recognition
- Natural language processing
- Recommendation systems
- Fraud detection
- Autonomous vehicles
"""

with open(file_path,'w',encoding='utf-8') as file:
    file.write(content)

print("file created successfuly")

file created successfuly


In [11]:
from langchain_community.document_loaders import TextLoader

loader=TextLoader("../data/text_files/sample_document.txt",encoding='utf-8')
document=loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/sample_document.txt'}, page_content='Introduction to Retrieval-Augmented Generation (RAG)\n\nRetrieval-Augmented Generation, commonly known as RAG, is a powerful technique in natural language processing that combines the strengths of information retrieval and text generation. This hybrid approach enables AI systems to provide more accurate, up-to-date, and contextually relevant responses.\n\nHow RAG Works\n\nRAG systems operate in two main phases:\n\n1. Retrieval Phase: When a user submits a query, the system searches through a knowledge base or document collection to find relevant information. This is typically done using vector similarity search, where documents are converted into embeddings and compared against the query embedding.\n\n2. Generation Phase: The retrieved documents are then provided as context to a large language model (LLM), which generates a response based on both the query and the retrieved information.\n\nBenefits o

In [13]:
from langchain_community.document_loaders import DirectoryLoader

dir_loader=DirectoryLoader(
    "../data/text_files",
    glob="**/*.txt",
    loader_cls=TextLoader,
    loader_kwargs={'encoding':'utf-8'},
    show_progress=False
)

documents=dir_loader.load()
documents

[Document(metadata={'source': '..\\data\\text_files\\machine_learning_intro.txt'}, page_content='Introduction to Machine Learning\nMachine Learning (ML) is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed.\nTypes of Machine Learning:\n1. Supervised Learning\n   - Uses labeled training data\n   - Examples: Classification, Regression\n   - Common algorithms: Linear Regression, Decision Trees, Neural Networks\n2. Unsupervised Learning\n   - Works with unlabeled data\n   - Examples: Clustering, Dimensionality Reduction\n   - Common algorithms: K-Means, PCA, Hierarchical Clustering\n3. Reinforcement Learning\n   - Learns through trial and error\n   - Agent receives rewards/penalties\n   - Examples: Game playing, Robotics\nKey Concepts:\n- Training Data: Dataset used to train the model\n- Features: Input variables used for prediction\n- Labels: Output or target variable\n- Model: Mathematical representation lear

In [22]:
from langchain_community.document_loaders import PyPDFLoader, PyMuPDFLoader

dir_loader=DirectoryLoader(
    "../data/pdf_files",
    glob="**/*.pdf",
    loader_cls=PyMuPDFLoader,
    show_progress=False
)
pdf_doc=dir_loader.load()
pdf_doc

[Document(metadata={'producer': 'Adobe PDF Library 25.1.5', 'creator': 'Acrobat PDFMaker 25 for Word', 'creationdate': '2026-01-15T11:17:28-05:00', 'source': '..\\data\\pdf_files\\Jasmine-Resume.pdf', 'file_path': '..\\data\\pdf_files\\Jasmine-Resume.pdf', 'total_pages': 1, 'format': 'PDF 1.6', 'title': '', 'author': 'JakesResume', 'subject': '', 'keywords': '', 'moddate': '2026-01-15T11:17:35-05:00', 'trapped': '', 'modDate': "D:20260115111735-05'00'", 'creationDate': "D:20260115111728-05'00'", 'page': 0}, page_content="Jasmine Kaur \n+1 5195338833| jasminkaur5858@gmail.com| LinkedIn| Website | GitHub \nEDUCATION \nSoftware Engineering Technology \nApril 2026 \nConestoga College  \nWaterloo, Ontario \n• Recipient of Regional Scholarship Award for scoring highest academic average. \nWORK EXPERIENCE \nAutomation Developer \nMay 2025 – Sept 2025 \nCo-operators  \nKitchener, Ontario \n• Developed and deployed robust automation solution leveraging C# and VB.Net to streamline a core claims 