Data Ingestion

In [None]:
### Document Structuring and Ingestion

from langchain_core.documents import Document



In [3]:
doc = Document(
    page_content="This is the content of the document.",
    metadata={"source": "user_upload", 
              "page_number": 1,
              "author": "John Doe",
              "ingestion_date": "2024-06-15"}
)

In [4]:
doc

Document(metadata={'source': 'user_upload', 'page_number': 1, 'author': 'John Doe', 'ingestion_date': '2024-06-15'}, page_content='This is the content of the document.')

In [5]:
## Create a simple text file

import os
os.makedirs("../data/text_files", exist_ok=True)

In [10]:
sample_texts = {
    "../data/text_files/python_intro.txt": """Tell me everything about Python programming.
    Python is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is widely used in web development, data analysis, artificial intelligence, scientific computing, automation, and more.

Python is popular because of its clean syntax, ease of learning, and extensive ecosystem of libraries and frameworks. Some of its key features include readability, interpreted execution (which makes debugging easier), dynamic typing (no need to declare variable types explicitly), cross-platform support, and strong community support.

For data science and machine learning, Python offers powerful libraries like NumPy for numerical computations, Pandas for data manipulation, Matplotlib and Seaborn for visualization, Scikit-learn for machine learning algorithms, and TensorFlow and PyTorch for deep learning.

Python is also used for web development through frameworks like Django and Flask, which simplify backend development and integration with databases and frontend technologies. Error handling in Python is managed using exceptions, typically with try and except blocks to catch and respond to runtime errors.
""",

"../data/text_files/machine_learning.txt": """Explain the basics of Machine Learning.
Machine Learning (ML) is a subset of artificial intelligence (AI) that focuses on enabling computers to learn patterns from data and make predictions or decisions without being explicitly programmed. Instead of following fixed instructions, ML algorithms improve their performance over time as they are exposed to more data. Machine Learning can be broadly categorized into supervised learning, where the model is trained on labeled data to make predictions; unsupervised learning, where the model identifies hidden patterns or groupings in unlabeled data; and reinforcement learning, where the model learns optimal actions through trial and error to maximize rewards.

Key features of Machine Learning include the ability to handle large datasets, adaptability to new information, and the capacity to uncover complex relationships that are difficult for humans to identify manually. Popular algorithms include linear and logistic regression, decision trees, random forests, support vector machines, k-means clustering, and neural networks. For deep learning, which is a subfield of ML, algorithms such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are widely used for tasks like image recognition, natural language processing, and speech recognition.

Machine Learning is applied in diverse domains such as healthcare (for disease diagnosis and drug discovery), finance (for fraud detection and risk assessment), marketing (for customer segmentation and recommendation systems), autonomous vehicles, robotics, and cybersecurity. Common tools and libraries used in ML include Python along with Scikit-learn, TensorFlow, PyTorch, Keras, Pandas, and NumPy.

Developing ML models typically involves data preprocessing, feature engineering, selecting appropriate algorithms, training the model, evaluating its performance using metrics like accuracy, precision, recall, or F1-score, and iteratively tuning the model for better results. Overall, Machine Learning empowers systems to learn from data, make informed decisions, and continuously improve, making it a cornerstone of modern AI applications.
"""
}

for file_path, content in sample_texts.items():
    with open(file_path, "w") as f:
        f.write(content)
print("Sample text files created.")


Sample text files created.


In [13]:
###Text Loader
from langchain_community.document_loaders import TextLoader

loader = TextLoader("../data/text_files/python_intro.txt", encoding ="utf-8")
document = loader.load()
print(document)

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Tell me everything about Python programming.\n    Python is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is widely used in web development, data analysis, artificial intelligence, scientific computing, automation, and more.\n\nPython is popular because of its clean syntax, ease of learning, and extensive ecosystem of libraries and frameworks. Some of its key features include readability, interpreted execution (which makes debugging easier), dynamic typing (no need to declare variable types explicitly), cross-platform support, and strong community support.\n\nFor data science and machine learning, Python offers powerful libraries like NumPy for numerical computations, Pandas for data manipulation, Matplotlib and Seaborn for visuali

In [29]:
###DIrectory Loader

from langchain_community.document_loaders import DirectoryLoader

directory_loader = DirectoryLoader(
    "../data/text_files", 
    glob="*.txt", 
    loader_cls=TextLoader,
    loader_kwargs={"encoding": "utf-8"},
    show_progress=False
    )

documents = directory_loader.load()
documents

[Document(metadata={'source': '../data/text_files/python_intro.txt'}, page_content='Tell me everything about Python programming.\n    Python is a high-level, interpreted programming language known for its simplicity, readability, and versatility. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Python is widely used in web development, data analysis, artificial intelligence, scientific computing, automation, and more.\n\nPython is popular because of its clean syntax, ease of learning, and extensive ecosystem of libraries and frameworks. Some of its key features include readability, interpreted execution (which makes debugging easier), dynamic typing (no need to declare variable types explicitly), cross-platform support, and strong community support.\n\nFor data science and machine learning, Python offers powerful libraries like NumPy for numerical computations, Pandas for data manipulation, Matplotlib and Seaborn for visuali

In [28]:
from langchain_community.document_loaders import  PDFMinerLoader, PyMuPDFLoader



dir_loader = DirectoryLoader(
    "../data/pdf_files", 
    glob="*.pdf", 
    loader_cls=PyMuPDFLoader,
    loader_kwargs={},
    show_progress=False
    )
pdf_documents = dir_loader.load()
pdf_documents

[Document(metadata={'producer': 'pdfTeX-1.40.20; modified using iText® 5.5.6 ©2000-2015 iText Group NV (AGPL-version)', 'creator': 'Appligent AppendPDF Pro 5.5', 'creationdate': '2020-12-14T14:46:25-05:00', 'source': '../data/pdf_files/09.A Review on Deep Learning Techniques for  Video Prediction.pdf', 'file_path': '../data/pdf_files/09.A Review on Deep Learning Techniques for  Video Prediction.pdf', 'total_pages': 20, 'format': 'PDF 1.5', 'title': 'A Review on Deep Learning Techniques for Video Prediction', 'author': '', 'subject': 'IEEE Transactions on Pattern Analysis and Machine Intelligence; ;PP;99;10.1109/TPAMI.2020.3045007', 'keywords': '', 'moddate': '2021-05-28T11:50:31-04:00', 'trapped': '', 'modDate': "D:20210528115031-04'00'", 'creationDate': "D:20201214144625-05'00'", 'page': 0}, page_content='0162-8828 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/inde