Data Ingestion

In [2]:
import os
import pandas as pd
from typing import List, Dict, Any
from langchain_core.documents import Document 
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import TextLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter, TokenTextSplitter

###  Ingesting And Parsing Text Data Using Document Loaders

Document Structure 

In [3]:
# Create a basic document
doc = Document(
    page_content='Welcome to Generative AI and FullStack Development. You are welcome',
    metadata = {
        'source': 'rag_example.txt',
        'page': 1,
        'author':'Saheed Oladiipo',
        'date_created': '2025-02-12',
        'custom_field': 'any_value'
    }
)

In [4]:
print('Document Structure')
print(f'Content page: {doc.page_content}')
print(f'Content Metadata: {doc.metadata}')

Document Structure
Content page: Welcome to Generative AI and FullStack Development. You are welcome
Content Metadata: {'source': 'rag_example.txt', 'page': 1, 'author': 'Saheed Oladiipo', 'date_created': '2025-02-12', 'custom_field': 'any_value'}


In [5]:
 type(doc)

langchain_core.documents.base.Document

Read Text Files in .txt

In [6]:
# Create basic txt file
os.makedirs('data/text_files', exist_ok=True)

In [7]:
texts = {
    'data/text_files/python.txt': """Of course. Python is the undisputed lingua franca for Generative AI, and its ecosystem of libraries makes it incredibly powerful for building applications like RAG systems, image generators, and more.

Here is a breakdown of the essential Python libraries, concepts, and resources for Generative AI, moving from foundational to advanced.

The Core Python Stack for Generative AI
Your toolkit will primarily consist of these libraries:

Core Numerical & Deep Learning Framework:

PyTorch (torch): Developed by Meta, it's the most popular framework for cutting-edge AI research, including most new LLMs. Its dynamic computation graph (eager execution) is very intuitive for debugging and experimentation.

TensorFlow (tensorflow): Developed by Google, it's a very powerful and production-hardened framework. While its market share in research has been overtaken by PyTorch, it's still widely used, especially with its high-level API Keras.

JAX: Gaining rapid traction in research for its incredible speed and composable transformations (gradients, jit, vmap). It's used by Google DeepMind for models like Gemini.

Recommendation for Beginners: Start with PyTorch. Most tutorials, research code, and pre-trained models are released in PyTorch first.

The Hugging Face Ecosystem (transformers, datasets, accelerate): This is non-negotiable for working with LLMs. It provides a unified and incredibly simple API to thousands of pre-trained models.

transformers: The main library for downloading and using pre-trained models (for text, vision, audio) for tasks like text generation, summarization, and classification.

datasets: Provides easy access to thousands of datasets for training and evaluation.

accelerate: Makes it trivial to scale your training or inference from a CPU to a single GPU to multiple GPUs without changing your code.

peft: Library for Parameter-Efficient Fine-Tuning (e.g., LoRA), essential for fine-tuning large models on consumer hardware.

evaluate: For standard evaluation metrics.

Vector Databases & Similarity Search: Critical for the "Retrieval" in RAG.

ChromaDB: Very popular, open-source, and easy-to-use. Great for getting started.

FAISS (by Meta): A library for efficient similarity search and clustering of dense vectors. It's often used as an in-memory index.

Pinecone, Weaviate, Qdrant: Managed/self-hosted vector databases that are production-ready and offer more features like persistence, hybrid search, and scalability.

Web Frameworks for Building APIs & Applications:

FastAPI: The modern standard for building high-performance APIs to serve your AI models. It automatically generates OpenAPI documentation and is asynchronous.

Streamlit: The fastest way to turn your Python scripts into interactive web apps. Perfect for building prototypes, demos, and internal tools in minutes.

Gradio: Similar to Streamlit, great for quickly creating a UI to demo your models, often used with Hugging Face Spaces.

Essential Utilities:

numpy: The fundamental package for numerical computation in Python.

pandas: For data manipulation and analysis on your training data or outputs.

tqdm: For adding progress bars to your loops, which is incredibly useful for tracking long-running training/inference tasks.

python-dotenv: For managing environment variables (e.g., API keys for OpenAI, Pinecone) securely."""
}

In [8]:
for filepath, content in texts.items():
    with open(filepath, 'w', encoding='utf-8') as f:
        f.write(content)
print('Text file successfully created')

Text file successfully created


Read Single File Using TextLoader

In [9]:
# Load single text file
loader = TextLoader('data/text_files/python.txt', encoding='utf-8')
loader

<langchain_community.document_loaders.text.TextLoader at 0x18074597b60>

In [10]:
documents = loader.load()
print(type(documents))
print(documents)

<class 'list'>
[Document(metadata={'source': 'data/text_files/python.txt'}, page_content='Of course. Python is the undisputed lingua franca for Generative AI, and its ecosystem of libraries makes it incredibly powerful for building applications like RAG systems, image generators, and more.\n\nHere is a breakdown of the essential Python libraries, concepts, and resources for Generative AI, moving from foundational to advanced.\n\nThe Core Python Stack for Generative AI\nYour toolkit will primarily consist of these libraries:\n\nCore Numerical & Deep Learning Framework:\n\nPyTorch (torch): Developed by Meta, it\'s the most popular framework for cutting-edge AI research, including most new LLMs. Its dynamic computation graph (eager execution) is very intuitive for debugging and experimentation.\n\nTensorFlow (tensorflow): Developed by Google, it\'s a very powerful and production-hardened framework. While its market share in research has been overtaken by PyTorch, it\'s still widely used, 

In [11]:
print(f'Loaded {len(documents)} document\n')
print(f'Content preview: {documents[0].page_content[:70]}....\n')
print(f'Metadata: {documents[0].metadata}')

Loaded 1 document

Content preview: Of course. Python is the undisputed lingua franca for Generative AI, a....

Metadata: {'source': 'data/text_files/python.txt'}


Extracting/Loading Multiple Text Files Using DirectoryLoader

In [12]:
# Load all text file in the directory
loader_dir = DirectoryLoader(
    'data/text_files',
    glob='**/*.txt', # Patterns to match the file
    loader_cls=TextLoader, #loader class used
    loader_kwargs={'encoding': 'utf-8'},
    show_progress=True
)

documents = loader_dir.load()

print(f'Loaded {len(documents)} documents')

for i, doc in enumerate(documents):
    print(f'\nDocument: {i+1}')
    print(f' Source: {doc.metadata['source']}')
    print(f' Length: {len(doc.page_content)} characters')

  0%|          | 0/1 [00:00<?, ?it/s]

100%|██████████| 1/1 [00:00<00:00, 510.07it/s]

Loaded 1 documents

Document: 1
 Source: data\text_files\python.txt
 Length: 3342 characters





Text Splitting

In [13]:
#1. Using character text splitter
text = documents[0].page_content
text

'Of course. Python is the undisputed lingua franca for Generative AI, and its ecosystem of libraries makes it incredibly powerful for building applications like RAG systems, image generators, and more.\n\nHere is a breakdown of the essential Python libraries, concepts, and resources for Generative AI, moving from foundational to advanced.\n\nThe Core Python Stack for Generative AI\nYour toolkit will primarily consist of these libraries:\n\nCore Numerical & Deep Learning Framework:\n\nPyTorch (torch): Developed by Meta, it\'s the most popular framework for cutting-edge AI research, including most new LLMs. Its dynamic computation graph (eager execution) is very intuitive for debugging and experimentation.\n\nTensorFlow (tensorflow): Developed by Google, it\'s a very powerful and production-hardened framework. While its market share in research has been overtaken by PyTorch, it\'s still widely used, especially with its high-level API Keras.\n\nJAX: Gaining rapid traction in research for 

In [14]:
print('Character Text Splitter')
char_splitter = CharacterTextSplitter(
    separator='\n', #
    chunk_size= 500, #Maximum size in character per chunk
    chunk_overlap=20, # Overlap between chunks
    length_function=len # How to measure chunk size 
)

char_chunks = char_splitter.split_text(text)

print(f'Created {len(char_chunks)} chunks')
print(f'\nFirst chunk: {char_chunks[0]}[:100]....\n')

Character Text Splitter
Created 8 chunks

First chunk: Of course. Python is the undisputed lingua franca for Generative AI, and its ecosystem of libraries makes it incredibly powerful for building applications like RAG systems, image generators, and more.
Here is a breakdown of the essential Python libraries, concepts, and resources for Generative AI, moving from foundational to advanced.
The Core Python Stack for Generative AI
Your toolkit will primarily consist of these libraries:
Core Numerical & Deep Learning Framework:[:100]....



In [15]:
print(char_chunks[0],'\n')
print(char_chunks[1], '\n')
print(char_chunks[2], '\n')
print(char_chunks[3], '\n')

Of course. Python is the undisputed lingua franca for Generative AI, and its ecosystem of libraries makes it incredibly powerful for building applications like RAG systems, image generators, and more.
Here is a breakdown of the essential Python libraries, concepts, and resources for Generative AI, moving from foundational to advanced.
The Core Python Stack for Generative AI
Your toolkit will primarily consist of these libraries:
Core Numerical & Deep Learning Framework: 

PyTorch (torch): Developed by Meta, it's the most popular framework for cutting-edge AI research, including most new LLMs. Its dynamic computation graph (eager execution) is very intuitive for debugging and experimentation.
TensorFlow (tensorflow): Developed by Google, it's a very powerful and production-hardened framework. While its market share in research has been overtaken by PyTorch, it's still widely used, especially with its high-level API Keras. 

JAX: Gaining rapid traction in research for its incredible spee

#1. Using Recursive character splitting 

In [16]:
print('\n Recursive Character Text Splitter')
recursive_splitter = RecursiveCharacterTextSplitter(
    separators=['\n\n', '\n', ' ', ''],
    chunk_size=200,
    chunk_overlap=20,
    length_function=len
)


 Recursive Character Text Splitter


In [None]:
recursive_chunks = recursive_splitter.split_text(text)
print(f'Created: {len(recursive_chunks)} chunks\n')
print(f'First chunk: {recursive_chunks[0][:50]}.....')

Created: 24 chunks

First chunk: Of course. Python is the undisputed lingua franca for Generative AI, and its ecosystem of libraries .....


In [24]:
print(recursive_chunks[0])
print('*' * 40)
print(recursive_chunks[1])
print('*' * 40)
print(recursive_chunks[2])
print('*' * 40)
print(recursive_chunks[3])

Of course. Python is the undisputed lingua franca for Generative AI, and its ecosystem of libraries makes it incredibly powerful for building applications like RAG systems, image generators, and more.
****************************************
Here is a breakdown of the essential Python libraries, concepts, and resources for Generative AI, moving from foundational to advanced.
****************************************
The Core Python Stack for Generative AI
Your toolkit will primarily consist of these libraries:

Core Numerical & Deep Learning Framework:
****************************************
PyTorch (torch): Developed by Meta, it's the most popular framework for cutting-edge AI research, including most new LLMs. Its dynamic computation graph (eager execution) is very intuitive for


Token Text Splitter

In [25]:
token_splitter = TokenTextSplitter(
    chunk_size=50,
    chunk_overlap= 10
)

token_chunks = token_splitter.split_text(text)
print(f'Created: {len(token_chunks)} chunks\n')
print(f'First chunk: {token_chunks[0][:50]}.....')

Created: 20 chunks

First chunk: Of course. Python is the undisputed lingua franca .....


In [1]:
import os
from typing import List, Dict, Any
import pandas as pd
from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter, TokenTextSplitter

In [2]:
print('Set up Completed')

Set up Completed
