# Data Collection and Exploration

In this notebook, we'll collect real data from Wikipedia and ArXiv to build our RAG system. This is where we start working with actual documents that our system will need to understand and retrieve from.

## Learning Objectives
By the end of this notebook, you will:
1. Understand how to collect data from HuggingFace datasets
2. Explore the structure and characteristics of different data sources
3. Learn about data quality and filtering strategies
4. Get hands-on experience with real text data


## Setup and Imports

First, let's import the libraries we'll need and set up our environment.


In [None]:
# Standard library imports
import json
import pandas as pd
from pathlib import Path
from typing import List, Dict, Any
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Data science imports
import numpy as np
from tqdm import tqdm

# HuggingFace imports
from datasets import load_dataset

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

# Add project root to path
import sys
sys.path.append(str(Path.cwd().parent))

# Import our custom modules
from src.config import DATA_CONFIG, DATA_DIR
from src.data.collect_data import DataCollector

print("Libraries imported successfully!")
print(f"Data directory: {DATA_DIR}")
print(f"Max documents per source: {DATA_CONFIG['datasets']['wikipedia']['max_documents']}")
