# Data Preparation

In this workshop, we are going to use written content from [OWASP CheatSheetSeries](https://github.com/OWASP/CheatSheetSeries) as the source document for our RAG. However, to reduce the cost, I already currated few files that we are going to use in `sources/` directory. Instead of using all of them, we will just use few of them and build embedding with the currated files.

The source code below will just iterate over all files within `sources` directory and create a `course_content.jsonl` file containing the file contents.

In [None]:
import json
import os
from pathlib import Path

def generate_course_content_jsonl():
    sources_dir = Path('sources')
    output_file = 'course_content.jsonl'
    
    if not sources_dir.exists():
        print(f"Error: {sources_dir} directory not found.")
        return
    
    with open(output_file, 'w') as jsonl_file:
        for id, file_path in enumerate(sources_dir.glob('*'), start=1):
            if file_path.is_file():
                title = file_path.stem.replace('_', ' ')
                with open(file_path, 'r', encoding='utf-8') as content_file:
                    content = content_file.read()
                
                # Generate slug from title
                slug = title.lower().replace(' ', '-')
                
                record = {
                    'id': id,
                    'title': title,
                    'content': content,
                    'file_path': str(file_path),
                    'slug': slug
                }
                
                json.dump(record, jsonl_file)
                jsonl_file.write('\n')
    
    print(f"JSONL file '{output_file}' has been generated successfully.")

# Run the function to generate the JSONL file
generate_course_content_jsonl()


Let's see what is inside the `course_content.jsonl` file:

In [None]:
import pandas as pd

df = pd.read_json('course_content.jsonl', lines=True)
df.head()