This script makes a **demo curriculum dataset**, saves it to Excel, and then loads it into a database (SQLite). The goal is to test how fast we can insert large amounts of data into related tables.

**Steps in the script:**

1. **Create Demo Data**

    - Makes rows with three columns: `strand`, `substrand`, and `activity`.
    - Strands are named `"Strand A"`, `"Strand B"`, etc.
    - Substrands are named `"Sub Strand A"`, `"Sub Strand B"`, etc.
    - Activities are named `"Activity #1"`, `"Activity #2"`, etc.
    - The number of activities per substrand is set by `ACTIVITIES_PER_SUBSTRAND`.
    - Saves everything into an Excel file called **`DEMO_CURRICULUM.xlsx`**.

2. **Set Up Database**

    - Uses SQLite and creates 3 tables:

        - **Strand** (`id`, `name`)
        - **Substrand** (`id`, `strand_id`, `name`)
        - **Activity** (`id`, `substrand_id`, `name`)

    - Each table is linked:

        - A substrand belongs to a strand.
        - An activity belongs to a substrand.

3. **Insert Data into Database**

    - Finds all unique strands and inserts them into the `Strand` table.

        - Keeps a dictionary `strand_id_map` to remember the IDs.

    - Finds all unique strand–substrand pairs and inserts them into the `Substrand` table.

        - Keeps a dictionary `substrand_id_map` for their IDs.

    - Inserts all activities, using the `substrand_id_map` to connect them to the right substrand.
    - Uses `executemany()` for batch inserts, which is much faster than inserting one row at a time.
    - Measures how long each step takes and prints the times.

4. **Result**

    - At the end, you get a SQLite database file called **`curriculum.db`** with all the data linked correctly.
    - The script also shows how many rows were inserted and how long it took.


In [14]:
import sqlite3
import string
import time
from pathlib import Path

import pandas as pd

# 1. CREATE DEMO EXCEL FILE


In [15]:
ACTIVITIES_PER_SUBSTRAND = 100

print("📂 Creating demo Excel file...")

start_time = time.perf_counter()

strands = list(string.ascii_uppercase)
sub_strands = list(string.ascii_uppercase)
activities = [str(a) for a in range(1, ACTIVITIES_PER_SUBSTRAND)]

all_rows = []
for strand in strands:
    for sub_strand in sub_strands:
        for activity in activities:
            all_rows.append(
                {
                    "strand": f"Strand {strand}",
                    "substrand": f"Sub Strand {sub_strand}",
                    "activity": f"Activity #{activity}",
                }
            )

df = pd.DataFrame(all_rows)

output_dir = Path("demos")
output_dir.mkdir(exist_ok=True)
output_path = output_dir / "DEMO_CURRICULUM.xlsx"
df.to_excel(output_path, index=False, engine="openpyxl")

elapsed = time.perf_counter() - start_time
print(f"✅ Excel created with {len(df)} rows in {elapsed:.2f} seconds.")
print(f"   File: {output_path.resolve()}")

📂 Creating demo Excel file...
✅ Excel created with 66924 rows in 4.80 seconds.
   File: /home/kraigochieng/projects/kicd_extraction/demos/DEMO_CURRICULUM.xlsx


# 2. DB INSERTION

## Create tables

In [16]:
print("\n📂 Importing into database...")

conn = sqlite3.connect("curriculum.db")
cur = conn.cursor()

cur.executescript("""
DROP TABLE IF EXISTS Strand;
DROP TABLE IF EXISTS Substrand;
DROP TABLE IF EXISTS Activity;

CREATE TABLE Strand (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT UNIQUE
);

CREATE TABLE Substrand (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    strand_id INTEGER,
    name TEXT,
    UNIQUE(strand_id, name),
    FOREIGN KEY (strand_id) REFERENCES Strand(id)
);

CREATE TABLE Activity (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    substrand_id INTEGER,
    name TEXT,
    UNIQUE(substrand_id, name),
    FOREIGN KEY (substrand_id) REFERENCES Substrand(id)
);
""")




📂 Importing into database...


<sqlite3.Cursor at 0x75c382f286c0>

## Insert Strands

In [17]:
unique_strands = df["strand"].unique().tolist()
unique_strands

['Strand A',
 'Strand B',
 'Strand C',
 'Strand D',
 'Strand E',
 'Strand F',
 'Strand G',
 'Strand H',
 'Strand I',
 'Strand J',
 'Strand K',
 'Strand L',
 'Strand M',
 'Strand N',
 'Strand O',
 'Strand P',
 'Strand Q',
 'Strand R',
 'Strand S',
 'Strand T',
 'Strand U',
 'Strand V',
 'Strand W',
 'Strand X',
 'Strand Y',
 'Strand Z']

In [18]:
# --- Step 1: Insert Strands ---
start_time = time.perf_counter()

cur.executemany("INSERT INTO Strand (name) VALUES (?)", [(s,) for s in unique_strands])

strand_id_map = {
    s: i for s, i in zip(unique_strands, range(1, len(unique_strands) + 1))
}

display(strand_id_map)

elapsed = time.perf_counter() - start_time

print(f"✅ Inserted {len(unique_strands)} strands in {elapsed:.2f} seconds.")



{'Strand A': 1,
 'Strand B': 2,
 'Strand C': 3,
 'Strand D': 4,
 'Strand E': 5,
 'Strand F': 6,
 'Strand G': 7,
 'Strand H': 8,
 'Strand I': 9,
 'Strand J': 10,
 'Strand K': 11,
 'Strand L': 12,
 'Strand M': 13,
 'Strand N': 14,
 'Strand O': 15,
 'Strand P': 16,
 'Strand Q': 17,
 'Strand R': 18,
 'Strand S': 19,
 'Strand T': 20,
 'Strand U': 21,
 'Strand V': 22,
 'Strand W': 23,
 'Strand X': 24,
 'Strand Y': 25,
 'Strand Z': 26}

✅ Inserted 26 strands in 0.00 seconds.


## Insert Substrands

In [19]:
unique_pairs = df[["strand", "substrand"]].drop_duplicates().values.tolist()
unique_pairs[:20]

[['Strand A', 'Sub Strand A'],
 ['Strand A', 'Sub Strand B'],
 ['Strand A', 'Sub Strand C'],
 ['Strand A', 'Sub Strand D'],
 ['Strand A', 'Sub Strand E'],
 ['Strand A', 'Sub Strand F'],
 ['Strand A', 'Sub Strand G'],
 ['Strand A', 'Sub Strand H'],
 ['Strand A', 'Sub Strand I'],
 ['Strand A', 'Sub Strand J'],
 ['Strand A', 'Sub Strand K'],
 ['Strand A', 'Sub Strand L'],
 ['Strand A', 'Sub Strand M'],
 ['Strand A', 'Sub Strand N'],
 ['Strand A', 'Sub Strand O'],
 ['Strand A', 'Sub Strand P'],
 ['Strand A', 'Sub Strand Q'],
 ['Strand A', 'Sub Strand R'],
 ['Strand A', 'Sub Strand S'],
 ['Strand A', 'Sub Strand T']]

In [20]:
# --- Step 2: Insert Substrands ---
start_time = time.perf_counter()

cur.executemany(
    "INSERT INTO Substrand (strand_id, name) VALUES (?, ?)",
    [(strand_id_map[strand], substrand) for strand, substrand in unique_pairs],
)

substrand_id_map = {
    (strand, substrand): i
    for i, (strand, substrand) in enumerate(unique_pairs, start=1)
}


display(list(substrand_id_map.items())[:20])

elapsed = time.perf_counter() - start_time

print(f"✅ Inserted {len(unique_pairs)} substrands in {elapsed:.2f} seconds.")

[(('Strand A', 'Sub Strand A'), 1),
 (('Strand A', 'Sub Strand B'), 2),
 (('Strand A', 'Sub Strand C'), 3),
 (('Strand A', 'Sub Strand D'), 4),
 (('Strand A', 'Sub Strand E'), 5),
 (('Strand A', 'Sub Strand F'), 6),
 (('Strand A', 'Sub Strand G'), 7),
 (('Strand A', 'Sub Strand H'), 8),
 (('Strand A', 'Sub Strand I'), 9),
 (('Strand A', 'Sub Strand J'), 10),
 (('Strand A', 'Sub Strand K'), 11),
 (('Strand A', 'Sub Strand L'), 12),
 (('Strand A', 'Sub Strand M'), 13),
 (('Strand A', 'Sub Strand N'), 14),
 (('Strand A', 'Sub Strand O'), 15),
 (('Strand A', 'Sub Strand P'), 16),
 (('Strand A', 'Sub Strand Q'), 17),
 (('Strand A', 'Sub Strand R'), 18),
 (('Strand A', 'Sub Strand S'), 19),
 (('Strand A', 'Sub Strand T'), 20)]

✅ Inserted 676 substrands in 0.01 seconds.


## Insert Activities

In [21]:
unique_triples = (
    df[["strand", "substrand", "activity"]].drop_duplicates().values.tolist()
)
display(unique_triples[:20])

[['Strand A', 'Sub Strand A', 'Activity #1'],
 ['Strand A', 'Sub Strand A', 'Activity #2'],
 ['Strand A', 'Sub Strand A', 'Activity #3'],
 ['Strand A', 'Sub Strand A', 'Activity #4'],
 ['Strand A', 'Sub Strand A', 'Activity #5'],
 ['Strand A', 'Sub Strand A', 'Activity #6'],
 ['Strand A', 'Sub Strand A', 'Activity #7'],
 ['Strand A', 'Sub Strand A', 'Activity #8'],
 ['Strand A', 'Sub Strand A', 'Activity #9'],
 ['Strand A', 'Sub Strand A', 'Activity #10'],
 ['Strand A', 'Sub Strand A', 'Activity #11'],
 ['Strand A', 'Sub Strand A', 'Activity #12'],
 ['Strand A', 'Sub Strand A', 'Activity #13'],
 ['Strand A', 'Sub Strand A', 'Activity #14'],
 ['Strand A', 'Sub Strand A', 'Activity #15'],
 ['Strand A', 'Sub Strand A', 'Activity #16'],
 ['Strand A', 'Sub Strand A', 'Activity #17'],
 ['Strand A', 'Sub Strand A', 'Activity #18'],
 ['Strand A', 'Sub Strand A', 'Activity #19'],
 ['Strand A', 'Sub Strand A', 'Activity #20']]

In [22]:
# --- Step 3: Insert Activities ---
start_time = time.perf_counter()

cur.executemany(
    "INSERT INTO Activity (substrand_id, name) VALUES (?, ?)",
    [
        (substrand_id_map[(strand, substrand)], activity)
        for strand, substrand, activity in unique_triples
    ],
)

elapsed = time.perf_counter() - start_time

print(f"✅ Inserted {len(unique_triples)} activities in {elapsed:.2f} seconds.")


✅ Inserted 66924 activities in 0.20 seconds.


## Close Connection

In [23]:
# Commit
conn.commit()
conn.close()

print("\n🎉 Import complete!")


🎉 Import complete!
