# Lesson 03 Walkthrough: Your First Database

## Database Applications Development

**Welcome!** This notebook will guide you step-by-step through creating your first SQLite database.

### What You'll Do Today:
1. Load the Titanic CSV file (review from Lesson 01)
2. Convert the DataFrame into a SQLite database
3. Write your first SQL query
4. Compare SQL to pandas operations

### Prerequisites:
- ✅ Completed Lessons 01 and 02
- ✅ Have `Titanic Dataset.csv` in your working directory
- ✅ Know how to run cells in Jupyter (Shift + Enter)

Let's get started!

---

## Part 1: Import Libraries

We'll need two Python libraries:
- **pandas** - for reading CSV and working with DataFrames (you're already familiar with this!)
- **sqlite3** - for creating and working with SQLite databases (built into Python)


In [None]:
# Import the libraries we need
import pandas as pd
import sqlite3

print("Libraries imported successfully!")
print(f"pandas version: {pd.__version__}")
print(f"sqlite3 version: {sqlite3.sqlite_version}")

**✅ Check:** You should see version numbers printed above. If you see an error, ask for help!

---

## Part 2: Load the Titanic Dataset

This is review from Lesson 01! We'll load the CSV file into a pandas DataFrame.

**Remember:** A DataFrame is like a table in memory - rows and columns of data.

In [None]:
# Load the Titanic dataset from CSV
titanic_df = pd.read_csv('Titanic Dataset.csv')

# Display basic information about the dataset
print(f"Dataset shape: {titanic_df.shape[0]} rows, {titanic_df.shape[1]} columns")
print(f"\nColumn names: {list(titanic_df.columns)}")

In [None]:
# Let's look at the first few rows
titanic_df.head()

**Checkpoint:** You should see passenger information displayed above.

## **Answer These Questions:**
- What are the column names (you can copy the list)?

  
- How many passengers are in the dataset (think about rows)?

  
- What types of data do we have - just based on what you see in the sample of five rows? (numbers, text, etc.)



## Part 3: Understanding What We're About to Do

Right now, our data is in a **DataFrame** (in memory).

We're going to:
1. Create a **database file** on disk (will be named `titanic.db`)
2. Create a **table** inside that database (will be named `passengers`)
3. Copy all the data from our DataFrame into that table

**Analogy:**
- CSV file = loose papers
- DataFrame = papers on your desk
- Database = organized filing cabinet with labeled drawers

---

## Part 4: Create Your First Database

This is the exciting part! We'll create a SQLite database in just a few lines of code.

### Step 4.1: Connect to the database

When you "connect" to a SQLite database that doesn't exist, SQLite automatically creates it for you!

In [None]:
# Create a connection to the database
# If 'titanic.db' doesn't exist, this will create it
conn = sqlite3.connect('titanic.db')  # Opens the 'titanic.db' database file and gives us a connection object to work with it

print("Connected to database 'titanic.db'")
print("   (If the file didn't exist, it was just created!)")

**What just happened?**

A file named `titanic.db` was created in your current directory. Right now it's empty, but it's a real SQLite database file!

The variable `conn` is our **connection** to the database - think of it as a phone line that lets us communicate with the database.

### Step 4.2: Copy DataFrame data into the database

Now we'll use pandas' `.to_sql()` method to create a table and populate it with our data.

In [None]:
# Convert the DataFrame to a SQL table
titanic_df.to_sql(
    name='passengers',           # The name of the SQL table we want to create in the database
    con=conn,                    # The active database connection (created with sqlite3.connect)
    if_exists='replace',         # If a table named 'passengers' already exists, delete it and write a new one
    index=False                  # Don't store the DataFrame's index as its own column in the database
)


print("Data written to database (no errors reported).")
print(f"Expected rows written: {len(titanic_df)}")
print("Next steps will validate/verify the new db using SQL queries!")

**Congratulations!** You just created your first database!

**What happened behind the scenes:**
1. pandas looked at your DataFrame's columns and data types
2. It created a SQL table with matching columns
3. It copied all the data from the DataFrame into the table
4. Everything was saved to the `titanic.db` file

**Important note:** The `if_exists='replace'` parameter means if you run this cell again, it will delete the old table and create a new one. This is useful for development!

---



## Part 5: Your First SQL Query

Now that we have a database with a table full of data, let's **query** it!

### What is a query?

A **query** is a question you ask the database. You write it in **SQL** (Structured Query Language).

### The simplest SQL query:

```sql
SELECT * FROM passengers;
```

Let's break this down:
- `SELECT` = "Show me..."
- `*` = "...everything (all columns)"
- `FROM passengers` = "...from the passengers table"
- `;` = "End of command"

**Translation to English:** "Show me all columns from the passengers table"

In [None]:
query = "SELECT * FROM passengers LIMIT 5"   # 1. Write SQL query as a string
result = pd.read_sql(query, conn)              # 2. Run query using pandas + active DB connection
print("First 5 passengers from database:")   # 3. Optional: A label for clarity
result                                         # 4. Show the DataFrame (Jupyter auto-displays the last line)


**Success!** You just wrote and executed your first SQL query!

**Notice:**
- We used `LIMIT 5` to only get the first 5 rows (just like `.head()` in pandas)
- The result came back as a pandas DataFrame
- It looks identical to when we used `titanic_df.head()`

### Why use `pd.read_sql()`?

The function `pd.read_sql()` is awesome because it:
1. Takes a SQL query as a string
2. Executes it on the database
3. Returns the results as a pandas DataFrame
4. Makes it easy to work with the results!

---

## Part 6: Comparing Pandas and SQL

Let's see how pandas operations map to SQL queries.

### Example 1: Get all data

In [None]:
# Pandas way (from Lesson 01)
print("PANDAS WAY:")                      # Label to show we're displaying the DataFrame using pandas tools
print(f"Shape: {titanic_df.shape}")       # Shows the number of rows and columns in the DataFrame as a tuple (rows, columns)
display(titanic_df.head(3))               # Displays the first 3 rows so we can preview the data structure

In [None]:
# SQL way (new!)
print("SQL WAY:")                                # Label to show we're now previewing the data using an SQL query instead of pandas-only methods
query = "SELECT * FROM passengers LIMIT 3"       # SQL command asking for all columns but only the first 3 rows from the passengers table
result = pd.read_sql(query, conn)                # Runs the SQL query through our database connection and returns the result as a pandas DataFrame
print(f"Shape: {result.shape}")                  # Prints the number of rows and columns returned by the SQL query
display(result)                                  # Displays the query results so we can compare SQL output with pandas output


**Same result!** Just different syntax.

### Example 2: Select specific columns

In [None]:
# Pandas way (from Lesson 02)
print("PANDAS WAY:")                                                    # Label to show we're previewing data using pandas tools
pandas_result = titanic_df[['name', 'age', 'sex', 'survived']].head()   # Selects specific columns and returns the first 5 rows
display(pandas_result)                                                  # Displays the pandas result so we can view the sample output


In [None]:
# SQL way
print("SQL WAY:")                                                        # Label to show we're previewing data using an SQL query
query = "SELECT Name, Age, Sex, Survived FROM passengers LIMIT 5"        # SQL command selecting specific columns and the first 5 rows
sql_result = pd.read_sql(query, conn)                                    # Sends the SQL query through the database connection and returns a DataFrame
display(sql_result)                                                      # Displays the SQL query result so we can view the sample output

**Key insight:** 
- Pandas: `df[['col1', 'col2']]` with square brackets and quotes
- SQL: `SELECT col1, col2` with commas between column names

**Same concepts!** You're just learning two variations of syntax.

### Example 3: Count rows

In [None]:
# Pandas way
print("PANDAS WAY:")
print(f"Total passengers: {len(titanic_df)}")

In [None]:
# SQL way
print("SQL WAY:")
query = "SELECT COUNT(*) as total_passengers FROM passengers"
result = pd.read_sql(query, conn)
print(f"Total passengers: {result['total_passengers'][0]}")

**Note:** `COUNT(*)` is a SQL function that counts rows. We'll learn more about these functions in an upcoming lesson!

---

## Part 7: Exploring the Database Structure

Let's learn how to see what tables exist in our database and what columns they have.

### See all tables in the database

In [None]:
# Query to see all tables in the database
query = "SELECT name FROM sqlite_master WHERE type='table'"                   # SQL query asking SQLite to list all table names in the database
tables = pd.read_sql(query, conn)                                             # Runs the query through our database connection and returns the results as a DataFrame
print("Tables in database:")                                                  # Label explaining that we are about to display the list of tables
display(tables)                                                               # Shows the table list returned from the SQL query


### Why is there only one table in our output?
We read a single CSV (titanic.csv) into pandas, then wrote that data into SQLite as one table named passengers. Since no other tables were created, the database correctly reports only one table.

### Why does the output show an index of 0?
Pandas displays query results as a DataFrame, and DataFrames always include a row index (0, 1, 2, …).
Because there is only one table in the database, the result contains one row, and its index is simply 0.
That index is not part of the SQL database — it's just how pandas formats the output.

---

**What's `sqlite_master`?**

Every SQLite database has a special table called `sqlite_master` that stores metadata about the database. It's like a table of contents!

### Get information about table structure

Let's see what columns are in the passengers table:

In [None]:
# Get column information using PRAGMA
query = "PRAGMA table_info(passengers)"                   # PRAGMA = SQLite's "info" command; this one lists every column in the 'passengers' table
columns_info = pd.read_sql(query, conn)                   # Runs the PRAGMA query and returns the column details as a pandas DataFrame
print("Column information for 'passengers' table:")       # Label explaining that we are about to display the table's column names and data types
display(columns_info[['name', 'type']])                   # Shows only the column names and their data types for clarity

**Cool!** This shows us:
- Column names
- Data types used in SQLite (INTEGER, TEXT, REAL)

**What these SQL data types mean:**

- INTEGER → Whole numbers (no decimals), such as ages, counts, IDs
- REAL → Decimal numbers (floating-point values), such as fares, percentages, heights, etc.
- TEXT → Strings or text data, such as names, sexes, ticket numbers

SQLite uses a very flexible type system called dynamic typing, so when we wrote the DataFrame into the database, SQLite automatically chose the data type that best matched each column.

---

## Part 8: Practice Queries

Now it's your turn! Try writing some queries on your own.

### Practice 1: Select just names and ages

In [None]:
# TODO: Write a query to select only Name and Age columns
# Hint: SELECT Name, Age FROM passengers LIMIT 10

query = ""  # <-- Write your query here

# Uncomment the lines below when you're ready to test:
# result = pd.read_sql(query, conn)
# display(result)

### Practice 2: Count total passengers

In [None]:
# TODO: Write a query to count all passengers
# Hint: Use COUNT(*), and you can reference a code chunk previously that already did this

query = ""  # <-- Write your query here

# Uncomment the lines below when you're ready to test:
# result = pd.read_sql(query, conn)
# display(result)

### Practice 3: Get passenger names and survival status

In [None]:
# TODO: Select Name, Age, and Survived columns
# Show only the first 20 rows

query = ""  # <-- Write your query here

# Uncomment when ready:
# result = pd.read_sql(query, conn)
# display(result)

**Solutions are at the bottom of this notebook!** Try on your own first.

---

## Part 9: Always Close Your Connection!

When you're done working with a database, it's important to close the connection.

**Why?**
- Frees up system resources
- Ensures all changes are saved
- Good programming practice

In [None]:
# Close the database connection
conn.close()
print("Database connection closed")

**Note:** After closing, you can't run queries anymore (using this connection). If you need to run more queries, create a new connection with `sqlite3.connect('titanic.db')`

--- 

## Part 10: Complete Workflow Review

Let's review everything we did today:

In [None]:
# Complete workflow in one cell
import pandas as pd
import sqlite3

# Step 1: Load CSV
df = pd.read_csv('Titanic Dataset.csv')
print(f"✅ Loaded {len(df)} rows from CSV")

# Step 2: Create database and connection
conn = sqlite3.connect('titanic.db')
print("✅ Connected to database")

# Step 3: Transfer data to database
df.to_sql('passengers', conn, if_exists='replace', index=False)
print("✅ Data transferred to 'passengers' table")

# Step 4: Query the database
query = "SELECT Name, Age, Survived FROM passengers LIMIT 5"
result = pd.read_sql(query, conn)
print("\n✅ Query results:")
display(result)

# Step 5: Close connection
conn.close()
print("\n✅ Connection closed")

**That's the complete workflow!** From CSV to database to query results in just a few lines of code.

---

## Key Concepts Review

### What You Learned Today:

**Concepts:**
- ✅ Database = organized collection of data
- ✅ DBMS = software that manages databases
- ✅ SQLite = file-based database perfect for learning
- ✅ SQL = language for querying databases
- ✅ Table = structured data with rows and columns

**Skills:**
- ✅ Convert DataFrame → SQLite database
- ✅ Write basic SELECT queries
- ✅ Compare pandas operations to SQL
- ✅ Use `pd.read_sql()` to run queries
- ✅ Explore database structure

**Important Commands:**
```python
# Create connection
conn = sqlite3.connect('database.db')

# Create table from DataFrame
df.to_sql('table_name', conn, if_exists='replace', index=False)

# Run SQL query
result = pd.read_sql("SELECT * FROM table_name", conn)

# Close connection
conn.close()
```

---

## Looking Ahead to Lesson 04

**Next lesson you'll learn:**
- `WHERE` clause for filtering (like `df[df['Age'] > 30]`)
- `ORDER BY` for sorting (like `df.sort_values()`)
- More complex SELECT queries
- Recreate all your Lesson 02 operations in SQL!

---

---

## Congratulations!

You've completed Lesson 03! You now know how to:
- Create SQLite databases
- Write basic SQL queries
- Compare SQL and pandas
- Work with database connections

**Next up:** Lesson 04 - SQL Basics (SELECT, WHERE, ORDER BY)

---

## Additional Resources

**Want to learn more?**
- [SQLite Tutorial](https://www.sqlitetutorial.net/)
- [DB Browser Documentation](https://sqlitebrowser.org/docs/)
- [W3Schools SQL](https://www.w3schools.com/sql/)
- [pandas to_sql documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_sql.html)

**Keep practicing!** The more you work with databases, the more comfortable you'll become.