# **AI TECH INSTITUTE** · *Intermediate AI & Data Science*
### Week 3 - Notebook 01: SQL Fundamentals for Absolute Beginners
**Instructor:** Amir Charkhi |  **Goal:** SQL Fundamentals

> Format: theory → implementation → best practices → real-world application.

## Your First Steps into the World of Databases

**Learning Objectives:**
- Understand what SQL is and why it's essential for data science
- Learn how databases differ from CSV/Excel files
- Convert pandas DataFrames to SQL databases
- Write your first SQL queries
- Read from and write to databases using Python
- Master the essential SQL commands

**Prerequisites:** Basic Python and pandas knowledge



## 🤔 Why Should I Care About SQL?

Imagine you work at Netflix:
- **500 million** users
- **50 billion** viewing events per month
- **10 TB** of data generated daily

Can you load this into pandas? **NO!** Your computer would explode! 💥

This is where SQL comes in:
- Process **billions** of rows without loading them into memory
- Query data that's **100x larger** than your RAM
- Share data across **teams** without sending files
- Get results in **seconds** instead of hours

**Bottom line:** If you want a data job, you NEED SQL. It's used by:
- 🏢 **Every** tech company (Google, Meta, Amazon, Netflix)
- 🏦 **Every** bank and financial institution
- 🏥 **Every** healthcare organization
- 🛍️ **Every** retail company

## 📚 Part 0: Essential Libraries and Tools

Let's understand the Python libraries for working with databases.

In [2]:
#!pip install sqlalchemy

Collecting sqlalchemy
  Downloading sqlalchemy-2.0.43-cp313-cp313-win_amd64.whl.metadata (9.8 kB)
Collecting greenlet>=1 (from sqlalchemy)
  Downloading greenlet-3.2.4-cp313-cp313-win_amd64.whl.metadata (4.2 kB)
Downloading sqlalchemy-2.0.43-cp313-cp313-win_amd64.whl (2.1 MB)
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   --------- ------------------------------ 0.5/2.1 MB 3.3 MB/s eta 0:00:01
   ------------------------ --------------- 1.3/2.1 MB 3.3 MB/s eta 0:00:01
   ---------------------------------- ----- 1.8/2.1 MB 3.2 MB/s eta 0:00:01
   ---------------------------------------- 2.1/2.1 MB 3.1 MB/s  0:00:00
Downloading greenlet-3.2.4-cp313-cp313-win_amd64.whl (299 kB)
Installing collected packages: greenlet, sqlalchemy

   -------------------- ------------------- 1/2 [sqlalchemy]
   -------------------- ------------------- 1/2 [sqlalchemy]
   -------------------- ------------------- 1/2 [sqlalchemy]
   -------------------- ------------------- 1/2 [sqlalc

In [4]:
# Essential imports for SQL in Python
import pandas as pd
import numpy as np
import sqlite3  # Built-in SQL database
from sqlalchemy import create_engine  # Advanced database toolkit
import warnings
warnings.filterwarnings('ignore')

# For visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

print("📚 Library Guide:")
print("")
print("1. sqlite3: Built into Python, perfect for learning")
print("   - Serverless database (just a file)")
print("   - Great for prototypes and small projects")
print("   - But not as readily scalable as SQL alchemy")
print("")
print("2. pandas: Your bridge between Python and SQL")
print("   - pd.read_sql(): Read SQL query results into DataFrame")
print("   - df.to_sql(): Save DataFrame to database")
print("")
print("3. sqlalchemy: Professional database toolkit")
print("   - Works with any database (PostgreSQL, MySQL, etc.)")
print("   - Connection pooling and advanced features")
print("")
print("✅ All libraries loaded and ready!")

📚 Library Guide:

1. sqlite3: Built into Python, perfect for learning
   - Serverless database (just a file)
   - Great for prototypes and small projects
   - But not as readily scalable as SQL alchemy

2. pandas: Your bridge between Python and SQL
   - pd.read_sql(): Read SQL query results into DataFrame
   - df.to_sql(): Save DataFrame to database

3. sqlalchemy: Professional database toolkit
   - Works with any database (PostgreSQL, MySQL, etc.)
   - Connection pooling and advanced features

✅ All libraries loaded and ready!


---

## 🗄️ Part 1: What Exactly is SQL?

**SQL** = Structured Query Language (pronounced "sequel" or "S-Q-L")

Think of it as **English for talking to databases**.

In [5]:
# Let's see SQL in action with a simple example
print("🎯 SQL vs Pandas Comparison\n")
print("Task: Get all customers from USA\n")

print("PANDAS way:")
print("  df[df['country'] == 'USA']")
print("")
print("SQL way:")
print("  SELECT * FROM customers WHERE country = 'USA'")
print("")
print("💡 SQL reads almost like English!")
print("")
print("="*50)
print("")
print("🔑 Key Concepts:")
print("")
print("1. DATABASE: A container for tables (like an Excel file)")
print("2. TABLE: Structured data (like an Excel sheet or pandas DataFrame)")
print("3. ROW: A single record (like a row in pandas)")
print("4. COLUMN: A field/attribute (like a column in pandas)")
print("5. QUERY: A question you ask the database")

🎯 SQL vs Pandas Comparison

Task: Get all customers from USA

PANDAS way:
  df[df['country'] == 'USA']

SQL way:
  SELECT * FROM customers WHERE country = 'USA'

💡 SQL reads almost like English!


🔑 Key Concepts:

1. DATABASE: A container for tables (like an Excel file)
2. TABLE: Structured data (like an Excel sheet or pandas DataFrame)
3. ROW: A single record (like a row in pandas)
4. COLUMN: A field/attribute (like a column in pandas)
5. QUERY: A question you ask the database


### 1.1 CSV/Excel vs Database - What's the Difference?

In [6]:
# Create a comparison table
comparison = pd.DataFrame({
    'Aspect': ['Size Limit', 'Speed', 'Multiple Users', 'Data Integrity', 
               'Relationships', 'Querying', 'Storage'],
    'CSV/Excel': ['~1M rows', 'Slow for large files', 'File conflicts', 
                  'No validation', 'Manual', 'Load everything', 'Entire file in memory'],
    'SQL Database': ['Billions of rows', 'Fast at any size', 'Concurrent access', 
                     'Data types enforced', 'Foreign keys', 'Get only what you need', 'On disk, query in chunks']
})

print("📊 CSV/Excel vs SQL Database:\n")
print(comparison.to_string(index=False))
print("\n💡 Rule of thumb:")
print("  - < 10,000 rows: CSV is fine")
print("  - 10,000 - 1M rows: Consider SQL")
print("  - > 1M rows: Definitely use SQL")

📊 CSV/Excel vs SQL Database:

        Aspect             CSV/Excel             SQL Database
    Size Limit              ~1M rows         Billions of rows
         Speed  Slow for large files         Fast at any size
Multiple Users        File conflicts        Concurrent access
Data Integrity         No validation      Data types enforced
 Relationships                Manual             Foreign keys
      Querying       Load everything   Get only what you need
       Storage Entire file in memory On disk, query in chunks

💡 Rule of thumb:
  - < 10,000 rows: CSV is fine
  - 10,000 - 1M rows: Consider SQL
  - > 1M rows: Definitely use SQL


In [8]:
comparison.head(10)

Unnamed: 0,Aspect,CSV/Excel,SQL Database
0,Size Limit,~1M rows,Billions of rows
1,Speed,Slow for large files,Fast at any size
2,Multiple Users,File conflicts,Concurrent access
3,Data Integrity,No validation,Data types enforced
4,Relationships,Manual,Foreign keys
5,Querying,Load everything,Get only what you need
6,Storage,Entire file in memory,"On disk, query in chunks"


---

## 🏗️ Part 2: Creating Your First Database

Let's create a database from scratch and understand how it works!

### 2.1 Creating a Database Connection

In [9]:
# Create a database (it's just a file!)
print("🏗️ Creating your first database...\n")

# Method 1: Using sqlite3 directly
conn = sqlite3.connect('my_first_database.db')  # Creates a file called my_first_database.db
cursor = conn.cursor()  # A cursor is like a pointer to execute commands

print("✅ Database created!")
print("📁 Check your folder - you'll see 'my_first_database.db'")
print("")
print("🔑 Key Terms:")
print("  - CONNECTION: Your link to the database")
print("  - CURSOR: Executes SQL commands")
print("  - .db FILE: The actual database file")

🏗️ Creating your first database...

✅ Database created!
📁 Check your folder - you'll see 'my_first_database.db'

🔑 Key Terms:
  - CONNECTION: Your link to the database
  - CURSOR: Executes SQL commands
  - .db FILE: The actual database file


### 2.2 Creating Your First Table

In [26]:
# Create a simple table
print("📋 Creating a table is like defining a DataFrame structure:\n")


📋 Creating a table is like defining a DataFrame structure:



In [27]:
# Drop table if it exists (for re-running)
cursor.execute("DROP TABLE IF EXISTS students")

<sqlite3.Cursor at 0x1ef964cd340>

In [28]:
# Create table with SQL
create_table_sql = """
CREATE TABLE students (
    student_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    age INTEGER,
    grade REAL,
    enrolled_date DATE
)
"""

cursor.execute(create_table_sql)


<sqlite3.Cursor at 0x1ef964cd340>

In [29]:
conn.commit()  # Save changes

In [30]:
print("SQL Command:")
print(create_table_sql)
print("")
print("✅ Table 'students' created!")
print("")
print("📚 Data Types in SQL:")
print("  - INTEGER: Whole numbers (1, 2, 3)")
print("  - REAL/FLOAT: Decimal numbers (3.14, 2.5)")
print("  - TEXT/VARCHAR: Text strings ('Hello')")
print("  - DATE/DATETIME: Dates and times")
print("  - BOOLEAN: True/False")
print("")
print("🔑 Special Constraints:")
print("  - PRIMARY KEY: Unique identifier for each row")
print("  - NOT NULL: This field cannot be empty")
print("  - UNIQUE: No duplicates allowed")
print("  - DEFAULT: Default value if none provided")

SQL Command:

CREATE TABLE students (
    student_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    age INTEGER,
    grade REAL,
    enrolled_date DATE
)


✅ Table 'students' created!

📚 Data Types in SQL:
  - INTEGER: Whole numbers (1, 2, 3)
  - REAL/FLOAT: Decimal numbers (3.14, 2.5)
  - TEXT/VARCHAR: Text strings ('Hello')
  - DATE/DATETIME: Dates and times
  - BOOLEAN: True/False

🔑 Special Constraints:
  - PRIMARY KEY: Unique identifier for each row
  - NOT NULL: This field cannot be empty
  - UNIQUE: No duplicates allowed
  - DEFAULT: Default value if none provided


### 2.3 Inserting Data into Tables

In [31]:
# Insert data - Method 1: One row at a time
print("➕ INSERTING DATA\n")
print("Method 1: Insert one row")

insert_sql = """
INSERT INTO students (name, age, grade, enrolled_date)
VALUES ('Alice Smith', 20, 85.5, '2024-01-15')
"""

cursor.execute(insert_sql)
print("SQL:")
print(insert_sql)

➕ INSERTING DATA

Method 1: Insert one row
SQL:

INSERT INTO students (name, age, grade, enrolled_date)
VALUES ('Alice Smith', 20, 85.5, '2024-01-15')



In [36]:
# Method 2: Insert multiple rows
print("\nMethod 2: Insert multiple rows")

students_data = [
    ('Bob Johnson', 21, 92.0, '2024-01-15'),
    ('Charlie Brown', 19, 78.5, '2024-01-16'),
    ('Diana Prince', 22, 95.0, '2024-01-16'),
    ('Eve Wilson', 20, 88.0, '2024-01-17')
]

cursor.executemany(
    "INSERT INTO students (name, age, grade, enrolled_date) VALUES (?, ?, ?, ?)",
    students_data
)

conn.commit()  # Don't forget to commit!

print(f"✅ Inserted {len(students_data) + 1} students!")
print("")
print("💡 The '?' are placeholders - this prevents SQL injection attacks!")


Method 2: Insert multiple rows
✅ Inserted 5 students!

💡 The '?' are placeholders - this prevents SQL injection attacks!


In [32]:
# add one more new row for student
new_sql = """
INSERT INTO students (name, age, grade, enrolled_date)
VALUES ('Mickey', 78, 89, '2025-09-09')
"""
cursor.execute(new_sql)

conn.commit()

---

## 🔍 Part 3: Your First SQL Queries

Time to ask questions to your database!

### 3.1 SELECT - Getting Data Out

In [33]:
whos

Variable           Type          Data/Info
------------------------------------------
comparison         DataFrame     Shape: (7, 3)
conn               Connection    <sqlite3.Connection object at 0x000001EF96303F10>
create_engine      function      <function create_engine at 0x000001EF83FCD120>
create_table_sql   str           
CREATE TABLE students (
    student_id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    age INTEGER,
    grade REAL,
    enrolled_date DATE
)

cursor             Cursor        <sqlite3.Cursor object at 0x000001EF964CD340>
insert_sql         str           
INSERT INTO students (name, age, grade, enrolled_date)
VALUES ('Alice Smith', 20, 85.5, '2024-01-15')

new_sql            str           
INSERT INTO students (name, age, grade, enrolled_date)
VALUES ('Mickey', 78, 89, '2025-09-09')

np                 module        Shape: <function shape at 0x000001EFFF2F1440>
pd                 module        <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>
plt          

In [37]:
# select all rows

query = "SELECT * FROM students"

pd.read_sql(query, conn)

Unnamed: 0,student_id,name,age,grade,enrolled_date
0,1,Alice Smith,20,85.5,2024-01-15
1,2,Mickey,78,89.0,2025-09-09
2,3,Bob Johnson,21,92.0,2024-01-15
3,4,Charlie Brown,19,78.5,2024-01-16
4,5,Diana Prince,22,95.0,2024-01-16
5,6,Eve Wilson,20,88.0,2024-01-17


In [38]:
query2 = "SELECT name, enrolled_date FROM students WHERE enrolled_date > '2024-01-16'"
pd.read_sql(query2, conn)

Unnamed: 0,name,enrolled_date
0,Mickey,2025-09-09
1,Eve Wilson,2024-01-17


In [39]:
query4 = "SELECT name, grade FROM students WHERE enrolled_date > '2024-01-16'"
result4 = pd.read_sql(query4, conn)
print(f"SQL: {result4}")

SQL:          name  grade
0      Mickey   89.0
1  Eve Wilson   88.0


In [40]:
# checking the name of tables
query5 = "SELECT name FROM sqlite_master WHERE type='table'"
tables = pd.read_sql(query5, conn)
tables

Unnamed: 0,name
0,students


In [41]:
print("🔍 SELECT QUERIES - The Foundation of SQL\n")

# Query 1: Get everything
print("1️⃣ Get all data (SELECT *)")
query1 = "SELECT * FROM students"
result1 = pd.read_sql(query1, conn)
print(f"SQL: {query1}")
print("Result:")
print(result1)
print("\n" + "="*50 + "\n")

# Query 2: Get specific columns
print("2️⃣ Get specific columns")
query2 = "SELECT name, grade FROM students"
result2 = pd.read_sql(query2, conn)
print(f"SQL: {query2}")
print("Result:")
print(result2)
print("\n" + "="*50 + "\n")

# Query 3: Get with conditions (WHERE)
print("3️⃣ Filter with WHERE")
query3 = "SELECT name, grade FROM students WHERE grade > 85"
result3 = pd.read_sql(query3, conn)
print(f"SQL: {query3}")
print("Result:")
print(result3)

print("\n💡 SELECT is like df[columns] in pandas")
print("💡 WHERE is like df[df['column'] > value] in pandas")

🔍 SELECT QUERIES - The Foundation of SQL

1️⃣ Get all data (SELECT *)
SQL: SELECT * FROM students
Result:
   student_id           name  age  grade enrolled_date
0           1    Alice Smith   20   85.5    2024-01-15
1           2         Mickey   78   89.0    2025-09-09
2           3    Bob Johnson   21   92.0    2024-01-15
3           4  Charlie Brown   19   78.5    2024-01-16
4           5   Diana Prince   22   95.0    2024-01-16
5           6     Eve Wilson   20   88.0    2024-01-17


2️⃣ Get specific columns
SQL: SELECT name, grade FROM students
Result:
            name  grade
0    Alice Smith   85.5
1         Mickey   89.0
2    Bob Johnson   92.0
3  Charlie Brown   78.5
4   Diana Prince   95.0
5     Eve Wilson   88.0


3️⃣ Filter with WHERE
SQL: SELECT name, grade FROM students WHERE grade > 85
Result:
           name  grade
0   Alice Smith   85.5
1        Mickey   89.0
2   Bob Johnson   92.0
3  Diana Prince   95.0
4    Eve Wilson   88.0

💡 SELECT is like df[columns] in pandas
💡 W

### 3.2 WHERE Conditions - Filtering Data

In [42]:
print("🎯 WHERE CLAUSE - Your Data Filter\n")

# Different types of conditions
conditions = [
    ("Equals", "WHERE age = 20", "df[df['age'] == 20]"),
    ("Not equals", "WHERE age != 20", "df[df['age'] != 20]"),
    ("Greater than", "WHERE grade > 80", "df[df['grade'] > 80]"),
    ("Less than", "WHERE age < 21", "df[df['age'] < 21]"),
    ("BETWEEN", "WHERE grade BETWEEN 80 AND 90", "df[(df['grade'] >= 80) & (df['grade'] <= 90)]"),
    ("IN list", "WHERE age IN (19, 20, 21)", "df[df['age'].isin([19, 20, 21])]"),
    ("LIKE pattern", "WHERE name LIKE 'A%'", "df[df['name'].str.startswith('A')]"),
    ("AND", "WHERE age > 19 AND grade > 80", "df[(df['age'] > 19) & (df['grade'] > 80)]"),
    ("OR", "WHERE age < 20 OR grade > 90", "df[(df['age'] < 20) | (df['grade'] > 90)]")
]

for condition_name, sql_example, pandas_equivalent in conditions:
    print(f"📌 {condition_name}:")
    print(f"   SQL:    SELECT * FROM students {sql_example}")
    print(f"   Pandas: {pandas_equivalent}")
    print()

# Example with multiple conditions
print("\n🔥 Real Example:")
complex_query = """
SELECT name, age, grade 
FROM students 
WHERE grade > 80 
  AND age <= 21
  AND name LIKE '%e%'
"""
result = pd.read_sql(complex_query, conn)
print("SQL:")
print(complex_query)
print("\nResult:")
print(result)

🎯 WHERE CLAUSE - Your Data Filter

📌 Equals:
   SQL:    SELECT * FROM students WHERE age = 20
   Pandas: df[df['age'] == 20]

📌 Not equals:
   SQL:    SELECT * FROM students WHERE age != 20
   Pandas: df[df['age'] != 20]

📌 Greater than:
   SQL:    SELECT * FROM students WHERE grade > 80
   Pandas: df[df['grade'] > 80]

📌 Less than:
   SQL:    SELECT * FROM students WHERE age < 21
   Pandas: df[df['age'] < 21]

📌 BETWEEN:
   SQL:    SELECT * FROM students WHERE grade BETWEEN 80 AND 90
   Pandas: df[(df['grade'] >= 80) & (df['grade'] <= 90)]

📌 IN list:
   SQL:    SELECT * FROM students WHERE age IN (19, 20, 21)
   Pandas: df[df['age'].isin([19, 20, 21])]

📌 LIKE pattern:
   SQL:    SELECT * FROM students WHERE name LIKE 'A%'
   Pandas: df[df['name'].str.startswith('A')]

📌 AND:
   SQL:    SELECT * FROM students WHERE age > 19 AND grade > 80
   Pandas: df[(df['age'] > 19) & (df['grade'] > 80)]

📌 OR:
   SQL:    SELECT * FROM students WHERE age < 20 OR grade > 90
   Pandas: df[(df['age']

### 3.3 ORDER BY - Sorting Results

In [43]:
print("📊 ORDER BY - Sorting Your Results\n")

# Sort by grade (highest first)
print("1️⃣ Sort by grade (descending)")
query1 = """
SELECT name, grade 
FROM students 
ORDER BY grade DESC
"""
result1 = pd.read_sql(query1, conn)
print("SQL:")
print(query1)
print("\nResult:")
print(result1)
print("\nPandas equivalent: df.sort_values('grade', ascending=False)")
print("\n" + "="*50 + "\n")

# Sort by multiple columns
print("2️⃣ Sort by multiple columns")
query2 = """
SELECT name, age, grade 
FROM students 
ORDER BY age ASC, grade DESC
"""
result2 = pd.read_sql(query2, conn)
print("SQL:")
print(query2)
print("\nResult:")
print(result2)
print("\nPandas equivalent: df.sort_values(['age', 'grade'], ascending=[True, False])")

📊 ORDER BY - Sorting Your Results

1️⃣ Sort by grade (descending)
SQL:

SELECT name, grade 
FROM students 
ORDER BY grade DESC


Result:
            name  grade
0   Diana Prince   95.0
1    Bob Johnson   92.0
2         Mickey   89.0
3     Eve Wilson   88.0
4    Alice Smith   85.5
5  Charlie Brown   78.5

Pandas equivalent: df.sort_values('grade', ascending=False)


2️⃣ Sort by multiple columns
SQL:

SELECT name, age, grade 
FROM students 
ORDER BY age ASC, grade DESC


Result:
            name  age  grade
0  Charlie Brown   19   78.5
1     Eve Wilson   20   88.0
2    Alice Smith   20   85.5
3    Bob Johnson   21   92.0
4   Diana Prince   22   95.0
5         Mickey   78   89.0

Pandas equivalent: df.sort_values(['age', 'grade'], ascending=[True, False])


### 3.4 Aggregate Functions - Summarizing Data

In [49]:
print("📈 AGGREGATE FUNCTIONS - Data Summaries\n")

# Common aggregate functions
aggregates = [
    ("COUNT", "SELECT COUNT(*) as total_students FROM students", "len(df)"),
    ("SUM", "SELECT SUM(grade) as total_grades FROM students", "df['grade'].sum()"),
    ("AVG", "SELECT AVG(grade) as average_grade FROM students", "df['grade'].mean()"),
    ("MAX", "SELECT MAX(grade) as highest_grade FROM students", "df['grade'].max()"),
    ("MIN", "SELECT MIN(age) as youngest_age FROM students", "df['age'].min()")
]

for func_name, sql_query, pandas_equiv in aggregates:
    result = pd.read_sql(sql_query, conn)
    print(f"📊 {func_name}:")
    print(f"   SQL: {sql_query}")
    print(f"   Result: {result.iloc[0, 0] if func_name in ['AVG', 'SUM'] else result.iloc[0, 0]}")
    print(f"   Pandas: {pandas_equiv}")
    print()

# All together
print("\n🔥 Multiple aggregations:")
summary_query = """
SELECT 
    COUNT(*) as total_students,
    AVG(age) as avg_age,
    AVG(grade) as avg_grade,
    MIN(grade) as min_grade,
    MAX(grade) as max_grade
FROM students
"""
summary = pd.read_sql(summary_query, conn)
print("SQL:")
print(summary_query)
print("\nResult:")
print(summary)

📈 AGGREGATE FUNCTIONS - Data Summaries

📊 COUNT:
   SQL: SELECT COUNT(*) as total_students FROM students
   Result: 6
   Pandas: len(df)

📊 SUM:
   SQL: SELECT SUM(grade) as total_grades FROM students
   Result: 528.0
   Pandas: df['grade'].sum()

📊 AVG:
   SQL: SELECT AVG(grade) as average_grade FROM students
   Result: 88.0
   Pandas: df['grade'].mean()

📊 MAX:
   SQL: SELECT MAX(grade) as highest_grade FROM students
   Result: 95.0
   Pandas: df['grade'].max()

📊 MIN:
   SQL: SELECT MIN(age) as youngest_age FROM students
   Result: 19
   Pandas: df['age'].min()


🔥 Multiple aggregations:
SQL:

SELECT 
    COUNT(*) as total_students,
    AVG(age) as avg_age,
    AVG(grade) as avg_grade,
    MIN(grade) as min_grade,
    MAX(grade) as max_grade
FROM students


Result:
   total_students  avg_age  avg_grade  min_grade  max_grade
0               6     30.0       88.0       78.5       95.0


### 3.5 GROUP BY - Grouping Data

In [52]:
print("👥 GROUP BY - Analyzing Groups\n")

# First, let's add a department column for grouping
cursor.execute("ALTER TABLE students ADD COLUMN department TEXT")
departments = ['Engineering', 'Science', 'Arts', 'Engineering', 'Science']
for i, dept in enumerate(departments, 1):
    cursor.execute(f"UPDATE students SET department = '{dept}' WHERE student_id = {i}")
conn.commit()

# Now let's group!
print("Average grade by department:")
group_query = """
SELECT 
    department,
    COUNT(*) as student_count,
    AVG(grade) as avg_grade,
    MAX(grade) as top_grade
FROM students
GROUP BY department
ORDER BY avg_grade DESC
"""

result = pd.read_sql(group_query, conn)
print("SQL:")
print(group_query)
print("\nResult:")
print(result)
print("\nPandas equivalent:")
print("df.groupby('department').agg({")
print("    'grade': ['count', 'mean', 'max']")
print("})")

👥 GROUP BY - Analyzing Groups

Average grade by department:
SQL:

SELECT 
    department,
    COUNT(*) as student_count,
    AVG(grade) as avg_grade,
    MAX(grade) as top_grade
FROM students
GROUP BY department
ORDER BY avg_grade DESC


Result:
    department  student_count  avg_grade  top_grade
0      Science              2       92.0       95.0
1         Arts              1       92.0       92.0
2         None              1       88.0       88.0
3  Engineering              2       82.0       85.5

Pandas equivalent:
df.groupby('department').agg({
    'grade': ['count', 'mean', 'max']
})


---

## 🔄 Part 4: Converting Between Pandas and SQL

The real power comes from seamlessly moving between pandas and SQL!

### 4.1 DataFrame to SQL Database

In [54]:
print("🔄 CONVERTING PANDAS TO SQL\n")

# Create a sample DataFrame
sales_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=30, freq='D'),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet'], 30),
    'quantity': np.random.randint(1, 10, 30),
    'price': np.random.uniform(100, 1000, 30).round(2),
    'customer_id': np.random.randint(1, 10, 30)
})

# Calculate revenue
sales_data['revenue'] = sales_data['quantity'] * sales_data['price']

print("📊 Original DataFrame:")
print(sales_data.head())
print(f"\nShape: {sales_data.shape}")

# Save to SQL database
print("\n💾 Saving to SQL...")
sales_data.to_sql(
    name='sales',           # Table name
    con=conn,               # Database connection
    if_exists='replace',    # Replace if table exists
    index=False            # Don't save the index
)

print("✅ DataFrame saved to SQL!")

# Verify it worked
verify_query = "SELECT COUNT(*) as row_count FROM sales"
result = pd.read_sql(verify_query, conn)
print(f"\n🔍 Verification: {result.iloc[0, 0]} rows in SQL table")

print("\n📝 Key Parameters for to_sql():")
print("  - name: Table name in database")
print("  - con: Database connection")
print("  - if_exists: 'fail', 'replace', or 'append'")
print("  - index: Save DataFrame index as column?")
print("  - dtype: Specify SQL data types")
print("  - method: How to insert (default is best)")

🔄 CONVERTING PANDAS TO SQL

📊 Original DataFrame:
        date product  quantity   price  customer_id  revenue
0 2024-01-01  Laptop         4  452.17            4  1808.68
1 2024-01-02   Phone         2  561.11            3  1122.22
2 2024-01-03   Phone         2  567.07            4  1134.14
3 2024-01-04   Phone         8  433.13            2  3465.04
4 2024-01-05   Phone         9  851.50            7  7663.50

Shape: (30, 6)

💾 Saving to SQL...
✅ DataFrame saved to SQL!

🔍 Verification: 30 rows in SQL table

📝 Key Parameters for to_sql():
  - name: Table name in database
  - con: Database connection
  - if_exists: 'fail', 'replace', or 'append'
  - index: Save DataFrame index as column?
  - dtype: Specify SQL data types
  - method: How to insert (default is best)


### 4.2 SQL Query Results to DataFrame

In [55]:
print("📥 READING SQL INTO PANDAS\n")

# Method 1: Simple query
print("1️⃣ Simple Query:")
query = "SELECT * FROM sales WHERE quantity > 5"
df1 = pd.read_sql(query, conn)
print(f"Query: {query}")
print(f"Result shape: {df1.shape}")
print(df1.head())

print("\n" + "="*50 + "\n")

# Method 2: Complex query with aggregation
print("2️⃣ Complex Query:")
complex_query = """
SELECT 
    product,
    COUNT(*) as total_sales,
    SUM(quantity) as total_quantity,
    AVG(price) as avg_price,
    SUM(revenue) as total_revenue
FROM sales
GROUP BY product
ORDER BY total_revenue DESC
"""

df2 = pd.read_sql(complex_query, conn)
print("Query: [Complex aggregation query]")
print("\nResult:")
print(df2)

# Now we can use pandas operations on the result!
print("\n🎯 Now we can use pandas on the SQL results:")
print(f"Most profitable product: {df2.iloc[0]['product']}")
print(f"Total revenue across all products: ${df2['total_revenue'].sum():,.2f}")

📥 READING SQL INTO PANDAS

1️⃣ Simple Query:
Query: SELECT * FROM sales WHERE quantity > 5
Result shape: (15, 6)
                  date product  quantity   price  customer_id  revenue
0  2024-01-04 00:00:00   Phone         8  433.13            2  3465.04
1  2024-01-05 00:00:00   Phone         9  851.50            7  7663.50
2  2024-01-06 00:00:00   Phone         6  130.20            5   781.20
3  2024-01-07 00:00:00   Phone         6  144.74            3   868.44
4  2024-01-08 00:00:00   Phone         8  580.97            4  4647.76


2️⃣ Complex Query:
Query: [Complex aggregation query]

Result:
  product  total_sales  total_quantity   avg_price  total_revenue
0   Phone           13              68  474.267692       32703.93
1  Tablet           10              52  533.499000       28681.51
2  Laptop            7              38  501.705714       17849.20

🎯 Now we can use pandas on the SQL results:
Most profitable product: Phone
Total revenue across all products: $79,234.64


### 4.3 Updating and Deleting Data

In [56]:
print("✏️ UPDATING DATA IN SQL\n")

# UPDATE: Change existing data
print("Before update:")
before = pd.read_sql("SELECT * FROM students WHERE student_id = 1", conn)
print(before)

# Update grade
update_sql = """
UPDATE students 
SET grade = 90.0 
WHERE student_id = 1
"""
cursor.execute(update_sql)
conn.commit()

print("\nAfter update:")
after = pd.read_sql("SELECT * FROM students WHERE student_id = 1", conn)
print(after)

print("\n" + "="*50 + "\n")

# DELETE: Remove data
print("🗑️ DELETING DATA\n")

# Count before
before_count = pd.read_sql("SELECT COUNT(*) as count FROM sales", conn).iloc[0, 0]
print(f"Records before delete: {before_count}")

# Delete records with low quantity
delete_sql = "DELETE FROM sales WHERE quantity = 1"
cursor.execute(delete_sql)
conn.commit()

# Count after
after_count = pd.read_sql("SELECT COUNT(*) as count FROM sales", conn).iloc[0, 0]
print(f"Records after delete: {after_count}")
print(f"Deleted {before_count - after_count} records")

print("\n⚠️ WARNING: DELETE is permanent! Always backup important data!")

✏️ UPDATING DATA IN SQL

Before update:
   student_id         name  age  grade enrolled_date   department
0           1  Alice Smith   20   85.5    2024-01-15  Engineering

After update:
   student_id         name  age  grade enrolled_date   department
0           1  Alice Smith   20   90.0    2024-01-15  Engineering


🗑️ DELETING DATA

Records before delete: 30
Records after delete: 28
Deleted 2 records



---

## 🚀 Part 5: Essential SQL Patterns for Data Science

Let's learn the SQL patterns you'll use every day as a data scientist!

### 5.1 JOIN - Combining Tables

In [57]:
print("🔗 JOINS - Combining Multiple Tables\n")

# Create related tables
# Customers table
customers_df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'Boston']
})

# Orders table
orders_df = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 1, 3, 1],
    'product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Mouse'],
    'amount': [1000, 800, 500, 1200, 50]
})

# Save to database
customers_df.to_sql('customers_table', conn, if_exists='replace', index=False)
orders_df.to_sql('orders_table', conn, if_exists='replace', index=False)

print("📊 Customers:")
print(customers_df)
print("\n📊 Orders:")
print(orders_df)

print("\n" + "="*50 + "\n")

# JOIN them together
join_query = """
SELECT 
    c.name,
    c.city,
    o.product,
    o.amount
FROM customers_table c
JOIN orders_table o ON c.customer_id = o.customer_id
ORDER BY c.name
"""

print("🔗 JOINED Result:")
print("SQL:")
print(join_query)
print("\nResult:")
joined = pd.read_sql(join_query, conn)
print(joined)

print("\n💡 This is like pd.merge(customers, orders, on='customer_id') in pandas!")

🔗 JOINS - Combining Multiple Tables

📊 Customers:
   customer_id     name     city
0            1    Alice      NYC
1            2      Bob       LA
2            3  Charlie  Chicago
3            4    Diana      NYC
4            5      Eve   Boston

📊 Orders:
   order_id  customer_id product  amount
0       101            1  Laptop    1000
1       102            2   Phone     800
2       103            1  Tablet     500
3       104            3  Laptop    1200
4       105            1   Mouse      50


🔗 JOINED Result:
SQL:

SELECT 
    c.name,
    c.city,
    o.product,
    o.amount
FROM customers_table c
JOIN orders_table o ON c.customer_id = o.customer_id
ORDER BY c.name


Result:
      name     city product  amount
0    Alice      NYC  Laptop    1000
1    Alice      NYC   Mouse      50
2    Alice      NYC  Tablet     500
3      Bob       LA   Phone     800
4  Charlie  Chicago  Laptop    1200

💡 This is like pd.merge(customers, orders, on='customer_id') in pandas!


### 5.2 Common Data Science Queries

In [58]:
print("📊 COMMON DATA SCIENCE SQL PATTERNS\n")

# Pattern 1: Find duplicates
print("1️⃣ Finding Duplicates:")
duplicate_query = """
SELECT 
    product,
    COUNT(*) as count
FROM sales
GROUP BY product
HAVING COUNT(*) > 5
"""
print(f"SQL: {duplicate_query}")
duplicates = pd.read_sql(duplicate_query, conn)
print("Products appearing more than 5 times:")
print(duplicates)

print("\n" + "="*50 + "\n")

# Pattern 2: Date filtering
print("2️⃣ Date Range Filtering:")
date_query = """
SELECT 
    DATE(date) as sale_date,
    SUM(revenue) as daily_revenue
FROM sales
WHERE date BETWEEN '2024-01-01' AND '2024-01-07'
GROUP BY DATE(date)
ORDER BY sale_date
"""
print(f"SQL: [Date range query]")
date_results = pd.read_sql(date_query, conn)
print("First week revenue:")
print(date_results.head())

print("\n" + "="*50 + "\n")

# Pattern 3: Top N per category
print("3️⃣ Top N per Category:")
top_n_query = """
SELECT * FROM (
    SELECT 
        product,
        date,
        revenue,
        ROW_NUMBER() OVER (PARTITION BY product ORDER BY revenue DESC) as rank
    FROM sales
) ranked
WHERE rank <= 3
"""
print("Get top 3 sales for each product")
print("(This uses window functions - advanced SQL!)")

📊 COMMON DATA SCIENCE SQL PATTERNS

1️⃣ Finding Duplicates:
SQL: 
SELECT 
    product,
    COUNT(*) as count
FROM sales
GROUP BY product
HAVING COUNT(*) > 5

Products appearing more than 5 times:
  product  count
0  Laptop      7
1   Phone     12
2  Tablet      9


2️⃣ Date Range Filtering:
SQL: [Date range query]
First week revenue:
    sale_date  daily_revenue
0  2024-01-01        1808.68
1  2024-01-02        1122.22
2  2024-01-03        1134.14
3  2024-01-04        3465.04
4  2024-01-05        7663.50


3️⃣ Top N per Category:
Get top 3 sales for each product
(This uses window functions - advanced SQL!)


---

## 💪 Part 6: Practice Exercises

Time to test your SQL skills!

### Exercise 1: Basic Queries

In [60]:
# checking the names of tables in the system
table_name_query = "SELECT name FROM sqlite_master WHERE type='table'"
table_name = pd.read_sql(table_name_query, conn)
table_name

Unnamed: 0,name
0,students
1,sales
2,customers_table
3,orders_table


In [61]:
# checking the columns' names of table 'students'
columns_query = "SELECT * FROM students LIMIT 0"
columns_df = pd.read_sql(columns_query, conn)
list(columns_df.columns)

['student_id', 'name', 'age', 'grade', 'enrolled_date', 'department']

In [65]:
print("📝 EXERCISE 1: Write these queries\n")

print("1. Get all students with grade above 90")
print("   Your query: SELECT * FROM students WHERE grade > 90")
print("")

grade_query = "SELECT * FROM students WHERE grade > 90"
grade_df = pd.read_sql(grade_query,conn)
print(f"All students with grade above 90: \n{grade_df}")

📝 EXERCISE 1: Write these queries

1. Get all students with grade above 90
   Your query: SELECT * FROM students WHERE grade > 90

All students with grade above 90: 
   student_id          name  age  grade enrolled_date department
0           3   Bob Johnson   21   92.0    2024-01-15       Arts
1           5  Diana Prince   22   95.0    2024-01-16    Science


In [66]:
# checking the columns' names of table 'sales'
sales_columns_query = "SELECT * FROM sales LIMIT 0"
sales_columns_df = pd.read_sql(sales_columns_query,conn)
list(sales_columns_df.columns)

['date', 'product', 'quantity', 'price', 'customer_id', 'revenue']

In [71]:
print("2. Count how many sales were made for each product")
print("   Your query: SELECT product, COUNT(*) as sales_count FROM sales GROUP BY product")
print("")

sales_count_query = "SELECT product, COUNT(*) as sales_count FROM sales GROUP BY product"
sales_count_df = pd.read_sql(sales_count_query,conn)
print(f"Sales count for each product: \n{sales_count_df}")

total_revenue_query = "SELECT product, SUM(revenue) as total_revenue FROM sales GROUP BY product"
total_revenue_df = pd.read_sql(total_revenue_query, conn)
print(f"\nTotal sales for each product: \n{total_revenue_df}")

2. Count how many sales were made for each product
   Your query: SELECT product, COUNT(*) as sales_count FROM sales GROUP BY product

Sales count for each product: 
  product  sales_count
0  Laptop            7
1   Phone           12
2  Tablet            9

Total sales for each product: 
  product  total_revenue
0  Laptop       17849.20
1   Phone       32398.93
2  Tablet       28180.49


In [79]:
print("3. Find the average price for products sold in quantities > 5")
print("   Your query: SELECT AVG(price) as avg_price FROM sales WHERE quantity > 5")
print("")

avg_price_query = "SELECT AVG(price) as avg_price FROM sales WHERE quantity > 5"
avg_price_df = pd.read_sql(avg_price_query, conn)
print(f"Average price for products soled in quantities above 5: \n{avg_price_df.values}")

3. Find the average price for products sold in quantities > 5
   Your query: SELECT AVG(price) as avg_price FROM sales WHERE quantity > 5

Average price for products soled in quantities above 5: 
[[477.06066667]]


In [None]:
# Solutions (uncomment to see)
"""
# Solution 1:
solution1 = "SELECT * FROM students WHERE grade > 90"

# Solution 2:
solution2 = "SELECT product, COUNT(*) as sales_count FROM sales GROUP BY product"

# Solution 3:
solution3 = "SELECT AVG(price) as avg_price FROM sales WHERE quantity > 5"
"""

### Exercise 2: Create Your Own Database

In [82]:
print("🏗️ EXERCISE 2: Create a Movie Database\n")

print("Your task:")
print("1. Create a 'movies' table with columns:")
print("   - movie_id (INTEGER PRIMARY KEY)")
print("   - title (TEXT)")
print("   - year (INTEGER)")
print("   - rating (REAL)")
print("")
print("2. Insert at least 3 movies")
print("")
print("3. Query to find movies with rating > 8.0")
print("")

# Your code here:


🏗️ EXERCISE 2: Create a Movie Database

Your task:
1. Create a 'movies' table with columns:
   - movie_id (INTEGER PRIMARY KEY)
   - title (TEXT)
   - year (INTEGER)
   - rating (REAL)

2. Insert at least 3 movies

3. Query to find movies with rating > 8.0



In [80]:
conn_movies = sqlite3.connect('movies.db')
cursor_movies = conn_movies.cursor()

In [81]:
# Drop table if it exists
cursor_movies.execute("DROP TABLE IF EXISTS movies")

<sqlite3.Cursor at 0x1ef96da2240>

In [83]:
# Create table
create_movie_table = """
CREATE TABLE movies (
    movie_id INTEGER PRIMARY KEY,
    title TEXT,
    year INTEGER,
    rating REAL
)
"""

cursor_movies.execute(create_movie_table)

<sqlite3.Cursor at 0x1ef96da2240>

In [84]:
# Insert at least 3 movies
movies_data = [
    (1,'Pinocchio',2022,8.8),
    (2,'Perfect Day',2025,7.6),
    (3,'Your Name',2014,9.2)
]
movie_query = "INSERT INTO movies (movie_id, title, year, rating) VALUES (?,?,?,?)"
cursor_movies.executemany(movie_query,movies_data)
conn_movies.commit()

In [85]:
# Movies with rating > 8
rating_query = "SELECT * FROM movies WHERE rating > 8"
rating_df = pd.read_sql(rating_query,conn_movies)
print(f"Movies with rating > 8: \n{rating_df}")

Movies with rating > 8: 
   movie_id      title  year  rating
0         1  Pinocchio  2022     8.8
1         3  Your Name  2014     9.2


---

## 🎯 SQL Cheat Sheet

Keep this handy!

In [53]:
print("📋 SQL CHEAT SHEET\n")

cheat_sheet = """
🔍 BASIC QUERIES:
SELECT * FROM table                          -- Get everything
SELECT col1, col2 FROM table                 -- Get specific columns
SELECT DISTINCT col FROM table               -- Unique values only
SELECT * FROM table LIMIT 10                 -- First 10 rows

🎯 FILTERING:
WHERE col = 'value'                          -- Exact match
WHERE col != 'value'                         -- Not equal
WHERE col > 100                              -- Greater than
WHERE col BETWEEN 10 AND 20                  -- Range
WHERE col IN ('A', 'B', 'C')                -- In list
WHERE col LIKE 'A%'                          -- Starts with A
WHERE col IS NULL                            -- Null values
WHERE col IS NOT NULL                        -- Non-null values

📊 AGGREGATION:
COUNT(*)                                      -- Count rows
COUNT(DISTINCT col)                          -- Count unique
SUM(col)                                      -- Sum values
AVG(col)                                      -- Average
MAX(col) / MIN(col)                          -- Max/Min

👥 GROUPING:
GROUP BY col                                  -- Group rows
HAVING COUNT(*) > 5                          -- Filter groups

📈 SORTING:
ORDER BY col ASC                             -- Sort ascending
ORDER BY col DESC                            -- Sort descending
ORDER BY col1, col2                          -- Multiple columns

🔗 JOINING:
JOIN table2 ON table1.col = table2.col       -- Inner join
LEFT JOIN table2 ON ...                      -- Keep all from left
RIGHT JOIN table2 ON ...                     -- Keep all from right

✏️ MODIFYING DATA:
INSERT INTO table (col1, col2) VALUES (?, ?) -- Insert data
UPDATE table SET col = value WHERE ...       -- Update data
DELETE FROM table WHERE ...                  -- Delete rows

🏗️ TABLE OPERATIONS:
CREATE TABLE name (...)                      -- Create table
DROP TABLE name                              -- Delete table
ALTER TABLE name ADD COLUMN col TYPE         -- Add column
"""

print(cheat_sheet)

📋 SQL CHEAT SHEET


🔍 BASIC QUERIES:
SELECT * FROM table                          -- Get everything
SELECT col1, col2 FROM table                 -- Get specific columns
SELECT DISTINCT col FROM table               -- Unique values only
SELECT * FROM table LIMIT 10                 -- First 10 rows

🎯 FILTERING:
WHERE col = 'value'                          -- Exact match
WHERE col != 'value'                         -- Not equal
WHERE col > 100                              -- Greater than
WHERE col BETWEEN 10 AND 20                  -- Range
WHERE col IN ('A', 'B', 'C')                -- In list
WHERE col LIKE 'A%'                          -- Starts with A
WHERE col IS NULL                            -- Null values
WHERE col IS NOT NULL                        -- Non-null values

📊 AGGREGATION:
COUNT(*)                                      -- Count rows
COUNT(DISTINCT col)                          -- Count unique
SUM(col)                                      -- Sum values
AVG(col)         

---

## 🎓 Key Takeaways

Congratulations! You now know SQL fundamentals! Here's what you learned:

1. **What SQL Is**: A language for talking to databases
2. **Why SQL Matters**: Handle data too big for pandas
3. **Basic Operations**: SELECT, WHERE, ORDER BY, GROUP BY
4. **Pandas Integration**: Seamlessly convert between DataFrames and SQL
5. **CRUD Operations**: Create, Read, Update, Delete
6. **Joins**: Combine multiple tables
7. **Best Practices**: Use parameters to prevent SQL injection

---

## 🚀 Next Steps

You're ready to:
1. Work with the other SQL notebooks (01-04)
2. Query real databases
3. Build data pipelines
4. Ace SQL interview questions

**Remember**: SQL + Pandas = Data Science Superpower! 💪

In [86]:
# Cleanup
conn.close()
print("✅ Database connection closed.")
print("🎉 Congratulations! You now know SQL basics!")
print("")
print("📚 Next: Try the practice exercises, then move to notebook 01!")

✅ Database connection closed.
🎉 Congratulations! You now know SQL basics!

📚 Next: Try the practice exercises, then move to notebook 01!
