A comprehensive SQL tutorial designed for STAT 471 students, covering environment setup, basic queries, and data sampling with real-world databases.
- Course Code: STAT4710J
- Course Name: Data Engineering and Analytics at Scale
- Institution: Shanghai Jiao Tong University Global College
- Instructor: Ailin Zhang
- Assignment: Group Challenge
- Zeyi Chen
- Leyi Chen
- Yangyu Li
Shanghai Jiao Tong University Global College
This project is a Group Challenge assignment for STAT4710J course. Our team (Group 9) has created a comprehensive entry-level SQL tutorial that covers foundational SQL concepts and practical applications using three real-world databases.
By completing this tutorial, students will be able to:
- Set up SQL environments using DuckDB and SQLite
- Write basic SQL queries with SELECT, FROM, WHERE, and DISTINCT clauses
- Sort and sample data using ORDER BY, LIMIT, OFFSET, and RANDOM()
- Work with real databases and practice data exploration
- STAT 471 students (future cohorts) who are new to SQL
- Data enthusiasts seeking practical SQL fundamentals
- Anyone interested in learning data query and analysis
# Create a new conda environment from the configuration
conda env create -f environment.yml
# Activate the environment
conda activate stat471
# Verify installation
python -c "import duckdb; import jupysql; print('All dependencies installed successfully!')"cd /path/to/stat471_group
jupyter notebookOpen the main Jupyter Notebook file to start learning:
STAT4710J_Group9.ipynb(main tutorial - complete integrated version)
stat471_group/
├── README.md # This file
├── TODO.md # Task breakdown and checklist
├── requirements.txt # Pip dependencies (alternative)
├── environment.yml # Conda environment configuration
├── data/ # Data directory
│ ├── basic_examples.db # Simple teaching examples (Dragon, Dish, Scene)
│ ├── duckdb_example.db # Student grade data (student, grade, assignment)
│ └── imdbmini.db # Mini IMDB movie database (Title, Name, Role, Rating)
├── data_overview/ # Data analysis and documentation
│ ├── analyze_databases.py # Script to generate database overview
│ └── database_overview.txt # Database structure summary
└── notebooks/ # Jupyter notebooks
├── STAT4710J_Group9.ipynb # Main tutorial (integrated version)
├── 01_Topic_5.1_Environment_Tools.ipynb
├── 02_Topic_5.2_Basic_Queries.ipynb
└── 03_Topic_5.3_Sorting_Sampling.ipynb
The tutorial is organized into three core topics:
Focus: Setting up SQL development environment
-
Concepts Covered:
- SQL Magic Commands (
%sql,%%sql) in Jupyter - DuckDB vs SQLite comparison
- Database connection and basic setup
- SQL Magic Commands (
-
Database Used:
basic_examples.db/duckdb_example.db
Focus: Fundamentals of SQL querying
-
Concepts Covered:
- SELECT statement
- FROM clause
- WHERE filtering
- DISTINCT for deduplication
- AS for column aliasing
-
Database Used:
basic_examples.db -
Example Queries:
- Find all dragons with cute rating > 0
- Get unique dish types in the menu
- Apply WHERE conditions to filter results
Focus: Working with large datasets
-
Concepts Covered:
- ORDER BY (ascending/descending)
- LIMIT for result limiting
- OFFSET for pagination
- RANDOM() for random sampling
-
Database Used:
imdbmini.db -
Example Queries:
- Find top 10 longest movies
- Randomly recommend a movie
- Implement pagination with OFFSET
Completed Deliverables:
-
Jupyter Notebook Tutorial
STAT4710J_Group9.ipynb: 55 cells, fully executable- Clear explanations and code examples
- Running results and outputs
- Lab-style format with progressive complexity
- All cells executed successfully (execution counts 139-161)
-
Video Tutorials
- Individual videos: max 5 minutes per person
- Total group length: ≤ 15 minutes
- English subtitles required
- Topics: 5.1, 5.2, 5.3 (one per team member)
-
Presentation Slides
- Key concepts summary
- Code examples and results
- Visual aids for clarity
Group 9 Members:
- Zeyi Chen: Topic 5.1 (Environment & Tools) + Video Introduction
- Leyi Chen: Topic 5.2 (Basic Queries) + Core Concepts
- Yangyu Li: Topic 5.3 (Sorting & Sampling) + Video Conclusion
Team Collaboration:
- Regular code review for consistency
- Integrated SQL magic approach across all topics
- Complete notebook execution and validation
Small teaching database with simple relational structure:
- Dragon table: Dragon names, years, cute ratings
- Dish table: Menu items with types and costs
- Scene table: Locations with biomes, cities, visitor counts
Academic database for grade tracking:
- Student table: Student information (ID, name, email)
- Assignment table: Assignment descriptions
- Grade table: Student grades for each assignment
Large-scale movie database:
- Title table: Movie information (7,574 movies)
- Name table: Cast and crew (31,183 people)
- Role table: Cast/crew assignments (75,226 entries)
- Rating table: Movie ratings and votes
- duckdb: Fast analytical database engine
- jupysql: SQL magic commands for Jupyter
- pandas: Data manipulation and analysis
- jupyter: Interactive notebook environment
- matplotlib / plotly: Visualization tools
See environment.yml or requirements.txt for specific versions.
conda env create -f environment.yml
conda activate stat471python -m venv stat471_env
source stat471_env/bin/activate # On Windows: stat471_env\Scripts\activate
pip install -r requirements.txtimport duckdb
import jupysql
import pandas as pd
conn = duckdb.connect('data/basic_examples.db')
print(conn.execute("SELECT COUNT(*) FROM Dragon").fetchall())Recommended sequence for students:
-
Phase 1: Read Section 5.1 (Environment Setup)
- Set up your local environment
- Connect to sample databases
- Verify DuckDB and Jupyter integration
-
Phase 2: Complete Section 5.2 (Basic Queries)
- Write your first SELECT queries
- Practice WHERE filtering
- Explore DISTINCT values
-
Phase 3: Work with Section 5.3 (Sorting & Sampling)
- Sort results with ORDER BY
- Limit results with LIMIT/OFFSET
- Generate random samples
-
Phase 4: Apply knowledge
- Combine multiple concepts
- Work with larger datasets (IMDB)
- Create your own queries
- Run all code cells - Don't just read; execute and observe the results
- Modify queries - Change WHERE conditions, LIMIT values, etc.
- Ask questions - Use SQL to explore data and test hypotheses
- Practice pagination - Combine LIMIT and OFFSET to understand data structure
- Random sampling - Use RANDOM() to understand data distribution
- DuckDB Official Documentation
- SQL Tutorial - W3Schools
- Jupyter Magic Commands
- IMDB Dataset Documentation
- Jupyter Notebook with all three sections completed
- Code examples working correctly (all 55 cells executed)
- Video tutorials recorded and subtitled (≤ 15 min total)
- Presentation slides prepared
- All files pushed to Git repository
- README and TODO documentation complete
- All team members' contributions documented
- AI traces removed and code optimized
December 14, 2025
Status: Completed - Final Submission (Group Challenge 9)
For questions about this project:
- Contact Group 9 members: Zeyi Chen, Leyi Chen, Yangyu Li
- Course Instructor: Ailin Zhang, Shanghai Jiao Tong University Global College
- Refer to course materials and TA office hours
- Check database structure in
data_overview/database_overview.txt
This tutorial is designed to be beginner-friendly while maintaining technical accuracy. Happy learning!