SQL Essentials for Data Science: Group 9

A comprehensive SQL tutorial designed for STAT 471 students, covering environment setup, basic queries, and data sampling with real-world databases.

Course Information

Course Code: STAT4710J
Course Name: Data Engineering and Analytics at Scale
Institution: Shanghai Jiao Tong University Global College
Instructor: Ailin Zhang
Assignment: Group Challenge

Authors

Zeyi Chen
Leyi Chen
Yangyu Li

Shanghai Jiao Tong University Global College

Project Overview

This project is a Group Challenge assignment for STAT4710J course. Our team (Group 9) has created a comprehensive entry-level SQL tutorial that covers foundational SQL concepts and practical applications using three real-world databases.

Learning Objectives

By completing this tutorial, students will be able to:

Set up SQL environments using DuckDB and SQLite
Write basic SQL queries with SELECT, FROM, WHERE, and DISTINCT clauses
Sort and sample data using ORDER BY, LIMIT, OFFSET, and RANDOM()
Work with real databases and practice data exploration

Target Audience

STAT 471 students (future cohorts) who are new to SQL
Data enthusiasts seeking practical SQL fundamentals
Anyone interested in learning data query and analysis

Quick Start

1. Create the Conda Environment

# Create a new conda environment from the configuration
conda env create -f environment.yml

# Activate the environment
conda activate stat471

# Verify installation
python -c "import duckdb; import jupysql; print('All dependencies installed successfully!')"

2. Launch Jupyter Notebook

cd /path/to/stat471_group
jupyter notebook

3. Open the Tutorial

Open the main Jupyter Notebook file to start learning:

STAT4710J_Group9.ipynb (main tutorial - complete integrated version)

Project Structure

stat471_group/
├── README.md                      # This file
├── TODO.md                        # Task breakdown and checklist
├── requirements.txt               # Pip dependencies (alternative)
├── environment.yml                # Conda environment configuration
├── data/                          # Data directory
│   ├── basic_examples.db         # Simple teaching examples (Dragon, Dish, Scene)
│   ├── duckdb_example.db         # Student grade data (student, grade, assignment)
│   └── imdbmini.db              # Mini IMDB movie database (Title, Name, Role, Rating)
├── data_overview/                 # Data analysis and documentation
│   ├── analyze_databases.py      # Script to generate database overview
│   └── database_overview.txt     # Database structure summary
└── notebooks/                     # Jupyter notebooks
    ├── STAT4710J_Group9.ipynb    # Main tutorial (integrated version)
    ├── 01_Topic_5.1_Environment_Tools.ipynb
    ├── 02_Topic_5.2_Basic_Queries.ipynb
    └── 03_Topic_5.3_Sorting_Sampling.ipynb

Curriculum Overview

The tutorial is organized into three core topics:

Topic 5.1: Environment & Tools

Focus: Setting up SQL development environment

Concepts Covered:
- SQL Magic Commands (%sql, %%sql) in Jupyter
- DuckDB vs SQLite comparison
- Database connection and basic setup
Database Used: basic_examples.db / duckdb_example.db

Topic 5.2: Basic Queries

Focus: Fundamentals of SQL querying

Concepts Covered:
- SELECT statement
- FROM clause
- WHERE filtering
- DISTINCT for deduplication
- AS for column aliasing
Database Used: basic_examples.db
Example Queries:
- Find all dragons with cute rating > 0
- Get unique dish types in the menu
- Apply WHERE conditions to filter results

Topic 5.3: Sorting & Sampling

Focus: Working with large datasets

Concepts Covered:
- ORDER BY (ascending/descending)
- LIMIT for result limiting
- OFFSET for pagination
- RANDOM() for random sampling
Database Used: imdbmini.db
Example Queries:
- Find top 10 longest movies
- Randomly recommend a movie
- Implement pagination with OFFSET

Deliverables

Completed Deliverables:

Jupyter Notebook Tutorial
- STAT4710J_Group9.ipynb: 55 cells, fully executable
- Clear explanations and code examples
- Running results and outputs
- Lab-style format with progressive complexity
- All cells executed successfully (execution counts 139-161)
Video Tutorials
- Individual videos: max 5 minutes per person
- Total group length: ≤ 15 minutes
- English subtitles required
- Topics: 5.1, 5.2, 5.3 (one per team member)
Presentation Slides
- Key concepts summary
- Code examples and results
- Visual aids for clarity

Team Information

Group 9 Members:

Zeyi Chen: Topic 5.1 (Environment & Tools) + Video Introduction
Leyi Chen: Topic 5.2 (Basic Queries) + Core Concepts
Yangyu Li: Topic 5.3 (Sorting & Sampling) + Video Conclusion

Team Collaboration:

Regular code review for consistency
Integrated SQL magic approach across all topics
Complete notebook execution and validation

Database Overview

1. basic_examples.db (28 KB)

Small teaching database with simple relational structure:

Dragon table: Dragon names, years, cute ratings
Dish table: Menu items with types and costs
Scene table: Locations with biomes, cities, visitor counts

2. duckdb_example.db (3.8 MB)

Academic database for grade tracking:

Student table: Student information (ID, name, email)
Assignment table: Assignment descriptions
Grade table: Student grades for each assignment

3. imdbmini.db (9.9 MB)

Large-scale movie database:

Title table: Movie information (7,574 movies)
Name table: Cast and crew (31,183 people)
Role table: Cast/crew assignments (75,226 entries)
Rating table: Movie ratings and votes

Dependencies

Required Libraries

duckdb: Fast analytical database engine
jupysql: SQL magic commands for Jupyter
pandas: Data manipulation and analysis
jupyter: Interactive notebook environment
matplotlib / plotly: Visualization tools

See environment.yml or requirements.txt for specific versions.

Installation & Troubleshooting

Option 1: Using Conda (Recommended)

conda env create -f environment.yml
conda activate stat471

Option 2: Using Pip

python -m venv stat471_env
source stat471_env/bin/activate  # On Windows: stat471_env\Scripts\activate
pip install -r requirements.txt

Verify Installation

import duckdb
import jupysql
import pandas as pd

conn = duckdb.connect('data/basic_examples.db')
print(conn.execute("SELECT COUNT(*) FROM Dragon").fetchall())

Learning Path

Recommended sequence for students:

Phase 1: Read Section 5.1 (Environment Setup)
- Set up your local environment
- Connect to sample databases
- Verify DuckDB and Jupyter integration
Phase 2: Complete Section 5.2 (Basic Queries)
- Write your first SELECT queries
- Practice WHERE filtering
- Explore DISTINCT values
Phase 3: Work with Section 5.3 (Sorting & Sampling)
- Sort results with ORDER BY
- Limit results with LIMIT/OFFSET
- Generate random samples
Phase 4: Apply knowledge
- Combine multiple concepts
- Work with larger datasets (IMDB)
- Create your own queries

Tips for Success

Run all code cells - Don't just read; execute and observe the results
Modify queries - Change WHERE conditions, LIMIT values, etc.
Ask questions - Use SQL to explore data and test hypotheses
Practice pagination - Combine LIMIT and OFFSET to understand data structure
Random sampling - Use RANDOM() to understand data distribution

Resources

Completion Status

Jupyter Notebook with all three sections completed
Code examples working correctly (all 55 cells executed)
Video tutorials recorded and subtitled (≤ 15 min total)
Presentation slides prepared
All files pushed to Git repository
README and TODO documentation complete
All team members' contributions documented
AI traces removed and code optimized

Last Updated

December 14, 2025

Status: Completed - Final Submission (Group Challenge 9)

Contact & Support

For questions about this project:

Contact Group 9 members: Zeyi Chen, Leyi Chen, Yangyu Li
Course Instructor: Ailin Zhang, Shanghai Jiao Tong University Global College
Refer to course materials and TA office hours
Check database structure in data_overview/database_overview.txt

This tutorial is designed to be beginner-friendly while maintaining technical accuracy. Happy learning!

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.github		.github
STAT4710J SQL		STAT4710J SQL
data		data
data_overview		data_overview
notebooks		notebooks
.gitignore		.gitignore
README.md		README.md
TODO.md		TODO.md
environment.yml		environment.yml
requirements.txt		requirements.txt
useful_prompts.md		useful_prompts.md

Folders and files

Latest commit

History

Repository files navigation

SQL Essentials for Data Science: Group 9

Course Information

Authors

Project Overview

Learning Objectives

Target Audience

Quick Start

1. Create the Conda Environment

2. Launch Jupyter Notebook

3. Open the Tutorial

Project Structure

Curriculum Overview

Topic 5.1: Environment & Tools

Topic 5.2: Basic Queries

Topic 5.3: Sorting & Sampling

Deliverables

Team Information

Database Overview

1. basic_examples.db (28 KB)

2. duckdb_example.db (3.8 MB)

3. imdbmini.db (9.9 MB)

Dependencies

Required Libraries

Installation & Troubleshooting

Option 1: Using Conda (Recommended)

Option 2: Using Pip

Verify Installation

Learning Path

Tips for Success

Resources

Completion Status

Last Updated

Contact & Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages