Skip to content

marzich/stat471_group

Repository files navigation

SQL Essentials for Data Science: Group 9

A comprehensive SQL tutorial designed for STAT 471 students, covering environment setup, basic queries, and data sampling with real-world databases.

Course Information

  • Course Code: STAT4710J
  • Course Name: Data Engineering and Analytics at Scale
  • Institution: Shanghai Jiao Tong University Global College
  • Instructor: Ailin Zhang
  • Assignment: Group Challenge

Authors

  • Zeyi Chen
  • Leyi Chen
  • Yangyu Li

Shanghai Jiao Tong University Global College

Project Overview

This project is a Group Challenge assignment for STAT4710J course. Our team (Group 9) has created a comprehensive entry-level SQL tutorial that covers foundational SQL concepts and practical applications using three real-world databases.

Learning Objectives

By completing this tutorial, students will be able to:

  1. Set up SQL environments using DuckDB and SQLite
  2. Write basic SQL queries with SELECT, FROM, WHERE, and DISTINCT clauses
  3. Sort and sample data using ORDER BY, LIMIT, OFFSET, and RANDOM()
  4. Work with real databases and practice data exploration

Target Audience

  • STAT 471 students (future cohorts) who are new to SQL
  • Data enthusiasts seeking practical SQL fundamentals
  • Anyone interested in learning data query and analysis

Quick Start

1. Create the Conda Environment

# Create a new conda environment from the configuration
conda env create -f environment.yml

# Activate the environment
conda activate stat471

# Verify installation
python -c "import duckdb; import jupysql; print('All dependencies installed successfully!')"

2. Launch Jupyter Notebook

cd /path/to/stat471_group
jupyter notebook

3. Open the Tutorial

Open the main Jupyter Notebook file to start learning:

  • STAT4710J_Group9.ipynb (main tutorial - complete integrated version)

Project Structure

stat471_group/
├── README.md                      # This file
├── TODO.md                        # Task breakdown and checklist
├── requirements.txt               # Pip dependencies (alternative)
├── environment.yml                # Conda environment configuration
├── data/                          # Data directory
│   ├── basic_examples.db         # Simple teaching examples (Dragon, Dish, Scene)
│   ├── duckdb_example.db         # Student grade data (student, grade, assignment)
│   └── imdbmini.db              # Mini IMDB movie database (Title, Name, Role, Rating)
├── data_overview/                 # Data analysis and documentation
│   ├── analyze_databases.py      # Script to generate database overview
│   └── database_overview.txt     # Database structure summary
└── notebooks/                     # Jupyter notebooks
    ├── STAT4710J_Group9.ipynb    # Main tutorial (integrated version)
    ├── 01_Topic_5.1_Environment_Tools.ipynb
    ├── 02_Topic_5.2_Basic_Queries.ipynb
    └── 03_Topic_5.3_Sorting_Sampling.ipynb

Curriculum Overview

The tutorial is organized into three core topics:

Topic 5.1: Environment & Tools

Focus: Setting up SQL development environment

  • Concepts Covered:

    • SQL Magic Commands (%sql, %%sql) in Jupyter
    • DuckDB vs SQLite comparison
    • Database connection and basic setup
  • Database Used: basic_examples.db / duckdb_example.db


Topic 5.2: Basic Queries

Focus: Fundamentals of SQL querying

  • Concepts Covered:

    • SELECT statement
    • FROM clause
    • WHERE filtering
    • DISTINCT for deduplication
    • AS for column aliasing
  • Database Used: basic_examples.db

  • Example Queries:

    • Find all dragons with cute rating > 0
    • Get unique dish types in the menu
    • Apply WHERE conditions to filter results

Topic 5.3: Sorting & Sampling

Focus: Working with large datasets

  • Concepts Covered:

    • ORDER BY (ascending/descending)
    • LIMIT for result limiting
    • OFFSET for pagination
    • RANDOM() for random sampling
  • Database Used: imdbmini.db

  • Example Queries:

    • Find top 10 longest movies
    • Randomly recommend a movie
    • Implement pagination with OFFSET

Deliverables

Completed Deliverables:

  1. Jupyter Notebook Tutorial

    • STAT4710J_Group9.ipynb: 55 cells, fully executable
    • Clear explanations and code examples
    • Running results and outputs
    • Lab-style format with progressive complexity
    • All cells executed successfully (execution counts 139-161)
  2. Video Tutorials

    • Individual videos: max 5 minutes per person
    • Total group length: ≤ 15 minutes
    • English subtitles required
    • Topics: 5.1, 5.2, 5.3 (one per team member)
  3. Presentation Slides

    • Key concepts summary
    • Code examples and results
    • Visual aids for clarity

Team Information

Group 9 Members:

  • Zeyi Chen: Topic 5.1 (Environment & Tools) + Video Introduction
  • Leyi Chen: Topic 5.2 (Basic Queries) + Core Concepts
  • Yangyu Li: Topic 5.3 (Sorting & Sampling) + Video Conclusion

Team Collaboration:

  • Regular code review for consistency
  • Integrated SQL magic approach across all topics
  • Complete notebook execution and validation

Database Overview

1. basic_examples.db (28 KB)

Small teaching database with simple relational structure:

  • Dragon table: Dragon names, years, cute ratings
  • Dish table: Menu items with types and costs
  • Scene table: Locations with biomes, cities, visitor counts

2. duckdb_example.db (3.8 MB)

Academic database for grade tracking:

  • Student table: Student information (ID, name, email)
  • Assignment table: Assignment descriptions
  • Grade table: Student grades for each assignment

3. imdbmini.db (9.9 MB)

Large-scale movie database:

  • Title table: Movie information (7,574 movies)
  • Name table: Cast and crew (31,183 people)
  • Role table: Cast/crew assignments (75,226 entries)
  • Rating table: Movie ratings and votes

Dependencies

Required Libraries

  • duckdb: Fast analytical database engine
  • jupysql: SQL magic commands for Jupyter
  • pandas: Data manipulation and analysis
  • jupyter: Interactive notebook environment
  • matplotlib / plotly: Visualization tools

See environment.yml or requirements.txt for specific versions.


Installation & Troubleshooting

Option 1: Using Conda (Recommended)

conda env create -f environment.yml
conda activate stat471

Option 2: Using Pip

python -m venv stat471_env
source stat471_env/bin/activate  # On Windows: stat471_env\Scripts\activate
pip install -r requirements.txt

Verify Installation

import duckdb
import jupysql
import pandas as pd

conn = duckdb.connect('data/basic_examples.db')
print(conn.execute("SELECT COUNT(*) FROM Dragon").fetchall())

Learning Path

Recommended sequence for students:

  1. Phase 1: Read Section 5.1 (Environment Setup)

    • Set up your local environment
    • Connect to sample databases
    • Verify DuckDB and Jupyter integration
  2. Phase 2: Complete Section 5.2 (Basic Queries)

    • Write your first SELECT queries
    • Practice WHERE filtering
    • Explore DISTINCT values
  3. Phase 3: Work with Section 5.3 (Sorting & Sampling)

    • Sort results with ORDER BY
    • Limit results with LIMIT/OFFSET
    • Generate random samples
  4. Phase 4: Apply knowledge

    • Combine multiple concepts
    • Work with larger datasets (IMDB)
    • Create your own queries

Tips for Success

  1. Run all code cells - Don't just read; execute and observe the results
  2. Modify queries - Change WHERE conditions, LIMIT values, etc.
  3. Ask questions - Use SQL to explore data and test hypotheses
  4. Practice pagination - Combine LIMIT and OFFSET to understand data structure
  5. Random sampling - Use RANDOM() to understand data distribution

Resources


Completion Status

  • Jupyter Notebook with all three sections completed
  • Code examples working correctly (all 55 cells executed)
  • Video tutorials recorded and subtitled (≤ 15 min total)
  • Presentation slides prepared
  • All files pushed to Git repository
  • README and TODO documentation complete
  • All team members' contributions documented
  • AI traces removed and code optimized

Last Updated

December 14, 2025

Status: Completed - Final Submission (Group Challenge 9)


Contact & Support

For questions about this project:

  • Contact Group 9 members: Zeyi Chen, Leyi Chen, Yangyu Li
  • Course Instructor: Ailin Zhang, Shanghai Jiao Tong University Global College
  • Refer to course materials and TA office hours
  • Check database structure in data_overview/database_overview.txt

This tutorial is designed to be beginner-friendly while maintaining technical accuracy. Happy learning!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors