<a href="https://colab.research.google.com/github/glassresearch/PLT/blob/master/Python%20colab%20Georgia%20Tech/SQL_info_discovery_FA25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SQL Table Information and Discovery

## This NB covers Database Information and Discovery Skills.

1. Get schema and table information.

2. Get specific table information using PRAGMA.

3. Display sample data from each table.

## We are using a small version of the `NYC 311` database that is used in Homework NB9.

All of the columns from the database will be available, but we have restricted the date range to be a single month of data, in order to be able to reduce the database file size, enabling us to host on Github and manipulate using Google Colab. Note that this database has only a single table.

## We are also using the `university.db` database, which was the data source for the Spring 2025 MT2 exam.

This database has more tables, so it provides additional examples for students to see.

#### The next code cell loads the database into memory. On homework notebooks and exams, you will not have to do this yourself, as the code to load the database will be provided, as we showed in the previous notebooks.

#### Also, the code to load the database below is specific to Google Colab. It is different for notebooks hosted on Vocareum. And again, to reiterate, students WILL NOT be required to write code to load any databases.

In [1]:
!wget https://github.com/gt-cse-6040/bootcamp/raw/main/SQL/syllabus/NYC-311-2M_small.db
!wget https://github.com/gt-cse-6040/bootcamp/raw/main/SQL/syllabus/university.db

# create a connection to the database
import sqlite3 as db
import pandas as pd

# Connect to a database (or create one if it doesn't exist)
conn_nyc = db.connect('NYC-311-2M_small.db')
conn_univ = db.connect('university.db')

--2025-10-09 20:00:42--  https://github.com/gt-cse-6040/bootcamp/raw/main/SQL/syllabus/NYC-311-2M_small.db
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/SQL/syllabus/NYC-311-2M_small.db [following]
--2025-10-09 20:00:42--  https://raw.githubusercontent.com/gt-cse-6040/bootcamp/main/SQL/syllabus/NYC-311-2M_small.db
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 20639744 (20M) [application/octet-stream]
Saving to: ‘NYC-311-2M_small.db’


2025-10-09 20:00:47 (149 MB/s) - ‘NYC-311-2M_small.db’ saved [20639744/20639744]

--2025-10-09 20:00:48--  https://github.com/gt-cse-

## What information do we want to know about the database?

1. What tables are in the database?

2. What is the structure of each table (columns and data types)?

3. What does the data look like in each table (data sample)?

*    https://www.sqlite.org/schematab.html

### In SQLite, the table `sqlite_master` contains the metadata about every table in the database.

In SQLite, the sqlite_master table is a system table that contains metadata about the database schema, such as information about tables, indexes, views, and triggers. It's an internal table that SQLite uses to keep track of the structure of the database.

You can query the sqlite_master table to retrieve information about the database schema, including details about the tables in the database, the columns in those tables, and other objects.

Structure of sqlite_master:
The sqlite_master table has the following columns:
*    type: The type of the object (e.g., table, index, view, or trigger).
*    name: The name of the object (e.g., the name of a table, index, or view).
*    tbl_name: The name of the table to which the object belongs (relevant for indexes, views, and triggers).
*    rootpage: The page number of the root b-tree page for the object (relevant for tables and indexes).
*    sql: The SQL statement that was used to create the object (e.g., the CREATE TABLE or CREATE INDEX statement).

**Note that we are using the paradigm that the exams use for exercises.**

In [2]:
def gettablescema() -> str:
    query = """
            SELECT *
            FROM sqlite_master
            WHERE type='table'
            """
    return query

df_schema_nyc = pd.read_sql(gettablescema(),conn_nyc)
display(df_schema_nyc)

df_schema_univ = pd.read_sql(gettablescema(),conn_univ)
display(df_schema_univ)

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,data,data,2,"CREATE TABLE data (\n\t""index"" BIGINT, \n\t""Cr..."


Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,student_main,student_main,2,CREATE TABLE student_main (\n\tstudent_id VARC...
1,table,major_crosswalk,major_crosswalk,4,CREATE TABLE major_crosswalk (\n\tid INTEGER N...
2,table,scholarship_crosswalk,scholarship_crosswalk,5,CREATE TABLE scholarship_crosswalk (\n\tid INT...
3,table,student_key,student_key,6,CREATE TABLE student_key (\n\tstudent_id VARCH...
4,table,student_enrollment,student_enrollment,8,CREATE TABLE student_enrollment (\n\tid INTEGE...
5,table,graduation,graduation,9,CREATE TABLE graduation (\n\tid INTEGER NOT NU...
6,table,scholarship_rules,scholarship_rules,10,CREATE TABLE scholarship_rules (\n\tid INTEGER...
7,table,student_scholarship,student_scholarship,11,CREATE TABLE student_scholarship (\n\tid INTEG...


### In SQL, the `PRAGMA` statement has many functions (see the documentation).

*    https://www.sqlite.org/pragma.html

In SQLite, `PRAGMA` statements are used to query or modify database settings and retrieve metadata. To get metadata about tables, columns, indexes, and other database objects, SQLite provides specific PRAGMA commands that allow you to extract detailed information about the database schema.

### Here, we are using the `table_info` function, which returns the table structure and column information about the table whose name is passed to it.

This will return the following:
*    cid: Column ID (an integer representing the column's index).
*    name: The name of the column.
*    type: The data type of the column (e.g., INTEGER, TEXT, REAL).
*    notnull: A flag indicating whether the column has a NOT NULL constraint (1 if NOT NULL, 0 if not).
*    dflt_value: The default value for the column (if any).
*    pk: Indicates whether the column is part of the primary key (1 if yes, 0 if no).

#### Note in the function below, we are passing in the table name to send to `PRAGMA table_info()`.

The methodology you see here is how you will want to pass in parameters to your SQL function, that you will include in your query.

In [3]:
def tablemetadata(tablename: str) -> str:

    query = f"""
            PRAGMA table_info('{tablename}')
          """

    return query

# simple example, for the NYC database
df_pragma_nyc = pd.read_sql(tablemetadata('data'),conn_nyc)
display(df_pragma_nyc)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,index,BIGINT,0,,0
1,1,CreatedDate,DATETIME,0,,0
2,2,ClosedDate,DATETIME,0,,0
3,3,Agency,TEXT,0,,0
4,4,ComplaintType,TEXT,0,,0
5,5,Descriptor,TEXT,0,,0
6,6,City,TEXT,0,,0


In [4]:
# more complex example, for multiple tables from the university database

lst_univ_tables = ['student_main','major_crosswalk','scholarship_crosswalk']

for tablename in lst_univ_tables:
    print(f'tablename: {tablename}')

    display(pd.read_sql(tablemetadata(tablename),conn_univ))

    print('=================')

tablename: student_main


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,student_id,VARCHAR(9),1,,1
1,1,last_name,VARCHAR,1,,0
2,2,first_name,VARCHAR,1,,0
3,3,middle_initial,VARCHAR(1),0,,0
4,4,email,VARCHAR,0,,0
5,5,gender,VARCHAR(1),1,,0
6,6,ethnicity,VARCHAR,1,,0
7,7,address,VARCHAR,1,,0
8,8,us_citizen,VARCHAR(1),1,,0
9,9,us_resident,VARCHAR(1),1,,0


tablename: major_crosswalk


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,INTEGER,1,,1
1,1,major_code,VARCHAR(50),1,,0
2,2,major_description,VARCHAR(255),1,,0
3,3,activation_date,VARCHAR(9),1,,0


tablename: scholarship_crosswalk


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,INTEGER,1,,1
1,1,scholarship_code,VARCHAR,1,,0
2,2,scholarship_description,VARCHAR,1,,0
3,3,activation_date,VARCHAR,1,,0




#### If we did not have the table names, remember that the call to `sqlite_master` returns all of the table names in the database.

So we could loop over the `schema_univ` dataframe, put all of the table names into a list, and call PRAGMA for each. Or just loop over the table names and pass each to PRAGMA.

In [5]:
lst_univ_tables_full = []

for index, row in df_schema_univ.iterrows():
    lst_univ_tables_full.append(row['tbl_name'])

lst_univ_tables_full

['student_main',
 'major_crosswalk',
 'scholarship_crosswalk',
 'student_key',
 'student_enrollment',
 'graduation',
 'scholarship_rules',
 'student_scholarship']

In [6]:
# commented out, to reduce output volume.
for tablename in lst_univ_tables_full:
    print(f'tablename: {tablename}')

    display(pd.read_sql(tablemetadata(tablename),conn_univ))

    print('=================')

tablename: student_main


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,student_id,VARCHAR(9),1,,1
1,1,last_name,VARCHAR,1,,0
2,2,first_name,VARCHAR,1,,0
3,3,middle_initial,VARCHAR(1),0,,0
4,4,email,VARCHAR,0,,0
5,5,gender,VARCHAR(1),1,,0
6,6,ethnicity,VARCHAR,1,,0
7,7,address,VARCHAR,1,,0
8,8,us_citizen,VARCHAR(1),1,,0
9,9,us_resident,VARCHAR(1),1,,0


tablename: major_crosswalk


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,INTEGER,1,,1
1,1,major_code,VARCHAR(50),1,,0
2,2,major_description,VARCHAR(255),1,,0
3,3,activation_date,VARCHAR(9),1,,0


tablename: scholarship_crosswalk


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,INTEGER,1,,1
1,1,scholarship_code,VARCHAR,1,,0
2,2,scholarship_description,VARCHAR,1,,0
3,3,activation_date,VARCHAR,1,,0


tablename: student_key


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,student_id,VARCHAR(9),1,,1
1,1,finance_id,VARCHAR(12),1,,0
2,2,employee_id,VARCHAR(7),0,,0


tablename: student_enrollment


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,INTEGER,1,,1
1,1,student_id,VARCHAR(9),1,,0
2,2,term,VARCHAR(6),1,,0
3,3,major_code,VARCHAR(3),1,,0
4,4,semester_hours_attempted,INTEGER,1,,0
5,5,semester_hours_earned,INTEGER,1,,0
6,6,semester_gpa,FLOAT,1,,0
7,7,cumulative_hours_earned,INTEGER,1,,0
8,8,cumulative_gpa,FLOAT,1,,0


tablename: graduation


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,INTEGER,1,,1
1,1,student_id,VARCHAR(9),1,,0
2,2,last_enroll_term,VARCHAR(6),1,,0
3,3,grad_term,VARCHAR(6),1,,0
4,4,grad_level,VARCHAR(1),1,,0
5,5,grad_status,VARCHAR(1),1,,0


tablename: scholarship_rules


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,INTEGER,1,,1
1,1,scholarship_crosswalk_id,INTEGER,1,,0
2,2,scholarship_code,VARCHAR,1,,0
3,3,major_code,VARCHAR(3),0,,0
4,4,activation_date,VARCHAR,1,,0
5,5,scholarship_active,VARCHAR(1),1,,0
6,6,min_gpa,FLOAT,1,,0
7,7,gender,VARCHAR(1),0,,0
8,8,pell_recipient,VARCHAR(1),0,,0
9,9,us_veteran,VARCHAR(1),0,,0


tablename: student_scholarship


Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,id,INTEGER,1,,1
1,1,finance_id,VARCHAR(12),0,,0
2,2,scholarship_term,VARCHAR(6),0,,0
3,3,scholarship_code,VARCHAR,0,,0
4,4,scholarship_total,INTEGER,0,,0
5,5,scholarship_payment,INTEGER,0,,0
6,6,scholarship_refund,INTEGER,0,,0




### Finally, we can do a simple `SELECT * from Table LIMIT #` to get a view on the data itself.

#### Note again that, in the function below, we are passing in the table name and how many row to return.

In [None]:
def querytables(tablename: str,limit:int=10) -> str:
    query = f"""
            SELECT *
            FROM {tablename}
            LIMIT {limit}
          """
    return query

# simple example, for the NYC database
df_table_nyc = pd.read_sql(querytables('data',5),conn_nyc)
display(df_table_nyc)

In [None]:
# commented out, to reduce output volume.
# for tablename in lst_univ_tables_full:
#     print(f'tablename: {tablename}')

#     display(pd.read_sql(querytables(tablename,5),conn_univ))

#     print('=================')

## What are your questions about database discovery?