<a href="https://colab.research.google.com/github/petre001/Demo/blob/main/week5/python_sql_intro_inclass.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href='https://ai.meng.duke.edu'> = <img align="left" style="padding-top:10px;" src=https://storage.googleapis.com/aipi_datasets/Duke-AIPI-Logo.png>

# Working with SQLite Databases

In [1]:
import sqlite3 as db
import pandas as pd
import numpy as np

## Connecting to a database
The sqlite engine maintains a database as a file; in the below example, the name of that file is `example.db`.
If the named file does not yet exist, it will be created when this code is run. However, if the database has been created before, this same code will open it.  Once we open the database, we then create a 'cursor', which tracks the current state of the database.  We use the cursor to issue commands that modify or query the database.

In [2]:
# Connect to a database (or create one if it doesn't exist)
conn = db.connect('example.db')

# Create a 'cursor' for executing commands
c = conn.cursor()

## Adding data to tables
The central object of a relational database is a _table_. A table has a similar form to a pandas DataFrame: observations as rows, features as columns. In the relational database world, we sometimes refer to rows as _items_ or _records_ and columns as _attributes_.

Let's start by creating a table.  Suppose we would like to create a table within `example.db` to store information about Duke students, which includes three attributes: their Duke ID number, their name, and their expected graduation year.  We will create a table called `Students` to store this information.

In [3]:
# First check if the table already exists and if so we will delete it
c.execute("DROP TABLE IF EXISTS Students")

# Create a table named "Students" with 3 columns: "duke_id" (string), "name" (string), "grad_year" (integer).
c.execute("CREATE TABLE Students (duke_id INTEGER, name TEXT, grad_year INTEGER)")

<sqlite3.Cursor at 0x7f53d4ba0d50>

Let's now populate our table.  To add items to the table we use the command, [`INSERT INTO`](https://www.sqlite.org/lang_insert.html).  The format of the command is `"INSERT INTO <table_name> VALUES <values>"`

In [4]:
# Commands to add data to our table
c.execute("INSERT INTO Students VALUES ('121', 'Reifschneider', 2025)")
c.execute("INSERT INTO Students VALUES ('225', 'Egger', 2023)")
c.execute("INSERT INTO Students VALUES ('767', 'Lin', 2022)")
c.execute("INSERT INTO Students VALUES ('988', 'Saha', 2022)")

# Commit the changes (make them permanent in the datbase)
conn.commit()

Rather than adding one item at a time, we can use `executemany()` to add multiple items.

In [5]:
# List of items to add
more_students = [('734', 'Fox', 2025),
                 ('878', 'Lenz', 2023),
                 ('267', 'Glass', 2023)]

# '?' question marks are placeholders for the columns in Students table
c.executemany('INSERT INTO Students VALUES (?, ?, ?)', more_students)
conn.commit()

## Basic queries
The most common operation we perform on databases is to retrieve information from them using a 'query'.  We use SQL syntax to create queries, which you can read about [here](https://data36.com/wp-content/uploads/2018/12/sql-cheat-sheet-for-data-scientists-by-tomi-mester.pdf).

The simplest form of a SQL query is `"SELECT * FROM <table_name>"` which will return all data from the table as entries in a list. Note: unless we know that our table is of reasonable size, we usually do not want to do "SELECT * FROM" because it may return a lot of data!

In [6]:
# Query to get all data from the Students table
c.execute("SELECT * FROM Students")
results = c.fetchall()
print("Results of the query:", len(results), "\nThe entries of Students:\n", results)

Results of the query: 7 
The entries of Students:
 [(121, 'Reifschneider', 2025), (225, 'Egger', 2023), (767, 'Lin', 2022), (988, 'Saha', 2022), (734, 'Fox', 2025), (878, 'Lenz', 2023), (267, 'Glass', 2023)]


We can also create more complex queries using SQL which filter and/or sort the data.

In [7]:
# Query to get all students graduating in 2023
c.execute("SELECT * FROM Students WHERE grad_year=2023")
results = c.fetchall()
print("Results of the query:", len(results), "\nThe entries of Students:\n", results)

Results of the query: 3 
The entries of Students:
 [(225, 'Egger', 2023), (878, 'Lenz', 2023), (267, 'Glass', 2023)]


In [8]:
# Query to return students graduating before a certain year, ordered by last name
grad_year = 2025
query = f'''
        SELECT * 
        FROM Students
        WHERE grad_year < {grad_year}
        ORDER BY Name
        '''

c.execute(query)
results = c.fetchall()
print("Results of the query:", len(results), "\nThe entries of Students:\n", results)

Results of the query: 5 
The entries of Students:
 [(225, 'Egger', 2023), (267, 'Glass', 2023), (878, 'Lenz', 2023), (767, 'Lin', 2022), (988, 'Saha', 2022)]


## Join queries
The main type of query that combines information from multiple tables is the _join query_. There are four types of join queries:

- `INNER JOIN(A, B)`: Keep rows of `A` and `B` only where `A` and `B` match
- `OUTER JOIN(A, B)`: Keep all rows of `A` and `B`, but merge matching rows and fill in missing values with some default (`NaN` in Pandas, `NULL` in SQL)
- `LEFT JOIN(A, B)`: Keep all rows of `A` but only merge matches from `B`.
- `RIGHT JOIN(A, B)`: Keep all rows of `B` but only merge matches from `A`.

If you are a visual person, see [this page](https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins) for illustrations of the different join types.

Let's create a new table `Classes` which stores information on which classes each student has taken and their grade (on a 4.0 scale).  We will then run some join queries on both tables in our database.

In [9]:
# Create Classes table
c.execute('DROP TABLE IF EXISTS Classes')
c.execute('CREATE TABLE Classes (duke_id INTEGER, course TEXT, grade REAL)')

students = [('121','AIPI 510',3.7),
            ('121','AIPI 520',4.0),
            ('121','AIPI 530',3.3),
            ('225','AIPI 510',4.0),
            ('225','AIPI 520',3.3),
            ('767','MENG 570',3.0),
            ('767','AIPI 510',4.0),
            ('988','MENG 570',4.0),
            ('988','AIPI 510',3.7),
            ('734','AIPI 510',4.0),
            ('734','AIPI 520',4.0),
            ('878','AIPI 510',3.0),
            ('878','AIPI 520',4.0)]

c.executemany('INSERT INTO Classes VALUES (?,?,?)',students)
conn.commit()

# Displays the results of your code
c.execute('SELECT * FROM Classes')
results = c.fetchall()
print("Your results:", len(results), "\nThe entries of Classes:", results)

Your results: 13 
The entries of Classes: [(121, 'AIPI 510', 3.7), (121, 'AIPI 520', 4.0), (121, 'AIPI 530', 3.3), (225, 'AIPI 510', 4.0), (225, 'AIPI 520', 3.3), (767, 'MENG 570', 3.0), (767, 'AIPI 510', 4.0), (988, 'MENG 570', 4.0), (988, 'AIPI 510', 3.7), (734, 'AIPI 510', 4.0), (734, 'AIPI 520', 4.0), (878, 'AIPI 510', 3.0), (878, 'AIPI 520', 4.0)]


Let's now perform a couple join queries using our two tables. We will need to join them on the column they both share, which is the join key.  In this case both tables share the same column `duke_id`

In [None]:
# Get all students including their name (from Students), courses taken and grades (from Classes)

query = '''
        SELECT Students.name, Classes.course, Classes.grade
        FROM Students INNER JOIN Classes ON Students.duke_id = Classes.duke_id
        '''

c.execute(query)
results = c.fetchall()
for result in results:
    print(result)

In [None]:
# Get names and grades of all students who have taken AIPI510
course_name = 'AIPI 510'
query = f'''
        SELECT Students.name, Classes.grade
        FROM Students INNER JOIN Classes ON Students.duke_id = Classes.duke_id
        WHERE Classes.course = '{course_name}'
        '''
c.execute(query)
results = c.fetchall()
for result in results:
    print(result)

Let's now look at what happens when we run a join query which has missing data in one of the tables.

In [None]:
# Get all students including their name (from Students), courses taken and grades (from Classes)
# We will use a left join this time

query = '''
        SELECT Students.name, Classes.course, Classes.grade
        FROM Students LEFT JOIN Classes ON Students.duke_id = Classes.duke_id
        '''

c.execute(query)
results = c.fetchall()
for result in results:
    print(result)

As we can see above, student Glass is not included in the Classes table and so when we perform the left join, we have no data available for them for `course` and `grade`.  We can run it again and exclude students who do not have any grades

In [None]:
# Get all students including their name (from Students), courses taken and grades (from Classes)
# This time exclude students with no listed classes

query = '''
        SELECT Students.name, Classes.course, Classes.grade
        FROM Students LEFT JOIN Classes ON Students.duke_id = Classes.duke_id
        WHERE Classes.course is not null
        '''

c.execute(query)
results = c.fetchall()
for result in results:
    print(result)

## Aggregations
Another common style of query is an aggregation, which is a summary of information across multiple records. Similar to pandas, we group the data using `GROUP BY` in the query and specify how we want to aggregate across records (e.g. take the mean or sum).  Useful SQL aggregators include `AVG`,`MIN`, `MAX`, `SUM`, and `COUNT`.

In [None]:
# Calculate the average GPA of each student across all classes they have taken

query = '''
        SELECT Students.name, AVG(Classes.grade) 
        FROM Students INNER JOIN Classes ON Students.duke_id = Classes.duke_id
        GROUP BY Students.name
        '''

c.execute(query)
results = c.fetchall()
for result in results:
    print(result)

In [None]:
# Get the count of how many classes each student has taken so far

query = '''
        SELECT Students.name, COUNT(Classes.course)
        FROM Students INNER JOIN Classes ON Students.duke_id = Classes.duke_id
        GROUP BY Students.name
        '''

c.execute(query)
results = c.fetchall()
for result in results:
    print(result)

## SQL and Pandas
We can read SQL queries directly into pandas to create DataFrames of the results.

In [None]:
# Get a dataframe with all data from Students and Classes tables
query = '''
        SELECT Students.duke_id, Students.name, Students.grad_year, Classes.course, Classes.grade
        FROM Students INNER JOIN Classes ON Students.duke_id = Classes.duke_id
        '''

df = pd.read_sql_query (query, conn)
df

In [None]:
# Get a dataframe of students, their graduation year and their GPA
# Rename the average grade column to 'gpa' using AS
query = '''
        SELECT Students.duke_id, Students.name, Students.grad_year, AVG(Classes.grade) AS gpa
        FROM Students INNER JOIN Classes ON Students.duke_id = Classes.duke_id
        GROUP BY Students.name
        '''

df = pd.read_sql_query (query, conn,index_col='duke_id')
df

We can also save data directly from a pandas dataframe to a table in a sqlite database.

In [None]:
# Create dataframe of students' major
majors_dict={'duke_id':['225','734','878','878','121','267'],
             'major':['Biology','Finance','CS','AI','CS','Biology']}
majors = pd.DataFrame(majors_dict)

# Create table Majors from dataframe in example.db
conn = db.connect('example.db')
c = conn.cursor()
c.execute("DROP TABLE IF EXISTS Majors")
majors.to_sql(name='Majors',con=conn,index=False)

In [None]:
# We can now query our new table
df = pd.read_sql_query ("SELECT * FROM Majors", conn)
df

## Practice problems
### Question 1
Complete the below function to calculate the average GPA for students from each major.  The function should return a list of tuples containing the major and corresponding average gpa

In [30]:
def avg_gpa_by_major(db_name):  
    ### BEGIN SOLUTION ###
    
    conn = db.connect(db_name)

    # Create a 'cursor' for executing commands
    c = conn.cursor()
    # Create dataframe of students' major
    majors_dict={'duke_id':['225','734','878','878','121','267'],
                'major':['Biology','Finance','CS','AI','CS','Biology']}
    majors = pd.DataFrame(majors_dict)

    # Create table Majors from dataframe in example.db
    conn = db.connect('example.db')
    c = conn.cursor()
    c.execute("DROP TABLE IF EXISTS Majors")
    majors.to_sql(name='Majors',con=conn,index=False)

    query = '''
        SELECT Majors.major, Classes.grade AS gpa
        FROM Majors INNER JOIN Classes ON Classes.duke_id = Majors.duke_id
        GROUP BY Major    
        '''
    c.execute(query)
    results = c.fetchall()
    print(results)

    ### END SOLUTION ###

In [31]:
gpas = avg_gpa_by_major('example.db')
gpas

[('AI', 4.0), ('Biology', 4.0), ('CS', 4.0), ('Finance', 4.0)]


In [None]:
# Run function
gpas = avg_gpa_by_major('example.db')
gpas

### Question 2
Complete the function below which returns a **pandas dataframe** containing the name, graduation year, major (or None if not declared), and GPA of all students who have taken classes.

In [None]:
def all_students(db_name):
    ### BEGIN SOLUTION ###
    query = '''
        SELECT Students.name, Students.grad_year, Major.major, Classes.grade AS gpa
        FROM Students INNER JOIN Classes ON Students.duke_id = Classes.duke_id
        '''
    df = pd.read_sql_query (query, conn)
    df 
    ### END SOLUTION ###

In [None]:
# Run function
students_table = all_students('example.db')
students_table

### Question 3
Complete the below function to create a pandas dataframe where the index values are the courses, the columns are the majors, and the cells show the number of students from each major in each course.  Your output should look like this:  
<img align="left" style="padding-top:10px;" src=https://github.com/AIPI510/class_exercises/blob/master/week5/Q3.png?raw=1>

In [None]:
def majors_per_course(db_name):  
    ### BEGIN SOLUTION ###
    
    
    ### END SOLUTION ###

In [None]:
majors_per_course('example.db')