[Table of Contents](../../index.ipynb)

# FRC Analytics with Python - Session 17
# Introduction to Structured Query Language (SQL) - Part I
**Last Updated: 8 October 2021**

This session covers *Structured Query Language* (SQL). SQL is used to to create, modify, and retrieve data stored in relational databases. SQL has been in general use since the 1980s and is required knowledge for data scientists and business analysts. SQL is included in our curriculum because the Issaquah Robotics Society's scouting system uses SQL to store, retrieve, and manipulate data.

A *database* is a software program that stores information. Relational databases are a type of database that store data in tables with rows and columns. Databases often run on computer servers and provide information to other computer programs over a network. For example, when a web browser retrieves a web page, the web page is generated by a web server like Apache, Microsoft Internet Information Services (IIS), or Nginx. Web servers typically retrieve data from relational database servers when they construct a web page. MySQL, PostgreSQL, and Oracle are some of the most common relational database servers.

The IRS's scouting system uses a simple relational database called *SQLite*. Sqlite is bundeled with Python, so if you've installed Python on your computer, you already have Sqlite.

#### SQL References
* The [W3 Schools SQL Tutorial](https://www.w3schools.com/Sql/default.asp)  The W3 tutorial covers basic SQL syntax that is common to popular database systems.  It includes simple examples and it's great for when you know what type of query you need, but can't remember the exact syntax. This notebook contains links to applicable sections of the W3 SQL tutoral.
* Although basic SQL syntax is the same on different database systems, there are differences in how they support advanced SQL features. [The official *SQLite* documentation](https://www.sqlite.org/lang.html) provides precise descriptions of the features supported by *SQLite*.
* This notebook uses the *sqlite3* package from the *Python Standard Library* to interact with *SQLite* databases.  [The package's documenetation is available here.](https://docs.python.org/3/library/sqlite3.html) Keep in mind that *SQLite* can be used with many programming langauges, including Java, Python, C/C++, R, Javascript, Scala, Julia, and Ruby. Each of these languages has its own library for interacting with *SQLite* databases.

#### If Using Google Colab
It's best if you clone the *pyclass_frc* Github repo and run this notebook from your local computer. But if you would like to run it from Google Colab, uncomment and run the line in the next cell. (*Don't delete the exclamation point at the start of the line!*) The cell will copy a Sqlite database file from the Github repository.

In [1]:
# !wget -nv https://raw.githubusercontent.com/irs1318dev/pyclass_frc/master/sessions/s17_SQL/wasno2020.sqlite3

## I. Getting a Database Connection
To use a Sqlite database, we must first [import the `sqlite3` package](https://docs.python.org/3/library/sqlite3.html). This package is part of the *Python Standard Library*, so there is no need to install it with Anaconda. In addition to `sqlite3`, we'll also [need the *Pandas*](https://pandas.pydata.org/pandas-docs/stable/index.html) and *sys* packages.

In [2]:
import sqlite3
import sys

import pandas as pd

A Sqlite database is stored as a file. Our database is contained in the file *wasno2020.sqlite3*. It contains the IRS's scouting data from the [2020 Pacific Northwest (PNW) district competition at Glacier Peak High School in Snohomish, WA](https://www.thebluealliance.com/event/2020wasno). We can get a connection to this database by using the `sqlite3.connect()` function. 

In [17]:
# Get a connection to the database
db_file = "wasno2020.sqlite3"
con = sqlite3.connect(db_file)

## II. Getting Started
### A. Our First Query
The instructions that we'll give to our database are called queries. Run the next cell to see the results of our first query.

In [4]:
# Our first SQL Query
query = "SELECT * FROM teams;"

# This statement runs the query. We'll explain it later.
teams_dataframe = pd.read_sql_query(query, con, index_col = "team_id")

# Display the first six rows
teams_dataframe.head()

Unnamed: 0_level_0,team_number,team_name,city,state,region,year_founded,matches_played
team_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1318,Issaquah Robotics Society,Issaquah,Washington,,2004.0,10
2,2928,Viking Robotics,Seattle,Washington,,2009.0,10
3,2903,NeoBots,Arlington,Washington,,2009.0,10
4,3070,Team Pronto,Seattle,Washington,,2009.0,9
5,2930,Sonic Squirrels,Snohomish,Washington,,2009.0,10


The first line of the code cell defines a SQL query, `SELECT * FROM teams;`and saves it in a Python variable. The next line sends the query to our *Sqlite* database and converts the results to a *Pandas* dataframe. Don't pay much attention to this line -- we'll cover it in more detail later. The final statement displays the first six records that were retrieved from the database.

The data from the *teams* table consists of rows and columns. Each row corresponds to a different robotics team, and each column represents a different attribute of a team, such as its name, city or year founded. Rows are often called *records* and the attribute represented by a column is often called a *field*.

This particular SQL query, `SELECT * FROM teams;`, requests all data from the *teams* table. The SQL query has several components.
* Our query started with the SQL keyword `SELECT`, which means we want to retrieve data from the database.
* The asterisk, `*`, means we want to return all columns. If you were reading a SQL query out load, you could say "all columns" or "all fields" instead of "asterisk".
* The phrase `FROM teams` means we want to the data to come from the *teams* table. `FROM` is a SQL keyword and `teams` is a user-created name that we gave to our database table.
* All SQL queries end with a semicolon, `;`.

This data was saved after the first day of the FRC competition, before all qualification matches were completed. That's why the teams have only completed nine or ten matches. Each team normally competes in twelve matches at district competitions in the Pacific Northwest (PNW).

### B. Choosing Columns
We can replace the asterisk in the `SELECT` statement with one or more column names. For example, we could modify the query to provide only the *team_name* and *team_number* columns.

In [5]:
# We can choose our columns
query = """SELECT team_name, team_number FROM teams;"""

# Don't worry about this statement yet
teams_dataframe = pd.read_sql_query(query, con)

# Display the first six rows
teams_dataframe.head()

Unnamed: 0,team_name,team_number
0,Issaquah Robotics Society,1318
1,Viking Robotics,2928
2,NeoBots,2903
3,Team Pronto,3070
4,Sonic Squirrels,2930


`SELECT` statements will accept one or more column names. If providing more than one column name, separate the column names with commas.

[Refer to the *W3 SQL Tutorial* for additional examples of `SELECT` statements.](https://www.w3schools.com/Sql/sql_select.asp)

### C. Displaying Different Column Names
Using underscores to separate the words in long column names is a good practice, but it does make our tables look a bit crude.  We can use aliases to display the columns using a different name and make our table look more polished.

In [6]:
# We can choose our columns
query = """SELECT team_name AS Team,
                  team_number AS "Team Number"
             FROM teams;"""

# Don't worry about this statement yet
teams_dataframe = pd.read_sql_query(query, con)

# Display the first six rows
teams_dataframe.head()

Unnamed: 0,Team,Team Number
0,Issaquah Robotics Society,1318
1,Viking Robotics,2928
2,NeoBots,2903
3,Team Pronto,3070
4,Sonic Squirrels,2930


The `AS` keyword caused the columns to be renamed when our results were displayed. The new name is called an *alias*. We can include spaces in the alias if we enclose it in double quotes (but then the entire string needs to be enclosed in single or triple quotes).

[Refer to the *W3 SQL Tutorial* for additional alias examples.](https://www.w3schools.com/Sql/sql_alias.asp)

### D. Filtering Results with a `WHERE` Clause
The following query only returns teams that were founded in 2013.

In [7]:
# We can filter the results with a WHERE clause.
query = """SELECT *
             FROM teams
            WHERE year_founded = 2013;"""

# Don't worry about this statement yet
teams_dataframe = pd.read_sql_query(query, con, index_col="team_id")

# Display the first six rows
teams_dataframe

Unnamed: 0_level_0,team_number,team_name,city,state,region,year_founded,matches_played
team_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8,4681,Murphy's law,Everett,Washington,,2013,10
11,4682,BraveBots,Seattle,Washington,,2013,9
18,4683,Full Metal Robotics,Marysville,Washington,,2013,10
23,4512,BEAR bots,Everett,Washington,,2013,9
29,4513,Circuit Breakers,Medical Lake,Washington,,2013,10


We filtered the query results by adding a `WHERE` clause. Here are a couple key points:
* The order of clauses matters. The `WHERE` clause must come after the `FROM` clause, or else the SQL query will fail.
* Unlike Python, the equality operator, `=`, contains a single equals sign.

We can use the Boolean operators `AND`, `OR`, and `NOT` in our `WHERE` clause.

In [8]:
# We can use Boolean operators in the WHERE clause.
query = """SELECT *
             FROM teams
            WHERE year_founded <= 2012
              AND city = 'Seattle';"""

# Don't worry about this statement yet
teams_dataframe = pd.read_sql_query(query, con, index_col="team_id")

# Display the first six rows
teams_dataframe

Unnamed: 0_level_0,team_number,team_name,city,state,region,year_founded,matches_played
team_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,2928,Viking Robotics,Seattle,Washington,,2009,10
4,3070,Team Pronto,Seattle,Washington,,2009,9
31,4180,Iron Riders,Seattle,Washington,,2012,9


The query displays all teams in Seattle that were founded in 2012 or earlier. Here are a couple important things to note:
* The value *Seattle* is surrounded by single quotes (`'`) but the value *2012* does not use any quotes. This is because the *year_founded* has a numeric datatype and *city* has a text datatype. Literal text values should always be surrounded by *single* quotes. (*Sqlite* will let you use double quotes around literal strings, but other common database servers, like *Postgres*, will not. It's best get in the habit of using single quotes in this situation.)
* While SQL keywords and user-supplied identifiers are case insensitive, SQL searches often are case-sensitive. For example, if we change our the second part of our `WHERE` clause to `AND city = 'seattle'`, our search will return no results (try it for yourself!) because all occurrences of 'Seattle' in the *teams* table are capitalized. *Sqlite* queries can be forced to conduct a case-insensitive search by adding `COLLATE NOCASE` to the end of the `WHERE` clause. Also, it's possible to specify that searches should be case-insensitive when creating a database table.

[Refer to the *W3 SQL Tutorial* for additional examples of `WHERE` clauses.](https://www.w3schools.com/Sql/sql_where.asp)

### E. SQL Syntax and Style
Unlike Python, SQL is case insensitive. SQL considers the queries `SELECT * FROM teams;` and `select * frOM TEAMS;` to be identical. This means you can't have two tables where the only differences in the table names are that some characters are upper or lower case.

To make our queries easy to read and understand, we will conform to a few style rules. These style rules come from a [SQL styleguide by Simon Holywell](https://www.sqlstyle.guide/). 
* SQL keywords like `SELECT` and `FROM` will always be all uppercase.
* User-generated table and column names will always be all lowercase.
* If table or column names contain multiple words, the words will be separated by underscores.
* Longer queries will be placed on multiple lines, with the leftmost keywords right-aligned.
* Table names will be a plural noun that describes the data stored in the table.

In addition, when writing SQL queries in Python code, the mentor recommends placing SQL queries in triple-quoted strings. This practice has two advantages:
* It allows multi-line queries, which enhances readability.
* Double quotes (`"`) and single quotes (`'`) occur frequently in SQL queries. Single and double quotes can easily be placed in single-quoted strings.

[Refer to the *W3 SQL Tutorial* for additional information on SQL syntax.](https://www.w3schools.com/Sql/sql_syntax.asp)

## III. Exploring Tables and Schemas

### A. Getting a Table of Tables
Most relational databases contain multiple tables. We can run a special `SELECT` query to see what tables exist in a database.

In [9]:
# Get a list of tables in the database
query = """SELECT *
             FROM sqlite_schema
             WHERE type = 'table';"""

# Run the query and display the results
pd.read_sql_query(query, con)

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,measures,measures,53,"CREATE TABLE ""measures"" (\n""index"" INTEGER,\n ..."
1,table,status,status,115,"CREATE TABLE ""status"" (\n""index"" INTEGER,\n ""..."
2,table,teams,teams,114,CREATE TABLE teams (\n team_id INTEGER PRIM...
3,table,schedule,schedule,3,CREATE TABLE schedule (\n match_id I...


The results indicate our database contains four tables: *measures*, *schedule*, *status*, and *teams*.

*Sqlite* databases always contain a special table called *sqlite_schema*, which lists all objects contained in the database. The *sqlite_schema* table contains five columns:
1. `type`: Specifies the type of object. There are four possible values: 'table', 'index', 'view', and 'trigger'. We used a `WHERE` clause to filter the results to just tables. 
2. `name`: The name of the object. For tables, this column contains the table's name.
3. `tbl_name`: Indexes and triggers are always associated with a table. This column specifies the associated table. But for tables and views, the value in this column is always the same as the *name* column.
4. `rootpage`: We won't be using this column. But if you must know, *Sqlite* uses a datastructure called a *B-tree* (NOT the same as a binary tree) to store objects. This column contains the location of the table within the B-tree.
5. `sql`: Contains the SQL query that was used to create the table.

### B. Database Schemas
The arrangement of database tables, including the table's fields and data types, is called a schema. We can learn more about a database schema by inspecting the *sql* fields in the *sqlite_schema* table. For example, the following query retrieves the query that was used to create the *teams* table.

In [10]:
# Get a list of tables in the database
query = """SELECT sql
             FROM sqlite_schema
            WHERE type = 'table'
              AND name = 'teams';"""

# Run the query and display the results
print(con.execute(query).fetchone()[0])

CREATE TABLE teams (
    team_id INTEGER PRIMARY KEY,
    team_number TEXT NOT NULL UNIQUE,
    team_name TEXT,
    city TEXT,
    state TEXT,
    region TEXT,
    year_founded INTEGER,
    matches_played INTEGER
)


The `CREATE TABLE` query is used to create a new table in a SQL database.  Viewing the `CREATE TABLE` query is useful because it lists all of the columns and their datatypes for a specific table. For the *teams* table, we can see that all columns have a *TEXT* datatype except for *team_id*, *year_founded*, and *matches_played*. Note that the datatype of *team_number* is actually text, even though all of the values appear to be numeric. 

The important takeaway is that all columns have a specific datatype, and only data that matches the datatype can be stored in the column. We'll cover `CREATE TABLE` queries in greater detail in a later session.

[Refer to the official *SQLite* documentatoin for additional information on the schema table.](https://www.sqlite.org/schematab.html#interpretation_of_the_schema_table)

#### Pragma Queries
*Sqlite* provides another method for retrieving information about a table. The following query extracts information about the *schedule* table.

In [11]:
query = """PRAGMA table_info(schedule);"""
pd.read_sql_query(query, con, index_col="cid")

Unnamed: 0_level_0,name,type,notnull,dflt_value,pk
cid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,match_id,INTEGER,0,,1
1,match_date,TEXT,1,,0
2,comp_level,TEXT,1,,0
3,match_desc,TEXT,1,,0
4,alliance,TEXT,1,,0
5,team,TEXT,1,,0
6,station,TEXT,1,,0
7,last_match,INTEGER,0,,0


Results from `PRAGMA table_info(...)` include six different columns:
* `cid`: Contains the column ID, which is a sequence of integers, starting with zero for the first column.
* `name`: The column name.
* `type`: The column data type.
* `notnull`: If 1, the column must contain information -- it cannot be empty.
* `dflt_value`: Contains the column's default value. The default value is inserted into the column when creating a new record if no value is specified by the user.
* `pk`: If 1, indicates the column is a primary key. Primary keys will be covered in a later session.

There are several dozen different types of `PRAGMA` queries. [The complete list is available on the official *Sqlite* documentation.](https://www.sqlite.org/pragma.html) `PRAGMA` queries can be used to get information about a database's schema or to view or set database settings.

## IV. Joining Tables
Let's look at the first dozen rows in the *schedule* table.

In [12]:
query = """SELECT *
             FROM schedule
            LIMIT 12;"""
pd.read_sql_query(query, con, index_col="match_id")

Unnamed: 0_level_0,match_date,comp_level,match_desc,alliance,team,station,last_match
match_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
205768,2020-02-29T11:00:00,qual,001-q,red,4131,1,10
205769,2020-02-29T11:00:00,qual,001-q,red,4683,2,10
205770,2020-02-29T11:00:00,qual,001-q,red,2412,3,9
205771,2020-02-29T11:00:00,qual,001-q,blue,1318,1,10
205772,2020-02-29T11:00:00,qual,001-q,blue,4089,2,9
205773,2020-02-29T11:00:00,qual,001-q,blue,8059,3,10
205774,2020-02-29T11:09:00,qual,002-q,red,4205,1,10
205775,2020-02-29T11:09:00,qual,002-q,red,4180,2,9
205776,2020-02-29T11:09:00,qual,002-q,red,2910,3,10
205777,2020-02-29T11:09:00,qual,002-q,blue,3826,1,10


We tried out a new clause in this query. The `LIMIT n` clause ensures no more than *n* rows are returned from the query.

The schedule table lists all qualification matches for the robotics competition. There are six teams in every match, so there are six records for every match in the *schedule* table. 

The table contains the team number, but not the team's name. Let's fix that.

In [13]:
query = """SELECT schedule.*, teams.team_name
             FROM schedule LEFT JOIN teams
               ON schedule.team = teams.team_number;"""
pd.read_sql_query(query, con, index_col="match_id").head()

Unnamed: 0_level_0,match_date,comp_level,match_desc,alliance,team,station,last_match,team_name
match_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
205768,2020-02-29T11:00:00,qual,001-q,red,4131,1,10,Iron Patriots
205769,2020-02-29T11:00:00,qual,001-q,red,4683,2,10,Full Metal Robotics
205770,2020-02-29T11:00:00,qual,001-q,red,2412,3,9,Robototes
205771,2020-02-29T11:00:00,qual,001-q,blue,1318,1,10,Issaquah Robotics Society
205772,2020-02-29T11:00:00,qual,001-q,blue,4089,2,9,Stealth Robotics


The preceding query contains a `LEFT JOIN` clause. For every row in the *schedules* table, SQL extracts the value of the *schedule.team* column and looks for a row in the *teams* table with the same value in the *teams.team_number* table. SQL then adds a *team_name* column to the results.

Note how we prefaced the column selectors with the name of the table: `schedule.*` and `teams.team_name`. Since we are pulling data from two different tables, it's a good practice to specify which table contains the column we are trying to retrive.

There is also a `RIGHT JOIN` and an `INNER JOIN`. `LEFT JOIN` ensures *all* records from the table to the left of the `LEFT JOIN` keywords (i.e., *schedule*) are returned. For example, suppose our *teams* table was missing team 4683. No schedule records for team 4683 would have been returned if we had used an `INNER JOIN` or a `RIGHT JOIN` because the SQL would not have found a matching row in the *teams* table. We'll discuss this further in the next session.

[Refer to the *W3 SQL Tutorial* for additional information on SQL Joins.](https://www.w3schools.com/Sql/sql_join.asp)

## V. Updating Records
The *team_name* is missing for FRC 4131. Their name is *Iron Patriots*. Let's fix that.

In [14]:
# This query changes an existing record
query = """UPDATE teams
              SET team_name = 'Iron Patriots'
            WHERE team_number = '4131';"""

# This is another way to run SQL queries. 
# We'll discuss it in the next session.
con.execute(query)
con.commit()

The `UPDATE` query changes records that already exist in a database table. Keep in mind that it can't add new records to a table. The `WHERE` clause is extremely important. If we had omitted it, all team names would have been changed to 'Iron Patriots'. Liberty High School has a great FRC team, but we would be making things very hard for the match announcer if every team in the match had the same name.

[Refer to the *W3 SQL Tutorial* for additional information on SQL Joins.](https://www.w3schools.com/Sql/sql_update.asp)

You probably noticed that we used a different function for executing the query. So far, all of our queries have requested information *from* the database. The `UPDATE` query is different. It's the first query that sends information *to* the database. The `UPDATE` query does not return any information, so there is no information to display in a dataframe. Attempting to run an `UPDATE` query with `pandas.read_sql_query()` will cause an error. 

## VI. Running Queries
This notebook uses three different techniques to run SQL queries.
* The `pandas.read_sql_query()` function
* The `sqlite3.connection.execute()` method
* *Sqlite3's* `cursor` object.

We've seen examples of the first two techniques. We'll use the `cursor` object later in this section. We'll also cover how to execute SQL queries in *SQLite* from the command line.

### A. The `pandas.read_sql_query()` Function
We've used the `read_sql_query()` function from the *Pandas* package to execute all of our queries. The `read_sql_query()` function is not part of the SQL specification. It is a function provided by the *Pandas* package that makes it easy to get data from a SQL database into a dataframe. We're using `read_sql_query()` function in this notebook because displaying the query results as *Pandas* dataframes makes the easy to read.

**Helpful Hint:** *Pandas* SQL functions require that the *SQLAlchemy* package be installed. Installing *Pandas* does not automatically install *SQLAlchemy*. If you get an error the first time you run a *Pandas* SQL function, install *SQLAlchemy* if it isn't installed. The CLI commands `conda list sqlachemy` and `conda install sqlalchemy` should do the trick.

The `read_sql_query()` function has two required parameters and six optional ones. We've used three of the parameters so far.
* `sql`: A text string containing the SQL query we want to execute.
* `con`: A *Sqlite* database connection object. Without the connection object, Pandas would have no clue which database we want to use.
* `index_col`: *Pandas* dataframes always have an index column that is displayed on the left side of the dataframe. If the use doesn't specify an index, *Pandas* will create an integer index starting at zero. The `index_col` parameter allows us to tell *Pandas* to use one of the table's columns as an index. This prevents *Pandas* from displaying an extraneous column of integers that don't exist in the underlying database.

[Refer to the official *Pandas* documentation for more information on the `read_sql_query()` function.](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql_query.html#pandas.read_sql_query).


### B. The `connection.execute()` Method
The `pandas.read_sql_query()` function only works if the query returns data. For other queries, use the `execute()` method from the `sqlite3` package. Here is an example:
```python
# Import the sqlite3 package
import sqlite3

# Get a connection object to the database file
con = sqlite3.connect("database-file.sqlite3")

# Create and execute the query
query """UPDATE some_table
            SET some_column = 'some_value'
            WHERE some_other_column = 'some_other_value';"""
con.execute(query)

# Commit the changes
con.commit()
```

This is mostly straightforward. We just run the `execute()` method on the connection object, passing the SQL query as an argument.

The only mysterious part of the code snippet is the line `con.commit()`. If we were to omit this line, the changes would not be saved to the database. It's as if the `commit()` function is really the `simon_says()` function. If we don't say *Simon Says...*, nothing actually happens.

The `commit()` function seems superflous but it exists for a very good reason. Suppose you have a savings and a checking account at a local bank. The bank stores account records in a SQL database. They have a *account_balances* table that contains the account balance of every account.
* Now suppose that you try to transfer &dollar;100 from your checking account to your savings account.
* This transfer requires two `UPDATE` queries. The first `UPDATE` query subtracts &dollar;100 from your checking account balance. The second query adds &dollar;100 to your savings account balance.
* Suppose the first query, which decrements your checking account balance by &dollar;100, runs just fine. But then there is a glitch, and the second `UPDATE` query throws an error and doesn't run.
* You have just lost &dollar;100. &dollar;100 was subtracted from your checking account, but there was no corresponding increase in your savings account balance. That's bad.

We want both queries to run with no errors. But if one query fails, it would be best if none of the queries run. Most of us would prefer that the transfer not happen at all over having &dollar;100 mysteriously dissapear from our checking account. The `commit()` function exists for this purpose. Consider the following code.

```python
subtract_query = """
    UPDATE balances
       SET balance = balance - 100
     WHERE user_id = '9876543'
       AND account = 'checking';"""
       
add_query = """
    UPDATE balances
       SET balance = balance + 100
     WHERE user_id = '9876543'
       AND account = 'savings';"""
       
con.execute(subtract_query)
con.execute(add_query)
con.commit()
```

The `con.commit()` method allows us to group SQL queries together. If either the `subtract_query` or `add_query` fail, execution will stop before the `con.commit()` method is run, and no changes will be saved to the database. Another option is to catch the error and call the `con.rollback()` method, which will revert the database to the most recent commit.

A group of SQL statments that must be run as a group, with either all or none of the SQL statements being executed, is called a *transaction*. It is possible to configure *SQLite* to automatically commit every SQL query without calling `con.commit()`, but we're not going to do that.

[Refer to the Python *sqlite3* documentatoin for additional information on the `.commit()` and `.rollback()` methods.](https://docs.python.org/3/library/sqlite3.html#connection-objects)

### C. Closing Connections and Context Managers
We've been running all of our queries on the same database connection. This is fine when we're working in a notebook, but in more significant programs, it's best to close database connections after each transaction. Connections that are left open needlessly consume resources. Promptly closing database connections after use reduces the risk that we'll forget to close them. The next cell closes our connection.

In [15]:
# Closing a database connection
con.close()

### D. Using Cursors
So far we've used two different techniques to run SQL queries on our *SQLite* database. We used *Pandas* `.read_sql_query()` for queries that return information and the *sqlite3* package's `.execute()` methods for queries that don't return any information. You may be wondering ... would we ever use the `.execute()` method for queries that do return information? The answer is yes, absolutely! But first, we need to learn about cursors.

#### Cursor Etymology
You already know about two kinds of cursors. Your computer mouse's or touchpad's cursor indicates the position on the screen that will be activated if you click the mouse's or touchpad's button. Word processors, text editors, and CLIs use a cursor to indicate the position at which text will be entered if you type characters on your keyboard. The word cursor comes from the Latin word *cursor*, which means runner or courier. Before computers were commonplace, the word cursor was [frequently used to describe the transparent slide on a slide rule](https://www.math.utah.edu/~alfeld/sliderules/) (see figure below). A slide rule cursor moves back and forth along the slide rule and has a vertical hariline that helps the useer line up different scales on the slide rule. 

![Slide Rule](sliderule.jpg)

#### Database Cursors
The word *cursor* has a special meaning. It refers to a software object that marks the current position within a set of database records that have been returned from a query. But enough talk -- let's look at an example.

In [19]:
# Using a cursor object
query = """SELECT team_number, team_name, city, state, year_founded FROM teams LIMIT 6;"""
con = sqlite3.connect(db_file)
cursor = con.cursor()
cursor.execute(query)
for record in cursor:
    print(record)

('1318', 'Issaquah Robotics Society', 'Issaquah', 'Washington', 2004)
('2928', 'Viking Robotics', 'Seattle', 'Washington', 2009)
('2903', 'NeoBots', 'Arlington', 'Washington', 2009)
('3070', 'Team Pronto', 'Seattle', 'Washington', 2009)
('2930', 'Sonic Squirrels', 'Snohomish', 'Washington', 2009)
('1294', 'Top Gun', 'Sammamish', 'Washington', 2004)


Here's what happened.
1. First, we retrieved a cursor object by calling the `.cursor()` method of the *sqlite3* connection (`con`) object.
2. We executed the query by calling the cursor's `.execute()` method.
3. Finally, we iterated over the cursor and printed the results. Our `for` loop extracted one record at a time from the query results.

Each record is a Python tuple that contains the record's fields. The fields are in the same order as the column order specified in the SQL query. If we had used the asterisk instead of explicitly naming the columns, the field order would have matched the order of column's in the table's `CREATE` query. Individual fields can be extracted from the tuple using indexing notation:
```python
# Using Index Notation to Retrieve Fields

# Get the team number
team_number = record[0]
# Get the city
team_number = record[2]
```

There is an interesting thing about cursors. Suppose we want to loop over the query results again:

In [20]:
# Iterating again
for record in cursor:
    print(record)
con.close()

No records were returned when we tried to iterate over the cursor a second time. That's because the cursor object does not contain the query results. Instead, it points to the query data in the database. Like a text cursor that indicates your current position within a document, the database cursor indicates the position within the database of the data we want to retrieve.

Before the `for` loop, the cursor points to the first record of the query results. The first record is not retrieved from the database until we enter the `for` loop for the first time. On subsequent iterations of the `for` loop, the data in the `record` variable is overwritten with a new tuple containing the next record. After the final record is retrieved and the `for` loop is finished, the cursor is exhausted and points to the end of the query results.

Other than re-running the query, there is no way to make the query move backwards and point to an earlier record or the first record. Either you save record within the body of a `for` loop, or the record is lost forever (or until you re-run the query).

#### Benefits of Using Cursors
Why would the cursor object behave like this? Why can't we just get a data-structure that contains all the records and peruse it at our convenience, in whatever order we choose? It turns out there is a good reason the cursor behaves the way it does. The following code selects all records from the *measures* table using *Pandas* and a cursor object and compares the amount of memory used. It uses the [`sys.getsizeof()` function from the *Python Standard Library*](https://docs.python.org/3/library/sys.html#sys.getsizeof) to get the amount of memory in bytes required to store an object.

In [21]:
# Using a cursor object
query = """SELECT * FROM measures;"""

# Run query using Pandas
con = sqlite3.connect(db_file)
measures = pd.read_sql_query(query, con)
# Get amount of memory consumed by
# measures dataframe 
df_bytes = sys.getsizeof(measures)

# Run query using a cursor object
cursor = con.cursor()
cursor.execute(query)
# Check amount of memory used by cursor and record at each
# iteration of for loop. Keep track of maximum size.
max_cursor_bytes = sys.getsizeof(cursor)
for record in cursor:
    max_cursor_bytes = max(max_cursor_bytes,
                           sys.getsizeof(cursor) + sys.getsizeof(record))
max_cursor_bytes = max(max_cursor_bytes,
                           sys.getsizeof(cursor) + sys.getsizeof(record))

# Display results
print(f"Bytes used by cursor: {max_cursor_bytes:,}")
print(f"Bytes used by dataframe: {df_bytes:,}")
print(f"Ratio of dataframe to cursor memory used: {int(df_bytes / max_cursor_bytes):,}")
con.close()

Bytes used by cursor: 320
Bytes used by dataframe: 1,500,998
Ratio of dataframe to cursor memory used: 4,690


Retrieving all of the data using *Pandas* uses over 4,000 times as much memory as using a cursor object! When we use the cursor object to retrieve data, we never hold more than one record in memory at a time, which means the cursor object uses minimal memory.

The entire *measures* table requires about 1.5 Mb of memory, which is still quite small. Hundreds of megabytes to a Gigabyte are usually not a problem for 2021-era computers. The IRS's scouting data has always been small enough to fit in memory, so deciding between *Pandas* or *sqlite3's* cursor object is a matter of personal preference.

Someday you might work with *really* big data. SQLite databases can contain many terabytes of information (max theoretical size of a *SQLite* database is 140 TB), so a table could easily take up dozens or hundreds of gigabytes. Such a table cannot be read into memory all at one time - the `pandas.read_sql_query()` method will probably crash. But with a cursor object, you could easily scan through the entire table one record at a time, analyze each record, and save the results of your analysis back to the database or to disk.

#### Getting Column Names from the Cursor

##### Using the `.description` Attribute
Consider the following query.

In [22]:
# Using a cursor object
query = """SELECT * FROM measures LIMIT 1;"""
con = sqlite3.connect(db_file)
cursor = con.cursor()
cursor.execute(query)
for record in cursor:
    print(record)
con.close()

(0, '2020-02-29T16:29:00', 'wasno', '2020', 'qual', '034-q', 'blue', '1294', '1', 'robot', 'startingPosition_Goal', 'enum', 'auto', 'summary', 'na', 'Goal', 1, 0, 0, 4, 9)


We used an asterisk in the `SELECT` statement to get all columns from the *measures* table. The *measures* table has a lot of columns, and it's not easy to determine the columnes to which the tuple values belong. The order of the values within the tuple is the same order in which the columns were listed in the `CREATE` SQL statement that was used to create the *measures* table. We could run another SQL query to get the column names (see section III), but the *sqlite3* cursor object provides an easier way. Just access the `.description` attribute on the cursor object.

In [23]:
# Get column names
cursor.description

(('index', None, None, None, None, None, None),
 ('date', None, None, None, None, None, None),
 ('event', None, None, None, None, None, None),
 ('season', None, None, None, None, None, None),
 ('level', None, None, None, None, None, None),
 ('match', None, None, None, None, None, None),
 ('alliance', None, None, None, None, None, None),
 ('team', None, None, None, None, None, None),
 ('station', None, None, None, None, None, None),
 ('actor', None, None, None, None, None, None),
 ('task', None, None, None, None, None, None),
 ('measuretype', None, None, None, None, None, None),
 ('phase', None, None, None, None, None, None),
 ('attempt', None, None, None, None, None, None),
 ('reason', None, None, None, None, None, None),
 ('capability', None, None, None, None, None, None),
 ('successes', None, None, None, None, None, None),
 ('attempts', None, None, None, None, None, None),
 ('cycle_times', None, None, None, None, None, None),
 ('last_match', None, None, None, None, None, None),
 ('nu

The `.description` attribute returns a tuple of tuples. Each inner tuple represents a column from the query results. The inner tuples will always have seven elements, with the first element containing the column name and the remaining six elements always containing the `None` object. The reason for the `.description` attributes odd structure is that it needs to be compatible with other (non-*SQLite*) database systems. Other database systems presumeably return information in addition to the column name.

The `.description` attribute can easily be converted to a simple list of column names with a list comprehension.

In [24]:
# Getting a Simple List of Column Namesf
col_names = [col[0] for col in cursor.description]
print(col_names)

['index', 'date', 'event', 'season', 'level', 'match', 'alliance', 'team', 'station', 'actor', 'task', 'measuretype', 'phase', 'attempt', 'reason', 'capability', 'successes', 'attempts', 'cycle_times', 'last_match', 'num_matches']


##### Using Row Factories
Consider the following query.

In [28]:
# Accessing Fields with Column Names Using a Row Factory
con_colnames = sqlite3.connect(db_file)
con_colnames.row_factory = sqlite3.Row

# Run the query
query = """SELECT * FROM measures;"""
cursor = con_colnames.execute(query)

# Retrieve the first results row
row = cursor.fetchone()

print("Each row works like a Python dictionary, with column names as keys.")
print()
print("Column Names")
print(row.keys())
print()
print("Example of retrieving fields with column names")
print(f"Team: {row['team']}, Match: {row['match']}, Date: {row['date']}")
con_colnames.close()

Each row works like a Python dictionary, with column names as keys.

Column Names
['index', 'date', 'event', 'season', 'level', 'match', 'alliance', 'team', 'station', 'actor', 'task', 'measuretype', 'phase', 'attempt', 'reason', 'capability', 'successes', 'attempts', 'cycle_times', 'last_match', 'num_matches']

Example of retrieving fields with column names
Team: 1294, Match: 034-q, Date: 2020-02-29T16:29:00


#### Cursor Shortcuts
So far, we've been explicitly creating cursor objects with the statement `cursor = con.cursor`. The *sqlite3* package provides a shortcut. If we call the `.execute()` method directly from the connection object, the `.execute()` method will execute the query and return a cursor object. This allows us to eleminate the statement that creates the cursor. An example is provided below.

In [29]:
# Cursor Shortcut - Calling Execute from Connection Object
# This style eliminates one statement
query = """SELECT team_number, team_name, city, state, year_founded FROM teams LIMIT 6;"""
con = sqlite3.connect(db_file)
cursor = con.execute(query)
for record in cursor:
    print(record)
con.close()

('1318', 'Issaquah Robotics Society', 'Issaquah', 'Washington', 2004)
('2928', 'Viking Robotics', 'Seattle', 'Washington', 2009)
('2903', 'NeoBots', 'Arlington', 'Washington', 2009)
('3070', 'Team Pronto', 'Seattle', 'Washington', 2009)
('2930', 'Sonic Squirrels', 'Snohomish', 'Washington', 2009)
('1294', 'Top Gun', 'Sammamish', 'Washington', 2004)


We can use an even shorter sytax if desired.

In [30]:
# Cursor Shortcut - Calling Execute from Connection Object
# This style eliminates two statements!
query = """SELECT team_number, team_name, city, state, year_founded FROM teams LIMIT 6;"""
con = sqlite3.connect(db_file)
for record in con.execute(query):
    print(record)
con.close()

('1318', 'Issaquah Robotics Society', 'Issaquah', 'Washington', 2004)
('2928', 'Viking Robotics', 'Seattle', 'Washington', 2009)
('2903', 'NeoBots', 'Arlington', 'Washington', 2009)
('3070', 'Team Pronto', 'Seattle', 'Washington', 2009)
('2930', 'Sonic Squirrels', 'Snohomish', 'Washington', 2009)
('1294', 'Top Gun', 'Sammamish', 'Washington', 2004)


The final example is short and easy to understand. Which style you use is a matter of personal preference. Note that in the final example, it's not possiblel to get the column names from the cursor object because the cursor object is never saved to its own variable.

### E. SQLite on the Command Line
The *Pandas* and *sqlite3* packages are essential if we want to interact with a database from a Python program or a *Jupyter* notebook. But sometimes we just want to quickly check something in the database, for which writing Python code would be overkill. Fortunately, *SQLite* provides an easy command line tool. Here is an example of a command that can be run in PowerShell, Mac Terminal, or Linux Bash. (Remember, the exclamation point at the beginning of a Jupyter code cell causes the statement to be run from a command line.)

In [31]:
!sqlite3 wasno2020.sqlite3 "SELECT * FROM teams LIMIT 6;"

1|1318|Issaquah Robotics Society|Issaquah|Washington||2004|10
2|2928|Viking Robotics|Seattle|Washington||2009|10
3|2903|NeoBots|Arlington|Washington||2009|10
4|3070|Team Pronto|Seattle|Washington||2009|9
5|2930|Sonic Squirrels|Snohomish|Washington||2009|10
6|1294|Top Gun|Sammamish|Washington||2004|9


As you can see, we can run a SQL command from a CLI by entering `sqlite3`, then the path to our database file, and finally a SQL query in quotation marks. The results of the query will be printed out on the command line.

Entering `sqlite3` and the path to a database file, and nothing else, causes a sqlite prompt to be displayed. SQL queries can be entered directly at this prompt, and the results will be displayed within the CLI (see example below). Multiple SQL queries can be entered in this fashion. Entering `.quit` exits the *sqlite3* program and returns to the normal CLI prompt.

```
(pyclass) PS C:\Users\tedcodd\sql> sqlite3 wasno2020.sqlite3
SQLite version 3.36.0 2021-06-18 18:36:39
Enter ".help" for usage hints.
sqlite> SELECT * FROM schedule LIMIT 3;
205768|2020-02-29T11:00:00|qual|001-q|red|4131|1|10
205769|2020-02-29T11:00:00|qual|001-q|red|4683|2|10
205770|2020-02-29T11:00:00|qual|001-q|red|2412|3|9
sqlite> .quit
(pyclass) PS C:\Users\tedcodd\sql>
```

The *SQLite* CLI accepts numerous commands that start with a period, which are called *dot* commands. These commands are not part of the SQL language and only work within the *SQLite* CLI program. We've already seen one dot command, the `.quit` command. The `.help` command lists and describes all dot commands. Run the next two code cells to see more dot commands in action.

In [32]:
%%writefile sched.sql
.output schedule.csv
.mode csv
SELECT * FROM schedule;

Overwriting sched.sql


In [33]:
!sqlite3 wasno2020.sqlite3 ".read sched.sql"

There are several things going on in the two preceding code cells.
1. `%%writefile` is a jupyter magic command. It's neither Python nor SQL, and it only works in a Jupyter code cell. It wrote the contents of the cell to a text file named *sched.sql*. The *sched.sql* file contains two dot commands and one SQL query.
2. The second code cell runs the dot command `.read sched.sql` within the *sqlite3* program. the `.read` command causes each line of the *sched.sql* file to be executed with *sqlite3*.
3. The `.output schedule.csv` redirects all output to the *schedule.csv* file.
4. The `.mode csv` command changes the output format to comma separated value.
5. Finally, we run a `SELECT` query on the *wasno2020.sqlte3* database.

The end result is the entire schedule table is converted to CSV format and saved to the *schedule.csv* file.

[Refer to the *SQLite* command line shell's official documentation for more information.](https://sqlite.org/cli.html)

## VII. Exercises