# STAT 440 Statistical Data Management - Fall 2021
## Week 11 Notes
### Created by Christopher Kinson


***


## Table of Contents

- [Intro to SQL](#intro)
- [Accessing and importing data](#accessing-and-importing-data)  
- [Handling dates and times](#handling-dates-and-times) 
- [Arranging data](#arranging-data)  
  - [Organizing columns](#organizing)  
  - [Sorting columns](#sorting)  
- [Data Reduction](#data-reduction)  
  - [Filtering rows](#filtering)  
  - [Selecting columns](#selecting)  
  - [Dropping missing values](#dropping)
- [Data Expansion](#data-expansion)  
  - [Renaming columns](#renaming)  
  - [Mutating columns](#mutating)
- [Conditional execution](#conditional-execution)  
 

***


## <a name="intro"></a>Intro to SQL

SQL stands for structured query language and represents a sort of grammar of data management and wrangling. Some important grammar or translations are: database (collection of data files), table (e.g. a subset; a data frame), fields (columns), records (rows), query (any data wrangling task).

The intention behind the language is to have a general method for handling data. SQL can be accessed in almost any platform and programming language including the two in this course (R and Python). There are some subtleties in SQL that make it distinct and different from how programming languages access and handle data. Although it is structured and general, SQL has many versions developed by different users and companies for their own specific purposes. In this course, we plan on using the most common SQL statements, clauses, and keywords to avoid tasks that are inoperable on the 3 programming languages. Below is a general form of a typical query.

```
SELECT object-item <, ...object-item> 
FROM from-list 
<WHERE sql-expression> 
<GROUP BY object-item <, ...object-item >> 
<HAVING sql-expression> 
<ORDER BY order-by-item <DESC> <, ...order-by-item>>
```

The `SELECT` statement specifies which columns need to be in the resulting table once the query is complete. New variables may be created in the `SELECT` statement. 

The `FROM` clause points the dataset (the source of the query). Usually, this dataset exists within the database.

The `WHERE` clause is a way to select records based on conditions.

The `GROUP BY` clause groups data for processing. It helps to remove duplicates when there is a unique identifier in the data. 

The `HAVING` clause allows conditions to be placed on the groups for group processing. This clause is relevant when a `GROUP BY` clause exists.

The `ORDER BY` clause is the way to arrange the data. The default behavior is sorting in ascending order.

The order of these statements, clauses, and keywords matters and must be strictly followed. Meaning, a `WHERE` clause cannot appear before a `FROM` clause. Almost every query requires a `SELECT` statement.

To begin using SQL in Python, we'll use the **sqlite3** module.


***


## <a name="accessing-and-importing-data"></a>Accessing and importing data

We won't use SQL to import data from a database. Instead, we will continue using the same tools to import data discussed in Week 02 Notes: `read_delim()` and `read_csv()`. After importing data, we can work with those assigned objects.

Let's import the [Rental Inspection Grades Listing Data as comma-separated .csv - GHE](https://github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/raw/master/data/rental-inspections-grades-data01.csv) or [Rental Inspection Grades Listing Data as comma-separated .csv - Box](https://uofi.box.com/shared/static/l9o50efbnemdnaxury4hg45cj8b2truu.csv).


In [1]:
import pandas as pd
rentals = pd.read_csv('https://uofi.box.com/shared/static/l9o50efbnemdnaxury4hg45cj8b2truu.csv')
rentals['Inspection Date'] = pd.to_datetime(rentals['Inspection Date'], format = '%m/%d/%Y')
rentals['Expiration Date'] = pd.to_datetime(rentals['Expiration Date'], format = '%m/%d/%Y')
rentals.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1730 entries, 0 to 1729
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Property Address  1730 non-null   object        
 1   Parcel Number     1730 non-null   int64         
 2   Inspection Date   1730 non-null   datetime64[ns]
 3   Grade             1730 non-null   object        
 4   License Status    1730 non-null   object        
 5   Expiration Date   1456 non-null   datetime64[ns]
 6   Mappable Address  1730 non-null   object        
dtypes: datetime64[ns](2), int64(1), object(4)
memory usage: 94.7+ KB


***


## <a name="#handling-dates-and-times"></a>Handling dates and times

In general, dates and times may be considered as numeric information. Thus we should recall the formatting dates and times in the Week 03 Notes. However, in SQL particularly, dates are handled more easily as character strings after they've been coerced from numeric back to character format. Thus, a good idea is to coerce dates to numeric (date) format then coerce back to character format using your programming language of choice (not using SQL). Afterwards, we can use those character strings and their format to perform an necessary subsetting much like the example below.

In [2]:
import sqlite3 

rentals['Inspection Date'] = rentals['Inspection Date'].astype(str).replace({'NaT': None})
rentals['Expiration Date'] = rentals['Expiration Date'].astype(str).replace({'NaT': None})
# create a new database (currently empty) and connect with the myTable database 
conn = sqlite3.connect("stat440_fa21_notes_week11.db") 
conn.commit()
dfrentals = rentals.copy()
dfrentals.to_sql('rentals', conn, if_exists='replace', index = False)

  sql.to_sql(


In [3]:
pd.read_sql_query("SELECT * \
FROM rentals \
WHERE `Inspection Date` = '2020-03-10';", conn)

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
0,405 East Villa Lane,932117428138,2020-03-10,Class A,Issued,2021-10-14,"405 East Villa Lane\r\nUrbana, IL\r\n(40.10434..."
1,113 West Washington Street,932117401008,2020-03-10,Class C,Issued,2021-10-14,"113 West Washington Street\r\nUrbana, IL\r\n(4..."


In [4]:
pd.read_sql_query("SELECT * \
FROM rentals \
limit 5;", conn)

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
0,607 1/2 Glover Avenue,922116177018,2015-07-24,Class B,Expired,2021-10-14,"607 1 2 Glover Avenue\r\nUrbana, IL\r\n(40.108..."
1,1302 1/2 Hill Street,912107406011,2011-08-17,Class B,Issued,2021-10-14,"1302 1 2 Hill Street\r\nUrbana, IL\r\n(40.1193..."
2,212 1/2 Central Avenue,912108383001,2010-04-26,Class B,Issued,,"212 1 2 Central Avenue\r\nUrbana, IL"
3,801 1/2 East Harding Drive,932121153003,2013-06-12,Class B,Issued,2021-10-14,"801 1 2 East Harding Drive\r\nUrbana, IL\r\n(4..."
4,1003 1/2 East Harding Drive,932121153010,2013-07-08,Class B,Issued,2020-10-14,"1003 1 2 East Harding Drive\r\nUrbana, IL\r\n(..."


**Pay attention to how column names with spaces must be enclosed with the \` ticks. And how we begin and end the body of the read_sql_query() function with \" quotes.**


***


## <a name="arranging-data"></a>Arranging data

Arranging a dataset involves organizing its columns and sorting the data by one or more of its columns. 

### <a name="organizing"></a>Organizing columns

By organizing the data we may want certain columns to appear as the first column, second column, etc. See edited image taken from [RStudio's dplyr cheat sheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

![](https://uofi.box.com/shared/static/sxh3cw9yyol3m3tlu8fhefw9ye1pjmwm.png)

Using SQL, let's arrange the Rental Inspection Grades Listing Data such that:

- the first column is Parcel Number  
- the second column is Property Address  
- the third column is Mappable Address  
- the fourth column is Inspection Date  
- the fifth column is Expiration Date  
- the sixth column is License Status
- the seventh column is Grade  

In [5]:
pd.read_sql_query("SELECT `Parcel Number`, `Property Address`, `Mappable Address`, \
                  `Inspection Date`, `Expiration Date`, `License Status`, Grade \
                  FROM rentals;", conn)

Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
0,922116177018,607 1/2 Glover Avenue,"607 1 2 Glover Avenue\r\nUrbana, IL\r\n(40.108...",2015-07-24,2021-10-14,Expired,Class B
1,912107406011,1302 1/2 Hill Street,"1302 1 2 Hill Street\r\nUrbana, IL\r\n(40.1193...",2011-08-17,2021-10-14,Issued,Class B
2,912108383001,212 1/2 Central Avenue,"212 1 2 Central Avenue\r\nUrbana, IL",2010-04-26,,Issued,Class B
3,932121153003,801 1/2 East Harding Drive,"801 1 2 East Harding Drive\r\nUrbana, IL\r\n(4...",2013-06-12,2021-10-14,Issued,Class B
4,932121153010,1003 1/2 East Harding Drive,"1003 1 2 East Harding Drive\r\nUrbana, IL\r\n(...",2013-07-08,2020-10-14,Issued,Class B
...,...,...,...,...,...,...,...
1725,932122406009,3026 East Stillwater Landing Unit 101,"3026 East Stillwater Landing\r\nUrbana, IL\r\n...",2017-12-18,2021-10-14,Issued,Class B
1726,932117307003,1108 South Busey Avenue,"1108 South Busey Avenue\r\nUrbana, IL\r\n(40.1...",2019-12-16,2021-10-14,Issued,Class B
1727,912107428001,806 Harvey Street,"806 Harvey Street\r\nUrbana, IL\r\n(-88.2215, ...",2011-11-04,2021-10-14,Issued,Class B
1728,922116376032,1302 East Michigan Avenue,"1302 East Michigan Avenue\r\nUrbana, IL\r\n(-8...",2016-04-18,,Issued,Class B


Remember that these are not permanent objects until we assign them.

In [6]:
rentals2 = pd.read_sql_query("SELECT `Parcel Number`, `Property Address`, `Mappable Address`, \
                  `Inspection Date`, `Expiration Date`, `License Status`, Grade \
                  FROM rentals;", conn)  
#convert to SQL
rentals2.to_sql('rentals2', conn, if_exists='replace', index = False)
pd.read_sql_query("SELECT `Parcel Number`, `Property Address`, `Mappable Address`, \
                  `Inspection Date`, `Expiration Date`, `License Status`, Grade \
                  FROM rentals2;", conn)

  sql.to_sql(


Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
0,922116177018,607 1/2 Glover Avenue,"607 1 2 Glover Avenue\r\nUrbana, IL\r\n(40.108...",2015-07-24,2021-10-14,Expired,Class B
1,912107406011,1302 1/2 Hill Street,"1302 1 2 Hill Street\r\nUrbana, IL\r\n(40.1193...",2011-08-17,2021-10-14,Issued,Class B
2,912108383001,212 1/2 Central Avenue,"212 1 2 Central Avenue\r\nUrbana, IL",2010-04-26,,Issued,Class B
3,932121153003,801 1/2 East Harding Drive,"801 1 2 East Harding Drive\r\nUrbana, IL\r\n(4...",2013-06-12,2021-10-14,Issued,Class B
4,932121153010,1003 1/2 East Harding Drive,"1003 1 2 East Harding Drive\r\nUrbana, IL\r\n(...",2013-07-08,2020-10-14,Issued,Class B
...,...,...,...,...,...,...,...
1725,932122406009,3026 East Stillwater Landing Unit 101,"3026 East Stillwater Landing\r\nUrbana, IL\r\n...",2017-12-18,2021-10-14,Issued,Class B
1726,932117307003,1108 South Busey Avenue,"1108 South Busey Avenue\r\nUrbana, IL\r\n(40.1...",2019-12-16,2021-10-14,Issued,Class B
1727,912107428001,806 Harvey Street,"806 Harvey Street\r\nUrbana, IL\r\n(-88.2215, ...",2011-11-04,2021-10-14,Issued,Class B
1728,922116376032,1302 East Michigan Avenue,"1302 East Michigan Avenue\r\nUrbana, IL\r\n(-8...",2016-04-18,,Issued,Class B


### <a name="sorting"></a>Sorting columns

Sorting data in SQL is done with `ORDER BY`. The default ordering is ascending. Now let's sort the `rentals2` data such that the Parcel Numbers are in ascending order.

In [7]:
pd.read_sql_query("SELECT *\
                  FROM rentals2 \
                  ORDER BY `Parcel Number`;", conn) 

Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
0,912104351004,1902 Willow Road,"1902 Willow Road\r\nUrbana, IL\r\n(-88.2007, 4...",2010-05-21,,Issued,Class B
1,912105477028,1601 Willow Road,"1601 Willow Road\r\nUrbana, IL\r\n(-88.201, 40...",2007-10-19,,Issued,Class B
2,912105477030,1709 Willow Court,"1709 Willow Court\r\nUrbana, IL\r\n(-88.2013, ...",2010-07-27,2021-10-14,Issued,Class B
3,912105477031,1707 Willow Court,"1707 Willow Court\r\nUrbana, IL\r\n(-88.2014, ...",2010-07-13,2021-10-14,Issued,Class A
4,912105477032,1705 Willow Court,"1705 Willow Court\r\nUrbana, IL\r\n(-88.2014, ...",2010-07-13,2020-10-14,Issued,Class A
...,...,...,...,...,...,...,...
1725,932128253015,1704 East Trails Drive Unit B,"1704 East Trails Drive\r\nUnit B Urbana, IL\r\...",2018-02-12,2021-10-14,Issued,Class B
1726,932128255024,1705 East Trails Drive Unit A,"1705 East Trails Drive\r\nUnit A Urbana, IL\r\...",2017-12-05,2021-10-14,Issued,Class B
1727,932128276006,1914 East Galena Street,"1914 East Galena Street\r\nUrbana, IL\r\n(40.0...",2017-11-29,2021-10-14,Issued,Class B
1728,932128405023,1708 East Lexington Drive,"1708 East Lexington Drive\r\nUrbana, IL\r\n(40...",2017-12-13,2021-10-14,Issued,Class A


If we want descending order, then we use the `DESC` keyword after the field name. We can also limit the amount of observations we want to show using the `LIMIT n` clause where $n$ is a number of records to be shown in the result. Let's sort the `rentals2` data such that the Grades are in descending order then Parcel Numbers are in ascending order and showing the resulting first 10 rows. 

In [8]:
pd.read_sql_query("SELECT `Parcel Number`, Grade\
                  FROM rentals2 \
                  ORDER BY Grade DESC, `Parcel Number` \
                  LIMIT 10;", conn) 

Unnamed: 0,Parcel Number,Grade
0,922117111001,Class N
1,932121181018,Class F
2,932121181019,Class F
3,922117264013,Class D
4,932117476001,Class D
5,932121181020,Class D
6,912107205012,Class C
7,912107253008,Class C
8,912107255007,Class C
9,912107256012,Class C


## <a name="data-reduction"></a>Data Reduction

Data reduction tasks, from a data wrangling perspective, are done by data workers so often that they become second nature. In SQL, these are easily done with the clauses `WHERE` and `SELECT`.

### <a name="filtering"></a>Filtering rows

We can select rows or observations through conditions with the `WHERE` clause in SQL. Let's filter only the rows that have a grade of F.

In [9]:
pd.read_sql_query("SELECT *\
                  FROM rentals2 \
                  WHERE Grade = 'Class F';", conn) 

Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
0,932121181018,1302 Silver Street,"1302 Silver Street\r\nUrbana, IL\r\n(-88.1928,...",2013-02-25,,Expired,Class F
1,932121181019,1304 Silver Street,"1304 Silver Street\r\nUrbana, IL\r\n(-88.1925,...",2013-02-25,2021-10-14,Expired,Class F


We can filter only the rows that have a grade of A or F (where the keyword `OR` means "or").

In [10]:
pd.read_sql_query("SELECT *\
                  FROM rentals2 \
                  WHERE Grade = 'Class F' OR Grade = 'Class A';", conn)

Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
0,912108354004,807 1/2 West Main Street,"807 1 2 West Main Street\r\nUrbana, IL\r\n(40....",2011-05-18,2020-10-14,Issued,Class A
1,922116177015,601 A Glover Avenue,"601 A Glover Avenue\r\nUrbana, IL\r\n(40.10851...",2012-11-29,2020-10-14,Issued,Class A
2,932121332024,2205 South Philo Road,"2205 South Philo Road\r\nUrbana, IL\r\n(40.089...",2017-03-14,2021-10-14,Issued,Class A
3,912105477032,1705 Willow Court,"1705 Willow Court\r\nUrbana, IL\r\n(-88.2014, ...",2010-07-13,2020-10-14,Issued,Class A
4,912107280014,901 Harvey Street,"901 Harvey Street\r\nUrbana, IL\r\n(-88.2222, ...",2011-12-13,2019-10-14,Issued,Class A
...,...,...,...,...,...,...,...
152,922116432056,2014 East Michigan Avenue,"2014 East Michigan Avenue\r\nUrbana, IL\r\n(40...",2018-03-15,2021-10-14,Issued,Class A
153,932117326017,506 West Iowa Street,"506 West Iowa Street\r\nUrbana, IL\r\n(40.1049...",2019-05-31,2021-10-14,Issued,Class A
154,932117356002,713 West Pennsylvania Avenue,"713 West Pennsylvania Avenue\r\nUrbana, IL\r\n...",2009-07-24,2021-10-14,Expired,Class A
155,922117157004,705 West Oregon Street,"705 West Oregon Street\r\nUrbana, IL\r\n(-88.2...",2007-11-30,2020-10-14,Issued,Class A


We filter the rows corresponding to the a grade of A and expiration dates before the year 2021.

In [11]:
pd.read_sql_query("SELECT *\
                  FROM rentals2 \
                  WHERE Grade = 'Class A' AND `Expiration Date` < '2021-01-01';", conn)

Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
0,912108354004,807 1/2 West Main Street,"807 1 2 West Main Street\r\nUrbana, IL\r\n(40....",2011-05-18,2020-10-14,Issued,Class A
1,922116177015,601 A Glover Avenue,"601 A Glover Avenue\r\nUrbana, IL\r\n(40.10851...",2012-11-29,2020-10-14,Issued,Class A
2,912105477032,1705 Willow Court,"1705 Willow Court\r\nUrbana, IL\r\n(-88.2014, ...",2010-07-13,2020-10-14,Issued,Class A
3,912107280014,901 Harvey Street,"901 Harvey Street\r\nUrbana, IL\r\n(-88.2222, ...",2011-12-13,2019-10-14,Issued,Class A
4,912109126035,1105 North Carroll Avenue,"1105 North Carroll Avenue\r\nUrbana, IL\r\n(40...",2018-02-12,2020-10-14,Fee Exempt Registration,Class A
5,932117327011,905 South Race Street,"905 South Race Street\r\nUrbana, IL\r\n(-88.21...",2009-06-05,2020-10-14,Issued,Class A
6,912105477033,1703 Willow Court,"1703 Willow Court\r\nUrbana, IL\r\n(-88.2012, ...",2010-07-13,2019-10-14,Issued,Class A
7,912115180034,503 Sunny Lane,"503 Sunny Lane\r\nUrbana, IL\r\n(-88.1729, 40....",2015-03-26,2019-10-14,Issued,Class A
8,912115180027,703 Sunny Lane,"703 Sunny Lane\r\nUrbana, IL\r\n(-88.1729, 40....",2015-03-26,2020-10-14,Issued,Class A
9,922117234004,405 East Elm Street,"405 East Elm Street\r\nUrbana, IL\r\n(40.11145...",2018-02-12,2020-10-14,Fee Exempt Registration,Class A


### <a name="selecting"></a>Selecting columns

We can select certain columns using the `SELECT` clause. Selecting can be helpful when we don't need all of a dataset's original columns. Let's select only the Mappable Address column.

In [12]:
pd.read_sql_query("SELECT `Mappable Address` \
                  FROM rentals2;", conn)

Unnamed: 0,Mappable Address
0,"607 1 2 Glover Avenue\r\nUrbana, IL\r\n(40.108..."
1,"1302 1 2 Hill Street\r\nUrbana, IL\r\n(40.1193..."
2,"212 1 2 Central Avenue\r\nUrbana, IL"
3,"801 1 2 East Harding Drive\r\nUrbana, IL\r\n(4..."
4,"1003 1 2 East Harding Drive\r\nUrbana, IL\r\n(..."
...,...
1725,"3026 East Stillwater Landing\r\nUrbana, IL\r\n..."
1726,"1108 South Busey Avenue\r\nUrbana, IL\r\n(40.1..."
1727,"806 Harvey Street\r\nUrbana, IL\r\n(-88.2215, ..."
1728,"1302 East Michigan Avenue\r\nUrbana, IL\r\n(-8..."


De-selecting is not really allowed in SQL, but we can select all columns using `SELECT *`.

In [13]:
pd.read_sql_query("SELECT * \
                  FROM rentals2;", conn)

Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
0,922116177018,607 1/2 Glover Avenue,"607 1 2 Glover Avenue\r\nUrbana, IL\r\n(40.108...",2015-07-24,2021-10-14,Expired,Class B
1,912107406011,1302 1/2 Hill Street,"1302 1 2 Hill Street\r\nUrbana, IL\r\n(40.1193...",2011-08-17,2021-10-14,Issued,Class B
2,912108383001,212 1/2 Central Avenue,"212 1 2 Central Avenue\r\nUrbana, IL",2010-04-26,,Issued,Class B
3,932121153003,801 1/2 East Harding Drive,"801 1 2 East Harding Drive\r\nUrbana, IL\r\n(4...",2013-06-12,2021-10-14,Issued,Class B
4,932121153010,1003 1/2 East Harding Drive,"1003 1 2 East Harding Drive\r\nUrbana, IL\r\n(...",2013-07-08,2020-10-14,Issued,Class B
...,...,...,...,...,...,...,...
1725,932122406009,3026 East Stillwater Landing Unit 101,"3026 East Stillwater Landing\r\nUrbana, IL\r\n...",2017-12-18,2021-10-14,Issued,Class B
1726,932117307003,1108 South Busey Avenue,"1108 South Busey Avenue\r\nUrbana, IL\r\n(40.1...",2019-12-16,2021-10-14,Issued,Class B
1727,912107428001,806 Harvey Street,"806 Harvey Street\r\nUrbana, IL\r\n(-88.2215, ...",2011-11-04,2021-10-14,Issued,Class B
1728,922116376032,1302 East Michigan Avenue,"1302 East Michigan Avenue\r\nUrbana, IL\r\n(-8...",2016-04-18,,Issued,Class B


### <a name="dropping"></a>Dropping missing values

Missing values are often represented as `NA` (not available), `NaN` (not a number), ".", or " " in data. Missing values are slightly different from null values and unknown values. A missing value could be unknown or NULL or an actual value that just never made it into the data frame. 

Null values (`NULL`) are undefined values often used in R coding to create empty objects. 

Unknown values are usually noted or marked as "unknown" in a dataset. Older data might note a value as "9999" or "99999" to represent an unknown value. Unknown values are not necessarily missing when they are represented as "unknown" or "9999" within a dataset.

In SQL, we can identify missing values using `IS NULL` in the `WHERE` statement such as  `WHERE Name IS NULL`. Let's remove missing values from the Expiration Date column with the `NOT` keyword.

In [14]:
pd.read_sql_query("SELECT `Expiration Date` \
                  FROM rentals2 \
                  WHERE `Expiration Date` IS NOT NULL;", conn)

Unnamed: 0,Expiration Date
0,2021-10-14
1,2021-10-14
2,2021-10-14
3,2020-10-14
4,2021-10-14
...,...
1451,2021-10-14
1452,2021-10-14
1453,2021-10-14
1454,2021-10-14


***


## <a name="data-expansion"></a>Data expansion

The methods in this section will be making the dataset larger in some way, usually by adding new columns of information.

### <a name="renaming"></a>Renaming columns

Renaming variables can be accomplished using `AS` keyword with the `SELECT` clause and serves as a convenient way to change a column's name without an assignment operator. We place the old name on the left side of `AS` and the new name on the right side.

In [15]:
pd.read_sql_query("SELECT `Mappable Address` AS full_address \
                  FROM rentals2;", conn)

Unnamed: 0,full_address
0,"607 1 2 Glover Avenue\r\nUrbana, IL\r\n(40.108..."
1,"1302 1 2 Hill Street\r\nUrbana, IL\r\n(40.1193..."
2,"212 1 2 Central Avenue\r\nUrbana, IL"
3,"801 1 2 East Harding Drive\r\nUrbana, IL\r\n(4..."
4,"1003 1 2 East Harding Drive\r\nUrbana, IL\r\n(..."
...,...
1725,"3026 East Stillwater Landing\r\nUrbana, IL\r\n..."
1726,"1108 South Busey Avenue\r\nUrbana, IL\r\n(40.1..."
1727,"806 Harvey Street\r\nUrbana, IL\r\n(-88.2215, ..."
1728,"1302 East Michigan Avenue\r\nUrbana, IL\r\n(-8..."


### <a name="mutating"></a>Mutating columns

The real power of data wrangling and to a larger extent, data science, is the ability to create columns of new information. Often, this new information is really just a function of existing information. But, usually that new information is what is needed for a later analysis. Recall that the work of data management and wrangling (read: STAT 440) is to do all the data work prior to an actual data analysis. 

Below is a table of SQL operators for arithmetic and logicals (booleans).

Operation or Function | SQL Syntax 
---|---
Addition | + 
Subtraction | \- 
Multiplication | \* 
Division | \/ 
Exponentiation | POWER() 
Modulus | %
Equal to (for comparison) | =
Not Equal to | <>
Greater than | > 
Less than | < 
Greater than or equal to | >= 
Less than or equal to | <= 
And | AND
Or | OR
Negation (aka Not) | NOT
Square root | SQRT() 
Absolute value | ABS() 
Logarithm (natural) | LOG()
Exponential | EXP() 
Mean | AVG()
Minimum | MIN()
Maximum | MAX()

Read the following for more information on SQL's operators and functions [SQL operators](https://www.w3schools.com/sql/sql_operators.asp) and [SQL functions](https://www.w3schools.com/sql/sql_ref_sqlserver.asp).

Within the `SELECT` clause, we can create the new columns. Suppose we want to represent the Grades of the inspections as numbers and create a proportion from that new numeric grade. The math operator that's going to be useful is division `/`. The SQL special statement that we have not discussed yet is the `CASE`...`WHEN`...`THEN`...`ELSE` which will allow us to have results similar to the `ifelse()` function in R. To accomplish this we might do:

In [16]:
rentals3 = pd.read_sql_query("SELECT *,\
        CASE \
          WHEN Grade='Class N' THEN 1 \
          WHEN Grade='Class F' THEN 2 \
          WHEN Grade='Class D' THEN 3 \
          WHEN Grade='Class C' THEN 4 \
          WHEN Grade='Class B' THEN 5 \
          ELSE 6 \
          END AS grade_numeric \
      FROM rentals2;", conn)
rentals3.to_sql('rentals3', conn, if_exists='replace', index = False)
pd.read_sql_query("SELECT grade_numeric/6.0 AS grade_prop \
      FROM rentals3;", conn)

  sql.to_sql(


Unnamed: 0,grade_prop
0,0.833333
1,0.833333
2,0.833333
3,0.833333
4,0.833333
...,...
1725,0.833333
1726,0.833333
1727,0.833333
1728,0.833333


That's quite powerful! 


***


### <a name="conditional-execution"></a>Conditional Execution

SQL does allow for conditional execution in a form similar to `ifelse()` in R with `CASE`...`WHEN`...`THEN`...`ELSE`. This statement is part of the possibilities with the `SELECT` clause.

```
CASE
    WHEN condition1 THEN result1
    WHEN condition2 THEN result2
    WHEN conditionN THEN resultN
    ELSE result
END;
```

See the example of using this with the rentals data above. It's the same below.

In [17]:
rentals4 = pd.read_sql_query("SELECT *,\
        CASE \
          WHEN Grade='Class N' THEN 1 \
          WHEN Grade='Class F' THEN 2 \
          WHEN Grade='Class D' THEN 3 \
          WHEN Grade='Class C' THEN 4 \
          WHEN Grade='Class B' THEN 5 \
          ELSE 6 \
          END AS grade_numeric \
      FROM rentals2;", conn)
rentals4.to_sql('rentals4', conn, if_exists='replace', index = False)
pd.read_sql_query("SELECT grade_numeric/6.0 AS grade_prop \
      FROM rentals4;", conn)

  sql.to_sql(


Unnamed: 0,grade_prop
0,0.833333
1,0.833333
2,0.833333
3,0.833333
4,0.833333
...,...
1725,0.833333
1726,0.833333
1727,0.833333
1728,0.833333


In [18]:
conn.close()

#### END OF NOTES