# STAT 440 Statistical Data Management - Fall 2021

## Week 05 Notes
### Created by Christopher Kinson and Huiqin Xin


***

### Table of Contents

- [Data Reduction](#data-reduction)  
  - [Filtering rows](#filtering)  
  - [Slicing rows](#slicing)  
  - [Selecting columns](#selecting)  
  - [Dropping missing values](#dropping)
- [Data Expansion](#data-expansion)  
  - [Renaming columns](#renaming)  
  - [Mutating columns](#mutating)  
  

***


## <a name="data-reduction"></a>Data Reduction

Below, I describe a few techniques that achieve data reduction also known as subsetting. Data reduction tasks, from a data wrangling perspective, are done by data workers so often that they become second nature. *String manipulation is not discussed this week, but will be discussed in Week 08.* 

### <a name="filtering"></a>Filtering rows

We can select rows or observations through conditions with the `[]` in Python. Working with the City of Urbana's [Rental Inspection Grades Listing Data as comma-separated .csv](https://github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/raw/master/data/rental-inspections-grades-data01.csv), we can filter only the rows that have a grade of F.

In [1]:
import pandas as pd
rental_data = pd.read_csv('https://raw.github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/master/data/rental-inspections-grades-data01.csv?token=AAABJG5HX6DWL4R5KIY23UDBGO5S4')
rental_data['Inspection Date'] = pd.to_datetime(rental_data['Inspection Date'])
rental_data['Expiration Date'] = pd.to_datetime(rental_data['Expiration Date'])
rental_data[rental_data['Grade'] == 'Class F']

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
605,1302 Silver Street,932121181018,2013-02-25,Class F,Expired,NaT,"1302 Silver Street\nUrbana, IL\n(-88.1928, 40...."
1327,1304 Silver Street,932121181019,2013-02-25,Class F,Expired,2021-10-14,"1304 Silver Street\nUrbana, IL\n(-88.1925, 40...."


We can filter only the rows that have a grade of A or F (where the pipe key `|` means "or").

In [2]:
rental_data[(rental_data['Grade'] == 'Class A') | (rental_data['Grade'] == 'Class F')]

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
9,807 1/2 West Main Street,912108354004,2011-05-18,Class A,Issued,2020-10-14,"807 1 2 West Main Street\nUrbana, IL\n(40.1144..."
15,601 A Glover Avenue,922116177015,2012-11-29,Class A,Issued,2020-10-14,"601 A Glover Avenue\nUrbana, IL\n(40.108517, -..."
34,2205 South Philo Road,932121332024,2017-03-14,Class A,Issued,2021-10-14,"2205 South Philo Road\nUrbana, IL\n(40.0894966..."
41,1705 Willow Court,912105477032,2010-07-13,Class A,Issued,2020-10-14,"1705 Willow Court\nUrbana, IL\n(-88.2014, 40.1..."
54,901 Harvey Street,912107280014,2011-12-13,Class A,Issued,2019-10-14,"901 Harvey Street\nUrbana, IL\n(-88.2222, 40.1..."
...,...,...,...,...,...,...,...
1654,2014 East Michigan Avenue,922116432056,2018-03-15,Class A,Issued,2021-10-14,"2014 East Michigan Avenue\nUrbana, IL\n(40.101..."
1660,506 West Iowa Street,932117326017,2019-05-31,Class A,Issued,2021-10-14,"506 West Iowa Street\nUrbana, IL\n(40.10490798..."
1669,713 West Pennsylvania Avenue,932117356002,2009-07-24,Class A,Expired,2021-10-14,"713 West Pennsylvania Avenue\nUrbana, IL\n(-88..."
1710,705 West Oregon Street,922117157004,2007-11-30,Class A,Issued,2020-10-14,"705 West Oregon Street\nUrbana, IL\n(-88.2166,..."


We filter the rows corresponding to the a grade of A and expiration dates before the year 2021.

In [3]:
rental_data2 = rental_data[(rental_data['Grade'] == 'Class A') & (rental_data['Expiration Date'] < '2021-01-01')]
rental_data2.sort_values('Expiration Date')

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
405,2940 Rutherford Drive,912110406031,2015-03-13,Class A,Issued,2018-10-14,"2940 Rutherford Drive\nUrbana, IL\n(-88.1645, ..."
54,901 Harvey Street,912107280014,2011-12-13,Class A,Issued,2019-10-14,"901 Harvey Street\nUrbana, IL\n(-88.2222, 40.1..."
199,1703 Willow Court,912105477033,2010-07-13,Class A,Issued,2019-10-14,"1703 Willow Court\nUrbana, IL\n(-88.2012, 40.1..."
259,503 Sunny Lane,912115180034,2015-03-26,Class A,Issued,2019-10-14,"503 Sunny Lane\nUrbana, IL\n(-88.1729, 40.1081)"
9,807 1/2 West Main Street,912108354004,2011-05-18,Class A,Issued,2020-10-14,"807 1 2 West Main Street\nUrbana, IL\n(40.1144..."
1381,401 West Park Street,912108328007,2008-06-20,Class A,Expired,2020-10-14,"401 West Park Street\nUrbana, IL\n(-88.2111, 4..."
707,610 East Kerr Avenue,912109151012,2018-02-07,Class A,Fee Exempt Registration,2020-10-14,"610 East Kerr Avenue\nUrbana, IL\n(40.12248992..."
693,402 West Washington Street,922117181015,2019-03-08,Class A,Expired,2020-10-14,"402 West Washington Street\nUrbana, IL\n(40.10..."
565,502 West Nevada Street,922117161025,2019-04-03,Class A,Expired,2020-10-14,"502 West Nevada Street\nUrbana, IL\n(40.10675,..."
277,703 Sunny Lane,912115180027,2015-03-26,Class A,Issued,2020-10-14,"703 Sunny Lane\nUrbana, IL\n(-88.1729, 40.1069)"


### <a name="slicing"></a>Slicing rows

Filtering rows works well when we know the data columns and values by name. But if that information is not as readily available, we can reduce the data using values that represent the row location via the `[]` notation.

For the City of Urbana's Rental Inspection Grades Listing Data, we can slice rows 1000 to 1111.

In [4]:
rental_data[999:1111]

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
999,812 Hill Street,912108301015,2010-07-09,Class B,Issued,2021-10-14,"812 Hill Street\nUrbana, IL\n(-88.2183, 40.1194)"
1000,804 Church Street,912108302019,2014-02-06,Class B,Issued,2021-10-14,"804 Church Street\nUrbana, IL\n(-88.2176, 40.1..."
1001,1603 Abercorn Street,912115388009,2015-02-27,Class B,Issued,2021-10-14,"1603 Abercorn Street\nUrbana, IL\n(-88.1728, 4..."
1002,802 East Green Street,922116103008,2018-06-06,Class B,Issued,2021-10-14,"802 East Green Street\nUrbana, IL\n(40.1111183..."
1003,808 East Scovill Street,932121356019,2017-02-21,Class B,Issued,2021-10-14,"808 East Scovill Street\nUrbana, IL\n(40.08567..."
...,...,...,...,...,...,...,...
1106,1306 East Silver Street,932121181020,2017-04-11,Class D,Temporarily Not a Rental,2019-10-14,"1306 East Silver Street\nUrbana, IL\n(40.09172..."
1107,608 South Webber Street,922116157007,2012-05-22,Class B,Issued,NaT,"608 South Webber Street\nUrbana, IL\n(-88.1989..."
1108,802 West Nevada Street,922117153013,2007-11-28,Class B,Issued,2020-10-14,"802 West Nevada Street\nUrbana, IL\n(-88.2181,..."
1109,1002 North Highlands Drive,912108277016,2010-08-05,Class A,Issued,2021-10-14,"1002 North Highlands Drive\nUrbana, IL\n(-88.2..."


We can show the first few rows with `head()` or last few rows with `tail()`.

In [5]:
rental_data.head(10)

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
0,607 1/2 Glover Avenue,922116177018,2015-07-24,Class B,Expired,2021-10-14,"607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,..."
1,1302 1/2 Hill Street,912107406011,2011-08-17,Class B,Issued,2021-10-14,"1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ..."
2,212 1/2 Central Avenue,912108383001,2010-04-26,Class B,Issued,NaT,"212 1 2 Central Avenue\nUrbana, IL"
3,801 1/2 East Harding Drive,932121153003,2013-06-12,Class B,Issued,2021-10-14,"801 1 2 East Harding Drive\nUrbana, IL\n(40.09..."
4,1003 1/2 East Harding Drive,932121153010,2013-07-08,Class B,Issued,2020-10-14,"1003 1 2 East Harding Drive\nUrbana, IL\n(40.0..."
5,1204 1/2 North Goodwin Avenue,912107276001,2011-10-20,Class B,Issued,2021-10-14,"1204 1 2 North Goodwin Avenue\nUrbana, IL\n(40..."
6,910 1/2 North Busey Avenue,912108153017,2010-12-17,Class B,Issued,2021-10-14,"910 1 2 North Busey Avenue\nUrbana, IL\n(40.12..."
7,1109 1/2 East Main Street,922116126007,2015-06-05,Class B,Issued,2020-10-14,"1109 1 2 East Main Street\nUrbana, IL\n(40.112..."
8,1306 1/2 East Mumford Drive,932121327003,2013-07-08,Class B,Issued,2021-10-14,"1306 1 2 East Mumford Drive\nUrbana, IL\n(40.0..."
9,807 1/2 West Main Street,912108354004,2011-05-18,Class A,Issued,2020-10-14,"807 1 2 West Main Street\nUrbana, IL\n(40.1144..."


In [6]:
rental_data.tail(10)

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
1720,401 South Poplar Street,922116130016,2012-06-29,Class B,Issued,NaT,"401 South Poplar Street\nUrbana, IL\n(-88.1949..."
1721,706 South Coler Avenue,922117161011,2008-03-31,Class B,Issued,NaT,"706 South Coler Avenue\nUrbana, IL\n(-88.2158,..."
1722,401 West Springfield Avenue,922117131006,2019-03-04,Class B,Issued,2021-10-14,"401 West Springfield Avenue\nUrbana, IL\n(40.1..."
1723,1301 1/2 Dublin Street,912107257006,2011-06-29,Class B,Expired,NaT,"1301 1 2 Dublin Street\nUrbana, IL\n(-88.2258,..."
1724,2005 Bruce Drive,932121178007,2013-06-04,Class A,Issued,2021-10-14,"2005 Bruce Drive\nUrbana, IL\n(-88.1932, 40.0928)"
1725,3026 East Stillwater Landing Unit 101,932122406009,2017-12-18,Class B,Issued,2021-10-14,"3026 East Stillwater Landing\nUrbana, IL\n(40...."
1726,1108 South Busey Avenue,932117307003,2019-12-16,Class B,Issued,2021-10-14,"1108 South Busey Avenue\nUrbana, IL\n(40.10310..."
1727,806 Harvey Street,912107428001,2011-11-04,Class B,Issued,2021-10-14,"806 Harvey Street\nUrbana, IL\n(-88.2215, 40.1..."
1728,1302 East Michigan Avenue,922116376032,2016-04-18,Class B,Issued,NaT,"1302 East Michigan Avenue\nUrbana, IL\n(-88.19..."
1729,1503 South Cottage Grove Avenue,922116353022,2016-05-18,Class B,Expired,NaT,"1503 South Cottage Grove Avenue\nUrbana, IL\n(..."


We can also show the smallest parcel number or largest parcel number.

In [7]:
rental_data[rental_data['Parcel Number'] == rental_data['Parcel Number'].min()]

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
936,1902 Willow Road,912104351004,2010-05-21,Class B,Issued,NaT,"1902 Willow Road\nUrbana, IL\n(-88.2007, 40.1305)"


In [8]:
rental_data[rental_data['Parcel Number'] == rental_data['Parcel Number'].max()]

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
497,1611 East Lexington Drive,932128406013,2017-12-05,Class A,Issued,2021-10-14,"1611 East Lexington Drive\nUrbana, IL\n(40.075..."


### <a name="selecting"></a>Selecting columns

We can select certain columns using the `iloc[]` notation. Selecting can be helpful when we don't need all of a dataset's original columns.

In [9]:
rental_data['Mappable Address']

0       607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,...
1       1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ...
2                      212 1 2 Central Avenue\nUrbana, IL
3       801 1 2 East Harding Drive\nUrbana, IL\n(40.09...
4       1003 1 2 East Harding Drive\nUrbana, IL\n(40.0...
                              ...                        
1725    3026 East Stillwater Landing\nUrbana, IL\n(40....
1726    1108 South Busey Avenue\nUrbana, IL\n(40.10310...
1727    806 Harvey Street\nUrbana, IL\n(-88.2215, 40.1...
1728    1302 East Michigan Avenue\nUrbana, IL\n(-88.19...
1729    1503 South Cottage Grove Avenue\nUrbana, IL\n(...
Name: Mappable Address, Length: 1730, dtype: object

In [10]:
rental_data.iloc[:,1:]

Unnamed: 0,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
0,922116177018,2015-07-24,Class B,Expired,2021-10-14,"607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,..."
1,912107406011,2011-08-17,Class B,Issued,2021-10-14,"1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ..."
2,912108383001,2010-04-26,Class B,Issued,NaT,"212 1 2 Central Avenue\nUrbana, IL"
3,932121153003,2013-06-12,Class B,Issued,2021-10-14,"801 1 2 East Harding Drive\nUrbana, IL\n(40.09..."
4,932121153010,2013-07-08,Class B,Issued,2020-10-14,"1003 1 2 East Harding Drive\nUrbana, IL\n(40.0..."
...,...,...,...,...,...,...
1725,932122406009,2017-12-18,Class B,Issued,2021-10-14,"3026 East Stillwater Landing\nUrbana, IL\n(40...."
1726,932117307003,2019-12-16,Class B,Issued,2021-10-14,"1108 South Busey Avenue\nUrbana, IL\n(40.10310..."
1727,912107428001,2011-11-04,Class B,Issued,2021-10-14,"806 Harvey Street\nUrbana, IL\n(-88.2215, 40.1..."
1728,922116376032,2016-04-18,Class B,Issued,NaT,"1302 East Michigan Avenue\nUrbana, IL\n(-88.19..."


We can de-select (drop) columns as well with the `drop` function.

In [11]:
rental_data.drop(columns=['Mappable Address', 'License Status','Inspection Date', 'Expiration Date'])

Unnamed: 0,Property Address,Parcel Number,Grade
0,607 1/2 Glover Avenue,922116177018,Class B
1,1302 1/2 Hill Street,912107406011,Class B
2,212 1/2 Central Avenue,912108383001,Class B
3,801 1/2 East Harding Drive,932121153003,Class B
4,1003 1/2 East Harding Drive,932121153010,Class B
...,...,...,...
1725,3026 East Stillwater Landing Unit 101,932122406009,Class B
1726,1108 South Busey Avenue,932117307003,Class B
1727,806 Harvey Street,912107428001,Class B
1728,1302 East Michigan Avenue,922116376032,Class B


Or de-select columns with negation.

In [12]:
rental_data[[x for x in rental_data.columns if 
             x not in ['Mappable Address', 'License Status','Inspection Date', 'Expiration Date']]]

Unnamed: 0,Property Address,Parcel Number,Grade
0,607 1/2 Glover Avenue,922116177018,Class B
1,1302 1/2 Hill Street,912107406011,Class B
2,212 1/2 Central Avenue,912108383001,Class B
3,801 1/2 East Harding Drive,932121153003,Class B
4,1003 1/2 East Harding Drive,932121153010,Class B
...,...,...,...
1725,3026 East Stillwater Landing Unit 101,932122406009,Class B
1726,1108 South Busey Avenue,932117307003,Class B
1727,806 Harvey Street,912107428001,Class B
1728,1302 East Michigan Avenue,922116376032,Class B


### <a name="dropping"></a>Dropping missing values

Missing values are often represented as `NA` (not available), `NaN` (not a number), ".", or " " in data. Missing values are slightly different from null values and unknown values. A missing value could be unknown or NULL or an actual value that just never made it into the data frame. 

Null values (`NULL`) are undefined values often used in R coding to create empty objects. 

Unknown values are usually noted or marked as "unknown" in a dataset. Older data might note a value as "9999" or "99999" to represent an unknown value. Unknown values are not necessarily missing when they are represented as "unknown" or "9999" within a dataset.

How data workers handle missing values will vary. Imputation is a process of replacing missing values with real values, and this process requires lots of care, theory, and background knowledge about the non-missing data. For real data analysis, imputation may not make much sense, since the goal is to discuss the observed data. Here's a blog with several reviews of the standard textbooks in the field of missing data: https://thestatsgeek.com/stats-books/missing-data-books/. We won't discuss imputation in any meaningful way in this course.

Instead, I'll mention that there is a tidyverse function for removing NA values called `dropna()` if we ever need to do so. If you don't know a value is missing or not, you can use the `isna()` function.

In [13]:
sum(rental_data['Expiration Date'].isna())

274

In [14]:
rental_data.dropna()

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
0,607 1/2 Glover Avenue,922116177018,2015-07-24,Class B,Expired,2021-10-14,"607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,..."
1,1302 1/2 Hill Street,912107406011,2011-08-17,Class B,Issued,2021-10-14,"1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ..."
3,801 1/2 East Harding Drive,932121153003,2013-06-12,Class B,Issued,2021-10-14,"801 1 2 East Harding Drive\nUrbana, IL\n(40.09..."
4,1003 1/2 East Harding Drive,932121153010,2013-07-08,Class B,Issued,2020-10-14,"1003 1 2 East Harding Drive\nUrbana, IL\n(40.0..."
5,1204 1/2 North Goodwin Avenue,912107276001,2011-10-20,Class B,Issued,2021-10-14,"1204 1 2 North Goodwin Avenue\nUrbana, IL\n(40..."
...,...,...,...,...,...,...,...
1722,401 West Springfield Avenue,922117131006,2019-03-04,Class B,Issued,2021-10-14,"401 West Springfield Avenue\nUrbana, IL\n(40.1..."
1724,2005 Bruce Drive,932121178007,2013-06-04,Class A,Issued,2021-10-14,"2005 Bruce Drive\nUrbana, IL\n(-88.1932, 40.0928)"
1725,3026 East Stillwater Landing Unit 101,932122406009,2017-12-18,Class B,Issued,2021-10-14,"3026 East Stillwater Landing\nUrbana, IL\n(40...."
1726,1108 South Busey Avenue,932117307003,2019-12-16,Class B,Issued,2021-10-14,"1108 South Busey Avenue\nUrbana, IL\n(40.10310..."



***


## <a name="data-expansion"></a>Data expansion

The methods in this section will be making the dataset larger in some way, usually by adding new columns of information. This does not include combining data, which will be discussed in Week 10.

### <a name="renaming"></a>Renaming columns

Renaming variables can be accomplished using `rename()` function and serves as a convenient was to change a column's name without an assignment operator. We place the new name on the right side of the `:` sign and the old name on the left side.

In [15]:
rental_data.columns

Index(['Property Address', 'Parcel Number', 'Inspection Date', 'Grade',
       'License Status', 'Expiration Date', 'Mappable Address'],
      dtype='object')

In [16]:
rental_data.rename(columns={'Mappable Address':'full_address'},inplace=True)

In [17]:
rental_data.columns

Index(['Property Address', 'Parcel Number', 'Inspection Date', 'Grade',
       'License Status', 'Expiration Date', 'full_address'],
      dtype='object')

### <a name="mutating"></a>Mutating columns

The real power of data wrangling and to a larger extent, data science, is the ability to create columns of new information. Often, this new information is really just a function of existing information. But, usually that new information is what is needed for a later analysis. Recall that the work of data management and wrangling (read: STAT 440) is to do all the data work prior to an actual data analysis. 

The `[]` notation and creating a new column name is what we'll use to create the new information and make that new information appear in the data set. Suppose we want to represent the Grades of the inspections as numbers and create a proportion from that new numeric grade. To accomplish this we might do:

In [18]:
d_grade = {'Class N':1, 'Class F':2, 'Class D':3, 'Class C':4, 'Class B':5, 'Class A':6}
rental_data['grade_numeric'] = rental_data.apply(lambda x: d_grade[x.Grade], axis=1)
rental_data['grade_prop'] = rental_data['grade_numeric'] / 6
rental_data

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,full_address,grade_numeric,grade_prop
0,607 1/2 Glover Avenue,922116177018,2015-07-24,Class B,Expired,2021-10-14,"607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,...",5,0.833333
1,1302 1/2 Hill Street,912107406011,2011-08-17,Class B,Issued,2021-10-14,"1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ...",5,0.833333
2,212 1/2 Central Avenue,912108383001,2010-04-26,Class B,Issued,NaT,"212 1 2 Central Avenue\nUrbana, IL",5,0.833333
3,801 1/2 East Harding Drive,932121153003,2013-06-12,Class B,Issued,2021-10-14,"801 1 2 East Harding Drive\nUrbana, IL\n(40.09...",5,0.833333
4,1003 1/2 East Harding Drive,932121153010,2013-07-08,Class B,Issued,2020-10-14,"1003 1 2 East Harding Drive\nUrbana, IL\n(40.0...",5,0.833333
...,...,...,...,...,...,...,...,...,...
1725,3026 East Stillwater Landing Unit 101,932122406009,2017-12-18,Class B,Issued,2021-10-14,"3026 East Stillwater Landing\nUrbana, IL\n(40....",5,0.833333
1726,1108 South Busey Avenue,932117307003,2019-12-16,Class B,Issued,2021-10-14,"1108 South Busey Avenue\nUrbana, IL\n(40.10310...",5,0.833333
1727,806 Harvey Street,912107428001,2011-11-04,Class B,Issued,2021-10-14,"806 Harvey Street\nUrbana, IL\n(-88.2215, 40.1...",5,0.833333
1728,1302 East Michigan Avenue,922116376032,2016-04-18,Class B,Issued,NaT,"1302 East Michigan Avenue\nUrbana, IL\n(-88.19...",5,0.833333


That's quite powerful! We mutated the data by using columns that we were in the process of mutating!!

#### END OF NOTES