# STAT 440 Statistical Data Management - Fall 2021
## Week 09 Notes
### Created by Christopher Kinson and Huiqin Xin


***


## Table of Contents

- [Summarizing data](#summarizing)
- [Combining Data](#combining)


***


## <a name="summarizing"></a>Summarizing data

Another important aspect of data wrangling is to summarize or aggregate data. This may also be considered as applying summary functions, such as the sum, mean `mean()`, median `median()`, variance `var()`, or standard deviation `sd()`, to grouped data aka "group processing". Grouped data can be any data with a categorical variable or factor as a column. This task comes in handy when we want to know statistical or numeric values for each member of a group. To accomplish summarization, sometimes we can leverage the way the data are arranged (or sorted). Other times, the arrangement has no bearing on our ability to aggregate. Ideally, we want the result to be a data frame or other recursive object when possible.

Working with the City of Urbana's [Rental Inspection Grades Listing Data as tab-separated .txt - GHE](https://github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/raw/master/data/rental-inspections-grades-data03.txt) or the Box data URL https://uofi.box.com/shared/static/0j11not1cqmonbrwy0l8zmzr27fafm2j.txt, we may have to use loops and conditional execution to achieve the grouped summaries. Let's compute the proportion for each inspection grade.

In [1]:
import pandas as pd
RentalsData = pd.read_csv("https://uofi.box.com/shared/static/0j11not1cqmonbrwy0l8zmzr27fafm2j.txt",sep='\t')
RentalsData.head()

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
0,607 1/2 Glover Avenue,922116200000.0,7/24/2015,Class B,Expired,10/14/2021,"607 1 2 Glover Avenue\r\nUrbana, IL\r\n(40.108..."
1,1302 1/2 Hill Street,912107400000.0,8/17/2011,Class B,Issued,10/14/2021,"1302 1 2 Hill Street\r\nUrbana, IL\r\n(40.1193..."
2,212 1/2 Central Avenue,912108400000.0,4/26/2010,Class B,Issued,,"212 1 2 Central Avenue\r\nUrbana, IL"
3,801 1/2 East Harding Drive,932121200000.0,6/12/2013,Class B,Issued,10/14/2021,"801 1 2 East Harding Drive\r\nUrbana, IL\r\n(4..."
4,1003 1/2 East Harding Drive,932121200000.0,7/8/2013,Class B,Issued,10/14/2020,"1003 1 2 East Harding Drive\r\nUrbana, IL\r\n(..."


In [2]:
g = sorted(RentalsData['Grade'].unique())
groups = [0]*len(g)
for i in range(len(g)):
    groups[i] = len(RentalsData[RentalsData['Grade'] == g[i]])
s = sum(groups)
pd.DataFrame(data={'grades':g,'count_grade':groups,'grade_proportion':[x/s for x in groups]})

Unnamed: 0,grades,count_grade,grade_proportion
0,Class A,155,0.089595
1,Class B,1442,0.833526
2,Class C,127,0.07341
3,Class D,3,0.001734
4,Class F,2,0.001156
5,Class N,1,0.000578


For summarization in Python, we need `groupby()` for processing the data separately for each group in the data frame and the `agg()` function which allows for several summary functions, such as `sum()`, `mean()`, `median()`, or `count()`. If a summary function does not immediately exist, try looking in another package such as NumPy or SciKitLearn. Alternatively, we can use Python's `lambda` or `apply` methods to create a custom summary function.

We can use the `count()` function to total each group of Grades and just divide. The result will keep the groups in tact because of the `groupby()` function.

In [87]:
gagg=RentalsData.groupby(by='Grade').agg('count').iloc[:,0]
gagg

Grade
Class A     155
Class B    1442
Class C     127
Class D       3
Class F       2
Class N       1
Name: Property Address, dtype: int64

In [94]:
gagg2=gagg.groupby(by='Grade').agg('prod')/len(RentalsData.index)
gagg2

Grade
Class A    0.089595
Class B    0.833526
Class C    0.073410
Class D    0.001734
Class F    0.001156
Class N    0.000578
Name: Property Address, dtype: float64

Because these two results of `gagg` and `gagg2` are Pandas.Series. We can name them and join them together to make a Pandas.DataFrame so that our result looks more like the result in the R notes.

In [97]:
gagg01 = pd.Series(gagg, name='Counts')
gagg02 = pd.Series(gagg2, name='Props')
gagg01.to_frame().join(gagg02)

Unnamed: 0_level_0,Counts,Props
Grade,Unnamed: 1_level_1,Unnamed: 2_level_1
Class A,155,0.089595
Class B,1442,0.833526
Class C,127,0.07341
Class D,3,0.001734
Class F,2,0.001156
Class N,1,0.000578


If you're a R user looking for quick Python analogs for some key tidyverse functionality, check out this gist from  https://gist.github.com/conormm/fd8b1980c28dd21cfaf6975c86c74d07.

***


## <a name="combining"></a>Combining data

What happens when you need to work with multiple datasets at once? What happens when the one dataset you have is not enough information? Where do you get the additional information? Combining data sets is a very useful data wrangling operation. Grabbing information from another dataset and adding it to your current one potentially increases your information. Combining data could mean different things in various disciplines or the same thing with different terms such as concatenating, merging, binding, appending, or joining.

In Python, concatenating is the act of combining objects or strings together and is typically done with `+` or `join`.

In Python, binding (or appending) is the act of combining two or more objects by stacking one on top of the other such as `concat(axis=0)` in pandas (or `merge(axis=0)` in pandas) or stacking one next to the other such as `concat(axis=1)` in pandas (or `merge(axis=1)` in pandas).

Merging (or joining) usually implies combining two or more objects with different columns of information into one single object. This merging would require each of the different data objects to have one column in common with a unique identifying information such as an ID variable or geographic location. There are at least 3 situations that can occur when merging objects. 

1. Observations in the two (or more) separate objects could not match each other.

**Data 1**  

ID | Salary
---|---
A  | \$10K
B  | \$11K
D  | \$12K

**Data 2**  

ID | Number
---|---
C | 2175551234
E | 2175551235
F | 2175551236

**Merged Data**  

ID | Salary | Number
---|---|---
A | \$10K |
B | \$11K |
D | \$12K |
C |  | 2175551234
E |  | 2175551235
F |  | 2175551236

2. Observations in the two (or more) separate objects could match each other one-to-one.

**Data 1**  

ID | Salary
---|---
A | \$10K
B | \$11K
D | \$12K

**Data 2**  

ID | Number
---|---
A | 2175551214
B | 2175551224
D | 2175551244

**Merged Data**  

ID | Salary | Number
---|---|---
A | \$10K | 2175551214
B | \$11K | 2175551224
D | \$12K | 2175551244

3. Observations in the two (or more) separate objects could match each other one-to-many (or many-to-one).

**Data 1**  

ID | Salary
---|---
A | \$10K
D | \$12K

**Data 2** 

ID | Number
---|---
A | 2175551214
A | 2175551204
D | 2175551244

**Merged Data**  

ID | Salary | Number
---|---|---
A | \$10K | 2175551214
A | \$10K | 2175551204
D | \$12K | 2175551244

How we merge (or join) the data depends on which of the three situations is intended for the data management. Only keeping the matches (#2 and #3 above) could be accomplished using an inner join (`join(how='inner')` in pandas or `merge(how='inner')` in pandas) . Keeping the matches (#2 and #3 above) and non-matches (#1 above) could be accomplished using a full join (`join(how='outer')` in pandas or `merge(how='outer')` in pandas). Whenever the common column of the different data objects contain the same information but have different column names, the easiest fix is to rename the column in one of the two objects. In `merge` function, different column names can be used to join two data frames by specifying `left_on` and `right_on` arguments. 

Let's combine the owner addresses (scraped and saved from Week 03 Notes) with the `RentalsData` as [owners-addresses .csv - GHE](https://github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/raw/master/data/owners-addresses.csv) or the Box data URL https://uofi.box.com/shared/static/u6coxibtzx3mith23bzk4rysu923g160.csv. Doing this combining is quite simple because we have only one column and the same number of elements in the RentalsData and owners-addresses.

In [5]:
owners_addresses = pd.read_csv("https://uofi.box.com/shared/static/u6coxibtzx3mith23bzk4rysu923g160.csv")

In [6]:
RentalsData2 = pd.concat([RentalsData, owners_addresses], axis=1)
RentalsData2.head(10)

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address,value
0,607 1/2 Glover Avenue,922116200000.0,7/24/2015,Class B,Expired,10/14/2021,"607 1 2 Glover Avenue\r\nUrbana, IL\r\n(40.108...","CORA MAE PROPERTIES LLC, \r\nLUKE SHERMAN\r\nP..."
1,1302 1/2 Hill Street,912107400000.0,8/17/2011,Class B,Issued,10/14/2021,"1302 1 2 Hill Street\r\nUrbana, IL\r\n(40.1193...","WOMACK, DEBORAH J & MICHAEL\r\n803 N OAKWOOD S..."
2,212 1/2 Central Avenue,912108400000.0,4/26/2010,Class B,Issued,,"212 1 2 Central Avenue\r\nUrbana, IL","RUBIN, RACHAEL\r\n212 N CENTRAL AVE\r\nURBANA,..."
3,801 1/2 East Harding Drive,932121200000.0,6/12/2013,Class B,Issued,10/14/2021,"801 1 2 East Harding Drive\r\nUrbana, IL\r\n(4...","HARPER, CRAIG & JAMES E\r\n1173 COUNTY ROAD 24..."
4,1003 1/2 East Harding Drive,932121200000.0,7/8/2013,Class B,Issued,10/14/2020,"1003 1 2 East Harding Drive\r\nUrbana, IL\r\n(...","WAMPLER, JOSEPH\r\nCOLONY PROPERTY MANAGEMENT\..."
5,1204 1/2 North Goodwin Avenue,912107300000.0,10/20/2011,Class B,Issued,10/14/2021,"1204 1 2 North Goodwin Avenue\r\nUrbana, IL\r\...","NEVES GROUP INVESTMENTS, \r\n801 W BRADLEY AVE..."
6,910 1/2 North Busey Avenue,912108200000.0,12/17/2010,Class B,Issued,10/14/2021,"910 1 2 North Busey Avenue\r\nUrbana, IL\r\n(4...","GRAMMER, JACOB\r\n1303 PEPPERMILL LN\r\nCHAMPA..."
7,1109 1/2 East Main Street,922116100000.0,6/5/2015,Class B,Issued,10/14/2020,"1109 1 2 East Main Street\r\nUrbana, IL\r\n(40...","CLARK, FREDERICK E\r\n1 HORSE RUN HILL RD\r\nC..."
8,1306 1/2 East Mumford Drive,932121300000.0,7/8/2013,Class B,Issued,10/14/2021,"1306 1 2 East Mumford Drive\r\nUrbana, IL\r\n(...","OVERTON, DONALD G\r\n2101 VAWTER\r\nURBANA, IL..."
9,807 1/2 West Main Street,912108400000.0,5/18/2011,Class A,Issued,10/14/2020,"807 1 2 West Main Street\r\nUrbana, IL\r\n(40....","URBANA CAMPUS RENTALS II, \r\n309 S 1ST ST\r\n..."


**SN: Pandas' `unique()` and `drop_duplicates()` functions may come in handy when wanting to find the number of unique individuals of a particular column or a set of columns for a data frame.**

In [7]:
RentalsData2['Grade'].unique()

array(['Class B', 'Class A', 'Class D', 'Class C', 'Class F', 'Class N'],
      dtype=object)

In [10]:
RentalsData2.drop_duplicates(subset=['Grade'])

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address,value
0,607 1/2 Glover Avenue,922116200000.0,7/24/2015,Class B,Expired,10/14/2021,"607 1 2 Glover Avenue\r\nUrbana, IL\r\n(40.108...","CORA MAE PROPERTIES LLC, \r\nLUKE SHERMAN\r\nP..."
9,807 1/2 West Main Street,912108400000.0,5/18/2011,Class A,Issued,10/14/2020,"807 1 2 West Main Street\r\nUrbana, IL\r\n(40....","URBANA CAMPUS RENTALS II, \r\n309 S 1ST ST\r\n..."
10,709 1/2 South Vine Street,922117300000.0,12/7/2009,Class D,Issued,10/14/2021,"709 1 2 South Vine Street\r\nUrbana, IL\r\n(40...","SNYDER, CAROL\r\n709 S VINE ST\r\nURBANA, IL, ..."
18,108 L North Busey Avenue,912108400000.0,10/12/2010,Class C,Expired,,"108 L North Busey Avenue\r\nUrbana, IL\r\n(40....","CPM MANAGEMENT LLC, \r\n615 S WRIGHT ST\r\nCH..."
605,1302 Silver Street,932121200000.0,2/25/2013,Class F,Expired,,"1302 Silver Street\r\nUrbana, IL\r\n(-88.1928,...","PLATINUM GROUP PROPERTIES, \r\nSUNNYCREST\r\nP..."
1351,611 West Elm Street,922117100000.0,2/12/2019,Class N,Issued,10/14/2021,"611 West Elm Street\r\nUrbana, IL\r\n(40.11121...","W A HOLDINGS LLC, TERRY WOLLER\r\n208 SUGAR LN..."


In [11]:
RentalsData2.drop_duplicates()

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address,value
0,607 1/2 Glover Avenue,9.221162e+11,7/24/2015,Class B,Expired,10/14/2021,"607 1 2 Glover Avenue\r\nUrbana, IL\r\n(40.108...","CORA MAE PROPERTIES LLC, \r\nLUKE SHERMAN\r\nP..."
1,1302 1/2 Hill Street,9.121074e+11,8/17/2011,Class B,Issued,10/14/2021,"1302 1 2 Hill Street\r\nUrbana, IL\r\n(40.1193...","WOMACK, DEBORAH J & MICHAEL\r\n803 N OAKWOOD S..."
2,212 1/2 Central Avenue,9.121084e+11,4/26/2010,Class B,Issued,,"212 1 2 Central Avenue\r\nUrbana, IL","RUBIN, RACHAEL\r\n212 N CENTRAL AVE\r\nURBANA,..."
3,801 1/2 East Harding Drive,9.321212e+11,6/12/2013,Class B,Issued,10/14/2021,"801 1 2 East Harding Drive\r\nUrbana, IL\r\n(4...","HARPER, CRAIG & JAMES E\r\n1173 COUNTY ROAD 24..."
4,1003 1/2 East Harding Drive,9.321212e+11,7/8/2013,Class B,Issued,10/14/2020,"1003 1 2 East Harding Drive\r\nUrbana, IL\r\n(...","WAMPLER, JOSEPH\r\nCOLONY PROPERTY MANAGEMENT\..."
...,...,...,...,...,...,...,...,...
1725,3026 East Stillwater Landing Unit 101,9.321224e+11,12/18/2017,Class B,Issued,10/14/2021,"3026 East Stillwater Landing\r\nUrbana, IL\r\n...","MINCONE, SANDY K\r\n22210 TAHOE CT\r\nSANTA CL..."
1726,1108 South Busey Avenue,9.321173e+11,12/16/2019,Class B,Issued,10/14/2021,"1108 South Busey Avenue\r\nUrbana, IL\r\n(40.1...","DOYLE, KIP & SHERI\r\n906 W DANIEL ST\r\nCHAMP..."
1727,806 Harvey Street,9.121074e+11,11/4/2011,Class B,Issued,10/14/2021,"806 Harvey Street\r\nUrbana, IL\r\n(-88.2215, ...","JONES PROPERTY MANAGEMENT, \r\n2516 PINEHURST ..."
1728,1302 East Michigan Avenue,9.221164e+11,4/18/2016,Class B,Issued,,"1302 East Michigan Avenue\r\nUrbana, IL\r\n(-8...","DILLMAN, RONALD L\r\n906 E MICHIGAN AVE\r\nURB..."


#### END OF NOTES