# STAT 440 Statistical Data Management - Fall 2021
## Week 04 Notes
### Created by Christopher Kinson and Huiqin Xin


***

### Table of Contents

- [Arranging data](#arranging-data)  
  - [Organizing columns](#organizing)  
  - [Sorting columns](#sorting)  
- [Reshaping data](#reshaping-data)  
  - [Pivoting](#pivoting)  
  - [Transposing](#transposing)  


***


## <a name="arranging-data"></a>Arranging data

Arranging a dataset involves organizing its columns and sorting the data by one or more of its columns. 

### <a name="organizing"></a>Organizing columns

By organizing the data we may want certain columns to appear as the first column, second column, etc. See edited image taken from [RStudio's dplyr cheat sheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

![](https://uofi.box.com/shared/static/sxh3cw9yyol3m3tlu8fhefw9ye1pjmwm.png)

Using indexing in Python, we can organize the columns. Recall the City of Urbana's [Rental Inspection Grades Listing Data as tab-separated .txt](https://github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/raw/master/data/rental-inspections-grades-data03.txt). 


In [1]:
import pandas as pd
rentalsData = pd.read_csv("https://raw.github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/master/data/rental-inspections-grades-data03.txt?token=AAABJGYPNPRLJDH2JQESWD3BGO3FG", sep = "\t")
rentalsData.head()

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address
0,607 1/2 Glover Avenue,922116200000.0,7/24/2015,Class B,Expired,10/14/2021,"607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,..."
1,1302 1/2 Hill Street,912107400000.0,8/17/2011,Class B,Issued,10/14/2021,"1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ..."
2,212 1/2 Central Avenue,912108400000.0,4/26/2010,Class B,Issued,,"212 1 2 Central Avenue\nUrbana, IL"
3,801 1/2 East Harding Drive,932121200000.0,6/12/2013,Class B,Issued,10/14/2021,"801 1 2 East Harding Drive\nUrbana, IL\n(40.09..."
4,1003 1/2 East Harding Drive,932121200000.0,7/8/2013,Class B,Issued,10/14/2020,"1003 1 2 East Harding Drive\nUrbana, IL\n(40.0..."


Let's arrange the Rental Inspection Grades Listing Data such that:

- the first column is Parcel Number  
- the second column is Property Address  
- the third column is Mappable Address  
- the fourth column is Inspection Date  
- the fifth column is Expiration Date  
- the sixth column is License Status
- the seventh column is Grade  



In [2]:
rentalsData[["Parcel Number", "Property Address", "Mappable Address", "Inspection Date", "Expiration Date", "License Status", "Grade"]].head(5)

Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
0,922116200000.0,607 1/2 Glover Avenue,"607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,...",7/24/2015,10/14/2021,Expired,Class B
1,912107400000.0,1302 1/2 Hill Street,"1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ...",8/17/2011,10/14/2021,Issued,Class B
2,912108400000.0,212 1/2 Central Avenue,"212 1 2 Central Avenue\nUrbana, IL",4/26/2010,,Issued,Class B
3,932121200000.0,801 1/2 East Harding Drive,"801 1 2 East Harding Drive\nUrbana, IL\n(40.09...",6/12/2013,10/14/2021,Issued,Class B
4,932121200000.0,1003 1/2 East Harding Drive,"1003 1 2 East Harding Drive\nUrbana, IL\n(40.0...",7/8/2013,10/14/2020,Issued,Class B


In [3]:
rentalsData2 = rentalsData[["Parcel Number", "Property Address", "Mappable Address", "Inspection Date", "Expiration Date", "License Status", "Grade"]]
rentalsData2.head(5)

Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
0,922116200000.0,607 1/2 Glover Avenue,"607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,...",7/24/2015,10/14/2021,Expired,Class B
1,912107400000.0,1302 1/2 Hill Street,"1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ...",8/17/2011,10/14/2021,Issued,Class B
2,912108400000.0,212 1/2 Central Avenue,"212 1 2 Central Avenue\nUrbana, IL",4/26/2010,,Issued,Class B
3,932121200000.0,801 1/2 East Harding Drive,"801 1 2 East Harding Drive\nUrbana, IL\n(40.09...",6/12/2013,10/14/2021,Issued,Class B
4,932121200000.0,1003 1/2 East Harding Drive,"1003 1 2 East Harding Drive\nUrbana, IL\n(40.0...",7/8/2013,10/14/2020,Issued,Class B


Also, notice that in order to store the newly arranged data, we need to assign the data with the assignment operator.

### <a name="sorting"></a>Sorting columns

Sorting the data by the values in the columns involves the `sort_values()` function. Now let's sort the data such that the Parcel Numbers are in ascending order showing the resulting first 10 rows.


In [4]:
rentalsData2.sort_values(by=['Parcel Number'], ascending=[True]).head(10)

Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
936,912104400000.0,1902 Willow Road,"1902 Willow Road\nUrbana, IL\n(-88.2007, 40.1305)",5/21/2010,,Issued,Class B
551,912105500000.0,1601 Willow Road,"1601 Willow Road\nUrbana, IL\n(-88.201, 40.1276)",10/19/2007,,Issued,Class B
784,912105500000.0,1709 Willow Court,"1709 Willow Court\nUrbana, IL\n(-88.2013, 40.1...",7/27/2010,10/14/2021,Issued,Class B
204,912105500000.0,1707 Willow Court,"1707 Willow Court\nUrbana, IL\n(-88.2014, 40.129)",7/13/2010,10/14/2021,Issued,Class A
41,912105500000.0,1705 Willow Court,"1705 Willow Court\nUrbana, IL\n(-88.2014, 40.1...",7/13/2010,10/14/2020,Issued,Class A
199,912105500000.0,1703 Willow Court,"1703 Willow Court\nUrbana, IL\n(-88.2012, 40.1...",7/13/2010,10/14/2019,Issued,Class A
723,912107200000.0,1405 Bradley Avenue,"1405 Bradley Avenue\nUrbana, IL\n(-88.2289, 40...",2/17/2012,,Issued,Class B
205,912107200000.0,1507 North Romine Street,"1507 North Romine Street\nUrbana, IL\n(-88.227...",7/27/2011,10/14/2021,Expired,Class B
470,912107200000.0,1305 Bradley Avenue,"1305 Bradley Avenue\nUrbana, IL\n(-88.2274, 40...",7/15/2011,10/14/2021,Issued,Class B
1048,912107200000.0,1506 North Romine Street,"1506 North Romine Street\nUrbana, IL\n(-88.227...",10/22/2013,10/14/2021,Issued,Class A


We can complicate this arrangement by specifying the sort order of more columns. Let's sort the Rental Inspection Grades Listing Data such that the Grades in descending order then Parcel Numbers are in ascending order and showing the resulting first 10 rows. 



In [5]:
rentalsData2.sort_values(by=['Grade','Parcel Number'], ascending=[False,True]).head(10)

Unnamed: 0,Parcel Number,Property Address,Mappable Address,Inspection Date,Expiration Date,License Status,Grade
1351,922117100000.0,611 West Elm Street,"611 West Elm Street\nUrbana, IL\n(40.111217, -...",2/12/2019,10/14/2021,Issued,Class N
605,932121200000.0,1302 Silver Street,"1302 Silver Street\nUrbana, IL\n(-88.1928, 40....",2/25/2013,,Expired,Class F
1327,932121200000.0,1304 Silver Street,"1304 Silver Street\nUrbana, IL\n(-88.1925, 40....",2/25/2013,10/14/2021,Expired,Class F
10,922117300000.0,709 1/2 South Vine Street,"709 1 2 South Vine Street\nUrbana, IL\n(40.106...",12/7/2009,10/14/2021,Issued,Class D
1458,932117500000.0,1304 South Vine Street,"1304 South Vine Street\nUrbana, IL\n(-88.2046,...",5/27/2016,10/14/2021,Issued,Class D
1106,932121200000.0,1306 East Silver Street,"1306 East Silver Street\nUrbana, IL\n(40.09172...",4/11/2017,10/14/2019,Temporarily Not a Rental,Class D
344,912107200000.0,1407 North Romine Street,"1407 North Romine Street\nUrbana, IL\n(-88.227...",4/19/2016,10/14/2020,Issued,Class C
854,912107300000.0,1410 Beslin Street,"1410 Beslin Street\nUrbana, IL\n(-88.2283, 40....",7/12/2011,10/14/2019,Issued,Class C
1499,912107300000.0,1308 Beech Street,"1308 Beech Street\nUrbana, IL\n(-88.2267, 40.1...",6/27/2011,,Issued,Class C
133,912107300000.0,1003 North Mathews Avenue,"1003 North Mathews Avenue\nUrbana, IL\n(-88.22...",6/16/2011,10/14/2021,Expired,Class C


***


## <a name="reshaping-data"></a>Reshaping data
Reshaping a dataset can be a good way to ensure the information we want is in the proper orientation. Two actions usually encompass reshaping: pivoting and transposing. The particular action being taken ultimately depends on the scenario. 

## <a name="pivoting"></a>Pivoting

Pivoting may happen in two ways: 

1. "lengthening", which makes the dataset longer (more rows than we started). See image taken from [RStudio's dplyr cheat sheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf). 

![](https://uofi.box.com/shared/static/o5h7rqsgy9xcy23m8uq0wup8b6vglgxn.png)

2. "widening", which makes the dataset wider (more columns than we started). See image taken from [RStudio's dplyr cheat sheet](https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

![](https://uofi.box.com/shared/static/2pl1dlbmjdqhri43zvrsov2emlkuogf4.png)

To prepare the data for pivoting, we are adding two columns to the City of Urbana's [Rental Inspection Grades Listing Data as tab-separated .txt](https://github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/raw/master/data/rental-inspections-grades-data03.txt) (named object `RentalsData`). *Pivoting may not be necessary with the original `RentalsData`.*



In [6]:
import numpy as np

In [7]:
Coordinates = rentalsData['Mappable Address'].str.findall("[+-]?\d+\.\d+")
rentalsData['Coordinates01'] = [x[0] if len(x)==2 else np.nan for x in Coordinates]
rentalsData['Coordinates02'] = [x[1] if len(x)==2 else np.nan for x in Coordinates]

With this new version of the City of Urbana's Rental Inspection Grades Listing Data, suppose that we wanted the decimal values in columns Coordinates01 and Coordinates02 in a single column. To accomplish this, we can use the `pivot()` and `merge()` functions since we would be making the data longer; we're adding more rows than there were originally.

In [8]:
stack_cols = ['Coordinates01','Coordinates02']
rem_cols = [x for x in rentalsData if x not in stack_cols]

In [9]:
stack_df = rentalsData[stack_cols].stack(dropna=False).to_frame()
stack_df = stack_df.reset_index()
stack_df.columns= ['newid','Coordinate','Decimal']
rentalsData3 = pd.merge(rentalsData[rem_cols],stack_df,left_on = rentalsData.index,right_on='newid',how='right')
rentalsData3.head()

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address,newid,Coordinate,Decimal
0,607 1/2 Glover Avenue,922116200000.0,7/24/2015,Class B,Expired,10/14/2021,"607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,...",0,Coordinates01,40.108023
1,607 1/2 Glover Avenue,922116200000.0,7/24/2015,Class B,Expired,10/14/2021,"607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,...",0,Coordinates02,-88.193322
2,1302 1/2 Hill Street,912107400000.0,8/17/2011,Class B,Issued,10/14/2021,"1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ...",1,Coordinates01,40.119327
3,1302 1/2 Hill Street,912107400000.0,8/17/2011,Class B,Issued,10/14/2021,"1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ...",1,Coordinates02,-88.226119
4,212 1/2 Central Avenue,912108400000.0,4/26/2010,Class B,Issued,,"212 1 2 Central Avenue\nUrbana, IL",2,Coordinates01,


That's it! We just increased the number of rows of `RentalsData` to 3460.

Now, with this newly pivoted data, let's demonstrate "widening" the data. In this way, we want to increase the number of columns; essentially reverting back to the version of the data before "lengthening." *Notice that we first use the `sort_values()` function to sort the Coordinate column.*

In [10]:
rentalsData3 = rentalsData3.sort_values('Coordinate')
pivot_df = rentalsData3.pivot(index = 'Parcel Number',columns='Coordinate',values='Decimal')
rentalsData4 = pd.merge(rentalsData.drop(columns=['Coordinates01','Coordinates02']),pivot_df,on ='Parcel Number')
rentalsData4.head()

Unnamed: 0,Property Address,Parcel Number,Inspection Date,Grade,License Status,Expiration Date,Mappable Address,Coordinates01,Coordinates02
0,607 1/2 Glover Avenue,922116200000.0,7/24/2015,Class B,Expired,10/14/2021,"607 1 2 Glover Avenue\nUrbana, IL\n(40.108023,...",40.108023,-88.193322
1,1302 1/2 Hill Street,912107400000.0,8/17/2011,Class B,Issued,10/14/2021,"1302 1 2 Hill Street\nUrbana, IL\n(40.119327, ...",40.119327,-88.226119
2,212 1/2 Central Avenue,912108400000.0,4/26/2010,Class B,Issued,,"212 1 2 Central Avenue\nUrbana, IL",,
3,801 1/2 East Harding Drive,932121200000.0,6/12/2013,Class B,Issued,10/14/2021,"801 1 2 East Harding Drive\nUrbana, IL\n(40.09...",40.093806,-88.19767
4,1003 1/2 East Harding Drive,932121200000.0,7/8/2013,Class B,Issued,10/14/2020,"1003 1 2 East Harding Drive\nUrbana, IL\n(40.0...",40.093743,-88.195595


Done! Technically the number of columns did not increase, but the number of rows did decrease. This is another way to know that the "widening" worked appropriately.

## <a name="transposing"></a>Transposing

When it comes to transposing, all we're accomplishing is making the columns become the rows, alternating the dimension of the data: formerly row## by column## to presently column## by row##.



In [11]:
rentalsData5 = rentalsData.T 
rentalsData.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1730 entries, 0 to 1729
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Property Address  1730 non-null   object 
 1   Parcel Number     1730 non-null   float64
 2   Inspection Date   1730 non-null   object 
 3   Grade             1730 non-null   object 
 4   License Status    1730 non-null   object 
 5   Expiration Date   1456 non-null   object 
 6   Mappable Address  1730 non-null   object 
 7   Coordinates01     1729 non-null   object 
 8   Coordinates02     1729 non-null   object 
dtypes: float64(1), object(8)
memory usage: 121.8+ KB


In [12]:
rentalsData.shape

(1730, 9)

In [13]:
rentalsData5.shape

(9, 1730)

We see the data went from 1730 by 9 to presently 9 by 1730. 

#### END OF NOTES