# STAT 440 Statistical Data Management - Fall 2021
## Week 03 Notes
### Created by Christopher Kinson


***

### Table of Contents

- [Assigning objects](#assigning-objects)  
- [Web scraping for accessing and importing data](#web-scraping)  
- [Handling dates and times](#handling-dates-and-times)  


***


## <a name="assigning-objects"></a>Assigning objects

In Python, object assignment is done with an assignment operator, `=` as in `x=10`.  

We've already assigned objects in Python. If you need proof, review the notes from Week 02. With assigning objects, one important thing to notice is the acceptable naming conventions of your programming language. As mentioned in Tip 1, most programming languages won't like kebab-case. Thus, I advise you to use one of the following cases: camelCase, PascalCase, or snake_case. 


***


## <a name="web-scraping"></a>Web scraping for accessing and importing data

Web scraping can be a fun way to explore information provided on a website in order to store it and analyze it for statistical/academic purposes. 

**Because the data contained in the City of Urbana's Data Portal is part of the US government, students outside of the US may have limited or no access based on their current government policies. Regardless, the web scraping section of the notes are still conceptually important. All students in this course are expected to learn and demonstrate their understanding of the concepts covered in these notes.**

Recall, [City of Urbana's Rental Inspection Grades Listings Data - structured comma-separated](https://github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/raw/master/data/rental-inspections-grades-data01.csv) that I like so much. Notice that when we access and import it (the structured comma-separated file), we can get a better view of the columns.


In [1]:
import pandas as pd
rental_data = pd.read_csv('https://raw.github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/master/data/rental-inspections-grades-data01.csv?token=AAABJGZ6OHL3GMHOK3DVMSTBGEUWS')
rental_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1730 entries, 0 to 1729
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Property Address  1730 non-null   object
 1   Parcel Number     1730 non-null   int64 
 2   Inspection Date   1730 non-null   object
 3   Grade             1730 non-null   object
 4   License Status    1730 non-null   object
 5   Expiration Date   1456 non-null   object
 6   Mappable Address  1730 non-null   object
dtypes: int64(1), object(6)
memory usage: 94.7+ KB


This data is about housing that people pay rent money for and the grades of these housing units upon inspection. The specific columns are: Property Address, Parcel Number, Inspection Date, Grade, License Status, Expiration Date, and Mappable Address. 


In [2]:
rental_data.columns

Index(['Property Address', 'Parcel Number', 'Inspection Date', 'Grade',
       'License Status', 'Expiration Date', 'Mappable Address'],
      dtype='object')


If you've lived long enough to understand property, then you know that every county across America has public records of almost all properties and their owners. This includes Champaign County. These property records are stored by the [Office of the Champaign County Assessor](http://www.co.champaign.il.us/ccao/assessor.php). These housing units belong to property owners who may or may not live at the same address as the property they are renting. What if we were to look online, find the owner's name and address to see if it matches that of the property they're renting out? 

First, let's go to the Property Record Search portion of the county's website. See image below.

![](https://uofi.box.com/shared/static/3sf6f8djudewb0ch9zw8pdmmj6y1vuj9.png)

Which takes us here. See image below.

![](https://uofi.box.com/shared/static/tkuofhfr3cytzwce9xovnat5qcmg9dmy.png)

The property record search allows for searching by parcel number. Our dataset has a column by the exact same name. So let's try searching the records for the first parcel number in our dataset: 922116177018. See image below.

![](https://uofi.box.com/shared/static/mshnwcds23ee9u2h7g1lm7g7p9v0d7th.png)

Running that search takes us here. See image below.

![](https://uofi.box.com/shared/static/k6m6zjmy6c17fqfafd916j8denzhuc02.png)

The owners of parcel number 922116177018 don't share the same address as the rental property address, at least not on record.

Doing this would be tedious to do for all 1730 properties in this dataset. Web scraping is a data accessing tool that can automate the process of retrieving property owner information from the county's website.

We will use the **BeautifulSoup** and **requests** libraries, to scrape or harvest data from the Office of the Champaign County Assessor's website. I am basing the notes below on ["Harvesting the web with rvest"](https://rvest.tidyverse.org/articles/harvesting-the-web.html) - an **rvest** vignette written by Demytro Perepolkin, as well as ["Beautiful Soup: Build a Web Scraper With Python"](https://realpython.com/beautiful-soup-web-scraper-python/#scraping-the-monster-job-site) a Python blog post by Martin Breuss.

We're going to go directly to the table containing the Owner Name and Address shown above and right-click on the section where the owner names appear. Click "Inspect" or "Inspect Page" and pay attention to the text highlighted in the image below.

![](https://uofi.box.com/shared/static/90k8q8ppl13vv0yav8h6m9ncknp1i7jn.png)

This text (which is a series of HTML tags) is necessary for us to extract the owner names. The tags represent HTML reference points for how to identify the the information we truly want, which is in the table of Owner Name and Address. To figure out which html tags are going to get us what we need will take some trial and error. That's okay, but takes up time. Eventually we have our desired result (took me 30 minutes to try out several combinations of html tags) based on the two html tags: ".col-xs-4" and ".inner-value".


In [59]:
import requests
from bs4 import BeautifulSoup
url1 = "https://champaignil.devnetwedge.com/parcel/view/922116177018/2020"
page1 = requests.get(url1)
soup1 = BeautifulSoup(page1.text, 'html.parser')
owner_names1 = soup1.find_all('td',{'class':'col-xs-4'})
print(owner_names1)

[<td class="col-xs-4">
<div class="inner-label">Parcel Number</div>
<div class="inner-value">92-21-16-177-018</div>
</td>, <td class="col-xs-4" rowspan="3">
<div class="inner-label">Site Address</div>
<div class="inner-value">
                        607   GLOVER AVE<br/>
                        URBANA, IL 61802
                    </div>
</td>, <td class="col-xs-4" rowspan="3">
<div class="inner-label">Owner Name &amp; Address</div>
<div class="inner-value" style="white-space:pre-line"> CORA MAE PROPERTIES LLC, 
LUKE SHERMAN
PO BOX 101
CHAMPAIGN, IL, 61824-0101 </div>
</td>]


In [61]:
for owner_names in owner_names1:
 owner_names2 = owner_names.find('div', class_='inner-value')
print(owner_names2.text)

 CORA MAE PROPERTIES LLC, 
LUKE SHERMAN
PO BOX 101
CHAMPAIGN, IL, 61824-0101 


Good! 

We used web scraping to grab the first property's owner names according to the Parcel Number 922116177018, and the result shows the owner name and address separated by new line characters `\n`. Again, we see this information verifies that the owner address is not the same as the rental property's address.

Now, let's do this for all properties. The key will be to loop or vectorize this process. *Actually in the chunk below I am only doing this for the first 5 properties since this process takes a long time for all 1730 properties.* The most important thing that is changing with each iteration of a loop should be the parcel number. We can use an index-controlled loop such that that the index value of parcel number column increments until we reach 1730. Putting it all together, we yield the following vector of owner names and address.


In [65]:
root_url = 'https://champaignil.devnetwedge.com/parcel/view/'
owner_names6 = ['']*len(rental_data)
for i in range(5):
    url = root_url + str(rental_data['Parcel Number'].iloc[i]) + '/2020'
    response = requests.get(url)
    soup2 = BeautifulSoup(response.text,'html.parser')
    owner_names5 = soup2.find_all('td',{'class':'col-xs-4'})
    for oN in owner_names5:
        owner_names6[i] = oN.find('div', class_='inner-value').text
print(owner_names6)

Nice!

But there's a better way!!

Let's use the SelectorGadget tool! The SelectorGadget tool (read about it and set it up in your browser https://rvest.tidyverse.org/articles/selectorgadget.html) allows one to inspect the particular part of the web page and better narrow down the html tags. This saves time and greatly reduces the effort of trial and error to grab the information in the Owner Name and Address section of the website.

Using this tool, we selected the table we want and de-selected the Site Address portion of the table next to it. Doing so improved the SelectorGadget estimate of the html tags we *do* want (seen at bottom highlighted in blue).

![](https://uofi.box.com/shared/static/as4pbwxnxdod2q1ah4aopn88im8xony2.png)

This resulted in two html tags: ".col-xs-4:nth-child(3)" and ".inner-value". Trying those two tags out results in the more direct Owner Name and Address information. **Notice that we use the `select()` function with the `soup1` object instead of the `find_all()` or `find()` functions.**

In [57]:
import requests
from bs4 import BeautifulSoup
url1 = "https://champaignil.devnetwedge.com/parcel/view/922116177018/2020"
page1 = requests.get(url1)
soup1 = BeautifulSoup(page1.text)
owner_names1 = soup1.select('.col-xs-4:nth-child(3) .inner-value')
print(owner_names1)

[<div class="inner-value" style="white-space:pre-line"> CORA MAE PROPERTIES LLC, 
LUKE SHERMAN
PO BOX 101
CHAMPAIGN, IL, 61824-0101 </div>]


In [58]:
owner_names2 = [names1.text for names1 in owner_names1]
print(owner_names2)

[' CORA MAE PROPERTIES LLC, \nLUKE SHERMAN\nPO BOX 101\nCHAMPAIGN, IL, 61824-0101 ']


Now looping over the first 5 parcel numbers with this more direct CSS selection yields...

In [67]:
root_url = 'https://champaignil.devnetwedge.com/parcel/view/'
owner_names4 = ['']*len(rental_data)
o_n = ['']*len(rental_data)
for i in range(5):
    url = root_url + str(rental_data['Parcel Number'].iloc[i]) + '/2020'
    response = requests.get(url)
    soup = BeautifulSoup(response.text)
    owner_names3 = soup.select('.col-xs-4:nth-child(3) .inner-value')
    for oN in owner_names3:
        owner_names4[i] = [names2.text for names2 in owner_names3]
owner_names04 = []
for item in owner_names4:
    owner_names04 += item
print(owner_names04)

[' CORA MAE PROPERTIES LLC, \nLUKE SHERMAN\nPO BOX 101\nCHAMPAIGN, IL, 61824-0101 ', ' WOMACK, DEBORAH J & MICHAEL\n803 N OAKWOOD ST\nEFFINGHAM, IL, 62401-3241 ', ' RUBIN, RACHAEL\n212 N CENTRAL AVE\nURBANA, IL, 61801-2606 ', ' HARPER, CRAIG & JAMES E\n1173 COUNTY ROAD 2400 E\nST JOSEPH, IL, 61873-9726 ', ' WAMPLER, JOSEPH\nCOLONY PROPERTY MANAGEMENT\n701 DEVONSHIRE DR\nCHAMPAIGN, IL, 61820-7337 ']


Great! 

Below is a final note on web scraping.

Sometimes people abuse this exploration and overdo it on web scraping, which is why certain aspects of web scrapers are illegal or at the least, frowned upon. When you web scrape be cautious of how often you are hitting a particular website. It might be best to do the scraping in chunks over a few days if you are attempting to gather lots of data.


***


## <a name="handling-dates-and-times"></a>Handling dates and times

When a data set contains date and time information in the fields (columns), the dates and times may be correctly imported internally by the programming language, but misinterpreted externally by the users. Most programming languages, operating systems, and software internally store dates and times as a value in reference to some specific date. For example, in SAS, the reference date is January 1, 1960. In R and Python, internal dates and times are in reference to January 1, 1970. You may find it necessary to convert character strings into date values or re-format existing date values. 

Below is a table of standard date and time formats that work across Python and R. For more information about your programming language's specific formatting for dates, see [Python with datetime module](https://www.w3schools.com/python/python_datetime.asp), [R with the tiyverse](https://r4ds.had.co.nz/dates-and-times.html), and [R with base R functionality](https://www.statmethods.net/input/dates.html). 

Code | Meaning
---|---
\%a	| Abbreviated weekday name	
\%A	| Full weekday name
\%b	| Abbreviated month name	
\%B	| Full month name
\%c	| Date and time	
\%d	| Day of the month (0-31)
\%H	| Hours (24 hour)	
\%I	| Hours (12 hour)
\%j	| Day of the year numbered (000-366)
\%m	| Month numbered (01-12)
\%M	| Minute numbered (00-59)	
\%p	| AM/PM
\%S	| Second numbered (00-61)	
\%U	| Week of the year starting on Sunday numbered (00-53)
\%w	| Weekday starting on Sunday numbered (0-6)	
\%W	| Week of the year starting on Monday numbered (00-53)
\%y	| 2-digit year	
\%Y	| 4-digit year
\%z	| Offset from UTC	
\%Z	| Time zone (character)

Let's see this in action with the City of Urbana's [Rental Inspection Grades Listings Data - structured comma-separated file](https://github-dev.cs.illinois.edu/stat440-fa21/stat440-fa21-course-content/raw/master/data/rental-inspections-grades-data01.csv). Let's focus on the Inspection Date and Expiration Date columns.

In [7]:
rental_data.columns

Index(['Property Address', 'Parcel Number', 'Inspection Date', 'Grade',
       'License Status', 'Expiration Date', 'Mappable Address'],
      dtype='object')

In [8]:
rental_dates = rental_data[["Inspection Date", "Expiration Date"]]
rental_dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1730 entries, 0 to 1729
Data columns (total 2 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Inspection Date  1730 non-null   object
 1   Expiration Date  1456 non-null   object
dtypes: object(2)
memory usage: 27.2+ KB


In [9]:
rental_dates["Inspection Date"]

0       07/24/2015
1       08/17/2011
2       04/26/2010
3       06/12/2013
4       07/08/2013
           ...    
1725    12/18/2017
1726    12/16/2019
1727    11/04/2011
1728    04/18/2016
1729    05/18/2016
Name: Inspection Date, Length: 1730, dtype: object

In [10]:
rental_dates["Expiration Date"]

0       10/14/2021
1       10/14/2021
2              NaN
3       10/14/2021
4       10/14/2020
           ...    
1725    10/14/2021
1726    10/14/2021
1727    10/14/2021
1728           NaN
1729           NaN
Name: Expiration Date, Length: 1730, dtype: object

The two date columns are stored in Python in a type that is not datetime format. We are going to **coerce** the current format to be date format with the `to_datetime()` function within **pandas**.

In [11]:
rental_dates01 = pd.to_datetime(rental_dates["Inspection Date"])
rental_dates01

0      2015-07-24
1      2011-08-17
2      2010-04-26
3      2013-06-12
4      2013-07-08
          ...    
1725   2017-12-18
1726   2019-12-16
1727   2011-11-04
1728   2016-04-18
1729   2016-05-18
Name: Inspection Date, Length: 1730, dtype: datetime64[ns]

In [12]:
rental_dates02 = pd.to_datetime(rental_dates["Expiration Date"])
rental_dates02

0      2021-10-14
1      2021-10-14
2             NaT
3      2021-10-14
4      2020-10-14
          ...    
1725   2021-10-14
1726   2021-10-14
1727   2021-10-14
1728          NaT
1729          NaT
Name: Expiration Date, Length: 1730, dtype: datetime64[ns]

Alternatively, we can add the format argument to be more specific about how the date exists instead of relying on Python to guess for us. For example:

In [13]:
rental_dates03 = pd.to_datetime(rental_dates["Expiration Date"], format="%m/%d/%Y")
rental_dates03

0      2021-10-14
1      2021-10-14
2             NaT
3      2021-10-14
4      2020-10-14
          ...    
1725   2021-10-14
1726   2021-10-14
1727   2021-10-14
1728          NaT
1729          NaT
Name: Expiration Date, Length: 1730, dtype: datetime64[ns]

#### END OF NOTES