# Project Luther

## Website exploration

Let's do a bit of exploration in a number of websites to determine how to scrape the relevant data.

Websites of interest:

* https://www.dupageco.org/PropertyInfo/PropertyLookup.aspx
* http://www.willcountysoa.com/search_address.aspx
* http://www.cookcountypropertyinfo.com/
* http://www.lakecountyil.gov/376/Search-by-Address


In [1]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import os

chromedriver = "/usr/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver


In [2]:
url_lake_county   = 'http://www.lakecountyil.gov/376/Search-by-Address'
url_cook_county   = 'http://www.cookcountypropertyinfo.com/'
url_will_county   = 'http://www.willcountysoa.com/search_address.aspx'
url_dupage_county = 'https://www.dupageco.org/PropertyInfo/PropertyLookup.aspx'

Let's explore each county's website.  We want to investigate:
* The level of difficulty to procure data from the website
* The types of data we can obtain

Though the exploration may lead me to change my mind, at the moment I'm looking to use something akin to "the time duration since the last sale of the property" as my target and almost anything else as a feature.  I would like to be able to standardize across counties though I may end up just using one county.

###  Lake County

In [17]:
driver = webdriver.Chrome(chromedriver)
driver.get(url_lake_county)


Items of interest:
```html
<input name="tbStreetNum" type="text" id="tbStreetNum">
//*[@id="tbStreetNum"]
<input name="tbStreetName" type="text" id="tbStreetName">
//*[@id="tbStreetName"]
<input name="tbZip" type="text" id="tbZip">
//*[@id="tbZip"]
<input type="submit" name="cmdSubmit" value="Submit" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;cmdSubmit&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, false))" id="cmdSubmit" style="font-weight:bold;">
//*[@id="cmdSubmit"]
```

In [18]:
location = driver.find_element_by_xpath('//*[@id="tbStreetNum"]')
location.click()
location.send_keys('404')
location = driver.find_element_by_name('tbStreetName')
location.click()
location.send_keys('BANBURY RD')
location = driver.find_element_by_name('tbZip')
location.click()
location.send_keys('60060')
location = driver.find_element_by_name('submit')
location.click()


NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":"//*[@id="tbStreetNum"]"}
  (Session info: chrome=63.0.3239.108)
  (Driver info: chromedriver=2.33,platform=Linux 4.9.73-1-MANJARO x86_64)


In [15]:
response = requests.get(url_lake_county)

In [16]:
response.status_code

200

In [3]:
#print(response.text)

The Lake County site is the most interesting.

From an ease of access point of view, so far it looks impossible.  Something is prevent the ability to access the inputs on the webpage via selenium.  What Chromium shows as the page source simply doesn't have the inputs.  What IS this?

Beyond that, however, I must enter data in absolutely.  And I have to break the street address into three parts.

Alternatively, however, there is a mechanism using a map.  And just the popups when you get to a house contains robust data.

So... this site has good data, but presents a serious challenge to access.

I **may** return here, time permitting.

### Will County

In [14]:
driver = webdriver.Chrome(chromedriver)
driver.get(url_will_county)


Items of interest:
```html
<input name="ctl00$BC$txStreetFrom" type="text" maxlength="6" id="ctl00_BC_txStreetFrom" class="lgFont" style="width:75px;">
//*[@id="ctl00_BC_txStreetFrom"]
<input name="ctl00$BC$txStreetName" type="text" maxlength="40" id="ctl00_BC_txStreetName" class="lgFont" style="width:178px;">
//*[@id="ctl00_BC_txStreetName"]
<input type="submit" name="ctl00$BC$btnSearch" value="Search" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$BC$btnSearch&quot;, &quot;&quot;, false, &quot;&quot;, &quot;results.aspx&quot;, false, false))" id="ctl00_BC_btnSearch" class="mdButton" style="width:75px;">
//*[@id="ctl00_BC_btnSearch"]
```


In [15]:
location = driver.find_element_by_xpath('//*[@id="ctl00_BC_txStreetFrom"]')
location.click()
location.send_keys('1')
location = driver.find_element_by_xpath('//*[@id="ctl00_BC_txStreetName"]')
location.click()
location.send_keys('a')
location = driver.find_element_by_xpath('//*[@id="ctl00_BC_btnSearch"]')
location.click()



Items of Interest:
```html
//*[@id="ctl00_BC_gvParcels"]
//*[@id="ctl00_BC_gvParcels"]/tbody/tr[3]
//*[@id="ctl00_BC_gvParcels"]/tbody/tr[3]/td[1]/a
<a id="ctl00_BC_gvParcels_ctl18_lbNext" disabled="disabled" style="color:Blue;">Next</a>
//*[@id="ctl00_BC_gvParcels_ctl18_lbNext"]
```

In [16]:
location = driver.find_element_by_xpath('//*[@id="ctl00_BC_gvParcels"]/tbody/tr[3]/td[1]/a')
location.click()

The Will county site seems the easiest of the lot.  I can query for lots of records.  It comes paginated.  But it also gives a "Next" function at the next layer.  So I need not manage the pagination so much as just jump in and hit Next till it's not clickable.

I get Previous Sale Date immediately and up front.  I get a lot of "Building Information" as features.

For the MVP, I think this is going to have to be the site to use.

### DuPage County

In [11]:
driver = webdriver.Chrome(chromedriver)
driver.get(url_dupage_county)

Items of Interest:
```html
//*[@id="prefix-overlay-header"]/button


```

In [8]:
location = driver.find_element_by_xpath('//*[@id="prefix-overlay-header"]/button')
location.click()

Items of Interest:
```
<input name="ctl00$pageContent$ctl00$txtStreetNumber" type="text" maxlength="6" size="6" id="ctl00_pageContent_ctl00_txtStreetNumber" style="width:70px;">
//*[@id="ctl00_pageContent_ctl00_txtStreetNumber"]
<input name="ctl00$pageContent$ctl00$txtStreet" type="text" maxlength="22" size="22" id="ctl00_pageContent_ctl00_txtStreet" style="width:150px;">
//*[@id="ctl00_pageContent_ctl00_txtStreet"]
<input type="submit" name="ctl00$pageContent$ctl00$btnSearch" value="Search" id="ctl00_pageContent_ctl00_btnSearch" class="btn">
//*[@id="ctl00_pageContent_ctl00_btnSearch"]
```

In [9]:
location = driver.find_element_by_xpath('//*[@id="ctl00_pageContent_ctl00_txtStreetNumber"]')
location.click()
location.send_keys('1')
location = driver.find_element_by_xpath('//*[@id="ctl00_pageContent_ctl00_txtStreet"]')
location.click()
location.send_keys('a')
location = driver.find_element_by_xpath('//*[@id="ctl00_pageContent_ctl00_btnSearch"]')
location.click()


Items of Interest:
```html
<a id="ctl00_pageContent_ctl00_gvList_ctl02_lnkPin" href="/PropertyInformation.aspx?PIN=8cHbnjb55IusxmrPmfrB%2bas5tfBSzBn54NJRQ5UYIY4%3d">0101100003</a>
//*[@id="ctl00_pageContent_ctl00_gvList_ctl02_lnkPin"]
<td>
                <a id="ctl00_pageContent_ctl00_gvList_ctl02_lnkPin" href="/PropertyInformation.aspx?PIN=8cHbnjb55IusxmrPmfrB%2bas5tfBSzBn54NJRQ5UYIY4%3d">0101100003</a>
            </td>
//*[@id="ctl00_pageContent_ctl00_gvList"]/tbody/tr[2]/td[1]
//*[@id="ctl00_pageContent_ctl00_gvList"]/tbody/tr[52]/td/table/tbody/tr/td[2]
```

In [10]:
location = driver.find_element_by_xpath('//*[@id="ctl00_pageContent_ctl00_gvList"]/tbody/tr[2]/td[1]/a')
location.click()

For DuPage, we have a popup to dispense with to start.  No biggie.

We can query with very terse terms to get lots of records.  But it's paginated.

More problematically, the amount of information we get is slim.  We can get Sale History in a tab.  And we can get about ten years worth of assessment data.  Soo... for a raw tax history vs longevity, this may be good.  But for any additional features, this may be lacking....



### Cook County

In [4]:
driver = webdriver.Chrome(chromedriver)
driver.get(url_cook_county)

Items of Interest:
```html
<input name="ctl00$ContentPlaceHolder1$PINAddressSearch$houseNumber" type="text" id="houseNumber" class="addresssearchitem" placeholder="* House Number">
//*[@id="houseNumber"]
<input name="ctl00$ContentPlaceHolder1$PINAddressSearch$txtStreetName" type="text" id="txtStreetName" class="addresssearchitem" placeholder="* Street">
//*[@id="txtStreetName"]
<input name="ctl00$ContentPlaceHolder1$PINAddressSearch$txtCity" type="text" id="txtCity" class="addresssearchitem" placeholder="* City">
//*[@id="txtCity"]
//*[@id="ContentPlaceHolder1_PINAddressSearch_btnSearch"]
<input name="ctl00$ContentPlaceHolder1$PINAddressSearch$pinBox1" maxlength="2" id="pinBox1" title="PIN Segment 1" class="pinsearchitem xxsmall" type="number" autocomplete="off" onkeyup="autotab(this,'pinBox2')">
//*[@id="pinBox1"]
<input name="ctl00$ContentPlaceHolder1$PINAddressSearch$pinBox2" maxlength="2" id="pinBox2" title="PIN Segment 2" class="pinsearchitem xxsmall" type="number" autocomplete="off" onkeyup="autotab(this,'pinBox3')">
//*[@id="pinBox2"]
<input name="ctl00$ContentPlaceHolder1$PINAddressSearch$pinBox3" maxlength="3" id="pinBox3" title="PIN Segment 3" class="pinsearchitem xsmall" type="number" autocomplete="off" onkeyup="autotab(this,'pinBox4')">
//*[@id="pinBox3"]
<input name="ctl00$ContentPlaceHolder1$PINAddressSearch$pinBox4" maxlength="3" id="pinBox4" title=" PIN Segment 4" class="pinsearchitem xsmall" type="number" autocomplete="off" onkeyup="autotab(this,'pinBox5')">
//*[@id="pinBox4"]
<input name="ctl00$ContentPlaceHolder1$PINAddressSearch$pinBox5" maxlength="4" id="pinBox5" title="PIN Segment 5" class="pinsearchitem small" type="number" autocomplete="off" onkeydown="if(this.value.length==4 &amp;&amp; event.keyCode!=8) return false;">
//*[@id="pinBox5"]


```
```
111 NORMANDY DR (CHICAGO HEIGHTS, 60411)  PIN: 32-08-418-013-0000
111 PAMELA DR (CHICAGO HEIGHTS, 60411)  PIN: 32-08-416-021-0000
111 PEYTON DR (CHICAGO HEIGHTS, 60411)  PIN: 32-08-412-014-0000
111 E 83RD ST (CHICAGO, 60619)  PIN: 20-34-302-004-0000
```

In [5]:
location = driver.find_element_by_xpath('//*[@id="houseNumber"]')
location.click()
location.send_keys('111')
location = driver.find_element_by_xpath('//*[@id="txtStreetName"]')
location.click()
location.send_keys('a')
location = driver.find_element_by_xpath('//*[@id="txtCity"]')
location.click()
location.send_keys('Chicago')
location = driver.find_element_by_xpath('//*[@id="ContentPlaceHolder1_PINAddressSearch_btnSearch"]')
location.click()

Items of Interest:
```html
<a id="ContentPlaceHolder1_AddressResults_rptAddressResults_lnkAddressByPIN_0" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$AddressResults$rptAddressResults$ctl00$lnkAddressByPIN','')">111  MAYFAIR PL  (CHICAGO HEIGHTS, 60411)&nbsp;</a>
//*[@id="ContentPlaceHolder1_AddressResults_rptAddressResults_lnkAddressByPIN_0"]
<a id="ContentPlaceHolder1_AddressResults_rptAddressResults_lnkAddressPIN_0" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$AddressResults$rptAddressResults$ctl00$lnkAddressPIN','')">PIN: 32-08-316-016-0000</a>
//*[@id="ContentPlaceHolder1_AddressResults_rptAddressResults_lnkAddressPIN_0"]
<a id="ContentPlaceHolder1_AddressResults_rptAddressResults_lnkAddressByPIN_4109" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$AddressResults$rptAddressResults$ctl8218$lnkAddressByPIN','')">11195 S LOTHAIR AVE  (CHICAGO, 60643)&nbsp;</a>

//*[@id="ContentPlaceHolder1_AddressResults_rptAddressResults_lnkAddressByPIN_782"]
//*[@id="ContentPlaceHolder1_AddressResults_rptAddressResults_lnkAddressByPIN_4083"]
```

In [6]:
location = driver.find_element_by_xpath('//*[@id="ContentPlaceHolder1_AddressResults_rptAddressResults_lnkAddressByPIN_782"]')
location.click()

We can relatively easily walk through the Cook County site.  There seems to be no pagination and no limits.  We can give very simple queries (eg, "1" and "a" and "Chicago") and pull an enormous number of links to iterate.

However,  I need to confirm how to interpret the "Document, Deeds & Liens" section.  The Cook County data provides five years of data.  But, if the only way to determine time since last sale is whether a sale occurred in the last five years, we may not be able to use this.  Sampling a few records, it's clear the Documents section provides five rows, not five years.  But it's 


Furthermore, there is more data available but it is behind a captcha.  The stuff there is more akin to features.

What we want as target is also available via a bounce.  That is to say, I can go way back through records for each property:

http://www.ccrecorder.org/recordings/get_docs_by_pin/24241020480000/


So... in summary, ease of access is mixed.  The captcha may render this unviable if I want those features.

## Prototyping

### Will County

Let's navigate to the page with data...

In [29]:
driver = webdriver.Chrome(chromedriver)
driver.get(url_will_county)
location = driver.find_element_by_xpath('//*[@id="ctl00_BC_txStreetFrom"]')
location.click()
location.send_keys('1')
location = driver.find_element_by_xpath('//*[@id="ctl00_BC_txStreetName"]')
location.click()
location.send_keys('a')
location = driver.find_element_by_xpath('//*[@id="ctl00_BC_btnSearch"]')
location.click()
#location = driver.find_element_by_xpath('//*[@id="ctl00_BC_gvParcels"]/tbody/tr[3]/td[1]/a')
#location.click()

Items of Interest:
```html
<a id="ctl00_BC_lbNextProperty" href="javascript:__doPostBack('ctl00$BC$lbNextProperty','')" style="color:Blue;">Next Parcel &gt;&gt;</a>
//*[@id="ctl00_BC_lbNextProperty"]
```

In [39]:
driver.find_elements_by_xpath('//*[@id="ctl00_BC_gvParcels"]/tbody/tr/td[1]/a')[0].text

'0701124040070000'

In [None]:
html = driver.page_source
soup = BeautifulSoup(html, "lxml")


In [22]:
html = driver.page_source
soup = BeautifulSoup(html, "lxml")

In [28]:
soup.find_all('table')

[<table border="0" cellpadding="0" cellspacing="0" id="Table1" jstcache="0" width="744">
 <tbody jstcache="0"><tr jstcache="0">
 <td bgcolor="#ffffff" width="2">
 <img alt="" height="2" src="/images/2x2trans.gif" width="2"/>
 </td>
 <td align="left" bgcolor="#ffffff" jstcache="0" valign="top" width="742">
 <table border="0" cellpadding="0" cellspacing="0" id="Table2" jstcache="0" width="740">
 <tbody jstcache="0"><tr>
 <td><img alt="The County of Will, Illinois" height="45" src="/images/logo1.gif" width="175"/><img alt="Rhonda R. Novak, CIAO/I" height="45" src="/images/name.gif" width="565"/></td>
 </tr>
 <tr>
 <td><img alt="The County of Will, Illinois" height="118" src="/images/logo2.jpg" width="175"/><img alt="" height="118" src="/images/photo.jpg" width="565"/></td>
 </tr>
 <tr jstcache="0">
 <td align="left" jstcache="0" valign="top">
 <table border="0" cellpadding="0" cellspacing="0" id="Table3" jstcache="0" width="100%">
 <tbody jstcache="0"><tr jstcache="0">
 <td style="width:1

In [56]:
import pandas as pd
t=pd.read_html('''<table class="dialogTable" id="address-list">
                <thead style="background: #ddd;">
                    <tr>
                        <th style="height:15px;" align="left">
                            PIN
                        </th>
                        <th>
                            ADDRESS
                        </th>
                        <th>
                            CITY
                        </th>
                        <th>
                            ZIP
                        </th>
                    </tr>
                </thead>
                <tbody><tr><td>0701124040070000</td><td>1 ASHCROFT CT</td><td>BOLINGBROOK</td><td align="center">60490</td></tr><tr><td>0701124040070000</td><td>1 ASHCROFT CT</td><td>BOLINGBROOK</td><td align="center">60490</td></tr></tbody>
               

            </table>''')

In [66]:
t2.append([t[0].head(1)])


In [70]:
driver.find_elements_by_xpath('//*[@id="ctl00_BC_gvParcels"]/tbody/tr/td[2]/a')[0].text

'ADDRESS(ES)'

In [None]:
//*[@id="ctl00_BC_gvParcels"]/tbody/tr[4]/td[2]/a

//*[@id="address-list"]

/html/body/div[2]/div[1]/button/span[1]

In [73]:
driver.find_elements_by_xpath('//*[@id="address-list"]')[0].get_attribute("outerHTML")

'<table class="dialogTable" id="address-list">\n                <thead style="background: #ddd;">\n                    <tr>\n                        <th style="height:15px;" align="left">\n                            PIN\n                        </th>\n                        <th>\n                            ADDRESS\n                        </th>\n                        <th>\n                            CITY\n                        </th>\n                        <th>\n                            ZIP\n                        </th>\n                    </tr>\n                </thead>\n                <tbody><tr><td>1202024040030000</td><td>1 ASSEMBLY CT</td><td>BOLINGBROOK</td><td align="center">60440</td></tr></tbody>\n               \n\n            </table>'

In [74]:
print(','.join(['a','b']))

a,b
