## Explanation of Project

* This project is to highlight my coding abilities in python. It involves web scraping a local county jail's released inmates.

* The website contains a table with each inmate's name,  age, race, sex, admit date, release date, and time in jail (number of days).

* Due to JavaScript on the website, there was a next button to see each page of the table. Therefore, selenium was needed to simulate actually clicking the next button.

* The htmls from each page were collected and then through parsing the html, messy data was transformed into a pandas dataframe.

* I concluded this exercise showing summary statistics and some plots with matplotlib. 

## Imports

In [72]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import re
import pandas as pd
import requests
import html5lib
import sys
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.action_chains import ActionChains
import time
import numpy as np

## Created Functions

In [73]:
# this function is used to scrape the htmls from each page on the website and 
# extract necessary information into a dictionary 

def get_table(tbrow): 
    
    name = tbrow.find("td", {'id': re.compile('gvInmates_tccell')}).text
    name = re.sub('[\n\t]', "", name)
    
    age = tbrow.findAll('td', {'class': re.compile('dxgv*')})[1].text
    race = tbrow.findAll('td', {'class': re.compile('dxgv*')})[2].text
    sex = tbrow.findAll('td', {'class': re.compile('dxgv*')})[3].text
    admit = tbrow.findAll('td', {'class': re.compile('dxgv*')})[4].text
    release = tbrow.findAll('td', {'class': re.compile('dxgv*')})[5].text
    tij =  tbrow.findAll('td', {'class': re.compile('dxgv*')})[6].text
    
    inmate_row = {'name': name,
                    'age': age,
                    'race': race,
                    'sex': sex,
                    'admit': admit,
                    'release': release,
                    'tij': tij}
    return inmate_row

## Initialize Selenium

In [74]:
url = "https://jail.desotosheriff.org/DCN/inmates-released"
driver = webdriver.Firefox()
driver.get(url)

## Loop through Each Page and Extract Html 

In [75]:
# Ths loops through each page on the website and extracts each page's html source
# Once it hits the last page an exception is captured. 
soups = []
for i in range(174):
    try:
        html = driver.page_source
        bsObj = BeautifulSoup(html, "html5lib")
        soups.append(bsObj)
        wait = WebDriverWait(driver, 45)
        el = wait.until(EC.element_to_be_clickable((By.XPATH,'//*[@id="gvInmates_DXPagerBottom"]/a[@class="dxp-button dxp-bi"]/img[@alt="Next"]')))    
        driver.execute_script("arguments[0].click();", el)  
        time.sleep(2)
    except: 
        print("Last Page") 
        sys.exit(1)


Last Page


SystemExit: 1

  warn("To exit: use 'exit', 'quit', or Ctrl-D.", stacklevel=1)


## Extract necessary information from each page's html

In [76]:
# for each page in a list of htmls and each table row on that page 
# the get table function is applied to extract data from each row 
# in other words, each person's information is extracted and put into a list 
table_rows = []
for pg in soups:
    for tabrow in pg.findAll('tr', {'id': re.compile('gvInmates_DXDataRow*')}):
        table_rows.append(get_table(tabrow))

In [77]:
# Each table row is saved as a dictionary, thus we are left with a list of dictionaries
# because as of 5/31/2019, there were 17,210 released inmates 
# there should be 17,210 observations in the list 
print(len(table_rows))
# This is an example of one observation - a dictionary for each inmate
print(table_rows[0])

17210
{'name': 'Abalos, Eusebio Cardenas', 'age': '43', 'race': 'W', 'sex': 'M', 'admit': '8/31/2015', 'release': '8/31/2015', 'tij': '0'}


## Convert List of Dictionaries to Pandas Dataframe

In [78]:
# convert list of dictionaries to df 
table = pd.DataFrame(table_rows)

# rearrange columns
table = table[['name', 'age', 'race','sex','admit', 'release', 'tij']]

# show head of table
table.head()

# note - anonymize column (name), but for purposes will not 

Unnamed: 0,name,age,race,sex,admit,release,tij
0,"Abalos, Eusebio Cardenas",43,W,M,8/31/2015,8/31/2015,0
1,"Abila Rivera, Maria del Carmen",39,H,F,7/22/2017,7/22/2017,0
2,"Aburayyan, Bader Khaleel",35,O,M,9/17/2014,9/18/2014,1
3,"Accardi, Charles Bradley",26,W,M,3/2/2012,3/2/2012,0
4,"Accardi, Charles Bradley",26,W,M,7/24/2012,8/6/2012,13


In [79]:
# check the data types
#table.dtypes

# convert to different data types 
table = table.astype({"name": str, "age": int, "race": 'category', "sex": 'category'})

# for the time in jail column we must convert N/A to NAN
table.loc[table['tij'] == 'N/A', 'tij'] = np.nan

# remove nan values 
table.dropna(how='any', inplace=True)  

# convert tij to int 
table.tij.astype(int, inplace=True)

# check dtypes
table.dtypes

## Summary Statistics 