## HTML - Web Scraping - Simple example using Pandas

In [1]:
#You will likely have to restart your kernel (Jupyter environment) after executing this
#To restart the kernel - Go to the "Kernel" menu item and click "Restart". Note that all your previously
#created variables will be removed from memory
!pip install lxml



#### "lxml" is a widely used library for working with XML and HTML documents, parsing and manipulating them

#### This library is commonly used for web scraping, parsing XML and HTML data, and performing various tasks related to structured document processing. Here are some use cases for installing the "lxml" library:

- Web Scraping: "lxml" is often used for web scraping tasks where you need to extract data from web pages that are written in HTML.

- XML and HTML Parsing: It's commonly used for parsing and manipulating XML and HTML documents. This can be useful for extracting data from XML-based APIs or web pages.

- Data Extraction and Transformation: You can use "lxml" to extract specific data from structured documents, clean and transform it, and then save it in a more usable format.

- Web Crawling: When building web crawlers to navigate websites and gather information, "lxml" can help parse and process the web pages efficiently.



In [2]:
!pip install html5lib



#### The `html5lib` package is a Python library that provides a way to parse and process HTML documents. Used in web scraping and web page parsing tasks. It can be useful when you need to work with HTML data and extract information from web pages, 

Here's an example of how you might use this package after installing it:

```python
import html5lib

# Parse an HTML document
html = "<html><body><h1>Hello, World!</h1></body></html>"
document = html5lib.parse(html)

# Access elements in the HTML document
h1_text = document.find(".//h1").text
print(h1_text)  # Output: "Hello, World!"
```

Here, `html5lib` is used to parse an HTML document and extract the text content of an `<h1>` element. 

In [3]:
import pandas as pd
import html5lib
tables = pd.read_html('FDIC_ Failed Bank List.html') #Read HTML tables (from the webpage) into a list of DataFrame objects
len(tables)   #How many tables are defined in the DOM?  (how many tables are in the document?)

1

In [4]:
failures = tables[0] # The [0] index indicates that we are selecting the first table in the list
failures.head()

Unnamed: 0,Bank Name,City,ST,CERT,Acquiring Institution,Closing Date,Updated Date
0,Allied Bank,Mulberry,AR,91,Today's Bank,"September 23, 2016","November 17, 2016"
1,The Woodbury Banking Company,Woodbury,GA,11297,United Bank,"August 19, 2016","November 17, 2016"
2,First CornerStone Bank,King of Prussia,PA,35312,First-Citizens Bank & Trust Company,"May 6, 2016","September 6, 2016"
3,Trust Company Bank,Memphis,TN,9956,The Bank of Fayette County,"April 29, 2016","September 6, 2016"
4,North Milwaukee State Bank,Milwaukee,WI,20364,First-Citizens Bank & Trust Company,"March 11, 2016","June 16, 2016"


In [8]:
failures.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 547 entries, 0 to 546
Data columns (total 7 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   Bank Name              547 non-null    object
 1   City                   547 non-null    object
 2   ST                     547 non-null    object
 3   CERT                   547 non-null    int64 
 4   Acquiring Institution  547 non-null    object
 5   Closing Date           547 non-null    object
 6   Updated Date           547 non-null    object
dtypes: int64(1), object(6)
memory usage: 30.0+ KB


In [9]:
close_timestamps = pd.to_datetime(failures['Closing Date'])
close_timestamps.dt.year.value_counts()

Closing Date
2010    157
2009    140
2011     92
2012     51
2008     25
2013     24
2014     18
2002     11
2015      8
2016      5
2004      4
2001      4
2007      3
2003      3
2000      2
Name: count, dtype: int64

### Code Explanation

1. `close_timestamps = pd.to_datetime(failures['Closing Date'])`:
   - `pd.to_datetime()`  is used to convert a column of data containing date and time information into Pandas `Datetime` objects. In this case, it is applied to the 'Closing Date' column of the 'failures' DataFrame.
   

2. `close_timestamps.dt.year.value_counts()`:
   - `close_timestamps.dt`  allows you to access various attributes of the DateTime objects, such as year, month, day, etc.
   - `.year` is used to extract the year component from each DateTime object in `close_timestamps` Series. Transforms the series of timestamps into a series of years, grouping the data by year.
   - `value_counts()` is a Pandas method that counts the occurrences of unique values in a Series and returns a new Series with the counts, indexed by the unique values.
   - In this case, it is used to count the number of failures that occurred in each year and returns a Series where each year is the index, and the number of failures is the corresponding value.

