* Validate the installation of Pandas
* Understand NFL Data in HTML
* Analyze HTML Table Data using Pandas
* Limitations of using Pandas HTML APIs
* Integration of BeautifulSoup and Pandas
* Exercise and Solution - Analyze HTML Data using Pandas

* Validate the installation of Pandas

In [None]:
!python -m pip show pandas

* Understand NFL Data in HTML

A simple table copied from NFL Wiki - https://en.wikipedia.org/wiki/National_Football_League

* Analyze HTML Table Data using Pandas

1. Create Pandas Data Frame for HTML Table Data
2. Analyze the data in Data Frame using appropriate APIs

In [None]:
!pip install lxml # restart the notebook environment

In [None]:
import pandas as pd

In [None]:
file_name = 'nfl_teams.html'

In [None]:
pd.read_html(file_name)[0]

1. Get Division, Club and Head Coach Details
2. Get all the Teams related to East
3. Get Number of Teams per Conference and Division

1. Get Division, Club and Head Coach Details

In [None]:
nfl_df = pd.read_html(file_name)[0]

In [None]:
nfl_df.head()

In [None]:
nfl_df[['Division', 'Club', 'Head Coach']]

2. Get all the Teams related to East

In [None]:
nfl_df.query('Division == "East"')

3. Get Number of Teams per Conference and Division

In [None]:
help(nfl_df.groupby(['Conference', 'Division'])['Club'].agg)

In [None]:
nfl_df. \
    groupby(['Conference', 'Division'])['Club']. \
    agg(club_count='count'). \
    reset_index()

* Limitations of using Pandas HTML APIs

Here are the limitations related to using Pandas to analyze HTML data.

1. Supports only http, ftp and file url protocols. Hence you might not be able to use secure urls.
2. Behavior might not be consistent if the html pages are too diversified (like Wiki pages)

* Integration of BeautifulSoup and Pandas

We will process the data under all the 100 pages under https://www.goodreads.com/quotes

1. Parse all the 100 pages.
2. Create list of dicts containing quote text, author or title, author or title url and author or title url text.
3. Create Data Frame using Pandas
4. Analyze the Data based on the requirements (eg: number of quotes per author or title) 

In [None]:
urls = []
base_url = 'https://www.goodreads.com/quotes'
for i in range(1, 101):
    urls.append(f'{base_url}?page={i}')
urls

In [None]:
import requests

In [None]:
from bs4 import BeautifulSoup

In [None]:
html_content = requests.get(urls[0]).content

In [None]:
soup = BeautifulSoup(html_content, 'html.parser')

In [None]:
len(soup.find_all('div', attrs={'class': 'quoteText'})

In [None]:
quoteText = soup.find('div', attrs={'class': 'quoteText'})

In [None]:
quoteText.find(string=True, recursive=False)

In [None]:
quoteText.find('span').find(string=True, recursive=False)

In [None]:
author_or_title_url = None
author_or_title_url_text = None

if quoteText.find('a'):
    author_or_title_url = quoteText.find('a')['href']
    author_or_title_url_text = quoteText.find('a').find(string=True, recursive=False)

In [None]:
import requests
from bs4 import BeautifulSoup

quotes = []
urls = []

base_url = 'https://www.goodreads.com/quotes' 
for i in range(1, 101):
    urls.append(f'{base_url}?page={i}')

for url in urls:
    print(f'Processing: {url}')
    html_content = requests.get(url).content
    soup = BeautifulSoup(html_content, 'html.parser')
    quoteTexts = soup.find_all('div', attrs={'class': 'quoteText'})
    for quoteText in quoteTexts:
        quote_text = quoteText.find(string=True, recursive=False)
        author_or_title = None
        author_or_title_url = None
        author_or_title_url_text = None
        if quoteText.find('span'):
            author_or_title = quoteText.find('span').find(string=True, recursive=False)
        if quoteText.find('a'):
            author_or_title_url = quoteText.find('a')['href']
            author_or_title_url_text = quoteText.find('a').find(string=True, recursive=False)
        quotes.append({
            'quote_text': quote_text,
            'author_or_title': author_or_title,
            'author_or_title_url': author_or_title_url,
            'author_or_title_url_text': author_or_title_url_text
        })

In [None]:
quotes

In [None]:
import pandas as pd

In [None]:
quotes_df = pd.DataFrame(quotes)

In [None]:
quotes_df

* Exercise - Analyze HTML Data using Pandas

1. Use the quotes_df above or create one
2. Group the data by `author_or_title` and get the count
3. Ensure the aggregated data frame have `author_or_title` and `quote_count` (new column for count)
4. Sort the data in descending order by values

* Solution - Analyze HTML Data using Pandas

In [None]:
quotes_df. \
    groupby(['author_or_title'])['author_or_title']. \
    agg(quote_count='count'). \
    reset_index(). \
    sort_values('quote_count', ascending=False)