### EDA Review and Extracting Data From HTML

**OBJECTIVES**

- Review plotting and subplots
- Review datetime properties and methods
- Use `pd.read_html` to extract data from website tables
- Use `bs4` to parse html returned with requests.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import requests

ModuleNotFoundError: No module named 'matplotlib'

#### Warm Up

1. Read in the `book_sales.csv` data and make sure to create a datetime index.  
2. Plot the Paperback and Hardcover sales through time.
3. Create a `sns.regplot` of Hardcover vs. Paperback sales.  Do the seem related?

In [None]:
#read in and create datetime index
url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa23/main/data/book_sales.csv'

In [None]:
#check the info


In [None]:
#plot over time


In [None]:
#regplot


**PROBLEM**

Use the code below to read in the `us-retail-sales.csv` data, this time creating a datetime index from the `Month` column.

Use the `.resample()` method to determine the monthly average for Building Materials.  Create a line plot of this data using either `seaborn` or `matplotlib`. 

In [None]:
#read in dataframe
url = 'https://raw.githubusercontent.com/jfkoehler/nyu_bootcamp_fa23/main/data/us-retail-sales.csv'

In [None]:
#look at information


In [None]:
#resample and plot


**`pandas`**

`pd.melt`

```
Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

This function is useful to massage a DataFrame into a format where one
or more columns are identifier variables (`id_vars`), while all other
columns, considered measured variables (`value_vars`), are "unpivoted" to
the row axis, leaving just two non-identifier columns, 'variable' and
'value'.
```

In [None]:
#create a melted dataframe of Building Materials and Clothing


In [None]:
#add a month column


In [None]:
#make a boxplot


### Reading in Data from HTML Tables

Now, we turn to one more approach in accessing data. As we've seen, you may have `json` or `csv` when querying a data API. Alternatively, you may receive HTML data where information is contained in tags.  Below, we examine some basic html tags and their effects.

```html
<h1>A Heading</h1>
<p>A first paragraph</p>
<p>A second paragraph</p>
<table>
  <tr>
    <th>Album</th>
    <th>Rating</th>
  </tr>
  <tr>
    <td>Pink Panther</td>
    <td>10</td>
  </tr>
</table>
```

In [None]:
html = '''
<h1>A Heading</h1>
<p>A first paragraph</p>
<p>A second paragraph</p>
<table>
  <tr>
    <th>Album</th>
    <th>Rating</th>
  </tr>
  <tr>
    <td>Pink Panther</td>
    <td>10</td>
  </tr>
</table>
'''

In [None]:
from IPython.display import HTML

In [None]:
HTML(html)

### Making a request of a url

Let's begin with some basketball information from basketball-reference.com:

- https://www.basketball-reference.com/wnba

The tables on the page will be picked up (hopefully!) by the `read_html` function in pandas.

In [None]:
#visit the url below
url = 'https://www.basketball-reference.com/wnba'

In [None]:
#assign the results as data
#read_html


In [None]:
#what kind of object is data?


In [None]:
#first element?


In [None]:
#examine information


In [None]:
#last dataframe?


In [None]:
#plot?


**Example 2**

List of best selling albums from Wikipedia.

- https://en.wikipedia.org/wiki/List_of_best-selling_albums

In [None]:
url = 'https://en.wikipedia.org/wiki/List_of_best-selling_albums'

In [None]:
#read in the tables


In [None]:
#how many tables?


In [None]:
#look at the fourth table


In [None]:
#try to convert sales to float


In [None]:
#replace and coerce as float
# fourth_table['Claimed sales*'] = fourth_table['Claimed sales*'].replace({'20[disputed – discuss]': 20}).astype('float')

In [None]:
#alternative with string method
#fourth_table['Claimed sales*'].str.replace('[disputed – discuss]', '', regex = False)

### Scraping the Web for Data

Sometimes the data is not formatted as an `html` table or `pd.read_html` simply doesn't work.  In these situations you can use the `bs4` library and its `BeautifulSoup` object to parse HTML tags and extract information.  First, make sure you have the library installed and can import it below.

In [None]:
# pip install -U bs4

In [None]:
from bs4 import BeautifulSoup
import requests

In [None]:
sample_html = '''
<h1>Music Reviews</h1>
<p>This album was awful. <strong>Score</strong>: <i class = "score">2</i></p>
<p class = "good">This album was great. <strong>Score</strong>: <i class = "score">8</i></p>
'''

In [None]:
# create a soup object


In [None]:
# examine the soup


In [None]:
# find the <p> tags


In [None]:
# find the i tag


In [None]:
# find all the i tags


In [None]:
# find all good paragraphs


#### Extracting Data from a URL

1. Make a request.
2. Turn the request into soup!

In [None]:
url = 'https://pitchfork.com/reviews/albums/'

In [None]:
#make a request


In [None]:
#examine the text


In [None]:
#turn it into soup!


### Using Inspect

You can inspect an items HTML code by right clicking on the item of interest and selecting **inspect**.  Here, you will see the html tags that surround the object of interest.  

For example, when writing this lesson a recent album review on pitchfork was *Drake: For all the dogs*.  Right clicking on the image of the album cover and choosing inspect showed:

![](images/dogs.png)

In [None]:
#find the img tag


In [None]:
#find all img tags


In [None]:
#explore attributes


In [None]:
#extract source of image url


In [None]:
# extract the genre tags


In [None]:
# extract the text from the genres


**PROBLEM**

Use the url below to the npr book review site.  Make a request, turn this into a soup object, and use the inspect tool to locate the title of each article on the page.  

In [None]:
url = 'https://www.npr.org/sections/book-reviews/'

#### Summary

There are many ways you may get data -- a file that somebody shares with you, data obtained through an API, data obtained through scraping and crawling websites, and even more like a database that you connect to.  Now that you've got some basics with both data accession, cleaning, munging, and visualizing -- it's time to explore a dataset and ask your own questions.