# Getting Data from a Webpage

Open this notebook in [Callysto](https://hub.callysto.ca/jupyter/hub/user-redirect/git-pull?repo=https://github.com/pbeens/Data-Dunkers&branch=main&subPath=Demos/data-from-webpage.ipynb&depth=1) | [Colab](https://githubtocolab.com/pbeens/Data-Dunkers/blob/main/Demos/data-from-webpage.ipynb).

# Lesson Objectives

By the end of this lesson, students will be able to:
- Install necessary libraries for data extraction from webpages, including Pandas and specific HTML parsing libraries like lxml and html5lib.
- Understand and implement the `pd.read_html()` function to load data from HTML tables into a DataFrame.
- Identify and select the appropriate table from a webpage by indexing, recognizing how Pandas indexes tables starting from zero.
- Create and customize a line plot using Plotly Express with data extracted from a webpage, effectively demonstrating the ability to visualize web-based data.
- Navigate common issues such as handling webpages with multiple tables or multi-index column headers.

## Program Setup

This first code block may have to be run if these libraries haven't already been installed. Once this has been done once, it will never have to be done again. You can skip it for now, but if you get an error message related to a library not being installed, go ahead and run it.

In [None]:
%pip install pandas -q
%pip install plotly_express -q
%pip install lxml -q
%pip install html5lib -q

## Introduction

There are many ways we can import data, but the most common are from the program itself, a CSV (comma separated values) file, from an Excel spreadsheet, from a Google Sheet, or from a webpage. 

So far we have lookedd at how to get data from in the Jupyter Notebook itself, from a CSV file, and from an Excel file.

In this demo, we will demonstrate how to get data from a webpage.

## Data from a table on a webpage

As you might imagine, the overall program won't be much different than the ones above. Instead of `read_csv()` or `read_excel()`, we are using `read_html()`, but one important difference is we have to tell the program which table we want to use. 

When Pandas reads in the tables on a webpage, it indexes them, with the first table being indexed with the value 0 (zero).

Here's our program. Look closely at how the table index number is referenced.

In [None]:
# import plotly.express and pandas
import plotly.express as px
import pandas as pd

# Read the html file into a DataFrame named df
# Note we are using the first table which is index 0
url = 'https://raw.githubusercontent.com/pbeens/Data-Dunkers/main/Data/x-y-data.html'
df = pd.read_html(url)[0] # Index 0 is the first table

# Create the plot
fig = px.line(data_frame=df, 
              x='X', 
              y='Y', 
              title='Data from a table on a webpage')

# Show the plot
fig.show()

Note: if you have multiple rows for the column headers, look at **Fixing a multi-index** in this [cheat sheet](https://github.com/pbeens/Data-Dunkers/blob/main/cheatsheet.md#fixing-a-multi-index).

# Exercise

Using the code above as an example, use the data below to plot Pascal Siakam's field goals made over his Raptors career. 

In [None]:
# Setup


# Input
url = 'https://raw.githubusercontent.com/pbeens/Data-Dunkers/main/Data/example.html'


# Process


# Output

---
In our next demonstration we will get our data from a [Google Sheet](https://github.com/pbeens/Data-Dunkers/blob/main/Demos/data-from-google-sheet.ipynb).

---
*Report issues or give us feedback about this notebook [here](https://docs.google.com/forms/d/e/1FAIpQLSdMRX2hPqZyD8-argFJXxB3ABQdLk3aUH1CAfmMEtcFAlWzCw/viewform?usp=pp_url&entry.1771525592=Module%20Resources%20%28the%20Jupyter%20notebooks%2C%20PPTS%20or%20additional%20resources%29&entry.1364186163=Data%20from%20a%20Webpage).*


---
Back to [Lessons](https://github.com/pbeens/Data-Dunkers/blob/main/Lessons.ipynb)