# **10. Input/Output Operations**

# 🌐 6. HTML Tables in Pandas

In [1]:
import pandas as pd

## 1️⃣ What It Does and When to Use It

### ✅ What it does:

Pandas supports reading and writing **HTML tables** using:

* `pd.read_html()` to **extract tables from HTML documents or URLs**.
* `df.to_html()` to **write DataFrames into HTML format**, which can be used in webpages or saved as `.html` files.

It uses **lxml** or **html5lib** parsers internally for parsing HTML content and extracting `<table>` elements.

---

### 📌 When to use:

* When scraping or importing tabular data from **webpages**.
* When exporting pandas data for **web display or reporting**.
* When embedding results into **Jupyter Notebooks** or **dashboards** with HTML support.

## 2️⃣ Syntax and Key Parameters

### 🔹 `pd.read_html()`

```python
pd.read_html(io, match='.+', flavor=None, header=None, index_col=None, attrs=None, parse_dates=False, ...)
```

| Parameter     | Description                                                 |
| ------------- | ----------------------------------------------------------- |
| `io`          | HTML string, file, or URL                                   |
| `match`       | Regex pattern to identify tables to extract                 |
| `flavor`      | Parser to use: `'bs4'`, `'lxml'`, or `'html5lib'`           |
| `header`      | Row(s) to use as the column names                           |
| `index_col`   | Column to use as index                                      |
| `attrs`       | HTML attributes (like `{'class': 'data'}`) to filter tables |
| `parse_dates` | Whether to parse dates                                      |

**Returns a list of DataFrames** (one for each table found).

---

### 🔹 `df.to_html()`

```python
df.to_html(buf=None, columns=None, col_space=None, header=True,
           index=True, na_rep='NaN', formatters=None, ...)
```

| Parameter | Description                                       |
| --------- | ------------------------------------------------- |
| `buf`     | File path or buffer (default returns HTML string) |
| `columns` | Subset of columns to write                        |
| `header`  | Include column names                              |
| `index`   | Include index column                              |
| `na_rep`  | String representation for missing values          |


## 3️⃣ Examples of Reading/Writing

### 📥 Reading HTML Tables from URL

In [6]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)'
tables = pd.read_html(url, match='GDP', header=0)

tables

[Empty DataFrame
 Columns: [Largest economies in the world by GDP (nominal) in 2025 according to International Monetary Fund estimates[n 1][1]]
 Index: [],
      Country/Territory IMF[1][12] IMF[1][12].1 World Bank[13]  \
 0    Country/Territory   Forecast         Year       Estimate   
 1                World  113795678         2025      111326370   
 2        United States   30507217         2025       29184890   
 3                China   19231705    [n 1]2025       18743803   
 4              Germany    4744804         2025        4659929   
 ..                 ...        ...          ...            ...   
 218           Kiribati        312         2025            308   
 219   Marshall Islands        297         2025            280   
 220              Nauru        169         2025            160   
 221         Montserrat          —            —              —   
 222             Tuvalu         65         2025             62   
 
     World Bank[13].1 United Nations[14] United Na

In [11]:
df = tables[0]

df

Unnamed: 0,Largest economies in the world by GDP (nominal) in 2025 according to International Monetary Fund estimates[n 1][1]


In [10]:
df = tables[1]

df.head()

Unnamed: 0,Country/Territory,IMF[1][12],IMF[1][12].1,World Bank[13],World Bank[13].1,United Nations[14],United Nations[14].1
0,Country/Territory,Forecast,Year,Estimate,Year,Estimate,Year
1,World,113795678,2025,111326370,2024,100834796,2022
2,United States,30507217,2025,29184890,2024,27720700,2023
3,China,19231705,[n 1]2025,18743803,[n 3]2024,17794782,[n 1]2023
4,Germany,4744804,2025,4659929,2024,4525704,2023


In [15]:
# You can also **filter by class** or specific attributes:

pd.read_html(url, attrs={'class': 'wikitable'})[0]

Unnamed: 0_level_0,Country/Territory,IMF[1][12],IMF[1][12],World Bank[13],World Bank[13],United Nations[14],United Nations[14]
Unnamed: 0_level_1,Country/Territory,Forecast,Year,Estimate,Year,Estimate,Year
0,World,113795678,2025,111326370,2024,100834796,2022
1,United States,30507217,2025,29184890,2024,27720700,2023
2,China,19231705,[n 1]2025,18743803,[n 3]2024,17794782,[n 1]2023
3,Germany,4744804,2025,4659929,2024,4525704,2023
4,India,4187017,2025,3912686,2024,3575778,2023
...,...,...,...,...,...,...,...
217,Kiribati,312,2025,308,2024,289,2023
218,Marshall Islands,297,2025,280,2024,270,2023
219,Nauru,169,2025,160,2024,176,2023
220,Montserrat,—,—,—,—,80,2023


### 📤 Writing HTML from DataFrame

In [16]:
data = {
    'Product': ['Apple', 'Banana', 'Cherry'],
    'Price': [1.2, 0.5, 2.5]
}
df = pd.DataFrame(data)

# Save as an HTML file
df.to_html('data files/html/products.html', index=False)

# Or get the HTML string
html_string = df.to_html(index=False)
print(html_string)

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th>Product</th>
      <th>Price</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Apple</td>
      <td>1.2</td>
    </tr>
    <tr>
      <td>Banana</td>
      <td>0.5</td>
    </tr>
    <tr>
      <td>Cherry</td>
      <td>2.5</td>
    </tr>
  </tbody>
</table>


## 4️⃣ Common Pitfalls

| Pitfall                            | Explanation & Fix                                                                       |
| ---------------------------------- | --------------------------------------------------------------------------------------- |
| ❌ **Requires parser dependencies** | Must have **`lxml`, `html5lib`, or `bs4`** installed                                    |
| ❌ **Multiple tables returned**     | `read_html()` returns a **list of DataFrames**, not a single one                        |
| ❌ **Incorrect parsing**            | Complex HTML or malformed tables may fail — use `match` or `attrs` for better targeting |
| ❌ **Performance issues**           | Reading large tables from complex HTML pages can be slow                                |
| ❌ **Missing headers or indexes**   | May need to manually pass `header=0`, `index_col=...`                                   |


## 5️⃣ Real-World Usage

### 🌍 Web Scraping

* Extract economic indicators, sports stats, stock tables, etc., from Wikipedia, Yahoo Finance, etc.

### 📈 Report Automation

* Convert DataFrames into styled HTML tables for automated dashboards and email reports.

### 🧑‍💻 Web-based UIs

* Use `df.to_html()` to render backend pandas data into web templates (e.g., Flask, Django).

### 🧪 Data Exploration in Notebooks

* Pretty display of data in Jupyter Notebooks with `to_html()` and inline HTML rendering.

## ✅ Summary Table

| Task                   | Method                     |
| ---------------------- | -------------------------- |
| Read HTML table        | `pd.read_html()`           |
| Write to HTML          | `df.to_html()`             |
| Requires parser        | Yes (`lxml` or `html5lib`) |
| Output type from read  | List of DataFrames         |
| Output type from write | HTML string or file        |


<center><b>Thanks</b></center>