# Reading Data from Various Data Sources using Pandas

📌 **Objective:**  
Learn how to read data from different file formats and sources into a Pandas DataFrame —  
including CSV, Excel, JSON, SQL, HTML, and APIs.

---

## 🔹 1. The Universal Loader — `pd.read_*()`

Pandas provides a powerful set of functions prefixed with `read_`  
for importing data from almost any source.

\[
\text{DataFrame} = pd.read\_*(\text{source})
\]

Here are some of the most common ones:

| Format | Function | Example |
|:--------|:-----------|:--------------------------------|
| CSV | `pd.read_csv()` | `'data.csv'` |
| Excel | `pd.read_excel()` | `'data.xlsx'` |
| JSON | `pd.read_json()` | `'data.json'` |
| SQL | `pd.read_sql()` | `'SELECT * FROM table'` |
| HTML | `pd.read_html()` | `'https://example.com'` |
| Clipboard | `pd.read_clipboard()` | copy–paste data |
| Parquet | `pd.read_parquet()` | `'data.parquet'` |

---

## 🔹 2. Reading CSV Files

CSV (Comma-Separated Values) is the most common format for structured data.


In [84]:
# 🧱 Code cell — performs data manipulation or utility operation

import pandas as pd
from io import StringIO

💡 Converting JSON into Data frame

> _Note:_ This markdown was expanded for clarity. Replace with more specific notes if needed.

# 🔹 Reading JSON data

**Purpose:** Load JSON files or API responses and normalize nested structures with `pd.json_normalize()`.

*Auto-generated section header for clarity.*

In [85]:
# 🧾 Read structured data from a JSON file or API response

data = '{"employee_name": "James", "email": "james@gmail.com", "job_profile": [{"title1":"Team Lead", "title2":"Sr. Developer"}]}'
df = pd.read_json(StringIO(data))  # Load JSON into DataFrame

display(df)

Unnamed: 0,employee_name,email,job_profile
0,James,james@gmail.com,"{'title1': 'Team Lead', 'title2': 'Sr. Develop..."


💡 Coverting Data frame to Json

> _Note:_ This markdown was expanded for clarity. Replace with more specific notes if needed.

In [86]:
# 🧱 Code cell — performs data manipulation or utility operation

display(df.to_json())

'{"employee_name":{"0":"James"},"email":{"0":"james@gmail.com"},"job_profile":{"0":{"title1":"Team Lead","title2":"Sr. Developer"}}}'

💡 Coverting Data frame to Json (with orient as `index`)

In [87]:
# 🧱 Code cell — performs data manipulation or utility operation

display(df.to_json(orient='index'))

'{"0":{"employee_name":"James","email":"james@gmail.com","job_profile":{"title1":"Team Lead","title2":"Sr. Developer"}}}'

💡 Coverting Data frame to Json (with orient as `records`)

In [88]:
# 🧱 Code cell — performs data manipulation or utility operation

display(df.to_json(orient='records'))

'[{"employee_name":"James","email":"james@gmail.com","job_profile":{"title1":"Team Lead","title2":"Sr. Developer"}}]'

💡 This is how we can read a specific data from a specific URL, provided that data is comma seperated

# 🔹 Reading CSV files

**Purpose:** Load CSV files into a DataFrame using `pd.read_csv()` with common options.

*Auto-generated section header for clarity.*

In [89]:
# 🧠 Load data from a CSV file into a DataFrame and inspect the first few rows

df=pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",header=None)  # Read CSV file into a pandas DataFrame

display(df.head())

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,1,14.23,1.71,2.43,15.6,127,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065
1,1,13.2,1.78,2.14,11.2,100,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050
2,1,13.16,2.36,2.67,18.6,101,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185
3,1,14.37,1.95,2.5,16.8,113,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480
4,1,13.24,2.59,2.87,21.0,118,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735


💡 Converting this dataframe back to CSV file

# 🔹 Writing / Exporting data

**Purpose:** Export DataFrames to CSV/Excel/SQL using `to_csv()`, `to_excel()`, and `to_sql()`.

*Auto-generated section header for clarity.*

In [90]:
# 📝 Export DataFrame to a CSV file for reporting or sharing

df.to_csv('wine.csv')  # Export DataFrame to CSV

In [91]:
# 🧱 Code cell — performs data manipulation or utility operation

!pip install lxml
!pip install html5lib
!pip install beautifulsoup4



💡 Reading a data from a URL

> _Note:_ This markdown was expanded for clarity. Replace with more specific notes if needed.

# 🔹 Reading HTML tables from the web

**Purpose:** Extract `<table>` elements from webpages using `pd.read_html()`; use `requests` + `StringIO` when needed.

*Auto-generated section header for clarity.*

In [92]:
# 🌐 Extract tabular data from an HTML webpage using `pd.read_html()`

url="https://www.fdic.gov/resources/resolutions/bank-failures/failed-bank-list/"

fdic_df = pd.read_html(url)  # Parse HTML and extract tables as DataFrames
display(fdic_df)

[                                            Bank Name                City  \
 0                        The Santa Anna National Bank          Santa Anna   
 1                                Pulaski Savings Bank             Chicago   
 2                  The First National Bank of Lindsay             Lindsay   
 3               Republic First Bank dba Republic Bank        Philadelphia   
 4                                       Citizens Bank            Sac City   
 5                            Heartland Tri-State Bank             Elkhart   
 6                                 First Republic Bank       San Francisco   
 7                                      Signature Bank            New York   
 8                                 Silicon Valley Bank         Santa Clara   
 9                                   Almena State Bank              Almena   
 10                         First City Bank of Florida   Fort Walton Beach   
 11                               The First State Bank       Bar

In [93]:
# 🧱 Code cell — performs data manipulation or utility operation

fdic_df[0]

Unnamed: 0,Bank Name,City,State,Cert,Acquiring Institution,Closing Date,Fund Sort ascending
0,The Santa Anna National Bank,Santa Anna,Texas,5520,Coleman County State Bank,"June 27, 2025",10549
1,Pulaski Savings Bank,Chicago,Illinois,28611,Millennium Bank,"January 17, 2025",10548
2,The First National Bank of Lindsay,Lindsay,Oklahoma,4134,First Bank & Trust Co.,"October 18, 2024",10547
3,Republic First Bank dba Republic Bank,Philadelphia,Pennsylvania,27332,"Fulton Bank, National Association","April 26, 2024",10546
4,Citizens Bank,Sac City,Iowa,8758,Iowa Trust & Savings Bank,"November 3, 2023",10545
5,Heartland Tri-State Bank,Elkhart,Kansas,25851,"Dream First Bank, N.A.","July 28, 2023",10544
6,First Republic Bank,San Francisco,California,59017,"JPMorgan Chase Bank, N.A.","May 1, 2023",10543
7,Signature Bank,New York,New York,57053,"Flagstar Bank, N.A.","March 12, 2023",10540
8,Silicon Valley Bank,Santa Clara,California,24735,First Citizens Bank & Trust Company,"March 10, 2023",10539
9,Almena State Bank,Almena,Kansas,15426,Equity Bank,"October 23, 2020",10538


In [94]:
# 🧱 Code cell — performs data manipulation or utility operation

!pip install requests



# 🔹 Reading HTML tables from the web

**Purpose:** Extract `<table>` elements from webpages using `pd.read_html()`; use `requests` + `StringIO` when needed.

*Auto-generated section header for clarity.*

In [95]:
# 🌐 Extract tabular data from an HTML webpage using `pd.read_html()`

import requests

# URL of the page
url = "https://en.wikipedia.org/wiki/Mobile_country_code"

# Add custom headers to mimic a real browser
headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                  "AppleWebKit/605.1.15 (KHTML, like Gecko) "
                  "Version/17.0 Safari/605.1.15"
}

# Fetch the HTML content
response = requests.get(url, headers=headers)  # Perform HTTP GET request (include headers to avoid 403)
response.raise_for_status()  # Raise error if request fails

# Parse HTML with pandas
tables = pd.read_html(response.text, match="Country", header=0)  # Parse HTML and extract tables as DataFrames

# Display first table
df = tables[0]
display(df.head())

  tables = pd.read_html(response.text, match="Country", header=0)  # Parse HTML and extract tables as DataFrames


Unnamed: 0,Mobile country code,Country,ISO 3166,Mobile network codes,National MNC authority,Remarks
0,289,A Abkhazia,GE-AB,List of mobile network codes in Abkhazia,,MCC is not listed by ITU
1,412,Afghanistan,AF,List of mobile network codes in Afghanistan,,
2,276,Albania,AL,List of mobile network codes in Albania,,
3,603,Algeria,DZ,List of mobile network codes in Algeria,,
4,544,American Samoa (United States of America),AS,List of mobile network codes in American Samoa,,


### ✅ Explanation
| Step | Action                     | Purpose                     |
| :--- | :------------------------- | :-------------------------- |
| 1️⃣  | Use `requests.get()`       | Manual request with control |
| 2️⃣  | Add `User-Agent` header    | Pretend to be a browser     |
| 3️⃣  | `response.text`            | Get HTML as a string        |
| 4️⃣  | `pd.read_html()`           | Parse HTML content directly |
| 5️⃣  | Optional `match="Country"` | Filter the right table      |


🧠 Extra Tip: Identify All Tables

If you’re unsure which table to extract:

# 🔹 Reading HTML tables from the web

**Purpose:** Extract `<table>` elements from webpages using `pd.read_html()`; use `requests` + `StringIO` when needed.

*Auto-generated section header for clarity.*

In [96]:
# 🌐 Extract tabular data from an HTML webpage using `pd.read_html()`

for i, table in enumerate(pd.read_html(response.text)):  # Parse HTML and extract tables as DataFrames
    print(f"🔹 Table {i} shape:", table.shape)

🔹 Table 0 shape: (4, 7)
🔹 Table 1 shape: (252, 6)
🔹 Table 2 shape: (104, 7)
🔹 Table 3 shape: (1, 7)
🔹 Table 4 shape: (10, 2)


  for i, table in enumerate(pd.read_html(response.text)):  # Parse HTML and extract tables as DataFrames


`/var/folders/vf/rst47yl90f74cf3frj58cvw00000gn/T/ipykernel_6863/3170680250.py:18`: **FutureWarning**: Passing literal html to 'read_html' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  tables = pd.read_html(response.text, match="Country", header=0)

# 🔹 Reading HTML tables from the web

**Purpose:** Extract `<table>` elements from webpages using `pd.read_html()`; use `requests` + `StringIO` when needed.

*Auto-generated section header for clarity.*

In [97]:
# 🌐 Extract tabular data from an HTML webpage using `pd.read_html()`

import requests
from io import StringIO   # ✅ New import for wrapping HTML string

# Target URL
url = "https://en.wikipedia.org/wiki/Mobile_country_code"

# Spoof a browser user-agent (Wikipedia blocks raw bot requests)
headers = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/605.1.15 (KHTML, like Gecko) "
        "Version/17.0 Safari/605.1.15"
    )
}

# Fetch the HTML content safely
response = requests.get(url, headers=headers)  # Perform HTTP GET request (include headers to avoid 403)
response.raise_for_status()  # Raise error if the request fails

# ✅ Wrap the HTML content in a StringIO object to avoid the FutureWarning
html_content = StringIO(response.text)

# Read HTML tables from the page
tables = pd.read_html(html_content, match="Country", header=0)  # Parse HTML and extract tables as DataFrames

# Extract the main table
df = tables[0]

# Display preview
display(df.head())

Unnamed: 0,Mobile country code,Country,ISO 3166,Mobile network codes,National MNC authority,Remarks
0,289,A Abkhazia,GE-AB,List of mobile network codes in Abkhazia,,MCC is not listed by ITU
1,412,Afghanistan,AF,List of mobile network codes in Afghanistan,,
2,276,Albania,AL,List of mobile network codes in Albania,,
3,603,Algeria,DZ,List of mobile network codes in Algeria,,
4,544,American Samoa (United States of America),AS,List of mobile network codes in American Samoa,,


# 🧩 Inspecting Multiple Tables from a Webpage

📌 **Objective:**  
When a webpage contains several `<table>` elements,  
`pd.read_html()` returns **a list of DataFrames** — one for each detected table.

We can iterate through this list to inspect them all  
and choose the one we actually need.


# 🔹 Reading HTML tables from the web

**Purpose:** Extract `<table>` elements from webpages using `pd.read_html()`; use `requests` + `StringIO` when needed.

*Auto-generated section header for clarity.*

In [98]:
# 🌐 Extract tabular data from an HTML webpage using `pd.read_html()`

url = "https://en.wikipedia.org/wiki/Mobile_country_code"

headers = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
        "AppleWebKit/605.1.15 (KHTML, like Gecko) "
        "Version/17.0 Safari/605.1.15"
    )
}

# Fetch HTML content
response = requests.get(url, headers=headers)  # Perform HTTP GET request (include headers to avoid 403)
html_content = StringIO(response.text)

# Read all tables from the page
tables = pd.read_html(html_content)  # Parse HTML and extract tables as DataFrames

# Iterate through all detected tables
for i, table in enumerate(tables):
    print(f"🔹 Table {i}: shape = {table.shape}")

🔹 Table 0: shape = (4, 7)
🔹 Table 1: shape = (252, 6)
🔹 Table 2: shape = (104, 7)
🔹 Table 3: shape = (1, 7)
🔹 Table 4: shape = (10, 2)


In [99]:
# 🧱 Code cell — performs data manipulation or utility operation

!pip install openpyxl



# 🔹 Reading Excel files

**Purpose:** Load Excel sheets into DataFrames using `pd.read_excel()` and handle `sheet_name`.

*Auto-generated section header for clarity.*

In [100]:
# 📘 Read data from an Excel sheet using `pd.read_excel()` and preview results

# sheet_name is optional
df_excel = pd.read_excel('Book1.xlsx', sheet_name='Sheet1')  # Read Excel file or sheet into DataFrame
display(df_excel)

Unnamed: 0,Name,Age
0,Prasanna Sundaram,35
1,Sundaram,67
2,Indra,61


##  What is a Pickle File?

Pickle is a **binary file format** used to serialize (save) Python objects so they can be  
easily loaded later without losing structure, data types, or indexes.

Unlike CSV or Excel:
- It **preserves data types**
- It’s **faster** to read/write
- It’s **not human-readable** (binary format)

$$[
\text{Save → Serialize (Pickle)} \;\;\;\; \text{Load → Deserialize (Unpickle)}
]$$

---

# 🔹 Pickle (serialize/deserialize DataFrames)

**Purpose:** Save and load DataFrames quickly with `to_pickle()` / `pd.read_pickle()`; mention security caveat.

*Auto-generated section header for clarity.*

In [101]:
# 💾 Save DataFrame to a Pickle file for fast reload

df_excel.to_pickle('df_excel.pkl')  # Serialize DataFrame to pickle file for fast I/O

💡 After executing this a pickle file named `df_excel.pkl` would be created

# 🔹 Pickle (serialize/deserialize DataFrames)

**Purpose:** Save and load DataFrames quickly with `to_pickle()` / `pd.read_pickle()`; mention security caveat.

*Auto-generated section header for clarity.*

In [102]:
# 📂 Load a serialized DataFrame from a Pickle file

pd.read_pickle('df_excel.pkl')  # Load DataFrame from pickle file

Unnamed: 0,Name,Age
0,Prasanna Sundaram,35
1,Sundaram,67
2,Indra,61


Pickle is typically **2x–10x faster** than CSV for both saving and loading large DataFrames.  
It’s also **type-safe**, meaning integers remain integers, datetimes remain datetimes, etc.


## 🔹 Notes & Best Practices

| ⚙️ Scenario | Recommendation |
|:-------------|:----------------|
| Large data with frequent reloading | ✅ Use Pickle |
| Need human-readable file | ❌ Avoid Pickle → Use CSV or Excel |
| Cross-language sharing | ❌ Avoid Pickle (Python-only format) |
| Quick local caching in analysis | ✅ Perfect use case |
| Security | ⚠️ Never unpickle data from untrusted sources |

---

## 🔹 Alternative: Compressed Pickle Files

You can save space using compression (gzip, bz2, zip, etc.)

# 🔹 Pickle (serialize/deserialize DataFrames)

**Purpose:** Save and load DataFrames quickly with `to_pickle()` / `pd.read_pickle()`; mention security caveat.

*Auto-generated section header for clarity.*

In [103]:
# 💾 Save DataFrame to a Pickle file for fast reload

# Save pickle with gzip compression
df_excel.to_pickle('df_excel.pkl.gz', compression='gzip')  # Serialize DataFrame to pickle file for fast I/O

# Load compressed pickle
df_gzip = pd.read_pickle('df_excel.pkl.gz', compression='gzip')  # Load DataFrame from pickle file
display(df_gzip)

Unnamed: 0,Name,Age
0,Prasanna Sundaram,35
1,Sundaram,67
2,Indra,61


✅ **Output:** Same as original, but file size is smaller.

---

## 🧾 Summary Table

| Operation | Method | Description |
|:------------|:----------|:-------------|
| Save as Pickle | `df.to_pickle('file.pkl')` | Serializes DataFrame |
| Load Pickle | `pd.read_pickle('file.pkl')` | Deserializes DataFrame |
| Compressed Pickle | `compression='gzip'` | Optional compression |
| Compare Speed | CSV slower; Pickle faster | Type-safe & binary format |

---

## ✅ Quick Recap

🔹 `to_pickle()` saves your DataFrame in Python’s binary format  
🔹 `pd.read_pickle()` loads it instantly  
🔹 Much faster than CSV or Excel for large datasets  
🔹 Best for intermediate caching or local analysis  
🔹 Don’t use Pickle for sharing across platforms or untrusted sources
