### 1. Reading a CSV File
We begin by importing **Pandas** and loading the housing dataset using `pd.read_csv("housing.csv")`.

- The dataset is now stored in a **DataFrame** (`df`).  
- `df.head()` shows the first 5 rows, giving a quick preview of the data.  
- Each row = one house record, and each column = attribute (e.g., price, bedrooms, bathrooms, sqft_living).


In [2]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("house_data.csv")

print(df.head())  # display first 5 rows


           id             date   price  bedrooms  bathrooms  sqft_living  \
0  7129300520  20141013T000000  221900         3       1.00         1180   
1  6414100192  20141209T000000  538000         3       2.25         2570   
2  5631500400  20150225T000000  180000         2       1.00          770   
3  2487200875  20141209T000000  604000         4       3.00         1960   
4  1954400510  20150218T000000  510000         3       2.00         1680   

   sqft_lot  floors  waterfront  view  ...  grade  sqft_above  sqft_basement  \
0      5650     1.0           0     0  ...      7        1180              0   
1      7242     2.0           0     0  ...      7        2170            400   
2     10000     1.0           0     0  ...      6         770              0   
3      5000     1.0           0     0  ...      7        1050            910   
4      8080     1.0           0     0  ...      8        1680              0   

   yr_built  yr_renovated  zipcode      lat     long  sqft_liv

### 2. Extracting the Contents of a CSV File
After loading, we explore the dataset:

- `df.columns.tolist()` → lists all feature names (e.g., price, bedrooms, sqft_living).  
- `df.shape` → shows the dataset dimensions (rows × columns).  
- `df.describe()` → provides descriptive statistics (mean, std, min, max, etc.) for numerical columns.

This helps us **understand the structure** and **summarize the dataset** before deeper analysis.


In [4]:
import pandas as pd

# Extract CSV content
dataframe = pd.read_csv("house_data.csv")

print("Columns:", dataframe.columns)
print("Shape:", dataframe.shape)   # rows × columns
print(dataframe)   # prints the entire dataframe


Columns: Index(['id', 'date', 'price', 'bedrooms', 'bathrooms', 'sqft_living',
       'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade',
       'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated', 'zipcode',
       'lat', 'long', 'sqft_living15', 'sqft_lot15'],
      dtype='object')
Shape: (21614, 21)
               id             date   price  bedrooms  bathrooms  sqft_living  \
0      7129300520  20141013T000000  221900         3       1.00         1180   
1      6414100192  20141209T000000  538000         3       2.25         2570   
2      5631500400  20150225T000000  180000         2       1.00          770   
3      2487200875  20141209T000000  604000         4       3.00         1960   
4      1954400510  20150218T000000  510000         3       2.00         1680   
...           ...              ...     ...       ...        ...          ...   
21609  6600060120  20150223T000000  400000         4       2.50         2310   
21610  1523300141  20140623T000000  

### 3. Appending Data to a CSV
We can add new rows (new house records) to the existing CSV file:

- Created a small **DataFrame (`new_data`)** with one new record.  
- Used `to_csv()` with:
  - `mode="a"` → append mode (does not overwrite file).  
  - `header=False` → prevents rewriting column headers.  
  - `index=False` → avoids writing the row index.

This ensures our new record is **seamlessly added** to the dataset.


In [6]:
# New sample house data to append
new_data = pd.DataFrame({
    "id": [9999999999],
    "date": ["20250909T000000"],
    "price": [750000],
    "bedrooms": [4],
    "bathrooms": [3],
    "sqft_living": [2200],
    "sqft_lot": [8000],
    "floors": [2],
    "waterfront": [0],
    "view": [0],
    "grade": [8],
    "sqft_above": [2200],
    "sqft_basement": [0],
    "yr_built": [2005],
    "yr_renovated": [0],
    "zipcode": [98178],
    "lat": [47.51],
    "long": [-122.25],
    "sqft_living15": [2100],
    "sqft_lot15": [7800]
})

# Append to housing.csv (no header, no index)
new_data.to_csv("house_data.csv", mode="a", header=False, index=False)


## 4. Reading a CSV Chunk-by-Chunk

Sometimes CSV files are **too large** to load entirely into memory.  
In such cases, we use **`chunksize`** in Pandas to process the file in smaller blocks.


In [8]:
chunk_iter = pd.read_csv("house_data.csv", chunksize=5000)

for i, chunk in enumerate(chunk_iter):
    print(f"Chunk {i+1} → Shape: {chunk.shape}")
    print(chunk.head(2))  # print 2 rows only for preview


Chunk 1 → Shape: (5000, 21)
           id             date   price  bedrooms  bathrooms  sqft_living  \
0  7129300520  20141013T000000  221900         3       1.00         1180   
1  6414100192  20141209T000000  538000         3       2.25         2570   

   sqft_lot  floors  waterfront  view  ...  grade  sqft_above  sqft_basement  \
0      5650     1.0           0     0  ...      7        1180              0   
1      7242     2.0           0     0  ...      7        2170            400   

   yr_built  yr_renovated  zipcode      lat     long  sqft_living15  \
0      1955             0    98178  47.5112 -122.257           1340   
1      1951          1991    98125  47.7210 -122.319           1690   

   sqft_lot15  
0        5650  
1        7639  

[2 rows x 21 columns]
Chunk 2 → Shape: (5000, 21)
              id             date    price  bedrooms  bathrooms  sqft_living  \
5000  3023049215  20140702T000000   519000         5       2.25         2570   
5001  3625710080  20140626T00

###  How does this work?

**1. `chunksize=5000`**  
- Pandas does not load the full CSV at once.  
- Instead, it **yields blocks of 5000 rows at a time**.  
- This is **memory-efficient**, especially for big data.

---

**2. `chunk_iter`**  
- `pd.read_csv(..., chunksize=5000)` creates an **iterator object**.  
- Each time we call `next(chunk_iter)`, it gives us the **next 5000 rows** as a DataFrame.

---

**3. `enumerate()`**  
- Normally, iterating over an object only gives us the **data**.  
- But here, we also need the **chunk number** (1st chunk, 2nd chunk, etc.).  
- `enumerate()` solves this by returning **two outputs** in each loop:
  - The index (`i`) → starts from 0  
  - The chunk (DataFrame of 5000 rows)  

---

### Example Flow
- First loop → `i=0`, `chunk = first 5000 rows`  
- Second loop → `i=1`, `chunk = next 5000 rows`  
- Third loop → `i=2`, `chunk = next 5000 rows`  

So when we write:  
```python
print(f"Chunk {i+1}")...

It prints:

Chunk 1
Chunk 2
Chunk 3
...

> ###  Analogy  
> Imagine a **huge dictionary of 1,00,000 pages**:  
> - `chunksize=5000` → you read 5000 pages at a time (like binding them into small booklets).  
> - `chunk_iter` → acts like a librarian handing you one booklet after another.  
> - `enumerate()` → keeps count: “This is booklet 1, this is booklet 2, …”.  
>
> Without `enumerate()`, you would only see the data but not know **which part of the file** it belongs to.


###  Visual Workflow of Chunk Reading with `enumerate`

```mermaid
flowchart LR
    A[CSV File: house_data.csv] --> B[Read with chunksize=5000]
    B --> C1[Chunk 1 (Rows 1-5000)]
    B --> C2[Chunk 2 (Rows 5001-10000)]
    B --> C3[Chunk 3 (Rows 10001-15000)]
    B --> Cn[... More Chunks ...]

    C1 --> D1[enumerate assigns i=0]
    C2 --> D2[enumerate assigns i=1]
    C3 --> D3[enumerate assigns i=2]
    Cn --> Dn[enumerate assigns i=n]

    D1 --> E[Process Chunk 1]
    D2 --> E[Process Chunk 2]
    D3 --> E[Process Chunk 3]
    Dn --> E[Process Remaining Chunks]

    E --> F[Final Output / Analysis]


### 5. Writing Numeric Data into a CSV
We can extract and save only numeric attributes:

- Selected columns: `price`, `bedrooms`, `bathrooms`, `sqft_living`.  
- Saved to a new file **housing_numeric.csv**.  
- `index=False` → prevents saving row indices.

This is useful for **numeric-only analysis** or when sharing a simplified dataset.


In [14]:
# Extract numeric columns only (e.g., price, bedrooms, sqft_living)
numeric_data = df[["price", "bedrooms", "bathrooms", "sqft_living"]]

numeric_data.to_csv("housing_numeric.csv", index=False)


### 6. Writing Text Data into a CSV
We can also extract only text/categorical attributes:

- Selected columns: `date` and `zipcode`.  
- Saved to **housing_text.csv**.  
- `index=False` → keeps the file clean.

This is useful when we want to separate **categorical/textual information** for specialized analysis.


In [16]:
# Extract text-based columns (e.g., date, zipcode)
text_data = df[["date", "zipcode"]]

text_data.to_csv("housing_text.csv", index=False)

In [9]:
pip install --upgrade jupyter_contrib_nbextensions


Note: you may need to restart the kernel to use updated packages.




In [None]:
pip uninstall jupyter_contrib_nbextensions
