# Class 3: File Handling and Real-World Data

**Week 4: Intermediate Python and Data Preprocessing**

## Objectives
- Read and write JSON files for flexible data handling.
- Process large CSV files efficiently using chunks.
- Begin the mini-project by applying preprocessing steps to a real-world dataset (Titanic).
- Understand the importance of file format flexibility in data science workflows.

## Datasets
- **Titanic dataset** (`titanic.csv`): Contains columns like `PassengerId`, `Pclass`, `Name`, `Sex`, `Age`, `Fare`, `Embarked`, `Survived`. Used for the mini-project.
- **Sample JSON** (`sample.json`): A small dataset with passenger-like records (e.g., `id`, `name`, `ticket_cost`).

## Instructions
- Run the setup cell to load libraries.
- Complete the exercises by filling in the code cells.
- Use the hints if you're stuck.
- Start the mini-project in Exercise 3 and save your progress.
- Save your notebook and submit it if required.

## Setup
Run the cell below to import libraries.

In [None]:
import pandas as pd
import json

# Verify datasets are accessible
try:
    titanic = pd.read_csv('data/titanic.csv')
    print('Titanic dataset loaded successfully.')
    print(titanic.head())
except FileNotFoundError:
    print('Error: titanic.csv not found in data/ folder.')

try:
    with open('data/sample.json', 'r') as f:
        json_data = json.load(f)
    print('\nSample JSON loaded successfully:')
    print(json_data[:2])  # Show first two records
except FileNotFoundError:
    print('Error: sample.json not found in data/ folder.')

## Exercise 1: Reading and Writing JSON

**Goal**: Work with JSON files to handle semi-structured data.

**Task**:
- Read `sample.json` into a pandas DataFrame.
- Filter rows where `ticket_cost` is greater than 50.
- Save the filtered data as a new JSON file (`filtered.json`).

**Steps**:
1. Use `json.load()` to read `sample.json` (already done in setup).
2. Convert the JSON data to a DataFrame with `pd.DataFrame()`.
3. Filter rows using boolean indexing.
4. Save the filtered DataFrame to JSON using `to_json()`.

**Hint**: Use `orient='records'` in `to_json()` to match the input JSON format.

In [None]:
# Your code here

# Convert JSON data to DataFrame
json_df = # YOUR CODE

# Filter rows where ticket_cost > 50
filtered_df = # YOUR CODE

# Save to filtered.json
# YOUR CODE

# Display the filtered DataFrame
print('Filtered DataFrame:')
print(filtered_df)

## Solution (Instructor Reference)

Uncomment and run the cell below to check your work. Try to complete the exercise yourself first!

```python
# json_df = pd.DataFrame(json_data)
# filtered_df = json_df[json_df['ticket_cost'] > 50]
# filtered_df.to_json('data/filtered.json', orient='records', lines=True)
# print('Filtered DataFrame:')
# print(filtered_df)
```

## Exercise 2: Advanced CSV Processing

**Goal**: Process large CSV files efficiently using chunks.

**Task**:
- Read `titanic.csv` in chunks of 100 rows.
- For each chunk, count the number of passengers by `Pclass`.
- Sum the counts across chunks to get the total passengers per class.

**Steps**:
1. Use `pd.read_csv()` with `chunksize=100` to create a chunk iterator.
2. In a loop, use `value_counts()` to count `Pclass` in each chunk.
3. Aggregate counts across chunks (e.g., store in a dictionary or Series).
4. Display the final counts.

**Hint**: Initialize an empty Series or dictionary to accumulate counts.

In [None]:
# Your code here

# Initialize a Series to store counts
class_counts = pd.Series(dtype=int)

# Read titanic.csv in chunks
for chunk in # YOUR CODE:
    # Count Pclass in this chunk
    chunk_counts = # YOUR CODE
    # Add to total counts
    class_counts = class_counts.add(chunk_counts, fill_value=0)

# Display the result
print('Total passengers by Pclass:')
print(class_counts)

## Solution (Instructor Reference)

Uncomment and run the cell below to check your work.

```python
# class_counts = pd.Series(dtype=int)
# for chunk in pd.read_csv('data/titanic.csv', chunksize=100):
#     chunk_counts = chunk['Pclass'].value_counts()
#     class_counts = class_counts.add(chunk_counts, fill_value=0)
# print('Total passengers by Pclass:')
# print(class_counts)
```

## Exercise 3: Mini-Project Kickoff

**Goal**: Start preprocessing the Titanic dataset for the Week 4 mini-project.

**Task**:
- Load `titanic.csv`.
- Handle missing values:
  - Fill missing `Age` with the median.
  - Fill missing `Embarked` with the mode.
- Encode categorical variables:
  - One-hot encode `Sex`.
  - One-hot encode `Embarked`.
- Normalize `Fare` using `MinMaxScaler`.
- Save the preprocessed DataFrame to a new CSV (`titanic_preprocessed.csv`).

**Steps**:
1. Load the dataset (already done in setup).
2. Use `fillna()` for missing values.
3. Use `pd.get_dummies()` for encoding.
4. Use `MinMaxScaler` from scikit-learn for normalization.
5. Save with `to_csv()`.

**Hint**: Reuse techniques from Classes 1 and 2. Check for missing values with `isna().sum()`.

In [None]:
# Your code here

from sklearn.preprocessing import MinMaxScaler

# Check missing values
print('Missing values before:')
print(titanic.isna().sum())

# Handle missing values
# YOUR CODE (Age and Embarked)

# Encode categorical variables
# YOUR CODE (Sex and Embarked)

# Normalize Fare
scaler = # YOUR CODE
titanic['Fare_normalized'] = # YOUR CODE

# Check missing values after
print('\nMissing values after:')
print(titanic.isna().sum())

# Save to CSV
# YOUR CODE

# Display the first few rows
print('\nPreprocessed DataFrame:')
print(titanic.head())

## Solution (Instructor Reference)

Uncomment and run the cell below to check your work.

```python
# print('Missing values before:')
# print(titanic.isna().sum())
# titanic['Age'] = titanic['Age'].fillna(titanic['Age'].median())
# titanic['Embarked'] = titanic['Embarked'].fillna(titanic['Embarked'].mode()[0])
# titanic = pd.get_dummies(titanic, columns=['Sex', 'Embarked'], drop_first=False)
# scaler = MinMaxScaler()
# titanic['Fare_normalized'] = scaler.fit_transform(titanic[['Fare']].values.reshape(-1, 1))
# print('\nMissing values after:')
# print(titanic.isna().sum())
# titanic.to_csv('data/titanic_preprocessed.csv', index=False)
# print('\nPreprocessed DataFrame:')
# print(titanic.head())
```

## Bonus Challenge

**Task**: Convert the preprocessed Titanic DataFrame to JSON and save it as `titanic_preprocessed.json`.
- Ensure the JSON format is a list of records (like `sample.json`).
- Load the saved JSON back into a DataFrame to verify it matches.

**Hint**: Use `to_json(orient='records', lines=True)` and `pd.read_json()`.

In [None]:
# Your code here

# Save preprocessed DataFrame to JSON
# YOUR CODE

# Load JSON back to verify
verify_df = # YOUR CODE
print('Verified JSON DataFrame:')
print(verify_df.head())

## Discussion Questions
1. Why is JSON a popular format for data exchange?
2. How does chunked CSV processing help with large datasets?
3. What challenges might arise when preprocessing real-world datasets like Titanic?

Feel free to jot down your thoughts in a new markdown cell below!

## Your Notes

(Add your thoughts here)