# Lecture 18: Working with Data (Data Cleaning & Analysis)

This notebook covers the concepts from Lecture 18, including data loading, inspection, and cleaning using Pandas.

## ðŸ§  Cheat Sheet: Data Cleaning & Analysis

### 1. Data Inspection
*   `df.head()`: View the first few rows.
*   `df.info()`: Check data types and missing values.
*   `df.describe()`: Summary statistics for numerical columns.
*   `df.columns`: List all column names.

### 2. Data Cleaning (The "Mnemonic")
*   **Empty**: Check for missing values (`df.isna().sum()`).
    *   *Example:* `df.isna().sum()` -> Returns count of missing values per column.
*   **Bad**: Check for invalid or "junk" values.
*   **Unique**: Check for duplicates or cardinality.
    *   *Example (Duplicates):* `df.duplicated().sum()` -> Returns number of duplicate rows.
    *   *Example (Cardinality):* `df.nunique()` -> Returns number of unique values in each column (e.g., 3 unique Boroughs).
*   **Spread**: Check the distribution of values.

### 3. Common Cleaning Operations
*   **Dropping columns**: `df.drop(columns=['col_name'])`
    *   *Example:* `df.drop(columns=['Unnecessary_ID'])`
*   **Renaming columns**: `df.rename(columns={'old': 'new'})`
    *   *Example:* `df.rename(columns={'Pop': 'Population'})`
*   **Changing types**: `df['col'] = df['col'].astype(new_type)`
    *   *Example:* `df['Year'] = df['Year'].astype(int)`
*   **String cleanup**: `df['col'] = df['col'].str.replace(...)`
    *   *Example:* `df['Price'] = df['Price'].str.replace(',', '')` (Removes commas)
*   **Handling missing data**: `df.dropna()` or `df.fillna(value)`
    *   *Example (Drop All):* `df.dropna()` (Removes rows with ANY missing values)
    *   *Example (Drop Specific):* `df.dropna(subset=['Age'])` (Removes rows where 'Age' is missing)
    *   *Example (Fill):* `df['Score'] = df['Score'].fillna(0)` (Replaces missing scores with 0)

### 4. Grouping & Aggregation
*   `df.groupby('Category')['Value'].sum()`: Sum values by category.
*   `df.groupby('Category').size()`: Count items per category.
*   `.reset_index()`: Turn the group labels back into a regular column.

---
## Setup
Run this cell to load the necessary libraries.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Display options
pd.set_option('display.max_columns', None)

# Section 1: 311 Data Analysis (Lecture Demo)

### Load Data
Load the 311 sample data from the provided URL.

In [None]:
# URL: https://storage.googleapis.com/python-public-policy2/data/311_requests_2018-19_sample.csv.zip
# Your code here

### Preview Data
Inspect the first few rows.

In [None]:
# Your code here

### Data Info
Check the data types and non-null counts.

In [None]:
# Your code here

### Analysis: Most Common Complaints
Which complaint types are the most frequent?

In [None]:
# Your code here

### Analysis: Top Request per Agency
What is the most frequent request type for each agency?

In [None]:
# Your code here

### Data Cleaning: Zip Codes
Find and handle invalid zip codes (e.g., 'HARRISBURG', 'IDK').

In [None]:
# Your code here

# Section 2: Birth Rates Analysis
Now apply these concepts to the `BirthsAndFertilityRatesAnnual.csv` dataset.

### Load Data
Load `BirthsAndFertilityRatesAnnual.csv`.

In [None]:
# Your code here

### Question 1: Data Inspection
Inspect the dataframe. What are the columns? Do they look correct? (Hint: The file might need transposition or header adjustment based on its structure).

In [None]:
# Your code here

### Question 2: Reshaping
If the years are columns and metrics are rows, transpose the dataframe so years become the index (rows) and metrics become columns.

In [None]:
# Your code here

### Question 3: Data Types
Check the data types. Are the rates numeric? If not, clean them (handle 'na' or other non-numeric values).

In [None]:
# Your code here

### Question 4: Trend Analysis
Plot the 'Total Fertility Rate (TFR)' over the years.

In [None]:
# Your code here

### Question 5: Age Groups
Compare the birth rates between '20 - 24 Years' and '30 - 34 Years' over time. Which one is increasing?

In [None]:
# Your code here

### Question 6: Maxima
In which year was the 'Crude Birth Rate' the highest?

In [None]:
# Your code here

### Question 7: Ethnic Groups
Create a new dataframe containing only the fertility rates for 'Chinese', 'Malays', and 'Indians'. Plot their trends.

In [None]:
# Your code here

### Question 8: Cleaning Practice
Are there any missing values in the 'Resident Live-Births' column? If so, how many?

In [None]:
# Your code here

### Question 9: Calculation
Calculate the average 'Total Live-Births' for the period 2000-2010.

In [None]:
# Your code here

### Question 10: Complex Query
Find the year where the difference between 'Total Live-Births' and 'Resident Live-Births' was the largest.

In [None]:
# Your code here