# ðŸŽ“ Final Revision Guide: The "Cheat Sheet" Notebook

**Goal:** This notebook covers the *concepts* and *logic* you need for the test. 
**Format:** Read the concept, look at the code, and try to understand *why* it works.

## ðŸ“š Syllabus Checklist
- [ ] Pure Python (CSV, Logic)
- [ ] Pandas Fundamentals (Types, Indexing)
- [ ] Data Manipulation (Groupby, Merge, Clean, Reshaping)
- [ ] Time Series (Resampling)
- [ ] Visualization (Chart types)
- [ ] APIs & JSON
- [ ] Markdown & Jupyter

# 1. Pure Python & Standard Library (No Pandas)
**Scenario:** You are on a desert island without Pandas. You have to process a CSV file row by row.

### Key Modules
*   `import csv`: For reading/writing CSVs.
*   `import json`: For parsing JSON data.

### The "Lowest GDP" Problem (Classic Logic)
**Task:** Find the country with the lowest GDP without using `min()` on a list.

In [3]:
import csv
import io

# Simulating a CSV file in memory so this code runs
csv_content = """Country,GDP
USA,23000
China,18000
Tuvalu,0.06
Germany,4000"""

# In the real test, you'd use: with open('gdp.csv', 'r') as f:
# Here we use io.StringIO(csv_content) to pretend it's a file
with io.StringIO(csv_content) as f:
    reader = csv.DictReader(f)  # DictReader lets you use row["ColumnName"]
    
    lowest_gdp = None
    lowest_country = None

    for row in reader:
        gdp = float(row["GDP"]) # CRITICAL: Convert string to float!
        
        # Logic: If it's the first row (None) OR this gdp is lower than our current best
        if lowest_gdp is None or gdp < lowest_gdp:
            lowest_gdp = gdp
            lowest_country = row["Country"]

print(f"Country with lowest GDP: {lowest_country} (${lowest_gdp})")

Country with lowest GDP: Tuvalu ($0.06)


### ðŸ’¡ Key Takeaways
1.  **`csv.DictReader`**: Reads rows as dictionaries (`{'Country': 'USA', 'GDP': '20000'}`).
2.  **Type Conversion**: CSV data is *always* strings. You MUST cast to `float()` or `int()` for math.
3.  **`None` Check**: Initialize variables to `None` to handle the first row correctly.

# 2. Pandas Fundamentals

### Data Structures
*   **Series**: 1D array (a single column). Has an index.
*   **DataFrame**: 2D table (rows and columns). A collection of Series.

### Data Types (dtypes)
Why do we care? **Memory** and **Functionality**.

| Type | Description | Common Issues |
| :--- | :--- | :--- |
| `object` | Text/Strings (or mixed) | Can't do math. Check for hidden symbols ('$100'). |
| `int64` | Whole numbers | Can't have NaNs (usually). |
| `float64` | Decimals | Used for money, scientific data. |
| `bool` | True/False | Result of logic (`df['Age'] > 18`). |
| `datetime64` | Dates | Allows `.dt` accessor (e.g., `.dt.year`). |

In [4]:
import pandas as pd
import numpy as np

df = pd.DataFrame({
    'Price': ['$100', '$200', 'Not Available'],
    'Age': [25, 30, np.nan]
})

# 1. Check Types
print(df.dtypes)

# 2. Convert Types (The Hard Way vs The Easy Way)
# Hard Way: astype (Fails if there are errors)
# df['Age'] = df['Age'].astype(int) # ERROR! Cannot convert NaN to integer

# Easy Way: pd.to_numeric (Handles errors)
df['Price_Clean'] = pd.to_numeric(df['Price'].str.replace('$', ''), errors='coerce')
# errors='coerce' turns 'Not Available' into NaN

Price     object
Age      float64
dtype: object


### Indexing
*   **`.loc[row_label, col_label]`**: Label-based. Inclusive of end.
*   **`.iloc[row_pos, col_pos]`**: Integer-position based. Exclusive of end (like Python lists).
*   **Boolean Indexing**: Filtering rows based on a condition.

In [None]:
df = pd.DataFrame({'A': [10, 20, 30], 'B': [40, 50, 60]}, index=['x', 'y', 'z'])

# .loc (Label)
print(df.loc['x', 'A']) # 10
print(df.loc['x':'y', 'A']) # Includes 'y'!

# .iloc (Position)
print(df.iloc[0, 0]) # 10
print(df.iloc[0:2, 0]) # Excludes index 2 (row 'z')

# Boolean Indexing
mask = df['A'] > 15
print(df[mask]) # Shows rows y and z

# 3. Data Manipulation (The Heavy Lifting)

### Cleaning Data
*   `dropna()`: Remove missing values.
*   `fillna(value)`: Replace missing values.
*   `sort_values(by='col')`: Sort data.

In [None]:
df = pd.DataFrame({'Val': [1, np.nan, 2]})

# Drop rows with ANY NaNs
clean_df = df.dropna()

# Fill NaNs with 0
filled_df = df.fillna(0)

# Sort
sorted_df = df.sort_values(by='Val', ascending=False)

### Grouping (Split-Apply-Combine)
**Syntax:** `df.groupby('GroupCol')['TargetCol'].agg_func()`

**The Dictionary Syntax (Your Nemesis):**
If you want different aggregations for different columns, pass a dictionary to `.agg()`.

In [None]:
data = pd.DataFrame({
    'Category': ['A', 'A', 'B', 'B'],
    'Qty': [1, 2, 3, 4],
    'Price': [10, 20, 30, 40]
})

# CORRECT WAY
res = data.groupby('Category').agg({
    'Qty': 'sum',      # Sum the Qty
    'Price': 'mean'    # Average the Price
})

# WRONG WAY (Syntax Error)
# df.groupby('Category')['Qty'].agg(['Qty': 'sum']) # No colons in lists!

### Merging (Joins)
Combining two DataFrames based on a common key.

*   **Inner Join (`how='inner'`)**: Keep only rows that match in BOTH. (Intersection)
*   **Left Join (`how='left'`)**: Keep ALL rows from Left, match what you can from Right. (Fill missing with NaN).
*   **Outer Join (`how='outer'`)**: Keep EVERYTHING. (Union).

In [None]:
left = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
right = pd.DataFrame({'ID': [2, 3], 'Score': [90, 80]})

# Inner Join (Only ID 2 matches)
inner = pd.merge(left, right, on='ID', how='inner')
# Result: Bob, 90

# Left Join (Keep Alice, Bob. Alice gets NaN score)
left_join = pd.merge(left, right, on='ID', how='left')
# Result: Alice (NaN), Bob (90)

### Reshaping (Melt & Pivot)
Changing the shape of your data (Wide <-> Long).

*   **Melt (Wide -> Long)**: Unpivoting. Good for computers/plotting.
*   **Pivot (Long -> Wide)**: Making it readable. Good for humans.

In [1]:
# Wide Data (Months as Columns)
wide_df = pd.DataFrame({
    'City': ['NY', 'LA'],
    'Jan': [100, 200],
    'Feb': [150, 250]
})

# 1. MELT: Gather columns into rows
melted_df = wide_df.melt(
    id_vars='City',      # Column to keep fixed
    var_name='Month',    # Name for the new column created from headers
    value_name='Sales'   # Name for the values
)
# Result:
# City  Month  Sales
# NY    Jan    100
# LA    Jan    200
# NY    Feb    150
# LA    Feb    250

# 2. PIVOT: Spread rows back into columns
pivoted_df = melted_df.pivot(
    index='City',        # What becomes the rows
    columns='Month',     # What becomes the columns
    values='Sales'       # What fills the cells
)

NameError: name 'pd' is not defined

# 4. Time Series & Resampling

**Resampling:** Changing the frequency of your time series data.
*   **Downsampling:** High Freq -> Low Freq (Days -> Months). Needs **Aggregation** (sum, mean).
*   **Upsampling:** Low Freq -> High Freq (Months -> Days). Needs **Filling** (ffill, bfill).

**Syntax:** `df.resample('Rule').agg()`

In [11]:
dates = pd.to_datetime(['2023-01-01', '2023-01-02', '2023-02-01'])
ts = pd.DataFrame({'Sales': [100, 200, 300]}, index=dates)

# Downsample to Month (Sum)
monthly = ts.resample('M').sum()
# Result: Jan (300), Feb (300)

# Upsample to Day (Forward Fill)
# Fills missing days with the previous valid value
daily = monthly.resample('D').ffill()


  monthly = ts.resample('M').sum()


# 5. Visualization

### Choosing the Right Chart
1.  **Line Chart:** Change over time (Trends).
2.  **Bar Chart:** Comparing categories (Apples vs Oranges).
3.  **Scatter Plot:** Relationship between two numbers (Height vs Weight).
4.  **Histogram:** Distribution of one variable (Frequency of test scores).
5.  **Choropleth:** Maps/Geospatial data (Coloring regions by value).

### Chart Hygiene
*   **Title:** What is this chart?
*   **Labels:** What are the axes? (Include units! $ or kg)
*   **Legend:** If multiple series.

In [12]:
import plotly.express as px

# Bar Chart
# px.bar(df, x='Category', y='Sales', title='Sales by Category')

# Line Chart
# px.line(df, x='Date', y='Sales', title='Sales Over Time')

# Scatter Plot
# px.scatter(df, x='Height', y='Weight', title='Height vs Weight')

# Histogram
# px.histogram(df, x='Scores', nbins=10, title='Score Distribution')

# 6. APIs & JSON

**API (Application Programming Interface):** A waiter. You (Client) ask for data (Menu), the API gets it from the Kitchen (Server) and brings it back.

**JSON (JavaScript Object Notation):** The language of APIs. It's just nested Dictionaries and Lists.

### Parsing JSON Example
```python
data = {
  "results": [{ "name": "Jimmy McMillan", "first_file_date": "2010-01-01" }]
}
```
**Goal:** Get "Jimmy McMillan".

In [None]:
data = {
  "results": [{ "name": "Jimmy McMillan", "first_file_date": "2010-01-01" }]
}

# 1. 'results' is a key. Value is a LIST.
# 2. Access the first item in the list [0].
# 3. That item is a DICT. Access key 'name'.

name = data['results'][0]['name']
print(name)

# 7. Markdown & Jupyter

**Markdown:** A lightweight markup language for formatting text.
*   `# Header 1`, `## Header 2`
*   `**Bold**`, `*Italic*`
*   `[Link Text](URL)`
*   `- Bullet Point`

**Jupyter:**
*   **Kernel:** The computation engine (Python) running in the background. If code hangs, restart the kernel.
*   **Cell Types:** Code (runs Python) vs Markdown (text).