# 00 â€” Python, NumPy, and Pandas (Beginnerâ€‘Friendly, with TODOs)

Welcome to the **Loan Risk Prediction** project!  
This notebook is designed for students who may be brandâ€‘new to Python. We go **slowly** and explain **why** things work, not just **how**.

**How to use this notebook**
- Read the notes in each section.
- Run the examples.
- Complete the **TODO** blocks. Theyâ€™re short practices to build your muscle memory.
- If you get stuck, read the example right above the TODO again.

> Tip: You can run a cell with **Shift + Enter** in Jupyter.

## Table of Contents
1. Python Basics: Variables, Types, Printing, Input, Comments
2. Numbers, Strings, Booleans, `None`
3. Collections: Lists, Tuples, Sets, Dicts
4. Control Flow: `if`, `for`, `while`, `break`, `continue`, ternary
5. Functions: `def`, parameters, return values, docstrings, type hints
6. Modules & Imports: `import`, `from ... import ...`, `as`
7. Errors & Exceptions: try/except, raising errors, assertions
8. NumPy Essentials: arrays, dtypes, shapes, indexing, slicing, broadcasting, masking
9. Random, Aggregations & Stats: `np.random`, `np.mean`, `np.std`, etc.
10. Pandas Essentials: Series & DataFrame, reading CSV, inspecting, selecting, filtering
11. Handling Missing Data, Dtypes, Dates, String Ops
12. GroupBy, Aggregation, Sorting, `value_counts`
13. Merging/Joining DataFrames
14. Exporting, Reproducibility & Project Tips

Each section ends with **TODO** practice.

## 1) Python Basics

**Concepts**  
- **Variables** store values: `x = 5`  
- **Names**: letters, digits, underscores; **cannot** start with a digit. Caseâ€‘sensitive (`score` vs `Score`).  
- **Comments**: start with `#` and are ignored by Python.  
- **Printing**: `print(...)` displays values.  
- **Input** (optional here): `input()` reads user text (we rarely use it in data science notebooks).

**Pro tip:** In notebooks, the **last expression** in a cell will display automatically. Use `print()` when you want explicit output anywhere.

In [None]:
# Examples
x = 5        # integer
y = 2.5      # float
name = "Ray" # string
is_ok = True # boolean

print("x:", x, "y:", y, "name:", name, "is_ok:", is_ok)

# The last line in a cell is echoed if it's an expression:
x + y

### âœ… TODO 1 â€” Variables & Printing
1. Create variables: `a` (int), `b` (float), `msg` (string), `flag` (bool).
2. Print them in a single line like: `a=..., b=..., msg=..., flag=...`.
3. Try changing values and running again.

In [None]:
# TODO 1 â€” Your code here

## 2) Numbers, Strings, Booleans, and `None`

- **int**: whole numbers, e.g., `42`
- **float**: decimals, e.g., `3.14`
- **bool**: `True` or `False`
- **None**: "empty" or "no value yet" placeholder

**Common string operations**  
- Concatenate: `"Hello " + "World"`  
- f-string (preferred): `f"Hello {name}"`  
- Length: `len("abc")`  
- Methods: `"loan".upper()`, `" LOAN ".strip()`, `"loan-risk".replace("-", " ")`

In [None]:
# Examples
pi = 3.14159
count = 12
statement = f"We have {count} samples and PI ~ {pi:.2f}"
upper = "loan".upper()
repl = "loan-risk".replace("-", " ")
length = len("default")

print(statement)
print("upper:", upper, "| repl:", repl, "| length:", length)

nothing = None
print("nothing is", nothing)

### âœ… TODO 2 â€” Strings & Numbers
1. Make a string with your name and favorite number, using an **f-string**.
2. Use `.upper()` and `.replace()` on any string you create.
3. Create a variable `mystery = None` and print it.

In [None]:
# TODO 2 â€” Your code here

## 3) Collections: Lists, Tuples, Sets, Dicts

- **List `[]`**: ordered, changeable collection. Great general container.  
  - Indexing: `nums[0]`, slicing: `nums[1:3]`, append: `nums.append(10)`
- **Tuple `()`**: ordered, **immutable** (can't change). Useful for "fixed packets" of data.
- **Set `{}`**: **unique**, unordered elements; fast membership tests, set operations.
- **Dict `{key: value}`**: mapping from keys to values. Keys usually strings.

Why care? Youâ€™ll use **lists** and **dicts** constantly when massaging data **before** turning it into tables.

In [None]:
# Examples
nums = [3, 1, 4]
nums.append(1)
print("list nums:", nums, "| slice nums[1:3]:", nums[1:3])

point = (10, 20)  # tuple
print("tuple point:", point)

unique = {1, 1, 2, 3}
print("set unique:", unique, "| 2 in unique?", 2 in unique)

row = {"Loan ID": 123, "Amount": 5000, "Status": "Current"}
print("dict row:", row, "| keys:", list(row.keys()))

### âœ… TODO 3 â€” Practice Collections
1. Make a list of three loan amounts. Append one more amount.
2. Create a set from that list (notice duplicates vanish).
3. Create a dict with keys: `"Customer ID"`, `"Term"`, `"Home Ownership"` and fill in any values.

In [None]:
# TODO 3 â€” Your code here

## 4) Control Flow: `if`, `for`, `while`, `break`, `continue`, ternary

- `if/elif/else` chooses a path.
- `for` loops across items.
- `while` repeats while a condition is true.
- `break` exits a loop early; `continue` skips to next iteration.
- Ternary: `x = A if condition else B` (compact if/else for expressions).

In [None]:
# Examples
score = 72
if score >= 90:
    grade = "A"
elif score >= 80:
    grade = "B"
elif score >= 70:
    grade = "C"
else:
    grade = "D"
print("grade:", grade)

# for loop
total = 0
for n in [1, 2, 3]:
    total += n
print("total:", total)

# while with break
i = 0
while True:
    i += 1
    if i == 3:
        break
print("i after loop:", i)

# ternary
risk = "high" if score < 70 else "low"
print("risk:", risk)

### âœ… TODO 4 â€” Loops & Conditions
1. Given `amounts = [500, 1200, 3000, 250]`, make a new list containing only amounts â‰¥ 500 using a `for` loop and `if`.
2. Compute the sum of the filtered list.
3. Use a ternary expression to set `flag = "big"` if the sum â‰¥ 2000 else `"small"`.

In [None]:
# TODO 4 â€” Your code here

## 5) Functions

Why write functions?
- Reuse logic, reduce repetition, write tests, and make your code readable.

Key pieces:
- `def name(params):` defines a function.
- **Docstring** `"""Describe what it does."""` at top of function is helpful.
- **Type hints** (optional but recommended): `def add(a: int, b: int) -> int:`

In [None]:
def add(a: float, b: float) -> float:
    """Return the sum of a and b."""
    return a + b

print(add(1.5, 2.5))

### âœ… TODO 5 â€” Write a Function
Write a function `is_default_risk(income, debts)` that returns `True` if `debts > income * 0.4`, else `False`.  
Add a short docstring and type hints.

In [None]:
# TODO 5 â€” Your code here

## 6) Modules & Imports

- `import math`
- `from math import sqrt`
- `import numpy as np` (gives a **short alias** so you type less)

> In our project, weâ€™ll often use: `import numpy as np` and `import pandas as pd`

In [None]:
import math
from math import sqrt
print("pi:", math.pi, "| sqrt(9):", sqrt(9))

### âœ… TODO 6 â€” Try Imports
1. Import the `statistics` module.
2. Use `statistics.mean([1,2,3,4])`.

In [None]:
# TODO 6 â€” Your code here

## 7) Errors & Exceptions

- Errors stop execution. Use `try/except` to handle.
- `raise ValueError("message")` to signal problems in your own code.
- `assert condition, "message"` to catch unexpected states during development.

In [None]:
def safe_divide(a, b):
    try:
        return a / b
    except ZeroDivisionError:
        return float("inf")

print("10/2 =", safe_divide(10,2))
print("10/0 =", safe_divide(10,0))

### âœ… TODO 7 â€” Your Turn
Write `sqrt_nonneg(x)` that returns `math.sqrt(x)` but **raises** a `ValueError` if `x < 0`.

In [None]:
# TODO 7 â€” Your code here

## 8) NumPy Essentials

**What/Why:** NumPy provides fast, memoryâ€‘efficient **arrays** and mathematical operations.  
We use NumPy for:
- numeric arrays and vectorized math
- broadcasting (do math across arrays of different shapes)
- masking & boolean indexing (filtering)

**Core ideas**
- Create arrays: `np.array([...])`, `np.arange`, `np.linspace`
- Shape: `arr.shape`, data type: `arr.dtype`
- Indexing/slicing: `arr[0]`, `arr[1:3]`, multiâ€‘dim: `arr[rows, cols]`
- Broadcasting: add arrays of compatible shapes automatically
- Masking: boolean arrays to filter values

In [None]:
import numpy as np

arr = np.array([1, 2, 3], dtype=np.int64)
print("arr:", arr, "| shape:", arr.shape, "| dtype:", arr.dtype)

big = np.arange(0, 10)  # 0..9
print("big:", big)
print("slice big[2:7]:", big[2:7])

# Broadcasting
a = np.array([1, 2, 3])
b = 10
print("a + b:", a + b)

# Masking
mask = a > 1
print("mask:", mask, "| a[mask]:", a[mask])

### âœ… TODO 8 â€” Arrays, Shapes, Masks
1. Create an array `vals = np.array([100, 250, 400, 50])`.
2. Create a mask for values â‰¥ 200 and print the filtered result.
3. Reshape `np.arange(12)` into a 3Ã—4 array and print `row 1, col 2` (0â€‘based).

In [None]:
# TODO 8 â€” Your code here

## 9) Random, Aggregations & Simple Stats

- Random: `np.random.default_rng(seed)` then use methods like `.normal`, `.integers`
- Aggregations: `np.sum`, `np.mean`, `np.median`, `np.std`, `np.min`, `np.max`

In [None]:
rng = np.random.default_rng(42)
samples = rng.normal(loc=0, scale=1, size=5)
print("samples:", samples)
print("mean:", np.mean(samples), "| std:", np.std(samples), "| min:", np.min(samples), "| max:", np.max(samples))

### âœ… TODO 9 â€” Practice Random + Stats
1. Create 100 random integers between 300 and 850 (inclusive of 300, exclusive of 851).  
2. Compute mean and standard deviation.

In [None]:
# TODO 9 â€” Your code here

## 10) Pandas Essentials

Pandas gives you **Series** (1D) and **DataFrames** (2D tables).  
We use Pandas to load CSVs, clean data, select/filter rows, create new columns, summarize, and export.

**Core operations you'll use a lot:**
- Read: `pd.read_csv("file.csv")`
- Inspect: `.head()`, `.tail()`, `.shape`, `.info()`, `.describe()`
- Select columns: `df[["A","B"]]` or `df["A"]`
- Filter rows: `df[df["col"] > 0]`
- Create columns: `df["new"] = df["old"] * 2`
- Rename: `df.rename(columns={"Old":"New"}, inplace=True)`

In [None]:
import pandas as pd

# Tiny demo DataFrame
df = pd.DataFrame({
    "Loan ID":[1,2,3,4],
    "Amount":[500, 1200, 3000, 250],
    "Status":["Current", "Fully Paid", "Defaulted", "Current"]
})
display(df.head())
print("shape:", df.shape)
df.info()
display(df.describe(numeric_only=True))

### âœ… TODO 10 â€” Load & Inspect
1. Create a DataFrame with columns: `Customer ID`, `Income`, `Debt`, `Home Ownership` (4â€“5 rows).
2. Show `.head()`, `.shape`, `.info()`, `.describe()`.

In [None]:
# TODO 10 â€” Your code here

## 11) Handling Missing Data, Dtypes, Dates, and String Ops

- Missing values use `np.nan` (for numbers) or `pd.NA`.
- Detect missing: `df.isna()`, `df["col"].isna()`
- Drop missing: `df.dropna(subset=["col1","col2"])`
- Fill missing: `df["col"].fillna(value)`
- Dtypes: `df.dtypes`, `df["col"] = df["col"].astype("int64")`
- Dates: `pd.to_datetime(df["when"])`, then `.dt.year`, `.dt.month`, etc.
- Strings: `df["col"].str.upper()`, `.str.strip()`, `.str.contains("abc")`

In [None]:
df2 = pd.DataFrame({
    "Amount":[1000, None, 500, 2000],
    "Date":["2024-01-05", "2024/02/10", None, "03-15-2024"],
    "Purpose":["debt_consolidation", "home", "credit_card", " car "]
})
display(df2)

# Fix dates
df2["Date"] = pd.to_datetime(df2["Date"], errors="coerce")
# Clean strings
df2["Purpose"] = df2["Purpose"].str.strip().str.replace(" ", "_")
# Handle missing
df2["Amount_filled"] = df2["Amount"].fillna(df2["Amount"].median())

display(df2)
print(df2.dtypes)

### âœ… TODO 11 â€” Clean Columns
1. Create a DataFrame with a few missing numbers and messy strings (extra spaces/mixed case).
2. Use `.fillna()` with median for the numeric column and `.str.strip().str.lower()` for strings.
3. Convert a date column to datetime and print the `.dt.year`.

In [None]:
# TODO 11 â€” Your code here

## 12) GroupBy, Aggregation, Sorting, `value_counts`

- Group and summarize: `df.groupby("col")["other_col"].mean()`
- Multiâ€‘agg: `.agg(["mean","median","count"])` or dict syntax
- Sorting: `df.sort_values("col", ascending=False)`
- `value_counts()` for frequencies

In [None]:
df3 = pd.DataFrame({
    "Home Ownership":["RENT","MORTGAGE","RENT","OWN","RENT"],
    "Amount":[500, 1200, 3000, 250, 900]
})
display(df3)

grouped = df3.groupby("Home Ownership")["Amount"].agg(["count","mean","sum"])
display(grouped.sort_values("sum", ascending=False))

display(df3["Home Ownership"].value_counts())

### âœ… TODO 12 â€” Summaries
1. Using your DataFrame from TODO 10 or 11, compute average `Debt` by `Home Ownership`.
2. Show counts of each `Home Ownership` category.
3. Sort your table by average `Debt` descending.

In [None]:
# TODO 12 â€” Your code here

## 13) Merging / Joining DataFrames

- Similar to SQL joins.
- `pd.merge(left, right, on="key", how="inner")` where how âˆˆ {"inner","left","right","outer"}

Use joins to bring together features from multiple tables.

In [None]:
customers = pd.DataFrame({
    "Customer ID":[101,102,103],
    "Income":[60000, 45000, 80000]
})
loans = pd.DataFrame({
    "Customer ID":[101,101,103,104],
    "Amount":[500, 2500, 1200, 3000]
})

inner = pd.merge(customers, loans, on="Customer ID", how="inner")
left = pd.merge(customers, loans, on="Customer ID", how="left")
outer = pd.merge(customers, loans, on="Customer ID", how="outer")

print("INNER JOIN")
display(inner)
print("LEFT JOIN")
display(left)
print("OUTER JOIN")
display(outer)

### âœ… TODO 13 â€” Join Practice
1. Create a `customers` table and a separate `scores` table with `Customer ID` and `Credit Score`.
2. Leftâ€‘join `scores` onto `customers`. Fill missing scores with the mean score.

In [None]:
# TODO 13 â€” Your code here

## 14) Exporting & Project Tips

- Save to CSV: `df.to_csv("clean.csv", index=False)`
- Reproducibility:
  - Keep a **requirements.txt**
  - Use a **virtual environment** (we recommend `pyenv` + `venv` in this project)
  - Seed randomness (e.g., `rng = np.random.default_rng(42)`)
- Good habits:
  - Add comments and short docstrings.
  - Write small, testable functions.
  - Prefer **clear** over **clever**.

### âœ… TODO 14 â€” Save
1. Save any DataFrame you made as `demo_output.csv` (no index).
2. Read it back and show its `.head()`.

In [None]:
# TODO 14 â€” Your code here

---

## You made it! ðŸŽ‰
If anything was confusing, leave a comment **right next to** the cell where you got stuck, and weâ€™ll clarify in the next revision.

Next up in this project: EDA on the loan dataset, then preprocessing and modeling.