# Exercise 1: Gradebook Analyzer (NumPy-only)

## Problem Statement
Given a scores matrix of shape `(num_students, num_assignments)`, compute:
1. Per-student average score and rank students.
2. Per-assignment mean and standard deviation.
3. A **z-score curve** per student (normalize each student's vector) and a **global curve** (normalize across all scores).
4. A simple grade curve to target mean=75 and std=10 (vectorized).
5. Compare vectorized per-assignment means with a naive Python loop to see speedup.

**Constraints:** Use only NumPy. No pandas / scikit-learn.

## Approach
- Create synthetic scores with a realistic distribution (clipped normal 0..100).
- Use axis-wise reductions (`mean`, `std`), broadcasting for z-scores, and `argsort` for ranking.
- For performance, compare `scores.mean(axis=0)` against a manual Python loop.


## Deliverables
- Printed per-assignment stats, top-5 students with averages, and timing comparison.
- Arrays: `student_avg`, `assign_mean`, `assign_std`, `z_per_student`, `z_global`, `curved_scores`.

In [1]:
import numpy as np
from time import perf_counter

# ----- A) Synthetic data -----
rng = np.random.default_rng(123)
num_students = 500
num_assignments = 12

# base ability per student + assignment difficulty + noise
ability = rng.normal(0.0, 10.0, size=(num_students, 1))           # student effect
difficulty = rng.normal(0.0, 5.0, size=(1, num_assignments))       # assignment effect
noise = rng.normal(70.0, 12.0, size=(num_students, num_assignments))# general level

scores = ability + noise - difficulty
scores = np.clip(scores, 0, 100).astype(np.float32)

# ----- B) Core metrics -----
student_avg = scores.mean(axis=1)                      # per-student average
assign_mean = scores.mean(axis=0)                      # per-assignment mean
assign_std = scores.std(axis=0, ddof=0)                # population std per assignment

# z-score per student vector (center each student's row)
row_mean = scores.mean(axis=1, keepdims=True)
row_std = scores.std(axis=1, ddof=0, keepdims=True)
z_per_student = np.divide(scores - row_mean, row_std, out=np.zeros_like(scores), where=row_std>0)

# global z-score across all entries
global_mean = scores.mean()
global_std = scores.std(ddof=0)
z_global = (scores - global_mean) / (global_std + 1e-12)

# Curve to target mean=75, std=10 (global)
target_mean, target_std = 75.0, 10.0
curved_scores = (z_global * target_std + target_mean).astype(np.float32)
curved_scores = np.clip(curved_scores, 0, 100)

# ----- C) Ranking -----
top5_idx = np.argsort(student_avg)[-5:][::-1]

# ----- D) Performance comparison -----
t0 = perf_counter()
means_vec = scores.mean(axis=0)
t1 = perf_counter()

t2 = perf_counter()
means_loop = np.zeros(num_assignments, dtype=np.float64)
for j in range(num_assignments):
    s = 0.0
    for i in range(num_students):
        s += float(scores[i, j])
    means_loop[j] = s / num_students
t3 = perf_counter()

# ----- E) Results -----
print('Per-assignment mean (first 5):', np.round(assign_mean[:5], 2))
print('Per-assignment std  (first 5):', np.round(assign_std[:5], 2))
print('\nTop 5 students by average:')
for rank, idx in enumerate(top5_idx, 1):
    print(f'  #{rank}: student {idx:3d}  avg={student_avg[idx]:.2f}')

print('\nTiming:')
print(f'  Vectorized mean: {(t1 - t0):.6f} s')
print(f'  Loop mean:       {(t3 - t2):.6f} s')
print(f'  Speedup:         {(t3 - t2)/(t1 - t0 + 1e-12):.1f}x')

# Sanity check equality
print('\nMax abs diff (vec vs loop):', float(np.max(np.abs(means_vec - means_loop))))

Per-assignment mean (first 5): [81.13 68.75 66.54 65.3  62.05]
Per-assignment std  (first 5): [13.55 15.78 14.49 15.74 15.37]

Top 5 students by average:
  #1: student 210  avg=93.02
  #2: student 436  avg=91.86
  #3: student 312  avg=91.72
  #4: student 107  avg=91.57
  #5: student 223  avg=91.27

Timing:
  Vectorized mean: 0.000197 s
  Loop mean:       0.002300 s
  Speedup:         11.7x

Max abs diff (vec vs loop): 5.7579040529276426e-05


Here’s a clean, thorough walkthrough of what each block does, with shapes and NumPy concepts called out.

# Imports

```python
import numpy as np
from time import perf_counter
```

* `numpy` for vectorized array math.
* `perf_counter()` is a high-resolution timer for benchmarking small code sections.

---

# A) Synthetic data

```python
rng = np.random.default_rng(123)
num_students = 500
num_assignments = 12
```

* `default_rng(123)` creates a reproducible random generator.
* We’ll model a 500×12 “gradebook” (students × assignments).

```python
# base ability per student + assignment difficulty + noise
ability = rng.normal(0.0, 10.0, size=(num_students, 1))            # (500, 1)
difficulty = rng.normal(0.0, 5.0, size=(1, num_assignments))        # (1, 12)
noise = rng.normal(70.0, 12.0, size=(num_students, num_assignments))# (500, 12)
```

* Three components:

  * **ability**: per-student offset (column vector). Same value broadcast across that student’s 12 assignments.
  * **difficulty**: per-assignment offset (row vector). Same value broadcast across all students for that assignment.
  * **noise**: baseline around \~70 with some spread (realistic grades).

```python
scores = ability + noise - difficulty    # broadcasting → (500, 12)
scores = np.clip(scores, 0, 100).astype(np.float32)
```

* **Broadcasting** combines shapes `(500,1) + (500,12) - (1,12)` → `(500,12)`.
* Clip to `[0, 100]` like real scores; store as `float32` (saves memory, fast).

---

# B) Core metrics

```python
student_avg = scores.mean(axis=1)        # (500,) mean across assignments
assign_mean = scores.mean(axis=0)        # (12,)  mean across students
assign_std  = scores.std(axis=0, ddof=0) # population std per assignment
```

* `axis=1`: reduce rows → per-student average.
* `axis=0`: reduce columns → per-assignment statistics.
* `ddof=0` means **population** std (not sample). (Sample std would use `ddof=1`.)

---

# Row-wise (per-student) z-scores

```python
row_mean = scores.mean(axis=1, keepdims=True)           # (500,1)
row_std  = scores.std(axis=1, ddof=0, keepdims=True)    # (500,1)
z_per_student = np.divide(
    scores - row_mean, row_std,
    out=np.zeros_like(scores),
    where=row_std > 0
)
```

* Normalize each student’s vector to mean 0, std 1.
* `keepdims=True` keeps shapes compatible for broadcasting `(500,12) - (500,1)`.
* `np.divide(..., where=row_std>0)` avoids divide-by-zero if a student’s scores are all identical; those rows become zeros via `out=`.

---

# Global z-scores

```python
global_mean = scores.mean()            # scalar
global_std  = scores.std(ddof=0)       # scalar
z_global = (scores - global_mean) / (global_std + 1e-12)
```

* Normalize **all** scores using the single global mean/std (different from row-wise). Tiny `1e-12` avoids any divide-by-zero edge case.

---

# Curving to a target scale

```python
target_mean, target_std = 75.0, 10.0
curved_scores = (z_global * target_std + target_mean).astype(np.float32)
curved_scores = np.clip(curved_scores, 0, 100)
```

* Standard curving: convert z-scores to a distribution with desired mean/std, then clip to the scoring range.

---

# Ranking

```python
top5_idx = np.argsort(student_avg)[-5:][::-1]
```

* `argsort` returns indices sorted by value (ascending). Take last 5 (largest), then reverse → **top-5 students** by average.

---

# Performance comparison

```python
t0 = perf_counter()
means_vec = scores.mean(axis=0)     # vectorized mean per assignment
t1 = perf_counter()
```

* Fast, compiled path: one call reduces all 500 rows for each of the 12 columns.

```python
t2 = perf_counter()
means_loop = np.zeros(num_assignments, dtype=np.float64)
for j in range(num_assignments):
    s = 0.0
    for i in range(num_students):
        s += float(scores[i, j])
    means_loop[j] = s / num_students
t3 = perf_counter()
```

* Slow, pure-Python nested loops doing the same math. Same complexity on paper, but Python loop overhead dominates. Using `float64` here avoids extra rounding in the manual accumulator.

---

# Results

```python
print('Per-assignment mean (first 5):', np.round(assign_mean[:5], 2))
print('Per-assignment std  (first 5):', np.round(assign_std[:5], 2))

print('\nTop 5 students by average:')
for rank, idx in enumerate(top5_idx, 1):
    print(f'  #{rank}: student {idx:3d}  avg={student_avg[idx]:.2f}')
```

* Displays basic stats and the top-5 leaderboard.

```python
print('\nTiming:')
print(f'  Vectorized mean: {(t1 - t0):.6f} s')
print(f'  Loop mean:       {(t3 - t2):.6f} s')
print(f'  Speedup:         {(t3 - t2)/(t1 - t0 + 1e-12):.1f}x')
```

* Reports absolute times and speedup (loop / vectorized). The tiny epsilon avoids a zero-division if the vectorized path is extremely fast.

```python
print('\nMax abs diff (vec vs loop):', float(np.max(np.abs(means_vec - means_loop))))
```

* **Sanity check**: the two methods should agree up to small floating-point differences (float32 vs float64 accumulation and order of summation).

---

## Key NumPy ideas used

* **Broadcasting**: `(500,1)` and `(1,12)` expand to `(500,12)` automatically.
* **Axis reductions**: `mean`/`std` along rows or columns with `axis`.
* **Safe division**: `np.divide` with `where=` and `out=` to handle edge cases.
* **Vectorization vs loops**: one call over whole arrays is much faster than Python loops.
* **Dtypes**: `float32` for compact storage/speed; `float64` for accumulators to reduce rounding error (the code mixes responsibly).

## Complexity (N = students, M = assignments)

* Data creation & stats: `O(N*M)` in vectorized C; the nested Python loop also does `O(N*M)` work but is far slower due to interpreter overhead.
