# HW2B - NumPy Fundamentals

See Canvas for details on how to complete and submit this assignment.

**Note: if you are reading this on Colab, Stop! Follow the instructions on Canvas to load and run this notebook locally.**

## Introduction

This assignment marks your transition from cloud-based notebooks to local development. You'll apply the terminal skills and conda environment from HW2A to run NumPy computations on your own machine, experiencing firsthand another way that professional data scientists actually work.

Through three progressively complex problems — temperature monitoring, grade analysis, and sales performance — you'll discover why NumPy underpins nearly every Python data science tool you'll encounter. These aren't toy problems: they represent patterns you'll use repeatedly when cleaning sensor data, analyzing experimental results, or processing business metrics.

The assignment deliberately contrasts NumPy's vectorized operations with traditional Python loops. You'll measure actual performance differences on your hardware and develop intuition for when NumPy should be employed. By the end, you'll see why operations that take minutes with pure Python complete in milliseconds with NumPy.

This assignment should take 2-3 hours to complete, 3-4 for Graduate students with additional requirements.

Before submitting, ensure your notebook:

- Runs completely with "Kernel → Restart & Run All"
- Includes all required pasted outputs in markdown cells
- Uses clear variable names, includes docstrings, and is readable by others

### Learning Objectives

By completing this assignment, you will be able to:

1. **Execute NumPy operations in a local environment**
   - Navigate Jupyter Lab interface and conda environment activation
   - Troubleshoot import errors and version conflicts
   - Compare local vs. cloud development tradeoffs

2. **Manipulate arrays using NumPy's indexing paradigms**
   - Extract data using basic slicing, boolean masks, and fancy indexing
   - Distinguish operations that create views vs. copies
   - Apply `np.where()` for both location finding and conditional selection

3. **Leverage broadcasting for efficient computations**
   - Eliminate explicit loops using vectorized operations
   - Predict output shapes from broadcast operations
   - Apply broadcasting rules to normalize data across axes

4. **Measure and interpret performance characteristics**
   - Use `%timeit` to benchmark operations at different scales
   - Calculate and explain speedup factors
   - Identify the data size threshold where NumPy becomes essential

5. **Connect NumPy patterns to real applications**
   - Recognize when to use `clip()`, `percentile()`, and other domain-specific ufuncs
   - Choose between `np.where()`, `np.select()`, and boolean indexing for filtering
   - Explain why method chaining drives NumPy's design philosophy

### Generative AI Allowance

You may use GenAI tools for brainstorming, explanations, and code sketches if you disclose it, understand it, and validate it. Your submission must represent your own work and you are solely responsible for its correctness.

### Scoring

- Problem 1: 25 pts
- Problem 2: 30 pts
- Problem 3: 15 pts
- Reflection: 10 pts

## Problems

### Problem 1: Temperature Data Analysis (Detailed)

#### Setup

Run the following code to generate synthetic hourly temperature readings for one week. Rows represent days (0=Monday through 6=Sunday), columns represent hours (0-23).

In [None]:
import numpy as np

# Provided code to generate the data
np.random.seed(42)
base_temps = 70 + 15 * np.sin(np.linspace(0, 2*np.pi, 24))  # Daily cycle
weekly_variation = np.array([0, 2, 4, 3, 1, -2, -3])[:, np.newaxis]  # Weekly trend
noise = np.random.randn(7, 24) * 3  # Random variation
temperatures = base_temps + weekly_variation + noise
temperatures = np.round(temperatures, 1)  # Round to 1 decimal
print(temperatures)

#### Task 1a: Array Properties

Report the basic properties of the temperature array: its type, number of dimensions, shape, total size, and the data type of its elements. Format your output as "property name: property value", e.g., `Data type: int8`, with each property on a separate line.

In [None]:
# task 1a code...


#### Task 1b: Basic Slicing

Extract and print the following subsets of data:

- All of Monday's temperatures
- The temperature at noon for every day of the week
- The weekend temperatures

Format your output as you did in part 1a, but with the name and value on separate lines, e.g.:

```text
Monday temperatures:
<values>
```

In [None]:
# task 1b code...


#### Task 1c: Finding Extremes

Find the single hottest temperature in the dataset. Use `np.where()` to identify exactly when it *first* occurred. Report in the format: 'Hottest temperature: X°F on Day Y at Hour Z', where Y is the day number converted to its name.

In [None]:
# task 1c code...


#### Interpretation

Reflect on this problem and answer the following questions in the markdown cell that follows.

1. What is the relationship between `shape`, `ndim`, and `size`? Could you have predicted two of these if you only knew one?
2. Compare the slicing operations you performed. Which returned 1D vs 2D arrays and why? How does NumPy decide the dimensionality of a slice?
3. Why was `np.where` necessary to find the hottest hour? What would go wrong if you just used `temperatures.max()` alone?
4. If you wanted to safely modify Monday's temperatures without affecting the original array, what would you need to do differently?

*problem 1 interpretation here*

#### Follow-Up (Graduate Only)

Extract temperature readings for 'business hours' (9 AM - 5 PM) on weekdays only (Monday-Friday) in a single slicing operation. Then calculate and report:

1. The mean temperature during business hours.
2. The mean temperature during non-business hours on M-F.
3. Is difference in the means greater than 1 standard deviation of all temperatures?

In [None]:
# p1 follow-up code...


### Problem 2: Student Grade Analysis

#### Task 2a: Generate Base Scores

Create a 10×5 grade array where the base score for each student-assignment pair equals: `(student_number × assignment_number × 5) + 40`. For example, Student 3's score on Assignment 2 should start as `(3 × 2 × 5) + 40 = 70`.

Leverage NumPy's design to implement it in three lines of code, as follows:

1. Create a column vector (i.e., shape of n, 1) called `student_nums` with values from `1` to `10` using `arange` and `reshape`.
2. Create a row vector (i.e., shape of 1, m) called `assignment_nums` with values from `1` to `5`
3. Calculate `base_scores` as defined above.

Output the results as shown below.

```text
Base scores shape: (10, 5)
First student scores: [45 50 55 60 65]
Last student scores: [ 90 140 190 240 290]
```

In [None]:
# setup code - do not change
np.random.seed(100)

# task 2a code...


#### Task 2b: Add Variation

Add random noise to make scores "realistic." Generate random values between -15 and +5 using `np.random.uniform`, then add them to base scores. Ensure no score exceeds 100 using `np.clip`. Calculate and report the minimum, maximum, mean, and standard deviation of final scores using NumPy array methods.

In [None]:
# task 2b code...


#### Task 2c: Calculate Statistics by Axis

Calculate and display:

- Each student's average across all assignments (hint: use axis=1)
- Each assignment's average across all students (hint: use axis=0)
- Identify the hardest assignment (lowest average)
- The number of students on honor roll (95 or better, no rounding).

Use the provided `print_vals` function to display the results for student and assignment averages.

In [None]:
# setup code - do not change
def print_vals(array, descr):
    """prints each value in an array as a line in the format
    `descr idx: value`
    """
    for idx, val in enumerate(array):
        print(f" {descr} {idx + 1}: {val:.1f}")

# task 2c code...


#### Interpretation

Reflect on this problem and answer the following questions in the markdown cell that follows.

1. In Task 2a, what NumPy feature allowed you to multiply a (10,1) array by a (5,) array? Briefly describe the process that happens "under the hood".
2. How was `np.clip` useful and where would you expect to use it in other data science applications? What questions should you ask before using it?
3. Compare using `axis=0` vs `axis=1` for the mean. How do you remember which axis does what?
4. Why is the method syntax (e.g., `data.method()`) preferred over the function syntax (e.g., `np.method(data)`) in NumPy and how does that relate to its reliance on returned values instead of in-place modification?

*problem 2 interpretation here*

#### Follow-Up (Graduate Only)

Apply a curve to normalize each assignment to a target mean of 75. Follow these steps to leverage vectorized operations in NumPy and avoid explicit loops:

1. Create a copy of your result from 2a using the `copy()` method.
2. Get the `current_means` for all assignments by specifying the correct axis for the `mean` method.
3. Calculate the `shifts` for all assignments (`75 - current_means`)
4. Add those `shifts` to the copy you created in step 1.
5. Use `clip` to ensure scores remain in `[0, 100]`.

Report the new assignment averages to verify the curve worked.

In [None]:
# p2 follow-up code...


### Problem 3: Sales Data Filtering and Analysis

#### Setup

Run the following code to generate synthetic daily sales data for a retail store over one year (365 days).

In [None]:
# setup code - do not modify!
np.random.seed(200)

days = np.arange(365)
base_sales = 1000 + 300 * np.sin(2 * np.pi * days / 365 - np.pi/2)
weekly_boost = np.where(days % 7 >= 5, 1.3, 1.0)  # 30% boost on weekends
noise = np.random.normal(0, 100, 365)
daily_sales = base_sales * weekly_boost + noise
special_days = np.random.choice(365, 10, replace=False)
daily_sales[special_days] *= np.random.uniform(1.5, 2.5, 10)
daily_sales = np.round(daily_sales, 0)
daily_sales = np.maximum(daily_sales, 0)

print(f"Sales data shape: {daily_sales.shape}")
print(f"Sales range: ${daily_sales.min():.0f} to ${daily_sales.max():.0f}")
print(f"Mean daily sales: ${daily_sales.mean():.2f}")

Run the following cell to use matplotlib to generate a histogram of the data.

In [None]:
import matplotlib.pyplot as plt

# Create histogram
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.hist(daily_sales, bins=30, edgecolor='black', alpha=0.7)
plt.axvline(daily_sales.mean(), color='red', linestyle='--', label=f'Mean: ${daily_sales.mean():.0f}')
plt.axvline(daily_sales.mean() + 2*daily_sales.std(), color='orange', linestyle='--', label=f'+2σ: ${daily_sales.mean() + 2*daily_sales.std():.0f}')
plt.xlabel('Daily Sales ($)')
plt.ylabel('Frequency')
plt.title('Distribution of Daily Sales')
plt.legend()
plt.grid(True, alpha=0.3)

# Also show time series to see the seasonal pattern
plt.subplot(1, 2, 2)
plt.plot(daily_sales, linewidth=0.5)
plt.xlabel('Day of Year')
plt.ylabel('Daily Sales ($)')
plt.title('Daily Sales Throughout Year')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

#### Task 3a: Identify Outliers

Use Boolean indexing to identify and analyze outlier days:

1. Create a Boolean mask for days with sales > 2 standard deviations above the mean
2. Count how many outlier days exist
3. Calculate what percentage of total annual revenue came from these outlier days

In [None]:
# task 3a code...


#### Task 3b: Weekend Analysis with Fancy Indexing

Use fancy indexing to analyze weekend vs weekday performance:

0. Working from the `days` array created in setup...
1. Create an array of `weekday_indices`: indices of `days` where the day-of-week is 0-4 (Mon-Fri)
2. Create an array of `weekend_indices`: indices of `days` where the day-of-week is 5-6 (Sat-Sun)
3. Use those index arrays to extract `weekday_sales` and `weekend_sales`.
4. Report the mean, median, and maximum sales for each day type.

For steps 1 and 2, consider this guidance:

- Recall that the mod operator (`%`) gives us the remainder of a division, which is useful for conversion
- Use `days % 7` to get day-of-week (0=Mon, 1=Tue, ..., 6=Sun)
- Compare that result various values to get a Boolean array, e.g. `days % 7 < 5` is `True` for all weekdays
- `np.where()` of a Boolean array returns a tuple. Each element of the tuple is an array of indices where the Boolean array is `True`. The tuple will contain one array for each dimension in the input data. In our case the input data is 1D, so the output is a tuple of length 1. To get the array only, extract it with indexing.
- Use the resulting array of indices for step 3.

It may be helpful to build this expression one step at a time in a series of cells, so you can observe the output of each operation and check the shape, type, etc.

In [None]:
# task 3b code...


#### Task 3c: Performance Analysis with Ufuncs

Run the following code to compare two approaches - basic Python loop and Numpy's `where` - for applying dynamic pricing rules:

- 20% discount on days with sales below median
- 5% premium on days with sales above 75th percentile  
- No change for days in between

Notes:

- It may take a minute for this to run on your machine. Be patient.
- If you get an error about missing `psutil` when running this, add it to your conda environment using the methods described in the conda lecture and applied in HW2a. You may need to reopen this notebook after doing so.


In [None]:
# load module for timing execution and record starting time
import time
start_time = time.time()

# report information about your machine, loading required dependencies
import platform
import psutil

# Report machine stats
print("=== Machine Information ===")
print(f"Processor: {platform.processor() or platform.machine()}")
print(f"CPU cores: {psutil.cpu_count(logical=False)} physical, {psutil.cpu_count(logical=True)} logical")
print(f"RAM: {psutil.virtual_memory().total / (1024**3):.1f} GB")
print(f"Python: {platform.python_version()}")
print(f"NumPy: {np.__version__}")
print(f"Platform: {platform.platform()}")
print("="*28 + "\n")

# Calculate thresholds
median_sales = np.median(daily_sales)
q75_sales = np.percentile(daily_sales, 75)

print(f"Median sales: ${median_sales:.2f}")
print(f"75th percentile: ${q75_sales:.2f}")

# Approach 1: Pure Python loop
def apply_pricing_loop(sales):
    result = np.empty_like(sales)
    for i in range(len(sales)):
        if sales[i] < median_sales:
            result[i] = sales[i] * 0.8  # 20% discount
        elif sales[i] > q75_sales:
            result[i] = sales[i] * 1.05  # 5% premium
        else:
            result[i] = sales[i]  # no change
    return result

# Approach 2: Nested np.where
def apply_pricing_where(sales):
    # Apply conditions from innermost to outermost
    result = np.where(sales < median_sales, 
                     sales * 0.8,  # condition true
                     np.where(sales > q75_sales,  # condition false, check next
                             sales * 1.05,  # nested true
                             sales))  # nested false
    return result

# Verify all methods produce identical results
result_loop = apply_pricing_loop(daily_sales)
result_where = apply_pricing_where(daily_sales)

print(f"\nResults match:")
print(f"  Loop vs where: {np.allclose(result_loop, result_where)}")

# Show a sample of transformations
sample_idx = [0, 100, 200, 300]
print(f"\nSample transformations:")
for idx in sample_idx:
    orig = daily_sales[idx]
    new = result_where[idx]
    ratio = new/orig
    print(f"  Day {idx}: ${orig:.0f} → ${new:.0f} ({ratio:.2f}x)")

# Performance testing on original array
print(f"\nTiming on 365-day array:")
%timeit apply_pricing_loop(daily_sales)
%timeit apply_pricing_where(daily_sales)

# Create larger array for more dramatic comparison
np.random.seed(42)
large_sales = np.random.normal(1000, 200, 36500)
large_sales = np.maximum(large_sales, 0)  # no negative sales

print(f"\nTiming on 36,500-day array (100 years):")
%timeit apply_pricing_loop(large_sales)
%timeit apply_pricing_where(large_sales)

# Calculate speedup factors
import timeit

# Time each approach on large array (proper timing for calculation)
loop_time = timeit.timeit(
    'apply_pricing_loop(large_sales)', 
    globals=globals(), 
    number=100
) / 100

where_time = timeit.timeit(
    'apply_pricing_where(large_sales)', 
    globals=globals(), 
    number=100
) / 100

print(f"\nSpeedup factors (on 100-year data):")
print(f"  np.where is {loop_time/where_time:.1f}x faster than loop")
print(f"\nTotal cell execution time: {time.time() - start_time:.2f} seconds")

After running the code, answer these questions in the markdown block below:

1. Copy and paste your results to the [Canvas discussion board](https://auburn.instructure.com/courses/1665637/discussion_topics/10108405) and review those posted by others in the class. How long did it take for it to run on your machine (see last line of output)? Compare that result with some from your classmates. How does performance vary by hardware (see the machine info section)?
2. Compare the calculated speedup factor on 100-year data with the same for 1 year data (you will need to calculate it). What does this tell you?
3. So far, we've only used `np.where` to return True / False values, but it is much more powerful than that. Look at the `np.where` implementation and do some research on the function. Explain in your own words how the nested structure works. Does this remind you of anything in Excel?
4. Review the code provided. Identify 3 other things you learned from it.

*problem 3c answers here here*

#### Interpretation

Reflect on this problem and answer the following questions in the markdown cell that follows.

1. Did you use Boolean indexing (i.e., `some_array[boolean_mask]`) to subset data in task 3a? If so, how and why? If not, how could your solution be improved / simplified with it?
2. Explain, in your own words, the process used to generate the array of indexes for "fancy indexing" in task 3b.
3. What surprised you most about the performance results in 3c?
4. How can you use the original graphs to validate your findings? Does anything seem out of line?

*problem 3 interpretation here*

## Reflection (10 points)

Address the following in a markdown cell:

1. **Local Development Experience** (3 points)
   - Describe what, if any, relevant prior experience you have with terminal commands, conda, virtual environments, etc. This will help us interpret the feedback that follows.
   - What specific challenges did you encounter moving from Colab to local Jupyter Lab, if any? How did you solve them?
   - Compare the experience: What do you miss about Colab? What do you prefer about local development?
   - Give some feedback on the style of instruction for terminal, conda, etc. Was the material sufficient, etc.?

2. **NumPy Conceptual Understanding** (3 points)
   - If you've used MATLAB, R, or another scientific computing platform, how does NumPy compare?
   - Which NumPy concept most changed how you think about data manipulation?
   - Based on your timing results in Problem 3c, at what data size would you say NumPy becomes "worth it"?

3. **Terminal & Environment Management** (3 points)
   - Which terminal commands did you use most frequently while working on this assignment?

4. **Meta-Learning & Feedback** (1 point)
   - Time spent: ___ (broken down by: setup/environment: ___, coding: ___, debugging: ___)
   - Most frustrating technical issue and how you resolved it: ___
   - One specific improvement for the assignment instructions: ___
   - On a scale of 1-10, how prepared do you feel to use NumPy in your own projects? Why?


*reflection here*