# DATASCI 503, Group Work 1: Introduction to Python

*Instructions:* Collaborate in two-person teams to complete the data analysis exercises below. The GSI will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. Teams will be randomly assigned by the GSIs. Teams are encouraged to talk with their GSI if they need help. Upon completion, one member of the team should submit their team's work through Canvas as HTML.

## Where to Go for Help

In lab section, we cannot cover everything you will need to know to solve the problems below. The standard way people learn Python is by trial and error and by consulting online resources. Here are some resources we suggest:

* Books such as [Python for Everybody](https://do1.dr-chuck.com/pythonlearn/EN_us/pythonlearn.pdf) can help with basic Python knowledge.
* Large language models (e.g., ask ChatGPT to generate Python code that loads a file named "college_train.csv" and makes a histogram of the Books feature. See what you get!)
* Package documentation:
  * [pandas](https://pandas.pydata.org/docs/) for manipulating datasets
  * [NumPy](https://numpy.org/doc/stable/) for computing with arrays
  * [scikit-learn](https://scikit-learn.org/stable/user_guide.html) for fitting models
  * [Matplotlib](https://matplotlib.org/stable/gallery/index) for plotting
* [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html)
* Python for Data Analysis (accessible through the U-M library)
* [Stack Overflow](https://stackoverflow.com/)
* The open internet. Search and you may just find what you need.

## What Is a Jupyter Notebook?

A Jupyter notebook is a collection of "cells." There are two kinds of cells. The first kind is *markdown cells* like this one. They can have normal text, *italics*, and **bold**. If you edit the cell, you will see this is done with *Markdown* formatting conventions.

Markdown cells can also have equations such as $\frac{a}{b}$ and

$$\frac{a}{b}.$$

These equations are written using LaTeX notation. You can learn more about how to do different things in Markdown from resources such as the [Markdown Guide](https://www.markdownguide.org/basic-syntax/).

In [None]:
# The other kind of cell is a code cell.
# In code cells, if you want to write ordinary text,
# you have to put a "#" at the beginning of the line.
# All other content in code cells is interpreted as
# instructions for the computer.

3 + 5

## Using packages

In [None]:
# Import packages with aliases
import numpy as np
import pandas as pd

##  Lists

In [None]:
# Create a list of integers
int_list = list(range(10))
print(int_list)
type(int_list[0])

In [None]:
# Create a list of strings
str_list = [str(item) for item in int_list]
print(str_list)
type(str_list[0])

In [None]:
# Create a list of heterogeneous elements
mixed_list = [True, "2", 3.0, 4]
[type(item) for item in mixed_list]

## Arrays

We have covered lists using built-in functions in Python. Next, let's use the `numpy` package to create arrays.

In [None]:
# Create arrays from lists
np.array([1, 4, 2, 5, 3])
np.array([3.14, 4, 2, 3])
np.array([1, 2, 3, 4], dtype="float32")  # dtype explicitly sets the data type

In [None]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

In [None]:
# Create a 3x5 floating-point array filled with ones
np.ones((3, 5), dtype=float)

In [None]:
# Create a 3x5 array filled with 3.14
np.full((3, 5), 3.14)

In [None]:
# Create a 3x3 identity matrix
np.eye(3)

In [None]:
# Create an array filled with a linear sequence
# Starting at 0, ending at 20, stepping by 2
# (this is similar to the built-in range() function)
np.arange(0, 20, 2)

In [None]:
# Create an array of five values evenly spaced between 0 and 1
np.linspace(0, 1, 5)

In [None]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

In [None]:
# Create a 3x3 array of random integers in the interval [0, 10)
np.random.randint(0, 10, (3, 3))

## Array indexing: single elements


In a one-dimensional array, the *i*th value (**counting from zero**) can be accessed by specifying the desired index in square brackets.

In [None]:
np.random.seed(0)  # seed for reproducibility

x1 = np.random.randint(10, size=6)  # One-dimensional array
x2 = np.random.randint(10, size=(3, 4))  # Two-dimensional array
x3 = np.random.randint(10, size=(3, 4, 5))  # Three-dimensional array

print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

In [None]:
# Access single elements
print(x1)
print(x1[0])  # First element in x1
print(x1[4])  # Fifth element in x1

In [None]:
# Index from the end of the array
print(x1[-1])  # Last element
print(x1[-2])  # Second-to-last element

In [None]:
print(x2)
print(x2[0, 0])
print(x2[0, 1])
print(x2[2, 0])
print(x2[2, -1])

In [None]:
# Modify values using index notation
x2[0, 0] = 12
x2

## Array slicing: subarrays

In [None]:
# One-dimensional subarrays
arr = np.arange(10)
arr

In [None]:
print(arr[:5])  # First five elements
print(arr[5:])  # Elements after index 5

In [None]:
print(arr[4:7])  # Middle sub-array
print(arr[::3])  # Every third element

In [None]:
print(arr[::-1])  # All elements, reversed
print(arr[5::-2])  # Reversed every other element from index 5

In [None]:
# Multi-dimensional subarrays
print(x2[:, 0])  # First column of x2
print(x2[0, :])  # First row of x2

In [None]:
print(x2[:2, :3])  # Two rows, three columns
print(x2[:3, ::2])  # All rows, every other column

In [None]:
print(x2[::-1, ::-1])  # Reverse dimensions together

## Arithmetic commands

In [None]:
arr = np.arange(4)
print("arr     =", arr)
print("arr + 5 =", arr + 5)
print("arr - 5 =", arr - 5)
print("arr * 2 =", arr * 2)
print("arr / 2 =", arr / 2)
print("arr // 2 =", arr // 2)  # Floor division

In [None]:
print("-arr     = ", -arr)
print("arr ** 2 = ", arr**2)  # ** operator for exponentiation
print("arr % 2  = ", arr % 2)  # % operator for modulus

In [None]:
arr = np.array([-2, -1, 0, 1, 2])
abs(arr)
np.absolute(arr)

In [None]:
arr = [1, 2, 3]
print("arr     =", arr)
print("e^arr   =", np.exp(arr))
print("2^arr   =", np.exp2(arr))
print("3^arr   =", np.power(3, arr))

In [None]:
arr = [1, 2, 4, 10]
print("arr        =", arr)
print("ln(arr)    =", np.log(arr))
print("log2(arr)  =", np.log2(arr))
print("log10(arr) =", np.log10(arr))

In [None]:
arr_a = np.array([0, 1, 2])
arr_b = np.array([5, 5, 5])
arr_a + arr_b

In [None]:
arr_a + 5

In [None]:
# Centering
data = np.random.random((10, 3))
data_mean = data.mean(0)
data_centered = data - data_mean
data_centered.mean(0)

In [None]:
np.random.seed(1)
random_vals = np.random.random(100)
sum(random_vals)
np.sum(random_vals)

In [None]:
big_array = np.random.random(1000000)
min(big_array), max(big_array)
np.min(big_array), np.max(big_array)

In [None]:
matrix = np.random.random((3, 4))

# Find the minimum value within each column by specifying axis=0
matrix.min(axis=0)

In [None]:
# Find the variance within each row
matrix.var(axis=1)

## Iterations and traversals

A `for` loop is used for iterating over a sequence (that is either a list, a tuple, a dictionary, a set, or a string).

In [None]:
fruits = ["apple", "banana", "cherry"]
for fruit in fruits:
    print("I have one", fruit)

In [None]:
# Notice the zero indexing again here
for i in range(5):
    print(i)

## Basic dataset manipulation

In [None]:
# Use the pandas library to load a dataset
iris = pd.read_csv("./data/iris_test.csv")
iris

In [None]:
# Use NumPy to calculate the median of one of the columns
np.median(iris["sepal length (cm)"])

In [None]:
# Create a transformed version of one of the columns
iris["doublesepal"] = iris["sepal length (cm)"] * 2

# And look at the first 4 rows of the modified dataframe
iris.iloc[:4]

In [None]:
# Create a new series that is True where species is virginica and False elsewhere
is_virginica = iris["Species"] == "virginica"

# Look at first 4 rows
is_virginica.iloc[:4]

In [None]:
# Use NumPy to count how many times the species is "virginica"
np.sum(is_virginica)

In [None]:
# Can also do this more directly without creating a new variable
np.sum(iris["Species"] == "virginica")

In [None]:
# Subset dataframe to include only samples that are virginica
new_dataframe = iris.loc[is_virginica]
new_dataframe

## Group Work Problems

During weekly lab sections, students will collaborate to complete data analysis exercises in the Jupyter notebooks provided. GSIs will help individual teams encountering difficulty, make announcements addressing common issues, and help ensure progress for all teams. Initially, teams will be randomly assigned by the GSIs, but students will later be invited to choose their own teammates. Cross-team collaboration is discouraged; teams are instead encouraged to talk with their GSI if they need help.

Some weeks the group-work assignment can be completed entirely during lab section, in which case students are expected to submit their work by the end of the lab or shortly thereafter. Other weeks, the group-work assignments initiated during lab section will be due several days later.

The purpose of this format is to teach students about teamwork and collaborative data science practices, which are increasingly important in the industry. Working in a group setting can enhance motivation for some students, while others will benefit from the experience of serving as mentors.

To accommodate short-term illness and occasional scheduling conflicts, the lowest 3 out of 11 group-work scores will be dropped.

It is our intention that students score near 100% on all of these because Group Work is an opportunity to get everyone initially familiar with using the relevant material, no matter the difficulty level.

**Note:** The test cases dictate which function and variable names you should use.

## Function Writing

---

**Problem 1:** Gross Pay Calculator

Write a function `get_gross_pay` that computes gross pay given the hours worked and the rate per hour.

**Note:** Observe how the variable names are well-chosen. They explain their role in the function, so no one has to guess. Please be mindful when making variable names so people reading your code know what they represent.

In [None]:
def get_gross_pay(hours_worked, rate_per_hour):
    # BEGIN SOLUTION
    return hours_worked * rate_per_hour
    # END SOLUTION

When you finish your implementation, check your work on the test cases below. Assert statements check a boolean condition before further code execution is allowed to proceed. Your code will be expected to pass a variety of test cases to ensure you have a proper solution.

In [None]:
# Test assertions
assert get_gross_pay(40, 20) == 800, "40 hours at $20/hr should be $800"
assert get_gross_pay(30, 10) == 300, "30 hours at $10/hr should be $300"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert get_gross_pay(0, 100) == 0, "0 hours should result in $0 pay"
assert get_gross_pay(10, 0) == 0, "Rate of $0/hr should result in $0 pay"
assert get_gross_pay(1, 1) == 1, "1 hour at $1/hr should be $1"
# END HIDDEN TESTS

---

**Problem 2:** Gross Pay with Overtime

Write a new version of your `get_gross_pay` function that gives the employee 1.5 times the hourly rate for hours worked above 40 hours.

**Note:** This function should have the same name as Problem 1, so it will replace the previous version.

In addition to the test cases provided, consider: what if someone worked 80 hours a week at $10/hour? How much would they make? Include your answer in the markdown cell below.

In [None]:
def get_gross_pay(hours_worked, rate_per_hour):
    # BEGIN SOLUTION
    # Calculate pay with overtime: 1.5x rate for hours above 40
    if hours_worked > 40:
        regular_pay = 40 * rate_per_hour
        overtime_pay = (hours_worked - 40) * rate_per_hour * 1.5
        return regular_pay + overtime_pay
    else:
        return hours_worked * rate_per_hour
    # END SOLUTION

In [None]:
# Test assertions
assert get_gross_pay(0, 1000) == 0, "0 hours should result in $0 pay"
assert get_gross_pay(2, 10) == 20, "2 hours at $10/hr should be $20"
assert get_gross_pay(41, 10) == 415, "41 hours at $10/hr should be $415 (40*10 + 1*15)"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert get_gross_pay(40, 10) == 400, "Exactly 40 hours should have no overtime"
assert get_gross_pay(50, 20) == 1100, "50 hours at $20/hr should be $1100"
assert get_gross_pay(80, 10) == 1000, "80 hours at $10/hr should be $1000"
# END HIDDEN TESTS

> BEGIN SOLUTION

For someone working 80 hours at $10/hour:
- Regular pay (first 40 hours): 40 x $10 = $400
- Overtime pay (remaining 40 hours at 1.5x): 40 x $15 = $600
- Total: $400 + $600 = $1,000
> END SOLUTION


---

**Problem 3:** Power of Two Check

Write a function `is_power_of_2` that takes an integer as its only argument and returns a Boolean indicating whether or not the input is a power of 2. Note that only positive integers can be powers of 2 (the function should return `False` for zero and negative numbers).

You may not use the built-in `math.log` or `math.sqrt` functions in your solution. You should need only the division and modulus (`%`) operations.

**Hint:** Think about what happens when you repeatedly divide a power of 2 by 2. What value do you end up with?

In [None]:
def is_power_of_2(number):
    # BEGIN SOLUTION
    # Edge case: powers of 2 must be positive
    if number <= 0:
        return False
    # Repeatedly divide by 2 while the number is even
    while number % 2 == 0:
        number = number // 2
    # If we end up with 1, it was a power of 2
    return number == 1
    # END SOLUTION

In [None]:
# Test assertions
assert is_power_of_2(1) is True, "1 is 2^0, so it is a power of 2"
assert is_power_of_2(2) is True, "2 is 2^1, so it is a power of 2"
assert is_power_of_2(3) is False, "3 is not a power of 2"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert is_power_of_2(0) is False, "0 is not a power of 2"
assert is_power_of_2(16) is True, "16 is 2^4"
assert is_power_of_2(1024) is True, "1024 is 2^10"
assert is_power_of_2(100) is False, "100 is not a power of 2"
# END HIDDEN TESTS

---

**Problem 4:** Testing Power of Two

When writing test cases for your code, there is a solid mantra: "test none, test one, test many." Following this principle, write test cases for your `is_power_of_2` function.

In the markdown cell below, outline your testing approach. Then in the code cell, write your test cases. Use boolean expressions to check if the function returns the expected values (e.g., `is_power_of_2(4) is True`).

**Hint:** Think about edge cases your program may need to handle given the function specification.

> BEGIN SOLUTION

Test approach for `is_power_of_2`:

1. **Test none (edge cases):**
   - Test 0 (should return False - not a valid power of 2)
   - Test negative numbers like -5 (should return False)

2. **Test one:**
   - Test 1 (which is 2^0, should return True)
   - Test 2 (which is 2^1, should return True)

3. **Test many:**
   - Test larger powers of 2 like 2^10 = 1024 and 2^20 = 1048576
   - Test numbers that are not powers of 2 like 3, 17, 100
> END SOLUTION


In [None]:
# BEGIN SOLUTION
# Test edge cases (test none)
student_tests_passed = True
student_tests_passed = student_tests_passed and (is_power_of_2(1) is True)  # 1 is 2^0
student_tests_passed = student_tests_passed and (is_power_of_2(-5) is False)  # Negative numbers

# Test larger powers of 2 (test many)
student_tests_passed = student_tests_passed and (is_power_of_2(2**20) is True)  # 2^20
student_tests_passed = student_tests_passed and (is_power_of_2(17) is False)  # Not a power of 2
# END SOLUTION

In [None]:
# Test assertions
assert is_power_of_2(1) is True, "1 should be power of 2"
assert is_power_of_2(3) is False, "3 should not be power of 2"
assert student_tests_passed is True, "Your tests should all pass"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert is_power_of_2(64) is True, "64 should be power of 2"
assert is_power_of_2(15) is False, "15 should not be power of 2"
# END HIDDEN TESTS

## Working with Datasets

---

**Problem 5:** Dataset Summary

You may have noticed in the lab we were able to calculate a median for a variable of interest. However, what if we wanted to generate a five-number summary for all the variables for some quick exploratory data analysis (EDA)? Write code that provides a summary of all the variables in the iris dataset.

You may use generative AI, Google, or the [pandas documentation](https://pandas.pydata.org/docs/) to complete this part.

**Hint:** Look for a pandas DataFrame method that provides descriptive statistics.

In [None]:
# BEGIN SOLUTION
# Use describe() to get summary statistics for all numeric columns
iris.describe()
# END SOLUTION

In [None]:
# Test assertions
desc = iris.describe()
assert "mean" in desc.index, "describe() should include mean"
assert len(desc.columns) > 0, "describe() should return statistics for columns"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "std" in desc.index, "describe() should include std"
assert "50%" in desc.index, "describe() should include median (50%)"
# END HIDDEN TESTS

---

**Problem 6:** Summary Table Observations

What do you notice about the variables that were included in the description table? Answer in the markdown cell below.

> BEGIN SOLUTION

The `describe()` method only includes numeric columns in its summary. The "Species" column, which contains categorical string data (setosa, versicolor, virginica), is not included in the summary statistics. This is because operations like mean, standard deviation, and percentiles are only meaningful for numeric data.
> END SOLUTION


---

**Problem 7:** Dataset Filtering

The iris dataset actually has too broad of samples for my liking. Please filter the dataset to only have the following properties:

1. `sepal length (cm)` of at least 6.9
2. `petal width (cm)` of at least 2.0
3. `petal length (cm)` greater than 5.1 but less than 6.1

Store the result in a variable called `iris_filtered`.

**Hint:** Use boolean indexing with the `&` operator to combine multiple conditions. Remember to wrap each condition in parentheses.

In [None]:
# BEGIN SOLUTION
# Filter iris dataset using boolean indexing with multiple conditions
iris_filtered = iris[
    (iris["sepal length (cm)"] >= 6.9)
    & (iris["petal width (cm)"] >= 2.0)
    & (iris["petal length (cm)"] > 5.1)
    & (iris["petal length (cm)"] < 6.1)
]
iris_filtered
# END SOLUTION

In [None]:
# Test assertions
assert len(iris_filtered) >= 0, "iris_filtered should be a valid DataFrame"
assert iris_filtered.shape[1] == iris.shape[1], "Should have same number of columns"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert iris_filtered["sepal length (cm)"].min() >= 6.9, "Sepal length filter failed"
assert iris_filtered["petal width (cm)"].min() >= 2.0, "Petal width filter failed"
assert iris_filtered["petal length (cm)"].min() > 5.1, "Petal length lower bound failed"
assert iris_filtered["petal length (cm)"].max() < 6.1, "Petal length upper bound failed"
# END HIDDEN TESTS

---

**Problem 8:** Verify Filtering

In the code cell below, write boolean expressions to verify that your filtered dataset meets all the filtering criteria. Use comparisons like `iris_filtered["column"].min() >= value` to check each condition.

You can write standalone expressions, store them in a list, or use any approach that demonstrates the filtering worked correctly.

In [None]:
# Write your verification expressions here
# BEGIN SOLUTION
# Verify all filtering criteria are met
verification_results = [
    iris_filtered.shape[1] == iris.shape[1],  # Same columns as original
    iris_filtered["sepal length (cm)"].min() >= 6.9,  # Sepal length >= 6.9
    iris_filtered["petal width (cm)"].min() >= 2.0,  # Petal width >= 2.0
    iris_filtered["petal length (cm)"].min() > 5.1,  # Petal length > 5.1
    iris_filtered["petal length (cm)"].max() < 6.1,  # Petal length < 6.1
]
all(verification_results)
# END SOLUTION

In [None]:
# Test assertions
assert iris_filtered.shape[1] == iris.shape[1], "Should have same columns as original"
assert len(iris_filtered) >= 0, "Should be a valid filtered DataFrame"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert iris_filtered["sepal length (cm)"].min() >= 6.9, "Sepal length should be >= 6.9"
assert iris_filtered["petal width (cm)"].min() >= 2.0, "Petal width should be >= 2.0"
# END HIDDEN TESTS

---

**Problem 9:** Petal Area Calculation

Assume that petals are rectangular-shaped. Create a new column called `Petal Area` that stores the area of petals (petal length multiplied by petal width).

In [None]:
# BEGIN SOLUTION
# Calculate petal area as length times width
iris["Petal Area"] = iris["petal length (cm)"] * iris["petal width (cm)"]
iris.head()
# END SOLUTION

In [None]:
# Test assertions
assert "Petal Area" in iris.columns, "Petal Area column should exist"
assert len(iris["Petal Area"]) == len(iris), "Petal Area should have same length as dataset"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Check that the calculation is correct for the first row
expected_area = iris["petal length (cm)"].iloc[0] * iris["petal width (cm)"].iloc[0]
assert iris["Petal Area"].iloc[0] == expected_area, "Petal area calculation is incorrect"
# END HIDDEN TESTS

---

**Problem 10:** Maximum Petal Area by Species

Find the maximum petal area for each species of iris plant. Store the result in a variable called `max_petal_area_by_species`.

**Hint:** Use the `groupby()` method followed by an aggregation function.

In [None]:
# BEGIN SOLUTION
# Group by species and find the maximum petal area for each
max_petal_area_by_species = iris.groupby("Species")["Petal Area"].max()
max_petal_area_by_species
# END SOLUTION

In [None]:
# Test assertions
assert len(max_petal_area_by_species) == 3, "Should have 3 species"
assert "setosa" in max_petal_area_by_species.index, "setosa should be in index"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert "versicolor" in max_petal_area_by_species.index, "versicolor should be in index"
assert "virginica" in max_petal_area_by_species.index, "virginica should be in index"
virginica_area = max_petal_area_by_species["virginica"]
setosa_area = max_petal_area_by_species["setosa"]
assert virginica_area > setosa_area, "virginica should have larger max area"
# END HIDDEN TESTS

## Data Structures

---

**Problem 11:** Random Array Generation

Masking is a common operation in machine learning and artificial intelligence. We will learn how to perform this operation together.

(a) Set the random seed to the value 8. This ensures that everyone has the same randomization procedures.

(b) Generate a random 5x5 sample of Uniform(0, 1) values and store it in a variable called `random_array`.

**Hint:** Use `np.random.seed()` and `np.random.rand()`.

In [None]:
# BEGIN SOLUTION
np.random.seed(8)
random_array = np.random.rand(5, 5)
# END SOLUTION

In [None]:
# Test assertions
assert random_array.shape == (5, 5), "Array should be 5x5"
assert random_array.min() >= 0, "All values should be >= 0"
assert random_array.max() <= 1, "All values should be <= 1"
print("All tests passed!")

# BEGIN HIDDEN TESTS
correct_answer = np.array(
    [
        [0.8734294, 0.96854066, 0.86919454, 0.53085569, 0.23272833],
        [0.0113988, 0.43046882, 0.40235136, 0.52267467, 0.4783918],
        [0.55535647, 0.54338602, 0.76089558, 0.71237457, 0.6196821],
        [0.42609177, 0.28907503, 0.97385524, 0.33377405, 0.21880106],
        [0.06580839, 0.98287055, 0.12785571, 0.32213079, 0.07094284],
    ]
)
assert np.sum(np.abs(random_array - correct_answer)) < 1e-7, "Array values don't match expected"
# END HIDDEN TESTS

---

**Problem 12:** Array Masking

Using the `random_array` from Problem 11, perform the following three masks. For each mask, first copy the original array to preserve it.

(a) Zero out the first 3 columns. Store the result in `masked_array1`.

(b) Zero out any values less than 0.5. Store the result in `masked_array2`.

(c) Zero out all values whose first digit after the decimal place is an even number (0, 2, 4, 6, 8). Store the result in `masked_array3`.

**Hint:** For part (c), multiply by 10, take the floor, and check if the result is even using modulus.

In [None]:
# BEGIN SOLUTION
# Copy arrays to preserve the original
masked_array1 = random_array.copy()
masked_array2 = random_array.copy()
masked_array3 = random_array.copy()

# (a) Zero out the first 3 columns
masked_array1[:, :3] = 0

# (b) Zero out values less than 0.5
masked_array2[masked_array2 < 0.5] = 0

# (c) Zero out values where first digit after decimal is even
masked_array3[np.floor(masked_array3 * 10) % 2 == 0] = 0
# END SOLUTION

In [None]:
# Test assertions
assert masked_array1.shape == (5, 5), "masked_array1 should be 5x5"
assert masked_array2.shape == (5, 5), "masked_array2 should be 5x5"
assert masked_array3.shape == (5, 5), "masked_array3 should be 5x5"
print("All tests passed!")

# BEGIN HIDDEN TESTS
correct_mask1 = np.array(
    [
        [0.0, 0.0, 0.0, 0.53085569, 0.23272833],
        [0.0, 0.0, 0.0, 0.52267467, 0.4783918],
        [0.0, 0.0, 0.0, 0.71237457, 0.6196821],
        [0.0, 0.0, 0.0, 0.33377405, 0.21880106],
        [0.0, 0.0, 0.0, 0.32213079, 0.07094284],
    ]
)
correct_mask2 = np.array(
    [
        [0.8734294, 0.96854066, 0.86919454, 0.53085569, 0.0],
        [0.0, 0.0, 0.0, 0.52267467, 0.0],
        [0.55535647, 0.54338602, 0.76089558, 0.71237457, 0.6196821],
        [0.0, 0.0, 0.97385524, 0.0, 0.0],
        [0.0, 0.98287055, 0.0, 0.0, 0.0],
    ]
)
correct_mask3 = np.array(
    [
        [0.0, 0.96854066, 0.0, 0.53085569, 0.0],
        [0.0, 0.0, 0.0, 0.52267467, 0.0],
        [0.55535647, 0.54338602, 0.76089558, 0.71237457, 0.0],
        [0.0, 0.0, 0.97385524, 0.33377405, 0.0],
        [0.0, 0.98287055, 0.12785571, 0.32213079, 0.0],
    ]
)

assert np.sum(np.abs(masked_array1 - correct_mask1)) < 1e-7, "Mask 1 incorrect"
assert np.sum(np.abs(masked_array2 - correct_mask2)) < 1e-7, "Mask 2 incorrect"
assert np.sum(np.abs(masked_array3 - correct_mask3)) < 1e-7, "Mask 3 incorrect"
assert masked_array1[:, :3].sum() == 0, "First 3 columns should all be zero"
assert (masked_array2[masked_array2 != 0] >= 0.5).all(), "Non-zero values should be >= 0.5"
# END HIDDEN TESTS

---

**Problem 13:** Image Matrix Setup

In modern machine learning, we often use matrices and tensors when dealing with image data. A simple image task consists of applying a [kernel](https://en.wikipedia.org/wiki/Kernel_(image_processing)).

For this exercise, first generate a 7x7 grayscale image:

(a) Create a vector of the first 49 consecutive integers (starting from 0).

(b) Reshape it into a 7x7 matrix and store it in a variable called `test_image`.

In [None]:
# BEGIN SOLUTION
vec = np.arange(49)
test_image = vec.reshape(7, 7)
# END SOLUTION

In [None]:
# Test assertions
assert test_image.shape == (7, 7), "test_image should be 7x7"
assert test_image[0, 0] == 0, "Top-left should be 0"
assert test_image[6, 6] == 48, "Bottom-right should be 48"
print("All tests passed!")

# BEGIN HIDDEN TESTS
correct_answer = np.array(
    [
        [0, 1, 2, 3, 4, 5, 6],
        [7, 8, 9, 10, 11, 12, 13],
        [14, 15, 16, 17, 18, 19, 20],
        [21, 22, 23, 24, 25, 26, 27],
        [28, 29, 30, 31, 32, 33, 34],
        [35, 36, 37, 38, 39, 40, 41],
        [42, 43, 44, 45, 46, 47, 48],
    ]
)
assert np.sum(np.abs(test_image - correct_answer)) < 1e-7, "test_image values don't match"
# END HIDDEN TESTS

---

**Problem 14:** Box Filter Implementation

Create a function called `box_filter` that applies a 3x3 box filter to an image. When applied to an image, the filter modifies each pixel by computing the floor of the average of the pixel and its eight surrounding pixels. If one or more surrounding pixels are not present (at edges/corners), only consider the available pixels in the average.

**Worked Example:**

For the matrix:
$$
\begin{bmatrix}
  1 & 2 & 3 & 4 \\
  5 & 6 & 7 & 8 \\
  9 & 10 & 11 & 12 \\
  13 & 14 & 15 & 16
\end{bmatrix} \longrightarrow \begin{bmatrix}
  3 & 4 & 5 & 5 \\
  6 & 7 & 8 & 8 \\
  9 & 10 & 11 & 11 \\
  11 & 12 & 13 & 13
\end{bmatrix}
$$

- For the cell containing 4 (top-right corner): only 4 neighbors exist: 3, 4, 7, 8. The calculation is $(3+4+7+8) / 4 = 5.5 \rightarrow 5$.

- For the cell containing 15 (bottom edge): 6 neighbors exist. The calculation is $(10+11+12+14+15+16) / 6 = 13$.

Write a function that takes a 2D NumPy array and returns the smoothed version.

In [None]:
def box_filter(image):
    # BEGIN SOLUTION
    rows, cols = image.shape

    def compute_average(row, col):
        """Compute floor of average of pixel and its neighbors."""
        total, count = 0, 0

        # Define boundaries for neighboring pixels (clamped to image bounds)
        top = max(0, row - 1)
        bottom = min(rows, row + 2)
        left = max(0, col - 1)
        right = min(cols, col + 2)

        # Sum all pixels in the 3x3 neighborhood
        for r in range(top, bottom):
            for c in range(left, right):
                total += image[r, c]
                count += 1

        return total // count

    # Apply the box filter to each pixel
    return np.array([[compute_average(r, c) for c in range(cols)] for r in range(rows)])
    # END SOLUTION

In [None]:
smoothed_image = box_filter(test_image)
smoothed_image

In [None]:
# Test assertions
correct_answer = np.array(
    [
        [4, 4, 5, 6, 7, 8, 9],
        [7, 8, 9, 10, 11, 12, 12],
        [14, 15, 16, 17, 18, 19, 19],
        [21, 22, 23, 24, 25, 26, 26],
        [28, 29, 30, 31, 32, 33, 33],
        [35, 36, 37, 38, 39, 40, 40],
        [39, 39, 40, 41, 42, 43, 44],
    ]
)
diff = np.sum(np.abs(smoothed_image - correct_answer))
assert diff < 1e-7, "Smoothed image doesn't match expected"
print("All tests passed!")

# BEGIN HIDDEN TESTS
# Test on a simple 2x2 matrix
small_test = np.array([[1, 2], [3, 4]])
small_result = box_filter(small_test)
assert small_result[0, 0] == 2, "2x2 top-left should be floor((1+2+3+4)/4) = 2"
assert small_result.shape == (2, 2), "Output shape should match input shape"
# END HIDDEN TESTS

#### RGB Image Setup

Run the setup cell below to load an RGB image from the Caltech101 dataset. **Do not modify** the code.

In [None]:
import matplotlib.pyplot as plt
import torchvision
import torchvision.transforms as transforms

# Define the transformation pipeline (convert to tensor, no resizing)
transform = transforms.Compose([transforms.ToTensor()])

# Download the Caltech101 dataset
caltech101_dataset = torchvision.datasets.Caltech101(
    root="./data", download=True, transform=transform
)

# Get a single image and its label
rgb_image, label = caltech101_dataset[35]

# Convert the image tensor to a NumPy matrix (H x W x C format)
rgb_image = np.transpose((rgb_image.numpy() * 255).astype(np.uint8), (1, 2, 0))

In [None]:
plt.imshow(rgb_image)
plt.show()

---

**Problem 15:** RGB Box Filter

The image now has one more axis for the RGB channels. The shape is now H x W x C (height, width, channels). Modify your code from Problem 14 to apply the box filter to each channel independently and then output the blurred RGB image.

Write a function called `box_filter_rgb` that takes an RGB image (3D NumPy array) and returns the blurred version.

In [None]:
def box_filter_rgb(image):
    # BEGIN SOLUTION
    # Apply box filter to each of the 3 color channels independently
    filtered_channels = [box_filter(image[:, :, i]) for i in range(3)]
    # Stack channels back together in H x W x C format
    return np.transpose(np.array(filtered_channels), (1, 2, 0))
    # END SOLUTION

In [None]:
blurred_image = box_filter_rgb(rgb_image)
plt.imshow(blurred_image)
plt.show()

In [None]:
# Test assertions
diff_image = np.abs(rgb_image.astype(int) - blurred_image.astype(int))
assert blurred_image.shape == rgb_image.shape, "Output shape should match input shape"
assert blurred_image.shape[2] == 3, "Should have 3 color channels"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert np.max(diff_image) > 0, "Blurring should change some pixel values"
# END HIDDEN TESTS