# DATASCI 503, Homework 1: Introduction to Statistical Learning

In this assignment, you will explore fundamental concepts in statistical learning including populations, variables, bias-variance tradeoffs, and basic data analysis with pandas.

In [None]:
import pandas as pd

### Problem 1: Populations and Variables

Consider the population of students taking DATASCI 503 this semester at the University of Michigan.

**(a)** Name three variables **related to academics** that you could collect or measure about each student in this population. Of these three variables, one must be ordinal, one must be categorical, and one must be continuous.

**(b)** Suppose you have collected a dataset containing these variables for the population of students taking DATASCI 503 this semester. Now consider using this dataset to make inferences about a different population. Name another population about which we could plausibly make inferences. Name another population that would be more difficult to make inferences about.

> BEGIN SOLUTION

**Answer 1a:**

The three variables related to academics that we can collect/measure about each student in the DATASCI 503 population are as follows:
1. Year (freshman, sophomore, junior, and senior) (ordinal)
2. Major (data science, computer science, statistics, math, science etc) (categorical)
3. GPA (contained in the range [0, 4]) (continuous)

**Answer 1b:**

**Plausible Inference:**
Using the data we gathered from the DATASCI 503 population we can make plausible inferences about the students taking another graduate level statistics course of a similar difficulty level. The reason for this is that variables like Year, Major and GPA are usually good indicators of the caliber and type of students taking a course. Under our assumption of the other population being with respect to a similarly difficult graduate level statistics course, we can expect the data to follow a similar distribution and pattern. This makes it possible to make valid and somewhat accurate inferences.

**Not Plausible Inference:**
Based on the above logic, it would not be plausible to make inferences about the students taking a completely different type of course in an unrelated field/domain (e.g., English 102). The reason for this is that the currently collected data will not have patterns present for the kind of students that are expected to take these new type of courses and therefore making any sort of inferences would be pointless and futile.

> END SOLUTION

### Problem 2: Classification vs Regression

For each scenario below, explain whether it is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide the number of samples ($n$) and the number of features ($p$).

**(a)** We collect a set of data on the top 500 firms in the United States. For each firm, we record the profit, the number of employees, the industry, and the CEO salary. We are interested in understanding which factors affect CEO salary.

**(b)** We are considering releasing a new product and want to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product, we have recorded whether it was a success or failure, the price charged for the product, the marketing budget, the competition price, and ten other variables.

**(c)** We are interested in predicting the percentage change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence, we collect weekly data for all of 2012. For each week, we record the percent change in the USD/Euro, the percent change in the US market, the percent change in the British market, and the percent change in the German market.

> BEGIN SOLUTION

**Answer 2a:**
1. Type of problem - Inference (more specifically regression)
2. Number of samples ($n$) - 500
3. Number of features ($p$) - 3 (excluding target - CEO salary)

**Answer 2b:**
1. Type of problem - Prediction (more specifically classification)
2. Number of samples ($n$) - 20
3. Number of features ($p$) - 13 (excluding target - success or failure)

**Answer 2c:**
1. Type of problem - Prediction (more specifically regression)
2. Number of samples ($n$) - 52
3. Number of features ($p$) - 3 (excluding target - % change in USD/Euro)
> END SOLUTION


### Problem 3: Bias-Variance Tradeoff

You have a dataset. You are considering a collection of different methods that you could apply to this dataset in order to estimate the regression function. The methods range from very inflexible (only capable of representing a small class of true regression functions) to very flexible (capable of fitting a very large class of true regression functions).

Answer True or False for each statement, and provide a brief explanation.

**(a)** Typically, the bias of your estimate will be lower with more flexible methods.

**(b)** Typically, the variance of your estimate will be lower with more flexible methods.

**(c)** Typically, irreducible error will be lower with more flexible methods.

> BEGIN SOLUTION

**Answer 3a:** True

Flexible methods are more capable of being able to represent our regression function which will directly result in the final model having lower bias.

**Answer 3b:** False

Flexible methods are able to more accurately represent the regression function due to their ability to take into account more features and data, but at the same time they also end up picking up noise and irregularities which can result in them having a higher variance than inflexible methods which are highly selective and restrictive.

**Answer 3c:** False

Irreducible error is the random error that exists in our data. No matter which type of model we use or how flexible that model is, the irreducible error will remain the same. Therefore, both the flexible and the highly inflexible model should have the same irreducible error.
> END SOLUTION


### Problem 4: Asymptotic Properties

Consider gathering a dataset and using it to estimate the regression function. Assume the data-generating distribution is well behaved (e.g., assume that the true regression function is continuous) and typical of what we would find in real-world settings. Answer True or False.

**(a)** As the size of your training dataset tends to infinity (for a fixed number of neighbors, $K$), the bias of the KNN regression estimate will tend to 0.

**(b)** As the size of your training dataset tends to infinity (for a fixed number of neighbors, $K$), the variance of the KNN regression estimate will tend to 0.

**(c)** As the size of your training dataset tends to infinity, the bias of a least squares linear regression estimate will tend to zero.

**(d)** As the size of your training dataset tends to infinity, the variance of a least squares linear regression estimate will tend to 0.

> BEGIN SOLUTION

**Answer 4a:** True (under mild conditions, e.g., that the regression function is continuous)

**Answer 4b:** False

**Answer 4c:** False (in real-world cases, where the underlying true function is never linear)

**Answer 4d:** True
> END SOLUTION


## Data Analysis with Pandas

Load the `college_train.csv` dataset and answer the following questions. This dataset contains information about US colleges, including enrollment statistics, costs, and graduation rates.

**Hint:** See the [pandas documentation](https://pandas.pydata.org/docs/reference/frame.html) for DataFrame methods.

In [None]:
df = pd.read_csv("data/college_train.csv")
df.head()

### Problem 5a: DataFrame Shape

Access the `shape` property of your DataFrame to compute the number of samples (rows) and the number of variables (columns) measured about each sample. Store them in variables `num_samples` and `num_variables`.

In [None]:
# BEGIN SOLUTION
num_samples, num_variables = df.shape
# END SOLUTION
print(f"Number of samples: {num_samples}, Number of variables: {num_variables}")

In [None]:
# Test assertions
assert num_samples == 650, f"Expected 650 samples, got {num_samples}"
assert num_variables == 19, f"Expected 19 variables, got {num_variables}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert df.shape == (650, 19), "Shape should be (650, 19)"
assert num_samples > 0, "num_samples should be positive"
assert num_variables > 0, "num_variables should be positive"
# END HIDDEN TESTS

### Problem 5b: Computing Statistics

Compute the mean and standard deviation of the `Books` feature. Store the results in variables `books_mean` and `books_std`.

In [None]:
# BEGIN SOLUTION
books_mean = df["Books"].mean()
books_std = df["Books"].std()
# END SOLUTION
print(f"The mean of the Books feature is {books_mean}")
print(f"The standard deviation of the Books feature is {books_std}")

In [None]:
# Test assertions
assert abs(books_mean - 551.655) < 0.01, f"Mean should be ~551.655, got {books_mean}"
assert abs(books_std - 168.643) < 0.01, f"Std should be ~168.643, got {books_std}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert 550 < books_mean < 553, "Mean should be around 551.65"
assert 167 < books_std < 170, "Std should be around 168.64"
assert books_mean == df["Books"].mean(), "books_mean should equal df['Books'].mean()"
assert books_std == df["Books"].std(), "books_std should equal df['Books'].std()"
# END HIDDEN TESTS

### Problem 5c: Filtering Data

For how many samples is the `Terminal` feature at least 90? Store the count in a variable called `terminal_count`.

**Hint:** You can filter a DataFrame using boolean indexing: `df[df["column"] >= value]`.

In [None]:
# BEGIN SOLUTION
terminal_count = len(df[df["Terminal"] >= 90])
# END SOLUTION
print(f"Number of samples with Terminal >= 90: {terminal_count}")

In [None]:
# Test assertions
assert terminal_count == 207, f"Expected 207, got {terminal_count}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert (df["Terminal"] >= 90).sum() == 207, "Count should be 207"
assert terminal_count == len(df[df["Terminal"] >= 90]), "Should match filtered length"
assert terminal_count > 0, "terminal_count should be positive"
# END HIDDEN TESTS

### Problem 5d: Subsetting and Statistics

Create a new DataFrame called `df_private` that only includes samples where `Private` is equal to `"Yes"`. Then compute:
- The mean value of `Books` in the new DataFrame (store in `private_books_mean`)
- The standard deviation of `Books` in the new DataFrame (store in `private_books_std`)

In [None]:
# BEGIN SOLUTION
df_private = df[df["Private"] == "Yes"]
private_books_mean = df_private["Books"].mean()
private_books_std = df_private["Books"].std()
# END SOLUTION
print(f"Number of private schools: {len(df_private)}")
print(f"The mean of the Books feature in the new DataFrame is {private_books_mean}")
print(f"The standard deviation of the Books feature in the new DataFrame is {private_books_std}")

In [None]:
# Test assertions
assert len(df_private) == 471, f"Expected 471 private schools, got {len(df_private)}"
assert abs(private_books_mean - 550.297) < 0.01, "Mean should be ~550.297"
assert abs(private_books_std - 179.725) < 0.01, "Std should be ~179.725"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert df_private["Private"].unique().tolist() == ["Yes"], "Should only contain private schools"
assert 549 < private_books_mean < 552, "Mean should be around 550.297"
assert 178 < private_books_std < 181, "Std should be around 179.725"
# END HIDDEN TESTS

### Problem 5e: For Loops

Write a function called `generate_hello_messages` that uses a for loop to return a list containing the following strings:
```
["Hello world 1.", "Hello world 2.", "Hello world 4.", "Hello world 8.", "Hello world 16."]
```

Then call your function to print each message on its own line.

**Hint:** Notice the pattern in the numbers: they are powers of 2 ($2^0, 2^1, 2^2, 2^3, 2^4$).

In [None]:
# BEGIN SOLUTION
def generate_hello_messages():
    """Generate a list of 'Hello world' messages with powers of 2."""
    messages = []
    for i in range(5):
        messages.append(f"Hello world {2**i}.")  # noqa: PERF401
    return messages


# END SOLUTION

# Print each message
for message in generate_hello_messages():
    print(message)

In [None]:
# Test assertions
messages = generate_hello_messages()
expected = [
    "Hello world 1.",
    "Hello world 2.",
    "Hello world 4.",
    "Hello world 8.",
    "Hello world 16.",
]
assert messages == expected, f"Expected {expected}, got {messages}"
print("All tests passed!")

# BEGIN HIDDEN TESTS
assert len(messages) == 5, "Should have exactly 5 messages"
assert messages[0] == "Hello world 1.", "First message should be 'Hello world 1.'"
assert messages[-1] == "Hello world 16.", "Last message should be 'Hello world 16.'"
assert all("Hello world" in m for m in messages), "All messages should contain 'Hello world'"
# END HIDDEN TESTS

### Problem 5f: Markdown Formatting

Create a markdown Jupyter cell below with the following contents:

- A section heading (level 3) that says "This is a section heading"
- Text that says "In this section I have a bigger formula,"
- A displayed equation: $\sqrt{\frac{\alpha}{\sqrt{\beta^2 + \cos 3}}}$
- Text with an inline formula: "and an in-line formula, $\sqrt{3}$. I made **this text** bold."

**Hint:** Use `###` for a level 3 heading, `$$...$$` for displayed equations, and `$...$` for inline math.

> BEGIN SOLUTION

### This is a section heading
In this section I have a bigger formula,
$$\sqrt{\frac{\alpha}{\sqrt{\beta^2 + \cos 3}}}$$
and an in-line formula, $\sqrt{3}$. I made **this text** bold.
> END SOLUTION
