# Short-Answer Questions

Please be sure to write the answers to these questions on the separate answer sheet provided. Nothing in this notebook will be graded!

In [1]:
import numpy as np
import pandas as pd

## Part 1 (8 minutes)

For each of the following scenarios, determine whether a regression, classification, or clustering model is most appropriate. **On your answer sheet, write 1 for regression, 2 for classification, and 3 for clustering.**

**1.** You have the texts of several books written by Fitzgerald, Hemingway, and Steinbeck. You would like to build a machine learning model using this data that, given a new text, is able to predict whether it was written by Fitzgerald, Hemingway, or Steinbeck.

**2.** You would like to build a machine learning model that predicts the price of a house, given information about the number of bedrooms, number of bathrooms, square footage, and other features.

**3.** You are a data scientist for a large retailer. The retailer has data about what products each customer purchases and would like to use this data to segment its customers into 6 "types".

**4.** You have a large database of handwritten digits. Each digit has been manually examined by a human and labeled as 0, 1, ..., 9. You would like to build a machine learning model on this data that, given a scan of a handwritten number, is able to identify the number.

## Part 2 (6 minutes)

You are trying to predict whether a baseball team will win or lose based on features $x_1$ and $x_2$. You have training data, shown below.

![](http://users.csc.calpoly.edu/~dsun09/data301/exam3/k_neighbors.png)

You fit a 4-nearest neighbors model to this data. Use this model to predict the _probability_ that a team with $(x_1, x_2) = (2.0, 1.0)$ will win...

**5.** ...if the distance metric is Euclidean distance.

**6.** ...if the distance metric is cosine distance.

In [2]:
np.average([1, 1, 0, 0])

0.5

## Part 3 (15 minutes)

The CSV file `/data/tides.csv` contains the water level of the ocean, measured at Port San Luis every 0.1 hours over a period of 6 days. Tidal data tends to be periodic, with periodic components of 12 hours and 24 hours. Therefore, it makes sense to model the water level at time $t$ as follows:

$$ f(t) = \beta_0 + \beta_1 \cos\left(\frac{2\pi t}{12}\right) + \beta_2 \sin\left(\frac{2\pi t}{12}\right) + \beta_3 \cos\left(\frac{2 \pi t}{24}\right) + \beta_4 \sin\left(\frac{2 \pi t}{24}\right), $$

where $t$ represents the time in hours.

Use the data in `/data/tides.csv` to estimate $\beta_0$, $\beta_1$, $\beta_2$, $\beta_3$, and $\beta_4$.

**7.** What is your estimate of $\beta_0$?

**8.** What is your estimate of $\beta_1$?

**9.** What is your estimate of $\beta_4$?

In [3]:
data = pd.read_csv("/data/tides.csv")

# YOUR CODE HERE
data['B1'] = np.cos((2*np.pi*data['Time (hours)'])/12)
data['B2'] = np.sin((2*np.pi*data['Time (hours)'])/12)
data['B3'] = np.cos((2*np.pi*data['Time (hours)'])/24)
data['B4'] = np.sin((2*np.pi*data['Time (hours)'])/24)

from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(data[['B1', 'B2', 'B3', 'B4']], data['Water Level (ft)'])
model.intercept_, model.coef_

(2.6774820402076234,
 array([ 0.8752349 ,  0.25942657, -0.09890444, -1.09277118]))

## Part 4 (6 minutes)

You wish to predict the mileage (miles per gallon) of a car based on its displacement and the year the car was made. The data is shown below:

![](http://users.csc.calpoly.edu/~dsun09/data301/exam3/mpg_data.png)

You decide to use linear regression to model the data. Shown below are three plots representing three possible fitted models.

![](http://users.csc.calpoly.edu/~dsun09/data301/exam3/mpg_fit.png)


**10.** Suppose you model the mpg as a linear function of displacement only:
$$ f(\textrm{displacement}) = \beta_0 + \beta_1 \cdot \textrm{displacement}, $$
ignoring the year variable entirely. Which of the three plots above best represents the predictions you would get from this model?

**11.** Suppose you instead model the mpg as a function of displacement and year:
$$ f(\textrm{displacement}, \textrm{year}) = \beta_0 + \beta_1 \cdot \textrm{displacement} + \beta_2 \cdot \textrm{year1978} + \beta_3 \cdot \textrm{year1982}, $$
treating year as a categorical variable, with 1973 as a baseline. Which of the three plots above best represents the predictions you would get from this model?

## Part 5 (3 minutes)

**12.** Suppose you run $K$-means clustering on the following data, with $K=3$.

![](http://users.csc.calpoly.edu/~dsun09/data301/exam3/kmeans.png)

Which of the following clusterings are you most likely to get?

![](http://users.csc.calpoly.edu/~dsun09/data301/exam3/kmeans_choices.png)

In [4]:
print([0, 150])
print([0, 9850])

[0, 150]
[0, 9850]


In [5]:
9850/10000

0.985

## Part 6 (9 minutes)

You are building a classifier to predict whether or not a [Reddit](http://www.reddit.com) comment will be controversial. You train your classifier on a training set of 10,000 Reddit comments, of which only 150 are controversial.

Your classifier predicts that every comment in the training set is non-controversial.

**13.** What is the accuracy of your classifier on the training set?

**14.** What is the precision of your classifier for identifying _non_-controversial tweets on the training set?

**15.** What is the recall of your classifier for identifying _non_-controversial tweets on the training set?

_Note:_ Your answers should be numbers between 0 and 1.