# Homework 1

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, root_mean_squared_error, mean_absolute_error

### 1). 
Your goal is to predict abalone age, which is calculated as the number of rings plus 1.5. Notice there currently is no age variable in the data set. Add age to the data set.

Assess and describe the distribution of age.

In [2]:
#Load our data
abalone = pd.read_csv("abalone.csv")

# Create our ring column
abalone["age"] = abalone["rings"] + 1.5

### 2).
Split the abalone data into a training set and a testing set. Use stratified sampling. You should decide on appropriate percentages for splitting the data.

Remember that you’ll need to set a seed at the beginning of the document to reproduce your results

In [5]:
# Split up our predictors and response variables and create a new column that
# has age ranges, or bins, to stratify our data.
y = abalone['age']
abalone['ageBin'] = pd.cut(abalone['age'], bins=10)
abaloneNoAge = abalone.drop(columns=['age'])

# Generate our testing and training data with random seed of 100 chosen arbitrarily
XTrainData, XTestData, yTrainData, yTestData = train_test_split(abaloneNoAge, y, test_size=0.2, stratify=abalone['ageBin'], random_state=100)

### 3).
Using the training data, create a recipe predicting the outcome variable, age, with all other predictor variables. Note that you should not include rings to predict age. Explain why you shouldn’t use rings to predict age.

Steps for your recipe:

1. dummy code any categorical predictors

2. create interactions between

    * type and shucked_weight,
    * longest_shell and diameter,
    * shucked_weight and shell_weight
3. center all predictors, and

4. scale all predictors.

You’ll need to investigate the tidymodels documentation to find the appropriate step functions to use.

We shouldn't use rings to predict age, because the age column is directly calculated from the rings column, so there would be no actual predictions being made.

In [75]:
# Drop our rings and ageBin columns because they are not needed anymore
XTrainData = XTrainData.drop(columns=['rings', 'ageBin'])
XTestData = XTestData.drop(columns=['rings', 'ageBin'])

# Create lists of column names that will be used in our processing of data
categoricalPredictors = ['type']
numericalPredictors = ['longest_shell', 'diameter', 'height', 'whole_weight', 
                       'shucked_weight', 'viscera_weight', 'shell_weight']
interactionTerms = ['type_M', 'shucked_weight', 'type_F', 'shucked_weight',
                    'type_I', 'shucked_weight', 'longest_shell', 'diameter',
                    'shucked_weight', 'shell_weight']

# This function is used to create the interaction terms described in the instructions
# Technically, this can be used to create any number of interaction terms for any
# future models I am creating, we just need to input a list of even length containing
# column names, but those are both implied when attempting to make interactions
def interactionCreator(data, colNames):
    newColNameList = []
    for index in range(0, len(colNames) - 1, 2):
        newColName = f"{colNames[index]}_{colNames[index + 1]}"
        data[newColName] = data[colNames[index]] * data[colNames[index + 1]]
        newColNameList.append(newColName)

    return data

# This will preprocess our data by scaling and centering all our numerical predictors
# and one-hot encoding our categorical precictor
preprocessing = ColumnTransformer(
    transformers = [('numerical', StandardScaler(), numericalPredictors),
                    ('categorical', OneHotEncoder(sparse_output=False), categoricalPredictors)],
    verbose_feature_names_out=False
).set_output(transform='pandas')

# This will apply our interactionCreator function to our data
interactionMaker = FunctionTransformer(interactionCreator, kw_args={
                        'colNames':interactionTerms}, validate=False)

### 4).
Create and store a linear regression object using the "lm" engine.

In [76]:
lm = LinearRegression()

### 5).
Create and store a KNN object using the "kknn" engine. Specify k = 7.

In [77]:
knn = KNeighborsRegressor(n_neighbors=7)

### 6).
Now, for each of these models (linear regression and KNN):

1. set up an empty workflow,
2. add the model, and
3. add the recipe that you created in Question 3.
Note that you should be setting up two separate workflows.

Fit both models to the training set.

In [78]:
# Now that all our transformers were made earlier, we can simply set up a pipeline
# for both our linear model and knn model to first preprocess, then create the interaction
# terms, then apply the model
lmPipeline = Pipeline([
    ('preprocessing', preprocessing),
    ('interaction', interactionMaker),
    ('model', lm)
])

knnPipeline = Pipeline([
    ('preprocessing', preprocessing),
    ('interaction', interactionMaker),
    ('model', knn)
])

# This simply fits the training data to both pipelines
lmPipeline.fit(XTrainData, yTrainData)
knnPipeline.fit(XTrainData, yTrainData)

### 7).
Use your linear regression fit() object to predict the age of a hypothetical female abalone with longest_shell = 0.50, diameter = 0.10, height = 0.30, whole_weight = 4, shucked_weight = 1, viscera_weight = 2, and shell_weight = 1.

In [79]:
# Create our hypothetical female abalone
hypotheticalAbalone = pd.DataFrame({
    'type' : ['F'],
    'longest_shell' : [0.50],
    'diameter' : [0.10],
    'height' : [0.30],
    'whole_weight' : [4],
    'shucked_weight' : [1],
    'viscera_weight' : [2],
    'shell_weight' : [1]
})

# Print what the linear regression model predicts the age is for the female abalone
lmPredictedAge = lmPipeline.predict(hypotheticalAbalone)
print(f"Predicted age: {lmPredictedAge[0]:.2f}")

Predicted age: 23.03


### 8).
Now you want to assess your models’ performance. To do this, use the yardstick package:

1. Create a metric set that includes R2, RMSE (root mean squared error), and MAE (mean absolute error).
2. Use augment() to create a tibble of your model’s predicted values from the testing data along with the actual observed ages (these are needed to assess your model’s performance).
3. Finally, apply your metric set to the tibble, report the results, and interpret the R^2 value.
Repeat these steps once for the linear regression model and for the KNN model.

In [80]:
# Predict all age values for our testing data for both models
lmYTestPredicted = lmPipeline.predict(XTestData)
knnYTestPredicted = knnPipeline.predict(XTestData)

# Now we can calculate all the metrics
lmR2 = r2_score(yTestData, lmYTestPredicted)
knnR2 = r2_score(yTestData, knnYTestPredicted)

lmRMSE = root_mean_squared_error(yTestData, lmYTestPredicted)
knnRMSE = root_mean_squared_error(yTestData, knnYTestPredicted)

lmMAE = mean_absolute_error(yTestData, lmYTestPredicted)
knnMAE = mean_absolute_error(yTestData, knnYTestPredicted)

# This simply puts all the metrics together in a dataframe for organization
metrics = pd.DataFrame({
    'Linear Regression' : [lmR2, lmRMSE, lmMAE],
    'KNN' : [knnR2, knnRMSE, knnMAE]
})
metrics.index = ['R^2', 'RMSE', 'MAE']
metrics

Unnamed: 0,Linear Regression,KNN
R^2,0.541892,0.497491
RMSE,2.214101,2.318918
MAE,1.590577,1.621155


### 9).
Which model performed better on the testing data? Explain why you think this might be. Are you surprised by any of your results? Why or why not?

We can clearly see that the linear regression model performed better than the k-nearest neighbors model due to its higher $R^2$ and lower RMSE and MAE. This is likely due to the fact that knn does worse with large amounts of predictors, and we have 15 predictor variables. Due to this "curse of dimensionality", I am not very surprised by the results, although it is interesting to see how close the models performed when it comes to these specific metrics.