# Module 3 - Programming Assignment

## Directions

1. Change the name of this file to be your JHED id as in `jsmith299.ipynb`. Because sure you use your JHED ID (it's made out of your name and not your student id which is just letters and numbers).
2. Make sure the notebook you submit is cleanly and fully executed. I do not grade unexecuted notebooks.
3. Submit your notebook back in Blackboard where you downloaded this file.

*Provide the output **exactly** as requested*

## k Nearest Neighbors and Model Evaluation

In this programming assignment you will use k Nearest Neighbors (kNN) to build a "model" that will estimate the compressive strength of various types of concrete. This assignment has several objectives:

1. Implement the kNN algorithm with k=9. Remember...the data + distance function is the model in kNN. In addition to asserts that unit test your code, you should "test drive" the model, showing output that a non-technical person could interpret.

2. You are going to compare the kNN model above against the baseline model described in the course notes (the mean of the training set's target variable). You should use 10 fold cross validation and Mean Squared Error (MSE):

$$MSE = \frac{1}{n}\sum^n_i (y_i - \hat{y}_i)^2$$

as the evaluation metric ("error"). Refer to the course notes for the format your output should take. Don't forget a discussion of the results.

3. use validation curves to tune a *hyperparameter* of the model. 
In this case, the hyperparameter is *k*, the number of neighbors. Don't forget a discussion of the results.

4. evaluate the *generalization error* of the new model.
Because you may have just created a new, better model, you need a sense of its generalization error, calculate that. Again, what would you like to see as output here? Refer to the course notes. Don't forget a discussion of the results. Did the new model do better than either model in Q2?

5. pick one of the "Choose Your Own Adventure" options.

Refer to the "course notes" for this module for most of this assignment.
Anytime you just need test/train split, use fold index 0 for the test set and the remainder as the training set.
Discuss any results.

## Load the Data

The function `parse_data` loads the data from the specified file and returns a List of Lists. The outer List is the data set and each element (List) is a specific observation. Each value of an observation is for a particular measurement. This is what we mean by "tidy" data.

The function also returns the *shuffled* data because the data might have been collected in a particular order that *might* bias training.

In [1]:
import random
from typing import List, Dict, Tuple, Callable

In [2]:
def parse_data(file_name: str) -> List[List]:
    data = []
    file = open(file_name, "r")
    for line in file:
        datum = [float(value) for value in line.rstrip().split(",")]
        data.append(datum)
    random.shuffle(data)
    return data

In [3]:
data = parse_data("concrete_compressive_strength.csv")

In [4]:
data[0]

[272.8, 181.9, 0.0, 185.7, 0.0, 1012.4, 714.3, 7.0, 19.77]

In [5]:
len(data)

1030

There are 1,030 observations and each observation has 8 measurements. The data dictionary for this data set tells us the definitions of the individual variables (columns/indices):

| Index | Variable | Definition |
|-------|----------|------------|
| 0     | cement   | kg in a cubic meter mixture |
| 1     | slag     | kg in a cubic meter mixture |
| 2     | ash      | kg in a cubic meter mixture |
| 3     | water    | kg in a cubic meter mixture |
| 4     | superplasticizer | kg in a cubic meter mixture |
| 5     | coarse aggregate | kg in a cubic meter mixture |
| 6     | fine aggregate | kg in a cubic meter mixture |
| 7     | age | days |
| 8     | concrete compressive strength | MPa |

The target ("y") variable is a Index 8, concrete compressive strength in (Mega?) [Pascals](https://en.wikipedia.org/wiki/Pascal_(unit)).

## Train/Test Splits - n folds

With n fold cross validation, we divide our data set into n subgroups called "folds" and then use those folds for training and testing. You pick n based on the size of your data set. If you have a small data set--100 observations--and you used n=10, each fold would only have 10 observations. That's probably too small. You want at least 30. At the other extreme, we generally don't use n > 10.

With 1,030 observations, n = 10 is fine so we will have 10 folds.
`create_folds` will take a list (xs) and split it into `n` equal folds with each fold containing one-tenth of the observations.

In [6]:
def create_folds(xs: List, n: int) -> List[List[List]]:
    k, m = divmod(len(xs), n) #k = numdata/10, m = remainder = 0
    # be careful of generators...
    return list(xs[i * k + min(i, m):(i + 1) * k + min(i + 1, m)] for i in range(n))

In [7]:
folds = create_folds(data, 10)

In [8]:
len(folds)

10

We always use one of the n folds as a test set (and, sometimes, one of the folds as a *pruning* set but not for kNN), and the remaining folds as a training set.
We need a function that'll take our n folds and return the train and test sets:

In [9]:
def create_train_test(folds: List[List[List]], index: int) -> Tuple[List[List], List[List]]:
    training = []
    test = []
    for i, fold in enumerate(folds):
        if i == index:
            test = fold
        else:
            training = training + fold
    return training, test

We can test the function to give us a train and test datasets where the test set is the fold at index 0:

In [10]:
train, test = create_train_test(folds, 0)

In [11]:
len(train)

927

In [12]:
len(test)

103

## Answers

Answer the questions above in the space provided below, adding cells as you need to.
Put everything in the helper functions and document them.
Document everything (what you're doing and why).
If you're not sure what format the output should take, refer to the course notes and what they do for that particular topic/algorithm.

## Problem 1: kNN

Implement k Nearest Neighbors with k = 9.

### Imports

In [13]:
import numpy as np
from copy import deepcopy

<a id="get_dist"></a>
### get_dist

`get_dist` takes in an example and query, and returns the euclidean distance between these two points. **Used by**: [knn](#knn)

* **example**: List: Dataset against which you're measuring distance
* **query**: List: Test point we're looking to measure

**returns** dist: float Euclidean distance between two points

In [14]:
def get_dist(example, query): 
    assert len(example) == len(query)
    sum = 0 #init
    
    for i in range(len(example)-1): #last col is y 
        sum += (example[i]-query[i])**2
    dist = np.sqrt(sum)
    
    return dist

In [15]:
# Unit Tests / Assertions
ex = [0, 0, 0]
quer = [1, 0, 0]
assert get_dist(ex, quer) == 1

quer = [1, 1, 1]
assert get_dist(ex, quer) == np.sqrt(2)

quer = [2, 0, 0]
assert get_dist(ex, quer) == 2

<a id="knn_processing"></a>
### knn_processing

`knn_processing` takes in the k nearest neighbors, and outputs the processed prediction for y. This instantiation outputs the mean y value of the nearest neighbors. 

* **nearest**: List[float]: values of the k nearest neighbors

**returns** mean_y: average y value of the k nearest neighbors

In [16]:
def knn_processing(nearest: List[float]): 
    dist_only, y_vals, y_sum, num_y = [], [], 0, 0
    for i in range(len(nearest)): 
        y_sum += nearest[i][1][-1]
        num_y += 1

    return y_sum / num_y

In [17]:
# Unit Tests / Assertions
test_near = [[0.0, [1.0, 1.0, 1.0, 1.0]]] 
assert knn_processing(test_near) == 1

test_near = [[0.0, [1.0, 1.0, 1.0, 1.0]], [0.0, [1.0, 1.0, 1.0, 2.0]]]
assert knn_processing(test_near) == 1.5

test_near = [[0.0, [1.0, 1.0, 1.0, 1.0]], [0.0, [1.0, 1.0, 1.0, 2.0]], [0.0, [1.0, 1.0, 1.0, 3.0]]]
assert knn_processing(test_near) == 2

<a id="knn"></a>
### knn

`knn` implements the k-nearest neighbor algorithm. Takes in a dataset and a query, generates a list of distances to k nearest neighbors, and processes that output. Unit Tests for this block are shown in the test drive section. 

* **dataset**: List: Main dataset against which we're measuring the query
* **query**: List: test point we're measuring / predicting for
* **k**: int: number of nearest neighbors to generate

**returns** prediction: float y_hat predicted value for query

In [18]:
def knn(dataset, query, k): 
    distances = []
    
    for example in dataset: 
        distances.append([get_dist(example,query), example])
    
    distances = sorted(distances) #sort on first element
    nearest = distances[0:k]
    prediction = knn_processing(nearest)

    return prediction

<a id="Test Drive the Model"></a>
### Test Drive the Model

In [19]:
index = 27
random_query = test[index]
pred = knn(test, random_query, 9)
print('Index # ', index)
print('Actual [MPa]: ', test[index][-1])
print('Prediction [MPa]: ', pred)
print('--------')

index = 5
random_query = test[index]
pred = knn(test, random_query, 9)
print('Index # ', index)
print('Actual [MPa]: ', test[index][-1])
print('Prediction [MPa]: ', pred)
print('--------')

index = 55
random_query = test[index]
pred = knn(test, random_query, 9)
print('Index # ', index)
print('Actual [MPa]: ', test[index][-1])
print('Prediction [MPa]: ', pred)

Index #  27
Actual [MPa]:  19.93
Prediction [MPa]:  26.958888888888893
--------
Index #  5
Actual [MPa]:  13.71
Prediction [MPa]:  23.34
--------
Index #  55
Actual [MPa]:  22.95
Prediction [MPa]:  35.93333333333333


## Problem 2: Evaluation vs. The Mean

Using Mean Squared Error (MSE) as your evaluation metric, evaluate your implement above and the Null model, the mean.

<a id="get_mean_y"></a>
### get_mean_y

`get_mean_y` takes in a dataset (or portion) and averages the y value from all points. The y value is set to the last index in each row. 

* **dataset**: List: test data used to generate y_mean 

**returns** y_mean: float value for average y

In [20]:
def get_mean_y(dataset: List): 
    sum = 0
    
    for row in dataset: 
        sum+=row[-1] #y val
    
    y_mean = sum / len(dataset)
    return y_mean

In [21]:
# Unit Tests / Assertions
test_data = [[0, 1], [0, 1], [0,1]]
assert get_mean_y(test_data) == 1
test_data = [[0, 2], [0, 2], [0,2]]
assert get_mean_y(test_data) == 2
test_data = [[0, 2], [0, 2], [0,5]]
assert get_mean_y(test_data) == 3

<a id="evaluate_mse"></a>
### evaluate_mse

`evaluate_mse` uses Mean Squared Error (MSE) as an evaluation metric, and generates the MSE and mean_y for each fold in the 10-fold cross validation. 

* **folds**: List: Full dataset, split into (10) folds 
* **k**: int: number of nearest neighbors used in this analysis

**returns** results: Dictionary of MSE and mean y value for each fold: float Euclidean distance between two points

In [22]:
def evaluate_mse(folds: List[List[List]], k: int): 
    n = len(folds[0])
    results = {} #init
    for i in range(len(folds)): #iterate over folds
        
        sum, y_sum = 0, 0 #init for each fold
        train, test = create_train_test(folds, i)
        
        for j in range(len(test)): #iterate over rows
            y_cur = test[j][-1] #last element in row
            y_hat = knn(train, test[j], k)
            y_sum += y_cur
            sum += ((y_cur - y_hat)**2)

        MSE = (1/len(test)) * sum #changed from train to test in resubmit
        y_avg = y_sum / len(test)
        results[i] = {'MSE': MSE, 'Fold mean y': y_avg}
    
    return results

In [23]:
mse_results = evaluate_mse(folds, 9)
assert len(mse_results) == 10
assert (get_mean_y(folds[0])) == mse_results[0]['Fold mean y']
assert (get_mean_y(folds[1])) == mse_results[1]['Fold mean y']

<a id="evaluate_mean_mse"></a>
### evaluate_mean_mse

`evaluate_mean_mse` uses the same logic as the `evaluate_mse` function, except y_mean is used in place of y_hat predictions. This represents the null model, and also uses 10-fold cross validation.

* **folds**: List: Full dataset, split into (10) folds 
* **k**: int: number of nearest neighbors used in this analysis

**returns** results: Dictionary of mean MSE and mean y value for each fold: float Euclidean distance between two points

In [24]:
def evaluate_mean_mse(folds: List[List[List]], k: int): 
    n = len(folds[0])
    results = {} #init
    for i in range(len(folds)): #iterate over folds
        sum, y_sum = 0, 0 #init for each fold
        train, test = create_train_test(folds, i)
        y_mean = get_mean_y(train)
        
        for j in range(len(test)): #iterate over rows
            y_cur = test[j][-1] #last element in row
            y_sum += y_cur
            sum += ((y_cur - y_mean)**2)

        mean_MSE = (1/len(test)) * sum #changed from train to test in resubmit
        y_avg = y_sum / len(test)
        results[i] = {'Mean MSE': mean_MSE, 'Fold mean y': y_avg}
    return results

In [25]:
mean_mse_results = evaluate_mean_mse(folds, 9)
assert len(mean_mse_results) == 10
assert (get_mean_y(folds[0])) == mean_mse_results[0]['Fold mean y']
assert (get_mean_y(folds[1])) == mean_mse_results[1]['Fold mean y']

<a id="avg_mse"></a>
### avg_mse

`avg_mse` takes in the MSE results from the `evaluate` function, and averages the MSE value across all 10 folds. This function returns a float value for average MSE. 

* **mse_results**: Dict: Dictionary storing 10 folds, and the MSE value for each
* **key**: Str: Lookup key for MSE in the dictionary

**returns** avg float average MSE across all folds

In [26]:
def avg_mse(mse_results: Dict, key: str): 
    sum = 0
    length = len(mse_results)
    for i in range(length): 
        sum += mse_results[i][key]
    avg = sum / length
    return avg

In [27]:
# Unit Tests / Assertions
test_results = mse_results
assert len(mse_results) == 10
test_dict = {0: {'test': 5}, 1: {'test': 10}}
assert avg_mse(test_dict, 'test') == 7.5

test_dict = {0: {'test': 2}, 1: {'test': 2}}
assert avg_mse(test_dict, 'test') == 2

In [28]:
print('Average MSE: ', avg_mse(mse_results, 'MSE'))
print('Average Mean-MSE: ', avg_mse(mean_mse_results, 'Mean MSE'))

Average MSE:  86.87077578928441
Average Mean-MSE:  279.2620808229211


### Discussion of Results
From the cell above, we see the method of using the training set mean results in a MSE of roughly triple that of the knn MSE method. This is essentially comparing knn against a null model, so we would expect the knn method to return a lower error (i.e. lower MSE). This tracks with those expectations. 

## Problem 3: Hyperparameter Tuning

Tune the value of k.

<a id="tune_knn"></a>
### tune_knn

`tune_knn` takes in the total folds, and a max k value to test. The function then iterates over each value of k from 1 to *k_max*, and generates the average MSE across all 10 folds at that k value. This function runs a simple loop for k-values, calling previously used functions; no further unit tests / assertions are necessary. 

* **folds**: List[List[List]]]: Entire dataset, split into 10 folds
* **k_max**: int: maximum k value to test

**returns** None (print statements for average MSE)

In [29]:
def tune_knn(folds, k_max): 

    for i in range(1,k_max): 
        results = evaluate_mse(folds, i)
        avg = avg_mse(results, 'MSE')
        print('k =', str(i), ', Avg MSE: ', avg)
    
    return None

In [30]:
tune_knn(folds, 20)

k = 1 , Avg MSE:  76.14022902912623
k = 2 , Avg MSE:  79.20551538834951
k = 3 , Avg MSE:  75.90810213592232
k = 4 , Avg MSE:  75.5415845145631
k = 5 , Avg MSE:  78.87600217864076
k = 6 , Avg MSE:  83.58282061218986
k = 7 , Avg MSE:  84.08195187834356
k = 8 , Avg MSE:  84.28254505916263
k = 9 , Avg MSE:  86.87077578928441
k = 10 , Avg MSE:  89.5253041893204
k = 11 , Avg MSE:  91.6770276731124
k = 12 , Avg MSE:  92.8732228229504
k = 13 , Avg MSE:  94.27140319411731
k = 14 , Avg MSE:  95.16755228452544
k = 15 , Avg MSE:  97.02561548392664
k = 16 , Avg MSE:  99.2175871609527
k = 17 , Avg MSE:  101.33023809554206
k = 18 , Avg MSE:  102.6583883629989
k = 19 , Avg MSE:  105.07362677002934


### Discussion of Results

This tuning generally shows that average MSE increases as k increases. This makes some intuitive sense: as k increases to the limit, k is eventually equal to the number of data points, which turns the model into the null model. Lower k-values are ideal in this situation, especially for regression. The farther you expand the zone of influence around a target point (i.e. increase k), the more error the model experiences. 

## Problem 4: Generalization Error

Analyze and discuss the generalization error of your model with the value of k from Problem 3.

As discussed above, MSE generally increases as k increases. There are a few dips in this slope, where MSE appears level or decreasing for a step of k, such as from k=2 to k=3, and from k=8 to k=9. For this evaluation, I'll use k=3, and re-evaluate the same model. 

In [31]:
# MSE vs Mean Error
gen_results = evaluate_mse(folds, k=3)
print('mean_y over training dataset: ', get_mean_y(train))
print('average MSE across normalized results: ', avg_mse(gen_results, 'MSE'))
gen_results

mean_y over training dataset:  35.91021574973034
average MSE across normalized results:  75.90810213592232


{0: {'MSE': 94.92535577130523, 'Fold mean y': 34.987669902912636},
 1: {'MSE': 76.86820258899677, 'Fold mean y': 36.26980582524273},
 2: {'MSE': 67.62811154261057, 'Fold mean y': 38.32446601941749},
 3: {'MSE': 72.38882524271845, 'Fold mean y': 35.15281553398057},
 4: {'MSE': 85.43824390507012, 'Fold mean y': 35.650776699029116},
 5: {'MSE': 65.07042384034519, 'Fold mean y': 34.00718446601942},
 6: {'MSE': 62.795598381877014, 'Fold mean y': 34.76815533980582},
 7: {'MSE': 56.43578673139156, 'Fold mean y': 37.07330097087379},
 8: {'MSE': 78.27545512405605, 'Fold mean y': 34.30165048543689},
 9: {'MSE': 99.2550182308522, 'Fold mean y': 37.64378640776699}}

### Discussion of Results
In general, the MSE values for k=3 look similar to those of k=9. The average for this evaluation (8.6) is less than the k=9 case (9.9). This seems to be a better choice for k-value moving forward. As an added benefit, this will increase computational efficiency as well, since each test point is only being evaluated against 3 nearest neighbors. 

## Q5: Choose your own adventure

You have three options for the next part:

1. You can implement mean normalization (also called "z-score standardization") of the *features*; do not normalize the target, y. See if this improves the generalization error of your model (middle).

2. You can implement *learning curves* to see if more data would likely improve your model (easiest).

3. You can implement *weighted* kNN and use the real valued GA to choose the weights. weighted kNN assigns a weight to each item in the Euclidean distance calculation. For two points, j and k:
$$\sqrt{\sum w_i (x^k_i - x^j_i)^2}$$

You can think of normal Euclidean distance as the case where $w_i = 1$ for all features  (ambitious, but fun...you need to start EARLY because it takes a really long time to run).

The easier the adventure the more correct it must be...

For this question, I will address #1 - normalization. 

<a id="find_max_vals"></a>
### find_max_vals

`find_max_vals` takes in a dataset and outputs the maximum value in each column. This will be used to identify the maximum value in each of the data features. 

* **data**: List: dataset being evaluated

**returns** max_vals: List of maximum values in each column. 

In [32]:
def find_max_vals(data: List[List]): 
    max_vals = data[0] #init
    for row in data: 
        for i in range(len(row)): 
            if row[i] > max_vals[i]: 
                max_vals[i] = row[i]

    return max_vals

In [33]:
# Unit Tests / Assertions
test = [[1, 1, 1], [2, 2, 2], [3, 3, 3]]
assert find_max_vals(test) == [3, 3, 3]

test = [[1, 1, 1], [2, 5, 2], [3, 3, 3]]
assert find_max_vals(test) == [3, 5, 3]

test = [[1, 1, 7], [2, 5, 2], [3, 3, 3]]
assert find_max_vals(test) == [3, 5, 7]

<a id="normalize"></a>
### normalize

`normalize` takes in the full dataset, creates a deep copy, determines max values for each feature, then normalizes the dataset (except for the y-column). 

* **data**: List: full dataset for normalization

**returns** norm_data: List of normalized data

In [34]:
def normalize(data: List[List]): 
    
    norm_data = deepcopy(data)
    max_vals = find_max_vals(data)
    for i in range(len(norm_data)): #iter rows 
        for j in range(len(norm_data[0])-1): #iter cols
            norm_data[i][j] = float(norm_data[i][j]) / float(max_vals[j])
    
    return norm_data

In [35]:
norm = normalize(data)
norm_folds = create_folds(norm, 10)
norm_results = evaluate_mse(norm_folds, 3)
print('mean_y over normalized dataset: ', get_mean_y(norm))
print('average MSE across normalized results: ', avg_mse(norm_results, 'MSE'))
norm_results

mean_y over normalized dataset:  35.81796116504851
average MSE across normalized results:  62.347890053937434


{0: {'MSE': 66.88365480043147, 'Fold mean y': 34.987669902912636},
 1: {'MSE': 57.00971067961165, 'Fold mean y': 36.26980582524273},
 2: {'MSE': 58.04900658036675, 'Fold mean y': 38.32446601941749},
 3: {'MSE': 49.809030420711956, 'Fold mean y': 35.15281553398057},
 4: {'MSE': 71.50795080906147, 'Fold mean y': 35.650776699029116},
 5: {'MSE': 63.178303020496216, 'Fold mean y': 34.00718446601942},
 6: {'MSE': 58.250506688241636, 'Fold mean y': 34.76815533980582},
 7: {'MSE': 43.44902912621359, 'Fold mean y': 37.07330097087379},
 8: {'MSE': 78.8639008629989, 'Fold mean y': 34.30165048543689},
 9: {'MSE': 76.4778075512406, 'Fold mean y': 37.64378640776699}}

### Discussion of Results
Normalizing the data seems to improve (decrease) the MSE of the model (from 8.6 to 7.5), reducing the overall generalization error. Normalizing the data reduces the variablility in each of the features, so it would make intuitive sense that the MSE should be reduced. The values in this dataset are on similar orders of magnitude, with the exception of index 4 (superplasticizer), which is a single order of magnitude less. If this dataset were to have values at drastically different orders of magnitude (i.e. 10 and 10000000), I would expect normalization to have a larger impact on reducing the generalization error and improving MSE. 

## Q5 Resubmittal - Z Score Standardization

<a id="get_cols"></a>
### get_cols

`get_cols` takes in the full dataset and outputs a list of all the column data (except for y-column)

* **data**: List: full dataset for normalization

**returns** cols: List[List] of columns

In [36]:
def get_cols(data: List[List]):
    cols = [[] for i in range(len(data[0])-1)] #init 8 empty lists
    for row in data: 
        for i in range(len(data[0])-1): #not incl y
            cols[i].append(row[i])

    return cols

In [37]:
#Unit Tests / Assertions
test_data = [
    [1, 2, 7], 
    [3, 4, 7]]
assert get_cols(test_data) == [[1, 3], [2, 4]]
assert len(get_cols(test_data)) == len(test_data[0]) - 1
test_data = [
    [1, 2], 
    [3, 4]]
assert get_cols(test_data) == [[1, 3]]

<a id="get_std_dev"></a>
### get_std_dev

`get_std_dev` takes in the full dataset and determines standard deviation for each feature (except for the y-column).

* **data**: List: full dataset for normalization

**returns** std_dev_vals: List of standard deviation values per column

In [38]:
def get_std_dev(data: List[List]): 
    cols = get_cols(data)
    std_dev_vals = []
    for col in cols: 
        std_dev_vals.append(np.std(col))

    return std_dev_vals

In [39]:
# Unit Tests / Assertions
test_data = [
    [1, 2, 7], 
    [3, 4, 7]]
assert get_std_dev(test_data) == [1.0, 1.0]
test_data = [
    [1, 2], 
    [3, 4]]
assert get_std_dev(test_data) == [1.0]
test_data = [
    [1, 2], 
    [100, 4]]
assert get_std_dev(test_data) == [49.5]

<a id="get_col_mean"></a>
### get_col_mean

`get_col_mean` takes in the full dataset and determines the mean for each feature (except for the y-column).

* **data**: List: full dataset for normalization

**returns** mean_vals: List of mean values per column

In [40]:
def get_col_mean(data: List[List]): 
    cols = get_cols(data)
    mean_vals = []
    for col in cols: 
        mean_vals.append(np.mean(col))

    return mean_vals

In [41]:
# Unit Tests / Assertions
test_data = [
    [1, 2, 7], 
    [3, 4, 7]]
assert get_col_mean(test_data) == [2.0, 3.0]
test_data = [
    [1, 2], 
    [100, 4]]
assert get_col_mean(test_data) == [50.5]
assert len(get_col_mean(test_data)) == len(test_data[0]) - 1

<a id="z_standardize"></a>
### z_standardize

`z_standardize` takes in the full dataset, creates a deep copy, determines mean values and standard deviations for each feature, then performs z-score standardization (except for the y-column). 

* **data**: List: full dataset for standardization

**returns** norm_data: List of normalized data

In [42]:
def z_standardize(data: List[List]): 

    z_data = deepcopy(data)
    mean_vals = get_col_mean(z_data)
    std_dev_vals = get_std_dev(z_data)
    for i in range(len(z_data)): #iter rows 
        for j in range(len(z_data[0])-1): #iter cols, not y
            z_data[i][j] = (float(z_data[i][j]) - float(mean_vals[j])) / float(std_dev_vals[j])
    
    return z_data

In [43]:
test_data = [
    [1, 2, 7], 
    [3, 4, 7]]
assert z_standardize(test_data) == [[-1.0, -1.0, 7], [1.0, 1.0, 7]]
test_data = [
    [1, 2], 
    [100, 4]]
assert z_standardize(test_data) == [[-1.0, 2], [1.0, 4]]
assert len(z_standardize(test_data)) == len(test_data[0])

### Z-Score Standardization

In [44]:
z_data = z_standardize(data)
z_folds = create_folds(z_data, 10)
z_results = evaluate_mse(z_folds, 3)
print('mean_y over normalized dataset: ', get_mean_y(z_data))
print('average MSE across normalized results: ', avg_mse(z_results, 'MSE'))
z_results

mean_y over normalized dataset:  35.8789611650485
average MSE across normalized results:  74.42918348435813


{0: {'MSE': 81.57017249190935, 'Fold mean y': 35.597669902912635},
 1: {'MSE': 76.06079417475726, 'Fold mean y': 36.26980582524273},
 2: {'MSE': 60.35733786407765, 'Fold mean y': 38.32446601941749},
 3: {'MSE': 68.5814316073355, 'Fold mean y': 35.15281553398057},
 4: {'MSE': 83.2563334412082, 'Fold mean y': 35.650776699029116},
 5: {'MSE': 75.90633786407764, 'Fold mean y': 34.00718446601942},
 6: {'MSE': 67.04184412081983, 'Fold mean y': 34.76815533980582},
 7: {'MSE': 56.54921488673137, 'Fold mean y': 37.07330097087379},
 8: {'MSE': 84.30616634304204, 'Fold mean y': 34.30165048543689},
 9: {'MSE': 90.66220204962244, 'Fold mean y': 37.64378640776699}}

### Discussion of Resubmit Results
Performing z-score standardization of the data seems to result in roughly the same average MSE of the model (i.e. from 75.1 to 75.3), slightly increasing the overall generalization error. I would expect standardizing the data to reduce the variablility in each of the features, so it would make intuitive sense that the MSE should be reduced. The values in this dataset are on similar orders of magnitude, with the exception of index 4 (superplasticizer), which is a single order of magnitude less. If this dataset were to have values at drastically different orders of magnitude (i.e. 10 and 10000000), I would expect standardization to have a larger impact on reducing the generalization error and improving MSE. 

## Before You Submit...

1. Did you provide output exactly as requested?
2. Did you re-execute the entire notebook? ("Restart Kernel and Rull All Cells...")
3. If you did not complete the assignment or had difficulty please explain what gave you the most difficulty in the Markdown cell below.
4. Did you change the name of the file to `jhed_id.ipynb`?

Do not submit any other files.

## Resubmit Notes

**Feedback from Professor**
- Cleanup your code, ie, things like ` if mode == 'default': ` in knn_processing
- Your MSE has an error for evaluate_mse: MSE = (1/len(train)) * sum should be len of test
- Normalize isn't the same as z-score standardization
- You're really close. But your MSE results aren't correct as you don't divide by the correct value - Revise (5)

**Resubmission Changes**
- Default mode removed in `knn_processing` and all subsequent calls
- `evaluate_mse` updated - changed len(train) to len(test)
- Section 1.11 created, with all applicable new functions:
    - `get_cols`
    - `get_std_dev`
    - `get_col_mean`
    - `z_standardize`
- New results for Q5 included (using z-score standardization
- New results discussed
- Code cleaned up throughout