# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 4 Assignment: Regression Neural Network**

**Student Name: Jie Miao**

# Assignment Instructions
For this assignment you will use the reg-33-spring-2019.csv dataset. This is a dataset that I generated specifically for this semester. You can find the CSV file on my data site, at this location: reg-33-spring-2019.csv.

For this assignment you will train a neural network and return the predictions. You will submit these predictions to the submit function. See Assignment #1 for details on how to submit an assignment or check that one was submitted.

Complete the following tasks:

* Normalize all numeric to zscores and all text/categorical to dummies. Do not normalize the target.
* If you find any missing values (NA's), replace them with the median values for that column.
* No need for any cross validation or holdout. Just train on the entire data set for 500 epochs.
* You might get a warning, such as "Warning: The mean of column pred differs from the solution file by 2.39". Unless this value is several hundred, do not worry about it. I used a neural network with layer sizes of (200, 100, 50) and got a RMSE of around 600, with a result of Warning: The mean of column pred differs from the solution file by 89.07342078982037. More epochs would likely improve this further, how low can you get it?
* Your submission should contain the id (column name id), your prediction (column name pred"), the expected value (from the reg-33-spring-2019.csv dataset, named y, and the absolute value of the difference between the expected and predicted (column name diff*)
* Your submitted dataframe will have these columns: id, pred.

# Helpful Functions

You will see these at the top of every module and assignment.  These are simply a set of reusable functions that we will make use of.  Each of them will be explained as the semester progresses.  They are explained in greater detail as the course progresses.  Class 4 contains a complete overview of these functions.

In [1]:
import base64
import os

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import requests
from sklearn import preprocessing


# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = f"{name}-{x}"
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)


# Encode text values to a single dummy variable.  The new columns (which do not replace the old) will have a 1
# at every location where the original column (name) matches each of the target_values.  One column is added for
# each target value.
def encode_text_single_dummy(df, name, target_values):
    for tv in target_values:
        l = list(df[name].astype(str))
        l = [1 if str(x) == str(tv) else 0 for x in l]
        name2 = f"{name}-{tv}"
        df[name2] = l


# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df, name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_


# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name] - mean) / sd


# Convert all missing values in the specified column to the median
def missing_median(df, name):
    med = df[name].median()
    df[name] = df[name].fillna(med)


# Convert all missing values in the specified column to the default
def missing_default(df, name, default_value):
    df[name] = df[name].fillna(default_value)


# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df, target):
    result = []
    for x in df.columns:
        if x != target:
            result.append(x)
    # find out the type of the target column.  Is it really this hard? :(
    target_type = df[target].dtypes
    target_type = target_type[0] if hasattr(
        target_type, '__iter__') else target_type
    # Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
    if target_type in (np.int64, np.int32):
        # Classification
        dummies = pd.get_dummies(df[target])
        return df[result].values.astype(np.float32), dummies.values.astype(np.float32)
    # Regression
    return df[result].values.astype(np.float32), df[[target]].values.astype(np.float32)

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return f"{h}:{m:>02}:{s:>05.2f}"


# Regression chart.
def chart_regression(pred, y, sort=True):
    t = pd.DataFrame({'pred': pred, 'y': y.flatten()})
    if sort:
        t.sort_values(by=['y'], inplace=True)
    plt.plot(t['y'].tolist(), label='expected')
    plt.plot(t['pred'].tolist(), label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[(np.abs(df[name] - df[name].mean())
                          >= (sd * df[name].std()))]
    df.drop(drop_rows, axis=0, inplace=True)


# Encode a column to a range between normalized_low and normalized_high.
def encode_numeric_range(df, name, normalized_low=-1, normalized_high=1,
                         data_low=None, data_high=None):
    if data_low is None:
        data_low = min(df[name])
        data_high = max(df[name])

    df[name] = ((df[name] - data_low) / (data_high - data_low)) \
        * (normalized_high - normalized_low) + normalized_low


# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - Pandas dataframe output.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.  
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    r = requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={'csv':base64.b64encode(data.to_csv(index=False).encode('ascii')).decode("ascii"),
        'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code == 200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Assignment #4 Sample Code

The following code provides a starting point for this assignment.

In [97]:
import os
import pandas as pd
from scipy.stats import zscore
from keras.models import Sequential
from keras.layers.core import Dense, Activation
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics
from sklearn.model_selection import train_test_split
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
# This is your student key that I emailed to you at the beginnning of the semester.
key = "qqeAIdN7kdaHWUmLYRyM5aHJxb8Yt5Ssw0qi11wb"  # This is an example key and will not work.

# You must also identify your source file.  (modify for your local setup)
# file='/resources/t81_558_deep_learning/assignment_yourname_class1.ipynb'  # IBM Data Science Workbench
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\t81_558_class1_intro_python.ipynb'  # Windows
file='/Users/jie/Desktop/t81_558_deep_learning-master/assignments/assignment_JieMiao_class4.ipynb'  # Mac/Linux
#file = "C:\\Users\\jeffh\\Dropbox\\school\\teaching\\wustl\\classes\\T81_558_deep_learning\\solutions\\assignment_solution_class4.ipynb"

# Begin assignment
path = "/Users/jie/Desktop"

filename_read = os.path.join(path,"reg-33-spring-2019.csv")
df = pd.read_csv(filename_read, na_values=['NA', '?'])

# Add assignment code here

# Normalize all numeric to zscores and all text/categorical to dummies. Do not normalize the target.
float_col = df.select_dtypes(['float64']).columns.tolist()
int_col = df.select_dtypes(['int64']).columns.tolist()
cols = float_col + int_col
all_cols = df.columns.tolist()
for col in cols:
    if  col != "target" and col != "id":
        encode_numeric_zscore(df, col, mean=None, sd=None)

for col in all_cols:
    if col not in cols:
        encode_text_dummy(df, col)

# If you find any missing values (NA's), replace them with the median values for that column.
for col in df.columns.tolist():
    missing_median(df, cols)

# No need for any cross validation or holdout. Just train on the entire data set for 500 epochs.
#You might get a warning, such as "Warning: The mean of column pred differs from the solution file by 2.39". 
#Unless this value is several hundred, do not worry about it. I used a neural network with layer sizes of (200, 100, 50) 
#and got a RMSE of around 600, with a result of Warning: The mean of column pred differs from the solution file by 89.07342078982037. 
#More epochs would likely improve this further, how low can you get it?

x,y = to_xy(df, 'target')
x_train, x_test, y_train, y_test = train_test_split(    
    x, y, test_size=0.25, random_state=45)
model = Sequential()
model.add(Dense(10, input_dim=x.shape[1], activation='relu'))
# model.add(Dense(10))
# model.add(Dense(10))
# model.add(Dense(10))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adam')
monitor = EarlyStopping(monitor='val_loss', min_delta=1e-3, patience=5, verbose=1, mode='auto')
model.fit(x_train,y_train,validation_data=(x_test,y_test),callbacks=[monitor],verbose=2,epochs=500)

# Your submission should contain the id (column name id), your prediction (column name pred"), 
#the expected value (from the reg-33-spring-2019.csv dataset, named y, and the absolute value of the difference between the expected 
#and predicted (column name diff*)



Train on 8106 samples, validate on 2703 samples
Epoch 1/500
 - 1s - loss: 9017537978.3943 - val_loss: 8384944425.6722
Epoch 2/500
 - 0s - loss: 7408142293.5544 - val_loss: 6213668112.1953
Epoch 3/500
 - 0s - loss: 5284908406.1150 - val_loss: 4393018457.9741
Epoch 4/500
 - 0s - loss: 3948698278.7506 - val_loss: 3596652641.2194
Epoch 5/500
 - 0s - loss: 3506787564.3563 - val_loss: 3432371697.2253
Epoch 6/500
 - 0s - loss: 3434285645.5643 - val_loss: 3418200739.6582
Epoch 7/500
 - 0s - loss: 3426623470.5038 - val_loss: 3417396074.1694
Epoch 8/500
 - 0s - loss: 3425644230.5848 - val_loss: 3417671675.8328
Epoch 9/500
 - 0s - loss: 3425304772.6267 - val_loss: 3417540539.3356
Epoch 10/500
 - 0s - loss: 3424569492.5596 - val_loss: 3415609271.7366
Epoch 11/500
 - 0s - loss: 3424072006.8058 - val_loss: 3415574660.4040
Epoch 12/500
 - 0s - loss: 3423324438.4229 - val_loss: 3416011828.5638
Epoch 13/500
 - 0s - loss: 3423546099.6832 - val_loss: 3414305218.6282
Epoch 14/500
 - 0s - loss: 3422449272.

Epoch 116/500
 - 0s - loss: 3335546893.3906 - val_loss: 3327593606.4876
Epoch 117/500
 - 0s - loss: 3334110381.0669 - val_loss: 3326026498.3677
Epoch 118/500
 - 0s - loss: 3332928738.6923 - val_loss: 3324971057.2490
Epoch 119/500
 - 0s - loss: 3331415956.4017 - val_loss: 3324033753.3585
Epoch 120/500
 - 0s - loss: 3329673545.2692 - val_loss: 3321738640.7162
Epoch 121/500
 - 0s - loss: 3328438679.8441 - val_loss: 3320751406.5971
Epoch 122/500
 - 0s - loss: 3326887316.7491 - val_loss: 3319827504.7754
Epoch 123/500
 - 0s - loss: 3325381796.5398 - val_loss: 3317335411.9245
Epoch 124/500
 - 0s - loss: 3323573042.7200 - val_loss: 3317266011.1106
Epoch 125/500
 - 0s - loss: 3322696393.5534 - val_loss: 3314565215.3725
Epoch 126/500
 - 0s - loss: 3320815569.5435 - val_loss: 3313154724.2264
Epoch 127/500
 - 0s - loss: 3319209128.8981 - val_loss: 3312177762.4033
Epoch 128/500
 - 0s - loss: 3317664127.1473 - val_loss: 3310622832.6097
Epoch 129/500
 - 0s - loss: 3316694944.4342 - val_loss: 33084743

Epoch 230/500
 - 0s - loss: 2858943696.8803 - val_loss: 2853267709.9637
Epoch 231/500
 - 0s - loss: 2849217038.4012 - val_loss: 2848578292.6822
Epoch 232/500
 - 0s - loss: 2840060412.2734 - val_loss: 2833940539.8565
Epoch 233/500
 - 0s - loss: 2828875111.0190 - val_loss: 2823131792.1953
Epoch 234/500
 - 0s - loss: 2818758400.7895 - val_loss: 2812166943.5383
Epoch 235/500
 - 0s - loss: 2807027077.8742 - val_loss: 2803995322.1517
Epoch 236/500
 - 0s - loss: 2796773915.7918 - val_loss: 2791180463.0707
Epoch 237/500
 - 0s - loss: 2787113831.3348 - val_loss: 2780248823.7129
Epoch 238/500
 - 0s - loss: 2774297485.2327 - val_loss: 2769374684.3892
Epoch 239/500
 - 0s - loss: 2764290136.6178 - val_loss: 2758745854.5794
Epoch 240/500
 - 0s - loss: 2753250120.0691 - val_loss: 2748913318.4994
Epoch 241/500
 - 0s - loss: 2741902063.4513 - val_loss: 2736123779.5990
Epoch 242/500
 - 0s - loss: 2729054845.5682 - val_loss: 2723243822.1709
Epoch 243/500
 - 0s - loss: 2717449500.4866 - val_loss: 27117241

Epoch 344/500
 - 0s - loss: 984554515.1384 - val_loss: 992830365.8335
Epoch 345/500
 - 0s - loss: 968608541.9709 - val_loss: 976492278.4343
Epoch 346/500
 - 0s - loss: 950654450.8621 - val_loss: 958501750.5527
Epoch 347/500
 - 0s - loss: 932690765.9433 - val_loss: 942865393.5331
Epoch 348/500
 - 0s - loss: 915856767.9053 - val_loss: 929393601.4680
Epoch 349/500
 - 0s - loss: 900335565.8564 - val_loss: 907513845.6293
Epoch 350/500
 - 0s - loss: 883440179.5727 - val_loss: 894679913.1040
Epoch 351/500
 - 0s - loss: 867049837.2721 - val_loss: 874218806.1739
Epoch 352/500
 - 0s - loss: 848975960.3020 - val_loss: 861766785.5627
Epoch 353/500
 - 0s - loss: 834069084.8813 - val_loss: 842746903.2275
Epoch 354/500
 - 0s - loss: 817266250.6430 - val_loss: 840295848.9145
Epoch 355/500
 - 0s - loss: 800844877.0906 - val_loss: 814718003.4983
Epoch 356/500
 - 0s - loss: 785063230.3736 - val_loss: 794550267.3119
Epoch 357/500
 - 0s - loss: 768956339.4779 - val_loss: 785251886.9286
Epoch 358/500
 - 0s 

Epoch 462/500
 - 0s - loss: 37061077.9433 - val_loss: 40735048.3862
Epoch 463/500
 - 0s - loss: 36107780.8448 - val_loss: 39847964.7103
Epoch 464/500
 - 0s - loss: 34992429.8534 - val_loss: 37957521.1646
Epoch 465/500
 - 0s - loss: 33785873.4745 - val_loss: 36756770.8487
Epoch 466/500
 - 0s - loss: 32698681.9877 - val_loss: 35634569.5035
Epoch 467/500
 - 0s - loss: 31741803.8608 - val_loss: 34513319.4199
Epoch 468/500
 - 0s - loss: 30549715.8145 - val_loss: 33654933.7181
Epoch 469/500
 - 0s - loss: 29483391.3644 - val_loss: 32443381.0196
Epoch 470/500
 - 0s - loss: 28565141.0994 - val_loss: 31677672.6615
Epoch 471/500
 - 0s - loss: 27688770.7723 - val_loss: 30332481.9623
Epoch 472/500
 - 0s - loss: 27023028.0170 - val_loss: 29328440.7148
Epoch 473/500
 - 0s - loss: 26029750.0296 - val_loss: 28442197.5923
Epoch 474/500
 - 0s - loss: 24842364.5184 - val_loss: 29090819.5723
Epoch 475/500
 - 0s - loss: 24227945.0239 - val_loss: 28174891.5198
Epoch 476/500
 - 0s - loss: 23145012.8428 - val_

<keras.callbacks.History at 0x1a33bda080>

In [100]:
pred = model.predict(x_test).tolist()
df['pred'] = pd.Series(pred)
# Your submitted dataframe will have these columns: id, pred.
col_id = df['id']
col_pred = df['pred']
submit_df = pd.concat([col_id, col_pred ],axis=1)
submit(source_file=file,data=submit_df,key=key,no=4)

Success: Submitted Assignment #4 for j.miao:
You have submitted this assignment 3 times. (this is fine)

