# T81-558: Applications of Deep Neural Networks
* Instructor: [Jeff Heaton](https://sites.wustl.edu/jeffheaton/), School of Engineering and Applied Science, [Washington University in St. Louis](https://engineering.wustl.edu/Programs/Pages/default.aspx)
* For more information visit the [class website](https://sites.wustl.edu/jeffheaton/t81-558/).

**Module 8 Assignment: Building a Kaggle Submission File**

**Student Name: Jason Walker**

# Assignment Instructions

For this assignment you will use the [**reg-30-spring-2018.csv**](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/reg-30-spring-2018.csv) dataset to train a neural network and [**reg-30-spring-2018-eval.csv
**](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/data/reg-30-spring-2018-eval.csv) to use as test to build a submission (similar to Kaggle).  The training code used for this assignment will be identical to [Assignmnent 4](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class4.ipynb) and you are encouraged to use your [Assignment 4](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/assignments/assignment_yourname_class4.ipynb) code as a starting point.  Refer to [Module 8](https://github.com/jeffheaton/t81_558_deep_learning/blob/master/t81_558_class8_kaggle.ipynb) for instructions on producing a Kaggle type submission file.  Please note, Module #8 provides an example of producing a classification (iris) submission file, you will need to convert this for 

The dataframe that you submit should have two columns: *id* and *target*.  The *id* column should matchup with the test data file.  The *target* column is your prediction.  It is unlikely that the mean of *target* will match exacly with mine.



# Helpful Functions

You will see these at the top of every module and assignment.  These are simply a set of reusable functions that we will make use of.  Each of them will be explained as the semester progresses.  They are explained in greater detail as the course progresses.  Class 4 contains a complete overview of these functions.

In [1]:
from sklearn import preprocessing
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shutil
import os
import requests
import base64


# Encode text values to dummy variables(i.e. [1,0,0],[0,1,0],[0,0,1] for red,green,blue)
def encode_text_dummy(df, name):
    dummies = pd.get_dummies(df[name])
    for x in dummies.columns:
        dummy_name = "{}-{}".format(name, x)
        df[dummy_name] = dummies[x]
    df.drop(name, axis=1, inplace=True)


# Encode text values to a single dummy variable.  The new columns (which do not replace the old) will have a 1
# at every location where the original column (name) matches each of the target_values.  One column is added for
# each target value.
def encode_text_single_dummy(df, name, target_values):
    for tv in target_values:
        l = list(df[name].astype(str))
        l = [1 if str(x) == str(tv) else 0 for x in l]
        name2 = "{}-{}".format(name, tv)
        df[name2] = l


# Encode text values to indexes(i.e. [1],[2],[3] for red,green,blue).
def encode_text_index(df, name):
    le = preprocessing.LabelEncoder()
    df[name] = le.fit_transform(df[name])
    return le.classes_


# Encode a numeric column as zscores
def encode_numeric_zscore(df, name, mean=None, sd=None):
    if mean is None:
        mean = df[name].mean()

    if sd is None:
        sd = df[name].std()

    df[name] = (df[name] - mean) / sd


# Convert all missing values in the specified column to the median
def missing_median(df, name):
    med = df[name].median()
    df[name] = df[name].fillna(med)


# Convert all missing values in the specified column to the default
def missing_default(df, name, default_value):
    df[name] = df[name].fillna(default_value)


# Convert a Pandas dataframe to the x,y inputs that TensorFlow needs
def to_xy(df, target):
    result = []
    for x in df.columns:
        if x != target:
            result.append(x)
    # find out the type of the target column.  Is it really this hard? :(
    target_type = df[target].dtypes
    target_type = target_type[0] if hasattr(target_type, '__iter__') else target_type
    # Encode to int for classification, float otherwise. TensorFlow likes 32 bits.
    if target_type in (np.int64, np.int32):
        # Classification
        dummies = pd.get_dummies(df[target])
        return df.as_matrix(result).astype(np.float32), dummies.as_matrix().astype(np.float32)
    else:
        # Regression
        return df.as_matrix(result).astype(np.float32), df.as_matrix([target]).astype(np.float32)

# Nicely formatted time string
def hms_string(sec_elapsed):
    h = int(sec_elapsed / (60 * 60))
    m = int((sec_elapsed % (60 * 60)) / 60)
    s = sec_elapsed % 60
    return "{}:{:>02}:{:>05.2f}".format(h, m, s)


# Regression chart.
def chart_regression(pred,y,sort=True):
    t = pd.DataFrame({'pred' : pred, 'y' : y.flatten()})
    if sort:
        t.sort_values(by=['y'],inplace=True)
    a = plt.plot(t['y'].tolist(),label='expected')
    b = plt.plot(t['pred'].tolist(),label='prediction')
    plt.ylabel('output')
    plt.legend()
    plt.show()

# Remove all rows where the specified column is +/- sd standard deviations
def remove_outliers(df, name, sd):
    drop_rows = df.index[(np.abs(df[name] - df[name].mean()) >= (sd * df[name].std()))]
    df.drop(drop_rows, axis=0, inplace=True)


# Encode a column to a range between normalized_low and normalized_high.
def encode_numeric_range(df, name, normalized_low=-1, normalized_high=1,
                         data_low=None, data_high=None):
    if data_low is None:
        data_low = min(df[name])
        data_high = max(df[name])

    df[name] = ((df[name] - data_low) / (data_high - data_low)) \
               * (normalized_high - normalized_low) + normalized_low
        
# This function submits an assignment.  You can submit an assignment as much as you like, only the final
# submission counts.  The paramaters are as follows:
# data - Pandas dataframe output.
# key - Your student key that was emailed to you.
# no - The assignment class number, should be 1 through 1.
# source_file - The full path to your Python or IPYNB file.  This must have "_class1" as part of its name.  
# .             The number must match your assignment number.  For example "_class2" for class assignment #2.
def submit(data,key,no,source_file=None):
    if source_file is None and '__file__' not in globals(): raise Exception('Must specify a filename when a Jupyter notebook.')
    if source_file is None: source_file = __file__
    suffix = '_class{}'.format(no)
    if suffix not in source_file: raise Exception('{} must be part of the filename.'.format(suffix))
    with open(source_file, "rb") as image_file:
        encoded_python = base64.b64encode(image_file.read()).decode('ascii')
    ext = os.path.splitext(source_file)[-1].lower()
    if ext not in ['.ipynb','.py']: raise Exception("Source file is {} must be .py or .ipynb".format(ext))
    r = requests.post("https://api.heatonresearch.com/assignment-submit",
        headers={'x-api-key':key}, json={'csv':base64.b64encode(data.to_csv(index=False).encode('ascii')).decode("ascii"),
        'assignment': no, 'ext':ext, 'py':encoded_python})
    if r.status_code == 200:
        print("Success: {}".format(r.text))
    else: print("Failure: {}".format(r.text))

# Assignment #8 Sample Code

The following code provides a starting point for this assignment.

In [15]:
import os
import pandas as pd
from scipy.stats import zscore
from keras.models import Sequential
from keras.layers.core import Dense, Activation
import pandas as pd
import io
import requests
import numpy as np
from sklearn import metrics

# This is your student key that I emailed to you at the beginnning of the semester.
key = "UjhQvgInJx71GabltZtqy6O1LdzxtjcE5idLxF3K"  # This is an example key and will not work.

# You must also identify your source file.  (modify for your local setup)
# file='/resources/t81_558_deep_learning/assignment_yourname_class1.ipynb'  # IBM Data Science Workbench
# file='C:\\Users\\jeffh\\projects\\t81_558_deep_learning\\t81_558_class1_intro_python.ipynb'  # Windows
file='/Users/jwalker/git/t81_558_deep_learning/assignments/assignment_jwalker_class8.ipynb'  # Mac/Linux

# Begin assignment
path = "../data/"

filename_train = os.path.join(path,"reg-30-spring-2018.csv")
filename_test = os.path.join(path,"reg-30-spring-2018-eval.csv")
filename_submit = os.path.join(path,"8_submit.csv")

df_train = pd.read_csv(filename_train,na_values=['NA','?'])

# all numeric columns (except id)
for column in ['distance','height','landings','number','pack','age','usage','weight','volume','width','max','power','size',]:
    missing_median(df_train,column)
    # do NOT normalize target
    if (column != 'target'):
        encode_numeric_zscore(df_train,column)

# all text/categorical columns
for column in ['region', 'item']:
    encode_text_dummy(df_train,column)

# Encode the feature vector
ids = df_train['id']
df_train.drop('id',1,inplace=True)
x,y = to_xy(df_train,'target')

model = Sequential()
model.add(Dense(25, input_dim=x.shape[1], activation='relu')) # Hidden 1
model.add(Dense(10, activation='relu')) # Hidden 2
model.add(Dense(1)) # Output
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x,y,verbose=2,epochs=1000)

pred = model.predict(x)

# Measure RMSE error.  RMSE is common for regression.
score = np.sqrt(metrics.mean_squared_error(pred,y))
print("Final score (RMSE): {}".format(score))

pred_df = pd.DataFrame(pred)
pred_df.insert(0,'id',ids)
pred_df.columns = ['id','pred']

pred_df.to_csv('8_train.csv',index=False)



Epoch 1/1000
 - 0s - loss: 12804.3001
Epoch 2/1000
 - 0s - loss: 12793.4377
Epoch 3/1000
 - 0s - loss: 12784.1192
Epoch 4/1000
 - 0s - loss: 12771.7555
Epoch 5/1000
 - 0s - loss: 12753.5754
Epoch 6/1000
 - 0s - loss: 12726.7798
Epoch 7/1000
 - 0s - loss: 12688.0638
Epoch 8/1000
 - 0s - loss: 12639.8853
Epoch 9/1000
 - 0s - loss: 12579.3114
Epoch 10/1000
 - 0s - loss: 12515.3813
Epoch 11/1000
 - 0s - loss: 12444.2415
Epoch 12/1000
 - 0s - loss: 12379.3645
Epoch 13/1000
 - 0s - loss: 12322.1359
Epoch 14/1000
 - 0s - loss: 12272.5136
Epoch 15/1000
 - 0s - loss: 12234.2040
Epoch 16/1000
 - 0s - loss: 12193.9508
Epoch 17/1000
 - 0s - loss: 12167.3846
Epoch 18/1000
 - 0s - loss: 12141.8457
Epoch 19/1000
 - 0s - loss: 12112.4248
Epoch 20/1000
 - 0s - loss: 12086.9946
Epoch 21/1000
 - 0s - loss: 12062.6949
Epoch 22/1000
 - 0s - loss: 12038.8792
Epoch 23/1000
 - 0s - loss: 12015.4335
Epoch 24/1000
 - 0s - loss: 11995.5426
Epoch 25/1000
 - 0s - loss: 11971.1850
Epoch 26/1000
 - 0s - loss: 11946.

Epoch 212/1000
 - 0s - loss: 5268.0706
Epoch 213/1000
 - 0s - loss: 5217.8139
Epoch 214/1000
 - 0s - loss: 5192.9620
Epoch 215/1000
 - 0s - loss: 5190.4972
Epoch 216/1000
 - 0s - loss: 5139.8405
Epoch 217/1000
 - 0s - loss: 5110.2846
Epoch 218/1000
 - 0s - loss: 5085.1942
Epoch 219/1000
 - 0s - loss: 5063.4691
Epoch 220/1000
 - 0s - loss: 5034.7870
Epoch 221/1000
 - 0s - loss: 4990.7048
Epoch 222/1000
 - 0s - loss: 4983.4500
Epoch 223/1000
 - 0s - loss: 4955.1391
Epoch 224/1000
 - 0s - loss: 4909.6007
Epoch 225/1000
 - 0s - loss: 4875.7614
Epoch 226/1000
 - 0s - loss: 4848.4388
Epoch 227/1000
 - 0s - loss: 4849.2272
Epoch 228/1000
 - 0s - loss: 4803.1653
Epoch 229/1000
 - 0s - loss: 4771.2618
Epoch 230/1000
 - 0s - loss: 4738.7320
Epoch 231/1000
 - 0s - loss: 4724.4654
Epoch 232/1000
 - 0s - loss: 4713.3155
Epoch 233/1000
 - 0s - loss: 4672.7154
Epoch 234/1000
 - 0s - loss: 4638.8937
Epoch 235/1000
 - 0s - loss: 4618.5951
Epoch 236/1000
 - 0s - loss: 4589.3201
Epoch 237/1000
 - 0s - lo

 - 0s - loss: 1646.4990
Epoch 423/1000
 - 0s - loss: 1633.8545
Epoch 424/1000
 - 0s - loss: 1632.6671
Epoch 425/1000
 - 0s - loss: 1621.8000
Epoch 426/1000
 - 0s - loss: 1611.8910
Epoch 427/1000
 - 0s - loss: 1607.4685
Epoch 428/1000
 - 0s - loss: 1589.6609
Epoch 429/1000
 - 0s - loss: 1586.3976
Epoch 430/1000
 - 0s - loss: 1583.1229
Epoch 431/1000
 - 0s - loss: 1570.3769
Epoch 432/1000
 - 0s - loss: 1565.9875
Epoch 433/1000
 - 0s - loss: 1559.0484
Epoch 434/1000
 - 0s - loss: 1552.6716
Epoch 435/1000
 - 0s - loss: 1549.4436
Epoch 436/1000
 - 0s - loss: 1524.7304
Epoch 437/1000
 - 0s - loss: 1515.2170
Epoch 438/1000
 - 0s - loss: 1517.9052
Epoch 439/1000
 - 0s - loss: 1502.3552
Epoch 440/1000
 - 0s - loss: 1501.2506
Epoch 441/1000
 - 0s - loss: 1485.8668
Epoch 442/1000
 - 0s - loss: 1488.6173
Epoch 443/1000
 - 0s - loss: 1486.4079
Epoch 444/1000
 - 0s - loss: 1463.0300
Epoch 445/1000
 - 0s - loss: 1456.8791
Epoch 446/1000
 - 0s - loss: 1448.9771
Epoch 447/1000
 - 0s - loss: 1444.9215
E

 - 0s - loss: 593.8331
Epoch 636/1000
 - 0s - loss: 580.3956
Epoch 637/1000
 - 0s - loss: 585.8198
Epoch 638/1000
 - 0s - loss: 573.7208
Epoch 639/1000
 - 0s - loss: 584.6064
Epoch 640/1000
 - 0s - loss: 582.8458
Epoch 641/1000
 - 0s - loss: 582.5452
Epoch 642/1000
 - 0s - loss: 574.4444
Epoch 643/1000
 - 0s - loss: 572.8577
Epoch 644/1000
 - 0s - loss: 566.7610
Epoch 645/1000
 - 0s - loss: 556.5720
Epoch 646/1000
 - 0s - loss: 567.4555
Epoch 647/1000
 - 0s - loss: 566.8286
Epoch 648/1000
 - 0s - loss: 551.0376
Epoch 649/1000
 - 0s - loss: 552.9246
Epoch 650/1000
 - 0s - loss: 560.3154
Epoch 651/1000
 - 0s - loss: 535.1845
Epoch 652/1000
 - 0s - loss: 554.4868
Epoch 653/1000
 - 0s - loss: 540.5833
Epoch 654/1000
 - 0s - loss: 547.1113
Epoch 655/1000
 - 0s - loss: 537.8847
Epoch 656/1000
 - 0s - loss: 532.8912
Epoch 657/1000
 - 0s - loss: 528.1002
Epoch 658/1000
 - 0s - loss: 531.2659
Epoch 659/1000
 - 0s - loss: 532.3321
Epoch 660/1000
 - 0s - loss: 536.4872
Epoch 661/1000
 - 0s - loss

Epoch 851/1000
 - 0s - loss: 290.4480
Epoch 852/1000
 - 0s - loss: 290.9193
Epoch 853/1000
 - 0s - loss: 285.9333
Epoch 854/1000
 - 0s - loss: 289.8103
Epoch 855/1000
 - 0s - loss: 288.3071
Epoch 856/1000
 - 0s - loss: 287.7407
Epoch 857/1000
 - 0s - loss: 282.5949
Epoch 858/1000
 - 0s - loss: 279.3368
Epoch 859/1000
 - 0s - loss: 283.4101
Epoch 860/1000
 - 0s - loss: 284.6275
Epoch 861/1000
 - 0s - loss: 283.0687
Epoch 862/1000
 - 0s - loss: 285.7211
Epoch 863/1000
 - 0s - loss: 280.3266
Epoch 864/1000
 - 0s - loss: 276.9834
Epoch 865/1000
 - 0s - loss: 279.1079
Epoch 866/1000
 - 0s - loss: 279.2948
Epoch 867/1000
 - 0s - loss: 279.8827
Epoch 868/1000
 - 0s - loss: 277.3879
Epoch 869/1000
 - 0s - loss: 278.7448
Epoch 870/1000
 - 0s - loss: 273.2675
Epoch 871/1000
 - 0s - loss: 270.1079
Epoch 872/1000
 - 0s - loss: 274.0274
Epoch 873/1000
 - 0s - loss: 272.2306
Epoch 874/1000
 - 0s - loss: 272.9122
Epoch 875/1000
 - 0s - loss: 273.6750
Epoch 876/1000
 - 0s - loss: 272.2353
Epoch 877/10

In [16]:
# Generate Kaggle submit file

# Encode feature vector
df_test = pd.read_csv(filename_test,na_values=['NA','?'])

# all numeric columns (except id)
for column in ['distance','height','landings','number','pack','age','usage','weight','volume','width','max','power','size',]:
    missing_median(df_test,column)
    # do NOT normalize target
    if (column != 'target'):
        encode_numeric_zscore(df_test,column)

# all text/categorical columns
for column in ['region', 'item']:
    encode_text_dummy(df_test,column)

test_ids = df_test['id']
df_test.drop('id', axis=1, inplace=True)

x = df_test.as_matrix().astype(np.float32)

# Generate predictions
pred = model.predict(x)

# Create submission data set

submit_df = pd.DataFrame(pred)
submit_df.insert(0,'id',test_ids)
submit_df.columns = ['id','target']

submit_df.to_csv(filename_submit, index=False)


In [17]:
submit(source_file=file,data=submit_df,key=key,no=8)

Success: Submitted Assignment #8 for walker-jason:
You have submitted this assignment 3 times. (this is fine)

