# Final Project: Overview

# Objective

The objective of this project is for you to demonstrate your mastery of the Machine Learning process
**using Neural Networks**.



# Submission requirements

The guidelines will be similar to the Midterm
- you will write a procedure that takes raw data and produces predictions

You will submit a *single* model for evaluation.

**Demonstrate that all cells in your notebook work**

The final cell in your notebook should print the message "Done"
- `print("Done")`
- If we run your notebook and this last cell does not execute your submission will be inadequate

## Testing

*You must perform out of sample testing*.

If you want to perform cross-validation in training, that is fine, but you
must *also* test out of sample to show that you are not over-fitting.

It is up to you to create the out of sample data that you feel best evaluates your model.

We will create holdout data (that we will not show you) for grading.

The procedure you write to make predictions should be able to work on the unseen holdout data
(similar to how it should work for your test set but the holdout set has *no targets*)

    

# The data

Data will be provided to you 
- as multiple files in a directory which we refer to as a *data directory*

The reason for this is that the different files may convey different information.

You will be responsible for deciding
- which files to use
- which fields within the files to use

We will give you a data directory for training.

# Submission guidelines

Here are the basics, a code template that you must complete is in the following cells
- you will be required to store  your model in a file
- you will be required to write a procedure `MyModel` that takes two arguments
    - `test_dir`
        - this is a *relative path* to the holdout data directory
    - `model_path`
        - this is a *relative path* to the file containing your model
- the holdout data directory is similar in structure to the training data directory
    - but without target labels !  It is your job to predict these.
- your procedure must produce predictions given this holdout data directory

This means that your procedure must
- prepare the files in the holdout data directory similar to the way that they were prepared in the training data directory

We will provide you with a sample data directory that will resemble the holdout -- this is so that you
may test the procedure you write for submission.



## Detailed submission guidelines


In **addition to your notebook that trains/evaluates your model**, 
- please also submit an **archive file of the directory** whose name is stored in `model_path`, which 
contains your trained model.
    - use `saveModel` to put your final, trained model in this directory
- We will **not** train your model; we will only use the method `MyModel`
    - which **you** will implement
    - and which uses `loadModel` and the directory whose name is stored in `model_path`
    - this will create the model that we will evaluate


Here is a code template for you to complete
- it will save your model (assuming it is in variable `my_model`)
- it provides the specification for procedure `MyModel`, which *you must complete*


In [5]:
from tensorflow.keras.models import load_model
import os
import tensorflow as tf
import math
import pandas as pd
import numpy as np
import pickle
from sklearn.base import BaseEstimator, TransformerMixin

modelName = "final_model"
model_path = os.path.join(".", modelName)

def flatten3D(mat3D):
    ori_shape = mat3D.shape
    mat2D = mat3D.reshape((ori_shape[0], ori_shape[1] * ori_shape[2]))
    return mat2D

class Winsor(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        X_new = pd.DataFrame(np.array(X).copy())
        self.High = []
        self.Low = []
        for col in X_new.columns:
            self.High.append(X_new[col].quantile(0.9))
            self.Low.append(X_new[col].quantile(0.1))
        return self

    def transform(self, X, y=None):
        X_new = pd.DataFrame(np.array(X).copy())
        for i, col in enumerate(X_new.columns):
            high = self.High[i]
            low = self.Low[i]
            X_new.loc[X_new[col] > high, col] = high
            X_new.loc[X_new[col] < low, col] = low
        return X_new.values

def GenerateTest(dir, lag = 90, var_ls=['Close', 'Volume']):
    comp_list = ['CSCO', 'ADBE', 'XLE', 'INTC', 'XLF', 
                 'XLP', 'MSFT', 'XLB', 'XLU', 'NVDA', 
                 'XLV', 'IBM', 'XLY', 'XLK', 'SPY', 'XLI', 'AAPL']
    data = easyAccess.lsGetCompData(var_ls=var_ls, comp_ls=comp_list, data_path=dir).pct_change()
    data.columns = ['pct ' + x for x in data.columns]
    X = []
    for col in data.columns:
        temp = data[[col]].copy()
        for i in range(1, lag+1):
            temp[str(i) + col] = temp[col].shift(i)
        temp.dropna(inplace=True)
        X.append(temp.values)
    X = np.array(X)
    X = np.swapaxes(X, 0, 1)
    X = np.swapaxes(X, 1, 2)
    ## spatial in time order
    X = X[:, ::-1, :]
    idx = data.iloc[- X.shape[0]:, ].index
    print(X.shape, idx.shape)
    return X, idx

class easyAccess:
    def __init__(self, data_path='./Data/train'):
        self.data_path = data_path
        self.file_names = os.listdir(data_path)
        self.comp_names = [x.split('.')[0] for x in self.file_names]
        self.file_dict = dict(zip(self.comp_names, self.file_names))
    
    @classmethod
    def printAll(cls, data_path='./Data/train', mute=True):
        obj = cls(data_path)
        if not mute:
            for i, comp in enumerate(obj.comp_names):
                print(i+1, comp)
        return obj.comp_names
    
    @classmethod
    def printDetail(cls, data_path='./Data/train'):
        obj = cls(data_path)
        for key, value in obj.file_dict.items():
            print(key, end=':\t')
            df = pd.read_csv(os.path.join(obj.data_path, value))
            print(", ".join(list(df.columns)))
    
    @classmethod
    def lsGetData(cls, var_ls=None, data_path='./Data/train'):
        """
        Get data by transferring in a variable-name list and fetch data for all the companies. 
        """
        if not var_ls or var_ls == []:
            return None
        obj = cls(data_path)
        result = None
        for key, value in obj.file_dict.items():
            df = pd.read_csv(os.path.join(obj.data_path, value))
            df['Dt'] = pd.to_datetime(df['Dt'], format='%Y-%m-%d')
            df.set_index('Dt', inplace=True)
            df = df[var_ls]
            df.columns = [' '.join([key, x]) for x in df.columns]
            if result is None:
                result = df
            else:
                result = result.merge(df, how='outer', left_index=True, right_index=True)
        print('Data acquisition completed!', ", ".join(var_ls))
        return result
                
    @classmethod
    def lsGetCompData(cls, var_ls=None, comp_ls=None, data_path='./Data/train'):
        """
        Get data by transferring in a variable-name list and fetch data for all the companies. 
        """
        if not var_ls or var_ls == []:
            return None
        obj = cls(data_path)
        result = None
        for key in comp_ls:
            value = obj.file_dict[key]
            df = pd.read_csv(os.path.join(obj.data_path, value))
            df['Dt'] = pd.to_datetime(df['Dt'], format='%Y-%m-%d')
            df.set_index('Dt', inplace=True)
            df = df[var_ls]
            df.columns = [' '.join([key, x]) for x in df.columns]
            if result is None:
                result = df
            else:
                result = result.merge(df, how='outer', left_index=True, right_index=True)
        print('Data acquisition completed!', ", ".join(var_ls))
        return result
        

def saveModel(model, model_path): 
    try:
        os.makedirs(model_path)
    except OSError:
        print("Directory {dir:s} already exists, files will be over-written.".format(dir=model_path))
        
    # Save JSON config to disk
    json_config = model.to_json()
    with open(os.path.join(model_path, 'config.json'), 'w') as json_file:
        json_file.write(json_config)
    # Save weights to disk
    model.save_weights(os.path.join(model_path, 'weights.h5'))
    
    print("Model saved in directory {dir:s}; create an archive of this directory and submit with your assignment.".format(dir=model_path))
    
def loadModel(model_path):
    # Reload the model from the 2 files we saved
    with open(os.path.join(model_path, 'config.json')) as json_file:
        json_config = json_file.read()
    model = tf.keras.models.model_from_json(json_config)
    model.load_weights(os.path.join(model_path, 'weights.h5'))
    
    return model

def MyModel(test_dir, model_path):
    # YOU MAY NOT change model after this statement !
    model = loadModel(model_path)

    # It should run model to create an array of predictions; we initialize it to the empty array for convenience
    predictions = []
    
    # We need to match your array of predictions with the examples you are predicting
    # The array below (ids) should have a one-to-one correspondence and identify the example your are predicting
    # For Bankruptcy: the Id column
    # For Stock prediction: the date on which you are making a prediction
    ids = []
    
    # YOUR CODE GOES HERE
    X_raw, idx = GenerateTest(dir=test_dir, lag = 90, var_ls = ['Close', 'Volume'])

    # load autoencoder model
    with open(os.path.join('.', 'final_preprocess', 'final_model_autoencoder.pkl'), 'rb') as f:
        ae_recovered = pickle.load(f)
    # load model pipeline
    with open(os.path.join('.', 'final_preprocess', 'final_model_pipeline.pkl'), 'rb') as f:
        pipe_recovered = pickle.load(f)
    # load model raw function
    with open(os.path.join('.', 'final_preprocess', 'final_model_raw_func.pkl'), 'rb') as f:
        func_recovered = pickle.load(f)
    
    print(ae_recovered, pipe_recovered, func_recovered)

    shape_test = X_raw.shape
    X_raw = pipe_recovered.transform(func_recovered(X_raw))
    X_raw = X_raw.reshape(shape_test)
    X_raw = ae_recovered.predict(X_raw)
    X_raw.shape
    print(X_raw.shape)

    model_f = loadModel(model_path)
    y_pred = model_f.predict(X_raw)
    
    predictions = y_pred[-200:].reshape([-1])
    ids = np.array(idx)[-200:]

    return predictions, ids

# Assign to variable my_model the model that is your final model (the one  you will be evaluated on)
model_11 = loadModel(model_path)
my_model = model_11 # CHANGE None to your model !

saveModel(my_model, model_path)

Directory ./final_model already exists, files will be over-written.
Model saved in directory ./final_model; create an archive of this directory and submit with your assignment.


## Evaluate your model on the holdout data directory

**You must run the following cell** from the directory that contains your model file

Here is how we will evaluate your submission
- we will create a directory whose only content is
    - sub-directory `Data`
- we will copy your model file to this directory with the name stored in `model_path`
- we will run the cell in your notebook that should be a copy of the one below
    - it calls procedure `MyModel` with the arguments given below
    - your implementation of `MyModel`
        - must successfully load your model file, *given where **we** have place it as described above*
        - must successfully return one prediction for each example in the holdout directory *given where **we** have placed the holdout directory*

```{warning}
When grading, please change the holdout_dir to holdout dataset. Currently, it's under testing using sample data. 
```

In [6]:
holdout_dir = os.path.join(".", "Data", "sample")
predicts = MyModel(holdout_dir, model_path)

Data acquisition completed! Close, Volume
(160, 91, 34) (160,)
<keras.engine.sequential.Sequential object at 0x7f7afe93f190> Pipeline(steps=[('Winsor', Winsor()), ('Scaler', StandardScaler())]) <function flatten3D at 0x7f7b010b17a0>
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
(160, 91, 34)
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: 'arguments' object has no attribute 'posonlyargs'
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `ex

In [7]:
print("Done")

Done
