# <span style='color:red'>Problem Set 4: Predicting Returns with Machine Learning</span>

## Instructions:

- **Make a copy of this notebook somewhere under the root-folder *except* for the shared-folder**
  - In JHub we have read-only access to the material under /shared
  - Everything else is associated with your Dartmouth ID and so only you can see them and they are permanent
- Enter the answers on this notebook
- All of the code you need can be found in Topics #4, #5, and #6 notebooks
- When you are done, select "Print..." from the File-menu and create an **HTML** version of the notebook
  - Submit this notebook through canvas
  - You can also submit the notebook as an .html file -- but please don't submit it as a notebook (.ipynb) file

## Notes:

- If your kernel crashes, it is typically a sign that your kernels are using too much memory
- What to do?
  1. If you have multiple tabs open, each one of them associates with its own kernel -- which means that each one of them is using some memory. It might help to close all tabs except the one you're working on.
  2. In the notebook I don't typically discard objects we no longer need (we could do so with "del <object>" command). To conserve memory, you might start a fresh kernel, execute the import statements, and then skip to the point that you really need to run. 
    
    This might sometimes be tricky because you want to load the right data -- but in most notebooks I save the final data precisely for this reason so that most of the code can later be skipped.

# Problem 1: Past Returns Only

In Topic #6 we predicted returns using both past returns and a few fundamental variables.

Edit the code so that you *only* include the 12 features corresponding to the past returns in months t, t-1, t-2, ..., t-11. (In the models we predict month t+1 returns.)

If you estimate the linear model with just these features, what is the resulting model's Sharpe ratio in:

1. the training sample and
2. the validation sample?

How do these estimates compare to the full model that also had squared returns and the fundamental characteristics?

In [None]:
# load data

cs_crsp = pd.read_pickle('/home/jovyan/data/ml_crsp.pkl')

# the TARGET variable is the return next month

cs_crsp['retnm'] = cs_crsp.groupby(level='permno')['ret'].shift(-1)

# The FIRST set of features consist of monthly returns over the past year
for lag in range(12):
    cs_crsp['x0_retlag' + str(lag)] = cs_crsp.groupby(level='permno')['ret'].shift(lag)
    
# The SECOND set of features are (a) log-size, (b) log-BE/ME, (c) log-asset growth, and (d) gross profitability

# (1) log-size
cs_crsp['x1_logme'] = np.log(cs_crsp['me'])

# (2) log-book-to-market
cs_crsp['be'] = cs_crsp['be'].apply(lambda x: x if x > 0 else np.nan) # set negative BEs to missing
cs_crsp['beme'] = cs_crsp['be'] / cs_crsp['me']
cs_crsp['x2_logbeme'] = np.log(cs_crsp['beme'])

# (3) asset growth
cs_crsp['at_lag12'] = cs_crsp.groupby(level='permno')['at'].shift(12)
cs_crsp['x3_log_asset_growth'] = np.log(cs_crsp['at'] / cs_crsp['at_lag12'])
bad_data = (cs_crsp['at'] <= 0) | (cs_crsp['at_lag12'] <= 0) 
cs_crsp.loc[bad_data, 'x3_log_asset_growth'] = np.nan

# (4) gross profitability
cs_crsp['x4_gross_profitability'] = (cs_crsp['sale'] - cs_crsp['cogs']) / cs_crsp['at']
bad_data = cs_crsp['at'] <= 0 
cs_crsp.loc[bad_data, 'x4_gross_profitability'] = np.nan

# Keep only the variables we need
target_var = ['retnm']
features = [c for c in cs_crsp.columns if c.startswith('x')]
cs_crsp = cs_crsp[target_var + features]

# Normalize variable by cross-sectionally demeaning
cs_crsp = cs_crsp.sub(cs_crsp.groupby(level='date').mean(), level='date')

# Problem 2: Value and Profitability

Some academics (for example, Robert Novy-Marx) and asset managers (for example, Avantis Investors) highlight the importance of the interaction between the value and profitability characteristics: instead of buying value stocks or buying profitable stocks, they argue that it makes sense to look at both characteristics at the same time.

In the original code, create a new characteristic that "interacts" value and profitability as follows:

```
cs_crsp['x5_valueprofitability'] = cs_crsp['x2_logbeme'] * cs_crsp['x4_gross_profitability']
```

#### a) If you estimate the random forest model with this variable in it, where does it rank in terms of **<span style='color:orange'>variable importance</span>** relative to the other variables? 

#### b) What is the random forest model's Sharpe ratio in the validation sample with and without this variable? 

**Note:** Please do not go through the trouble of retuning the models hyperparameters. That is, after creating the new feature etc., just execute the cell that reads

```
y = train_data['retnm']
X = train_data[features]

best_hyperparameters = {'bootstrap': True, 'max_depth': 5, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'min_samples_split': 2, 'n_estimators': 100}
best_model = RandomForestRegressor(**best_hyperparameters, random_state=42)

# Fit the model to the data
best_model.fit(X, y);
```

# Problem 3: Ensembling is Averaging

A popular approach for creating predictive models is to estimate many different models and combine predictions from these models. That basic idea is that every model is flawed but that different techniques can get something different but right about the data. A random forest, for example, is already such an "ensemble model" because it combines predictions from many smaller trees.

In Topic #6 we trained three different models:

1. Linear regression
2. Ridge regression
3. Random forest

In each case, we created a strategy that traded stocks based on the predictions. 

Go back to the code and change it so that the returns from the strategies based on the linear regression, ridge regression, and random forest are saved with **different names**. For example, save them as:

```
validation_returns_ols 
validation_returns_ridge
validation_returns_rf
```

Once you have stored these returns down, create a strategy that is an ensemble of these three strategies:

```
validation_returns_ensemble = (1/3) * (validation_returns_ols + validation_returns_ridge + validation_returns_rf)
```

#### What is this ensemble strategy's Sharpe ratio in the validation sample?