Homework 5
==========

In this homework...

- Use multivariate linear regression to predict molecular band gap energy from other molecular properties
- Analyze regression parameters with bootstrapping
- Assess goodness of fit of linear models


Problem statement
-----------------------
1. Data has been loaded in for you where independent variables are molecular volume, number of atoms, formation energy, and molecular stability. You will be predicting the band gap energy. Not all materials have band gaps. Remove all data points which have a value of 0 for the band gap energy. How many data points does that leave?


2. Standardize the remaining data. For every variable (independent and dependent), subtract the mean and divide the variance from every data point. Plot band gap energy vs. one of the independent variables of your choice.


3. Create a multi-variate linear regression model to predict the band gap energy from the other variables (molecular volume, number of atoms, formation energy, and molecular stability). Plot band gap energy vs. the same independent variable you plotted in question 2. Add a line of best fit for your selected covariate. Report the $R^2$ value of the regression model.


4. Perform bootstrapping on the multivariate regression problem and find histograms of the regression parameters (i.e., coefficients). If you use the bootstrapping function from discussion, you will need to edit it.
Bootstrapping requirements:
    - The number of bootstrapping samples per trial you use should be equal to the number of data points.
    - Use as many bootstrapping trials as you think are necessary.
    - Plot each histogram of regression parameters, on the same plot.
    - Plot the histograms by density (i.e., set `density=True` in the arguments of `plt.hist()`)


5. Clearly report the mean and variance of each histogram computed in part 4. Round each value to 4 decimal places.


6. Report the independent variable whose coefficient has the highest variance based on the bootstrapping histograms. Create a new multi-variate linear regression model to predict the band gap energy from the other variables $\textit{excluding}$ this high-variance variable. Recreate each plot from part 3 with the 3 remaining independent variables. Report the $R^2$ value of the regression model. What is the difference between this $R^2$ value and that in part 3? What does this suggest about the variable you removed?

Import the necessary modules

In [1]:
import numpy as np
import sys
import matplotlib.pyplot as plt
import json
import pandas as pd
from sklearn import linear_model
from sklearn.metrics import mean_squared_error,r2_score
import random
from sklearn.utils import resample
import scipy.stats as st

Load data from the Open Quantum Materials Database: http://oqmd.org

Citations:
   - http://dx.doi.org/10.1007/s11837-013-0755-4
   - http://dx.doi.org/10.1038/npjcompumats.2015.10

Regression is done on all OQMD molecules containing both tungsten and nitrogen - about 250 molecules

In [19]:
# load data
data_list = []
for i in range(1,6):
    filename = 'data/W/formationenergy'+str(i)+'.json'
    with open(filename) as json_file:
        jfile = json.load(json_file)
        data_list = data_list + (jfile['data'])
df = pd.DataFrame(data_list)

# choose independent variables
indep_vars = ['volume','natoms','stability','delta_e']

# choose dependent variable
dep_var = 'band_gap'

# remove examples with missing values
xs = df[indep_vars].to_numpy()
xs = np.delete(xs,0,0)
ys = df[dep_var].to_numpy()
ys = np.delete(ys,0,0).reshape(-1,1)
all_data = np.concatenate((xs,ys),axis=1)
all_data  = all_data[~np.isnan(all_data).any(axis=1),:]
xs = all_data[:,:len(indep_vars)]
ys = all_data[:,-1]

font = {'family' : 'serif',
        'weight':'normal',
        'size': 18}

1. Data has been loaded in for you where independent variables are molecular volume, number of atoms, formation energy, and molecular stability. You will be predicting the band gap energy. Not all materials have band gaps. Remove all data points which have a value of 0 for the band gap energy. How many data points does that leave?

2. Standardize the remaining data. For every variable (independent and dependent), subtract the mean and divide the variance from every data point. Plot band gap energy vs. one of the independent variables of your choice.

3. Create a multi-variate linear regression model to predict the band gap energy from the other variables (molecular volume, number of atoms, formation energy, and molecular stability). Plot band gap energy vs. the same independent variable you plotted in question 2. Add a line of best fit for your selected covariate. Report the $R^2$ value of the regression model.

4. Perform bootstrapping on the multivariate regression problem and find histograms of the regression parameters (i.e., coefficients). If you use the bootstrapping function from discussion, you will need to edit it.
Bootstrapping requirements:
    - The number of bootstrapping samples per trial you use should be equal to the number of data points.
    - Use as many bootstrapping trials as you think are necessary.
    - Plot each histogram of regression parameters, on the same plot.
    - Plot the histograms by density (i.e., set `density=True` in the arguments of `plt.hist()`)

5. Clearly report the mean and variance of each histogram computed in part 4. Round each value to 4 decimal places.

6. Report the independent variable whose coefficient has the highest variance based on the bootstrapping histograms. Create a new multi-variate linear regression model to predict the band gap energy from the other variables $\textit{excluding}$ this high-variance variable. Recreate each plot from part 3 with the 3 remaining independent variables. Report the $R^2$ value of the regression model. What is the difference between this $R^2$ value and that in part 3? What does this suggest about the variable you removed?