# Problem 4: Using Machine Learning to Predict Bulk Modulus

NOTICE TO BINDER USERS: YOUR NOTEBOOK PROGRESS WILL NOT BE SAVED IF YOU CLOSE THIS WINDOW OR LEAVE IT INACTIVE FOR TOO LONG.

PLEASE DOWNLOAD YOUR NOTEBOOKS AND FILES REGULARLY OR DOWNLOAD THIS REPO AND RUN OFFLINE ON YOUR MACHINE. See "running_offline.md" for more info.


This notebook is based on a Matminer example notebook written by Anubhav Jain, which can be found [here](https://github.com/hackingmaterials/matminer_examples/blob/master/matminer_examples/machine_learning-nb/bulk_modulus.ipynb).

## Part A: Construct a Linear Regression Model

Linear regression, also known as ordinary least squares regression, is a fundamental technique for learning a linear model from a set of data. Because it is so foundational, most machine learning libraries have some kind of function or object that performs linear regression pre-defined. In this notebook, we'll be using [scikit-learn's](https://scikit-learn.org/stable/index.html) linear regression model and evaluating our performance with the "root mean squared error" or RMSE (the square root of the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error)). Please define a function for RMSE below.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

def rmse(y_true, y_predicted):
    return # YOUR CODE HERE
        

First, we need to import our data. The provided training data is already in a tabular form (.csv stands for "comma separated values" and is a tabular file type), so we can readily use a [pandas dataframe](https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python) to load the data from "ml_training_data.csv' and make it easier to manipulate.

In [None]:
from pandas import read_csv

df = read_csv("ml_training_data.csv")
# Drop the unnecessesary index column
df = df.drop("Unnamed: 0", 1)
# Show first 10 rows of our data
df.head(10)

Now, we need to separate our training data into what we're trying to predict (K_VRH) and the features we're trying to use for our predictions. Let's also set 1% of our data aside as a final test set. 

In [None]:
from sklearn.model_selection import train_test_split
df, df_test = train_test_split(df, test_size=0.01, random_state=42)

# Creates a numpy array from a the "K_VRH" column of the dataframe
y_test = df_test["K_VRH"].to_numpy()
y = df["K_VRH"].to_numpy()


# Creates a numpy array from the dataframe after removing unwanted columns (we don't want to predict K_VRH from K_VRH.)
X_test = df_test.drop(labels=["material_id", "K_VRH", "kpoint_density"], axis = 1).to_numpy()
X = df.drop(labels=["material_id", "K_VRH", "kpoint_density"], axis = 1).to_numpy()

Okay! We're ready to fit our linear model to this data. 

In [None]:
lr = LinearRegression()

lr.fit(X, y)

# get fit statistics
print('training R2 = ' + str(round(lr.score(X, y), 3)))
print('training RMSE = %.3f GPa' % rmse(y_true=y, y_predicted=lr.predict(X)))

In [None]:
from sklearn.model_selection import KFold, cross_val_score

# Use 10-fold cross validation (90% training, 10% test)
crossvalidation = KFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_val_score(lr, X, y, scoring='neg_mean_squared_error', cv=crossvalidation, n_jobs=1)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
r2_scores = cross_val_score(lr, X, y, scoring='r2', cv=crossvalidation, n_jobs=1)

print('Cross-validation results:')
print('Folds: %i, mean R2: %.3f' % (len(scores), np.mean(np.abs(r2_scores))))
print('Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores))))

We can also use a *figrecipe* from matminer to plot how our linear model's predictions compare to the DFT calculated values:

In [None]:
from matminer.figrecipes.plot import PlotlyFig
from sklearn.model_selection import cross_val_predict

pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
               y_title='Predicted bulk modulus (GPa)',
               title='Linear regression',
               mode='notebook',
               filename="lr_regression.html")

pf.xy(xy_pairs=[(y, cross_val_predict(lr, X, y, cv=crossvalidation)), ([0, max(y)], [0, max(y)])], 
      labels=df['material_id'], 
      modes=['markers', 'lines'],
      lines=[{}, {'color': 'black', 'dash': 'dash'}], 
      showlegends=False
     )

Let's compare this to the DFT calculated values of K_VRH from the materials project (you should compare to your calculations from Problem 2 as well for your report)

In [None]:
Si_df = read_csv("Si_features.csv")
Si =  Si_df.drop(['material_id', 'material_name'], axis=1)[df.drop(['material_id', 'kpoint_density', 'K_VRH'], axis=1).columns.values].to_numpy()
predictions = lr.predict(Si)
print("Diamond Cubic:")
print("\tPrediction: {0:.0f} GPa".format(predictions[0]))
print("\tMP DFT: {0:.0f} GPa".format(83.0))
print("Beta Tin:")
print("\tPrediction: {0:.0f} GPa".format(predictions[1]))
print("\tMP DFT: {0:.0f} GPa".format(108.0))
      


While simple and easy to interpret, linear regressions can be quite useful. Let's try a slightly more advanced form of machine learning model, a random forest. 

In [None]:
from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(n_estimators=50, random_state=1)

rf.fit(X, y)
print('training R2 = ' + str(round(rf.score(X, y), 3)))
print('training RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y, y_pred=rf.predict(X))))

In [None]:
from sklearn.model_selection import KFold, cross_val_score

# Use 10-fold cross validation (90% training, 10% test)
crossvalidation = KFold(n_splits=10, shuffle=True, random_state=1)
scores = cross_val_score(rf, X, y, scoring='neg_mean_squared_error', cv=crossvalidation, n_jobs=1)
rmse_scores = [np.sqrt(abs(s)) for s in scores]
r2_scores = cross_val_score(rf, X, y, scoring='r2', cv=crossvalidation, n_jobs=1)

print('Cross-validation results:')
print('Folds: %i, mean R2: %.3f' % (len(scores), np.mean(np.abs(r2_scores))))
print('Folds: %i, mean RMSE: %.3f' % (len(scores), np.mean(np.abs(rmse_scores))))

In [None]:
from matminer.figrecipes.plot import PlotlyFig
from sklearn.model_selection import cross_val_predict

pf = PlotlyFig(x_title='DFT (MP) bulk modulus (GPa)',
               y_title='Predicted bulk modulus (GPa)',
               title='Random Forest',
               mode='notebook',
               filename="random_forest.html")

pf.xy(xy_pairs=[(y, cross_val_predict(rf, X, y, cv=crossvalidation)), ([0, max(y)], [0, max(y)])], 
      labels=df['material_id'], 
      modes=['markers', 'lines'],
      lines=[{}, {'color': 'black', 'dash': 'dash'}], 
      showlegends=False
     )

Let's check how our models do on the test set we set aside at the beginning and also compare this to the DFT calculated values of K_VRH from the materials project (you should compare to your calculations from Problem 2 as well for your report.)

In [None]:
print("Test Set:")
print('\t LR R2 = ' + str(round(lr.score(X_test, y_test), 3)))
print('\t LR RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y_test, y_pred=lr.predict(X_test))))
print('\t RF R2 = ' + str(round(rf.score(X_test, y_test), 3)))
print('\t RF RMSE = %.3f' % np.sqrt(mean_squared_error(y_true=y_test, y_pred=rf.predict(X_test))))

Si_df = read_csv("Si_features.csv")
Si =  Si_df.drop(['material_id', 'material_name'], axis=1)[df.drop(['material_id', 'kpoint_density', 'K_VRH'], axis=1).columns.values].to_numpy()
predictions = rf.predict(Si)
print("Diamond Cubic:")
print("\tPrediction: {0:.0f} GPa".format(predictions[0]))
print("\tMP DFT: {0:.0f} GPa".format(83.0))
print("Beta Tin:")
print("\tPrediction: {0:.0f} GPa".format(predictions[1]))
print("\tMP DFT: {0:.0f} GPa".format(108.0))
      

### How to programatically get CPU time used during DFT Calculations

Rather than going through the OUTCARs by hand to get the calculation statistics, we can use pymatgen's Outcar object to help speed up the process:

In [None]:
from pymatgen.io.vasp import Outcar

outcar_file = "fake_vasp_data/oGTYAM6nxJ/OUTCAR"

time = Outcar(outcar_file).run_stats["Total CPU time used (sec)"]
print(time)

In [None]:
import os

my_directory = # YOUR CODE HERE
total_time = 0
for path, subdirs, subfiles in os.walk(my_directory):
    print("path = ",path)
    
    for subdir in subdirs:
        print("subdir = ",subdir)
        outcar_file = os.path.join(path, subdir) + "/OUTCAR"
        total_time += 0