<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#What-error-metric-and-feature-scaling-strategy-will-you-choose-for-this-dataset?" data-toc-modified-id="What-error-metric-and-feature-scaling-strategy-will-you-choose-for-this-dataset?-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>What error metric and feature scaling strategy will you choose for this dataset?</a></span></li><li><span><a href="#What-hyperparameters-for-kernel-ridge-regression-provide-the-best-model-performance-if-all-features-are-included?" data-toc-modified-id="What-hyperparameters-for-kernel-ridge-regression-provide-the-best-model-performance-if-all-features-are-included?-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>What hyperparameters for kernel ridge regression provide the best model performance if all features are included?</a></span></li><li><span><a href="#Is-it-possible-to-use-feature-selection-or-linear-feature-transformations-to-reduce-the-number-of-input-features-while-maintaining-similar-performance?" data-toc-modified-id="Is-it-possible-to-use-feature-selection-or-linear-feature-transformations-to-reduce-the-number-of-input-features-while-maintaining-similar-performance?-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Is it possible to use feature selection or linear feature transformations to reduce the number of input features while maintaining similar performance?</a></span></li><li><span><a href="#What-is-the-best-accuracy-you-can-achieve-for-a-regression-model-evaluated-on-the-test-set?" data-toc-modified-id="What-is-the-best-accuracy-you-can-achieve-for-a-regression-model-evaluated-on-the-test-set?-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>What is the best accuracy you can achieve for a regression model evaluated on the test set?</a></span></li></ul></div>

# Case Study: Predicting band gaps of oxides from their structure

In this lesson and excercises you were introduced to a dataset that contains structural information as well as formation energies and band gaps of transparent conducting oxide materials. We saw that linear models did not have very good performance, and in this case study you will build a kernel ridge regression (KRR) model to obtain more accurate predictions.

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
from matplotlib import pylab as plt

df = pd.read_csv('data/materials_band_gaps.csv')
X = df.iloc[:,1:-2] #id is irrelevant
x_names = X.columns.to_list()
X = X.values
y = df.iloc[:,-1].values #the last column is the band gap
y = y.reshape(-1,1)

print('Feature dimensions: {}'.format(X.shape))
print('Output dimensions: {}'.format(y.shape))

fig, ax = plt.subplots()

ax.hist(y)
ax.set_xlabel('Band Gap [eV]')
ax.set_ylabel('Counts')

We will create a test and a training set here. **All training and hyperparameter tuning should only include the training set!** This is very important, since accidentally optimizing parameters or hyperparameters on the test set can bias the model.  You can **only** use the test set to evaluate the accuracy.

In [None]:
from sklearn.model_selection import train_test_split
np.random.seed(5)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.35)

print('Feature dimensions: {}'.format(X.shape))
print('Output dimensions: {}'.format(y.shape))

fig, ax = plt.subplots()

ax.hist(y_train)
ax.set_xlabel('Band Gap [eV]')
ax.set_ylabel('Counts')

We can see that our training data has the same general distribution as the testing data, which means it is a representative sample. Use this training set to answer the following questions.

## What error metric and feature scaling strategy will you choose for this dataset? 

## What hyperparameters for kernel ridge regression provide the best model performance if all features are included?

## Is it possible to use feature selection or linear feature transformations to reduce the number of input features while maintaining similar performance?

## What is the best accuracy you can achieve for a regression model evaluated on the test set?