# Regression

In this notebook, we will predict the bandgap of materials. The dataset that we will use is built in the `dataset_preparation.ipynb` file. We will test many possible algorithms and to assess which one gives the smallest mean squared error. The workflow is essentially the same for all algorithms: we perform a train test split; then perform a grid search evaluated against a 5-fold split of the training set as our validation set to find the best set of hyperparameters; finally, we evaluate the error on the test data.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
import multiprocessing
import xgboost as xgb #For parallel gradient boosting

In [2]:
#Dataset loading
df = pd.read_csv('gap_prediction.csv')

#Turning space group into a categorical variable
df["Space Group"] = df["Space Group"].astype('category')

#Building a dict that maps the space groups in unique integers
mapping_dict = dict(zip(df['Space Group'], df['Space Group'].cat.codes))

#Transforms the categorical space group to numbers
df['Space Group'] = df['Space Group'].map(mapping_dict)

#Target
y = df['gap']
df.drop(['gap','Material','Unnamed: 0'], axis='columns', inplace=True)
X = df.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

# Models

## Linear Regression (ElasticNet)

In [3]:
# Parameter Tuning with Cross-Validation
# Define the hyperparameters to tune and their possible values
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1],  # Regularization parameter
    'l1_ratio': [.1, .5, .7, .9, .95, .99, 1]  # Mixing parameter (0: L2, 1: L1, [0,1]: ElasticNet)
}

# Create an Elastic Net regressor
en_regressor = ElasticNet(max_iter=10000)

# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(en_regressor, param_grid, cv=5, scoring='neg_mean_squared_error',n_jobs=-1)
scaler = StandardScaler().fit(X_train)
grid_search.fit(scaler.transform(X_train), y_train)
print(grid_search.best_params_)
best_params = grid_search.best_params_

{'alpha': 0.01, 'l1_ratio': 0.1}


In [4]:
# Train the Elastic Net regressor with the best hyperparameters
best_en_regressor = ElasticNet(**best_params,max_iter=10000)
best_en_regressor.fit(scaler.transform(X_train), y_train)

# Evaluate the model on the test set
y_pred = best_en_regressor.predict(scaler.transform(X_test))
mse_en = mean_squared_error(y_test, y_pred)
print("Test Accuracy:", mse_en)

# Perform Cross-Validation with the best hyperparameters
cv_scores_en = cross_val_score(best_en_regressor, X_train, y_train, cv=5, scoring='neg_mean_squared_error',n_jobs=-1)
print("Cross-Validation Error:", -cv_scores_en)
print("Mean CV Error:", -np.mean(cv_scores_en))

Test Accuracy: 0.839231301570435
Cross-Validation Error: [0.68949278 0.76132569 0.6652068  0.80278002 0.72332652]
Mean CV Error: 0.7284263611691566


## Decision Tree

In [5]:
# Parameter Tuning with Cross-Validation
# Define the hyperparameters to tune and their possible values
param_grid = {
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]  # Minimum samples required to be at a leaf node
}

# Create a Decision Tree regressor
dt_regressor = DecisionTreeRegressor()

# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(dt_regressor, param_grid, cv=5, scoring='neg_mean_squared_error',n_jobs=-1)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
best_params = grid_search.best_params_

{'max_depth': 30, 'min_samples_leaf': 2, 'min_samples_split': 5}


In [6]:
# Train the Decision Tree regressor with the best hyperparameters
best_dt_regressor = DecisionTreeRegressor(**best_params)
best_dt_regressor.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = best_dt_regressor.predict(X_test)
mse_dt = mean_squared_error(y_test, y_pred)
print("Test Accuracy:", mse_dt)

# Perform Cross-Validation with the best hyperparameters
cv_scores_dt = cross_val_score(best_dt_regressor, X_train, y_train, cv=5, scoring='neg_mean_squared_error',n_jobs=-1)
print("Cross-Validation Error:", -cv_scores_dt)
print("Mean CV Error:", -np.mean(cv_scores_dt))

Test Accuracy: 0.6921712444281362
Cross-Validation Error: [0.55680645 0.60627146 0.58004253 0.61168845 0.73644058]
Mean CV Error: 0.6182498954241303


## Random Forest

In [7]:
# Parameter Tuning with Cross-Validation
# Define the hyperparameters to tune and their possible values
param_grid = {
    #'n_estimators': [100, 200, 300],      # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],     # Maximum depth of each tree
    'min_samples_split': [2, 5, 10],    # Minimum samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]       # Minimum samples required to be at a leaf node
}

# Create a Random Forest regressor
rf_regressor = RandomForestRegressor(n_jobs=-1)

# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(rf_regressor, param_grid, cv=5, scoring='neg_mean_squared_error',n_jobs=-1)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
best_params = grid_search.best_params_

{'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2}


In [8]:
# Train the Random Forest regressor with the best hyperparameters
best_rf_regressor = RandomForestRegressor(n_jobs=-1, **best_params)
best_rf_regressor.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = best_rf_regressor.predict(X_test)
mse_rf = mean_squared_error(y_test, y_pred)
print("Test Error:", mse_rf)

# Perform Cross-Validation with the best hyperparameters
cv_scores_rf = cross_val_score(best_rf_regressor, X_train, y_train, cv=5, scoring='neg_mean_squared_error',n_jobs=-1)
print("Cross-Validation Error:", -cv_scores_rf)
print("Mean CV Error:", -np.mean(cv_scores_rf))

Test Error: 0.4407661799079696
Cross-Validation Error: [0.31376791 0.46013912 0.38254048 0.51632421 0.46347019]
Mean CV Error: 0.42724838105726703


## Gradient Boosting

In [9]:
# Parameter Tuning with Cross-Validation
# Define the hyperparameters to tune and their possible values
param_grid = {
    'n_estimators': [50, 100, 200],      # Number of boosting stages to be used
    'learning_rate': [0.1, 0.2, 0.3, 0.4],  # Step size shrinks the contribution of each tree
    'max_depth': [5, 6, 7, 8]              # Maximum depth of each tree
}

# Create a Gradient Boosting Regressor
xgb_model = xgb.XGBRegressor(
    n_jobs=multiprocessing.cpu_count() // 2, tree_method="hist"
)


grid_search = GridSearchCV(xgb_model,param_grid,cv=5,scoring='neg_mean_squared_error',n_jobs=2)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
best_params = grid_search.best_params_

  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sp

{'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 200}


In [10]:
# Train the Gradient Boosting regressor with the best hyperparameters
best_gb_regressor = xgb.XGBRegressor(
    n_jobs=multiprocessing.cpu_count() // 2, tree_method="hist", **best_params)
best_gb_regressor.fit(X_train, y_train,verbose=3)

# Evaluate the model on the test set
y_pred = best_gb_regressor.predict(X_test)
mse_gb = mean_squared_error(y_test, y_pred)
print("Test Accuracy:", mse_gb)

# Perform Cross-Validation with the best hyperparameters
cv_scores_gb = cross_val_score(best_gb_regressor, X_train, y_train, cv=5, scoring='neg_mean_squared_error')
print("Cross-Validation Error:", -cv_scores_gb)
print("Mean CV Error:", -np.mean(cv_scores_gb))

Test Accuracy: 0.3517656511213685


  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):
  if is_sparse(data):


Cross-Validation Error: [0.24505851 0.36793458 0.27596039 0.39405086 0.32384137]
Mean CV Error: 0.32136914110180126


# Summary

In [11]:
df = pd.DataFrame(columns=['Algorithm', 'Test MSE', 'Mean CV MSE'])
df.loc[len(df)] = ['Linear Regression (ElasticNet)', mse_en, -np.mean(cv_scores_en)]
df.loc[len(df)] = ['Decision Tree', mse_dt, -np.mean(cv_scores_dt)]
df.loc[len(df)] = ['Random Forrest', mse_rf, -np.mean(cv_scores_rf)]
df.loc[len(df)] = ['Gradient Boosting', mse_gb, -np.mean(cv_scores_gb)]
df.sort_values(by='Mean CV MSE')

Unnamed: 0,Algorithm,Test MSE,Mean CV MSE
3,Gradient Boosting,0.351766,0.321369
2,Random Forrest,0.440766,0.427248
1,Decision Tree,0.692171,0.61825
0,Linear Regression (ElasticNet),0.839231,0.728426


# Prediction of novel Materials

In [50]:
novel_df = pd.read_csv('gap_prediction_novel.csv')

novel_df["Space Group"] = novel_df["Space Group"].astype('category')
novel_df['Space Group'] = novel_df['Space Group'].map(mapping_dict)

novel_df.drop(['Unnamed: 0'], axis='columns', inplace=True)

X_novel = novel_df.drop(['Material'], axis='columns').to_numpy()

novel_df["Predicted Gap"] = best_gb_regressor.predict(X_novel)

#The next lines move "Predicted Gap" to the second column
columns = novel_df.columns.tolist()
columns.remove("Predicted Gap")
columns.insert(2, "Predicted Gap")
novel_df = novel_df[columns]

#We turn the Space Group back to letters
inverse_dict = {val:key for key,val in mapping_dict.items()}
novel_df["Space Group"] = novel_df["Space Group"].map(inverse_dict)

#Print the 10 materials with highest band gap
novel_df.sort_values(by='Predicted Gap',ascending=False).head(10)

Unnamed: 0,Material,Space Group,Predicted Gap,Z_mean,Electronegativity_mean,IonizationPotential_mean,ElectronAffinity_mean,HOMO_mean,LUMO_mean,r_s_orbital_mean,...,r_p_orbital_wstd,r_d_orbital_wstd,r_atomic_nonbonded_wstd,r_valence_lastorbital_wstd,r_covalent_wstd,Valence_wstd,PeriodicColumn_wstd,PeriodicColumn_upto18_wstd,NumberUnfilledOrbitals_wstd,Polarizability_wstd
5068,ZnCl2,P-31m,4.834274,23.5,2.405,-11.8606,-0.8448,-7.28595,2.0692,0.89265,...,0.218448,0.305784,0.028444,0.034047,0.011111,6.944444,6.944444,6.944444,0.277778,161.28256
1180,ZnCl2,P-3,4.834274,23.5,2.405,-11.8606,-0.8448,-7.28595,2.0692,0.89265,...,0.218448,0.305784,0.028444,0.034047,0.011111,6.944444,6.944444,6.944444,0.277778,161.28256
5065,ZnF2,P-31m,4.786496,19.5,2.815,-14.5737,-1.044,-8.5456,0.46595,0.7556,...,0.447675,0.162903,0.1,0.149166,0.117361,6.944444,6.944444,6.944444,0.277778,339.616988
1177,ZnF2,P-3,4.786496,19.5,2.815,-14.5737,-1.044,-8.5456,0.46595,0.7556,...,0.447675,0.162903,0.1,0.149166,0.117361,6.944444,6.944444,6.944444,0.277778,339.616988
1181,ZnCl3,P-3,4.681159,23.5,2.405,-11.8606,-0.8448,-7.28595,2.0692,0.89265,...,0.245754,0.344006,0.032,0.038303,0.0125,7.8125,7.8125,7.8125,0.3125,181.44288
5069,ZnCl3,P-31m,4.681159,23.5,2.405,-11.8606,-0.8448,-7.28595,2.0692,0.89265,...,0.245754,0.344006,0.032,0.038303,0.0125,7.8125,7.8125,7.8125,0.3125,181.44288
5067,Zn2Cl3,P-31m,4.587038,23.5,2.405,-11.8606,-0.8448,-7.28595,2.0692,0.89265,...,0.204468,0.286213,0.026624,0.031868,0.0104,6.5,6.5,6.5,0.26,150.960476
1179,Zn2Cl3,P-3,4.587038,23.5,2.405,-11.8606,-0.8448,-7.28595,2.0692,0.89265,...,0.204468,0.286213,0.026624,0.031868,0.0104,6.5,6.5,6.5,0.26,150.960476
2368,ZnCl2,Pmna,4.474201,23.5,2.405,-11.8606,-0.8448,-7.28595,2.0692,0.89265,...,0.218448,0.305784,0.028444,0.034047,0.011111,6.944444,6.944444,6.944444,0.277778,161.28256
2692,ZnCl2,Pmn2_1,4.474201,23.5,2.405,-11.8606,-0.8448,-7.28595,2.0692,0.89265,...,0.218448,0.305784,0.028444,0.034047,0.011111,6.944444,6.944444,6.944444,0.277778,161.28256


All the materials with highst band gap appear to be zinc compounds.