# Classification

In this notebook, we will classify materials as metals or nonmetals. The dataset that we will use is built in the `dataset_preparation.ipynb` file. We will test many possible algorithms and to assess which one gives the better accuracy. The workflow is essentially the same for all algorithms: we perform a train test split; then perform a grid search evaluated against a 5-fold split of the training set as our validation set to find the best set of hyperparameters; finally, we evaluate the accuracy on the test data.

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import multiprocessing
import xgboost as xgb #For parallel gradient boosting

In [2]:
#Dataset loading
df = pd.read_csv('gap_prediction.csv')

#Turning space group into a categorical variable
df["Space Group"] = df["Space Group"].astype('category')

#Building a dict that maps the space groups in unique integers
mapping_dict = dict(zip(df['Space Group'], df['Space Group'].cat.codes))

#Transforms the categorical space group to numbers
df['Space Group'] = df['Space Group'].map(mapping_dict)

#Target.; 1 if metal (gap==0); 0 otherwise
y = [1 if gap==0 else 0 for gap in df['gap']]
df.drop(['gap','Material','Unnamed: 0'], axis='columns', inplace=True)
X = df.to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

# Models

## Logistic Regression

In [3]:
# Define the hyperparameters to tune and their possible values
param_grid = {
    'C': [0.001, 0.01,0.1,1,10],#, 0.1, 1, 10],  # Inverse of regularization strength
    'penalty': ['elasticnet', 'l1', 'l2']  # Regularization penalty (L1 or L2)
}

# Create a Logistc Regression classifier
lr_classifier = LogisticRegression(max_iter=50000,solver='saga')

# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(lr_classifier, param_grid, cv=5, scoring='accuracy',n_jobs=-1)
scaler = StandardScaler().fit(X_train)
grid_search.fit(scaler.transform(X_train), y_train)
print(grid_search.best_params_)
best_params = grid_search.best_params_

25 fits failed out of a total of 75.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "/home/marcsgil/Desktop/MLPhysics/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/marcsgil/Desktop/MLPhysics/lib/python3.11/site-packages/sklearn/base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/marcsgil/Desktop/MLPhysics/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py", line 1179, in fit
    raise ValueError("l1_ratio must be specified when penalty i

{'C': 0.1, 'penalty': 'l2'}


In [4]:
# Train the Logistc Regression classifier with the best hyperparameters
best_lr_classifier = LogisticRegression(**best_params, max_iter=10000)
best_lr_classifier.fit(scaler.transform(X_train), y_train)

# Evaluate the model on the test set
y_pred = best_lr_classifier.predict(scaler.transform(X_test))
accuracy_lr = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy_lr)

# Perform Cross-Validation with the best hyperparameters
cv_scores_lr = cross_val_score(best_lr_classifier, X_train, y_train, cv=5, scoring='accuracy',n_jobs=-1)
print("Cross-Validation Scores:", cv_scores_lr)
print("Mean CV Accuracy:", np.mean(cv_scores_lr))

Test Accuracy: 0.7167182662538699
Cross-Validation Scores: [0.72340426 0.70736434 0.71317829 0.6996124  0.71899225]
Mean CV Accuracy: 0.7125103084281709


## Support Vector Machine

In [5]:
# Define the hyperparameters to tune and their possible values
param_grid = {
    'C': [0.1, 1, 10],              # Regularization parameter
    'kernel': ['linear', 'rbf'],    # Kernel type (linear or radial basis function)
    'gamma': ['scale', 'auto', 0.1]  # Kernel coefficient for 'rbf' kernel
}

# Create an SVM classifier
svm_classifier = SVC()

# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(svm_classifier, param_grid, cv=5, scoring='accuracy',n_jobs=-1)
scaler = StandardScaler().fit(X_train)
grid_search.fit(scaler.transform(X_train), y_train)
print(grid_search.best_params_)
best_params = grid_search.best_params_

{'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}


In [6]:
# Train the SVM classifier classifier with the best hyperparameters
best_svm_classifier = SVC(**best_params)
best_svm_classifier.fit(scaler.transform(X_train), y_train)

# Evaluate the model on the test set
y_pred = best_svm_classifier.predict(scaler.transform(X_test))
accuracy_svm = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy_svm)

# Perform Cross-Validation with the best hyperparameters
cv_scores_svm = cross_val_score(best_svm_classifier, X_train, y_train, cv=5, scoring='accuracy',n_jobs=-1)
print("Cross-Validation Scores:", cv_scores_svm)
print("Mean CV Accuracy:", np.mean(cv_scores_svm))

Test Accuracy: 0.7724458204334366
Cross-Validation Scores: [0.60348162 0.60658915 0.6124031  0.62403101 0.60852713]
Mean CV Accuracy: 0.6110064024710239


## Decision Tree

In [7]:
# Define the hyperparameters to tune and their possible values
param_grid = {
    'max_depth': [None, 10, 20, 30],  # Maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # Minimum samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]  # Minimum samples required to be at a leaf node
}

# Create a Random Forest classifier
dt_classifier = DecisionTreeClassifier()

# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(dt_classifier, param_grid, cv=5, scoring='accuracy',n_jobs=-1)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
best_params = grid_search.best_params_

{'max_depth': 20, 'min_samples_leaf': 1, 'min_samples_split': 2}


In [8]:
#= Train the Decision Tree classifier with the best hyperparameters
best_dt_classifier = DecisionTreeClassifier(**best_params)
best_dt_classifier.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = best_dt_classifier.predict(X_test)
accuracy_dt = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy_dt)

# Perform Cross-Validation with the best hyperparameters
cv_scores_dt = cross_val_score(best_dt_classifier, X_train, y_train, cv=5, scoring='accuracy',n_jobs=-1)
print("Cross-Validation Scores:", cv_scores_dt)
print("Mean CV Accuracy:", np.mean(cv_scores_dt))

Test Accuracy: 0.8390092879256966
Cross-Validation Scores: [0.82978723 0.80426357 0.79457364 0.78875969 0.8003876 ]
Mean CV Accuracy: 0.8035543460333168


## Random Forest

In [9]:
# Parameter Tuning with Cross-Validation
# Define the hyperparameters to tune and their possible values
param_grid = {
    'n_estimators': [100, 200, 300],      # Number of trees in the forest
    'max_depth': [None, 10, 20, 30],     # Maximum depth of each tree
    'min_samples_split': [2, 5, 10],    # Minimum samples required to split an internal node
    'min_samples_leaf': [1, 2, 4]       # Minimum samples required to be at a leaf node
}

# Create a Random Forest classifier
rf_classifier = RandomForestClassifier(n_jobs=-1)

# Use GridSearchCV to find the best combination of hyperparameters
grid_search = GridSearchCV(rf_classifier, param_grid, cv=5, scoring='accuracy',n_jobs=-1)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
best_params = grid_search.best_params_

{'max_depth': 20, 'min_samples_leaf': 4, 'min_samples_split': 10, 'n_estimators': 100}


In [10]:
# Train the Random Forest classifier with the best hyperparameters
best_rf_classifier = RandomForestClassifier(n_jobs=-1, **best_params)
best_rf_classifier.fit(X_train, y_train)

# Evaluate the model on the test set
y_pred = best_rf_classifier.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy_rf)

# Perform Cross-Validation with the best hyperparameters
cv_scores_rf = cross_val_score(best_rf_classifier, X_train, y_train, cv=5, scoring='accuracy',n_jobs=-1)
print("Cross-Validation Scores:", cv_scores_rf)
print("Mean CV Accuracy:", np.mean(cv_scores_rf))

Test Accuracy: 0.8173374613003096
Cross-Validation Scores: [0.80464217 0.8120155  0.79651163 0.81395349 0.78488372]
Mean CV Accuracy: 0.802401301485913


## Gradient Boosting

In [11]:
# Parameter Tuning with Cross-Validation
# Define the hyperparameters to tune and their possible values
param_grid = {
    'n_estimators': [50, 100, 200],      # Number of boosting stages to be used
    'learning_rate': [0.1, 0.2, 0.3, 0.4],  # Step size shrinks the contribution of each tree
    'max_depth': [5, 6, 7, 8]              # Maximum depth of each tree
}

# Create a Gradient Boosting classifier
xgb_model = xgb.XGBClassifier(
    n_jobs=multiprocessing.cpu_count() // 2, tree_method="hist"
)

grid_search = GridSearchCV(xgb_model,param_grid,cv=5,scoring='accuracy',n_jobs=2)
grid_search.fit(X_train, y_train)
print(grid_search.best_params_)
best_params = grid_search.best_params_

{'learning_rate': 0.2, 'max_depth': 8, 'n_estimators': 100}


In [12]:
# Train the Gradient Boosting classifier with the best hyperparameters
best_gb_classifier = xgb.XGBClassifier(
    n_jobs=multiprocessing.cpu_count() // 2, tree_method="hist", **best_params)
best_gb_classifier.fit(X_train, y_train,verbose=3)

# Evaluate the model on the test set
y_pred = best_gb_classifier.predict(X_test)
accuracy_gb = accuracy_score(y_test, y_pred)
print("Test Accuracy:", accuracy_gb)

# Perform Cross-Validation with the best hyperparameters
cv_scores_gb = cross_val_score(best_gb_classifier, X_train, y_train, cv=5, scoring='accuracy')
print("Cross-Validation Scores:", cv_scores_gb)
print("Mean CV Accuracy:", np.mean(cv_scores_gb))

Test Accuracy: 0.871517027863777
Cross-Validation Scores: [0.86653772 0.89341085 0.87596899 0.86821705 0.8624031 ]
Mean CV Accuracy: 0.8733075435203095


# Summary

In [13]:
df = pd.DataFrame(columns=['Algorithm', 'Test Accuracy', 'Mean CV Accuracy'])
df.loc[len(df)] = ['Logistic Regression', accuracy_lr, np.mean(cv_scores_lr)]
df.loc[len(df)] = ['Support Vector Machine', accuracy_svm, np.mean(cv_scores_svm)]
df.loc[len(df)] = ['Decision Tree', accuracy_dt, np.mean(cv_scores_dt)]
df.loc[len(df)] = ['Random Forrest', accuracy_rf, np.mean(cv_scores_rf)]
df.loc[len(df)] = ['Gradient Boosting', accuracy_gb, np.mean(cv_scores_gb)]
df.sort_values(by='Mean CV Accuracy', ascending=False)

Unnamed: 0,Algorithm,Test Accuracy,Mean CV Accuracy
4,Gradient Boosting,0.871517,0.873308
2,Decision Tree,0.839009,0.803554
3,Random Forrest,0.817337,0.802401
0,Logistic Regression,0.716718,0.71251
1,Support Vector Machine,0.772446,0.611006


We see that Gradient boosting was the best algorithm. We will use it to classify novel materials.

# Prediction of novel Materials

In [14]:
random_df = pd.read_csv('gap_prediction_random.csv')

random_df["Space Group"] = random_df["Space Group"].astype('category')
random_df['Space Group'] = random_df['Space Group'].map(mapping_dict)

random_df.drop(['Material','Unnamed: 0'], axis='columns', inplace=True)
X_random = random_df.to_numpy()

In [15]:
y_pred = best_gb_classifier.predict(X_random)

In [16]:
y_pred

array([0, 0, 1, ..., 0, 0, 0])

In [17]:
np.sum(y_pred)

2754