In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Preparation

The goal of this case study is to classify the data provided using logistic regression, decision tree, random forest and extra random forest. The resulting classified data will then be compared through the classification report, confusion matrix, accuracy score, and if applicable, roc_auc_score.

Let's import the required libraries first.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
df = pd.read_csv('/kaggle/input/crystal-system-properties-for-liion-batteries/lithium-ion batteries.csv')
df.head()

In [None]:
df.info()

To prepare the dataset, missing data is visually checked by using a heatmap available through the Seaborn library. 

In [None]:
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')

Generate and check the pairplot of the dataframe.

In [None]:
sns.pairplot(df,hue='Crystal System',palette='Set2')

The use of the pairplot shows the relationship of each variable to the Crystal Systems of Li-ion batteries. With this, regression analysis is used wherein the variables Nsites and Volume showed an upward trend of regression. 

Columns with variables not needed in building models such as materials id, formula, and space group are removed to clean the dataset.

In [None]:
df.drop(['Materials Id','Formula','Spacegroup'],
        axis=1,inplace=True)
df.head()

In [None]:
df.dropna(inplace=True)
df.info()

Generate a frequency distribution of the Crystal Systems and whether or not it has a bandstructure.

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Crystal System',
              data=df,palette='RdBu_r')


In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Crystal System',hue='Has Bandstructure',data=df,palette='RdBu_r')

# Logistic Regression
Logistic Regression is used when the dependent variable or target is categorical. There are different types of logistic regression such as binary, multinomial, and 
ordinal (Swaminathan, 2018). Binary logistic regression is used when the categorical response has only two possible outcomes. Multinomial logistic regression is used when 
there are three or more categories used without ordering. Ordinal logistic regression is used when there are three or more categories with ordering. 

Build a Logistic Regression model. Split the data into a training set and a test set.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(df.drop('Crystal System',axis=1), 
                                                    df['Crystal System'], test_size=0.30, 
                                                    random_state=101)

The test size is set to 30 leaving the training set to be at 70. Random state is set to 101.

Train the model.

In [None]:
from sklearn.linear_model import LogisticRegression

logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)

Predict the values for the testing data and print a classification report to obtain the precision, recall and f1-score.

In [None]:
lr_predictions = logmodel.predict(X_test)

from sklearn.metrics import classification_report
print(classification_report(y_test,lr_predictions))

Print the accuracy score.

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score
print(accuracy_score(y_test, lr_predictions))

Show the confusion matrix of the prediction.

In [None]:
confusion_matrix(y_test, lr_predictions)

In [None]:
data = confusion_matrix(y_test, lr_predictions)
df_cm = pd.DataFrame(data, columns=np.unique(y_test), index = np.unique(lr_predictions))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
plt.figure(figsize = (10,7))
sns.set(font_scale=1.4)#for label size
sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})

The values of precision, recall, and f1 score are obtained through a classification report. Output shows the precision, recall, and f1 score for the Crystal Systems 
of Li-ion batteries as well as its accuracy score. The confusion matrix of the prediction is shown which can be used to solve the precision, recall, f1 score, and accuracy mathematically.

# Decision Tree

Decision Tree can be used to represent decisions and decision making visually and explicitly (Gupta, 2017). The name is taken from the tree-like model of decisions; 
however, the root is at the very top. The root is split into two decisions or leaves depending on the condition or internal node. In general, Decision Tree algorithms are 
referred to as Classification and Regression Trees (CART).

Build the Decision Tree model and split the data into a training set and test set.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('Crystal System',axis=1)
y = df['Crystal System']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

The test size is set to 30 leaving the train set at 70. 

Train the model.

In [None]:
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)

dtree.tree_.node_count, dtree.tree_.max_depth

The Decision Tree Classifier is used to obtain the count of nodes and maximum depth of the decision tree. 

Predict the values for the testing data and generate the classification report to check the precision, recall and f1-score.

In [None]:
dt_predictions = dtree.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

print(classification_report(y_test,dt_predictions))

Generate the confusion matrix and the accuracy score of the prediction.

In [None]:
print(confusion_matrix(y_test,dt_predictions))

In [None]:
from IPython.display import Image  
from six import StringIO  
from sklearn.tree import export_graphviz
import pydot 

features = list(df.columns[1:])
features

In [None]:
!pip install Graphviz

In [None]:
!pip install pydotplus

In [None]:
from io import StringIO
from IPython.display import Image, display
import pydotplus

from sklearn.tree import export_graphviz
dot_data = StringIO()

export_graphviz(dtree, out_file=dot_data, feature_names=features,filled=True,rounded=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())

    # View the tree image
filename = 'Batteries.png'
graph.write_png(filename)
img = Image(filename=filename)
display(img)

Here we used GraphViz and pydotplus to visualize the count of nodes and maximum depth of the decision tree. 

The values of precision, recall, and f1 score are obtained through a classification report. Output shows the precision, recall, and f1 score for the Crystal Systems of 
Li-ion batteries as well as its accuracy score. The confusion matrix of the prediction is shown which can be used to solve the precision, recall, f1 score, and accuracy mathematically.

# Random Forest

Random Forest is a supervised learning algorithm. The forest the algorithm builds is an ensemble of decision trees, usually with the bagging method (Donges, 2020). 
Bagging is a combination of learning models that increases the overall result. A random forest builds multiple decision trees and merges them together to get a 
more accurate and stable prediction. It can be used for both classification and regression problems.

Build the Random Forest model and split the data into a training set and test set.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('Crystal System',axis=1)
y = df['Crystal System']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

Train the random forest model and predict the class of the Crystal Systems.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=200)
rfc.fit(X_train,y_train)
rf_predictions = rfc.predict(X_test)

The test size is set to 30 leaving the train set at 70. The Random Forest Classifier is imported from sklearn and the estimators are set to 200.


Generate a classification report to obtain the precision, recall and f1-score of the model.

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, roc_auc_score
print(classification_report(y_test,rf_predictions))

Generate the confusion matrix and the accuracy score of the model.

In [None]:
print(confusion_matrix(y_test,rf_predictions))

In [None]:
print(accuracy_score(y_test, rf_predictions))

The values of precision, recall, and f1 score are obtained through a classification report. Output shows the precision, recall, and f1 score for the Crystal Systems of Li-ion batteries as well as its accuracy score. The confusion matrix of the prediction is shown which can be used to solve the precision, recall, f1 score, and 
accuracy mathematically.

# Extra Random Forest
Extra Random Forest is like a random forest and is also known as Extremely Randomized Trees. In an extra random forest, the features and splits are selected at random and it is less computationally expensive than a random forest (Ceballos, 2019).

Decision trees show high variance, random forests show medium variance and extra random forest show low variance. 

Build the Extra Random Forest model and split the data into a training set and test set.

In [None]:
from sklearn.model_selection import train_test_split

X = df.drop('Crystal System',axis=1)
y = df['Crystal System']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)


Train the model and predict the values for the testing data.

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

etc = ExtraTreesClassifier(n_estimators=200)
etc.fit(X_train,y_train)
erf_predictions=etc.predict(X_test)


The test size is set to 30 leaving the train set at 70. The Extra Trees Classifier is imported from sklearn and the estimators are set to 200.

Generate the classification report to obtain the precision, recall and f1-score of the model.

In [None]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, roc_auc_score
print(classification_report(y_test,erf_predictions))


Generate the confusion matrix and accuracy score of the model.

In [None]:
print(confusion_matrix(y_test,erf_predictions))

In [None]:
print(accuracy_score(y_test, erf_predictions))

The values of precision, recall, and f1 score are obtained through a classification report. Output shows the precision, recall, and f1 score for the Crystal Systems 
of Li-ion batteries as well as its accuracy score. The confusion matrix of the prediction is shown which can be used to solve the precision, recall, f1 score, and accuracy mathematically.

# K Nearest Neighbors (KNN)

KNN is a simple algorithm that stores all available cases and predict the numerical target based on a similarity measure (Muhajir, 2019).The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space that 
distinctly classifies the data points (Gandhi, 2018). To separate the two classes of data points, there are many possible hyperplanes that could be chosen. The goal is to 
find a plane that has a maximum margin.

Build the K Nearest Neighbors model and split the data into a training set and test set.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()

In [None]:
scaler.fit(df.drop('Crystal System',axis=1))

In [None]:
scaled_features = scaler.transform(df.drop('Crystal System',axis=1))

In [None]:
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
df_feat.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['Crystal System'],
                                                    test_size=0.30)

In [None]:
from sklearn.neighbors import KNeighborsClassifier

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)

In [None]:
knn.fit(X_train,y_train)

In [None]:
pred = knn.predict(X_test)

Evaluate the KNN model.

In [None]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score

In [None]:
print(confusion_matrix(y_test,pred))

In [None]:
print(classification_report(y_test,pred))

Choose a K value. Create a method to pick a good value of K.

In [None]:
error_rate = []


for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

Create a visualization to compare the error rate and k value.

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
# K = 1
knn = KNeighborsClassifier(n_neighbors=1)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=1')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

In [None]:
#K = 23
knn = KNeighborsClassifier(n_neighbors=23)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=23')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

Fit the K-nearest neighbors model again with n_neighbors=3 but this time use distance for the weights. Calculate the accuracy using the function you created above.

In [None]:
knn = KNeighborsClassifier(n_neighbors=3, weights='distance')

knn = knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(accuracy_score(y_test, y_pred))

Fit another K-nearest neighbors model. This time use uniform weights but set the power parameter for the Minkowski distance metric to be 1 (p=1) i.e. Manhattan Distance.

In [None]:
knn = KNeighborsClassifier(n_neighbors=5, p=1)

knn = knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

print(accuracy_score(y_test, y_pred))

Fit a K-nearest neighbors model using values of k (n_neighbors) ranging from 1 to 20. Use uniform weights (the default). The coefficient for the Minkowski distance (p) can be set to either 1 or 2--just be consistent. Store the accuracy and the value of k used from each of these fits in a list or dictionary. Plot (or view the table of) the accuracy vs k. What do you notice happens when k=1?

In [None]:
# Fit the K-nearest neighbors model with different values of k
# Store the accuracy measurement for each k

score_list = list()

for k in range(1, 30):
    
    knn = KNeighborsClassifier(n_neighbors=k)
    knn = knn.fit(X_train, y_train)
    
    y_pred = knn.predict(X_test)
    score = accuracy_score(y_test, y_pred)
    
    score_list.append((k, score))
    
score_df = pd.DataFrame(score_list, columns=['k', 'accuracy'])

In [None]:
sns.set_context('talk')
sns.set_style('ticks')
sns.set_palette('dark')

ax = score_df.set_index('k').plot()

ax.set(xlabel='k', ylabel='accuracy')
ax.set_xticks(range(1, 30));

# Support Vector Machines
Build the Support Vector Machines model and split the data into a training set and test set.

In [None]:
df

In [None]:
df.describe()

In [None]:
y = (df['Crystal System'] == 'monoclinic').astype(int)
fields = list(df.columns[:-1])
correlations = df[fields].corrwith(y)
correlations.sort_values(inplace=True)
correlations

Create a pairplot for the dataset.

In [None]:
sns.set_context('talk')
sns.set_palette('Paired')
sns.set_style('white')

sns.pairplot(df, hue='Crystal System')

Create a bar plot showing the correlations between each column and y

In [None]:
ax = correlations.plot(kind='bar')
ax.set(ylim=[-1, 1], ylabel='correlation');

In [None]:
from sklearn.preprocessing import MinMaxScaler

fields = correlations.map(abs).sort_values().iloc[-2:].index
print(fields)
X = df[fields]
scaler = MinMaxScaler()
X = scaler.fit_transform(X)
X = pd.DataFrame(X, columns=['%s_scaled' % fld for fld in fields])
print(X.columns)

In [None]:
from sklearn.svm import LinearSVC

LSVC = LinearSVC()
LSVC.fit(X, y)

X_color = X.sample(300, random_state=45)
y_color = y.loc[X_color.index]
y_color = y_color.map(lambda r: 'red' if r == 1 else 'yellow')
ax = plt.axes()
ax.scatter(
    X_color.iloc[:, 0], X_color.iloc[:, 1],
    color=y_color, alpha=1)
# -----------
x_axis, y_axis = np.arange(0, 1.005, .005), np.arange(0, 1.005, .005)
xx, yy = np.meshgrid(x_axis, y_axis)
xx_ravel = xx.ravel()
yy_ravel = yy.ravel()
X_grid = pd.DataFrame([xx_ravel, yy_ravel]).T
y_grid_predictions = LSVC.predict(X_grid)
y_grid_predictions = y_grid_predictions.reshape(xx.shape)
ax.contourf(xx, yy, y_grid_predictions, cmap=plt.cm.autumn_r, alpha=.3)
# -----------
ax.set(
    xlabel=fields[0],
    ylabel=fields[1],
    xlim=[0, 1],
    ylim=[0, 1],
    title='decision boundary for LinearSVC');

# Conclusion
Since there are three crystal systems that the Li-ion batteries can be categorized into, the problem has become a multiclass classification. A multiclass classification makes the assumption that each sample is assigned to one and only one label, therefore, the battery can either be monoclinic, orthorhombic or triclinic, but it cannot be both at the same time. The dataset is also imbalanced such that there are 139 monoclinic batteries, 128 orthorhombic batteries and 72 triclinic batteries giving a 41:38:21 ratio. Because the dataset is biased towards the monoclinic batteries, the model overfits on that class label and predicts it with high accuracy, leaving the orthorhombic class with medium accuracy and the triclinic class with the lowest accuracy.

Comparing the resulting accuracy, precision, recall and F1 score of the models, we could conclude that the random forest produced the best results, followed by the 
extra random forest, logistic regression, KNN, and the decision tree. This proves that while the decision trees produced the fastest results, it suffered from overfitting and the random forest algorithm which produced the slowest results did not suffer from overfitting by creating trees on random subsets.

# Recommendations
Because the dataset used for the classification problem was imbalanced, it is recommended to re-sample the dataset to make it balanced and standardized. By  sampling a balanced dataset, the resulting metrics would be improved.

# Referrences

Swaminathan, S. (2018, March 15). Logistic Regression Detailed 
Overview. Retrieved from Towards Data Science: 
https://towardsdatascience.com/logistic-regression-detailedoverview-46c4da4303b

Gupta, P. (2017, May 18). Decision Trees in Machine Learning. 
Retrieved from Towards Data Science : 
https://towardsdatascience.com/decision-trees-in-machinelearning-641b9c4e8052

Donges, N. (2020, September 3). A Complete Guide to the Random 
Forest Algorithm. Retrieved from Built In: 
https://builtin.com/data-science/random-forest-algorithm

Ceballos, F. (2019, July 14). An intuitive explanation of random forest 
and extra trees classifiers. Retrieved from Towards Data 
Science: https://towardsdatascience.com/an-intuitiveexplanation-of-random-forest-and-extra-trees-classifiers8507ac21d54b

Gandhi, R. (2018, May 27). Introduction to Machine Learning 
Algorithms: Linear Regression. Retrieved from Towards 
Data Science: https://towardsdatascience.com/introductionto-machine-learning-algorithms-linear-regression14c4e325882a

Muhajir, I. (2019, April 20). K-Neighbors Regression Analysis in Python. 
Retrieved from Analytics Vidhya: 
https://medium.com/analytics-vidhya/k-neighbors-regressionanalysis-in-python-61532d56d8e4