<a href="https://colab.research.google.com/github/mcpikep/Student-ML-Exercise/blob/main/Bond_Classification_Student_Exercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First step is to import all of the tools that are needed to complete this excersize. These tools will allow us to import, manipulate, and display data as well as use and analyze machine learning algorithms.

In [None]:
!python --version

In [None]:

# pip is a command-line tool that allows you to install and manage Python packages
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [None]:
import io #Used for handling input/output from the computers OS
from google.colab import files #Used to upload files into Colab
import math #Package containing more advanced operations
import pandas as pd #Used for data manipulation

#Data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

#Machine learning packages
from lazypredict.Supervised import LazyClassifier, LazyRegressor
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, mean_absolute_error, r2_score

Now lets import the data we will be using.

In [None]:
uploaded = files.upload()

Saving elements_EN.xlsx to elements_EN.xlsx


In [None]:
from google.colab import drive
drive.mount('/content/drive')

The data in the Excel file needs to be converted into a Pandas DataFrame to be used in Python.

In [None]:
df = pd.read_excel(io.BytesIO(uploaded['elements_EN.xlsx']))
df

Unnamed: 0,Atomic Number,Symbol,Electronegativity (Pauling Scale),GroupBlock
0,1,H,2.20,Nonmetal
1,2,He,,Noble gas
2,3,Li,0.98,Alkali metal
3,4,Be,1.57,Alkaline earth metal
4,5,B,2.04,Metalloid
...,...,...,...,...
113,114,Fl,,Post-transition metal
114,115,Mc,,Post-transition metal
115,116,Lv,,Post-transition metal
116,117,Ts,,Halogen


Notice that in the Helium row there is no data for Electronegativity. Let check out the DataFrame info.

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118 entries, 0 to 117
Data columns (total 4 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Atomic Number                      118 non-null    int64  
 1   Symbol                             118 non-null    object 
 2   Electronegativity (Pauling Scale)  95 non-null     float64
 3   GroupBlock                         118 non-null    object 
dtypes: float64(1), int64(1), object(2)
memory usage: 3.8+ KB


Althought the dataset contains information for 118 elements, only 95 of them have the relavent EN data. We need to get rid of the missing rows.

In [None]:
df = df.dropna()
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 95 entries, 0 to 102
Data columns (total 4 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Atomic Number                      95 non-null     int64  
 1   Symbol                             95 non-null     object 
 2   Electronegativity (Pauling Scale)  95 non-null     float64
 3   GroupBlock                         95 non-null     object 
dtypes: float64(1), int64(1), object(2)
memory usage: 3.7+ KB


Now all the columns contain the same amount of information and we have removed the missing data.

Next, we need to calculate the differences in EN for each hypothetical diatomic compound.

In [None]:
en_diff = []
for en in df["Electronegativity (Pauling Scale)"]:
  for en_1 in df["Electronegativity (Pauling Scale)"]:
      delta_en = abs(en - en_1)
      en_diff.append(delta_en)

print(en_diff)

[0.0, 1.2200000000000002, 0.6300000000000001, 0.16000000000000014, 0.34999999999999964, 0.8399999999999999, 1.2399999999999998, 1.7799999999999998, 1.27, 0.8900000000000001, 0.5900000000000001, 0.30000000000000027, 0.010000000000000231, 0.3799999999999999, 0.96, 1.3800000000000003, 1.2000000000000002, 0.8400000000000001, 0.6600000000000001, 0.5700000000000003, 0.5400000000000003, 0.6500000000000001, 0.3700000000000001, 0.3200000000000003, 0.29000000000000026, 0.30000000000000027, 0.5500000000000003, 0.3900000000000001, 0.1900000000000004, 0.020000000000000018, 0.34999999999999964, 0.7599999999999998, 0.7999999999999998, 1.3800000000000003, 1.2500000000000002, 0.9800000000000002, 0.8700000000000001, 0.6000000000000001, 0.040000000000000036, 0.30000000000000027, 0.0, 0.07999999999999963, 0.0, 0.27000000000000024, 0.5100000000000002, 0.42000000000000015, 0.2400000000000002, 0.15000000000000036, 0.10000000000000009, 0.45999999999999996, 0.3999999999999999, 1.4100000000000001, 1.31, 1.1, 1.

Now, calculate the average of the electronegativity of the two atoms.

In [None]:
en_avg = []
for en in df["Electronegativity (Pauling Scale)"]:
  for en_1 in df["Electronegativity (Pauling Scale)"]:
    avg = (en + en_1)/2
    en_avg.append(avg)

print(en_avg)

[2.2, 1.59, 1.8850000000000002, 2.12, 2.375, 2.62, 2.8200000000000003, 3.09, 1.5650000000000002, 1.7550000000000001, 1.9050000000000002, 2.05, 2.1950000000000003, 2.39, 2.68, 1.51, 1.6, 1.7800000000000002, 1.87, 1.915, 1.9300000000000002, 1.875, 2.015, 2.04, 2.055, 2.05, 1.925, 2.005, 2.105, 2.1900000000000004, 2.375, 2.58, 2.6, 1.51, 1.5750000000000002, 1.71, 1.7650000000000001, 1.9000000000000001, 2.18, 2.05, 2.2, 2.24, 2.2, 2.065, 1.945, 1.9900000000000002, 2.08, 2.125, 2.1500000000000004, 2.43, 2.4000000000000004, 1.495, 1.5450000000000002, 1.6500000000000001, 1.6600000000000001, 1.665, 1.67, 1.685, 1.7000000000000002, 1.71, 1.715, 1.7200000000000002, 1.725, 1.735, 1.75, 1.85, 2.2800000000000002, 2.05, 2.2, 2.2, 2.24, 2.37, 2.1, 1.9100000000000001, 2.265, 2.1100000000000003, 2.1, 2.2, 1.4500000000000002, 1.55, 1.6500000000000001, 1.75, 1.85, 1.79, 1.7800000000000002, 1.7400000000000002, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.75, 1.59, 0.98, 1.275, 1.51, 1.765, 2.01, 2.21

Now, we need to label the compounds with by combinding their chemical symbols.We can do the same with the types of elements.

In [None]:
diatomics = []
for element in df["Symbol"]:
    for element_1 in df["Symbol"]:
        molecule = element + element_1
        diatomics.append(molecule)

print(diatomics)

['HH', 'HLi', 'HBe', 'HB', 'HC', 'HN', 'HO', 'HF', 'HNa', 'HMg', 'HAl', 'HSi', 'HP', 'HS', 'HCl', 'HK', 'HCa', 'HSc', 'HTi', 'HV', 'HCr', 'HMn', 'HFe', 'HCo', 'HNi', 'HCu', 'HZn', 'HGa', 'HGe', 'HAs', 'HSe', 'HBr', 'HKr', 'HRb', 'HSr', 'HY', 'HZr', 'HNb', 'HMo', 'HTc', 'HRu', 'HRh', 'HPd', 'HAg', 'HCd', 'HIn', 'HSn', 'HSb', 'HTe', 'HI', 'HKe', 'HCs', 'HBa', 'HLa', 'HCe', 'HPr', 'HNd', 'HSm', 'HGd', 'HDy', 'HHo', 'HEr', 'HTm', 'HLu', 'HHf', 'HTa', 'HW', 'HRe', 'HOs', 'HIr', 'HPt', 'HAu', 'HHg', 'HTl', 'HPb', 'HBi', 'HPo', 'HAt', 'HFr', 'HRa', 'HAc', 'HTh', 'HPa', 'HU', 'HNp', 'HPu', 'HAm', 'HCm', 'HBk', 'HCf', 'HEs', 'HFm', 'HMd', 'HNo', 'HLr', 'LiH', 'LiLi', 'LiBe', 'LiB', 'LiC', 'LiN', 'LiO', 'LiF', 'LiNa', 'LiMg', 'LiAl', 'LiSi', 'LiP', 'LiS', 'LiCl', 'LiK', 'LiCa', 'LiSc', 'LiTi', 'LiV', 'LiCr', 'LiMn', 'LiFe', 'LiCo', 'LiNi', 'LiCu', 'LiZn', 'LiGa', 'LiGe', 'LiAs', 'LiSe', 'LiBr', 'LiKr', 'LiRb', 'LiSr', 'LiY', 'LiZr', 'LiNb', 'LiMo', 'LiTc', 'LiRu', 'LiRh', 'LiPd', 'LiAg', 'LiCd',

In [None]:
diatomic_types = []
for type_1 in df["GroupBlock"]:
    for type_2 in df["GroupBlock"]:
        types = type_1 + type_2
        diatomic_types.append(types)

print(diatomic_types)

['NonmetalNonmetal', 'NonmetalAlkali metal', 'NonmetalAlkaline earth metal', 'NonmetalMetalloid', 'NonmetalNonmetal', 'NonmetalNonmetal', 'NonmetalNonmetal', 'NonmetalHalogen', 'NonmetalAlkali metal', 'NonmetalAlkaline earth metal', 'NonmetalPost-transition metal', 'NonmetalMetalloid', 'NonmetalNonmetal', 'NonmetalNonmetal', 'NonmetalHalogen', 'NonmetalAlkali metal', 'NonmetalAlkaline earth metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalPost-transition metal', 'NonmetalMetalloid', 'NonmetalMetalloid', 'NonmetalNonmetal', 'NonmetalHalogen', 'NonmetalNoble gas', 'NonmetalAlkali metal', 'NonmetalAlkaline earth metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransition metal', 'NonmetalTransi

In [None]:
dataset = {'Molecule': diatomics, 'Delta EN': en_diff, 'Average EN': en_avg,'Atom Types': diatomic_types}
data = pd.DataFrame(dataset)
data

Unnamed: 0,Molecule,Delta EN,Average EN,Atom Types
0,HH,0.00,2.20,NonmetalNonmetal
1,HLi,1.22,1.59,NonmetalAlkali metal
2,HBe,0.63,1.89,NonmetalAlkaline earth metal
3,HB,0.16,2.12,NonmetalMetalloid
4,HC,0.35,2.38,NonmetalNonmetal
...,...,...,...,...
9020,LrEs,0.00,1.30,ActinideActinide
9021,LrFm,0.00,1.30,ActinideActinide
9022,LrMd,0.00,1.30,ActinideActinide
9023,LrNo,0.00,1.30,ActinideActinide


Now, we can classify the bond type given the known rules. Ionic compound have a difference greater or equal to 2.0. Any compound between 2.0 and 0.4 is considered polar colvalent, while anything below 0.4 is nonpolar covalent.

In [None]:
bond_type = []
for d_en, av_en in zip(data['Delta EN'], data['Average EN']):

  if d_en >= 2 * av_en - 2.76 and d_en >= -2 * av_en + 4.4:
    bond = "Ionic"
    bond_type.append(bond)

  elif 2 * av_en - 2.76 >= d_en >= -2 * av_en + 4.4:
    bond = "Covalent"
    bond_type.append(bond)

  elif 2 * av_en - 2.76 > d_en and -2 * av_en + 4.4 > d_en:
    bond = "Metalloid"
    bond_type.append(bond)

  else:
    bond = "Metallic"
    bond_type.append(bond)

print(bond_type)

Now let put all this together in a DataFrame.

In [None]:
data["Bond Types"] = bond_type
data

Google Colab AI offers interesting charts for DataFrames. For example, the two show the frequency of Average EN and Delta EN value which could be useful for making sure our classifications make sense.The third chart is keys us in on a potential problem for are exercise!

In [None]:
# @title Bond Type

from matplotlib import pyplot as plt
import seaborn as sns
data.groupby('Bond Types').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
sns.scatterplot(x = 'Average EN', y = 'Delta EN', hue = 'Bond Types', data = data)

Notice the large amount of compound that have low Delta EN and lower average EN. These represent the transition metal compounds. They need to be reclassified as Metallic bonds. Likewise, you should notice that the hydride compounds are miss lable, they should be "Ionic", except for HC and HH, which are "Nonpolar Covalent".

Now, let get the data split into training and testing sets.

In [None]:
y = data['Bond Types']
features = ['Delta EN', 'Average EN']
X = data[features]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

We can use lazypredict to test several different types of alogrithms at once and evaluate which one preforms the best.

In [None]:
lazy_clf = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = lazy_clf.fit(train_X, val_X, train_y, val_y)

clf_results = pd.DataFrame(models)
clf_results

The decision tree was one of the best preformer with high accuracy and low time. Let use one and evaluate further.

In [None]:
from pprint import pprint

bonding_tree = DecisionTreeClassifier()
bonding_tree.fit(train_X, train_y)
tree_predictions = bonding_tree.predict(val_X)

report = classification_report(val_y, tree_predictions)
pprint(confusion_matrix(val_y, tree_predictions))

In [None]:
sns.scatterplot(x = 'Average EN', y = 'Delta EN', hue = tree_predictions, data = val_X)

In [None]:
plt.figure(figsize=(40, 40))
plot_tree(bonding_tree, filled=True, feature_names=('Delta EN', 'Avg EN'))
plt.show()

In [None]:
ionic_character = []
for value in en_diff:
    percent = (1 - math.exp(-.25*(value**2))) * 100
    ionic_character.append(percent)

In [None]:
dataset.update({'Ionic %': ionic_character})
data = pd.DataFrame(dataset)

data

In [None]:
data.plot(kind='scatter', x='Average EN', y='Delta EN', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
data.plot(kind='scatter', x='Average EN', y='Ionic %', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
y = data['Ionic %']
features = ['Delta EN', 'Average EN']
X = data[features]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=0)

In [None]:
lazy_reg = LazyRegressor(verbose=0,ignore_warnings=True, custom_metric=None)
models,predictions = lazy_reg.fit(train_X, val_X, train_y, val_y)

reg_results = pd.DataFrame(models)

In [None]:
results = pd.DataFrame(models)
results

One of the top performers was the K-nearest neighbor model. Let's try running one.

In [None]:
neigh = KNeighborsRegressor()
neigh.fit(train_X, train_y)
neigh_prediction = neigh.predict(val_X)

print(mean_absolute_error(val_y, neigh_prediction))
print(r2_score(val_y, neigh_prediction))

We can check which parameters the learning algorithm uses when making predictions.

In [None]:
neigh.get_params()

One of the most usefull parameter of the K-nearest neighbor model is the n-neighbors parameter. This tells the algorithm how many neighbors to use to estimate the target value. The cross-validation part of this function splits the dataset into a given number of subsets and then evaluates the model using each subset as the validation set.

In [None]:
mod = GridSearchCV(estimator=neigh,
             param_grid={'n_neighbors': [1, 2, 5, 10, 15, 20]},
             cv=3)

In [None]:
mod.fit(train_X, train_y)
pd.DataFrame(mod.cv_results_)