# Predicting Tensile Strengths of Alloys Using Machine Learning

This notebook uses over 350 alloys to model tensile strengths based on alloy compositions, and uses the model to predict tensile strengths of new compounds. It creates features based on the composition of each entry, trains a random forest model, and then uses the model to predict new compound tensile strengths.

Alloy data taken from [here](https://www.nickelinstitute.org/~/media/files/technicalliterature/propertiesofsomemetalsandalloys_297_.pdf)

In [1]:
#Before loading the dataset, we import the required libraries and packages

import pandas as pd         #pandas provides easy-to-use data structures and data analysis tools
import numpy as np          #numpy provides numerical tools
from itertools import combinations,permutations        #helps create new compounds
from sklearn.model_selection import train_test_split   #used to create a test set for checking accuracy
from sklearn.ensemble import RandomForestRegressor     #the randomforest used for training the model
from sklearn.metrics import mean_absolute_error as mae #This is the metric we use to check error/accuracy

## Loading the dataset using pandas from the local folder

In [2]:
file_location = r'C:\Users\hchintada\Downloads\TS_Project\finaldata_alloy_comp.csv'
#change file_location in the above line accordingly.
# ex: file_location = r'C:\Users\Sushma\TSproject\finaldata_alloy_comp.csv'
data = pd.read_csv(file_location) #loads the data into a pandas dataframe

Having a brief look at the data that is loaded. head() method outputs the top 5 rows of the dataset

In [3]:
data.head()

Unnamed: 0,composition,m1,m1%,m2,m2%,m3,m3%,m4,m4%,m5,...,m7%,m8,m8%,m9,m9%,m10,m10%,m11,m11%,TS
0,Cu99.90,Cu,99.9,,,,,,,,...,,,,,,,,,,32.0
1,"Cu98.05,Be1.7,Co0.25",Cu,98.05,Be,1.7,Co,0.25,,,,...,,,,,,,,,,165.0
2,"Cu97.85,Be1.9,Co0.25",Cu,97.85,Be,1.9,Co,0.25,,,,...,,,,,,,,,,175.0
3,"Cu96.9,Be0.6,Co2.5",Cu,96.9,Be,0.6,Co,2.5,,,,...,,,,,,,,,,110.0
4,"Cu97.07,Be0.38,Co1.55,Ag1",Cu,97.07,Be,0.38,Co,1.55,Ag,1.0,,...,,,,,,,,,,110.0


# Creating new compounds
Creating over 18,000 random combinations of three metals with random weights

## Create random weights
Initializing random weights for components that sum up to 100

In [4]:
def wt():
    weights = np.random.random(3)
    weights /= weights.sum();weights*=100;weights = weights.round(2)
    return weights

## Create random metal-weight combinations of size 3

In [5]:
metals = []
for column in 'm1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11'.split(' '):
    metals.extend(data[column].dropna().drop_duplicates())
    metals = list(set(metals))
metals = np.asarray(metals)
metals[metals!='nan']
metals.sort()

new_comps = []
for i in combinations(metals,3):
    new_comps.append(','.join([m+str(n) for m,n in zip(i,wt())]))
print('Number of new randomly generated components:',len(new_comps))

Number of new randomly generated components: 18424


This created 18424 new alloy combinations

In [6]:
# A random sample of 15 items from the above list of new compounds
import random
print(random.sample(new_comps, 15))

['Cr15.39,Rh41.65,U42.95', 'S47.72,Th24.9,Y27.38', 'Au33.86,B40.43,Re25.71', 'Be30.86,Cb53.35,Ru15.8', 'As45.67,Ma17.76,Se36.57', 'Fe71.75,Mn22.07,Rh6.18', 'Be23.77,Mg40.25,Sn35.98', 'H38.64,Ma31.65,Si29.71', 'Ir46.8,Mo35.41,Zr17.79', 'C26.07,Cb63.71,Mo10.22', 'Fe54.86,N14.8,Ni30.34', 'Ir11.43,Mg6.33,Si82.24', 'Co27.9,Pd45.05,S27.05', 'P11.4,Rh32.81,U55.79', 'As45.92,Bi34.28,Rh19.8']


## Create a dataframe new compounds

In [7]:
#Create a pandas dataframe from new component list so that it can be passed to the model for prediction
new = pd.DataFrame({'composition':new_comps})
expanded = new['composition'].str.split(',', expand=True)
for column in expanded.columns:
    m = expanded[column].str.extract(r'([a-zA-Z]+)([0-9\.]+)')
    m.columns = ['m'+str(column+1),'m'+str(column+1)+'%']
    m['m'+str(column+1)+'%'] = m['m'+str(column+1)+'%'].astype('float')
    new = pd.concat([new,m],axis='columns')

In [8]:
# A look at the new dataframe
new.head()

Unnamed: 0,composition,m1,m1%,m2,m2%,m3,m3%
0,"Ag14.3,Al34.5,As51.2",Ag,14.3,Al,34.5,As,51.2
1,"Ag35.28,Al6.95,Au57.77",Ag,35.28,Al,6.95,Au,57.77
2,"Ag32.11,Al25.09,B42.8",Ag,32.11,Al,25.09,B,42.8
3,"Ag71.65,Al19.84,Be8.51",Ag,71.65,Al,19.84,Be,8.51
4,"Ag61.16,Al18.06,Bi20.78",Ag,61.16,Al,18.06,Bi,20.78


## data preprocessing

After loading the dataset, the columns m1, m2,..... need to be one-hot-encoded to suit learning algorithms. This helps convert the categorical columns into numeric columns for model building. This can be done by using the 'get_dummies()' method of pandas

In [9]:
#concatenating the newly generated compounds dataframe to capture their categories
data1 = pd.concat([data,new],sort=False)

In [10]:
for col in 'm1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11'.split(' '):
    data1[col] = data1[col].astype('category')

In [11]:
df = pd.get_dummies(data1.drop(['composition'],axis='columns')) #creates a new dataframe df that creates new dummy columns

In [12]:
# a look at the new data
df.head()

Unnamed: 0,m1%,m2%,m3%,m4%,m5%,m6%,m7%,m8%,m9%,m10%,...,m9_V,m9_Zr,m10_C,m10_Fe,m10_O,m10_Zr,m11_B,m11_C,m11_Co,m11_Cu
0,99.9,,,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
1,98.05,1.7,0.25,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
2,97.85,1.9,0.25,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
3,96.9,0.6,2.5,,,,,,,,...,0,0,0,0,0,0,0,0,0,0
4,97.07,0.38,1.55,1.0,,,,,,,...,0,0,0,0,0,0,0,0,0,0


data in 24 columns got spread into 276 columns after one-hot-encoding

In [13]:
# A look at all the 276 new columns created
for i in df.columns:
    print(i,end=' ')

m1% m2% m3% m4% m5% m6% m7% m8% m9% m10% m11% TS m1_Ag m1_Al m1_As m1_Au m1_B m1_Be m1_Bi m1_C m1_Cb m1_Cc m1_Cd m1_Co m1_Cr m1_Cu m1_Fe m1_H m1_Hf m1_In m1_Ir m1_La m1_Ma m1_Mg m1_Mn m1_Mo m1_N m1_Ni m1_O m1_P m1_Pb m1_Pd m1_Pt m1_Re m1_Rh m1_Ru m1_S m1_Sb m1_Se m1_Si m1_Sn m1_Ta m1_Te m1_Th m1_Ti m1_U m1_V m1_W m1_Y m1_Zn m1_Zr m2_Ag m2_Al m2_As m2_Au m2_B m2_Be m2_Bi m2_C m2_Cb m2_Cc m2_Cd m2_Co m2_Cr m2_Cu m2_Fe m2_H m2_Hf m2_In m2_Ir m2_La m2_Ma m2_Mg m2_Mn m2_Mo m2_N m2_Ni m2_O m2_P m2_Pb m2_Pd m2_Pt m2_Re m2_Rh m2_Ru m2_S m2_Sb m2_Se m2_Si m2_Sn m2_Ta m2_Te m2_Th m2_Ti m2_U m2_V m2_W m2_Y m2_Zn m2_Zr m3_Al m3_As m3_Au m3_B m3_Be m3_Bi m3_C m3_Cb m3_Cc m3_Cd m3_Co m3_Cr m3_Cu m3_Fe m3_H m3_Hf m3_In m3_Ir m3_La m3_Ma m3_Mg m3_Mn m3_Mo m3_N m3_Ni m3_O m3_P m3_Pb m3_Pd m3_Pt m3_Re m3_Rh m3_Ru m3_S m3_Sb m3_Se m3_Si m3_Sn m3_Ta m3_Te m3_Th m3_Ti m3_U m3_V m3_W m3_Y m3_Zn m3_Zr m4_Ag m4_Al m4_As m4_B m4_C m4_Cb m4_Cc m4_Cd m4_Co m4_Cr m4_Cu m4_Fe m4_Mg m4_Mn m4_Mo m4_N m4_Ni m4_Pb m4_

In [14]:
df.fillna(0,inplace=True) #this fills all the blank cells in the weights columns with zero
df.head()

Unnamed: 0,m1%,m2%,m3%,m4%,m5%,m6%,m7%,m8%,m9%,m10%,...,m9_V,m9_Zr,m10_C,m10_Fe,m10_O,m10_Zr,m11_B,m11_C,m11_Co,m11_Cu
0,99.9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
1,98.05,1.7,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
2,97.85,1.9,0.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
3,96.9,0.6,2.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0
4,97.07,0.38,1.55,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,0


After capturing categories, separating the newly generated compounds from the original alloy dataframe

In [15]:
new_dum = df[df['TS'] == 0]
df = df[df['TS'] != 0]
new_dum.drop(['TS'],axis='columns',inplace=True)

## Defining the target variable and splitting data into train and test sets

Supervised machine learning models require seggregation between the predictor variables and target variable. The original datafrem df's predictor variables are stored in the dataframe X, while the target is stored in y

In [16]:
X = df.drop(['TS'],axis='columns')
y = df['TS']

Splitting the data into train and test sets helps check the model performance on unseen data. The code in the below cell splits the predictors and target dataframes into 'train predictors', 'train targets', 'test predictors', and 'test targets'.

In [17]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=0)

## Train and test sets

A look at the test and train sets created above

In [43]:
# Alloys used in the train dataset
data.loc[X_train.index]['composition']

132                                            Ni50,Ti50
66                                            Al95.0,Si5
261                   Fe75.03,Cr17,Ni7,Ti0.7,Al0.2,C0.07
12                                      Cu89.8,Sn10,P0.2
120                  Ni95.6,Cu0.5,Fe0.5,Mn0.8,Si1.5,C0.8
153              Ni77.4,Cr20.5,Al0.15,Ti0.35,Fe0.5,C0.10
289                        Fe68.42,Cr19,Ni10,Mo2.5,C0.08
275           Fe72.74,Cr26,Mo1,Mn0.05,Si0.2,N0.01,C0.002
17                                             Cu79,Ni21
231                                  Fe70.0,Cr18,Ni10,B2
324                                               Cb99.6
322                              Fe16.45,Cr17,Ni66,C0.55
225                      Fe67.6,Cr18,Ni5,Mn9,N0.25,C0.15
348                                                Ir100
103                                 Fe55.0,Al12,Ni28,Cc5
146    Ni52,Cr22,Co12.5,Mo9,Al1.2,Fe1.5,Mn0.5,Si0.5,T...
191         Fe55.99,Ni26,Cr13.5,Mo2.7,Ti1.7,Al0.1,B0.005
347                            

In [44]:
# Alloys in the test dataset
data.loc[X_test.index]['composition']

6                                              Cu70,Zn30
140                       Ni32,Cr20.5,Fe44.5,C0.05,Ti1.1
302    Fe41.43,Cr20.5,Ni29,Cu3.5,Mo2.5,Mn1.5,Si1.5,C0.07
220                  Fe89.5,Ni4.5,Cr2.1,C2.8,Si0.5,Mn0.6
90     Ti81.66,Al6,Sn2,Zr4,Mo6,C0.04,Fe0.15,N0.02,H0....
229                           Fe71.92,Cr18.5,Ni9.5,C0.08
114                  Fe58.4,Ni33,Cr5,W2,Mn0.6,Si0.5,C0.5
60                                                Al99.0
37                                Cu87.0,Ni5,Sn5,Pb1,Zn2
296                        Fe63.0,Cr26,Ni5.5,Mo2.5,Cu3.0
26                                  Cu81.8,Ni9,Sn8,Cb0.2
286                              Fe70.97,Cr19,Ni10,C0.03
234                              Fe54.75,Cr25,Ni20,C0.25
287                               Fe71.92,Cr19,Ni9,C0.08
136                 Ni66,Cu31,Fe1.2,Mn1,Si0.2,C0.2,S0.04
213               Fe69.75,Ni15.5,Si2,Mn1.25,Cu6.5,Cr2,C3
135                       Ni54.5,Cu45,Mn0.05,Fe0.2,C0.08
282                     Fe76.63

# Building the model
## Random Forest for Regression

Now that the train and test datasets are ready, below cell creates a Random Forest model instance using RandomForestRegressor class from scikit learn

In [18]:
#This code creates a model instance with all the hyperparameters of the random forest model set
rf = RandomForestRegressor(bootstrap=True, max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=9, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

## Fitting the model

Now the random forest algorithm is fit on the train predictors and train target to learn the model. The learnt model is then used to predict the target variable (Tensile Strength) using train predictors and error is noted. 

Later, the same model is used to predict the test target variable using the test predictors. The last step gives us the metric to evaluate the model on heldout data.

In [19]:
rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=9, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

## Predicting train target using the model to evaluate how the model performs on seen data

In [26]:
rf_train_preds = rf.predict(X_train)
print('RF train error: ',mae(y_train,rf_train_preds))

RF train error:  14.146695100437274


The error on train set is 14.14

## Checking model performance on unseen data (the test set)

In [21]:
rf_test_preds = rf.predict(X_test)
print('RF test error : ',mae(y_test,rf_test_preds))

RF test error :  19.55008921127267


The error on unseen data is 19.55

# Predicting tensile strengths of new compounds
Passing the newly created compounds data to the model to predict their tensile strengths

In [22]:
predictd_TS = rf.predict(new_dum)

In [23]:
#adding the predicted tensile strengths as a column to the new compounds dataframe
new['Predicted_TS'] = predictd_TS.round(2)

In [24]:
# A look at the new final dataframe
new.head()

Unnamed: 0,composition,m1,m1%,m2,m2%,m3,m3%,Predicted_TS
0,"Ag14.3,Al34.5,As51.2",Ag,14.3,Al,34.5,As,51.2,63.91
1,"Ag35.28,Al6.95,Au57.77",Ag,35.28,Al,6.95,Au,57.77,60.75
2,"Ag32.11,Al25.09,B42.8",Ag,32.11,Al,25.09,B,42.8,63.91
3,"Ag71.65,Al19.84,Be8.51",Ag,71.65,Al,19.84,Be,8.51,59.89
4,"Ag61.16,Al18.06,Bi20.78",Ag,61.16,Al,18.06,Bi,20.78,59.89


## Exporting the new compounds dataframe as a csv document to the local drive

In [25]:
new.to_csv(r'C:\Users\hchintada\Downloads\TS_Project\New_Compound_Predictions.csv',index=False)
#please replace the above location with your local drive location where you want the file to be exported to