


# Unsupervised learning

Dataset source: https://citrination.com/datasets/153493/show_search?searchMatchOption=fuzzyMatch

Strength properties, composition, and processing steps for Ni superalloys. 

Reference: B.D. Conduit, N.G. Jones, H.J. Stone, and G.J. Conduit, Materials & Design 131 (2017) 358-365. 

Properties include: ultimate tensile strength, yield strength, elongation, stress rupture time, and processing steps.

## Import

In [1]:
%pylab inline

Populating the interactive namespace from numpy and matplotlib


In [2]:
import pandas as pd
import json
import re

## Load and transform the data

In [3]:
with open("ni_superalloys_3.json","r") as f:
    data=json.load(f)

In [4]:
len(data)

2800

In [5]:
type(data)

list

In [6]:
d=data[0]

In [7]:
def extract_composition_(d):
    weight_perc = d.get("idealWeightPercent")
    if weight_perc:
        weight_perc=weight_perc.get('value')
    return d.get('element'), weight_perc

In [8]:
def extract_composition(d):
    return dict(map(extract_composition_,d["composition"]))

In [9]:
def extract_properties(d):
    return dict([(dd["name"], dd["scalars"][0]['value']) for dd in d["properties"]])    

In [10]:
composition_example=extract_composition(d)

In [11]:
elements=list(composition_example.keys())

In [12]:
properties_example=extract_properties(d)

In [13]:
properties = list(properties_example.keys())

In [14]:
def extract_data(d):
    res = extract_composition(d)
    res.update(extract_properties(d))
    return res

In [15]:
extracted_data=[extract_data(d) for d in data]

In [16]:
df=pd.DataFrame(extracted_data)

In [17]:
df[elements]=df[elements].fillna('0')

In [18]:
patt=re.compile("[0-9.]+")
df[elements]=df[elements].applymap(lambda s: patt.findall(s)[0]).astype(float)

In [19]:
df.head()

Unnamed: 0,Al,Area under heat treatment curve,B,C,Co,Cr,Cu,Elongation,Fe,Heat treatment 1 Cooling,...,Test Temperature,Th,Ti,Total heat treatment time,Ultimate Tensile Strength,V,W,Y,Yield Strength,Zr
0,1.5,392890,0.005,0.02,9.1,17.4,0.0,,9.7,Air cooled,...,20,0.0,0.7,517,,0.0,1.0,0.0,989.0,0.0
1,0.3,0,0.0,0.0,0.0,15.5,0.0,,1.0,,...,20,0.0,0.0,0,845.0,0.0,0.0,0.0,,0.0
2,0.3,0,0.0,0.1,0.0,22.0,0.0,,0.0,,...,870,0.0,0.0,0,,0.0,14.0,0.0,,0.0
3,1.3,19572,0.001,0.035,13.4,19.7,0.006,,0.0,,...,732,0.0,3.0,24,,0.0,0.0,0.0,,0.0
4,5.7,0,0.0,0.0,15.0,10.0,0.0,,0.0,,...,815,0.0,4.1,0,,0.0,0.0,0.0,,0.0


In [20]:
df["Yield Strength"]=df["Yield Strength"].astype(float)
df["Area under heat treatment curve"]=df["Area under heat treatment curve"].astype(float)

In [21]:
df[df["Yield Strength"]>0].shape

(1046, 58)

In [22]:
df[df["Yield Strength"]>0].sort_values("Yield Strength", ascending=False)

Unnamed: 0,Al,Area under heat treatment curve,B,C,Co,Cr,Cu,Elongation,Fe,Heat treatment 1 Cooling,...,Test Temperature,Th,Ti,Total heat treatment time,Ultimate Tensile Strength,V,W,Y,Yield Strength,Zr
1999,3.40,35200.00,0.0150,0.160,9.00,12.20,0.0,10,0.00,Air cooled,...,20,0.0,4.10,50,,0.0,3.95,0.000000,1654.0,0.110
1182,2.50,24006.00,0.0200,0.025,14.75,16.00,0.0,17.3,0.00,Air cooled,...,20,0.0,5.00,34,,0.0,1.25,0.000000,1618.0,0.035
512,2.50,24006.00,0.0400,0.025,14.75,16.00,0.0,17.3,0.00,Air cooled,...,20,0.0,5.00,34,,0.0,1.25,0.000000,1586.0,0.035
575,2.50,24006.00,0.0200,0.025,14.75,16.00,0.0,15.3,0.00,Air cooled,...,20,0.0,5.00,34,,0.0,1.25,0.000000,1586.0,0.035
1149,2.50,24006.00,0.0200,0.025,14.75,16.00,0.0,17,0.00,Air cooled,...,20,0.0,5.00,34,,0.0,1.25,0.000000,1584.0,0.070
14,2.50,24006.00,0.0300,0.025,14.75,16.00,0.0,17.3,0.00,Air cooled,...,20,0.0,5.00,34,,0.0,1.25,0.000000,1581.0,0.035
116,2.50,24006.00,0.0200,0.025,14.75,16.00,0.0,16.5,0.00,Air cooled,...,427,0.0,5.00,34,,0.0,1.25,0.000000,1527.0,0.035
1359,2.50,24006.00,0.0300,0.025,14.75,16.00,0.0,15.5,0.00,Air cooled,...,427,0.0,5.00,34,,0.0,1.25,0.000000,1524.0,0.035
1909,2.50,24006.00,0.0200,0.025,14.75,16.00,0.0,16,0.00,Air cooled,...,427,0.0,5.00,34,,0.0,1.25,0.000000,1517.0,0.070
2145,2.50,24006.00,0.0200,0.025,14.75,16.00,0.0,13.5,0.00,Air cooled,...,427,0.0,5.00,34,,0.0,1.25,0.000000,1510.0,0.035


In [23]:
X = df[elements]

# Unsupervised learning

Perform the dimensionality reduction of the `X` dataset to two dimensions with PCA, KernelPCA and t-SNE, plot it.

Do the clustering with three methods: DBSCAN, KMeans, and GaussianMixture (chose appropriate number of clusters for last two methods)

Chose the best 2D representation of the dataset, analyse the 'character' of the largest cluster on it, analyze its average composition

# Bonus task - supervised learning

In this dataset you have also a "Yield Strength" variable for certain rows.

Consider it is as a 'label' `y`, train a supervised ML model for predict it from the composition `X` and from the reduced representations from the previous section

Chose the validation strategy accordingly and find the best model with a lowest RMSE

In [24]:
y = df["Yield Strength"]

In [25]:
X=X[~y.isna()]

In [26]:
y=y[~y.isna()]

In [27]:
X.shape

(1046, 27)

In [28]:
y.shape

(1046,)

In [29]:
from sklearn.metrics import mean_squared_error

In [30]:
def RMSE(y_true, y_pred):
    return np.sqrt(mean_squared_error(y_true, y_pred))

In [31]:
#YOUR CODE HERE