In [None]:
import numpy as np
import matplotlib.pyplot as plt

# ----- make nice figures -----
import matplotlib as mpl
mpl.rcParams['figure.dpi']= 300
from cycler import cycler
COLORS = ['#F00D2C', '#242482', '#0071BE', '#4E8F00', '#553C67', '#DA5319']
default_cycler = cycler(color=COLORS)
plt.rc('axes', prop_cycle=default_cycler) 
# -----------------------------

Below is just some of the code we used to generate the figures in class.

In [None]:
# Load the data
data = np.loadtxt('data/perovskite_data.txt')

# Extract the materials descriptors - there should be 14 of them
X = data[:, 1:]
print(X.shape)

# Extract the perovskite label
y = data[:, 0]
print(y.shape)

# normalize X (get rid of units)
X = (X - np.mean(X, axis = 0))/np.std(X, axis = 0)

Let's try to look at the data. For example, let's look at some histograms of the experiment response (whether perovksite formed).

In [None]:
plt.hist(y)
plt.ylabel('Frequency')
plt.xlabel('Perovskite formed')

It's not that illuminating. We're ultimately interested in whether perovskite will form based on the material descriptors, so we need to tie in the descriptors (the x's) with the response (the y) somehow. Let's do a few scatter plots.

In [None]:
for i in range(14):
    plt.scatter(X[:,i], y)
    plt.xlabel('x_' + str(i))
    plt.ylabel('Perovskite formed')
    plt.grid()
    plt.show()

While there is some structure that we can observe, there is nothing definitive. One problem is that all 14 descriptors are jointly working together to form perovskite structure. That is, we need to think of all 14 descriptors simulatenously!

In this module, we'll learn how to deal with higher dimensional data. For example, we consider the following dimensionality reduction:

In [None]:
from sklearn.decomposition import KernelPCA

pca = KernelPCA(n_components=2, kernel='rbf', degree = 2)
pca.fit(X)
Z = pca.transform(X)

color_seq = [ COLORS[0] if yy == -1 else COLORS[1] for yy in y]

plt.scatter(Z[:, 0], Z[:, 1], c = color_seq)