# Principal Component Analysis

We start with a combination of data (converged and nonconverged i.e. fitness not optimal)
via Regression, we conduct a statistical significance analysis and extract features with high p-values. 
For PCA section, we: 
Import Data: 
1. import the data with extracted features (weights and delay)
Create Model:
2. split train and test set with test size 10%
3. with 90% of train set, create the Principal Component space of size 3 and plot the data points
4. cluster the data points with Kmeans with arbitrary large k to discretize the space
Predict fitness:
5. transform our test data into created PC space and plot the data points
6. assign each new data points to the cluster with minimum euclidean distance 
7. then we give those clustered data points the median fitness value of the cluster, which it belongs
8. we further determine the validity of the cluster assignment by comparing the distance between centroid and the furthest point in the cluster with the distance between test data point to the centroid of the cluster it was assigned to. If the distance between the test data point and the centroid was greater than the distance between the centroid and the furthest data point in the cluster, we classify the test data point invalid. 
Test:
 compare predicted fitness value of the test data to real fitness value of the test set
 invalid/ valid 
10. 


# Remaining Task: 


In [1]:

import os
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from matplotlib import rcParams
from matplotlib.colors import Normalize
from mpl_toolkits.mplot3d import Axes3D
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
import ipympl

#import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans


#%matplotlib notebook
#%matplotlib inline 
%matplotlib qt


# Data Import and Preprocessing

We try to run PCA for the following data:

- [ ] last generation, optimized (converged, min. collision) : dataset{1,2} 
- [ ] mid generation, not optimized (not converged) : dataset{3,4,5,6,7}
- [ ] all generation, not optimized (not converged) : 99_gens_96_indiv_maxfitness_11
- [x] combination of all

**NOTE:** Datasets 2 and 5 as well as 6 and 8 are identical

In [2]:
# load data
df = pd.read_csv('reduced_joined_data.csv', delimiter=',')
df
#features = np.concatenate((weights, delays), axis=1)

Unnamed: 0,w1,w4,w5,w6,w8,w9,w10,w11,w12,w13,...,d25,d35,d36,d39,d56,d58,d72,d76,d77,d78
0,-17.00923,8.88100,16.74854,20.00000,-17.66358,20.00000,-20.00000,19.72965,-5.18483,-11.92167,...,1.0,3.0,1.0,1.0,7.0,1.0,1.0,2.0,7.0,4.0
1,-14.98519,8.44036,17.18445,20.00000,-17.57523,20.00000,20.00000,20.00000,-5.91350,-16.28607,...,2.0,3.0,1.0,1.0,7.0,1.0,1.0,2.0,5.0,4.0
2,-14.07550,8.33186,16.25532,20.00000,-17.55655,20.00000,-20.00000,20.00000,-0.10390,-15.84485,...,2.0,3.0,1.0,1.0,7.0,1.0,1.0,2.0,5.0,4.0
3,-18.57002,8.73767,16.92451,20.00000,-17.96217,20.00000,17.04964,19.47330,-5.40176,-13.14366,...,1.0,3.0,1.0,1.0,7.0,1.0,1.0,2.0,7.0,4.0
4,-14.07550,8.33186,16.25532,20.00000,-17.47067,20.00000,-20.00000,20.00000,-0.10390,-15.84485,...,2.0,3.0,1.0,1.0,7.0,1.0,1.0,2.0,5.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19771,-17.15248,-11.79802,-3.83303,0.50112,7.68743,-20.00000,-17.56264,-4.14908,-13.84944,6.45201,...,4.0,3.0,1.0,3.0,4.0,1.0,5.0,3.0,1.0,5.0
19772,-17.17634,-11.77756,-3.70331,0.52042,7.51694,-19.80628,-17.49668,-3.70948,-13.64777,6.39904,...,4.0,3.0,1.0,2.0,4.0,1.0,4.0,3.0,1.0,4.0
19773,-17.22298,-11.88064,-3.76099,0.80303,-20.00000,7.52185,-17.54095,-4.17102,-13.68850,6.38467,...,4.0,3.0,1.0,3.0,4.0,3.0,4.0,3.0,2.0,5.0
19774,-17.35108,-11.85397,-3.73932,0.59503,-20.00000,7.59284,-17.72002,-4.31863,-13.87703,6.46015,...,4.0,3.0,1.0,3.0,4.0,3.0,4.0,3.0,2.0,4.0


In [3]:
# load data
df_y = pd.read_csv('y_reduced_joined_data.csv', delimiter=',')
df_y
#features = np.concatenate((weights, delays), axis=1)

Unnamed: 0,0
0,37.0
1,26.0
2,36.0
3,6.0
4,19.0
...,...
19771,6.0
19772,-301.5
19773,2.0
19774,2.0


In [4]:
# fitness statistics

pd.DataFrame(df_y).describe()

Unnamed: 0,0
count,19776.0
mean,-691.226234
std,2094.585801
min,-10000.0
25%,-79.0
50%,4.0
75%,6.0
max,56.0


In [5]:
# Splitting the dataset into the Training set and Test set

X_train, X_test, y_train, y_test = train_test_split(df, df_y, test_size = 0.1, random_state = 42)

In [6]:
# Preprocessing

scaler = StandardScaler()
train_scaled = scaler.fit_transform(X_train)
test_scaled = scaler.transform(X_test)


In [7]:
print(y_test)

            0
13367     0.0
438       7.0
8850      9.0
2270      6.0
3405      4.0
...       ...
10633  -495.0
17848  -110.5
14961 -1144.0
18563     6.0
14581     7.0

[1978 rows x 1 columns]


# Approach 1: PCA

The first approach to address this challenge is to reduce the dimensionality of the data with the help of principal component analysis (PCA). PCA projects the data into a lower dimensional sub-space preserving essential parts of the data in terms of its variance. We suggest to plot the sub-space that is defined by the first three (and subsequent) principal components of the data combined with a color coded fitness.


**Is there a relation between principal components and the fitness of the data?**

## Observation

We only see 'relatively' strong principal components PC1 33.28% and PC2 13.89% with all generation data.
For all other datasets, we get a rather uniform distribution of explained variance by each principal components. 
i.e. can't really narrow down to lower dimension subspace defined by small number of principal components.


Symmetry at (5,10,-5)


In [8]:
# PCA
df_pca = pd.DataFrame()
pca = PCA(n_components = 3)
pca_result = pca.fit_transform(train_scaled)

df_pca['PC1'] = pca_result[:,0]
df_pca['PC2'] = pca_result[:,1] 
df_pca['PC3'] = pca_result[:,2] 
df_pca['fitness'] = y_train

for v in pca.explained_variance_ratio_:
    print('Explained variation per principal component: {}%'.format(round(v*100,2)))

Explained variation per principal component: 35.12%
Explained variation per principal component: 12.13%
Explained variation per principal component: 10.87%


In [9]:
display(df_pca)

Unnamed: 0,PC1,PC2,PC3,fitness
0,4.993993,-2.258959,-1.827663,37.0
1,5.062297,-2.426713,-1.473301,26.0
2,-4.556836,-1.472342,2.090387,36.0
3,3.943793,4.055449,3.756591,
4,4.192773,-2.430038,-0.761960,19.0
...,...,...,...,...
17793,-0.217712,0.877330,-1.580325,
17794,-4.423293,-2.911331,1.325402,-10000.0
17795,5.614366,-2.416353,-1.721162,
17796,5.015763,-2.332263,-1.859446,4.0


In [10]:
# plot pca results - 3D Scatter plot of PCA1, PCA2 and PCA3

fig = plt.figure(1)
ax = fig.add_subplot(111, projection = '3d')

#colors = np.clip(fitnesses.astype(int),-1000., 10000.)
colors = np.clip(y_train.astype(int),0., 10000.)

ax.scatter(df_pca.PC1, df_pca.PC2, df_pca.PC3, c = colors, cmap = 'coolwarm', alpha = 0.5)
#ax.scatter(df.PC1, df.PC2, df.PC3, s=5, c = colors, cmap = plt.cm.get_cmap('coolwarm', 2), alpha = 0.5)


# We will then label the three axes using the percentages explained for each major component.
ax.set_xlabel('PC-1, ' +  str(round(pca.explained_variance_ratio_[0]*100,2)) + '% Explained', fontsize=7)
ax.set_ylabel('PC-2, ' +  str(round(pca.explained_variance_ratio_[1]*100,2)) + '% Explained', fontsize=7)
ax.set_zlabel('PC-3, ' +  str(round(pca.explained_variance_ratio_[2]*100,2)) + '% Explained', fontsize=7)


fig.legend(fontsize = 'x-small', loc='upper center', markerscale=2)
plt.autoscale()
plt.rcParams["figure.dpi"] = 1000                                   # set the figure resolution dpi value to 1000
plt.show()


fig_name = '3D_scatterplot_PCA.png'
fig.savefig(fig_name)

No handles with labels found to put in legend.


In [11]:
#Plot the scatter digram
fig = plt.figure()
colors = np.clip(y_train.astype(int),0., 10000.)


plt.scatter(df_pca.PC1, df_pca.PC3, c = colors, cmap = 'coolwarm', alpha=0.5) 
#plt.title("title")
plt.xlabel("PC-1")
plt.ylabel("PC-3")

plt.show()

# PLOT NEW DATA ONTO TRAINED PCA SPACE

In [12]:
# create combined test and train data set to plot onto existing PCA space


## Do we need to scale X_test? how? with mean and std applied to model before and use standardscaler? 
projection = pca.transform(test_scaled)
projection_combined = np.concatenate((pca_result, projection))
fitnesses_combined = np.concatenate((y_train, y_test))

df_projection_combined = pd.DataFrame()
df_projection_combined['PC1'] = projection_combined[:,0]
df_projection_combined['PC2'] = projection_combined[:,1] 
df_projection_combined['PC3'] = projection_combined[:,2] 
df_projection_combined['fitness'] = fitnesses_combined

display(df_projection_combined)

Unnamed: 0,PC1,PC2,PC3,fitness
0,4.993993,-2.258959,-1.827663,6.0
1,5.062297,-2.426713,-1.473301,4.0
2,-4.556836,-1.472342,2.090387,-287.0
3,3.943793,4.055449,3.756591,4.0
4,4.192773,-2.430038,-0.761960,2.0
...,...,...,...,...
19771,-4.341209,-1.368304,1.987170,-495.0
19772,-4.103661,3.329037,-4.119960,-110.5
19773,-4.404154,3.137752,-3.873084,-1144.0
19774,-4.592113,-1.441290,1.477474,6.0


In [13]:
# project new transformed data to existing pca 

fig = plt.figure(2)
ax = fig.add_subplot(111, projection = '3d')

#colors = np.clip(fitnesses.astype(int),-1000., 10000.)
#colors = np.clip(fitnesses_combined.astype(int),0., 10000.)
'''
markers = np.zeros(len(df_projection_combined))
new_count = len(df_projection_combined) - len(projection)
for i in range(len(projection)):
    markers[new_count+i]= 11

ax.scatter(df_projection_combined.PC1, df_projection_combined.PC2, df_projection_combined.PC3, marker=markers, s=5, c = colors, cmap = 'coolwarm', alpha = 0.5)

'''

colors = np.zeros(len(df_projection_combined))
new_count = len(df_projection_combined) - len(projection)
for i in range(len(projection)):
    colors[new_count+i]=1

ax.scatter(df_projection_combined.PC1, df_projection_combined.PC2, df_projection_combined.PC3, c = colors, cmap = 'coolwarm', alpha = 0.5)
#cmap = plt.cm.get_cmap('coolwarm', 2)


# We will then label the three axes using the percentages explained for each major component.
ax.set_xlabel('PC-1, ' +  str(round(pca.explained_variance_ratio_[0]*100,2)) + '% Explained', fontsize=7)
ax.set_ylabel('PC-2, ' +  str(round(pca.explained_variance_ratio_[1]*100,2)) + '% Explained', fontsize=7)
ax.set_zlabel('PC-3, ' +  str(round(pca.explained_variance_ratio_[2]*100,2)) + '% Explained', fontsize=7)


fig.legend(fontsize = 'x-small', loc='upper center', markerscale=2)
plt.autoscale()
plt.rcParams["figure.dpi"] = 100         # set the figure resolution dpi value to 1000


plt.show()


fig_name = '3D_scatterplot_PCA_projection_split.png'
fig.savefig(fig_name)

No handles with labels found to put in legend.


In [14]:
#Plot the scatter digram
fig = plt.figure(3)

colors = np.zeros(len(df_projection_combined))
new_count = len(df_projection_combined) - len(projection)
for i in range(len(projection)):
    colors[new_count+i]=1

plt.scatter(df_projection_combined.PC1, df_projection_combined.PC3, c=colors, cmap = 'coolwarm', alpha = 0.5) 
#plt.title("title")
plt.xlabel("PC-1")
plt.ylabel("PC-3")

plt.show()


# Create max. clusters with kmeans using PCA data


In [15]:
# Set a 3 KMeans clustering
k = 10

kmeans = KMeans(n_clusters=k)
# Compute cluster centers and predict cluster indices
X_clustered = kmeans.fit_predict(pca_result)



In [16]:
#Plot the scatter digram
fig = plt.figure(3)
plt.scatter(df_pca.PC1, df_pca.PC3, c=X_clustered, alpha=0.5) 
#plt.title("title")
plt.xlabel("PC-1")
plt.ylabel("PC-3")

plt.show()

fig_name = 'K_means_2500_split.png'
fig.savefig(fig_name)

In [17]:
# TODO scatter plot of PCA space with clusters in differenet colors

print(y_train.to_numpy())
print(pca_result.shape)

[[   6.]
 [   4.]
 [-287.]
 ...
 [   5.]
 [   5.]
 [  15.]]
(17798, 3)


In [18]:
centroids = kmeans.cluster_centers_

fitnesses = y_train.to_numpy()
cluster_index = []
for i in range(len(X_clustered)):
    b = np.array([i,X_clustered[i]])
    cluster_index.append(b)

fitness_olddata_all = []
c = []
for i in range(k):
    b = []
    d = []
    for j in range(len(X_clustered)):
        if  i == cluster_index[j][1]:
            b.append(fitnesses[j])
            d.append(np.linalg.norm(centroids[i] - pca_result[j]))
    c.append(d)
    fitness_olddata_all.append(b)

In [19]:
    
maxdist_intra_cluster = []  
for i in range(k):
    d = np.amax(c[i])
    maxdist_intra_cluster.append(d)
    
           

    
# medium fitness of every cluster
fitness_olddata_median = []
for i in range(k):
    c = np.median(fitness_olddata_all[i])
    #c = np.mean(fitness_olddata_all[i])
    fitness_olddata_median.append(c)
           

In [20]:
display(len(X_train))
display(len(projection))

17798

1978

In [21]:


# printing Euclidean distance
eucl = []
for j in range(len(projection)):
    a = []
    for i in range(k):
        dist = np.linalg.norm(centroids[i] - projection[j])
        a.append(dist)
    eucl.append(a)
    #print(dist)
#print(eucl[0])

clusters = []
mindist = []
for i in range(len(projection)):
    argmin = np.argmin(eucl[i])
    min = np.amin(eucl[i])
    clusters.append(argmin)
    mindist.append(min) 
    
    
# we define the eps by 
# first calculating per cluster the max. dist. between centroid and the furthest point in the cluster.
# then if our new data point has dist. to the cluster smaller than that distance, we say valid else invalid.
    
    
prediction = []
for i in range(len(projection)):
    b = clusters[i]
    c = fitness_olddata_median[b]
    prediction.append(c)
print(prediction)

#print(fitness_newdata_median)
for i in range(len(projection)):
    if mindist[i] > maxdist_intra_cluster[clusters[i]]:
         prediction[i] = 'invalid'
print(prediction)

#we now have the array of validity and fitness scores

[-111.5, 4.0, 5.0, 5.0, 5.0, 1.0, 3.0, 3.0, 5.0, 0.0, 5.0, 3.0, 5.0, 1.0, 4.0, 3.0, 5.0, 5.0, 1.0, 5.0, 0.0, 3.0, 3.0, 1.0, 0.0, 5.0, 3.0, 5.0, 5.0, 5.0, 1.0, 3.0, 1.0, 3.0, 3.0, 5.0, 1.0, 3.0, 5.0, 5.0, 3.0, 3.0, 1.0, 5.0, 3.0, 4.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 3.0, 5.0, 3.0, 5.0, 5.0, 3.0, 4.0, 5.0, 3.0, 5.0, 5.0, 0.0, 3.0, 5.0, 5.0, 5.0, 4.0, 1.0, 5.0, 5.0, 1.0, 3.0, 5.0, 3.0, 5.0, 3.0, 3.0, 5.0, 1.0, 1.0, 5.0, 3.0, 3.0, 4.0, 1.0, 1.0, 3.0, 1.0, 5.0, 3.0, 5.0, 5.0, 3.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 5.0, 5.0, 3.0, 4.0, 1.0, 5.0, 0.0, 3.0, 3.0, 5.0, 3.0, 5.0, 5.0, 3.0, 3.0, 4.0, 5.0, -111.5, 3.0, 0.0, 5.0, 3.0, 5.0, 5.0, 1.0, 3.0, 5.0, 5.0, 5.0, 3.0, 5.0, 3.0, 5.0, 3.0, -111.5, 3.0, 5.0, 1.0, 1.0, 5.0, 5.0, 3.0, -111.5, 3.0, 5.0, 5.0, 5.0, 1.0, 5.0, 5.0, 3.0, 1.0, 3.0, 5.0, 0.0, 1.0, 1.0, 3.0, 5.0, 3.0, 5.0, 3.0, 3.0, 3.0, 5.0, 0.0, 5.0, 3.0, 5.0, 5.0, 1.0, 5.0, 5.0, -111.5, 1.0, 1.0, 5.0, 3.0, 5.0, 5.0, 5.0, 3.0, 4.0, 5.0, 3.0, 1.0, 1.0, 5.0, 4.0, 1.0, 5.0, 3.0, 5.0,

In [22]:
percentage = np.zeros(len(prediction))

for i in range(len(prediction)):
    if prediction[i]!='invalid':
        if prediction[i]>fitnesses[i]-10 and prediction[i]<fitnesses[i]+10:
            percentage[i]=1
            
b=0
for i in range(len(prediction)):
    if percentage[i]==1:
            b+=1
            
j=0
for i in range(len(prediction)):    
    if prediction[i]=='invalid':
            j+=1        
        
right_prozent = b/(len(prediction)-j)

print(right_prozent,j,b,len(prediction)) #j invalid amount, b where fitness was +- 10 "good", total fitness size

0.6405460060667341 0 1267 1978


In [27]:
'''
from sklearn.metrics import precision_recall_fscore_support

y_true = y_test
y_pred = prediction
precision_recall_fscore_support(y_true, y_pred, average='micro')
'''

"\nfrom sklearn.metrics import precision_recall_fscore_support\n\ny_true = y_test\ny_pred = prediction\nprecision_recall_fscore_support(y_true, y_pred, average='micro')\n"