### The Spine dataset contains information about patients belonging to one of three categories    of lumbar spine malfunctions:
1) Normal, 2) Disk Hernia and 3) Spondylolisthesis with the last two categories being abnormal. Each patient is represented in the data set by six biomechanical attributes derived from the shape and orientation of the pelvis and lumbar spine (in this order): pelvic incidence (PI), pelvic tilt (PT), lumbar lordosis angle (LL), sacral slope (SS), pelvic radius (PR) and grade of spondylolisthesis (GS).

In [None]:
df=pd.read_csv('spine.csv')
#Get the summary of data and check whether or not there are any missing values
df.describe(include="all")

Compute the mean and standard deviation for the PI variable, and mean, median and standard deviation for GS variable. Use the agg function.


In [None]:
df.agg({'PI':['mean','var'],'GS':['mean','std','median']})

Group the whole dataframe based on the “Categories” (last column in Spine dataset). Compute the mean and standard deviation associated with each group for all variables. What do you think about the differences in means and standard deviations of variables among the levels of the variable “Categories”?

In [None]:
df2=df.groupby(df['Categories'])
df2.agg(['mean','std'])

Compare the boxplots for the variable GS corresponding to the groups of the variable “Categories”. You must locate all boxplots in one plot (see googleapps.py file).

In [None]:
l=[df['GS'].loc[df['Categories']==x] for x in df['Categories'].unique()]
box=plt.boxplot(l,notch=True,labels= df['Categories'].unique())#Boxplots for Rating associated with each category
colors = ['cyan', 'lightblue', 'lightgreen']
for patch, color in zip(box['boxes'], colors):
patch.set_color(color)

Scale the data and apply PCA on the scaled dataset

In [None]:
X=df.iloc[:,0:6]
X=scale(X)
pca=PCA()
pca.fit(X)

Provide the matrix of directions (loadings). Interpret the first three directions. This is the “W” matrix in the slides.

In [None]:
W=pca.components_.T
pd.DataFrame(W[:,:3],index=df.columns[:-1],columns=['PC1','PC2','PC3'])
PC1 PC2 PC3
PI 0.535142 -0.002194 -0.096069
PT 0.323585 0.527545 -0.648701
LL 0.457970 0.092875 0.152338
SS 0.445906 -0.396157 0.360313
PR -0.143497 0.727756 0.585991
GS 0.423978 0.162777 0.271184
#The first PC gives a contrast between the average of PI, PT, LL, SS and
#GS against PR. The second PC gives a contrast between the average of PR, GS, PT, LL
#against SS. PC3 provides a contrast between the average of PI and PT against the average of LL, SS, PR and GS.

Compute the explained variance ratio

In [None]:
pd.DataFrame(pca.explained_variance_ratio_.cumsum(),index=np.arange(X.shape[1])+1,columns=['Explained Variability'])
Explained Variability
1 0.540964
2 0.740061
3 0.866909
4 0.945664
5 1.000000
6 1.000000
#Obviously, the first two components explain about 74% of the variability. So, we can work with these two components.

Using the scree plot

In [None]:
plt.bar(np.arange(1,X.shape[1]+1),pca.explained_variance_,color="blue",edgecolor="Red")

Make a scatter plot of the first two PC scores

In [None]:
Y=pca.fit_transform(X)
plt.figure(1)
plt.scatter(Y[:,0],Y[:,1],c="red",marker='o',alpha=0.5)
plt.xlabel('PC Scores 1')
plt.ylabel('PC Scores 2')
xs=Y[:,0]
ys=Y[:,1]
for i in range(len(W[:,0])):
plt.arrow(np.mean(xs), np.mean(ys), W[i,0]*max(xs), W[i,1]*max(ys),
color='b', width=0.0005, head_width=0.0025)
plt.text(W[i,0]*max(xs)+np.mean(xs), +np.mean(ys)+W[i,1]*max(ys),
list(df.columns.values)[i], color='b')

Which observation does this outlier belong to?

In [None]:
np.where(Y[:,0]>7)
(array([115], dtype=int64),)

On what variables (variable) do you think this outlier has the highest values?

In [None]:
Based on the biplot, we can say the responsible factor might be extremely large values on SS, PI, LL and GS. This point might have a small value on PR as well.
Let’s take a look at this observation:
df.iloc[115,:]
Compare it to the maximum values of the variables:
df.agg('max')
