
<h2 id="Project:-PCA-and-tSNE">Project: PCA and tSNE<a class="anchor-link" href="#Project:-PCA-and-tSNE">¶</a></h2><h2 ></h2><p>Welcome to the project on PCA and tSNE. In this project. We will be using the auto-mpg dataset.</p>
<hr/>
<h2 id="Objective:">Objective:<a class="anchor-link" href="#Objective:">¶</a></h2><hr/>
<p>The objective of this problem is to explore the data and reduce the number of features by using dimensionality reduction techniques like PCA and TSNE and generate meaningful insights.</p>
<hr/>
<h2 id="Dataset:">Dataset:<a class="anchor-link" href="#Dataset:">¶</a></h2><hr/>
<p>There are 8 variables in the data:</p>
<ul>
<li>mpg: miles per gallon</li>
<li>cyl: number of cylinders</li>
<li>disp: engine displacement (cu. inches) or engine size</li>
<li>hp: horsepower</li>
<li>wt: vehicle weight (lbs.)</li>
<li>acc: time taken to accelerate from O to 60 mph (sec.)</li>
<li>yr: model year</li>
<li>car name: car model name</li>
</ul>



<h2 id="Importing-necessary-libraries-and-overview-of-the-dataset">Importing necessary libraries and overview of the dataset<a class="anchor-link" href="#Importing-necessary-libraries-and-overview-of-the-dataset">¶</a></h2>


In [None]:


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#to scale the data using z-score 
from sklearn.preprocessing import StandardScaler

#importing PCA and TSNE
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE





<h4 id="Loading-data">Loading data<a class="anchor-link" href="#Loading-data">¶</a></h4>


In [None]:


data = pd.read_csv("auto-mpg.csv")




In [None]:


data.head()





<h4 id="Check-the-info-of-the-data">Check the info of the data<a class="anchor-link" href="#Check-the-info-of-the-data">¶</a></h4>


In [None]:


data.info()





<p><strong>Observations:</strong></p>
<ul>
<li>There are 398 observations and 8 columns in the data.</li>
<li>All variables except horsepower and car name are of numeric data type.</li>
<li>The horsepower must be a numeric data type. We will explore this further.</li>
</ul>



<h2 id="Data-Preprocessing-and-Exploratory-Data-Analysis">Data Preprocessing and Exploratory Data Analysis<a class="anchor-link" href="#Data-Preprocessing-and-Exploratory-Data-Analysis">¶</a></h2>


In [None]:


data["car name"].nunique()





<ul>
<li>The column 'car name' is of object data type containing a lot of unique entries and would not add values to our analysis. We can drop this column.</li>
</ul>


In [None]:


# dropping car_name
data1 = data.copy()
data = data.drop(['car name'], axis=1)





<h4 id="Checking-values-in-horsepower-column">Checking values in horsepower column<a class="anchor-link" href="#Checking-values-in-horsepower-column">¶</a></h4>


In [None]:


# checking if there are values other than digits in the column 'horsepower' 
hpIsDigit = pd.DataFrame(data.horsepower.str.isdigit())  # if the string is made of digits store True else False

# print isDigit = False!
data[hpIsDigit['horsepower'] == False]   # from temp take only those rows where hp has false





<p><strong>Observations:</strong></p>
<ul>
<li>There are 6 observations where horsepower is ?.</li>
<li>We can consider these values as missing values.</li>
<li>Let's impute these missing values and change the data type of horsepower column.</li>
<li>First we need to replace the ? with np.nan.</li>
</ul>


In [None]:


#Relacing ? with np.nan
data = data.replace('?', np.nan)
data[hpIsDigit['horsepower'] == False]




In [None]:


# Imputing the missing values with median value
data.horsepower.fillna(data.horsepower.median(), inplace=True)
data['horsepower'] = data['horsepower'].astype('float64')  # converting the hp column from object data type to float





<h4 id="Summary-Statistics">Summary Statistics<a class="anchor-link" href="#Summary-Statistics">¶</a></h4>



<h4 id="Question-1:">Question 1:<a class="anchor-link" href="#Question-1:">¶</a></h4><ul>
<li><strong>Check the summary statistics of the data (use describe function) (1 Mark)</strong></li>
<li><strong>Write your observations (1 Mark)</strong></li>
</ul>


In [None]:


#Write your code here
data.describe()





<p><strong>Observations:for mgp, weight, acceleration, and model year the means and medians are close to the same value.
 Displacement has a large standard deviation, relative to the mean.
 All cars are between 1970 and 1982</strong></p>



<h4 id="Let's-check-the-distribution-and-outliers-for-each-column-in-the-data">Let's check the distribution and outliers for each column in the data<a class="anchor-link" href="#Let's-check-the-distribution-and-outliers-for-each-column-in-the-data">¶</a></h4>




<li><strong>Create the histogram to check distribution of all variables (use .hist() attribute)</strong></li>
<li><strong>Create boxplot to visualize outliers for all variables (use sns.boxplot()) </strong></li>
<li><strong>Write your observations </strong></li>
</ul>


In [None]:


# Uncomment and complete the code by filling the blanks 

for col in data.columns:
     print(col)
     print('Skew :',round(data[col].skew(),2))
     plt.figure(figsize=(15,4))
     plt.subplot(1,2,1)
     data[col].hist()
     plt.ylabel('count')
     plt.subplot(1,2,2)
     sns.boxplot(x=data[col])
     plt.show()





<p><strong>Observations:Acceleration looks to be normally distributed. 
               Horsepower has quite a few outliers on the high side
               Model year resembles a uniform distribution</strong></p>



<h4 id="Checking-correlation">Checking correlation<a class="anchor-link" href="#Checking-correlation">¶</a></h4>


In [None]:


plt.figure(figsize=(8,8))
sns.heatmap(data.corr(), annot=True)
plt.show()





<p><strong>Observations:</strong></p>
<ul>
<li>The variable mpg has strong negative correlation with cylinders, displacement, horsepower, and weight.</li>
<li>horsepower and acceleration are negatively correlated.</li>
<li>The variable weight has strong positively correlation with horsepower, displacement and cylinders</li>
<li>model year is positively correlated with mpg.</li>
</ul>



<h4 id="Scaling-the-data">Scaling the data<a class="anchor-link" href="#Scaling-the-data">¶</a></h4>


In [None]:


# scaling the data
scaler=StandardScaler()
data_scaled=pd.DataFrame(scaler.fit_transform(data), columns=data.columns)




In [None]:


data_scaled.head()





<h2 id="Principal-Component-Analysis">Principal Component Analysis<a class="anchor-link" href="#Principal-Component-Analysis">¶</a></h2>



<li><strong>Apply the PCA algorithm with number of components equal to the total number of columns in the data with random_state=1 </strong></li>
<li><strong>Write observations on the variance explained by components </strong></li>
</ul>


In [None]:


#Defining the number of principal components to generate 
n=data_scaled.shape[1]

#Finding principal components for the data
pca = PCA(n_components = n) #Apply the PCA algorithm with random state = 1
data_pca1 = pd.DataFrame(pca.fit_transform(data)) #Fit and transform the pca function on scaled data

#The percentage of variance explained by each principal component
exp_var = pca.explained_variance_ratio_




In [None]:


# visualize the explained variance by individual components
plt.figure(figsize = (10,10))
plt.plot(range(1,8), exp_var.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")




In [None]:


# find the least number of components that can explain more than 90% variance
sum = 0
for ix, i in enumerate(exp_var):
  sum = sum + i
  if(sum>0.90):
    print("Number of PCs that explain at least 90% variance: ", ix+1)
    break





<p><strong>Observations:The first principle component contains the majority, over 99%, of the variance. The rest of it is contained in 2,3, and 4. </strong></p>


In [None]:


pc_comps = ['PC1','PC2','PC3']
data_pca = pd.DataFrame(np.round(pca.components_[:3,:],2),index=pc_comps,columns=data_scaled.columns)
data_pca.T




<strong>Interpret the coefficients of three principal components from the below dataframe</strong><a class="anchor-link" href="#Question-4:-Interpret-the-coefficients-of-three-principal-components-from-the-below-dataframe-(6-Marks)">¶</a></h4>


In [None]:


def color_high(val):
    if val <= -0.040: # you can decide any value as per your understanding
        return 'background: pink'
    elif val >= 0.40:
        return 'background: skyblue'   
    
data_pca.T.style.applymap(color_high)





<p><strong>Observations:For PC1 the hightest variance is weight but this also has the strongest negative variance for PC2.
The strongest variance for PC2 is displacement, one of the strongest negatives for PC3.</strong></p>



<h4 id="We-can-also-visualize-the-data-in-2-dimensions-using-first-two-principal-components">We can also visualize the data in 2 dimensions using first two principal components<a class="anchor-link" href="#We-can-also-visualize-the-data-in-2-dimensions-using-first-two-principal-components">¶</a></h4>


In [None]:


plt.figure(figsize = (7,7))
sns.scatterplot(x=data_pca1[0],y=data_pca1[1])
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()





<p><strong>Let's try adding hue to the scatter plot</strong></p>



<li><strong>Create a scatter plot for first two principal components with hue = 'cylinders'</strong> </li>
<li><strong>Write your observations on the plot</strong></li>
</ul>


In [None]:


df_concat = pd.concat([data_pca1, data], axis=1)

plt.figure(figsize = (7,7))
#Create a scatter plot with x=0 and y=1 using df_concat dataframe
sns.scatterplot(x=df_concat[0],y=df_concat[1], hue = df_concat["cylinders"])
plt.xlabel("PC1")
plt.ylabel("PC2")





<p><strong>Observations:We appear to have 3 distinct groups with a few random spots thrown in. There are seems to be a trend from PC1 and PC2 for all 3 groups. </strong></p>



<h2 id="t-SNE">t-SNE<a class="anchor-link" href="#t-SNE">¶</a></h2>



<li><strong>Apply the TSNE embedding with 2 components for the dataframe data_scaled (use random_state=1) </strong></li>
<li><strong>Write your observations on the below scatter plots </strong> </li>
</ul>


In [None]:


tsne = TSNE(n_components = 2, random_state=1)  #Apply the TSNE algorithm with random state = 1
data_tsne = tsne.fit_transform(data_scaled) #Fit and transform tsne function on the scaled data




In [None]:


data_tsne.shape




In [None]:


data_tsne = pd.DataFrame(data = data_tsne, columns = ['Component 1', 'Component 2'])




In [None]:


data_tsne.head()




In [None]:


sns.scatterplot(x=data_tsne.iloc[:,0],y=data_tsne.iloc[:,1])




In [None]:


# Let's see scatter plot of the data w.r.t number of cylinders
sns.scatterplot(x=data_tsne.iloc[:,0],y=data_tsne.iloc[:,1],hue=data.cylinders)





<p><strong>Observations:With 3 clusters shown, this shows there is a strong pattern to the data</strong></p>


In [None]:


# Let's assign points to 3 different groups
def grouping(x):
    first_component = x['Component 1']
    second_component = x['Component 2']
    if (first_component> 0) and (second_component >0): 
        return 'group_1'
    if (first_component >-20 ) and (first_component < 5):
        return 'group_2'
    else: 
        return 'group_3'




In [None]:


data_tsne['groups'] = data_tsne.apply(grouping,axis=1)




In [None]:


sns.scatterplot(x=data_tsne.iloc[:,0],y=data_tsne.iloc[:,1],hue=data_tsne.iloc[:,2])




In [None]:


data['groups'] = data_tsne['groups'] 





<li><strong>Complete the following code by filling the blanks</strong></li>
<li><strong>Write your observations on different groups w.r.t different variables</strong></li>
</ul>


In [None]:


all_col = data.columns.tolist()
plt.figure(figsize=(20, 20))

for i, variable in enumerate(all_col):
    if i==7:
        break
    plt.subplot(4, 2, i + 1)
    #Create boxplot with groups on the x-axis and variable on the y-axis (use the dataframe data)
    sns.boxplot(y=data[variable], x=data_tsne['groups'])
    plt.tight_layout()
    plt.title(variable)
plt.show()


