**Introduction**
___
- Tidy data
    - every column is a feature
    - every row is an observation for each variable
    - Pandas dataframe .shape attribute
- high dimensionality > 10 columns
- When to use dimensionality reduction?
    - drop columns with no variance (i.e. same values)
    - Pandas dataframe .describe() method
        - no variance = std = 0, max and min are the same
        - exclude = 'number' --> will show information for non-numeric values
___

In [None]:
#Removing features without variance

#A sample of the Pokemon dataset has been loaded as pokemon_df. To
#get an idea of which features have little variance you should use
#the IPython Shell to calculate summary statistics on this sample.
#Then adjust the code to create a smaller, easier to understand,
#dataset.

# Leave this list as is
number_cols = ['HP', 'Attack', 'Defense']

# Remove the feature without variance from this list
non_number_cols = ['Name', 'Type']

# Create a new dataframe by subselecting the chosen features
#df_selected = pokemon_df[number_cols + non_number_cols]

# Prints the first 5 lines of the new dataframe
#print(df_selected.head())

#################################################
#<script.py> output:
#       HP  Attack  Defense                   Name   Type
#    0  45      49       49              Bulbasaur  Grass
#    1  60      62       63                Ivysaur  Grass
#    2  80      82       83               Venusaur  Grass
#    3  80     100      123  VenusaurMega Venusaur  Grass
#    4  39      52       43             Charmander   Fire
#################################################
#All Pokemon in this dataset are non-legendary and from generation
#one so you could choose to drop those two features.

**Feature selection vs feature extraction**
___
- Why reduce dimensionality?
    - your dataset will:
        - be less complex
        - require less disk space
        - have lower chance of model overfitting
- Feature selection
    - .drop('column name', axis=1) [axis indicates column instead of row]
- Building a pairplot
    - sns.pairplot(data, hue='', diag_kind='hist')
- Feature extraction
    - calculating new feature(s) from original feature(s)
___

In [None]:
#Visually detecting redundant features
#Data visualization is a crucial step in any data exploration. Let's
#use Seaborn to explore some samples of the US Army ANSUR body
#measurement dataset.

#Two data samples have been pre-loaded as ansur_df_1 and ansur_df_2.

#Seaborn has been imported as sns.

# Create a pairplot and color the points using the 'Gender' feature
#sns.pairplot(ansur_df_1, hue='Gender', diag_kind='hist')

# Show the plot
#plt.show()

![_images/13.1.svg](_images/13.1.svg)

In [None]:
#Two features are basically duplicates, remove one of them from
#the dataset.

# Remove one of the redundant features
#reduced_df = ansur_df_1.drop('stature_m', axis=1)

# Create a pairplot and color the points using the 'Gender' feature
#sns.pairplot(reduced_df, hue='Gender')

# Show the plot
#plt.show()

![_images/13.2.svg](_images/13.2.svg)
the body height (inches) and stature (meters) hold the same information in a different unit

In [None]:
#Now create a pairplot of the ansur_df_2 data sample and color the
#points using the 'Gender' feature.

# Create a pairplot and color the points using the 'Gender' feature
#sns.pairplot(ansur_df_2, hue='Gender', diag_kind='hist')

# Show the plot
#plt.show()

![_images/13.3.svg](_images/13.3.svg)

In [None]:
#One feature has no variance, remove it from the dataset.
# Remove the redundant feature
#reduced_df = ansur_df_2.drop('n_legs', axis=1)

# Create a pairplot and color the points using the 'Gender' feature
#sns.pairplot(reduced_df, hue='Gender', diag_kind='hist')

# Show the plot
#plt.show()

![_images/13.4.svg](_images/13.4.svg)
all the individuals in the second sample have two legs.

**t-SNE visualization of high-dimensional data**
___
- t-distributed stochastic neighbor embedding
- t-SNE maximizes distance in 2-dimensional space between dimensions in higher dimensional space
- does not work with non-numeric values
- learning rate (10-1000) - lower number is conservative
___

In [None]:
#Fitting t-SNE to the ANSUR data

#t-SNE is a great technique for visual exploration of high dimensional
#datasets. In this exercise, you'll apply it to the ANSUR dataset. You'll
#remove non-numeric columns from the pre-loaded dataset df and fit TSNE to
#this numeric dataset.

# Non-numerical columns in the dataset
#non_numeric = ['Branch', 'Gender', 'Component']

# Drop the non-numerical columns from df
#df_numeric = df.drop(non_numeric, axis=1)

# Create a t-SNE model with learning rate 50
#m = TSNE(learning_rate=50)

# Fit and transform the t-SNE model on the numeric dataset
#tsne_features = m.fit_transform(df_numeric)

#################################################
#t-SNE reduced the more than 90 features in the dataset to just 2
#which you can now plot.
#################################################

In [None]:
#t-SNE visualisation of dimensionality
#Time to look at the results of your hard work. In this exercise,
#you will visualize the output of t-SNE dimensionality reduction on
#the combined male and female Ansur dataset. You'll create 3
#scatterplots of the 2 t-SNE features ('x' and 'y') which were
#added to the dataset df. In each scatterplot you'll color the
#points according to a different categorical variable.

#seaborn has already been imported as sns and matplotlib.pyplot as
#plt.

# Color the points according to Army Component
#sns.scatterplot(x="x", y="y", hue='Component', data=df)

# Show the plot
#plt.show()

![_images/13.5.svg](_images/13.5.svg)

In [None]:
# Color the points by Army Branch
#sns.scatterplot(x="x", y="y", hue='Branch', data=df)

# Show the plot
#plt.show()

![_images/13.6.svg](_images/13.6.svg)

In [None]:
# Color the points by Gender
#sns.scatterplot(x="x", y="y", hue='Gender', data=df)

# Show the plot
#plt.show()

![_images/13.7.svg](_images/13.7.svg)
There is a Male and a Female cluster. t-SNE found these gender
differences in body shape without being told about them explicitly!

From the second plot you learned there are more males in the
Combat Arms Branch.

**The curse of dimensionality**
___
- as number of features increase in order to better fit a model, the number of observations must increase exponentially
___


In [None]:
#Train - test split
#In this chapter, you will keep working with the ANSUR dataset.
#Before you can build a model on your dataset, you should first
#decide on which feature you want to predict. In this case, you're
#trying to predict gender.

#You need to extract the column holding this feature from the
#dataset and then split the data into a training and test set. The
#training set will be used to train the model and the test set will
#be used to check its performance on unseen data.

#ansur_df has been pre-loaded for you.

# Import train_test_split()
#from sklearn.model_selection import train_test_split

# Select the Gender column as the feature to be predicted (y)
#y = ansur_df['Gender']

# Remove the Gender column to create the training data
#X = ansur_df.drop('Gender', axis=1)

# Perform a 70% train and 30% test data split
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

#print("{} rows in test set vs. {} in training set. {} Features.".format(X_test.shape[0], X_train.shape[0], X_test.shape[1]))

#################################################
#<script.py> output:
#    300 rows in test set vs. 700 in training set. 91 Features.
#################################################

In [None]:
#Fitting and testing the model

#In the previous exercise, you split the dataset into X_train,
#X_test, y_train, and y_test. These datasets have been pre-loaded
#for you. You'll now create a support vector machine classifier
#model (SVC()) and fit that to the training data. You'll then
#calculate the accuracy on both the test and training set to detect
#overfitting.

# Import SVC from sklearn.svm and accuracy_score from sklearn.metrics
#from sklearn.svm import SVC
#from sklearn.metrics import accuracy_score

# Create an instance of the Support Vector Classification class
#svc = SVC()

# Fit the model to the training data
#svc.fit(X_train, y_train)

# Calculate accuracy scores on both train and test data
#accuracy_train = accuracy_score(y_train, svc.predict(X_train))
#accuracy_test = accuracy_score(y_test, svc.predict(X_test))

#print("{0:.1%} accuracy on test set vs. {1:.1%} on training set".format(accuracy_test, accuracy_train))

#################################################
#<script.py> output:
#    49.7% accuracy on test set vs. 100.0% on training set
#################################################
#Looks like the model badly overfits on the training data. On unseen
#data it performs worse than a random selector would.

In [None]:
#Accuracy after dimensionality reduction

#You'll reduce the overfit with the help of dimensionality reduction.
#In this case, you'll apply a rather drastic form of dimensionality
#reduction by only selecting a single column that has some good
#information to distinguish between genders. You'll repeat the
#train-test split, model fit and prediction steps to compare the
#accuracy on test vs. training data.

#All relevant packages and y have been pre-loaded.

# Assign just the 'neckcircumferencebase' column from ansur_df to X
#X = ansur_df[['neckcircumferencebase']]

# Split the data, instantiate a classifier and fit the data
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
#svc = SVC()
#svc.fit(X_train, y_train)

# Calculate accuracy scores on both train and test data
#accuracy_train = accuracy_score(y_train, svc.predict(X_train))
#accuracy_test = accuracy_score(y_test, svc.predict(X_test))

#print("{0:.1%} accuracy on test set vs. {1:.1%} on training set".format(accuracy_test, accuracy_train))

#################################################
#<script.py> output:
#   93.3% accuracy on test set vs. 94.9% on training set
#################################################
#On the full dataset the model is rubbish but with a single feature
#we can make good predictions? This is an example of the curse of
#dimensionality! The model badly overfits when we feed it too many
#features. It overlooks that neck circumference by itself is pretty
#different for males and females.

**Features with missing values or little variance**
___
- Variance thresholds are not always easy to interpret or compare between features
___

In [None]:
#Finding a good variance threshold

#You'll be working on a slightly modified subsample of the ANSUR
#dataset with just head measurements pre-loaded as head_df.

#Create a boxplot on head_df.

# Create the boxplot
#head_df.boxplot()

#plt.show()

![_images/13.8.svg](_images/13.8.svg)

In [None]:
#Normalize the data by dividing the dataframe with its mean values.

# Normalize the data
#normalized_df = head_df / head_df.mean()

#normalized_df.boxplot()
#plt.show()

![_images/13.9.svg](_images/13.9.svg)

In [None]:
#Print the variances of the normalized data.

# Normalize the data
#normalized_df = head_df / head_df.mean()

# Print the variances of the normalized data
#print(normalized_df.var())

#################################################
#<script.py> output:
#    headbreadth          1.678952e-03
#    headcircumference    1.029623e-03
#    headlength           1.867872e-03
#    tragiontopofhead     2.639840e-03
#    n_hairs              1.002552e-08
#    measurement_error    3.231707e-27
#    dtype: float64
#################################################
#Q: If you want to remove the 2 very low variance features.
#What would be a good variance threshold?
#A: 1.0e-03

In [None]:
#Features with low variance

#In the previous exercise you established that 0.001 is a good
#threshold to filter out low variance features in head_df after
#normalization. Now use the VarianceThreshold feature selector to
#remove these features.

#from sklearn.feature_selection import VarianceThreshold

# Create a VarianceThreshold feature selector
#sel = VarianceThreshold(threshold=0.001)

# Fit the selector to normalized head_df
#sel.fit(head_df / head_df.mean())

# Create a boolean mask
#mask = sel.get_support()

# Apply the mask to create a reduced dataframe
#reduced_df = head_df.loc[:, mask]

#print("Dimensionality reduced from {} to {}.".format(head_df.shape[1], reduced_df.shape[1]))

#################################################
#<script.py> output:
#    Dimensionality reduced from 6 to 4.
#################################################
#you've successfully removed the 2 low-variance features.

In [None]:
#Removing features with many missing values

#You'll apply feature selection on the Boston Public Schools
#dataset which has been pre-loaded as school_df. Calculate the
#missing value ratio per feature and then create a mask to remove
#features with many missing values.

#school_df.isna().sum() / len(school_df)
#################################################
#x             0.000000
#y             0.000000
#objectid_1    0.000000
#objectid      0.000000
#bldg_id       0.000000
#bldg_name     0.000000
#address       0.000000
#city          0.000000
#zipcode       0.000000
#csp_sch_id    0.000000
#sch_id        0.000000
#sch_name      0.000000
#sch_label     0.000000
#sch_type      0.000000
#shared        0.877863
#complex       0.984733
#label         0.000000
#tlt           0.000000
#pl            0.000000
#point_x       0.000000
#point_y       0.000000
#dtype: float64
#################################################

In [None]:
#Create a boolean mask on whether each feature has less than 50%
#missing values.

#Apply the mask to school_df to select columns without many missing
#values.

# Create a boolean mask on whether each feature less than 50% missing values.
#mask = school_df.isna().sum() / len(school_df) < 0.5

# Create a reduced dataset by applying the mask
#reduced_df = school_df.loc[:, mask]

#print(school_df.shape)
#print(reduced_df.shape)

#################################################
#<script.py> output:
#    (131, 21)
#    (131, 19)
#################################################
#The number of features went down from 21 to 19.

**Pairwise correlation**
___
- pairplots
- correlation coefficient
___

In [None]:
#Visualizing the correlation matrix

#Reading the correlation matrix of ansur_df in its raw, numeric
#format doesn't allow us to get a quick overview. Let's improve
#this by removing redundant values and visualizing the matrix using
#seaborn.

#Seaborn has been pre-loaded as sns, matplotlib.pyplot as plt,
#NumPy as np and pandas as pd.

# Create the correlation matrix
#corr = ansur_df.corr()

# Draw the heatmap
#sns.heatmap(corr,  cmap=cmap, center=0, linewidths=1, annot=True, fmt=".2f")
#plt.show()

![_images/13.10.svg](_images/13.10.svg)

In [None]:
#Create a boolean mask for the upper triangle of the plot.

# Create the correlation matrix
#corr = ansur_df.corr()

# Generate a mask for the upper triangle
#mask = np.triu(np.ones_like(corr, dtype=bool))

# Add the mask to the heatmap
#sns.heatmap(corr, mask=mask, cmap=cmap, center=0, linewidths=1, annot=True, fmt=".2f")
#plt.show()

![_images/13.11.svg](_images/13.11.svg)
The buttock and crotch height have a 0.93 correlation coefficient.

**Removing highly correlated features**
___
- correlation caveats - Anscombe's quartet
    - nonlinear relationships or datasets with outliers may also correlate strongly
    - always visualize the scatterplot
- correlation does not imply causation
___

In [None]:
#Filtering out highly correlated features
#You're going to automate the removal of highly correlated features
#in the numeric ANSUR dataset. You'll calculate the correlation
#matrix and filter out columns that have a correlation coefficient
#of more than 0.95 or less than -0.95.

#Since each correlation coefficient occurs twice in the matrix
#(correlation of A to B equals correlation of B to A) you'll want
#to ignore half of the correlation matrix so that only one of the
#two correlated features is removed. Use a mask trick for this
#purpose.

# Calculate the correlation matrix and take the absolute value
#corr_matrix = ansur_df.corr().abs()

# Create a True/False mask and apply it
#mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
#tri_df = corr_matrix.mask(mask)

# List column names of highly correlated features (r > 0.95)
#to_drop = [c for c in tri_df.columns if any(tri_df[c] > 0.95)]

# Drop the features in the to_drop list
#reduced_df = ansur_df.drop(to_drop, axis=1)

#print("The reduced_df dataframe has {} columns".format(reduced_df.shape[1]))

#################################################
#The original dataframe has 99 columns.
#
#<script.py> output:
#    The reduced_df dataframe has 88 columns
#################################################
# You've automated the removal of highly correlated features.

In [None]:
#Nuclear energy and pool drownings

#The dataset that has been pre-loaded for you as weird_df contains
#actual data provided by the US Centers for Disease Control &
#Prevention and Department of Energy.

#Let's see if we can find a pattern.

#Seaborn has been pre-loaded as sns and matplotlib.pyplot as plt.

# Print the first five lines of weird_df
#print(weird_df.head())
#################################################
#<script.py> output:
#       pool_drownings  nuclear_energy
#    0             421           728.3
#    1             465           753.9
#    2             494           768.8
#    3             538           780.1
#    4             430           763.7
#################################################

#Create a scatterplot with nuclear energy production on the x-axis
#and the number of pool drownings on the y-axis.

sns.scatterplot(x='nuclear_energy', y='pool_drownings', data=weird_df)
plt.show()

![_images/13.12.svg](_images/13.12.svg)

In [None]:
# Print out the correlation matrix of weird_df
print(weird_df.corr())
#################################################
#<script.py> output:
#                    pool_drownings  nuclear_energy
#    pool_drownings        1.000000        0.901179
#    nuclear_energy        0.901179        1.000000
#################################################
#While the example is silly, you'll be amazed how often people
#misunderstand correlation vs causation.

**Selecting features for model performance**
___

___