**Introduction**
___
- Tidy data
    - every column is a feature
    - every row is an observation for each variable
    - Pandas dataframe .shape attribute
- high dimensionality > 10 columns
- When to use dimensionality reduction?
    - drop columns with no variance (i.e. same values)
    - Pandas dataframe .describe() method
        - no variance = std = 0, max and min are the same
        - exclude = 'number' --> will show information for non-numeric values
___

In [None]:
#Removing features without variance

#A sample of the Pokemon dataset has been loaded as pokemon_df. To
#get an idea of which features have little variance you should use
#the IPython Shell to calculate summary statistics on this sample.
#Then adjust the code to create a smaller, easier to understand,
#dataset.

# Leave this list as is
number_cols = ['HP', 'Attack', 'Defense']

# Remove the feature without variance from this list
non_number_cols = ['Name', 'Type']

# Create a new dataframe by subselecting the chosen features
#df_selected = pokemon_df[number_cols + non_number_cols]

# Prints the first 5 lines of the new dataframe
#print(df_selected.head())

#################################################
#<script.py> output:
#       HP  Attack  Defense                   Name   Type
#    0  45      49       49              Bulbasaur  Grass
#    1  60      62       63                Ivysaur  Grass
#    2  80      82       83               Venusaur  Grass
#    3  80     100      123  VenusaurMega Venusaur  Grass
#    4  39      52       43             Charmander   Fire
#################################################
#All Pokemon in this dataset are non-legendary and from generation
#one so you could choose to drop those two features.

**Feature selection vs feature extraction**
___
- Why reduce dimensionality?
    - your dataset will:
        - be less complex
        - require less disk space
        - have lower chance of model overfitting
- Feature selection
    - .drop('column name', axis=1) [axis indicates column instead of row]
- Building a pairplot
    - sns.pairplot(data, hue='', diag_kind='hist')
- Feature extraction
    - calculating new feature(s) from original feature(s)
___

In [None]:
#Visually detecting redundant features
#Data visualization is a crucial step in any data exploration. Let's
#use Seaborn to explore some samples of the US Army ANSUR body
#measurement dataset.

#Two data samples have been pre-loaded as ansur_df_1 and ansur_df_2.

#Seaborn has been imported as sns.

# Create a pairplot and color the points using the 'Gender' feature
#sns.pairplot(ansur_df_1, hue='Gender', diag_kind='hist')

# Show the plot
#plt.show()

![_images/13.1.svg](_images/13.1.svg)

In [None]:
#Two features are basically duplicates, remove one of them from
#the dataset.

# Remove one of the redundant features
#reduced_df = ansur_df_1.drop('stature_m', axis=1)

# Create a pairplot and color the points using the 'Gender' feature
#sns.pairplot(reduced_df, hue='Gender')

# Show the plot
#plt.show()

![_images/13.2.svg](_images/13.2.svg)
the body height (inches) and stature (meters) hold the same information in a different unit

In [None]:
#Now create a pairplot of the ansur_df_2 data sample and color the
#points using the 'Gender' feature.

# Create a pairplot and color the points using the 'Gender' feature
#sns.pairplot(ansur_df_2, hue='Gender', diag_kind='hist')

# Show the plot
#plt.show()

![_images/13.3.svg](_images/13.3.svg)

In [None]:
#One feature has no variance, remove it from the dataset.
# Remove the redundant feature
#reduced_df = ansur_df_2.drop('n_legs', axis=1)

# Create a pairplot and color the points using the 'Gender' feature
#sns.pairplot(reduced_df, hue='Gender', diag_kind='hist')

# Show the plot
#plt.show()

![_images/13.4.svg](_images/13.4.svg)
all the individuals in the second sample have two legs.

**t-SNE visualization of high-dimensional data**
___
- t-distributed stochastic neighbor embedding
- t-SNE maximizes distance in 2-dimensional space between dimensions in higher dimensional space
- does not work with non-numeric values
- learning rate (10-1000) - lower number is conservative
___

In [None]:
#Fitting t-SNE to the ANSUR data

#t-SNE is a great technique for visual exploration of high dimensional
#datasets. In this exercise, you'll apply it to the ANSUR dataset. You'll
#remove non-numeric columns from the pre-loaded dataset df and fit TSNE to
#this numeric dataset.

# Non-numerical columns in the dataset
#non_numeric = ['Branch', 'Gender', 'Component']

# Drop the non-numerical columns from df
#df_numeric = df.drop(non_numeric, axis=1)

# Create a t-SNE model with learning rate 50
#m = TSNE(learning_rate=50)

# Fit and transform the t-SNE model on the numeric dataset
#tsne_features = m.fit_transform(df_numeric)

#################################################
#t-SNE reduced the more than 90 features in the dataset to just 2
#which you can now plot.
#################################################

In [None]:
#t-SNE visualisation of dimensionality
#Time to look at the results of your hard work. In this exercise,
#you will visualize the output of t-SNE dimensionality reduction on
#the combined male and female Ansur dataset. You'll create 3
#scatterplots of the 2 t-SNE features ('x' and 'y') which were
#added to the dataset df. In each scatterplot you'll color the
#points according to a different categorical variable.

#seaborn has already been imported as sns and matplotlib.pyplot as
#plt.

# Color the points according to Army Component
#sns.scatterplot(x="x", y="y", hue='Component', data=df)

# Show the plot
#plt.show()

![_images/13.5.svg](_images/13.5.svg)

In [None]:
# Color the points by Army Branch
#sns.scatterplot(x="x", y="y", hue='Branch', data=df)

# Show the plot
#plt.show()

![_images/13.6.svg](_images/13.6.svg)

In [None]:
# Color the points by Gender
#sns.scatterplot(x="x", y="y", hue='Gender', data=df)

# Show the plot
#plt.show()

![_images/13.7.svg](_images/13.7.svg)
There is a Male and a Female cluster. t-SNE found these gender 
differences in body shape without being told about them explicitly!

From the second plot you learned there are more males in the 
Combat Arms Branch.