# Introduction 

Working with **tidy data** makes our lives as data scientists way easier.

We consider a dataset is of **high dimensionality** if it has more than 10 features.

Before opting to reduce dimensionality of our dataset we have to understand it

In [None]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns 
import warnings 

warnings.filterwarnings('ignore')
sns.set()


In [None]:
pokemon_df = pd.read_csv('../data/pokemon.csv')

In [None]:
pokemon_df.head()

We can start reducing dimensionality by getting rid of those columns with no variance: having the same values for all the observations.

We can have a look to the variance of each feature with the *.describe()* pandas method. Keep in mind this method ignores non numerical columns by default. 

In [None]:
len(pokemon_df.columns)

In [None]:
pokemon_df.describe()

In [None]:
pokemon_df.describe(exclude='number')

In [None]:
pokemon_df.shape

Reducing dimensionality makes datasets: 
- be less complex
- require less disk space
- require less computation time
- have a lower chance of model overfitting

The simplest way to reduce dimensionality is to select only the features that are useful, but how to know who they are?

# Feature Selection VS Feature Extractions

## Feature selection

We can perform feature selection based on expertise (knowing that a certain feature wont be of any used based on SME experience), based on correlation with other features, the lack of variance of a certain feature.

## Feature extraction

It also reduces dimensionality, but it calculates or extract new features based on originals.

For datasets with numerous highly correlated features, dimensionality can be reduced using feature extraction.


In [None]:
ansur_df = pd.read_csv('../data/ANSUR_II_MALE.csv')

ansur_df.head()

In [None]:
ansur_df.shape

In [None]:
list(ansur_df.columns)

In [None]:
ansur_df_1 = ansur_df[['weight_kg', 'earlength', 'waistdepth', 'Gender']]

In [None]:
# Create a pairplot and color the points using the 'Gender' feature
sns.pairplot(ansur_df_1, hue='Gender', diag_kind='hist')

# Show the plot
plt.show()

In [None]:
ansur_df_1.shape

# t-SNE visualization of high-dimensional data

t-Distributed Stochastic Neighbor Embedding or t-SNE it's just a powerful technique to visualize high dimensional data using feature extraction.

t-SNE maximizes the distance in a two dimensional space between observations thata are most different in a high dimensional space.



In [None]:
ansur_f_df = pd.read_csv('../data/ANSUR_II_FEMALE.csv')

ansur_f_df.head()

In [None]:
ansur_f_df.shape

Lets remove the non numerical columns first since t-SNE doesnt work with that. 
We could do smth like one hot encoding eventually.

In [None]:
non_numeric = ['BMI_class', 'Height_class', 'Gender', 'Component', 'Branch']
df_numeric = ansur_f_df.drop(non_numeric, axis=1)
df_numeric.shape

In [None]:
from sklearn.manifold import TSNE

m = TSNE(learning_rate=50)

In [None]:
tsne_features = m.fit_transform(df_numeric)

tsne_features[1:4,:]

In [None]:
ansur_f_df['x'] =tsne_features[:,0]
ansur_f_df['y'] =tsne_features[:,1]


In [None]:
sns.scatterplot(data=ansur_f_df, x='x', y='y')
plt.show()

In [None]:
sns.scatterplot(data=ansur_f_df, x='x', y='y', hue='BMI_class')
plt.show()

In [None]:
sns.scatterplot(data=ansur_f_df, x='x', y='y', hue='Height_class')
plt.show()