---
title: "Exploring High Dimensional Data"
title-block-banner: true
format:
  html:
    code-fold: true
jupyter: python3
author: "kakamana"
date: "2023-01-22"
categories: [python, datacamp, feature engineering, machine learning, dimension ]
image: "exploringDimension.jpg"

---

# Exploring High Dimensional Data

It will introduce us to dimensionality reduction and explain why and when it is important. It will also teach us the difference between feature selection and feature extraction, and we'll learn how to apply both techniques for data exploration. Finally, the chapter covers t-SNE, a powerful feature extraction method for analyzing high-dimensional data.

This **Exploring High Dimensional Data** is part of [Datacamp course: Hypothesis Testing in Python](https://app.datacamp.com/learn/courses/hypothesis-testing-in-python)

This is my learning experience of data science through DataCamp

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize'] = (10, 5)

In [2]:
pokemon_df = pd.read_csv('dataset/pokemon_gen1.csv')
pokemon_df.head()

Unnamed: 0,HP,Attack,Defense,Generation,Name,Type,Legendary
0,45,49,49,1,Bulbasaur,Grass,False
1,60,62,63,1,Ivysaur,Grass,False
2,80,82,83,1,Venusaur,Grass,False
3,80,100,123,1,VenusaurMega Venusaur,Grass,False
4,39,52,43,1,Charmander,Fire,False


In [3]:
pokemon_df.describe()

Unnamed: 0,HP,Attack,Defense,Generation
count,160.0,160.0,160.0,160.0
mean,64.6125,74.98125,70.175,1.0
std,27.92127,29.18009,28.883533,0.0
min,10.0,5.0,5.0,1.0
25%,45.0,52.0,50.0,1.0
50%,60.0,71.0,65.0,1.0
75%,80.0,95.0,85.0,1.0
max,250.0,155.0,180.0,1.0


In [4]:
pokemon_df.describe(exclude='number')

Unnamed: 0,Name,Type,Legendary
count,160,160,160
unique,160,15,1
top,Bulbasaur,Water,False
freq,1,31,160


## Introduction

### Removing features without variance
A sample of the Pokemon dataset has been loaded as `pokemon_df`. To get an idea of which features have little variance  to calculate summary as above statistics on this sample. Then adjust the code to create a smaller, easier to understand, dataset.

In [5]:
# Remove the feature without variance from this list
number_cols = ['HP', 'Attack', 'Defense']

# Leave this list as is for now
non_number_cols = ['Name', 'Type', 'Legendary']

# Sub-select by combining the lists with chosen features
df_selected = pokemon_df[number_cols + non_number_cols]

# Prints the first 5 lines of the new DataFrame
print(df_selected.head())

   HP  Attack  Defense                   Name   Type  Legendary
0  45      49       49              Bulbasaur  Grass      False
1  60      62       63                Ivysaur  Grass      False
2  80      82       83               Venusaur  Grass      False
3  80     100      123  VenusaurMega Venusaur  Grass      False
4  39      52       43             Charmander   Fire      False


In [6]:
# Leave this list as is
number_cols = ['HP', 'Attack', 'Defense']

# Remove the feature without variance from this list
non_number_cols = ['Name', 'Type' ]

# Create a new dataframe by subselecting the chosen features
df_selected = pokemon_df[number_cols + non_number_cols]

# Prints the first 5 lines of the new dataframe
print(df_selected.head())

   HP  Attack  Defense                   Name   Type
0  45      49       49              Bulbasaur  Grass
1  60      62       63                Ivysaur  Grass
2  80      82       83               Venusaur  Grass
3  80     100      123  VenusaurMega Venusaur  Grass
4  39      52       43             Charmander   Fire


All Pokemon in this dataset are non-legendary and from generation one so you could choose to drop those two features.

## Feature selection vs feature extraction
- Why reduce dimensionality?
    - Your dataset will:
        - be less complex
        - require less disk space
        - require less computation time
        - have lower chance of model overfitting
![feature](Images/feature.png)
![feature](feature.png){#fig-feature}