#Data Engineering: Feature Hashing

A categorical variable (also called nominal variable) whose values are concepts. They are commonly reencoded as numeric with 1 to 1 mappings of a numeral and its string representation. That approach can fail when new or previously unknown concepts are expected to appear in the dataset.

## Dataset

The data are 800 Pokemon and their attributes acquired from several pokemon web sites, including pokemon.com, pokemondb, bulbapedia, and others. Data are scoped around pokemon games (not cards or Go).

[An update of the dataset is on Kaggle](https://www.kaggle.com/rounakbanik/pokemon).

##Problem
The dataset is complete in all but two columns. "Legendary" has one null value, which can be removed in processing. However, "Type 2" has 385 missing values. They may be provided in the future. 

##Approach
Feature values must be encoded in anticipation of updates. A hash function can encode string values as numerals and accomodate new values in the future. Sci-kit Learn has a 32-bit version of Murmurhash3 called FeatureHasher available from the feature_processing library.

##Prep the Notebook

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#After running the above, read files from:
#'/content/drive/MyDrive/datasets/filename.ext'

<br>

Import the necessary libraries

In [None]:
import pandas as pd
import numpy as np

## Read the data

In [None]:
df = pd.read_csv('/content/drive/MyDrive/datasets/Pokemon.txt', sep=",",header=0)

In [None]:
df.head()

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,Gen 1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,Gen 1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,Gen 1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,Gen 1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,Gen 1,False


**Data Dictionary**

* **'#'**: Redundant manually-assigned ID
* **Name:** Name of pokemon
* **Type 1:** Each pokemon has a type, this determines weakness/resistance to * attacks
* **Type 2:** Some pokemon are dual type and have 2
* **Total:** Sum of all stats that come after this, a general guide to how strong a pokemon is
* **HP:** Hit points, or health, defines how much damage a pokemon can withstand before fainting
* **Attack:** The base modifier for normal attacks (eg. Scratch, Punch)
* **Defense:** The base damage resistance against normal attacks
* **SP Atk:** Special attack, the base modifier for special attacks (e.g. fire blast, bubble beam)
* **SP Def:** The base damage resistance against special attacks
* **Speed:** Determines which pokemon attacks first each round
* **Generation**: An ordinal categorical variable identifying the release of origin.
* **Legendary**: A boolean True or False.

#Exploratory Data Analysis (EDA)
Before you attempt a hash, the column must be of type string and have no missing values (blank, 0, or NaN).

In [None]:
df.shape

(800, 13)

In [None]:
df.info() #Check for datatypes and null values.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 13 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   #           800 non-null    int64 
 1   Name        800 non-null    object
 2   Type 1      800 non-null    object
 3   Type 2      415 non-null    object
 4   Total       800 non-null    int64 
 5   HP          800 non-null    int64 
 6   Attack      800 non-null    int64 
 7   Defense     800 non-null    int64 
 8   Sp. Atk     800 non-null    int64 
 9   Sp. Def     800 non-null    int64 
 10  Speed       800 non-null    object
 11  Generation  800 non-null    object
 12  Legendary   799 non-null    object
dtypes: int64(7), object(6)
memory usage: 81.4+ KB


* 385 values are null in Type 2.
* 1 value is null in Legendary.

In [None]:
df['Type 2'].unique()

array(['Poison', nan, 'Flying', 'Dragon', '314', 'Ground', 'Fairy',
       'Grass', 'Fighting', 'Psychic', 'Steel', 'Ice', 'Rock', 'Dark',
       'Water', 'Electric', 'Fire', 'Ghost', 'Bug', 'Normal'],
      dtype=object)

In [None]:
#The first operation for the pipeline is to replace null with Other in Type 2, 
#anticipating updates and new values in the future. Next are to drop redundant columns
#and to drop the last remaining row with a null value (in 'Legendary').
(df
 .assign(Type2 = df['Type 2'].fillna('Other'))
 .drop(columns=['Type 2','#'])
 .dropna()
 .shape
)

(799, 12)

# Feature Hashing

Feature hashing maps data of arbitrary size to data of a fixed size. 
* It helps in cases like Type 1 and Type 2 where new pokemons appear regularly (and new nominal values). 
* It also ensures we add exactly k new feature columns to a dataset after encoding the feature values.

In [None]:
len(df['Type 2'].unique())
#The current 20 plus all future values will occupy k=5 columns after hashing.

20

In [None]:
from sklearn.feature_extraction import FeatureHasher

In [None]:
#Write a function to hash a feature. Concatenate and return the result to the transformation routine.
def hash_features(df_, col, k):
  fh = FeatureHasher(n_features = k, input_type='string')
  hashed_features = fh.fit_transform(df_[col]).toarray()
  hf = pd.DataFrame(hashed_features)
  transformed_df = pd.concat([df_, hf], axis=1)
  return transformed_df

In [None]:
#Write a function to encode a feature with get dummies. Concatenate and return the result.
def encode_dummies(df_, col):
  f_dummies = pd.get_dummies(df_[col])
  transformed_df = pd.concat([df_,f_dummies], axis=1)
  return transformed_df

In [None]:
(df
 .assign(Type_1 = df['Type 1'],
         Type_2 = df['Type 2'].fillna('Other')
         )
 .drop(columns=['Type 1','Type 2','#'])
 .dropna()
 .pipe(hash_features, 'Type_2', 5)
 .pipe(encode_dummies, 'Type_1')
 .head()
)

Unnamed: 0,Name,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary,Type_1,Type_2,0,1,2,3,4,Bug,Dark,Dragon,Electric,Fairy,Fighting,Fire,Flying,Ghost,Grass,Ground,Ice,Normal,Poison,Psychic,Rock,Steel,Water
0,Bulbasaur,318.0,45.0,49.0,49.0,65.0,65.0,45,Gen 1,False,Grass,Poison,0.0,-2.0,0.0,2.0,-2.0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
1,Ivysaur,405.0,60.0,62.0,63.0,80.0,80.0,60,Gen 1,False,Grass,Poison,0.0,-2.0,0.0,2.0,-2.0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
2,Venusaur,525.0,80.0,82.0,83.0,100.0,100.0,80,Gen 1,False,Grass,Poison,0.0,-2.0,0.0,2.0,-2.0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
3,VenusaurMega Venusaur,625.0,80.0,100.0,123.0,122.0,120.0,80,Gen 1,False,Grass,Poison,0.0,-2.0,0.0,2.0,-2.0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0
4,Charmander,309.0,39.0,52.0,43.0,60.0,50.0,65,Gen 1,False,Fire,Other,0.0,0.0,1.0,0.0,0.0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
