# Import necessary dependencies and settings

In [1]:
# importa pandas y numpy
import pandas as pd
import numpy as np

# Transforming Nominal Features

Nominal attributes consist of discrete categorical values with no notion or sense of order amongst them. The idea here is to transform these attributes into a more representative numerical format which can be easily understood by downstream code and pipelines. Let’s look at a new dataset pertaining to video game sales.

In [5]:
# lee 'vgsales.csv'
# muestra las primeras 6 filas de las columnas 'Name', 'Platform', 'Year', 'Genre', 'Publisher'
url="https://gist.githubusercontent.com/zhonglism/f146a9423e2c975de8d03c26451f841e/raw/f79e190df4225caed58bf360d8e20a9fa872b4ac/vgsales.csv"
df= pd.read_csv(url)
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


### Get the list of unique video game genres 

In [6]:
df.Genre.unique()

array(['Sports', 'Platform', 'Racing', 'Role-Playing', 'Puzzle', 'Misc',
       'Shooter', 'Simulation', 'Action', 'Fighting', 'Adventure',
       'Strategy'], dtype=object)

This tells us that we have 12 distinct video game genres. 

### We can now generate a label encoding scheme for mapping each category to a numeric value by leveraging scikit-learn LabelEncoder

In [16]:
# usando LabelEncoder muestra los géneros y las categorías asociadas a cada género
from sklearn import preprocessing
#El labelencoder se pasa el nombre de la categoria
le=preprocessing.LabelEncoder()
le.fit(df.Genre)
le.classes_

array(['Action', 'Adventure', 'Fighting', 'Misc', 'Platform', 'Puzzle',
       'Racing', 'Role-Playing', 'Shooter', 'Simulation', 'Sports',
       'Strategy'], dtype=object)

In [30]:
le.transform(["Action"])

array([0])

In [29]:
(dict(zip(le.classes_, le.transform(le.classes_))))

{'Action': 0,
 'Adventure': 1,
 'Fighting': 2,
 'Misc': 3,
 'Platform': 4,
 'Puzzle': 5,
 'Racing': 6,
 'Role-Playing': 7,
 'Shooter': 8,
 'Simulation': 9,
 'Sports': 10,
 'Strategy': 11}

### Show the transformed labels values and the dataframe

In [6]:
# primero muestra solo los géneros del DataFrame


0              Sports
1            Platform
2              Racing
3              Sports
4        Role-Playing
             ...     
16593        Platform
16594         Shooter
16595          Racing
16596          Puzzle
16597        Platform
Name: Genre, Length: 16598, dtype: object

In [7]:
# muestra en el DataFrame los géneros y sus categorías asociadas 


Unnamed: 0,Genre,Genero_OneHot_Encoder
0,Sports,10
1,Platform,4
2,Racing,6
3,Sports,10
4,Role-Playing,7
5,Puzzle,5
6,Platform,4
7,Misc,3
8,Platform,4
9,Shooter,8



# Transforming Ordinal Features

Ordinal attributes are categorical attributes with a sense of order amongst the values. Let’s consider the Pokémon dataset. Let’s focus more specifically on the Type 1 attribute. We will think that each Type 1 has a different power that we can order.


In [31]:
# lee Pokemon.csv y muestra un head()
url="https://gist.githubusercontent.com/armgilles/194bcff35001e7eb53a2a8b441e8b2c6/raw/92200bc0a673d5ce2110aaad4544ed6c4010f687/pokemon.csv"
pd.read_csv(url)

Unnamed: 0,#,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Generation,Legendary
0,1,Bulbasaur,Grass,Poison,318,45,49,49,65,65,45,1,False
1,2,Ivysaur,Grass,Poison,405,60,62,63,80,80,60,1,False
2,3,Venusaur,Grass,Poison,525,80,82,83,100,100,80,1,False
3,3,VenusaurMega Venusaur,Grass,Poison,625,80,100,123,122,120,80,1,False
4,4,Charmander,Fire,,309,39,52,43,60,50,65,1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...
795,719,Diancie,Rock,Fairy,600,50,100,150,100,150,50,6,True
796,719,DiancieMega Diancie,Rock,Fairy,700,50,160,110,160,110,110,6,True
797,720,HoopaHoopa Confined,Psychic,Ghost,600,80,110,60,150,130,70,6,True
798,720,HoopaHoopa Unbound,Psychic,Dark,680,80,160,60,170,130,80,6,True


In [16]:
# usa un sample() con semilla 1 y toma todo el DataFrame para desordenarlo aleatoriamente
# resetea los índices y haz un head()



Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary
0,Beedrill,Bug,Poison,395,65,90,40,45,80,75,3,False
1,Kingler,Water,,475,55,130,115,50,50,75,2,False
2,Golem,Rock,Ground,495,80,120,130,55,65,45,3,False
3,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,2,False
4,Ditto,Normal,,288,48,48,48,48,48,48,1,False


In [17]:
# muestra las columnas del DataFrame


Index(['Name', 'Type 1', 'Type 2', 'Total', 'HP', 'Attack', 'Defense',
       'Sp. Atk', 'Sp. Def', 'Speed', 'Stage', 'Legendary'],
      dtype='object')

### Show the different type 1 present in the dataset

In general, there is no generic module or function to map and transform these features into numeric representations based on order automatically. Hence we can use a custom encoding\mapping scheme based on a dictionary.

In [20]:
# escribe un diccionario que mapee el Type 1 con un número asociado a cómo es de bueno el Type 1.
# Es decir, presupón que se pueden ordenar esas etiquetas.
# Usa DataFrame['Type 1'].unique() para seleccionar esos valores en ese orden y asignarles 1,2,3...
# Por ejemplo: 'Bug' se corresponde con 1, 'Water' se corresponde con 2...

# poke_df['Type 1'].unique()
type_1_map = {'Bug': 1, 'Water': 2, 'Rock': 3, 'Normal': 4, 'Fighting': 5, 'Grass': 6, 'Poison': 7,
       'Fire': 8, 'Ghost': 9, 'Fairy': 10, 'Electric': 11, 'Dragon':12, 'Ground':13,
       'Psychic':14, 'Ice':15}

# mapea los valores en el DataFrame en una columna que se llame 'type_1_num'
# haz un head()



Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary,type_1_num
0,Beedrill,Bug,Poison,395,65,90,40,45,80,75,3,False,1
1,Kingler,Water,,475,55,130,115,50,50,75,2,False,2
2,Golem,Rock,Ground,495,80,120,130,55,65,45,3,False,3
3,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,2,False,4
4,Ditto,Normal,,288,48,48,48,48,48,48,1,False,4


# Encoding Categorical Features

## One-hot Encoding Scheme

Unnamed: 0,Name,Stage,Legendary
4,Ditto,1,False
5,Primeape,2,False
6,Aerodactyl,1,False
7,Vileplume,3,False
8,Nidorina,2,False
9,Starmie,2,False


In [30]:
# usa LabelEncoder
from sklearn.preprocessing import LabelEncoder

# transform and map pokemon Type 1 with LabelEncoder
# el método zip te puede ayudar



# transform and map pokemon legendary status with Label Encoder



In [28]:
poke_df.head()

Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary,type_1_num,Type 1 zip,Legendary zip
0,Beedrill,Bug,Poison,395,65,90,40,45,80,75,3,False,1,0,0
1,Kingler,Water,,475,55,130,115,50,50,75,2,False,2,14,0
2,Golem,Rock,Ground,495,80,120,130,55,65,45,3,False,3,13,0
3,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,2,False,4,10,0
4,Ditto,Normal,,288,48,48,48,48,48,48,1,False,4,10,0


In [31]:
# Otra forma más sencilla utilizando transform
# ¡Para esto vale fit y transform!
# Muchas transformaciones se dividen en fit (ajusta los parámetros de la transformación)
# y en transform (aplica los cambios)



In [32]:
#haz un head()

In [33]:
# comprobamos que la codificación del método es alfabética

In [34]:
# haz un head()


Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,Stage,Legendary,type_1_num,Type 1 zip,Legendary zip,Type 1 transformed
0,Beedrill,Bug,Poison,395,65,90,40,45,80,75,3,False,1,0,0,0
1,Kingler,Water,,475,55,130,115,50,50,75,2,False,2,14,0,14
2,Golem,Rock,Ground,495,80,120,130,55,65,45,3,False,3,13,0,13
3,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,2,False,4,10,0,10
4,Ditto,Normal,,288,48,48,48,48,48,48,1,False,4,10,0,10


The features Type 1 zip and Legendary_zip now depict the numeric representations of our categorical features. Let’s now apply the one-hot encoding scheme on these features. Apply the get_dummies() method.

In [36]:
# encode Type 1 labels using one-hot encoding scheme

# encode legendary status labels using one-hot encoding scheme


In [43]:
#one_hot_df_legendary

In [47]:
# compruebo que solo hay 4 pokemon legendarios


4

In [51]:
# concatena el DataFrame original con la codificación de Type 1 y de Legendary 



Unnamed: 0,Name,Type 1,Type 2,Total,HP,Attack,Defense,Sp. Atk,Sp. Def,Speed,...,Type_1_Grass,Type_1_Ground,Type_1_Ice,Type_1_Normal,Type_1_Poison,Type_1_Psychic,Type_1_Rock,Type_1_Water,Legendary_False,Legendary_True
0,Beedrill,Bug,Poison,395,65,90,40,45,80,75,...,0,0,0,0,0,0,0,0,1,0
1,Kingler,Water,,475,55,130,115,50,50,75,...,0,0,0,0,0,0,0,1,1,0
2,Golem,Rock,Ground,495,80,120,130,55,65,45,...,0,0,0,0,0,0,1,0,1,0
3,Pidgeotto,Normal,Flying,349,63,60,55,50,50,71,...,0,0,0,1,0,0,0,0,1,0
4,Ditto,Normal,,288,48,48,48,48,48,48,...,0,0,0,1,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
146,Vaporeon,Water,,525,130,65,60,110,95,65,...,0,0,0,0,0,0,0,1,1,0
147,Omanyte,Rock,Water,355,35,40,100,90,55,35,...,0,0,0,0,0,0,1,0,1,0
148,Tentacruel,Water,Poison,515,80,70,65,80,120,100,...,0,0,0,0,0,0,0,1,1,0
149,Kabutops,Rock,Water,495,60,115,105,65,70,80,...,0,0,0,0,0,0,1,0,1,0


Consider you built this encoding scheme on your training data and built some model and now you have some new data which has to be engineered for features before predictions as follows.

In [52]:


new_poke_df = pd.DataFrame([['PikaZoom', 'Bug', True], 
                           ['CharMyToast', 'Water', False]],
                           columns=['Name', 'Type 1', 'Legendary'])
new_poke_df


Unnamed: 0,Name,Type 1,Legendary
0,PikaZoom,Bug,True
1,CharMyToast,Water,False


In [34]:
# usando fit() y transform(), añade Type1_Label y Lgnd_Label en el DataFrame

You can leverage scikit-learn’s excellent API here by calling the transform(…) function of the previously build LabeLEncoder objects on the new data.

## Dummy Coding Scheme

Let’s try applying dummy coding scheme on Pokémon Type 1 by dropping the first level binary encoded feature (Type 1 = Bug).


In [19]:
# haz un get_dummies para una codificación dummy
# muestra las filas desde la 4 hasta la 9 (incluida)




If you want, you can also choose to drop the last level binary encoded feature

In [20]:
# haz un fit() de Type 1 y mira las clases que aparecen



In [21]:
# haz un dummies sin eliminar ninguna columna que se obtenga solo de Type 1
# haz un head()



In [22]:
# comprueba en la codificación con la columna eliminada (dummy)
# pista: isin te puede ayudar



In [23]:
# comprueba qué hace el signo ~



In [24]:
# haz una lectura en el DataFrame con la última sentencia que emplea ~



In [25]:
# asígnalo a una variable y muestra un head




## Feature Hashing scheme

Find the number of different 'Genre' in the dataset.

In [26]:
# Usa vgsales.csv, léelo y haz un head()




In [27]:
# print('Total game genres: ' + str(len(df_videojuegos.Genre.unique())))
# print(df_videojuegos.Genre.sort_values().unique())

### We can see that there are a total of 12 genres of video games. If we used a one-hot encoding scheme on the Genre feature, we would end up having 12 binary features. Instead, we will now use a feature hashing scheme by leveraging scikit-learn’s FeatureHasher class, which uses a signed 32-bit version of the Murmurhash3 hash function. We will pre-define the final feature vector size to be 6 in this case.