# Procesamiento para variables categóricas con Python

En este notebook, exploraremos cómo procesar variables categóricas en Python, una habilidad esencial en la ciencia de datos y la inteligencia artificial. Las variables categóricas son aquellas que pueden ser divididas en varios grupos o categorías pero no tienen un orden o prioridad específica.

El procesamiento de variables categóricas es un paso crucial en la preparación de datos para algoritmos de aprendizaje automático, ya que muchos de estos algoritmos solo pueden manejar entradas numéricas.

## Contenido

1. Pandas Dummies
2. One-hot con Scikit
3. Codificación de variables numéricas discretas

## 1. Pandas Dummies

Pandas provides a function called `get_dummies` to convert categorical variable(s) into dummy/indicator variables. The function creates a new DataFrame with binary columns for each category/label present in the original data. Let's see how it works.

In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_iris

# Load the iris dataset
iris = load_iris()
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['species'] = iris.target

# Display the first few rows of the dataframe
df.head()

The dataset we are using is the famous Iris dataset. It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this DataFrame are:

- sepal length (cm)
- sepal width (cm)
- petal length (cm)
- petal width (cm)
- species

The 'species' column is our categorical variable which we will transform using pandas `get_dummies` function.

In [None]:
# Use pandas get_dummies function to convert 'species' column into dummy/indicator variables
df_dummies = pd.get_dummies(df, columns=['species'])

# Display the first few rows of the new dataframe
df_dummies.head()

As you can see, the `get_dummies` function has created three new columns: 'species_0', 'species_1', and 'species_2'. Each of these columns is a binary column that indicates whether the species of a particular row is that species.

This is known as one-hot encoding, and it's a common way to handle categorical data in machine learning. Each category gets its own column in the data set, and this column is a binary column that indicates the presence or absence of the category.

Now, let's move on to the next method of handling categorical data: one-hot encoding with Scikit-learn.

## 2. One-hot encoding with Scikit-learn

Scikit-learn is a powerful Python library for machine learning. It contains a function called `OneHotEncoder` for one-hot encoding. Unlike pandas `get_dummies`, `OneHotEncoder` does not convert the categorical variable into a string and then perform the conversion. Instead, it keeps the categories as integers, which makes it more suitable for machine learning algorithms.

In [None]:
# Import necessary libraries
from sklearn.preprocessing import OneHotEncoder

# Create the encoder
encoder = OneHotEncoder(sparse=False)

# Fit and transform the 'species' column
species_onehot = encoder.fit_transform(df[['species']])

# Create a dataframe from the encoded species data
df_onehot = pd.DataFrame(species_onehot, columns=encoder.get_feature_names(['species']))

# Display the first few rows of the new dataframe
df_onehot.head()

As you can see, the `OneHotEncoder` function from Scikit-learn has created three new columns: 'species_0', 'species_1', and 'species_2'. Each of these columns is a binary column that indicates whether the species of a particular row is that species.

This is similar to what we saw with pandas `get_dummies`, but the difference is that `OneHotEncoder` keeps the categories as integers, which makes it more suitable for machine learning algorithms.

Now, let's move on to the next method of handling categorical data: encoding numerical categorical variables.

## 3. Encoding numerical categorical variables

Sometimes, categorical variables are disguised as numerical variables. For example, consider a feature like 'education level' that is rated from 1 to 5. Although this feature is numerical, it is in fact categorical, as the numbers simply represent different categories and cannot be mathematically manipulated.

In such cases, we can use the `LabelEncoder` function from Scikit-learn to convert each value in a column to a number. Let's see how it works.

In [None]:
# Import necessary libraries
from sklearn.preprocessing import LabelEncoder

# Create the encoder
encoder = LabelEncoder()

# Fit and transform the 'species' column
species_encoded = encoder.fit_transform(df['species'])

# Create a new dataframe with the encoded species data
df_encoded = df.copy()
df_encoded['species'] = species_encoded

# Display the first few rows of the new dataframe
df_encoded.head()

As you can see, the `LabelEncoder` function from Scikit-learn has transformed the 'species' column into numerical values. Each unique species has been assigned a unique integer.

This is a simple and effective way to encode categorical variables that have a natural order to them (also known as ordinal variables). However, it should be used with caution when encoding nominal variables (variables that don't have a natural order), as the machine learning algorithm may interpret the numerical values as having an ordinal relationship.

In this notebook, we've covered three common methods for handling categorical data in Python: pandas `get_dummies`, one-hot encoding with Scikit-learn, and encoding numerical categorical variables. Each of these methods has its own strengths and weaknesses, and the best one to use will depend on the specific nature of your dataset and the machine learning algorithm you're using.