# Encoding Categorical Variables

If we have a categorical variable we need to encode it to a numeric variable. Linear regression does not work with categorical variables.

A common technique is One-hot encoding. 

In [None]:
# libraries
import pandas as pd
import numpy as np

In [None]:
# Loading data
df_dia = pd.read_csv("https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt",sep="\t")

# Changing SEX column to be categorical - this is just to show, in the next step, how to encode 
df_dia['SEX'] = df_dia['SEX'].replace({1: 'Male', 2: 'Female'})

# Examining the head
df_dia.head()


One-hot encoding creates a binary column for each category in the categorical variable, minus 1. 

`SEX` is binary (i.e. it contains 2 categories) so 1 new column will be created. 

In [None]:
# Encoding using pandas get dummies
pd.get_dummies(df_dia, columns=['SEX'], drop_first=True, dtype=int)

In the new `SEX_Male` column, 1 indicates the presence of that category (i.e. 'Male') and 0 indicates it's absence (i.e. 'Female').

In [None]:
# examine the dataframe
df_dia.head()

In [None]:
# Applying this transformation to the dataframe
df_dia = pd.get_dummies(df_dia, columns=['SEX'], drop_first=True, dtype=int)

In [None]:
# examine the dataframe
df_dia.head()