### Working with Categorical Variables.
##### Categorical variables are those which takes limited number of values. These categorical variables has to be preprocessed before placing the data into the machine learning model.<br>There are three approaches to preprocess (in order to prepare) the categorical data.

### First approach -
### Drop Categorical Variables -
##### The easiest approach to dealing with categorical variables is to simply remove them from the dataset. This approach will only work well if the columns did not contain useful information.
### Ordinal Encoding -
##### This approach is assigning unique values to the categories, thereby making an ordering of the categories. For example,<br>an ordering of the categories: "Never" (0) < "Rarely" (1) < "Most days" (2) < "Every day" (3).
### One-Hot Encoding -
##### One-hot encoding creates new columns indicating the presence (or absence) of each possible value in the original data.
In contrast to ordinal encoding, one-hot encoding does not assume an ordering of the categories. Thus, you can expect this approach to work particularly well if there is no clear ordering in the categorical data. We refer to categorical variables without an intrinsic ranking as nominal variables.
<br><br>
One-hot encoding generally does not perform well if the categorical variable takes on a large number of values (i.e., you generally won't use it for variables taking more than 15 different values).

In [1]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split

In [4]:
file_path = 'C:/Users/hp/Desktop/MACHINE LEARNING/CATEGORICAL_IN-HOUSING/train.csv'
house_data = pd.read_csv(file_path)

y = house_data.Price
X = house_data.drop(['Price'],axis=1)
X.head(2)

Unnamed: 0,Suburb,Address,Rooms,Type,Method,SellerG,Date,Distance,Postcode,Bedroom2,Bathroom,Car,Landsize,BuildingArea,YearBuilt,CouncilArea,Lattitude,Longtitude,Regionname,Propertycount
0,Abbotsford,85 Turner St,2,h,S,Biggin,3/12/2016,2.5,3067.0,2.0,1.0,1.0,202.0,,,Yarra,-37.7996,144.9984,Northern Metropolitan,4019.0
1,Abbotsford,25 Bloomburg St,2,h,S,Biggin,4/02/2016,2.5,3067.0,2.0,1.0,0.0,156.0,79.0,1900.0,Yarra,-37.8079,144.9934,Northern Metropolitan,4019.0


In [9]:
train_X, test_X, train_y, test_y = train_test_split(X,y,train_size=0.8,test_size=0.2,random_state=0)

Now, the columns that are having null values are found and dropped from the table.

In [10]:
missing_columns = [col for col in train_X.columns if train_X[col].isnull().any()]
train_X.drop(missing_columns, axis=1, inplace=True)
test_X.drop(missing_columns, axis=1, inplace=True)

So, the columns having null values are removed. Now categorical column has to be selected with low cardinality.<br>
##### What is Cardinality?<br> Cardinality means the number of unique values in a column

In [12]:
low_card_col = [col for col in train_X.columns if train_X[col].nunique()<10 and train_X[col].dtype=='object']
low_card_col

['Type', 'Method', 'Regionname']

In [14]:
numerical_cols = [col for col in train_X.columns if train_X[col].dtype in ['int64','float64']]
numerical_cols

['Rooms',
 'Distance',
 'Postcode',
 'Bedroom2',
 'Bathroom',
 'Landsize',
 'Lattitude',
 'Longtitude',
 'Propertycount']

In [16]:
my_cols = low_card_col + numerical_cols
my_cols

['Type',
 'Method',
 'Regionname',
 'Rooms',
 'Distance',
 'Postcode',
 'Bedroom2',
 'Bathroom',
 'Landsize',
 'Lattitude',
 'Longtitude',
 'Propertycount']

In [18]:
train_X_new = train_X[my_cols].copy()
test_X_new = test_X[my_cols].copy()

Next, we obtain a list of all of the categorical variables in the training data.

In [29]:
s = (train_X_new.dtypes=='object')
object_cols = list(s[s].index)
print(object_cols)

['Type', 'Method', 'Regionname']


### So, now we have got categorical variables. Next, we define a function to compare the three different approaches to dealing with categorical variables. This function reports the mean absolute error (MAE) from a random forest model.

In [30]:
def score_dataset(train_X, test_X, train_y, test_y):
    model = RandomForestRegressor(n_estimators=100, random_state=0)
    model.fit(train_X,train_y)
    prediction = model.predict(test_X)
    return mean_absolute_error(test_y,prediction)

### First approach of DROP CATEGORICAL VARIABLE -

In [31]:
train_X_drop = train_X_new.select_dtypes(exclude=['object'])
test_X_drop = test_X_new.select_dtypes(exclude=['object'])

In [33]:
print('MAE from approach 1 after dropping the categorical variables:   ')
print(score_dataset(train_X_drop, test_X_drop, train_y, test_y))

MAE from approach 1 after dropping the categorical variables:   
175703.48185157913


### Second approach of ORDINAL ENCODING -
##### Scikit-learn has a OrdinalEncoder class that can be used to get ordinal encodings. We loop over the categorical variables and apply the ordinal encoder separately to each column.

In [34]:
from sklearn.preprocessing import OrdinalEncoder

train_X_label = train_X_new.copy()
test_X_label = test_X_new.copy()

In [35]:
ordinalencoder = OrdinalEncoder()
train_X_label[object_cols] = ordinalencoder.fit_transform(train_X_new[object_cols])
test_X_label[object_cols] = ordinalencoder.transform(test_X_new[object_cols])

In [36]:
print('MAE from approach 2 after ordinal encoding:   ')
print(score_dataset(train_X_label, test_X_label, train_y, test_y))

MAE from approach 2 after ordinal encoding:   
165936.40548390493


### Third approach of ONE-HOT ENCODING -
#####  - OneHotEncoder class from scikit-learn to get one-hot encodings <br> - set handle_unknown='ignore' to avoid errors when the validation data contains classes that aren't represented in the training data <br> - setting sparse=False ensures that the encoded columns are returned as a numpy array (instead of a sparse matrix) <br> - To use the encoder, we supply only the categorical columns that we want to be one-hot encoded.

In [37]:
from sklearn.preprocessing import OneHotEncoder
oh_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
train_ohencode = pd.DataFrame(oh_encoder.fit_transform(train_X_new[object_cols]))
test_ohencode = pd.DataFrame(oh_encoder.fit_transform(test_X_new[object_cols]))

One-hot encoding removed index; put it back

In [39]:
train_ohencode.index = train_X_new.index
test_ohencode.index = test_X_new.index

Remove categorical columns (will replace with one-hot encoding)

In [41]:
num_X_train = train_X_new.drop(object_cols, axis=1)
num_X_test = test_X_new.drop(object_cols, axis=1)

Add one-hot encoded columns to numerical features

In [42]:
OH_X_train = pd.concat([num_X_train, train_ohencode], axis=1)
OH_X_test = pd.concat([num_X_test, test_ohencode], axis=1)

In [43]:
print('MAE from approach 3 after one hot encoding:   ')
print(score_dataset(OH_X_train, OH_X_test, train_y, test_y))

MAE from approach 3 after one hot encoding:   
166089.4893009678


In this case, dropping the categorical columns (Approach 1) performed worst, since it had the highest MAE score. As for the other two approaches, since the returned MAE scores are so close in value, there doesn't appear to be any meaningful benefit to one over the other.
<br>
In general, one-hot encoding (Approach 3) will typically perform best, and dropping the categorical columns (Approach 1) typically performs worst, but it varies on a case-by-case basis.