In [1]:
import pandas as pd 

### About Categorical Features 

What is a categorical variable? 
- A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories. A categorical is one in which the data type of the feature is not int or float 

There are basically two types of categorical variable - 
2. Ordinal - It is a set of discrete values and has some intrinsic order in between them. 
3. Nominal - It is a normal object type variable like gender etc that is they have no intrinsic order in them 

Rule of thumb to remember that columns with object datatype are categorical variables while columns with int or float datatype are not. The data in general is discrete values and not continuous ones  

In this lecture, the we will use a dataset called video games sales.
Here's the link to the data set for further referrence -https://www.kaggle.com/anthonypino/melbourne-housing-market?select=MELBOURNE_HOUSE_PRICES_LESS.csv

In [2]:
df = pd.read_csv(r'D:\Videos\Feature engineering\vgsales.csv')

In [3]:
df.head()

Unnamed: 0,Rank,Name,Platform,Year,Genre,Publisher,NA_Sales,EU_Sales,JP_Sales,Other_Sales,Global_Sales
0,1,Wii Sports,Wii,2006.0,Sports,Nintendo,41.49,29.02,3.77,8.46,82.74
1,2,Super Mario Bros.,NES,1985.0,Platform,Nintendo,29.08,3.58,6.81,0.77,40.24
2,3,Mario Kart Wii,Wii,2008.0,Racing,Nintendo,15.85,12.88,3.79,3.31,35.82
3,4,Wii Sports Resort,Wii,2009.0,Sports,Nintendo,15.75,11.01,3.28,2.96,33.0
4,5,Pokemon Red/Pokemon Blue,GB,1996.0,Role-Playing,Nintendo,11.27,8.89,10.22,1.0,31.37


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16598 entries, 0 to 16597
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          16598 non-null  int64  
 1   Name          16598 non-null  object 
 2   Platform      16598 non-null  object 
 3   Year          16327 non-null  float64
 4   Genre         16598 non-null  object 
 5   Publisher     16540 non-null  object 
 6   NA_Sales      16598 non-null  float64
 7   EU_Sales      16598 non-null  float64
 8   JP_Sales      16598 non-null  float64
 9   Other_Sales   16598 non-null  float64
 10  Global_Sales  16598 non-null  float64
dtypes: float64(6), int64(1), object(4)
memory usage: 1.4+ MB


### Methods of handling categorical data

So there are mainly three methods for handling categorical features :
1. Droping off the column - This is one of the most crude method of handling categorical features and most of the times it results in loss of very precious data. So this method is not much recommended.

2. Mode imputation - In this method we replace the missing values with the most frequent values in the column.

3. Imputation with the help of some ML algorithm - The assumption behind using KNN for missing values is that a point value can be approximated by the values of the points that are closest to it, based on other variables.. 

#### Dropping Columns directly

In [4]:
df.isnull().sum()

Rank              0
Name              0
Platform          0
Year            271
Genre             0
Publisher        58
NA_Sales          0
EU_Sales          0
JP_Sales          0
Other_Sales       0
Global_Sales      0
dtype: int64

In [5]:
df.columns

Index(['Rank', 'Name', 'Platform', 'Year', 'Genre', 'Publisher', 'NA_Sales',
       'EU_Sales', 'JP_Sales', 'Other_Sales', 'Global_Sales'],
      dtype='object')

In [8]:
# First we will find out the categorical features
cat_feature = []
s = (df.dtypes == 'object')
cat_feature.append(s[s].index)

cat_feature

[Index(['Name', 'Platform', 'Genre', 'Publisher'], dtype='object')]

Only publisher have some missing values among the categorical features


In [7]:
len(df.Publisher)

16598

In [9]:
df['Publisher'].dropna(axis = 0, inplace = True)

In [10]:
df['Publisher'].isnull().sum()

0

In [11]:
len(df['Publisher'])

16540

#### Replacing missing value with the highest frequency

In [19]:
data = pd.read_csv(r'D:\Videos\Feature engineering\vgsales.csv')

In [20]:
data['Publisher'].value_counts()

Electronic Arts                 1351
Activision                       975
Namco Bandai Games               932
Ubisoft                          921
Konami Digital Entertainment     832
                                ... 
Boost On                           1
Her Interactive                    1
Legacy Interactive                 1
Imax                               1
Fortyfive                          1
Name: Publisher, Length: 578, dtype: int64

In [23]:
data['Publisher'].fillna('Electronic Arts',inplace=True)

In [24]:
data['Publisher'].value_counts()

Electronic Arts                 1409
Activision                       975
Namco Bandai Games               932
Ubisoft                          921
Konami Digital Entertainment     832
                                ... 
Boost On                           1
Her Interactive                    1
Legacy Interactive                 1
Imax                               1
Fortyfive                          1
Name: Publisher, Length: 578, dtype: int64

This method sometime increases the chances of facing problems due to imbalanced data within the category

#### Replacing with the help of KNN imputation

This is the best way to handle missing values among the other three. You can also use a unsupervised ML model for the work

In [6]:
import numpy as np
from sklearn.impute import KNNImputer
X =np.array( [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]])
print(X)



[[ 1.  2. nan]
 [ 3.  4.  3.]
 [nan  6.  5.]
 [ 8.  8.  7.]]


In [8]:
imputer = KNNImputer(n_neighbors=1)
imputer.fit_transform(X)


array([[1., 2., 3.],
       [3., 4., 3.],
       [3., 6., 5.],
       [8., 8., 7.]])

#### End of distribution imputation 

This is a method in which the msising values are replaced with some value from the end of the dataset. 