# High Cardinality
Another way to refer to variables that have a multitude of categories, is to call them variables with high cardinality.  
If we have categorical variables containing many multiple labels or high cardinality,then by using one hot encoding, we will expand the feature space dramatically.  
One approach that is heavily used in Kaggle competitions, is to replace each label of the categorical variable by the count, this is the amount of times each label appears in the dataset. Or the frequency, this is the percentage of observations within that category. The 2 are equivalent.  

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

In [2]:
data = pd.read_csv('C:\\Users\\koriv\\Desktop\\MachineLearning_DataScience\\Hands_On_Machine_Learning\\Count Frequency Encoding\\train.csv', usecols=['MSZoning','Street','LotShape','Utilities','LandSlope','SalePrice'])

In [4]:
data.head()

Unnamed: 0,MSZoning,Street,LotShape,Utilities,LandSlope,SalePrice
0,RL,Pave,Reg,AllPub,Gtl,208500
1,RL,Pave,Reg,AllPub,Gtl,181500
2,RL,Pave,IR1,AllPub,Gtl,223500
3,RL,Pave,IR1,AllPub,Gtl,140000
4,RL,Pave,IR1,AllPub,Gtl,250000


In [5]:
data.isnull().mean()

MSZoning     0.0
Street       0.0
LotShape     0.0
Utilities    0.0
LandSlope    0.0
SalePrice    0.0
dtype: float64

In [6]:
# SPliting the data into Train and Test Data
X_train,X_test,y_train,y_test = train_test_split(data[['MSZoning','Street','LotShape','Utilities','LandSlope']], 
                                                 data['SalePrice'], test_size =.3, random_state =111)

In [11]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(1022, 5)
(438, 5)
(1022,)
(438,)


In [16]:
# Check for Uniques lables available
X_train['MSZoning'].unique()

array(['RL', 'RM', 'RH', 'FV', 'C (all)'], dtype=object)

In [17]:
# Create a Dictory of Counts for the Unique lables available
Ms = X_train['MSZoning'].value_counts().to_dict()
Ms

{'RL': 814, 'RM': 141, 'FV': 48, 'RH': 12, 'C (all)': 7}

In [19]:
# Create a Dictory of Counts for the Unique lables available for all nominal Columns
cat_vars = ['MSZoning','Street','LotShape','Utilities','LandSlope']
encoder_dictionary ={}
for var in cat_vars:
    encoder_dictionary[var] = (X_train[var].value_counts()/len(X_train)).to_dict()
encoder_dictionary

{'MSZoning': {'RL': 0.7964774951076321,
  'RM': 0.1379647749510763,
  'FV': 0.046966731898238745,
  'RH': 0.011741682974559686,
  'C (all)': 0.00684931506849315},
 'Street': {'Pave': 0.99706457925636, 'Grvl': 0.0029354207436399216},
 'LotShape': {'Reg': 0.6223091976516634,
  'IR1': 0.34637964774951074,
  'IR2': 0.023483365949119372,
  'IR3': 0.007827788649706457},
 'Utilities': {'AllPub': 0.9990215264187867, 'NoSeWa': 0.0009784735812133072},
 'LandSlope': {'Gtl': 0.9549902152641878,
  'Mod': 0.03718199608610567,
  'Sev': 0.007827788649706457}}

In [20]:
for var in cat_vars:
    X_train[var] = X_train[var].map(encoder_dictionary[var])
print("Shape of X_train: ",X_train.shape)
X_train.head()

Shape of X_train:  (1022, 5)


Unnamed: 0,MSZoning,Street,LotShape,Utilities,LandSlope
529,0.796477,0.997065,0.34638,0.999022,0.95499
207,0.796477,0.997065,0.34638,0.999022,0.95499
498,0.796477,0.997065,0.622309,0.999022,0.95499
191,0.796477,0.997065,0.34638,0.999022,0.95499
1402,0.796477,0.997065,0.622309,0.999022,0.95499


There are some advantages and disadvantages that we will discuss now  

Advantages  
1 It is very simple to implement  
2 Does not increase the feature dimensional space  
Disadvantages  
1 If some of the labels have the same count, then they will be replaced with the same count and they will loose some valuable information.  
2 Adds somewhat arbitrary numbers, and therefore weights to the different labels, that may not be related to their predictive power  