OneHotEncoder: breaks a column with multiple response options into multiple columns that are 1 or 0 for each response option


When to use?
1) when you have more than 2 options for a categorical column


In [None]:
from sklearn.preprocessiong import OneHotEncoder
ncdr = OneHotEncoder(*, 
                     categories='auto', 
                     drop=None, 
                     sparse='deprecated', 
                     sparse_output=True, 
                     dtype=<class 'numpy.float64'>, 
                     handle_unknown='error', 
                     min_frequency=None, 
                     max_categories=None)

In [None]:
##METHODS
ncdr.fit(X, y) #fit the encoder
ncdr.fit_transform(X, y) #fit to data and transform the data
ncdr.get_feature_names_out([input_features]) #feature names for transformation, the categories that are being added
ncdr.get_params([deep]) #get the parameters for the estimator
ncdr.inverse_transform(X) #convert data back to the original representation, basically get rid of all the binary classes you just created, might not be perfect because things are getting merged back together
ncdr.set_output() #set output container
ncdr.set_params() #set the parameters of this estimator
ncdr.transform(X) #transform the data using the one-hot encoder


In [None]:
##BOTH ATTRIBUTES
ncdr.categories_ 
ncdr.drop_idx_
ncdr.infrequent_categories_
ncdr.n_features_in_
ncdr.feature_names_in_

PARAMETERS
categories= categories per feature
    -'auto' : determine categories automatically from the training data
    -list : categories[i] holds the categories expected in the ith column; the passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values
drop= specifies a methodology to use to drop one of the categories per feature; useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model.
    -None : retain all features (default)
    -'first' : drop the first category in each feature; if only one category is present, the feature will be dropped entirely
    -'if_binary' : drop the first category in each feature with two categories; features with one or more than 2 categories are left intact
    -array : drop[i] is the category in feature X[:,i]
sparse= (True or False) will return sparse matrix if set True else will return an array (default True)
sparse_output= (True or False) will return sparse matrix if set True else will return an array (default True)
dtype= desired dtype of the output
handle_unknown= specifies the way unknown categories are handled during transform
    -'error' (default): raise an error if an unknown category is present during transform
    -'ignore' : when an unknown category is encountered during transform, resulting encoded columns for this feature will be all zeros
    -'infrequent_if_exist' : when an unknown category is encountered during transform, the resulting endcoded columns for this feature will map to the infreqeunt category if it exists; the infrequent category will be mapped to the last position in the encoding
min_frequency : the threshold for a category to be considered infrequent (default=None)
max_categories: the max number of features that can be output for a category when considering infrequent categories

ATTRIBUTES
categories_ : categories of each feature determined during fitting
drop_idx_ : 
infrequent_categories_ : list of infrequent categories for each features
n_features_in_ : number of features seen during the .fit()
feature_names_in_ : names of the above features
