# <font color='Blue'>HYPERPARAMETERS FOR MACHINE LEARNING MODEL ( SCIKIT-LEARN )</font>

This notebook is about knowing all the hyperparameters provided by scikit-learn library for 
* Linear regression model
* logistic regression model
* KNeighborsClassifier model
* support vector machine model
* Random Forest classifier model
* XGBoost model
* Kmeans clustering model

# <font color='green'>Understanding Hyperparameters for linear regression </font>

LinearRegression(fit_intercept=True,
    normalize=False,
    copy_X=True,
    n_jobs=None)
    
## Fit_intercept
fit_intercept=False sets the y-intercept to 0. 

If fit_intercept=True, the y-intercept will be determined by the line of best fit.

Visually it becomes clear what fit_intercept does. When fit_intercept=True, the line of best fit is allowed to "fit" the y-axis (close to 100 in this example). When fit_intercept=False, the intercept is forced to the origin (0, 0).

![image.png](attachment:image.png)

## n_jobs

The number of jobs to use for the computation. 

This will only provide speedup for n_targets > 1 and sufficient large problems. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

## copy_X



## normalize

https://stackoverflow.com/questions/54067474/comparing-results-from-standardscaler-vs-normalizer-in-linear-regression

see the above link to get a good idea of what normalize=True do 

https://towardsai.net/p/data-science/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

see the above link to understand normalization,standardization and rescaling

https://stackoverflow.com/questions/47014365/what-is-the-difference-between-normalisation-and-regularisation-in-machine-learn

see the above link to understand the difference betweem normalization and regularization

##### Normalization typically means rescales the values into a range of [0,1]. 

##### Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).



# <font color='green'>Understanding Hyperparameters for logistic regression </font>

LogisticRegression(penalty='l2',
                   dual=False,
                   tol=0.0001,
                   C=1.0, 
                   fit_intercept=True,
                   intercept_scaling=1,
                   class_weight=None, 
                   random_state=None, 
                   solver='lbfgs',
                   max_iter=100,
                   multi_class='auto', 
                   verbose=0, 
                   warm_start=False,
                   n_jobs=None, l1_ratio=None

## penalty− str, ‘L1’, ‘L2’, ‘elasticnet’ or none, optional, default = ‘L2’

Imposes a penalty to the logistic model for having too many variables. This results in shrinking the coefficients of the less contributive variables toward zero. This is also known as regularization.

The most commonly used penalized regression include:

(L1) ridge regression=variables with minor contribution have their coefficients close to zero. However, all the variables are incorporated in the model. This is useful when all variables need to be incorporated in the model according to domain knowledge.

(L2) lasso regression=the coefficients of some less contributive variables are forced to be exactly zero. Only the most significant variables are kept in the final model.

Elastic net regression: the combination of ridge and lasso regression. It shrinks some coefficients toward zero (like ridge regression) and set some coefficients to exactly zero (like lasso regression)

## dual ( need to find the explanation for this parameter ) 



## solver− str, {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘saag’, ‘saga’}, optional, default = ‘liblinear’

methods for optimizing the cost function

liblinear − It is a good choice for small datasets. It also handles L1 penalty. For multiclass problems, it is limited to one-versus-rest schemes.

newton-cg − It handles only L2 penalty.

lbfgs − Stands for Limited-memory Broyden–Fletcher–Goldfarb–Shanno. It approximates the second derivative matrix updates with gradient evaluations. It stores only the last few updates, so it saves memory. It isn't super fast with large data sets.

For multiclass problems, it handles multinomial loss. It also handles only L2 penalty.

saga − It is a good choice for large datasets. For multiclass problems, it also handles multinomial loss. Along with L1 penalty, it also supports ‘elasticnet’ penalty.

sag − It is also used for large datasets. For multiclass problems, it also handles multinomial loss.

## fit_intercept− Boolean, optional, default = True

This parameter is useful when the solver ‘liblinear’ is used

fit_intercept is set to true

## n_jobs − int or None, optional, default = None

No of jobs to run in parallel (-1 means the execution uses all the parallel processors for speeding up the excecution)

If multi_class = ‘ovr’, this parameter represents the number of CPU cores used when parallelizing over classes. It is ignored when solver = ‘liblinear’.

## warm_start − bool, optional, default = false

With this parameter set to True, we can reuse the solution of the previous call to fit as initialization. If we choose default i.e. false, it will erase the previous solution.

## l1_ratio − float or None, optional, dgtefault = None

It is used in case when penalty = ‘elasticnet’. It is basically the Elastic-Net mixing parameter with 0 < = l1_ratio > = 1.

## verbose − int, optional, default = 0

By default, the value of this parameter is 0 but for liblinear and lbfgs solver we should set verbose to any positive number.

Verbose is a general programming term for produce lots of logging output. You can think of it as asking the program to "tell me everything about what you are doing all the time"

## max_iter − int, optional, default = 100

As name suggest, it represents the maximum number of iterations taken for solvers to converge.

## tol− float, optional, default=1e-4

As you noted, tol is the tolerance for the stopping criteria. 

The tol parameter tells the optimization algorithm when to stop. If the value of tol is too big, the algorithm stops before it can converge..

## C − float, optional, default=1.0

The larger C the less penalty for the parameters norm, l1 or l2. C cannot be set to 0 by the way, it has to be >0

Inverse regularization strength helps in reducing the overfitting issues 

It penalizes the large values of your parameters

It also helps to find the global minimum by finding the beat "solutions" from local minimum to global minimum 

The values of C to search should be n-equally spaced values in log space ranging from 1e-5 to 1e5

## class_weight − dict or ‘balanced’ optional, default = none

When using sklearn LogisticRegression function for binary classification of imbalanced training dataset (e.g., 85% pos class vs 15% neg class) the class_weight automatically balances the weights of classes when we set "class_weight ='balanced ( it automatically set the parameters in this way:

{0:0.15, 1:0.85} )

It represents the weights associated with classes. If we use the default option, it means all the classes are supposed to have weight one. On the other hand, if you choose class_weight: balanced, it will use the values of y to automatically adjust weights.

## random_state − int, RandomState instance or None, optional, default = none

This parameter represents the seed of the pseudo random number generated which is used while shuffling the data. Followings are the options

int − in this case, random_state is the seed used by random number generator.

RandomState instance − in this case, random_state is the random number generator.

None − in this case, the random number generator is the RandomState instance used by np.random.

## max_iter − int, optional, default = 100

As name suggest, it represents the maximum number of iterations taken for solvers to converge.

## multi_class − str, {‘ovr’, ‘multinomial’, ‘auto’}, optional, default = ‘ovr’ ( need some detail explanation )

ovr − For this option, a binary problem is fit for each label.

multimonial − For this option, the loss minimized is the multinomial loss fit across the entire probability distribution. We can’t use this option if solver = ‘liblinear’.

auto − This option will select ‘ovr’ if solver = ‘liblinear’ or data is binary, else it will choose ‘multinomial’.

## l1_ratio − float or None, optional, dgtefault = None

l1_ratio is a parameter in a [0,1] range weighting l1 vs l2 regularisation. 

Hence the amount of l1 regularisation is l1_ratio * 1./C, likewise the amount of l2 reg is (1-l1_ratio) * 1./C

## intercept_scaling − float, optional, default = 1 ( need some detail explanation )

This parameter is useful when

the solver ‘liblinear’ is used

fit_intercept is set to true

# <font color='green'>Understanding Hyperparameters for KNeighborsClassifier </font>


## weights: {‘uniform’, ‘distance’} or callable, default=’uniform’

uniform (all weights are equal), distance (the weight is inversely proportional to the distance from the test sample), or any other user-defined function

## algorithm: {‘auto’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, default=’auto’

brute, ball_tree, KD_tree, or auto. 

In the first case, the nearest neighbors for each test case are computed by a grid search over the training set.

In the second and third cases, the distances between the examples are stored in a tree to accelerate finding nearest neighbors. If you set this parameter to auto, the right way to find the neighbors will be automatically chosen based on the training set.

## leaf_size:int, default=30

threshold for switching to grid search if the algorithm for finding neighbors is BallTree or KDTree;

## metric:str or callable, default=’minkowski’

minkowski, manhattan, euclidean, chebyshev, or other.

## p: int, default=2

Power parameter for the Minkowski metric. When p = 1, this is

equivalent to using manhattan_distance (l1), and euclidean_distance
(l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.



## n_neighbors:int, default=5

n_neighbors = no of neighbors to take (default = 5)

# <font color='green'>Understanding Hyperparameters for support vector machine algorithm</font>

In [4]:
## you can watch the below video to revise all the hyperparameters for svm
from IPython.display import IFrame

IFrame('https://www.youtube.com/embed/93AjE1YY5II',width=800,height=400)


## C: float, default=1.0

C determines how many data samples are allowed to be placed in different classes. 

If the value of C is set to a low value, the probability of the outliers is increased, and the general decision boundary is found. If the value of C is set high, the decision boundary is found more carefully.

C is used in the soft margin, which requires understanding of slack variables.

 A high C tries to minimize the misclassification of training data and a low value tries to maintain a smooth classification.

## kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’}, default=’rbf’

Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable.

If none is given, ‘rbf’ will be used. 

If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).

For more information on different kernels  go to this link 
http://crsouza.com/2010/03/17/kernel-functions-for-machine-learning-applications/

## degree: int, default=3
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.

## gamma: {‘scale’, ‘auto’} or float, default=’scale’

gamma scales the squared distance and thus scales the influence

Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.

if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,

if ‘auto’, uses 1 / n_features.

## coef0: float, default=0.0
Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.

## max_iter: int, default=-1

Hard limit on iterations within solver, or -1 for no limit.

## cache_size: float, default=200

cache is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewhere

Specify the size of the kernel cache (in MB).

## shrinking: bool, default=True

shrinking is involved in sequencial minimal optimization.it basically decides which features sets can be ignored from the optimization algorithm because they are deems to probabily not have any impact 

## decision_function_shape: {‘ovo’, ‘ovr’}, default=’ovr’

ovo-one vs one 
ovr-one vs rest 

check the difference between ovo and ovr in the video mentioned above (1.00)

## break_ties: bool, default=False

Tie breaking is costly if decision_function_shape='ovr', and therefore it is not enabled by default. This example illustrates the effect of the break_ties parameter for a multiclass classification problem and decision_function_shape='ovr'.

The two plots differ only in the area in the middle where the classes are tied. If break_ties=False, all input in that area would be classified as one class, whereas if break_ties=True, the tie-breaking mechanism will create a non-convex decision boundary in that area

![image.png](attachment:image.png)


# <font color='green'>Understanding Hyperparameters for Random Forest classifier algorithm</font>

In [5]:
## you can watch the below video to revise important hyperparameters

IFrame('https://www.youtube.com/embed/XABw4Y3GBR4',width=800,height=400)

## n_esitmators: int, default=100

No of trees created while performing bagging in rf

## max_features: {“auto”, “sqrt”, “log2”}, int or float, default=”auto”

max number of features considered for splitting the node

The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.

If float, then max_features is a fraction and int(max_features * n_features) features are considered at each split.

If “auto”, then max_features=sqrt(n_features).

If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

If “log2”, then max_features=log2(n_features).

If None, then max_features=n_features.

Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.


## max_depth: int, default=None

max number of levels in the each decision trees

If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.

## min_samples_split

minimum no of samples required to split the internal node

## boot strap 

Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

method for sampling data ( with or without replacement)

## class_weight: {“balanced”, “balanced_subsample”}, dict or list of dicts, default=None

Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one. For multi-output problems, a list of dicts can be provided in the same order as the columns of y.

Note that for multioutput (including multilabel) weights should be defined for each class of every column in its own dict. For example, for four-class multilabel classification weights should be [{0: 1, 1: 1}, {0: 1, 1: 5}, {0: 1, 1: 1}, {0: 1, 1: 1}] instead of [{1:1}, {2:5}, {3:1}, {4:1}].

The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

The “balanced_subsample” mode is the same as “balanced” except that weights are computed based on the bootstrap sample for every tree grown.

For multi-output, the weights of each column of y will be multiplied.

Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.

## criterion: {“gini”, “entropy”}, default=”gini”

As we know , to select a best split in a tree , we use entropy and gini.

The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and “entropy” for the information gain. Note: this parameter is tree-specific

## min_samples_leaf: int or float, default=1

The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

If int, then consider min_samples_leaf as the minimum number.

If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.

#### simple meaning: minimum no of samples ,to specify the given node as a leaf node

## min_sample_split: int or float, default=1

min_sample_split specifies the minimum no of samples required to split an internal node

## diffference between min_sample_leaf and min_sample_split

min_samples_split specifies the minimum number of samples required to split an internal node, while min_samples_leaf specifies the minimum number of samples required to be at a leaf node.

For instance, if min_samples_split = 5, and there are 7 samples at an internal node, then the split is allowed. But let's say the split results in two leaves, one with 1 sample, and another with 6 samples. 

If min_samples_leaf = 2, then the split won't be allowed (even if the internal node has 7 samples) because one of the leaves resulted will have less then the minimum number of samples required to be at a leaf node.

## max_samples: int or float, default=None

If bootstrap is True, the number of samples to draw from X to train each base estimator.

If None (default), then draw X.shape[0] samples.

If int, then draw max_samples samples.

If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0, 1)

## warm_startbool, default=False

When set to True, reuse the solution of the previous call to fit and add more estimators to the ensemble, otherwise, just fit a whole new forest.

## oob_score: bool, default=False

When you train each tree in random forest, you will not use all the samples. So for each bag, those unused samples can be used to find the prediction error for that particular bag. The OOB error rate can then be obtained by averaging the prediction error from all the bags.

Whether to use out-of-bag samples to estimate the generalization accuracy.

## verbose: int, default=0

for machine learning, by setting verbose to a higher number (2 vs 1), you may see more information about the tree building process.

## max_leaf_nodes: int, default=None

Grow trees with max_leaf_nodes in best-first fashion. 

Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.

## ccp_alpha: non-negative float, default=0.0

This alpha value is seen in minimal cost complexity pruning

see the below video which explains the minimal complexity pruning


In [6]:
from IPython.display import IFrame
IFrame('https://www.youtube.com/embed/D0efHEJsfHo',width=800,height=400)

## min_impurity_decrease: float, default=0.0

A node will be split if this split induces a decrease of the impurity greater than or equal to this value.


## min_impurity_split: float, default=None

Threshold for early stopping in tree growth. A node will split if its impurity is above the threshold, otherwise it is a leaf.


# <font color='green'>Understanding Hyperparameters for XGBoost algorithm</font>

check the below link for all parameters 
https://xgboost.readthedocs.io/en/latest/parameter.html#general-parameters ( main one )

only main parameters are explained below

### n_estimators

no of trees created in XGB

### colsample_bytree

percentage of columns you want to select from a tree for helping overfitting and speeding up the process



### max_depth 

depth of each tree



### alpha 

learning rate ( used when getting the predicted values )



### lambda 

regularization parameter


### gamma 

it is a user defined penality (it encourages pruning the trees)

### min_child_weight 

For regression, that is the minimum number of observations that go to a leaf. For classification, it is the minimum of the hessian


# <font color='green'>Understanding Hyperparameters for Kmeansclustering algorithm</font>

## X: {array-like, sparse} matrix of shape (n_samples, n_features)

The observations to cluster. It must be noted that the data will be converted to C ordering, which will cause a memory copy if the given data is not C-contiguous.

## n_clusters: int
The number of clusters to form as well as the number of centroids to generate.

## sample_weightarray-like of shape (n_samples,), default=None
The weights for each observation in X. If None, all observations are assigned equal weight

## init: {‘k-means++’, ‘random’, ndarray, callable}, default=’k-means++’

Method for initialization:

‘k-means++’ : selects initial cluster centers for k-mean clustering in a smart way to speed up convergence. See section Notes in k_init for more details.

‘random’: choose n_clusters observations (rows) at random from data for the initial centroids.

If an ndarray is passed, it should be of shape (n_clusters, n_features) and gives the initial centers.

If a callable is passed, it should take arguments X, n_clusters and a random state and return an initialization.

## precompute_distances: {‘auto’, True, False}

Pre-computes distances is used in K-means which allows you to do all calculations in advance in memory. If you don't use precompute option ( set it false) then for every iterations in K-means it calculates distance iteratively thus wasting time.

If you have enough memory and want to save time use Precompute = True.

Precompute distances (faster but takes more memory).

‘auto’ : do not precompute distances if n_samples * n_clusters > 12 million. This corresponds to about 100MB overhead per job using double precision.

True : always precompute distances

False : never precompute distances


## n_init: int, default=10

Number of time the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.


## max_iter: int, default=300

Maximum number of iterations of the k-means algorithm to run.



## verbose: bool, default=False

Verbosity mode.

## tol: float, default=1e-4

Relative tolerance with regards to Frobenius norm of the difference in the cluster centers of two consecutive iterations to declare convergence.



## random_state: int, RandomState instance, default=None

Determines random number generation for centroid initialization. Use an int to make the randomness deterministic. See Glossary.

## copy_x: bool, default=True

When pre-computing distances it is more numerically accurate to center the data first. 

If copy_x is True (default), then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean. Note that if the original data is not C-contiguous, a copy will be made even if copy_x is False.

If the original data is sparse, but not in CSR format, a copy will be made even if copy_x is False.

## n_jobs: int, default=None

The number of OpenMP threads to use for the computation. Parallelism is sample-wise on the main cython loop which assigns each sample to its closest center.

None or -1 means using all processors.


## algorithm: {“auto”, “full”, “elkan”}, default=”auto” ( need to learn more about elkan algorithm )

K-means algorithm to use. The classical EM-style algorithm is “full”. The “elkan” variation is more efficient on data with well-defined clusters, by using the triangle inequality. However it’s more memory intensive due to the allocation of an extra array of shape (n_samples, n_clusters).

For now “auto” (kept for backward compatibiliy) chooses “elkan” but it might change in the future for a better heuristic.

return_n_iterbool, default=False
Whether or not to return the number of iterations.

To understand elkan algorithm see this link 

https://davidstutz.de/using-the-triangle-inequality-to-accelerate-k-means-elkan/#:~:text=For%20such%20approaches%2C%20the%20runtime,and%2C%20thus%2C%20reduce%20runtime.

#### Some of the explanations are taken directly from the documentation and some from stackoverflow
#### If you want to contact me do mail me at nitishkumar2902@gmail.com