# Machine learning

#### What is clustering?

Clustering is the task of dividing the population or data points into a number of groups such that data points in the same groups are more similar to other data points in the same group and dissimilar to the data points in other groups. It is basically a collection of objects on the basis of similarity and dissimilarity between them.

#### How to identify the optimal number of clusters?

Using the "Elbow method": Plot the error as a function number of clusters. Choose the number of cluster after which the error decreases in a linear manner. Sum of square error or inertia is a well-known metric used for this purpose. TO see this in practise, please refer to [my notebook on k-means](https://github.com/noronhaeyan/machine-learning/blob/main/k-means.ipynb)

#### What is the Bias-variance trade-off?

As the number of parameters in a model increases, the bias of the fitted model goes down but the variance of the fitted model goes up

On the other hand, if the number of parameters in a model is kept low, the variance in the model might not be much but the bias of the model would be high. 

The gives rise to a U-shaped curve for the model error as a function of the number of parameters. The trade-off in the variance and bias as the number of model parameters are varied is known as the bias-variance trade-off.

#### What is feature engineering? How does it affect the model’s performance? 

Feature engineering refers to developing some new features by using existing features. Feature engineering can help imporve model performace by 1) increasing accuracy 2) reducing model size 3) Reduce input features by clubbing them together

#### What is overfitting and how can we avoid it?

Overfitting happens when the model learns patterns as well as the noises present in the data this leads to high performance on the training data but very low performance for data that the model has not seen earlier. To avoid overfitting there are multiple methods that we can use:

1. Early stopping of the model’s training in case of validation training stops increasing but the training keeps going on.
2. Using regularization methods like L1 or L2 regularization which is used to penalize the model’s weights to avoid overfitting.


#### What is early stopping?

In Early Stopping, we stop training the model when the performance of the model on the validation set is getting worse

By plotting the error on the training dataset and the validation dataset together, both the errors decrease with a number of iterations until the point where the model starts to overfit. After this point, the training error still decreases but the validation error increases. So, even if training is continued after this point, early stopping essentially returns the set of parameters that were used at this point and so is equivalent to stopping training at that point. So, the final parameters returned will enable the model to have low variance and better generalization.

#### What is k-fold cross validation?

K-fold cross validation is used for estimating prediction error in a given model.

The dataset is divided into k subsets or folds. The model is trained and evaluated k times, using a different fold as the validation set each time. Performance metrics from each fold are averaged to estimate the model’s generalization performance. This method aids in model assessment, selection, and hyperparameter tuning, providing a more reliable measure of a model’s effectiveness.

#### What is the difference between the k-means and k-means++ algorithms?

One disadvantage of the K-means algorithm is that it is sensitive to the initialization of the centroids or the mean points. So, if a centroid is initialized to be a “far-off” point, it might just end up with no points associated with it, and at the same time, more than one cluster might end up linked with a single centroid. Similarly, more than one centroid might be initialized into the same cluster resulting in poor clustering.

To overcome the above-mentioned drawback we use K-means++. This algorithm ensures a smarter initialization of the centroids and improves the quality of the clustering. Apart from initialization, the rest of the algorithm is the same as the standard K-means algorithm. That is K-means++ is the standard K-means algorithm coupled with a smarter initialization of the centroids.

The steps involved are: 
 
1. Randomly select the first centroid from the data points.
2. For each data point compute its distance from the nearest, previously chosen centroid.
3. Select the next centroid from the data points such that the probability of choosing a point as centroid is directly proportional to its distance from the nearest, previously chosen centroid. (i.e. the point having maximum distance from the nearest centroid is most likely to be selected next as a centroid)
4. Repeat steps 2 and 3 until k centroids have been sampled

Reference: https://www.geeksforgeeks.org/ml-k-means-algorithm/#

#### Why we cannot use linear regression for a classification task?

Linear regresson is sutied for problems when the output values are continous. In classification, the output labels are discrete and bounded.

Second reason is that linear regression is sensitive to imbalance data. If the range of input features is too wide, linear regression may miss-classify some data points. Ref: https://towardsdatascience.com/why-linear-regression-is-not-suitable-for-binary-classification-c64457be8e28

#### Why not mean squared error as a loss function for logistic regression?

https://towardsdatascience.com/why-not-mse-as-a-loss-function-for-logistic-regression-589816b5e03c

#### Why do we perform normalization?

Many statisticla learning algorithms like, pricipal component regression or partial least squares regression require the data matrix to be normalized. 

In the case of neural networks, normalization can help address the problem of exploding or vanishing gradients.

#### What is upsampling anf downsampling? How to work with imbalanced data?

To work wit imbalanced data follow one of the items in the below list:
1) Upsampling Minority Class
2) Downsampling Majority Class
3) Generate Synthetic Data
4) Combine Upsampling & Downsampling Techniques 
5) Balanced Class Weight

Upsampling may lead to over-fitting of your model. Downsampling may lead to underfitting of your model as we may throw away data points.

Code to perform upsampling and downsampling: https://wellsr.com/python/upsampling-and-downsampling-imbalanced-data-in-python/

Further reading: https://towardsdatascience.com/5-techniques-to-work-with-imbalanced-data-in-machine-learning-80836d45d30c

#### What is data leakage? How to avoid it?

Data leakage occurs when information from the target variable leaks into the feature variable, either knowingly or unknowingly. 

This means that there may be an unusually high correlation between some features and the target variable. When a model is trained on such a dataset, The model may perform very well on test and validation data but may do poorly in production.

Source of data leakage can be hard to catch: [This article](https://towardsdatascience.com/data-leakage-in-machine-learning-how-it-can-be-detected-and-minimize-the-risk-8ef4e3a97562) provides a few approaches on how to minimize data leakage.

#### What are some of the hyperparameters of the random forest regressor which help to avoid overfitting?

https://www.geeksforgeeks.org/hyperparameters-of-random-forest-classifier/#

#### What is Principle Component Analysis (PCA)?

PCA(Principal Component Analysis) is an unsupervised machine learning dimensionality reduction technique in which we trade off some information or patterns of the data at the cost of reducing its size significantly. In this algorithm, we try to preserve the variance of the original dataset up to a great extent let’s say 95%. For very high dimensional data sometimes even at the loss of 1% of the variance, we can reduce the data size significantly. By using this algorithm we can perform image compression, visualize high-dimensional data as well as make data visualization easy.

#### Explain the working principle of support vector machine (SVM)?

https://www.geeksforgeeks.org/support-vector-machine-algorithm/#

Implementation: 

#### co-relation in hyper param trends in training and test set.



# Statistics

# General

#### What is a garbage collector?

Garbage collector (GC) is a form of automatic memory management.

The GC attempts to reclaim memory which was allocated by the program but is no longer referenced.

Advantages of GC:
1. Frees developers from having to manually release memory
2. Avoids memory leaks, in which a program fails to free memory occupied by objects that have become unrechable.

Disadvantages of GC:
1. GC uses computing resources to decide which memory to free, which impair performance
2. Unpredictability in memory management

#### What is the difference between a CPU and a GPU?

The main difference is that CPU is designed to handle a wide-range of tasks quickly, but are limited in the concurrency of tasks.

Generally speaking, GPUs are much faster than CPUs at highly parallel simple tasks like multiplying big matrices.

GPUs are designed to make rendering of 3D images more efficient.