# Machine Learning

Machine learning is the process whereby machines are given the ability to learn to make decisions from data without being explicity programmed.
* <b>`Unsupervised Learning`</b> is the process of uncovering hidden patterns and structures from unlabeled data.
    * Example grouping customers into distinct categories (`known as clustering`) based on various criteria like shopping behavior, most bought products etc 
* <b>`Supervised Learning`</b> is type of machine learning where the values to be predicted are already known and model is build to predict target values of unseen given data, given the features
    * Uses features to predict the value of target variable

## Types of Supervised Learning
* <b>`Classification`</b> is used to predict the label, or cateogry of an observation. The target variable consists of categories
    * Example cat or dog, spam email classification
* <b>`Regression`</b> is used to predict continuous values
    * Example property price prediction, stock forecasting

### Naming Convention
* `Feature` or `predictor variable` is independent variable used to find value of target variable
* `Target variable` or `dependent variable` is variable whose value need to be calculated or predicted
* `Labeled data` is the training data the model is trained on 

### Scikit-Learn
Scikit-learn is one of the most popular machine learning libraries in Python. It's a powerful tool for building machine learning models, providing a wide range of algorithms and tools for tasks such as classification, regression, clustering, dimensionality reduction, and more. 
<br/>
scikit-learn requires that the features are in array where each column is a feature and each row a different observation. Similarly, the target needs to be a single column with the same number of observations as the feature data

`Syntax for scikit-learn`

`from sklearn.module import Model` <br/>
`model = Model()`<br/>
`model.fit(X, y)` (X array of features, y array of our target variable values)<br/>
`prediction = model.predict(X_new)`<br/>
`print(prediction)`<br/>

##### Steps for Classifying Labels of Unseen Data
* Build a classifier
* Model learns from the labeled data we pass to it
* Pass unlabeled data to the model as input
* Model predicts the labels of the unseen data


### k-Nearest Neighbors
A supervised machine learning model which is used to predict the label of any data point by looking at `k` closest label data points. It uses majority voting which makes predictions based on what label the majority of nearest neighbors have. <br/>
As k increases, the model becomes a simpler model. Simpler models are less able to detect relationships in the dataset, which is known as `underfitting`
For smaller k, the model become complex and is sensitive to noise in the training data, rather than reflecting general trends, which is known as `overfitting`

larger k = less complex model = can cause underfitting <br/>
smaller k = more complex model = can lead to overfitting

`from sklearn.neighbors import KNeighborsClassifier`<br/>
`X = dataset_name[['feature1', 'feature2']]` <br/>
`Y = dataset_name ['feature']` <br/>
`knn = KNeighborsClassifier(n_neighbors = 6)` <br/>
`knn.fit(X, y)` <br/>
`predictions = knn.predict(X_new)` <br/>
`print(predictions)`<br/>


#### Measuring Model Performance
In classifcation, accuracy is a commonly used metric. <br/><br/> 
`Accuracy: correct_predictions/total_observations`<br/><br/>
For measuring model performance, it is common practice to split data into `training set` and `test set`. Training set is used to train classifier on training data and then calculate model accuracy using test set. We commonly use 20-30% of our data as the test set. Random_state argrument sets a seed for random number generator that splits the 

`from sklearn.model_selection import train_test_split`<br />
`X_train, X_text, y_train, y_test = train_test_split(X, y, test_size=0.3 random_state=21)` Four arrays <br />
`knn = KNeighborsClassifier(n_neighbors = 6)` <br/>
`knn.fit(X_train, y_train)` <br/>
`print(knn.score(X_test, y_test))` <br/><br/>

To deal with underfitting and overfitting problem, we use incremental values of k to find value of k for which accuracy is highest

`training_accuracies = {}`<br/>
`test_accuracies = {}`<br/>
`neighbors = np.arange(1, 26)`<br/>
`for neighbor in neighbors:`<br/>
`    knn = KNeighborsClassifier(n_neighbors=neighbor)`<br/>
`knn.fit(X_train, y_train)` <br/>
`training_accuracies[neighbor] = knn.score(X_train, y_train)` <br/>
`test_accuracies[neighbor] = knn.score(X_test, y_test)` <br/>

#### Confusion Matrix
In machine learning, a confusion matrix is a table that is used to evaluate the performance of a classification algorithm. It allows for a detailed examination of the predicted and actual values of a model.

The confusion matrix is constructed around the following concepts:

* `True Positives (TP):` These are the cases where the model correctly predicts the positive class.
* `True Negatives (TN):` These are the cases where the model correctly predicts the negative class.
* `False Positives (FP):` These are the cases where the model incorrectly predicts the positive class when it's actually negative. Also known as Type I error.
* `False Negatives (FN):` These are the cases where the model incorrectly predicts the negative class when it's actually positive. Also known as Type II error.
<br/>
<table>
    <tr>
        <th></th>
        <th>Predicted Negative</th>
        <th>Predicted Positive</th>
    </tr>
    <tr>
        <td>Actual Negative</td>
        <td>True Negative (TN)</td>
        <td>False Positive (FP)</td>
    </tr>
    <tr>
        <td>Actual Positive</td>
        <td>False Negative (FN)</td>
        <td>True Positive (TP)</td>
    </tr>
</table>
<br/>

From the confusion matrix, various metrics can be derived to evaluate the performance of a classification model, including:

* `Accuracy: (TP + TN) / (TP + TN + FP + FN)`
    * The proportion of correctly classified instances among the total instances.

* `Precision: TP / (TP + FP)`
    * The accuracy of positive predictions, measuring the model's ability to not label a negative sample as positive.
    * Higher precision = lower false positive rate 

* `Recall (Sensitivity or True Positive Rate): TP / (TP + FN)`
    * The proportion of actual positive cases that were correctly identified, measuring the model's ability to find all the positive samples.
    * High recall = lower false negative rate 
<br />

* `F1 Score: 2 * (Precision * Recall) / (Precision + Recall)`
    * It is the harmonic mean of precision and recall, providing a balance between the two metrics.

`from sklearn.metrics import classification_report, confusion_matrix` <br/>
`knn = KNeighborsClassifier(n_neighbors=7)` <br/>
`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)` <br/>
`knn.fit(X_train, y_train)` <br/>
`y_pred = knn.predict(X_test)` <br/>
`print(confusion_matrix(y_test, y_pred))` <br/>
`print(classification_report(y_test, y_pred))` <br/>



### Regression
`Regression` in machine learning is a type of supervised learning that deals with the prediction of continuous outcomes. It aims to find the relationship between a `dependent variable` (target) and one or more `independent variables` (features) to predict the value of the dependent variable based on the independent variables.

`from sklearn.linear_model import LinearRegression` <br/>
`reg = LinearRegression()` <br/>
`reg.fit(X, y)` <br/>
`predictions = reg.predict(X)` <br/>

#### Regression mechanism
In two dimension, we want to fit a line on the data and takes the form
`y = ax + b`
where
* y = target
* x = single feature
* a, b = parameters/co-efficients of the model - slope, intercept

How do we choose a and b
* Define an error function (also called loss function) for any given line
* Choose the line that minimizes the error function

We want the line to be as close to the observation as possible. Therefore we want to minimise the vertical distance between the fit and the data. So for each observation, we calculate the vertical distance between it and the line which is called a `residual`

For linear regressio in higher dimension, a line takes the form `y = a1x1 + a2x2 + a3x3 + ... + anxn + b`. 

So for linear regression using all features:

`from sklearn.model_selection import train_test_split` <br/>
`from sklearn.linear_model import LinearRegression`<br/>
`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)`<br/>
`reg_all = LinearRegression()`<br/>
`reg_all.fit(X_train, y_train)`<br/>
`y_pred = reg_all.predict(X_test)`<br/>

The default metric for linear regression is `R-squared` which quantifies the amount of variance in the the target variable that is explained by the features. Its value ranges from 0 to 1 where 1 means that the features completely explain the target variance. 

`reg_all.score(X_test, y_test)`<br/> <br/>
However, there is a potential pitfall; R-squared is dependent on the way the data is split where the test data mau have some peculiarities that mean the R-squared computed on it is not representative of the model's ability to generalise to unseen data

The performance of linear regression can be find using `mean square error` and `root mean squared error`.

`from sklearn.metrics import mean_squared_error`<br/>
`mean_squared_error(y_test, y_pred, squared=False)`<br/>


#### Cross Validation
Cross-validation is a technique used in machine learning to assess the performance of a predictive model. It helps to estimate how well the model generalizes to new, unseen data. Here's a simpler and more concise breakdown of cross-validation in linear regression
* ` Data Splitting:` Divide the dataset into a training set and a testing set.
* `K-Fold Cross-Validation:` Split the training set into K subsets/folds.
* `Iterative Training and Testing:` Train the model K times, each time using K-1 folds for training and 1 fold for testing.
* `Performance Evaluation:` Calculate the model's performance metric (e.g., mean squared error) for each fold.
* ` Average Metric:` Average the performance metric across all folds to determine the model's overall performance.
* `Model Comparison and Tuning:` Compare different models or tune parameters using cross-validation results.
* `Final Model Evaluation:` Select the best model and train it on the entire training set before evaluating it on a separate, unseen test set to estimate real-world performance.


`from sklearn.model_selection import cross_val_score, KFold` <br/>
`kf = KFold(n_splits=6, shuffle=True, random_state=42)`<br/>
`reg = LinearRegression()`<br/>
`cv_results = cross_val_score(reg, X, y, cv=kf)`<br/>
`print(cv_results)`<br/>
`print(np.mean(cv_results), np.std(cv_results))`<br/>
`print(np.quantile(cv_results, [0.025, 0.975]))` (calculate 95% confidence interval)<br/>


#### Ridge Regression

Ridge regression is a technique used in linear regression to address the problem of overfitting by adding a penalty term to the ordinary least squares (OLS) method. In traditional linear regression, the goal is to minimize the residual sum of squares (RSS). However, when there is multicollinearity, the estimated coefficients can become too sensitive to the training data, leading to overfitting and high variance.

`from sklearn.linear_model import Ridge` <br/>
`scores = []`<br/>
`for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:`<br/>
`ridge = Ridge(alpha=alpha)`<br/>
`ridge.fit(X_train, y_train)`<br/>
`y_pred = ridge.predict(X_test)`<br/>
`scores.append(ridge.score(X_test, y_test))`<br/>
`print(scores)`<br/>

#### Lasso Regression
Lasso regression, short for Least Absolute Shrinkage and Selection Operator, is a technique used in linear regression to perform both variable selection and regularization by adding a penalty term to the ordinary least squares (OLS) method. 

`from sklearn.linear_model import Lasso` <br/>
`scores = []`<br/>
`for alpha in [0.01, 1.0, 10.0, 20.0, 50.0]:`<br/>
`lasso = Lasso(alpha=alpha)`<br/>
`lasso.fit(X_train, y_train)`<br/>
`lasso_pred = lasso.predict(X_test)`<br/>
`scores.append(lasso.score(X_test, y_test))`<br/>
`print(scores)`<br/><br/>

* It can be used to select important features of a dataset by shrinking the coefficients of less important features to zero

`from sklearn.linear_model import Lasso` <br/>
`X = diabetes_df.drop("glucose", axis=1).values`<br/>
`y = diabetes_df["glucose"].values`<br/>
`names = diabetes_df.drop("glucose", axis=1).columns`<br/>
`lasso = Lasso(alpha=0.1)`<br/>
`lasso_coef = lasso.fit(X, y).coef_`<br/>


### Logistic Regression
Logistic regression is a type of statistical model used for binary classification tasks in machine learning. It predicts the probability of the occurrence of a categorical dependent variable based on one or more predictor variables. `If the probability is > 0.5. the data is labeled as 1`. It creates a linear decision boundary

![Image description](logistic-regression.png)<br/><br/><br/>

`from sklearn.linear_model import LogisticRegression` <br/>
`logreg = LogisticRegression()`<br/>
`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0., random_state=42)`<br/>
`logreg.fit(X_train, y_train)`<br/>
`y_pred = logreg.predict(X_test)`<br/>
`print(logreg.score(X_test, y_test))`<br/>

We can calculate probabilities of each instance belonging to class by calling `predict_proba` method

`y_pred_probs = logreg.predict_proba(X_test)[:, 1]`<br/>
`print(y_pred_probs[0])`

##### Difference between Logistic Regression and Linear Regression:

Linear regression is used for predicting continuous values by establishing a linear relationship between the dependent variable and independent variables. In contrast, logistic regression is used for binary classification, aiming to predict the probability of an input belonging to a particular class.

#### ROC Curve
The `Receiver Operating Characteristic (ROC)` curve is a graphical representation used to evaluate the performance of a classification model.

`from sklearn.metrics import roc_curve` <br/>
`fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)`<br/>
`plt.plot([0, 1], [0, 1], 'k--)`<br/>
`plt.plot(fpr, tpr)`<br/>

To quantify the performance of the model, we calculate `Area Under Curve (AUC)`.

`from sklearn.metrics import roc_auc_score` <br/>
`print(roc_auc_score(y_test, y_pred_probs))`

### Hyperparameter Tuning
In supervised machine learning, `hyperparameters` are external configurations or settings for a machine learning algorithm that are not learned from the data but are set prior to the training process. These hyperparameters control various aspects of the learning process, the model's complexity, and the optimization process. <br/><br/>
`Hyperparameter Tuning`, also known as `Hyperparameter Optimization`, is the process of finding the best set of hyperparameters for a machine learning model to optimize its performance on a specific task.
* We can try lots of different parameter values
* Fit them all separately
* See how well they perform
* Choose the best performing values

<b>Grid Search</b>: 
*   This method involves specifying a set of values or ranges for each hyperparameter, and the search algorithm     exhaustively evaluates all possible combinations. It can be computationally expensive but is straightforward.
    * `from sklearn.model_selection import GridSearchCV` <br/>
    `kf = KFold(n_splits=5, shuffle=True, random_state=42)`<br/>
    `param_grid = {"alpha" : np.arange(0.0001, 1, 10), "solver":["sag", "lsqr"]}` <br/>
    `ridge = Ridge()`<br/>
    `ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)`<br/>
    `ridge_cv.fit(X_train, y_train)` <br/>
    `print(ridge_cv.best_params_, ridge_cv.best_score_)`

<b>Random Search</b>:
*  Instead of exhaustively searching all combinations, random search samples hyperparameters randomly within specified ranges. It is more computationally efficient than grid search and often yields good results.
    * `from sklearn.model_selection import RandomizedSearchCV`<br/>
    `kf = KFold(n_splits=5, shuffle=True, random_state=42)`<br/>
    `param_grid = {'alpha': np.arange(*0.0001, 1, 10) 'solver': ['sag', 'lsqr]}`<br/>
    `ridge = Ridge()`<br/>
    `ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)`<br/>
    `rdige_cv.fit(X_train, y_train)`<br/>
    `print(ridge_cv.best_params_, ridge_cv.best_score_)`<br/>

### Multi-class Logistic Regression
`Multi-class logistic regression` is an extension of binary logistic regression to handle classification problems with more than two classes. It's a type of linear model that predicts the probability of each class and assigns the class with the highest probability as the final prediction. For multi-class logistic regression, the generalization involves assigning a separate linear model to each class but combining them into a single model. <br/>

In machine learning,` the one-vs-rest (OvR) strategy` is a technique for extending binary classification algorithms to handle multi-class classification problems.

`lr0.fit(X, y==0)` <br/>
`lr1.fit(X, y==1)` <br/>
`lr2.fit(X, y==2)` <br/>

 In order to make predictions using one vs rest, we take the class whose classifier gives the largest raw model output or decision_function in scikit-learn terminology

`lr0.decision_function(X)[0]` (6.124) <br/>
`lr1.decision_function(X)[0]` (-5.429) <br/>
`lr2.decision_function(X)[0]` (-7.532) <br/>

In above function, classifier 0 returns highest value so it will use classifier 0 for prediction <br/>

In `multinomial logistic regression`, the classifier fit a single classifier for all classes. The prediction directly outputs the best class. However, it is relatively complicated as compare to one-vs-rest strategy as their is more new code.

We can use scikit-learn to fit a logistic regression model on the original mult-class data set

`lr = LogisticRegression(multi_class = 'ovr')` <br/>
`lr.fit(X, y)` <br/>
`lr.predict(X)[0]` <br/>


### Choosing models for different models
Some guiding principles include
* Size of the dataset
    * Fewer features = simpler model, faster training time
    * Some models like Neural Networks requires large amount of data to perform well
* Interpretability
    * Some models are easier to explain which can be important for stakeholders
    * Linear Regression has higher interpretability
* Flexibility
    * May improve accuracy by making fewer assumptions about data
    * KNN model is more flexible model as it doesnt assume any linear relationship between the features and the target

<b>Models affected by scaling</b>
* KNN
* Linear Regression
* Logistic Regression (Best Median score for performance when trained on scaled data)
* ANN (Artificial Neural Network)


`print(music_df.isna().sum().sort_values())`<br/>
Dropping missing data
`music_df = music_df.dropna(subset=["genre","popularity"])` <br/>
`print(music_df.isna().sum().sort_values())`<br/>
Imputint values (making educated guess of what the missing value value will be. Common to use the mean, median or mode in case of categorical values )

`from sklearn.impute import SimpleImputer`<br/>
`X_cat=music_df["genre"].values.reshape(-1, 1)`<br/>
`X_num = music_df.drop(["genre", "popularity"], axis=1).values`<br/>
`y = music_df["popularity"].values`<br/>
`X_train_cat, y_train_cat` for categorical<br/>
`X_train_num, y_train_num` for numerical values<br/>
`imp_cat = SimpleImputer(strategy="most_frequent")` for categorical same for numerical values<br/>
`X_train_cat = imp_cat.fit_transform(X_train_cat)`<br/>
`X_test_cat = imp_cat.transform(X_test_cat)`<br/>
`X_train = np.append(X_train_num, X_train_cat, axis=1)` extra step for numerical values<br/>

imputiung within a pipeline
`from sklearn.pipeline import Pipeline`<br/>
`music_df = music_df.dropna(subset=["genre", "popularity"])`<br/>
`music_df[genre] =np.where(music_df[genre]== "Rock", 1, 0)`<br/>
`X = music_df.drop("genre", axis=1).values`<br/>
`y = music_df["genre"].values`<br/>
`steps=[("imputation", SimpleImputer()),("logistic_regression", LogisticRegression())]`<br/>
`pipeline = Pipeline(steps)`<br/>
`X_train, X_test`<br/>
`pipeline.fit(X_train, y_train)`<br/>
`pipeline.score(X_test, y_test)`<br/>

Scaling in scikit-learn

`from sklearn.preprocessing import StandardScaler`<br/>
`scaler=StandardScaler()`<br/>
`X_train_scaled = scaler.fit_transform(X_train)`<br/>
`X_test_scaled = scaler.transform(X_test)`<br/>
`print(np.mean(X), np.std(X))`<br/>
`print(np.mean(X_trained_scaled), np.std(X_train_scaled))`<br/>

Scaling in Pipeline

`steps = [('scaler', StandardScaler()),('knn', KNeighborsClassifer(n_neighbors=6))]`<br/>
`pipeline = Pipeline(steps)`<br/>
`knn_scaled = pipeline.fit(X_train, y_train)`<br/>
`y_pred = knn_scaled.predict(X_test)`<br/>
`print(knn_scaled.score(X_test, y_test))`<br/>

CV and scaling in pipeline
`from sklearn.model_selection import GridSearchCV`<br/>
`steps = [('scaler', StamdardScaler()),('knn', KNeighborsClassifier())]`<br/>
`pipeline = Pipeline(steps)`<br/>



## Unsupervised Machine Learning
It is a class of machine learning technique for discovering patterns in data. In contrast to supervised machine learning, it is learning without labels. It is purely pattern discovery unguided by prediction task

`Dimension of dataset = Number of Features`

### K-means Clustering
K-means clustering is an unsupervised machine learning algorithm used for partitioning a dataset into K distinct, non-overlapping subsets (or clusters). The goal of K-means is to group similar data points into the same cluster while keeping different groups as separate as possible.

`import sklearn.cluster import KMeans` <br/>
`model = KMeans(n_clusters=3)` <br/>
`model.fit(samples)` <br/>
`centroid = model.cluster_centers_`
`labels=model.predict(samples)` <br/>
`print(labels)` <br/>

New samples can be assigned to existing cluster without starting over which is done by remembering the mean of the samples in each cluster, which are called centroids. New samples  are assigned to the cluster whose centroid is closest. To visualise the result

`import matplotlib.pyplot as plt` <br/>
`xs = samples[:, 0]` <br/>
`ys = samples[:, 2]` <br/>
`plt.scatter(xs, ys, c=labels)` <br/>
`plt.show()`<br/>

#### Evaluating Clustering
A good cluster has tight clusters, meaning that the samples in each cluster are bunched together, not spread out. How spread out the samples within each cluster are measued by the `inertia`. Inertia measures how far samples are from their centroids. Lower values of the interia are better.

`print(model.inertia_)`

The question arises `what is the best number of clusters?`. This is ultimately a trade-off; a good clustering has low inertia (tight clusters) but also doesnt have too many clusters. A good rule of thu,b is to choose an elbow in the inertia plot, a point where the inertia begins to decrease more slowly

* Direct approach will be to check the correspondence with dataset feature
* Another method is `cross-tabulation`

##### Cross Tabulation
Cross-tabulation, also known as a contingency table or cross-tab, is a statistical method used in unsupervised machine learning to analyze the relationship between two or more categorical variables. It provides a summary of the distribution of data points across these variables, allowing for a better understanding of the patterns and associations within the dataset.

`import pandas as pd`<br/>
`df = pd.DataFrame({'labels' : labels, 'species': species})`<br/>
`print(df)`<br/>
`ct = pd.crosstab(df['labels'], df['species'])`<br/>
`print(ct)`<br/>

How to evaluate when species (feature) information is not given

##### Feature Variance
K_means alogrithms are sensitive to feature variance. In machine learning, `feature variance` refers to the extent to which values of a particular feature in a dataset vary or spread out. High feature variance can lead to issues in model training and performance. In K-means feature variance corresponds to how much it influence the prediction. Feauters with high variance tends to have higher influence the prediction of the target variable.To give every feature a chance, the data needs to be transformed so that features have equal variance.

`from sklearn.preprocessing import StandardScaler`<br/>
`scaler = StandardScaler()`<br/>
`scaler.fit(samples)`<br/>
`StandardScaler(copy=True, with_mean=True, with_std=True)`<br/>
`samples_scaled=scaler.transform(samples)`<br/>

Pipeline

`from sklearn.preprocessing import StandardScaler`<br/>
`from sklearn.cluster import KMeans`<br/>
`scaler = StandardScaler()`<br/>
`kmeans = KMeans(n_clusters=3)`<br/>
`from sklearn.pipleine import make_pipeline`<br/>
`pipeline = make_pipeline(scaler, kmeans)`<br/>
`pipeline.fit(samples)`<br/>
`labels = pipeline.predict(samples)`<br/>


### Visualising Hierarchies
`Hierarchical clustering` is a method used in unsupervised machine learning for grouping similar data points into clusters. The process is hierarchical in nature because it creates a tree-like structure (dendrogram) of clusters. In such clustering, the clusters are contained in one another. The dendrogram groups the countries into larger and larger clusters. Dendograms are read from the bottom up where the vertical lines represent clusters. The y-axis of the dendrogram encodes the distance between merging clusters. The distance between two clusters is measured using a `linkage method` which in the following example is `complete`. In `complete linkage`, the distance between clusters is the distance between the furthest points of the clusters. In `single linkage`, the distance between clusters is the distance between the closest point of the clusters<br/>
Hierarchical clustering in following steps (based on Eurovision dataset):
* Every country begins in a separate cluster
* At each step, the two closest clusters are merged
* Continue untill all countries in a single cluster
This process is particular type of hierarchical clustering called `agglomerative clustering`

`import matplotlib.pyplot as plt` <br/>
`from scipy.cluster.hierarchy import linkage, dendrogram`<br/>
`mergings = linkage(samples, method=complete)` performs the hierarchical clustering<br/>
`dendrogram(mergings, labels=country_names, leaf_rotation=90, leaf_font_size=6)`<br/>
`plt.show()`

#### Extracting cluster labels using fcluster
`from scipy.cluster.hierarchy import linkage`<br/>
`mergings = linkage(samples, method='complete')`<br/>
`from scipy.cluster.hierarchy import fcluster`<br/>
`labels = fcluster(mergings, 15, criterion='distance')`<br/>
`print(labels)`

To inspect cluster labels

`import pandas as pd`<br/>
`pairs = pd.DataFrame({'labels': labels, 'countries': country_names})`<br/>
`print(pairs.sort_values('labels'))`<br/>


### t-SNE
t-SNE, or `t-Distributed Stochastic Neighbor Embedding`, is a dimensionality reduction technique commonly used in machine learning and data visualization. Its primary purpose is to take high-dimensional data and represent it in a lower-dimensional space, typically 2D or 3D, while preserving the pairwise similarities between data points as much as possible. In simple terms, it maps samples from their high-dimensional space into a 2- or 3-dimensional space so they can be visualised. It perfroms excellent job of approximately representing the distances between the samples

##### t-SNE on the iris dataset
`import mtplotlib.pyplot as plt` <br/>
`from sklearn.mainfold import TSNE` <br/>
`model = TSNE(learning_rate=100)` <br/>
`transformed = model.fit_transform(samples)` <br/>
`xs = transformed[:, 0]` <br/>
`ys = transformed[:, 1]` <br/>
`plt.scatter(xs, ys, c=species)` <br/>
`plt.show()` <br/>

t-SNE has only fit_transform() method that simultaneously fits the model and transforms the data. It doesnt have separate `fit()` or `tranform()` methods which means that you cant extend a t-SNE map to include new samples; instead, you have start over each time <br/>
Choosing learning rate for t-SNE is a bit complicated. If points on the scatter plot are bunhed together, you have made a wrong choice. Normally, its enough to try a few values between 50 and 200. <br/>
Another thing to be aware of is that the axes of a t-SNE plot do not have any interpretable meaning. In fact, they are different every time t-SNE is applied even on the same data


## Dimension Reduction
Dimension reduction in machine learning is a technique used to reduce the number of features or variables in a dataset. The goal is to simplify the dataset while retaining its essential information. High-dimensional data, where each instance has many features, can be challenging to work with due to increased computational complexity, potential noise, and the curse of dimensionality. It is called principal component analysis because it learns the `principal components` of the data

### Principal Component Analysis
`Principal Component Analysis (PCA)` is a dimensionality reduction technique widely used in machine learning and statistics. Its main goal is to transform high-dimensional data into a lower-dimensional representation, preserving as much of the original variance as possible. <br/>
Performs PCA in two steps
* Decorrelation
    * PCA rotates the data samples so that they are aligned with the coordinate axes.
    * It also shifts the sample so that they have mean zero
* Reducing Dimension

`from sklearn.decomposition import PCA` <br/>
`model = PCA()` <br/>
`model.fit(samples)` <br/>
`transformed = model.transform(samples)` <br/>
`print(transformed)` <br/>
`print(model.components_)` <br/>
 
The new array has the same number of rows and columns as the original sample array. The columns of new array correspond to PCA features

##### Pearson Correlation
Pearson correlation, often referred to as the `Pearson correlation coefficient (PCC)` or Pearson's r, is a statistical measure that quantifies the linear relationship between two variables. It is commonly used to assess the strength and the direction of the linear association between two continuous variables. Value varies between -1 and  1 where 0 means no linear correlation

##### Intrinsic Dimension
`Intrinsic dimension` in machine learning refers to the essential number of features needed to represent the data accurately. It is the dimensionality of the dataset that captures most of the variability or information within the data. In other words, intrinsic dimensionality helps identify the minimum number of features that retain the key information necessary to describe the dataset.

`Intrinsic Dimension using Iris dataset`
To better illustrate the intrinsic dimension, lets consider an example dataset containing only some of the samples from the iris dataset. We will take three measurments (3 features) from the iris versicolor sample:
* sepal length
* sepal width
* petal width

Each sample point is represented as a point in 3d space. If we make a 3D scatter plot of the samples, you can observe they all lie close to a flat 2D sheet, which means the data can be approximated by using only two coordinates without losing much information. So this dataset has intrinsic dimension 2. The intrinsic dimension can be identified by counting the PCA features that have high variance

`import matplotlib.pyplot as plt` <br/>
`from sklearn.decomposition import PCA` <br/>
`pca = PCA()` <br/>
`pca.fit(samples)` <br/>
`features = range(pca.n_components)` <br/>
`plt.bar(features, pca.explained_variance)`<br/>
`plt.xticks(features)` <br/>
`plt.show()` <br/>

Intrinsic dimension is an idealization but there is not always one correct answer. <br.><br/>
PCA discards low variance PCA features and assumes the high variance features are informative. For dimension reduction

`from sklearn.decomposition import PCA` <br/>
`pca = PCA(n_components=2)`<br/>
`pca.fit(samples)` <br/>
`transformed=pca.transform(samples)`<br/>
`print(transformed.shape)`<br/>

### Non Negative Matrix Factorization
`Non-Negative Matrix Factorization (NMF)` is a technique used in machine learning for dimensionality reduction and feature extraction. It's particularly useful when dealing with non-negative data, such as images, text, audio, or other types of non-negative signals. Unlike PCA, NPF are easier to understand and much easier to explain to others. For example, NMF expresses images as combination of patterns. It requires the dataset to have only non-negative sample features

`Word Frequency Example`<br/>
The frequency of words in each document can be calculated using `tfidf`. `tf` is the frequency of the word in the document while `idf` is a weighting scheme that reduces the influence of words

`from sklearn.decomposition import NMF`<br/>
`model = NMF(n-components=2)`<br/>
`model.fit(samples)`<br/>
`nmf_features = model.transform(samples)`<br/>
`print(model.components_)`<br/>
`print(nmf_features)` <br/>

Sample can be reconstructed by multiplying the NMF components by NMF feature values and adding up. This calculation can also be expressed as product of matrics

`Example`

NMF Component value is
<br/>
[[1.  0.5 0. ]
 [0.2 0.1 2.1]]
<br/>

NMF Feature value is [2, 1]
<br/>

The reconstruction formula is 

`Reconstructed Sample = NMF Feature Values x NMF Components`

Performing the multiplication

[2, 1] x [[1.  0.5 0. ]
 [0.2 0.1 2.1]] 
 <br/>
 = [ 2(1) + 1(0.2) + 2(0.5) + 1(0.1) + 2(0)+1(2.1) ]
 <br/>
 = [2.2, 1.1, 2.1]
 <br/> <br/> <br/>
* If NMF is applied to documents, then the NMF components correspond to and NMF Features reconstruct the documents from the topic
* If NMF is applied to images, then the NMF components represent patterns that frequently occur in the image
* A collection of grayscale images of the same size can be encoded as 2D array in which each row corresponds to an image as flattedned array and each column represents a pixel

#### How linear classifiers make predictions

* First we compute raw model output 
    * raw model output = coefficients.features + intercept

    `lr = LogisticRegression()` <br/>
    `lr.fit(X, y)` <br/>
    `lr.predict(X)[10]` <br/>
    `lr.predict(X)[20]` <br/>
    `lr.coef_ @X[10] + lr.intercept` fives raw model output

* We will then take the sign of this quantity (check wether it is positive or negative, and same for both SVMs and Regression)
    * if positive, predict `1` class
    * If negative, predict `0` class

    
In general, this is what the predict function does for any X; it computes the raw model output. checks if its positive or negative and then returns a result based on the names of the classes in your data set, 0 and 1

`Both logistic regression and linear SVM has different fit functions but same predict function`

## Support Vector Machine
`Support Vector Machines (SVM)` is a supervised machine learning algorithm that is used for classification and regression tasks. It is particularly effective in high-dimensional spaces and is widely used for tasks like image classification, text classification, and handwriting recognition. They use `hinge loss` function and L2 regularisation. The SVM maximizes the margin (distnace from the boundary to the closest points) for linearly separable datasets.
<br/>
In scikit-learn, the basic SVM classifier is called LinearSVC which works the same way as logistic regression

`from sklearn.svm import LinearSVC` <br/>
`svm = LinearSVC()`<br/>
`svm.fit(feature, target)`<br/>
`svm.score(feature, target)`<br/>

### Support Vector Classification
`Support Vector Classification (SVC)` is a specific implementation of SVM designed for classification tasks. The goal of SVC is to find a hyperplane that separates the data into different classes while maximizing the margin between these classes

`from sklearn.svm import SVC`<br/>
`svm = SVC()`<br/>
`svm.fit(data, target)`<br/>
`svm.score(data, target)`<br/>


#### Kernel SVMs
 Kernel SVMs, or `Kernelized Support Vector Machines`, are an extension of traditional SVMs that allow for the application of non-linear decision boundaries through the use of kernels. Kernels in SVM allow the algorithm to implicitly map the input features into a higher-dimensional space, where it becomes easier to find a hyperplane that separates the data. The mapping is done without explicitly calculating the coordinates of the data points in the higher-dimensional space

 `from sklearn.svm import SVC` <br/>
 `svm = SVC(gamma=1)` (default is kernel = 'rbf' (Radial Basis Function) ) (gamma controls the smoothenss of the boundary) <br/>

 It is not advisible to use largest value of gamma and get the highest possible training accuracy beacuse it leads to overfitting

### Loss Function
A `loss function`, also known as a cost function or objective function, is a mathematical function that quantifies the difference between the predicted values and the actual values (or labels) in a dataset. The loss function serves as a measure of how well the model is performing. It provides a way to calculate the error between the predicted output and the actual target values. Example `Mean Squared Error (MSE)`. The square error is not appropriate for classication problems. A natural loss for classificaiton problem is the number of errors. This is the  `0-1 loss`; its 0 for a correct prediction and 1 for an incorrect prediction. By summing, we get the number of errors made on training dataset. However, in reality it is difficult to do and that is why its not used my logistic regression and svm
<br/>
For minimizing a loss

`from scipy.optimize import minimize`
`minimize(np.square,0).x`
``

<br/>

We can think of `fit()` function as running code that minimizes the loss. The `score()` method is used to how well the model is doing on the data.

<table>
  <tr>
    <th>Logistic Regression</th>
    <th>SVM</th>
  </tr>
  <tr>
    <td>Is a linear model</td>
    <td>Is a linear model</td>
  </tr>
  <tr>
    <td>Can use with kernels but slow</td>
    <td>Can use with kernels and fast</td>
  </tr>
  <tr>
    <td>Outputs meaningful probabilities</td>
    <td>Does not naturally output probabilities</td>
  </tr>
  <tr>
    <td>Sensitive to outliers as it aims to maximize the likelihood of the observed data</td>
    <td>Relatively robut to outliers due to the focus on maximizing the margin</td>
  </tr>
  <tr>
    <td>Can be extended to multiclass</td>
    <td>Can be extended to multiclass</td>
  </tr>
  <tr>
    <td>Generally faster and less computationally intensive, especially for large datasets.</td>
    <td>Can be computationally demanding, particularly with large datasets, as it involves solving a quadratic programming problem.</td>
  </tr>
  <tr>
    <td>All data points affect fit</td>
    <td>Only `support vectors` affect fit</td>
  </tr>
  <tr>
    <td>L2 or L1 Regularisation</td>
    <td>Conentionally just L2</td>
  </tr>
</table>

## Decision Tree
A `decision tree` is a popular supervised machine learning algorithm used for both classification and regression tasks. It works by recursively partitioning the input space into regions and assigning a specific label or value to each region. Three is trained in such a way so that in each leaf one class-label is predominant

* `Nodes`: <br/>
Decision trees consist of nodes, which represent questions or conditions based on features. Three kind of node
    * `Root node` is the node at which the decision-tree starts growing. It has no parent group and involves a question that gives rise to 2 children node through branches
    * `Internal node` has one parent node and involves a question that gives rise to 2 children nodes
    * `Leaf node` is a node with one parent node and no children. It is where predictions are made
* `Edges`: <br/>
Edges connect nodes and represent possible answers to the questions.
* `Leaves`: <br/>
Terminal nodes, or leaves, contain the final output (class label for classification or predicted value for regression).


`from sklearn.tree import DecisionTreeClassifier` <br/>
`import sklearn.model_selection import train_test_split` <br/>
`from sklearn.metrics import accuracy_score` <br/>
`X_train, X_test, y_train, y_test` <br/>
`dt = DecisionTreeClassifier(max_depth=2, random_state=1, criterion='gini')` <br/>
`dt.fit(X_train, y_train)` <br/>
`y_pred = dt.predict(X_test)` <br/>
`accuracy_score(y_test, y_pred)` <br/>

`Decision region` is the region in the feature space where all instances are assigned to one class label. Decision regions are separated by surfaces called `decision boundaries`

### CART
CART, which stands for `Classification and Regression Trees`, is a popular machine learning algorithm used for both classification and regression tasks. The process includes

* `Recursive Splitting:` <br/>
Start with the entire dataset and recursively split nodes based on features to maximize class separation (Gini impurity for classification, mean squared error for regression).
* `Binary Tree Structure: ` <br/>
Each internal node tests a feature, and the branches represent possible outcomes. Leaves contain class labels (for classification) or predicted values (for regression). 
* `Stopping Criteria: ` <br/>
Define criteria to stop splitting, such as a maximum tree depth, minimum samples per leaf, or minimum impurity improvement.
* `Pruning (Optional): ` <br/>
After tree construction, prune nodes to improve generalization and reduce overfitting.
* `Prediction: ` <br/>
For a new instance, traverse the tree, and the prediction is the majority class (for classification) or mean/median (for regression) in the leaf node. <br/> <br/>

Advantages of CARTs include:
* Simple to understand
* Simple to interpret
* Easy to use
* Flexible, which gives them an ability to describte non-linear dependencies between features and labels
* Dont need a lot of feature preprocessing to train

Limitation of CARTs include:
* A classification tree is only able to produce orthogonal decision boundaries
* Sensitive to small variations in the training dataset
* Also suffer from high variance when they are trained without constraints

To use Decsion Tree for Regression problems

`from sklearn.tree import DecisionTreeRegressor` <br/>
`from sklearn.metrics import mean_squared_error as MSE` <br/>
`dt = DecisionTreeRegressor(max_depth=4, min_samples_leaf=0.1, random_state=3)` <br/>
`dt.fit(X_train, y_train)` <br/>
`y_pred=dt.predict(X_test)` <br/>
`mse_dt = MSE(y_test, y_pred)` <br/>
`rmse_dt = mse_dt**(1/2)` <br/>
`print(rmse_dt)` <br/>


#### Information Gain
`Information Gain` is a concept used to measure the effectiveness of a feature in classifying the data. It helps decide which feature to use for splitting nodes in the tree during the construction process.  Information Gain measures the reduction in entropy or impurity achieved by splitting a dataset based on a particular feature. The decision tree choose a feature and split-point by maximizing information gain. The tree considers that every node contains information and aims at maximizing the information gain obtained after each split

#### KFold CV
`K-Fold cross-validation (CV)` is a technique used to assess the model's performance and generalization ability. It involves dividing the dataset into k subsets, called folds, and iteratively training and evaluating the model k times, using a different fold as the test set in each iteration. <br/>
* If CV error of the model function is greater than training set error of model function, then the model function suffers from high variance
    * It means the model is overfitting the training dataset. To remedy overfitting,
        * decrease model complexity
        * Increase data size (gather more data)

* If CV error of the model is roughly equal to the training error but much greater than the desired error, then the model suffers from high bias
    * Such model is said to underfitting the training set. To remedy underfitting,
        * increase model complexity
        * icrease max_depth, decrease min samples per leaf
        * gather more relevant features

`from sklearn.model_selection import cross_val_score` <br/>
`X_train, X_test ...` <br/>
`dt = DecisionTreeRegressor(max_depth= , min_samples_leaf= , random_state= )` <br/>
`MSE_CV = -cross_val_score(dt, X_train, y_train, cv= , scoring='neg_mean_squared_error', n_jobs = -1)`<br/>
`dt.fit(X_train, y_train)` <br/>
`y_predict_train = dt.predict(X_train)` <br/>
`y_predict_test = dt.predict(X_test)`<br/>


## Ensemble Learning
`Ensemble learning` is a machine learning paradigm that involves combining the predictions of multiple models to produce a more robust and accurate prediction than any individual model.
* Different models are trained on the same dataset
* Each model make its own predictions
* Meta-model then aggregates the predictions of individual models and outputs a final prediction
* Final prediction is more robust and less prone to errors than each individual model

### Voting Classifier
A `voting classifier` is an ensemble learning technique in machine learning where multiple models are trained to make predictions on a dataset, and the final prediction is determined by combining the individual predictions through a "voting" mechanism. The meta model outputs the prediction through hard voting. In `hard voting`, each model in the ensemble "votes" for a class, and the class that receives the majority of the votes is selected as the final prediction.
This is suitable for classifiers that output discrete class labels.

`from sklearn.linear_model import LogisticRegression` <br/>
`from sklearn.tree import DecisionTreeClassifier` <br/>
`from sklearn.neighbors import KNeighborsClassifier as KNN` <br/>
`from sklaern.ensemble import VotingClassifier` <br/>
`X_train, X_test, y_train, y_test` <br/>
`lr = LogisticRegression(random_state=)` <br/>
`knn = KNN()` <br/>
`dt = DecisionTreeClassifier(random_state=)` <br/>
`classifiers=[('Logistic Regression', lr), ('K Nearest Neighbours', knn), ('Classification Tree', dt)]` <br/>
`for clf_name in classifiers: ` <br/>
    `clf.fit(X_train, y_-_train)`<br/>
`vc = VotingClassifier(estimators=classifiers)` <br/>
`vc.fit(X_train, y_train)` <br/>
`y_pred = vc.predict(X_test)` <br/>

#### Bagging
`Bagging`, or `Bootstrap Aggregating`, is a machine learning ensemble technique designed to improve the stability and accuracy of a predictive model. The basic idea behind bagging is to train multiple instances of the same learning algorithm on different subsets of the training data and then combine their predictions. Bagging helps in reducing overfitting and variance in the model. By training multiple models on different subsets of the data, the ensemble model becomes more robust and less sensitive to the peculiarities of the training set. <br/>
The` bootstrap method` is a statistical tool in machine learning that helps estimate the uncertainty of a statistic. It works by repeatedly drawing random samples (with replacement) from your data. Each sample is used to calculate the statistic of interest. The variability in these estimates gives insights into the uncertainty associated with your initial data. <br/>
In classification problem, the final prediction is obtained by majority voting. In regression problem, the final prediction is the average of the predictions made by the individual models forming the ensemble

`from sklearn.ensemble import BaggingClassifer` <br/>
`from sklearn.tree import DecisionTreeClassifier` <br/>
`X_train, X_test, ....` <br/>
`dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=0.16)` <br/>
`bc = BaggingClassifier(base_estimator=dt, n_estimators=300m, n_jobs=-1)` <br/>
`bc.fit(X_train, y_train)` <br/>
`y_pred = bc.predict(X_test)` <br/>
`accuracy = accuracy_score(y_test, y_pred)` <br/>

On average, for each model, 63$ of the training instances are sampled. The remaining 37% are not sampled constitute what is known as the `Out-of-Bag` or `OOB instances`. Since OOB instances are not seen by a model during training, these can be used to estimate the perfromance of the ensemble without the need for cross-validation