# Machine Learning

Machine learning is the process whereby machines are given the ability to learn to make decisions from data without being explicity programmed.
* <b>`Unsupervised Learning`</b> is the process of uncovering hidden patterns and structures from unlabeled data.
    * Example grouping customers into distinct categories (`known as clustering`) based on various criteria like shopping behavior, most bought products etc 
* <b>`Supervised Learning`</b> is type of machine learning where the values to be predicted are already known and model is build to predict target values of unseen given data, given the features
    * Uses features to predict the value of target variable

## Types of Supervised Learning
* <b>`Classification`</b> is used to predict the label, or cateogry of an observation. The target variable consists of categories
    * Example cat or dog, spam email classification
* <b>`Regression`</b> is used to predict continuous values
    * Example property price prediction, stock forecasting

### Naming Convention
* `Feature` or `predictor variable` is independent variable used to find value of target variable
* `Target variable` or `dependent variable` is variable whose value need to be calculated or predicted
* `Labeled data` is the training data the model is trained on 

### Scikit-Learn
Scikit-learn is one of the most popular machine learning libraries in Python. It's a powerful tool for building machine learning models, providing a wide range of algorithms and tools for tasks such as classification, regression, clustering, dimensionality reduction, and more. 
<br/>
scikit-learn requires that the features are in array where each column is a feature and each row a different observation. Similarly, the target needs to be a single column with the same number of observations as the feature data

`Syntax for scikit-learn`

`from sklearn.module import Model` <br/>
`model = Model()`<br/>
`model.fit(X, y)` (X array of features, y array of our target variable values)<br/>
`prediction = model.predict(X_new)`<br/>
`print(prediction)`<br/>

##### Steps for Classifying Labels of Unseen Data
* Build a classifier
* Model learns from the labeled data we pass to it
* Pass unlabeled data to the model as input
* Model predicts the labels of the unseen data


### k-Nearest Neighbors
A supervised machine learning model which is used to predict the label of any data point by looking at `k` closest label data points. It uses majority voting which makes predictions based on what label the majority of nearest neighbors have. <br/>
As k increases, the model becomes a simpler model. Simpler models are less able to detect relationships in the dataset, which is known as `underfitting`
For smaller k, the model become complex and is sensitive to noise in the training data, rather than reflecting general trends, which is known as `overfitting`

larger k = less complex model = can cause underfitting <br/>
smaller k = more complex model = can lead to overfitting

`from sklearn.neighbors import KNeighborsClassifier`<br/>
`X = dataset_name[['feature1', 'feature2']]` <br/>
`Y = dataset_name ['feature']` <br/>
`knn = KNeighborsClassifier(n_neighbors = 6)` <br/>
`knn.fit(X, y)` <br/>
`predictions = knn.predict(X_new)` <br/>
`print(predictions)`<br/>


#### Measuring Model Performance
In classifcation, accuracy is a commonly used metric. <br/><br/> 
`Accuracy: correct_predictions/total_observations`<br/><br/>
For measuring model performance, it is common practice to split data into `training set` and `test set`. Training set is used to train classifier on training data and then calculate model accuracy using test set. We commonly use 20-30% of our data as the test set. Random_state argrument sets a seed for random number generator that splits the 

`from sklearn.model_selection import train_test_split`<br />
`X_train, X_text, y_train, y_test = train_test_split(X, y, test_size=0.3 random_state=21)` Four arrays <br />
`knn = KNeighborsClassifier(n_neighbors = 6)` <br/>
`knn.fit(X_train, y_train)` <br/>
`print(knn.score(X_test, y_test))` <br/><br/>

To deal with underfitting and overfitting problem, we use incremental values of k to find value of k for which accuracy is highest

`training_accuracies = {}`<br/>
`test_accuracies = {}`<br/>
`neighbors = np.arange(1, 26)`<br/>
`for neighbor in neighbors:`<br/>
`    knn = KNeighborsClassifier(n_neighbors=neighbor)`<br/>
`knn.fit(X_train, y_train)` <br/>
`training_accuracies[neighbor] = knn.score(X_train, y_train)` <br/>
`test_accuracies[neighbor] = knn.score(X_test, y_test)` <br/>

#### Confusion Matrix
In machine learning, a confusion matrix is a table that is used to evaluate the performance of a classification algorithm. It allows for a detailed examination of the predicted and actual values of a model.

The confusion matrix is constructed around the following concepts:

* `True Positives (TP):` These are the cases where the model correctly predicts the positive class.
* `True Negatives (TN):` These are the cases where the model correctly predicts the negative class.
* `False Positives (FP):` These are the cases where the model incorrectly predicts the positive class when it's actually negative. Also known as Type I error.
* `False Negatives (FN):` These are the cases where the model incorrectly predicts the negative class when it's actually positive. Also known as Type II error.
<br/>
<table>
    <tr>
        <th></th>
        <th>Predicted Negative</th>
        <th>Predicted Positive</th>
    </tr>
    <tr>
        <td>Actual Negative</td>
        <td>True Negative (TN)</td>
        <td>False Positive (FP)</td>
    </tr>
    <tr>
        <td>Actual Positive</td>
        <td>False Negative (FN)</td>
        <td>True Positive (TP)</td>
    </tr>
</table>
<br/>

From the confusion matrix, various metrics can be derived to evaluate the performance of a classification model, including:

* `Accuracy: (TP + TN) / (TP + TN + FP + FN)`
    * The proportion of correctly classified instances among the total instances.

* `Precision: TP / (TP + FP)`
    * The accuracy of positive predictions, measuring the model's ability to not label a negative sample as positive.
    * Higher precision = lower false positive rate 

* `Recall (Sensitivity or True Positive Rate): TP / (TP + FN)`
    * The proportion of actual positive cases that were correctly identified, measuring the model's ability to find all the positive samples.
    * High recall = lower false negative rate 
<br />

* `F1 Score: 2 * (Precision * Recall) / (Precision + Recall)`
    * It is the harmonic mean of precision and recall, providing a balance between the two metrics.

`from sklearn.metrics import classification_report, confusion_matrix` <br/>
`knn = KNeighborsClassifier(n_neighbors=7)` <br/>
`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)` <br/>
`knn.fit(X_train, y_train)` <br/>
`y_pred = knn.predict(X_test)` <br/>
`print(confusion_matrix(y_test, y_pred))` <br/>
`print(classification_report(y_test, y_pred))` <br/>



### Regression
`Regression` in machine learning is a type of supervised learning that deals with the prediction of continuous outcomes. It aims to find the relationship between a `dependent variable` (target) and one or more `independent variables` (features) to predict the value of the dependent variable based on the independent variables.

`from sklearn.linear_model import LinearRegression` <br/>
`reg = LinearRegression()` <br/>
`reg.fit(X, y)` <br/>
`predictions = reg.predict(X)` <br/>

#### Regression mechanism
In two dimension, we want to fit a line on the data and takes the form
`y = ax + b`
where
* y = target
* x = single feature
* a, b = parameters/co-efficients of the model - slope, intercept

How do we choose a and b
* Define an error function (also called loss function) for any given line
* Choose the line that minimizes the error function

We want the line to be as close to the observation as possible. Therefore we want to minimise the vertical distance between the fit and the data. So for each observation, we calculate the vertical distance between it and the line which is called a `residual`

For linear regressio in higher dimension, a line takes the form `y = a1x1 + a2x2 + a3x3 + ... + anxn + b`. 

So for linear regression using all features:

`from sklearn.model_selection import train_test_split` <br/>
`from sklearn.linear_model import LinearRegression`<br/>
`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)`<br/>
`reg_all = LinearRegression()`<br/>
`reg_all.fit(X_train, y_train)`<br/>
`y_pred = reg_all.predict(X_test)`<br/>

The default metric for linear regression is `R-squared` which quantifies the amount of variance in the the target variable that is explained by the features. Its value ranges from 0 to 1 where 1 means that the features completely explain the target variance. 

`reg_all.score(X_test, y_test)`<br/> <br/>
However, there is a potential pitfall; R-squared is dependent on the way the data is split where the test data mau have some peculiarities that mean the R-squared computed on it is not representative of the model's ability to generalise to unseen data

The performance of linear regression can be find using `mean square error` and `root mean squared error`.

`from sklearn.metrics import mean_squared_error`<br/>
`mean_squared_error(y_test, y_pred, squared=False)`<br/>


#### Cross Validation
Cross-validation is a technique used in machine learning to assess the performance of a predictive model. It helps to estimate how well the model generalizes to new, unseen data. Here's a simpler and more concise breakdown of cross-validation in linear regression
* ` Data Splitting:` Divide the dataset into a training set and a testing set.
* `K-Fold Cross-Validation:` Split the training set into K subsets/folds.
* `Iterative Training and Testing:` Train the model K times, each time using K-1 folds for training and 1 fold for testing.
* `Performance Evaluation:` Calculate the model's performance metric (e.g., mean squared error) for each fold.
* ` Average Metric:` Average the performance metric across all folds to determine the model's overall performance.
* `Model Comparison and Tuning:` Compare different models or tune parameters using cross-validation results.
* `Final Model Evaluation:` Select the best model and train it on the entire training set before evaluating it on a separate, unseen test set to estimate real-world performance.


`from sklearn.model_selection import cross_val_score, KFold` <br/>
`kf = KFold(n_splits=6, shuffle=True, random_state=42)`<br/>
`reg = LinearRegression()`<br/>
`cv_results = cross_val_score(reg, X, y, cv=kf)`<br/>
`print(cv_results)`<br/>
`print(np.mean(cv_results), np.std(cv_results))`<br/>
`print(np.quantile(cv_results, [0.025, 0.975]))` (calculate 95% confidence interval)<br/>


#### Ridge Regression

Ridge regression is a technique used in linear regression to address the problem of overfitting by adding a penalty term to the ordinary least squares (OLS) method. In traditional linear regression, the goal is to minimize the residual sum of squares (RSS). However, when there is multicollinearity, the estimated coefficients can become too sensitive to the training data, leading to overfitting and high variance.

`from sklearn.linear_model import Ridge` <br/>
`scores = []`<br/>
`for alpha in [0.1, 1.0, 10.0, 100.0, 1000.0]:`<br/>
`ridge = Ridge(alpha=alpha)`<br/>
`ridge.fit(X_train, y_train)`<br/>
`y_pred = ridge.predict(X_test)`<br/>
`scores.append(ridge.score(X_test, y_test))`<br/>
`print(scores)`<br/>

#### Lasso Regression
Lasso regression, short for Least Absolute Shrinkage and Selection Operator, is a technique used in linear regression to perform both variable selection and regularization by adding a penalty term to the ordinary least squares (OLS) method. 

`from sklearn.linear_model import Lasso` <br/>
`scores = []`<br/>
`for alpha in [0.01, 1.0, 10.0, 20.0, 50.0]:`<br/>
`lasso = Lasso(alpha=alpha)`<br/>
`lasso.fit(X_train, y_train)`<br/>
`lasso_pred = lasso.predict(X_test)`<br/>
`scores.append(lasso.score(X_test, y_test))`<br/>
`print(scores)`<br/><br/>

* It can be used to select important features of a dataset by shrinking the coefficients of less important features to zero

`from sklearn.linear_model import Lasso` <br/>
`X = diabetes_df.drop("glucose", axis=1).values`<br/>
`y = diabetes_df["glucose"].values`<br/>
`names = diabetes_df.drop("glucose", axis=1).columns`<br/>
`lasso = Lasso(alpha=0.1)`<br/>
`lasso_coef = lasso.fit(X, y).coef_`<br/>


### Logistic Regression
Logistic regression is a type of statistical model used for binary classification tasks in machine learning. It predicts the probability of the occurrence of a categorical dependent variable based on one or more predictor variables. `If the probability is > 0.5. the data is labeled as 1`. It creates a linear decision boundary

![Image description](logistic-regression.png)<br/><br/><br/>

`from sklearn.linear_model import LogisticRegression` <br/>
`logreg = LogisticRegression()`<br/>
`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0., random_state=42)`<br/>
`logreg.fit(X_train, y_train)`<br/>
`y_pred = logreg.predict(X_test)`<br/>

We can calculate probabilities of each instance belonging to class by calling `predict_proba` method

`y_pred_probs = logreg.predict_proba(X_test)[:, 1]`<br/>
`print(y_pred_probs[0])`

##### Difference between Logistic Regression and Linear Regression:

Linear regression is used for predicting continuous values by establishing a linear relationship between the dependent variable and independent variables. In contrast, logistic regression is used for binary classification, aiming to predict the probability of an input belonging to a particular class.

#### ROC Curve
The `Receiver Operating Characteristic (ROC)` curve is a graphical representation used to evaluate the performance of a classification model.

`from sklearn.metrics import roc_curve` <br/>
`fpr, tpr, thresholds = roc_curve(y_test, y_pred_probs)`<br/>
`plt.plot([0, 1], [0, 1], 'k--)`<br/>
`plt.plot(fpr, tpr)`<br/>

To quantify the performance of the model, we calculate `Area Under Curve (AUC)`.

`from sklearn.metrics import roc_auc_score` <br/>
`print(roc_auc_score(y_test, y_pred_probs))`

### Hyperparameter Tuning
In supervised machine learning, `hyperparameters` are external configurations or settings for a machine learning algorithm that are not learned from the data but are set prior to the training process. These hyperparameters control various aspects of the learning process, the model's complexity, and the optimization process. <br/><br/>
`Hyperparameter Tuning`, also known as `Hyperparameter Optimization`, is the process of finding the best set of hyperparameters for a machine learning model to optimize its performance on a specific task.
* We can try lots of different parameter values
* Fit them all separately
* See how well they perform
* Choose the best performing values

<b>Grid Search</b>: 
*   This method involves specifying a set of values or ranges for each hyperparameter, and the search algorithm     exhaustively evaluates all possible combinations. It can be computationally expensive but is straightforward.
    * `from sklearn.model_selection import GridSearchCV` <br/>
    `kf = KFold(n_splits=5, shuffle=True, random_state=42)`<br/>
    `param_grid = {"alpha" : np.arange(0.0001, 1, 10), "solver":["sag", "lsqr"]}` <br/>
    `ridge = Ridge()`<br/>
    `ridge_cv = GridSearchCV(ridge, param_grid, cv=kf)`<br/>
    `ridge_cv.fit(X_train, y_train)` <br/>
    `print(ridge_cv.best_params_, ridge_cv.best_score_)`

<b>Random Search</b>:
*  Instead of exhaustively searching all combinations, random search samples hyperparameters randomly within specified ranges. It is more computationally efficient than grid search and often yields good results.
    * `from sklearn.model_selection import RandomizedSearchCV`<br/>
    `kf = KFold(n_splits=5, shuffle=True, random_state=42)`<br/>
    `param_grid = {'alpha': np.arange(*0.0001, 1, 10) 'solver': ['sag', 'lsqr]}`<br/>
    `ridge = Ridge()`<br/>
    `ridge_cv = RandomizedSearchCV(ridge, param_grid, cv=kf, n_iter=2)`<br/>
    `rdige_cv.fit(X_train, y_train)`<br/>
    `print(ridge_cv.best_params_, ridge_cv.best_score_)`<br/>

`print(music_df.isna().sum().sort_values())`<br/>
Dropping missing data
`music_df = music_df.dropna(subset=["genre","popularity"])`
`print(music_df.isna().sum().sort_values())`
Imputint values (making educated guess of what the missing value value will be. Common to use the mean, median or mode in case of categorical values )