<h3>Fine-Tuning</h3>

In [3]:
from IPython.display import Image


Image(url= "./FineTuningImages/1.png", width=400)

In [2]:
Image(url= "./FineTuningImages/2.png", width=400)

<h3>Exercise</h3>
Metrics for classification
In Chapter 1, you evaluated the performance of your k-NN classifier based on its accuracy. However, as Andy discussed, accuracy is not always an informative metric. In this exercise, you will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.

You may have noticed in the video that the classification report consisted of three rows, and an additional support column. The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.

Here, you'll work with the PIMA Indians dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes. As in Chapters 1 and 2, the dataset has been preprocessed to deal with missing values.

The dataset has been loaded into a DataFrame df and the feature and target variable arrays X and y have been created for you. In addition, sklearn.model_selection.train_test_split and sklearn.neighbors.KNeighborsClassifier have already been imported.

Your job is to train a k-NN classifier to the data and evaluate its performance by generating a confusion matrix and classification report.


<h3>Instructions</h3>
<ul>
    <li>Import classification_report and confusion_matrix from sklearn.metrics.</li>
    <li>Create training and testing sets with 40% of the data used for testing. Use a random state of 42.</li>
    <li>Instantiate a k-NN classifier with 6 neighbors, fit it to the training data, and predict the labels of the test set.</li>
    <li>Compute and print the confusion matrix and classification report using the confusion_matrix() and classification_report() functions.</li>
</ul>

Excellent work! By analyzing the confusion matrix and classification report, you can get a much better understanding of your classifier's performance.

<h2>Logistic regression and the ROC curve</h2>

In [4]:
Image(url= "./FineTuningImages/3.png", width=400)

In [5]:
Image(url= "./FineTuningImages/4.png", width=400)

In [6]:
Image(url= "./FineTuningImages/5.png", width=400)

In [7]:
Image(url= "./FineTuningImages/6.png", width=400)

In [8]:
Image(url= "./FineTuningImages/7.png", width=400)

In [9]:
Image(url= "./FineTuningImages/8.png", width=400)

In [10]:
Image(url= "./FineTuningImages/9.png", width=400)

In [11]:
Image(url= "./FineTuningImages/10.png", width=400)

In [12]:
Image(url= "./FineTuningImages/11.png", width=400)

In [13]:
Image(url= "./FineTuningImages/12.png", width=400)

In [14]:
Image(url= "./FineTuningImages/13.png", width=400)

In [15]:
Image(url= "./FineTuningImages/14.png", width=400)

<h3>Exercise</h3>
<h2>Building a logistic regression model</h2>
Time to build your first logistic regression model! As Hugo showed in the video, scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as 'estimators'. You'll see this now for yourself as you train a logistic regression model on exactly the same data as in the previous exercise. Will it outperform k-NN? There's only one way to find out!

The feature and target variable arrays X and y have been pre-loaded, and train_test_split has been imported for you from sklearn.model_selection.

<h3>Instructions</h3>
<ul>
    <li> Import:
        <ul>
            <li>LogisticRegression from sklearn.linear_model.</li>
            <li>confusion_matrix and classification_report from sklearn.metrics.</li>
        </ul>
    </li>
    <li>Create training and test sets with 40% (or 0.4) of the data used for testing. Use a random state of 42. This has been done for you.</li>
    <li>Instantiate a LogisticRegression classifier called logreg.</li>
    <li>Fit the classifier to the training data and predict the labels of the test set.</li>
    <li>Compute and print the confusion matrix and classification report. This has been done for you, so hit submit to see how logistic regression compares to k-NN!</li>
</ul>

You now know how to use logistic regression for binary classification - great work! Logistic regression is used in a variety of machine learning applications and will become a vital part of your data science toolbox.

<h4>Exercise</h4>
<h3>Plotting an ROC curve</h3>
Great job in the previous exercise - you now have a new addition to your toolbox of classifiers!

Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models. As Hugo demonstrated in the video, most classifiers in scikit-learn have a .predict_proba() method which returns the probability of a given sample being in a particular class. Having built a logistic regression model, you'll now evaluate its performance by plotting an ROC curve. In doing so, you'll make use of the .predict_proba() method and become familiar with its functionality.

Here, you'll continue working with the PIMA Indians diabetes dataset. The classifier has already been fit to the training data and is available as logreg.

<h3>Instructions</h3>
<ul>
    <li>Import roc_curve from sklearn.metrics.</li>
    <li>Using the logreg classifier, which has been fit to the training data, compute the predicted probabilities of the labels of the test set X_test. Save the result as y_pred_prob.</li>
    <li>Use the roc_curve() function with y_test and y_pred_prob and unpack the result into the variables fpr, tpr, and thresholds.
Plot the ROC curve with fpr on the x-axis and tpr on the y-axis.</li>
</ul>

In [16]:
Image(url= "./FineTuningImages/15.png", width=400)

Excellent! This ROC curve provides a nice visual way to assess your classifier's performance.

<h4>Exercise</h4>
Precision-recall Curve
When looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds. As a reminder, precision and recall are defined as:

On the right, a precision-recall curve has been generated for the diabetes dataset. The classification report and confusion matrix are displayed in the IPython Shell.

Study the precision-recall curve and then consider the statements given below. Choose the one statement that is not true. Note that here, the class is positive (1) if the individual has diabetes.

<h4>Instructions</h4>

Possible Answers
<ul>
    <li>A recall of 1 corresponds to a classifier with a low threshold in which all females who contract diabetes were correctly classified as such, at the expense of many misclassifications of those who did not have diabetes.</li>
    <li>Precision is undefined for a classifier which makes no positive predictions, that is, classifies everyone as not having diabetes.
</li>
    <li>When the threshold is very close to 1, precision is also 1, because the classifier is absolutely certain about its predictions.</li>
    <li>Precision and recall take true negatives into consideration.</li>
</ul>

In [17]:
Image(url= "./FineTuningImages/16.png", width=400)

Great work! True negatives do not appear at all in the definitions of precision and recall.

In [18]:
Image(url= "./FineTuningImages/17.png", width=400)

In [19]:
Image(url= "./FineTuningImages/18.png", width=400)

In [20]:
Image(url= "./FineTuningImages/19.png", width=400)

<h4>Exercise</h4>
AUC computation
Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!

In this exercise, you'll calculate AUC scores using the roc_auc_score() function from sklearn.metrics as well as by performing cross-validation on the diabetes dataset.

X and y, along with training and test sets X_train, X_test, y_train, y_test, have been pre-loaded for you, and a logistic regression classifier logreg has been fit to the training data.

<h4>Instructions</h4>
<ul>
    <li>Import roc_auc_score from sklearn.metrics and cross_val_score from sklearn.model_selection.</li>
    <li>Using the logreg classifier, which has been fit to the training data, compute the predicted probabilities of the labels of the test set X_test. Save the result as y_pred_prob.</li>
    <li>Compute the AUC score using the roc_auc_score() function, the test set labels y_test, and the predicted probabilities y_pred_prob.</li>
    <li>Compute the AUC scores by performing 5-fold cross-validation. Use the cross_val_score() function and specify the scoring parameter to be 'roc_auc'.</li>
</ul>

In [22]:
Image(url= "./FineTuningImages/20.png", width=400)

Great work! You now have a number of different methods you can use to evaluate your model's performance.

<h2>Hyper Parameter</h2>

In [23]:
Image(url= "./FineTuningImages/21.png", width=400)

In [24]:
Image(url= "./FineTuningImages/22.png", width=400)

In [25]:
Image(url= "./FineTuningImages/23.png", width=400)

In [26]:
Image(url= "./FineTuningImages/24.png", width=400)

<h3>Exercise</h3>

<h2>Hyperparameter tuning with GridSearchCV</h2>
Hugo demonstrated how to tune the n_neighbors parameter of the KNeighborsClassifier() using GridSearchCV on the voting dataset. You will now practice this yourself, but by using logistic regression on the diabetes dataset instead!

Like the alpha parameter of lasso and ridge regularization that you saw earlier, logistic regression also has a regularization parameter: .  controls the inverse of the regularization strength, and this is what you will tune in this exercise. A large  can lead to an overfit model, while a small  can lead to an underfit model.

The hyperparameter space for  has been setup for you. Your job is to use GridSearchCV and logistic regression to find the optimal  in this hyperparameter space. The feature array is available as X and target variable array is available as y.

You may be wondering why you aren't asked to split the data into training and test sets. Good observation! Here, we want you to focus on the process of setting up the hyperparameter grid and performing grid-search cross-validation. In practice, you will indeed want to hold out a portion of your data for evaluation purposes, and you will learn all about this in the next video!

<h3>Instructions</h3>
<ul>
    <li>Import LogisticRegression from sklearn.linear_model and GridSearchCV from sklearn.model_selection.</li>
    <li>Setup the hyperparameter grid by using c_space as the grid of values to tune  over.</li>
    <li>Instantiate a logistic regression classifier called logreg.</li>
    <li>Use GridSearchCV with 5-fold cross-validation to tune :
        <ul>
            <li>Inside GridSearchCV(), specify the classifier, parameter grid, and number of folds to use.</li>
            <li>Use the .fit() method on the GridSearchCV object to fit it to the data X and y.</li>
        </ul>
    </li>
    <li>Print the best parameter and best score obtained from GridSearchCV by accessing the best_params_ and best_score_ attributes of logreg_cv.</li>
</ul>

Good job! It looks like a 'C' of 3.727 results in the best performance.

<h4>Exercise</h4>
Hyperparameter tuning with RandomizedSearchCV
GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use RandomizedSearchCV, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. You'll practice using RandomizedSearchCV in this exercise and see how this works.

Here, you'll also be introduced to a new model: the Decision Tree. Don't worry about the specifics of how this model works. Just like k-NN, linear regression, and logistic regression, decision trees in scikit-learn have .fit() and .predict() methods that you can use in exactly the same way as before. Decision trees have many parameters that can be tuned, such as max_features, max_depth, and min_samples_leaf: This makes it an ideal use case for RandomizedSearchCV.

As before, the feature array X and target variable array y of the diabetes dataset have been pre-loaded. The hyperparameter settings have been specified for you. Your goal is to use RandomizedSearchCV to find the optimal hyperparameters. Go for it!

<h4>Instructions</h4>
<ul>
    <li>Import DecisionTreeClassifier from sklearn.tree and RandomizedSearchCV from sklearn.model_selection.</li>
    <li>Specify the parameters and distributions to sample from. This has been done for you.</li>
    <li>Instantiate a DecisionTreeClassifier.</li>
    <li>Use RandomizedSearchCV with 5-fold cross-validation to tune the hyperparameters:
        <ul>
            <li>Inside RandomizedSearchCV(), specify the classifier, parameter distribution, and number of folds to use.</li>
            <li>Use the .fit() method on the RandomizedSearchCV object to fit it to the data X and y.</li>
        </ul>
    </li>
    <li>Print the best parameter and best score obtained from RandomizedSearchCV by accessing the best_params_ and best_score_ attributes of tree_cv</li>
</ul>

Great work! You'll see a lot more of decision trees and RandomizedSearchCV as you continue your machine learning journey. Note that RandomizedSearchCV will never outperform GridSearchCV. Instead, it is valuable because it saves on computation time.

<h2>Hold-out set for final evaluation</h2>

In [29]:
Image(url= "./FineTuningImages/25.png", width=400)

Correct! The idea is to tune the model's hyperparameters on the training set, and then evaluate its performance on the hold-out set which it has never seen before.

<h4>Exercise</h4>
Hold-out set in practice I: Classification
You will now practice evaluating a model with tuned hyperparameters on a hold-out set. The feature array and target variable array from the diabetes dataset have been pre-loaded as X and y.

In addition to , logistic regression has a 'penalty' hyperparameter which specifies whether to use 'l1' or 'l2' regularization. Your job in this exercise is to create a hold-out set, tune the 'C' and 'penalty' hyperparameters of a logistic regression classifier using GridSearchCV on the training set.

<h4>Instructions</h4>
<ul>
    <li> Create the hyperparameter grid:
        <ul>
            <li>Use the array c_space as the grid of values for 'C'.</li>
            <li>For 'penalty', specify a list consisting of 'l1' and 'l2'.</li>
        </ul>
    </li>
    <li>Instantiate a logistic regression classifier.</li>
    <li>Create training and test sets. Use a test_size of 0.4 and random_state of 42. In practice, the test set here will function as the hold-out set.</li>
    <li>Tune the hyperparameters on the training set using GridSearchCV with 5-folds. This involves first instantiating the GridSearchCV object with the correct parameters and then fitting it to the training data.</li>
    <li>Print the best parameter and best score obtained from GridSearchCV by accessing the best_params_ and best_score_ attributes of logreg_cv.</li>
</ul>

Excellent work! You're really mastering the fundamentals of classification!

<h4>Exercise</h4>
<h2>Hold-out set in practice II: Regression</h2>
Remember lasso and ridge regression from the previous chapter? Lasso used the  penalty to regularize, while ridge used the  penalty. There is another type of regularized regression known as the elastic net. In elastic net regularization, the penalty term is a linear combination of the  and  penalties:


In scikit-learn, this term is represented by the 'l1_ratio' parameter: An 'l1_ratio' of 1 corresponds to an  penalty, and anything lower is a combination of  and .

In this exercise, you will GridSearchCV to tune the 'l1_ratio' of an elastic net model trained on the Gapminder data. As in the previous exercise, use a hold-out set to evaluate your model's performance.

<h4>Instructions</h4>
<ul>
    <li>Import the following modules:
        <ul>
            <li>ElasticNet from sklearn.linear_model.</li>
            <li>mean_squared_error from sklearn.metrics.</li>
            <li>GridSearchCV and train_test_split from sklearn.model_selection.</li>
        </ul>
    </li>
    <li>Create training and test sets, with 40% of the data used for the test set. Use a random state of 42.</li>
    <li>Specify the hyperparameter grid for 'l1_ratio' using l1_space as the grid of values to search over.</li>
    <li>Instantiate the ElasticNet regressor.</li>
    <li>Use GridSearchCV with 5-fold cross-validation to tune 'l1_ratio' on the training data X_train and y_train. This involves first instantiating the GridSearchCV object with the correct parameters and then fitting it to the training data.</li>
    <li>Predict on the test set and compute the  and mean squared error.</li>
</ul>

Fantastic! Now that you understand how to fine-tune your models, it's time to learn about preprocessing techniques and how to piece together all the different stages of the machine learning process into a pipeline!