# Supervised Learning II: Advanced Regressors and Classifiers

Email Similarity
In this project, you will use scikit-learn’s Naive Bayes implementation on several different datasets. By reporting the accuracy of the classifier, we can find which datasets are harder to distinguish. For example, how difficult do you think it is to distinguish the difference between emails about hockey and emails about soccer? How hard is it to tell the difference between emails about hockey and emails about tech? In this project, we’ll find out exactly how difficult those two tasks are.

If you get stuck during this project or would like to see an experienced developer work through it, click “Get Unstuck“ to see a project walkthrough video.

Tasks
15/15 complete
Mark the tasks as complete by checking them off
Exploring the Data
1.
We’ve imported a dataset of emails from scikit-learn’s datasets. All of these emails are tagged based on their content.

Print emails.target_names to see the different categories.

2.
We’re interested in seeing how effective our Naive Bayes classifier is at telling the difference between a baseball email and a hockey email. We can select the categories of articles we want from fetch_20newsgroups by adding the parameter categories.

In the function call, set categories equal to the list ['rec.sport.baseball', 'rec.sport.hockey']

3.
Let’s take a look at one of these emails.

All of the emails are stored in a list called emails.data. Print the email at index 5 in the list.

4.
All of the labels can be found in the list emails.target. Print the label of the email at index 5.

The labels themselves are numbers, but those numbers correspond to the label names found at emails.target_names.

Is this a baseball email or a hockey email?

Making the Training and Test Sets
5.
We now want to split our data into training and test sets. Change the name of your variable from emails to train_emails. Add these three parameters to the function call:

subset='train'
shuffle = True
random_state = 108
Adding the random_state parameter will make sure that every time you run the code, your dataset is split in the same way.

6.
Create another variable named test_emails and set it equal to fetch_20newsgroups. The parameters of the function should be the same as before except subset should now be 'test'.

Counting Words
7.
We want to transform these emails into lists of word counts. The CountVectorizer class makes this easy for us.

Create a CountVectorizer object and name it counter.

8.
We need to tell counter what possible words can exist in our emails. counter has a .fit() a function that takes a list of all your data.

Call .fit() with test_emails.data + train_emails.data as a parameter.

9.
We can now make a list of the counts of our words in our training set.

Create a variable named train_counts. Set it equal to counter‘s transform function using train_emails.data as a parameter.

10.
Let’s also make a variable named test_counts. This should be the same function call as before, but use test_emails.data as the parameter of transform.

Making a Naive Bayes Classifier
11.
Let’s now make a Naive Bayes classifier that we can train and test on. Create a MultinomialNB object named classifier.

12.
Call classifier‘s .fit() function. .fit() takes two parameters. The first should be our training set, which for us is train_counts. The second should be the labels associated with the training emails. Those are found in train_emails.target.

13.
Test the Naive Bayes Classifier by printing classifier‘s .score() function. .score() takes the test set and the test labels as parameters.

.score() returns the accuracy of the classifier on the test data. Accuracy measures the percentage of classifications a classifier correctly made.

Testing Other Datasets
14.
Our classifier does a pretty good job distinguishing between soccer emails and hockey emails. But let’s see how it does with emails about really different topics.

Find where you create train_emails and test_emails. Change the categories to be ['comp.sys.ibm.pc.hardware','rec.sport.hockey'].

Did your classifier do a better or worse job on these two datasets?

15.
Play around with different sets of data. Can you find a set that’s incredibly accurate or incredibly inaccurate?

The possible categories are listed below.

'alt.atheism'
'comp.graphics'
'comp.os.ms-windows.misc'
'comp.sys.ibm.pc.hardware'
'comp.sys.mac.hardware'
'comp.windows.x'
'misc.forsale'
'rec.autos'
'rec.motorcycles'
'rec.sport.baseball'
'rec.sport.hockey'
'sci.crypt'
'sci.electronics'
'sci.med'
'sci.space'
'soc.religion.christian'
'talk.politics.guns'
'talk.politics.mideast'
'talk.politics.misc'
'talk.religion.misc'

In [None]:
# Import necessary libraries
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

# Step 1: Define the categories you want to classify
categories = ['rec.sport.baseball', 'rec.sport.hockey']  # You can change these to test other pairs

# Step 2: Load training data with specified categories
train_emails = fetch_20newsgroups(subset='train',
                                   categories=categories,
                                   shuffle=True,
                                   random_state=108)

# Step 3: Load test data with the same categories
test_emails = fetch_20newsgroups(subset='test',
                                  categories=categories,
                                  shuffle=True,
                                  random_state=108)

# Step 4: Create a CountVectorizer to convert text to word counts
counter = CountVectorizer()

# Step 5: Fit the vectorizer on the combined training and test data
counter.fit(train_emails.data + test_emails.data)

# Step 6: Transform the training data into word count vectors
train_counts = counter.transform(train_emails.data)

# Step 7: Transform the test data into word count vectors
test_counts = counter.transform(test_emails.data)

# Step 8: Create a Multinomial Naive Bayes classifier
classifier = MultinomialNB()

# Step 9: Train the classifier using the training data and labels
classifier.fit(train_counts, train_emails.target)

# Step 10: Evaluate the classifier's accuracy on the test data
accuracy = classifier.score(test_counts, test_emails.target)
print("Classifier Accuracy:", accuracy)

# Optional: Print a sample email and its label
print("\nSample Email:\n", train_emails.data[5])  # View the content of one email
print("Label:", train_emails.target[5])           # View its numeric label
print("Category:", train_emails.target_names[train_emails.target[5]])  # Convert label to category name


Predict Baseball Strike Zones With Machine Learning
Support Vector Machines are powerful machine learning models that can make complex decision boundaries. An SVM’s decision boundary can twist and curve to accommodate the training data.

In this project, we will use an SVM trained using a baseball dataset to find the decision boundary of the strike zone.

A batter standing in front of the plate with the strike zone outlined.
The strike zone can be thought of as a decision boundary that determines whether or not a pitch is a strike or a ball. There is a strict definition of the strike zone — in practice, however, it will vary depending on the umpire or the player at bat.

Let’s use our knowledge of SVMs to find the real strike zone of several baseball players.

If you get stuck during this project or would like to see an experienced developer work through it, click “Get Unstuck“ to see a project walkthrough video.

Tasks
17/17 complete
Mark the tasks as complete by checking them off
Create the labels
1.
We’ve imported several DataFrames related to some of baseball’s biggest stars. We have data on Aaron Judge and Jose Altuve. Judge is one of the tallest players in the league and Altuve is one of the shortest. Their strike zones should be pretty different!

Each row in these DataFrames corresponds to a single pitch that the batter saw in the 2017 season. To begin, let’s take a look at all of the features of a pitch. Print aaron_judge.columns.

In this project, we’ll ask you to print out a lot of information. To avoid clutter, feel free to delete the print statements once you understand the data.

We used the pybaseball Python package to get the data for this project. If you’re interested in getting more data, the documentation for pybaseball can help you get data that you’re interested onto your own computer.

2.
Some of these features have obscure names. Let’s learn what the feature description means.

Print aaron_judge.description.unique() to see the different values the description feature could have.

3.
We’re interested in looking at whether a pitch was a ball or a strike. That information is stored in the type feature. Look at the unique values stored in the type feature to get a sense of how balls and strikes are recorded.

4.
Great! We know every row’s type feature is either an 'S' for a strike, a 'B' for a ball, or an 'X' for neither (for example, an 'X' could be a hit or an out).

We’ll want to use this feature as the label of our data points. However, instead of using strings, it will be easier if we change every 'S' to a 1 and every 'B' to a 0.

You can change the values of a DataFrame column using the map() functions. For example, in the code below, every 'A' in example_column is changed to a 1, and every 'B' is changed to a 2.

df['example_column'] = df['example_column'].map({'A':1, 'B':2})

Copy to Clipboard

5.
Let’s make sure that worked. Print the type column from the aaron_judge DataFrame.

Plotting the pitches
6.
There were some NaNs in there. We’ll take care of those in a second. For now, let’s look at the other features we’re interested in.

We want to predict whether a pitch is a ball or a strike based on its location over the plate. You can find the ball’s location in the columns plate_x and plate_z.

Print aaron_judge['plate_x'] to see what that column looks like.

plate_x measures how far left or right the pitch is from the center of home plate. If plate_x = 0, that means the pitch was directly in the middle of the home plate.

7.
We now have the three columns we want to work with: 'plate_x', 'plate_z', and 'type'.

Let’s remove every row that has a NaN in any of those columns.

You can do this by calling the dropna function. This function can take a parameter named subset which should be a list of the columns you’re interested in.

For example, the following code drops all of the NaN values from the columns 'A', 'B', and 'C'.

data_frame = data_frame.dropna(subset = ['A', 'B', 'C'])

Copy to Clipboard

8.
We now have points to plot using Matplotlib. Call plt.scatter() using five parameters:

The parameter x should be the plate_x column.
The parameter y should be the plate_z column.
To color the points correctly, the parameter c should be the type column.
To make the strikes red and the balls blue, set the cmap parameter to plt.cm.coolwarm.
To make the points slightly transparent, set the alpha parameter to 0.25.
Call plt.show() to see your graph.

plate_z measures how high off the ground the pitch was. If plate_z = 0, that means the pitch was at ground level when it got to the home plate.

Building the SVM
9.
Now that we’ve seen the location of every pitch, let’s create an SVM to create a decision boundary. This decision boundary will be the real strike zone for that player. For this section, make sure to write all of your code below the call to the scatter function but above the show function.

To begin, we want to validate our model, so we need to split the data into a training set and a validation set.

Call the train_test_split function using aaron_judge as a parameter.

Set the parameter random_state equal to 1 to ensure your data is split in the same way as our solution code.

This function returns two objects. Store the return values in variables named training_set and validation_set.

10.
Next, create an SVC named classifier with kernel = 'rbf'. For right now, don’t worry about setting the C or gamma parameters.

11.
Call classifier‘s .fit() method. This method should take two parameters:

The training data. This is the plate_x column and the plate_z column in training_set.
The labels. This is the type column in training_set.
The code below shows and example of selecting two columns from a DataFrame:

two_columns = data_frame[['A', 'B']]

Copy to Clipboard

12.
To visualize the SVM, call the draw_boundary function. This is a function that we wrote ourselves - you won’t find it in scikit-learn.

This function takes two parameters:

The axes of your graph. For us, this is the ax variable that we defined at the top of your code.
The trained SVM. For us, this is classifier. Make sure you’ve called .fit() before trying to visualize the decision boundary.
Run your code to see the predicted strike zone!

Note that the decision boundary will be drawn based on the size of the current axes. So if you call draw_boundary before calling scatter function, you will only see the boundary as a small square.

To get around this, you could manually set the size of the axes by using something likeax.set_ylim(-2, 2) before calling draw_boundary.

Optimizing the SVM
13.
Nice work! We’re now able to see the strike zone. But we don’t know how accurate our classifier is yet. Let’s find its accuracy by calling the .score() method and printing the results.

.score() takes two parameters — the points in the validation set and the labels associated with those points.

These two parameters should be very similar to the parameters used in .fit().

14.
Let’s change some of the SVM’s parameters to see if we can get better accuracy.

Set the parameters of the SVM to be gamma = 100 and C = 100.

This will overfit the data, but it will be a good place to start. Run the code to see the overfitted decision boundary. What’s the new accuracy?

15.
Try to find a configuration of gamma and C that greatly improves the accuracy. You may want to use nested for loops.

Loop through different values of gamma and C and print the accuracy using those parameters. Our best SVM had an accuracy of 83.41%. Can you beat ours?

Explore Other Players
16.
Finally, let’s see how different players’ strike zones change. Aaron Judge is the tallest player in the MLB. Jose Altuve is the shortest player. Instead of using the aaron_judge variable, use jose_altuve.

To make this easier, you might want to consider putting all of your code inside a function and using the dataset as a parameter.

We’ve also imported david_ortiz.

Note that the range of the axes will change for these players. To really compare the strike zones, you may want to force the axes to be the same.

Try putting ax.set_ylim(-2, 6) and ax.set_xlim(-3, 3) right before calling plt.show()

17.
See if you can make an SVM that is more accurate by using more features. Perhaps the location of the ball isn’t the only important feature!

You can see the columns available to you by printing aaron_judge.columns.

For example, try adding the strikes column to your SVM — the number of strikes the batter already has might have an impact on whether the next pitch is a strike or a ball.

Note that our draw_boundary function won’t work if you have more than two features. If you add more features, make sure to comment that out!

Try to make the best SVM possible and share your results with us!

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Optional: If draw_boundary is provided by the project, import or define it
from svm_visualization import draw_boundary  # This is assumed to be provided

# Load player data (already imported in the project environment)
# Example: aaron_judge, jose_altuve, david_ortiz

# Step 1: Inspect the columns of the dataset
print(aaron_judge.columns)

# Step 2: Check pitch descriptions
print(aaron_judge['description'].unique())

# Step 3: Check how pitch types are labeled
print(aaron_judge['type'].unique())

# Step 4: Map pitch types to binary labels (1 for strike, 0 for ball)
aaron_judge['type'] = aaron_judge['type'].map({'S': 1, 'B': 0})

# Step 5: Confirm mapping worked
print(aaron_judge['type'])

# Step 6: Inspect plate_x (horizontal location of pitch)
print(aaron_judge['plate_x'])

# Step 7: Drop rows with missing values in key columns
aaron_judge = aaron_judge.dropna(subset=['plate_x', 'plate_z', 'type'])

# Step 8: Plot pitch locations with color-coded labels
fig, ax = plt.subplots()
plt.scatter(x=aaron_judge['plate_x'],
            y=aaron_judge['plate_z'],
            c=aaron_judge['type'],
            cmap=plt.cm.coolwarm,
            alpha=0.25)

# Step 9: Split data into training and validation sets
training_set, validation_set = train_test_split(aaron_judge, random_state=1)

# Step 10: Create an SVM classifier with RBF kernel
classifier = SVC(kernel='rbf', gamma=100, C=100)  # Try tuning gamma and C later

# Step 11: Train the classifier using plate_x and plate_z as features
classifier.fit(training_set[['plate_x', 'plate_z']], training_set['type'])

# Step 12: Draw the decision boundary (strike zone)
draw_boundary(ax, classifier)

# Optional: Set consistent axes for comparison across players
ax.set_xlim(-3, 3)
ax.set_ylim(-2, 6)

# Step 13: Show the plot
plt.show()

# Step 14: Evaluate accuracy on validation set
accuracy = classifier.score(validation_set[['plate_x', 'plate_z']], validation_set['type'])
print("Validation Accuracy:", accuracy)

# Step 15: Tune gamma and C to find best accuracy
best_accuracy = 0
for gamma in [0.1, 1, 10, 100, 1000]:
    for C in [0.1, 1, 10, 100, 1000]:
        model = SVC(kernel='rbf', gamma=gamma, C=C)
        model.fit(training_set[['plate_x', 'plate_z']], training_set['type'])
        acc = model.score(validation_set[['plate_x', 'plate_z']], validation_set['type'])
        print(f"Gamma: {gamma}, C: {C}, Accuracy: {acc}")
        if acc > best_accuracy:
            best_accuracy = acc
print("Best Accuracy Found:", best_accuracy)

# Step 16: Try with other players (e.g., Jose Altuve)
# Repeat the same steps using jose_altuve or david_ortiz instead of aaron_judge

# Step 17: Try adding more features (e.g., strikes)
# Note: draw_boundary only works with 2D features, so comment it out if using more
# classifier.fit(training_set[['plate_x', 'plate_z', 'strikes']], training_set['type'])
# accuracy = classifier.score(validation_set[['plate_x', 'plate_z', 'strikes']], validation_set['type'])
# print("Accuracy with additional feature:", accuracy)


# Regularization and Hyperparameter Turning

Predict Wine Quality with Regularization
The data you’re going to be working with is from the Wine Quality Dataset in the UCI Machine Learning Repository. We’re looking at the red wine data in particular and while the original dataset has a 1-10 rating for each wine, we’ve made it a classification problem with a wine quality of good (>5 rating) or bad (<=5 rating). The goals of this project are to:

implement different logistic regression classifiers
find the best ridge-regularized classifier using hyperparameter tuning
implement a tuned lasso-regularized feature selection method
What we’re working with:

11 input variables (based on physicochemical tests): ‘fixed acidity’, ‘volatile acidity’, ‘citric acid’, ‘residual sugar’,’chlorides’, ‘free sulfur dioxide’, ‘total sulfur dioxide’, ‘density’, ‘pH’, ‘sulphates’ and ‘alcohol’.
An output variable, ‘quality’ (0 for bad and 1 for good)
Tasks
16/17 complete
Mark the tasks as complete by checking them off
Logistic Regression Classifier without Regularization
1.
Before we begin modeling, let’s scale our data using StandardScaler(). Use StandardScaler().fit() to fit the variable features and then use transform() to get X to get the transformed input to our model.

2.
Perform an 80:20 train-test split on the data. Set the random_state to 99 for reproducibility.

3.
Define a classifier, clf_no_reg, a logistic regression model without regularization and fit it to the training data.

4.
We’re now going to plot the coefficients obtained from fitting the Logistic Regression model. Copy-paste the following code to get the ordered coefficients as a bar plot:

predictors = features.columns
coefficients = clf_no_reg.coef_.ravel()
coef = pd.Series(coefficients,predictors).sort_values()
coef.plot(kind='bar', title = 'Coefficients (no regularization)')
plt.tight_layout()
plt.show()
plt.clf()

Copy to Clipboard

5.
You’re now ready to evaluate this classifier! In the case of linear regression, we evaluated our models using mean-squared-error. For classifiers, it is important that the classifier not only has high accuracy, but also high precision and recall, i.e., a low false positive and false negative rate.

A metric known as f1 score, which is the weighted mean of precision and recall, captures the performance of a classifier holistically. It takes values between 0 and 1 and the closer it is to 1, the better the classifier. Use f1_score() to calculate the f1 score for the training and test data.

Logistic Regression with L2 Regularization
6.
We’ve seen in the previous article that the default implementation of logistic regression in scikit-learn is ridge-regularized! Use the default implementation to implement a classifier clf_default that is L2-regularized.

7.
Obtain the training and test f1_score for the ridge-regularized classifier using code similar to what we have in Task 5. Notice if either score goes up or down.

8.
The scores remain the same! Does this mean that regularization did nothing? Indeed! This means that the constraint boundary for the regularization we performed is large enough to hold the original loss function minimum, thus rendering our model the same as the unregularized one.

How can we tune up the regularization? Recall that C is the inverse of the regularization strength (alpha), meaning that smaller values of C correspond to more regularization. The scikit-learn default for C is 1; therefore, in order to increase the amount of regularization, we need to consider values of C that are less than 1. But how far do we need to go? Let’s try a coarse-grained search before performing a fine-grained one.

Define an array, C_array that takes the values C_array = [0.0001, 0.001, 0.01, 0.1, 1]. Get an array each for the training and test scores corresponding to these values of C.

9.
Use the following plotting code to plot the training and test scores as a function of C. Does this clarify the range of C’s we need to be doing a fine-grained search for?

plt.plot(C_array,training_array)
plt.plot(C_array,test_array)
plt.xscale('log')
plt.show()
plt.clf()

Copy to Clipboard

Hyperparameter Tuning for L2 Regularization
10.
We’re now ready to perform hyperparameter tuning using GridSearchCV! Looking at the plot, the optimal C seems to be somewhere around 0.001 so a search window between 0.0001 and 0.01 is not a bad idea here.

Let’s first get setup with the right inputs for this. Use np.logspace() to obtain 100 values between 10^(-4) and 10^(-2) and define a dictionary of C values named tuning_C that can function as an input to GridSearchCV‘s parameter grid.

11.
Define a grid search model on the parameter grid defined above for a logistic regression model with ridge regularization. Set the scoring metric to ‘f1’ and the number of folds to 5. Fit this to the training data.

12.
Obtain the best C value from this search and the score corresponding to it using the best_params_ and best_score attributes respectively.

13.
The score you got above reflects the mean f1-score on the 5 folds corresponding to the best classifier. Notice however that we haven’t yet used the test data, X_test, y_test from our original train-test split! This was done with good reason: the original test data can now be used as our validation dataset to validate whether our “best classifier” is doing as well as we’d like it to on essentially unknown data.

Define a new classifier clf_best_ridge that corresponds to the best C value you obtained in the previous task. Fit it to the training data and obtain the f1_score on the test data to validate the model.

Feature Selection using L1 Regularization
14.
We’re now going to use a grid search cross-validation method to regularize the classifier, but with L1 regularization instead. Instead of using GridSearchCV, we’re going to use LogisticRegressionCV. The syntax here is a little different. The arguments to LogisticRegressionCV that are relevant to us:

Cs : A list/array of C values to check; choose values between 0.01 and 100 here.
cv : Number of folds (5 is a good choice here!)
penalty : Remember to choose 'l1' for this!
solver : Recall that L1 penalty requires that we specify the solver to be ‘liblinear’.
scoring : 'f1' is still a great choice for a classifier.
Using the above, define a cross-validated classifier, clf_l1 and fit (X,y) here. (Note that we’re not doing a train-test-validation split like last time!)

15.
The classifier has the attribute C_ which prints the optimal C value. The attribute coef_ gives us the coefficients of the best lasso-regularized classifier. Print both of these.

16.
We can now reproduce the coefficient plot we’d produced for the unregularized scenario. Use the following lines of code to plot the sorted values of the coefficients as a bar plot:

coefficients = clf_l1.coef_.ravel()
coef = pd.Series(coefficients,predictors).sort_values()

plt.figure(figsize = (12,8))
coef.plot(kind='bar', title = 'Coefficients for tuned L1')
plt.tight_layout()
plt.show()
plt.clf()

Copy to Clipboard

17.
Notice how our L1 classifier has set one of the coefficients to zero! We’ve effectively eliminated one feature, density, from the model, thus using Lasso regularization as a feature selection method here.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import f1_score

# Load the wine dataset (assumed to be preloaded as 'wine')
# wine = pd.read_csv('winequality-red.csv')  # Uncomment if loading manually

# Step 1: Scale the input features
features = wine.drop('quality', axis=1)
labels = wine['quality']
scaler = StandardScaler()
X = scaler.fit_transform(features)

# Step 2: Perform 80:20 train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.2, random_state=99)

# Step 3: Logistic Regression without regularization
clf_no_reg = LogisticRegression(penalty='none', max_iter=1000)
clf_no_reg.fit(X_train, y_train)

# Step 4: Plot coefficients of unregularized model
predictors = features.columns
coefficients = clf_no_reg.coef_.ravel()
coef = pd.Series(coefficients, predictors).sort_values()
coef.plot(kind='bar', title='Coefficients (no regularization)')
plt.tight_layout()
plt.show()
plt.clf()

# Step 5: Evaluate f1 score for unregularized model
print("F1 Score (Train):", f1_score(y_train, clf_no_reg.predict(X_train)))
print("F1 Score (Test):", f1_score(y_test, clf_no_reg.predict(X_test)))

# Step 6: Logistic Regression with default L2 regularization
clf_default = LogisticRegression(max_iter=1000)
clf_default.fit(X_train, y_train)

# Step 7: Evaluate f1 score for default regularized model
print("F1 Score (Train, L2):", f1_score(y_train, clf_default.predict(X_train)))
print("F1 Score (Test, L2):", f1_score(y_test, clf_default.predict(X_test)))

# Step 8: Coarse-grained search over C values
C_array = [0.0001, 0.001, 0.01, 0.1, 1]
training_array = []
test_array = []

for C in C_array:
    clf = LogisticRegression(C=C, max_iter=1000)
    clf.fit(X_train, y_train)
    training_array.append(f1_score(y_train, clf.predict(X_train)))
    test_array.append(f1_score(y_test, clf.predict(X_test)))

# Step 9: Plot training and test scores vs C
plt.plot(C_array, training_array, label='Train F1')
plt.plot(C_array, test_array, label='Test F1')
plt.xscale('log')
plt.legend()
plt.show()
plt.clf()

# Step 10: Fine-grained search using GridSearchCV
C_values = np.logspace(-4, -2, 100)
tuning_C = {'C': C_values}
grid = GridSearchCV(LogisticRegression(max_iter=1000), tuning_C, scoring='f1', cv=5)
grid.fit(X_train, y_train)

# Step 12: Get best C and score
print("Best C:", grid.best_params_['C'])
print("Best Cross-Validated F1 Score:", grid.best_score_)

# Step 13: Validate best model on test data
clf_best_ridge = LogisticRegression(C=grid.best_params_['C'], max_iter=1000)
clf_best_ridge.fit(X_train, y_train)
print("F1 Score (Test, Best Ridge):", f1_score(y_test, clf_best_ridge.predict(X_test)))

# Step 14: L1 regularization with LogisticRegressionCV
clf_l1 = LogisticRegressionCV(Cs=np.linspace(0.01, 100, 100),
                              cv=5,
                              penalty='l1',
                              solver='liblinear',
                              scoring='f1',
                              max_iter=1000)
clf_l1.fit(X, labels)

# Step 15: Print optimal C and coefficients
print("Best C (L1):", clf_l1.C_[0])
print("Coefficients (L1):", clf_l1.coef_)

# Step 16: Plot L1 coefficients
coefficients = clf_l1.coef_.ravel()
coef = pd.Series(coefficients, predictors).sort_values()
plt.figure(figsize=(12, 8))
coef.plot(kind='bar', title='Coefficients for tuned L1')
plt.tight_layout()
plt.show()
plt.clf()

# Step 17: Observe feature elimination (e.g., density set to zero)
print("Features eliminated:", coef[coef == 0].index.tolist())


Classify Raisins with Hyperparameter Tuning!
In this project, you’ll use the different techniques you have learned in this unit to classify different types of raisins. The dataset has been posted on Kaggle by Murat Koklu, a researcher who has studied different raisin grain types using machine learning methods.

There are two raisin grain types in this dataset, Kecimen and Besni and seven numerical predictor variables associated with each of the 900 samples in the data. You’re going to use this dataset to implement the two hyperparameter tuning methods we’ve covered in this module thusfar:

Grid Search method to tune a Decision Tree Classifier
Random Search method to tune a Logistic Regression Classifier
You’ll be using a Jupyter notebook to implement the project. At any point if you’re away from the screen for too long, the Jupyter kernel might reset — so be sure to press Save on top of the notebook before taking a break!

Tasks
14/14 complete
Mark the tasks as complete by checking them off
Explore the Dataset
1.
The dataset and some of the libraries you’ll use have been loaded on the setup cell. Run the setup cell to get started!

2.
Create the predictor and target variables and label them X and y respectively.

3.
Examine the dataset by printing the

total number of features
total number of samples
samples belonging to class “1”
4.
Split the training data into train and test data with a random_state of 19 (if you want to match the solution code - you’re welcome to use your preferred random_state too! :) . Label the training data X_train and y_train and the test data, X_test and y_test.

Grid Search with Decision Tree Classifier
5.
A decision tree classifier works well for a binary balanced class classification problem. Initialize a decision tree classifier named tree.

6.
The DecisionTreeClassifier() implementation in scikit-learn has many parameters.

Create a dictionary parameters to set up grid search to explore three values each for the following 2 hyperparameters:

'max_depth': The maximum tree depth; explore the values 3,5 and 7 for this.
'min_samples_split': The minimum number of samples to split at each node; explore the values 2,3 and 4 for this.
7.
Create a grid search classifier grid with tree and parameters as inputs. Fit the grid search classifier to the training data.

8.
Use the .best_estimator_ attribute to see what hyperparameters grid chose. Print the result. Print the best score and the score on the test data to examine the performance of the best estimator.

9.
Use .cv_results_['mean_test_score'] to get the score for for each hyperparameter combination. Get the corresponding hyperparameters with .cv_results_['params'].

Convert the two arrays to DataFrames, concatenate them using pd.concat and print it to view the score for each hyperparameter combination.

Random Search with Logistic Regression
10.
Define a logistic regression model, lr, with solver set to 'liblinear' and max_iter = 1000.

11.
To perform random search we need to specify the parameters and the distributions to draw from. Define a dictionary distributions with the keys

'penalty': corresponding to the type of regularization to apply. Choose a discrete distribution with ‘l1’ and ‘l2’
'C': corresponding to the regularization strength. Choose a uniform distribution here between 0 and 100.
12.
Create a model named clf to perform random search with the logistic regression model you’ve defined, over the distribution space specified by distributions and for eight random draws. Fit the model to the training data.

13.
Print the best estimator and score from the random search you’ve performed. Print a table summarizing the results using .cv_results_ similar to the way you did for grid search!

14.
Congratulations, you’ve completed the hyperparameter tuning project! Some points to ponder:

Examine your results to see which model performs best over all. Are there other models and hyperparameters you can think of to experiment with this dataset? (K-nearest neighbors or Support Vector Machines or gradient boosted trees might be other models to consider!)
Would you make different choices with the models you’ve used? Fire up your own Jupyter notebook to explore more models and alternate parameter grids/distributions! :)

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from scipy.stats import uniform

# Step 1: Load the dataset (assumed to be preloaded as 'raisins')
# raisins = pd.read_csv('Raisin_Dataset.csv')  # Uncomment if loading manually

# Step 2: Create predictor and target variables
X = raisins.drop('Class', axis=1)
y = raisins['Class'].map({'Kecimen': 0, 'Besni': 1})  # Convert labels to binary

# Step 3: Examine dataset structure
print("Total number of features:", X.shape[1])
print("Total number of samples:", X.shape[0])
print("Samples belonging to class '1':", sum(y == 1))

# Step 4: Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=19)

# Step 5: Initialize Decision Tree Classifier
tree = DecisionTreeClassifier()

# Step 6: Define grid search parameters
parameters = {
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 3, 4]
}

# Step 7: Perform Grid Search
grid = GridSearchCV(tree, parameters, cv=5)
grid.fit(X_train, y_train)

# Step 8: Print best estimator and scores
print("Best Decision Tree Estimator:", grid.best_estimator_)
print("Best Cross-Validated Score:", grid.best_score_)
print("Test Accuracy:", accuracy_score(y_test, grid.predict(X_test)))

# Step 9: Summarize grid search results
grid_results = pd.DataFrame(grid.cv_results_['params'])
grid_scores = pd.DataFrame(grid.cv_results_['mean_test_score'], columns=['Mean Test Score'])
grid_summary = pd.concat([grid_results, grid_scores], axis=1)
print("\nGrid Search Results:\n", grid_summary)

# Step 10: Define Logistic Regression model
lr = LogisticRegression(solver='liblinear', max_iter=1000)

# Step 11: Define random search distributions
distributions = {
    'penalty': ['l1', 'l2'],
    'C': uniform(loc=0, scale=100)
}

# Step 12: Perform Randomized Search
clf = RandomizedSearchCV(lr, distributions, n_iter=8, cv=5, random_state=19)
clf.fit(X_train, y_train)

# Step 13: Print best estimator and scores
print("\nBest Logistic Regression Estimator:", clf.best_estimator_)
print("Best Cross-Validated Score:", clf.best_score_)
print("Test Accuracy:", accuracy_score(y_test, clf.predict(X_test)))

# Summarize random search results
random_results = pd.DataFrame(clf.cv_results_['params'])
random_scores = pd.DataFrame(clf.cv_results_['mean_test_score'], columns=['Mean Test Score'])
random_summary = pd.concat([random_results, random_scores], axis=1)
print("\nRandom Search Results:\n", random_summary)


In [None]:
# official solution 

# 1. Setup
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

raisins = pd.read_csv('Raisin_Dataset.csv')
raisins.head()

# 2. Create predictor and target variables, X and y

# Define X and y
X = raisins.drop(columns=['Class'])  # Predictor variables
y = raisins['Class']  # Target variable

# Display the first few rows
print(X.head())
print(y.head())

# 3. Examine the dataset
# Total number of features (excluding the target variable)
num_features = raisins.drop(columns=['Class']).shape[1]

# Total number of samples
num_samples = raisins.shape[0]

# Samples belonging to class "1"
num_class_1_samples = raisins[raisins['Class'] == 1].shape[0]

# Print the results
print(f"Total number of features: {num_features}")
print(f"Total number of samples: {num_samples}")
print(f"Samples belonging to class '1': {num_class_1_samples}")

# 4. Split the data set into training and testing sets
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=19)

# Display the sizes of each set
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

# 2. Grid Search with Decision Tree Classifier

# 5. Create a Decision Tree model
tree = DecisionTreeClassifier()

# 6. Dictionary of parameters for GridSearchCV
parameters = {'min_samples_split': [2,3,4], 'max_depth': [3,5,7]}

# 7. Create a GridSearchCV model
grid = GridSearchCV(tree, parameters)

# Fit the GridSearchCV model to the training data
grid.fit(X_train, y_train)

# 8. Print the model and hyperparameters obtained by GridSearchCV
print(grid.best_estimator_)

# Print best score
print(grid.best_score_)
# Print the accuracy of the final model on the test data
print(grid.score(X_test, y_test))

# 9. Print a table summarizing the results of GridSearchCV
df = pd.concat([pd.DataFrame(grid.cv_results_['params']), pd.DataFrame(grid.cv_results_['mean_test_score'], columns=['Score'])], axis=1)
print(df)

# 2. Random Search with Logistic Regression

# 10. The logistic regression model
lr = LogisticRegression(solver = 'liblinear', max_iter = 1000)

# 11. Define distributions to choose hyperparameters from
from scipy.stats import uniform
distributions = {'penalty': ['l1', 'l2'], 'C': uniform(loc=0, scale=100)}

# 12. Create a RandomizedSearchCV model
clf = RandomizedSearchCV(lr, distributions, n_iter=8)

# Fit the random search model
clf.fit(X_train, y_train)

# 13. Print best esimator and best score
print(clf.best_estimator_)
print (clf.best_score_)

# Print a table summarizing the results of RandomSearchCV
df = pd.concat([pd.DataFrame(clf.cv_results_['params']), pd.DataFrame(clf.cv_results_['mean_test_score'], columns=['Accuracy'])] ,axis=1)
print(df.sort_values('Accuracy', ascending = False))

# Ensemble Methods in Machine Learning

Random Forests Project
In this project, we will be using a dataset containing census information from UCI’s Machine Learning Repository.

By using this census data with a random forest, we will try to predict whether or not a person makes more than $50,000.

Let’s get started!

Datasets
The original data set is available at the UCI Machine Learning Repository:

https://archive.ics.uci.edu/ml/datasets/census+income
The dataset has been loaded for you in script.py and saved as a dataframe named df. Some of the input and output features of interest are:

age: continuous
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex: Female, Male
capital-gain: continuous
capital-loss: continuous
hours-per-week: continuous
native country: discrete
income: discrete, >50K, <=50K
Tasks
16/17 complete
Mark the tasks as complete by checking them off
Investigate the data
1.
We will build a random forest classifier to predict the income category. First, take a look at the distribution of income values – what percentage of samples have incomes less than 50k and greater than 50k?

2.
There’s a small problem with our data that is a little hard to catch — every string has an extra space at the start. For example, the first row’s native-country is “ United-States”, but we want it to be “United-States”. One way to fix this is to select all columns of type object and use the string method .str.strip().

3.
Create a features dataframe X. This should include only features in the list feature_cols and convert categorical features to dummy variables using pd.get_dummies(). Include the paramter drop_first=True to eliminate redundant features.

4.
Create the output variable y, which is binary. It should be 0 when income is less than 50k and 1 when it is greater than 50k.

5.
Split the data into a train and test set with a test size of 20%.

Build and Tune Random Forest Classifiers by Depth
6.
Instantiate an instance of a RandomForestClassifier() (with default parameters). Fit the model on the train data and print the score (accuracy) on the test data. This will act as a baseline to compare other model performances.

7.
We will explore tuning the random forest classifier model by testing the performance over a range of max_depth values. Fit a random forest classifier for max_depth values from 1-25. Save the accuracy score for the train and test sets in the lists accuracy_train, accuracy_test.

8.
Find the largest accuracy and the depth this occurs on the test data.

9.
Plot the training and test accuracy of the models versus the max_depth.

10.
Refit the random forest model using the max_depth from above; save the feature importances in a dataframe. Sort the results and print the top five features.

Create Additional Features and Re-Tune
11.
Looking at the education feature, there are 16 unique values – from preschool to professional school. Rather than adding dummy variables for each value, it makes sense to bin some of these values together. While there are many ways to do this, we will take the approach of combining the values into 3 groups: High school and less, College to Bachelors and Masters and more. Create a new column in df for this new features called education_bin.

12.
Like we did previously, we will now add this new feature into our feature list and recreate X.

13.
As we did before, we will tune the random forest classifier model by testing the performance over a range of max_depth values. Fit a random forest classifier for max_depth values from 1-25. Save the accuracy score for the train and test sets in the lists accuracy_train, accuracy_test.

14.
Find the largest accuracy and the depth this occurs on the test data. Compare the results from the previous model tuned.

15.
Plot the training and test accuracy of the models versus the max_depth. Compare the results from the previous model tuned.

16.
Refit the random forest model using the max_depth from above; save the feature importances in a dataframe. Sort the results and print the top five features. Compare the results from the previous model tuned.

17.
Nice work! Note that the accuracy of our final model increased and one of our added features is now in the top 5 based on importance!

There are a few different ways to extend this project:

Are there other features that may lead to an even better performace? Consider creating new ones or adding additional features not part of the original feature list.
Consider tuning hyperparameters based on a different evaluation metric – our classes are fairly imbalanced, AUC of F1 may lead to a different result
Tune more parameters of the model. You can find a description of all the parameters you can tune in the Random Forest Classifier documentation. For example, see what happens if you tune max_features or n_estimators.

In [None]:
import pandas as pd
import numpy as np
import codecademylib3
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, RandomForestRegressor
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
col_names = ['age', 'workclass', 'fnlwgt','education', 'education-num', 
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain','capital-loss', 'hours-per-week','native-country', 'income']
df = pd.read_csv('adult.data', header=None, names = col_names)

#Distribution of income
print(df.income.value_counts(normalize=True))

#Clean columns by stripping extra whitespace for columns of type "object"
for c in df.select_dtypes(include=['object']).columns:
    df[c] = df[c].str.strip()
    

feature_cols = ['age',
       'capital-gain', 'capital-loss', 'hours-per-week', 'sex','race']
#Create feature dataframe X with feature columns and dummy variables for categorical features
X = pd.get_dummies(df[feature_cols], drop_first=True)
#Create output variable y which is binary, 0 when income is less than 50k, 1 when it is greather than 50k
y = np.where(df.income=='<=50K', 0, 1)

#Split data into a train and test set
x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=.2)

#Instantiate random forest classifier, fit and score with default parameters
rf = RandomForestClassifier()
rf.fit(x_train, y_train)
rf.score(x_test, y_test)
print(f'Accuracy score for default random forest: {round(rf.score(x_test, y_test)*100,3)}%')

#Tune the hyperparameter max_depth over a range from 1-25, save scores for test and train set
np.random.seed(0)
accuracy_train=[]
accuracy_test = []
depths = range(1,26)
for i in depths:
    rf = RandomForestClassifier(max_depth=i)
    rf.fit(x_train, y_train)
    y_pred = rf.predict(x_test)
    accuracy_test.append(accuracy_score(y_test, rf.predict(x_test)))
    accuracy_train.append(accuracy_score(y_train, rf.predict(x_train)))
    
#Find the best accuracy and at what depth that occurs
best_acc= np.max(accuracy_test)
best_depth = depths[np.argmax(accuracy_test)]
print(f'The highest accuracy on the test is achieved when depth: {best_depth}')
print(f'The highest accuracy on the test set is: {round(best_acc*100,3)}%')

#Plot the accuracy scores for the test and train set over the range of depth values  
plt.plot(depths, accuracy_test,'bo--',depths, accuracy_train,'r*:')
plt.legend(['test accuracy', 'train accuracy'])
plt.xlabel('max depth')
plt.ylabel('accuracy')
plt.show()

#Save the best random forest model and save the feature importances in a dataframe
best_rf = RandomForestClassifier(max_depth=best_depth)
best_rf.fit(x_train, y_train)
feature_imp_df = pd.DataFrame(zip(x_train.columns, best_rf.feature_importances_),  columns=['feature', 'importance'])
print('Top 5 random forest features:')
print(feature_imp_df.sort_values('importance', ascending=False).iloc[0:5])


#Create two new features, based on education and native country
df['education_bin'] = pd.cut(df['education-num'], [0,9,13,16], labels=['HS or less', 'College to Bachelors', 'Masters or more'])

feature_cols = ['age',
       'capital-gain', 'capital-loss', 'hours-per-week', 'sex', 'race','education_bin']
#Use these two new additional features and recreate X and test/train split
X = pd.get_dummies(df[feature_cols], drop_first=True)

x_train, x_test, y_train, y_test = train_test_split(X,y, random_state=1, test_size=.2)

#Find the best max depth now with the additional two features
np.random.seed(0)
accuracy_train=[]
accuracy_test = []
depths = range(1,10)
for i in depths:
    rf = RandomForestClassifier(max_depth=i)
    rf.fit(x_train, y_train)
    y_pred = rf.predict(x_test)
    accuracy_test.append(accuracy_score(y_test, rf.predict(x_test)))
    accuracy_train.append(accuracy_score(y_train, rf.predict(x_train)))
    
best_acc= np.max(accuracy_test)
best_depth = depths[np.argmax(accuracy_test)]
print(f'The highest accuracy on the test is achieved when depth: {best_depth}')
print(f'The highest accuracy on the test set is: {round(best_acc*100,3)}%')

plt.figure(2)
plt.plot(depths, accuracy_test,'bo--',depths, accuracy_train,'r*:')
plt.legend(['test accuracy', 'train accuracy'])
plt.xlabel('max depth')
plt.ylabel('accuracy')
plt.show()

#Save the best model and print the two features with the new feature set
best_rf = RandomForestClassifier(max_depth=best_depth)
best_rf.fit(x_train, y_train)
feature_imp_df = pd.DataFrame(zip(x_train.columns, best_rf.feature_importances_),  columns=['feature', 'importance'])
print('Top 5 random forest features:')
print(feature_imp_df.sort_values('importance', ascending=False).iloc[0:5])


Machine Learning/AI Engineer
Boosting
In this project, we will be using a dataset containing census information from UCI’s Machine Learning Repository.

By using this census data with boosting algorithms, we will try to predict whether or not a person makes more than $50,000.

Let’s get started!

Datasets
The original data set is available at the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/census+income

Tasks
0/10 complete
Mark the tasks as complete by checking them off
Explore and prepare the data
1.
Take a look at the distribution of the target column, income. What percentage of samples have incomes greater than 50k and less than or equal to 50k?

2.
We have identified a set of features to explore. The features are stored in a variable raw_feature_cols. Take a look at the datatypes of these columns. Are they what you expected based on the data dictionary provided in the description?

Preparing the features
3.
Create a features dataframe X with the features listed in the stored variable raw_feature_cols. Since the columns workclass, sex, and race are all low cardinality categorical variables, we will convert them to dummy variables using pd.get_dummies().

Set the parameter drop_first = True in pd.get_dummies(). This drops the first categorical instance in each of the categorical variable columns because it is redundant. Make sure you understand why we do not have to worry about dropping the redundant variable before moving on.

Note: pd.get_dummies() is clever enough that it will only create dummy variables for the categorical columns. It will not create dummy variables for the int64 columns.

Take a look at the first 5 rows of the features dataframe by using the .head(n=5) method.

4.
Convert the target variable to a binary value and store it in a variable y. Set it to 0 when income <= 50K and 1 when income > 50K.

Build and Train the AdaBoost and Gradient Boosted Trees Classifiers
5.
Perform a train-test split. Create the base estimator for the AdaBoost classifier in the form a decision stump using DecisionTreeClassifier() and store it in a variable named decision_stump.

6.
Create an instance of AdaBoostClassifier() and store it in a variable ada_classifier. Keep most of the parameters set as their default value, except the base_estimator parameter which should be set to decision_stump.

7.
Create an instance of GradientBoostingClassifier() and store it in a variable grad_classifier. Keep all the parameters set as their default value.

8.
Fit each of the instantiated models on the training data. Calculate and store the predictions on the test data in separate variables y_pred_ada and y_pred_grad. Print the accuracy and f1 score for the predictions from each model.

Explore Hyperparameters
9.
For AdaBoost the default n_estimators is 50 and for Gradient Boosting it is 100. We’ve created a new list n_estimators_list = [10, 30, 50, 70, 90] to search from and determine the value of this parameter that gives us the most performant model. Use GridSearchCV with AdaBoost to fit the data and search this parameter space.

10.
Calculate the mean_test_score for each of these fits and store it as a list, ada_scores_list. Plot it against n_estimators_list to pick the best value for n_estimators.

In [None]:
import pandas as pd
import numpy as np
import codecademylib3

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

path_to_data = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

col_names = [
    'age', 'workclass', 'fnlwgt','education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain','capital-loss',
    'hours-per-week','native-country', 'income'
]

df = pd.read_csv(path_to_data, header=None, names = col_names)
print(df.head())

#Clean columns by stripping extra whitespace for columns of type "object"
for c in df.select_dtypes(include=['object']).columns:
    df[c] = df[c].str.strip()

target_column = "income"
raw_feature_cols = [
    'age',
    'education-num',
    'workclass',
    'hours-per-week',
    'sex',
    'race'
]

##1. Percentage of samples with income < and > 50k
print(df[target_column].value_counts(normalize=True))

##2. Data types of features
print(df[raw_feature_cols].dtypes)

##3. Preparing the features

X = pd.get_dummies(df[raw_feature_cols], drop_first=True)

X.head(n=5)


##4. Convert target variable to binary
y = np.where(df[target_column] == '<=50K', 0, 1)


##5a. Create train-est split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)
##5b. Create base estimator and store it as decision_stump
decision_stump = DecisionTreeClassifier(max_depth=1)

##6. Create AdaBoost Classifier
ada_classifier = AdaBoostClassifier(base_estimator=decision_stump)

##7. Create GradientBoost Classifier
grad_classifier = GradientBoostingClassifier()

##8a.Fit models and get predictions
ada_classifier.fit(X_train, y_train)
y_pred_ada = ada_classifier.predict(X_test)

grad_classifier.fit(X_train, y_train)
y_pred_grad = grad_classifier.predict(X_test)

##8b. Print accuracy and F1
print(f"AdaBoost accuracy: {accuracy_score(y_test, y_pred_ada)}")
print(f"AdaBoost f1-score: {f1_score(y_test, y_pred_ada)}")

print(f"Gradient Boost accuracy: {accuracy_score(y_test, y_pred_grad)}")
print(f"Gradient Boost f1-score: {f1_score(y_test, y_pred_grad)}")

##9. Hyperparameter Tuning
n_estimators_list = [10, 30, 50, 70, 90]

from sklearn.model_selection import GridSearchCV
estimator_parameters = {'n_estimators': n_estimators_list}
ada_gridsearch = GridSearchCV(ada_classifier, estimator_parameters, cv=5, scoring='accuracy', verbose=True)
ada_gridsearch.fit(X_train, y_train)

##10. Plot mean test scores
ada_scores_list = ada_gridsearch.cv_results_['mean_test_score']
plt.scatter(n_estimators_list, ada_scores_list)
plt.show()
plt.clf()

