
<h1><center>Marketing Analytics: Predicting Customer Churn in Python</center></h1>


<h4>About this Project</h4>
Churn is when a customer stops doing business or ends a relationship with a company. It’s a common problem across a variety of industries, from telecommunications to cable TV to SaaS, and a company that can predict churn can take proactive action to retain valuable customers and get ahead of the competition. This course will provide you a roadmap to create your own customer churn models. You’ll learn how to explore and visualize your data, prepare it for modeling, make predictions using machine learning, and communicate important, actionable insights to stakeholders. By the end of the course, you’ll become comfortable using the pandas library for data analysis and the scikit-learn library for machine learning.


<h1>Table of contents</h1>

<div class="alert alert-block alert-info" style="margin-top: 20px">
    <ol>
        <li><a href="#exploratory_data_analysis">Exploratory Data Analysis</a></li>
        <li><a href="#preprocessing_for_churn_modeling">Preprocessing for Churn Modeling</a></li>
        <li><a href="#churn_prediction">Churn Prediction</a></li>
        <li><a href="#model_tuning">Model Tuning</a></li>
    </ol>
</div>
<br>
<hr>

<h2 id="exploratory_data_analysis">Exploratory Data Analysis</h2>
There are many examples and use cases of customer churn for example cancelled a service that is under contract.
The dataset have 483 Churners and 2850 Non-Churners.


<h3>Grouping and summarizing data</h3>


<h4>Summary statistics for both classes</h4>
Here, a DataFrame df is grouped by a column 'x', and then the standard deviation is calculated across all columns of df for each value of 'x'. The .groupby() method is incredibly useful when you want to investigate specific columns of your dataset. Here, you're going to explore the 'Churn' column further to see if there are differences between churners and non-churners. A subset version of the telco DataFrame, consisting of the columns 'Churn', 'CustServ_Calls', and 'Vmail_Message' is available in your workspace.


In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [None]:
df = pd.read_csv('TelcoChurnDataset.csv')


In [None]:
df.describe()

In [None]:
# Group telco by 'Churn' and compute the mean
print(telco.groupby(['Churn']).mean())

In [None]:
# Adapt your code to compute the standard deviation
print(telco.groupby(['Churn']).std())

Churners make more customer service calls than non-churners.

<h3>Churn by State</h3>
When dealing with customer data, geographic regions may play an important part in determining whether a customer will cancel their service or not. You may have noticed that there is a 'State' column in the dataset. In this exercise, you'll group 'State' and 'Churn' to count the number of churners and non-churners by state. For example, if you wanted to group by x and aggregate by y, you could use .groupby() as follows:

In [None]:
# Count the number of churners and non-churners by State
print(telco.groupby('State')['Churn'].value_counts())

California (CA) has 25 non-churners and 9 churners.


<h3>Exploring your data using visualizations</h3>
The 'Account_Length' feature was normally distributed. Let's now visualize the distributions of the following features using seaborn's distribution plot:

'Day_Mins'
'Eve_Mins'
'Night_Mins'
'Intl_Mins'
To create a feature's distribution plot, pass it in as an argument to sns.distplot(). The Telco dataset is available to you as a DataFrame called telco.


Visualize the distribution of 'Day_Mins'.


In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of 'Day_Mins'
sns.distplot(telco['Day_Mins'])

# Display the plot
plt.show()

Update your code to visualize the distribution of 'Eve_Mins'.

In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of 'Eve_Mins'
sns.distplot(telco['Eve_Mins'])

# Display the plot
plt.show()

Update your code to visualize the distribution of 'Night_Mins'.


In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of 'Night_Mins'
sns.distplot(telco['Night_Mins'])

# Display the plot
plt.show()

Update your code to visualize the distribution of 'Intl_Mins'.



In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Visualize the distribution of 'Intl_Mins'
sns.distplot(telco['Intl_Mins'])

# Display the plot
plt.show()

<h4>Customer service calls and churn</h4>
You've already seen that there's not much of a difference in account lengths between churners and non-churners, but that there is a difference in the number of customer service calls left by churners.

Let's now visualize this difference using a box plot and incorporate other features of interest - do customers who have international plans make more customer service calls? Or do they tend to churn more? How about voicemail plans? Let's find out!

Recall the syntax for creating a box plot using seaborn:

sns.boxplot(x = "X-axis variable",
            y = "Y-axis variable",
            data = DataFrame)
If you want to remove outliers, you can specify the additional parameter sym="", and you can add a third variable using hue.


Create a box plot with 'Churn' on the x-axis and 'CustServ_Calls' on the y-axis.

In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create the box plot
sns.boxplot(x = 'Churn',
          y = 'CustServ_Calls',
          data = telco)

# Display the plot
plt.show()

There is a very noticeable difference here between churners and non-churners! Now, remove the outliers from the box plot.


In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Create the box plot
sns.boxplot(x = 'Churn',
            y = 'CustServ_Calls',
            data = telco,
            sym = "")

# Display the plot
plt.show()

Add a third variable to this plot - 'Vmail_Plan' - to visualize whether or not having a voice mail plan affects the number of customer service calls or churn.

In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Add "Vmail_Plan" as a third variable
sns.boxplot(x = 'Churn',
            y = 'CustServ_Calls',
            data = telco,
            sym = "",
            hue = 'Vmail_Plan')

# Display the plot
plt.show()

Not much of a difference there. Update your code so that the third variable is 'Intl_Plan' instead.

In [None]:
# Import matplotlib and seaborn
import matplotlib.pyplot as plt
import seaborn as sns

# Add "Intl_Plan" as a third variable
sns.boxplot(x = 'Churn',
            y = 'CustServ_Calls',
            data = telco,
            sym = "",
            hue = "Intl_Plan")

# Display the plot
plt.show()


<h2 id="preprocessing_for_churn_modeling">Preprocessing for Churn Modeling</h2>

<h3>Data preparation</h3>


<h2 id="reading_data">Reading the data in</h2>

<h4>Identifying features to convert</h4>
It is preferable to have features like 'Churn' encoded as 0 and 1 instead of no and yes, so that you can then feed it into machine learning algorithms that only accept numeric values.

Besides 'Churn', other features that are of type object can be converted into 0s and 1s.
The different data types of telco in the IPython Shell and the ones that are of type object are:
Churn, Intl_Plan, Vmail_Plan, State.

<h4>Encoding binary features</h4>
Recasting data types is an important part of data preprocessing. In this exercise you will assign the values 1 to 'yes' and 0 to 'no' to the 'Vmail_Plan' and 'Churn' features, respectively.

You saw two approaches to doing this in the video - one using pandas, and the other using scikit-learn. For straightforward tasks like this, sticking with pandas is recommended, so that's what we'll do in this exercise. If you're trying to build machine learning pipelines, on the other hand - which is beyond the scope of this course - you can explore using LabelEncoder(). When doing data science, it's important to be aware that there is always more than one way to accomplish a task, and you need to pick the one that is most effective for your application.

In [None]:
# Replace 'no' with 0 and 'yes' with 1 in 'Vmail_Plan'
telco['Vmail_Plan'] = telco['Vmail_Plan'].replace({'no':0, 'yes': 1})

# Replace 'no' with 0 and 'yes' with 1 in 'Churn'
telco['Churn'] = telco['Churn'].replace({'no':0, 'yes': 1})

# Print the results to verify
print(telco['Vmail_Plan'].head())
print(telco['Churn'].head())

<h4>One hot encoding</h4>
the 'State' feature can be encoded numerically using the technique of one hot encoding:

ohe_part3.png

Doing this manually would be quite tedious, especially when you have 50 states and over 3000 customers! Fortunately, pandas has a get_dummies() function which automatically applies one hot encoding over the selected feature.

In [None]:
# Import pandas
import pandas as pd

# Perform one hot encoding on 'State'
telco_state = pd.get_dummies(telco['State'])

In [None]:
# Print the head of telco_state
print(telco_state.head())

<h4>Feature scaling</h4>
Recall from the video the different scales of the 'Intl_Calls' and 'Night_Mins' features:

feature scaling

Re-scale them using StandardScaler.

The telco DataFrame has been subset to only include the features you want to rescale: 'Intl_Calls' and 'Night_Mins'. To apply StandardScaler, you need to first instantiate it using StandardScaler(), and then apply the fit_transform() method, passing in the DataFrame you want to rescale. You can do this in one line of code:

In [None]:
# Import StandardScaler
from sklearn.preprocessing import StandardScaler

# Scale telco
telco_scaled = StandardScaler().fit_transform(telco)

# Add column names back for readability
telco_scaled_df = pd.DataFrame(telco_scaled, columns=["Intl_Calls", "Night_Mins"])

# Print summary statistics
print(telco_scaled_df.describe())

<h4>Dropping unnecessary features</h4>
Some features such as 'Area_Code' and 'Phone' are not useful when it comes to predicting customer churn, and they need to be dropped prior to modeling. The easiest way to do so in Python is using the .drop() method of pandas DataFrames, just as you saw in the video, where 'Soc_Sec' and 'Tax_ID' were dropped:

telco.drop(['Soc_Sec', 'Tax_ID'], axis=1)
Here, axis=1 indicates that you want to drop 'Soc_Sec' and 'Tax_ID' from the columns.*texte en italique*

In [None]:
# Drop the unnecessary features
telco = telco.drop(telco[['Area_Code','Phone']], axis=1)

In [None]:
# Verify dropped features
print(telco.columns)

<h4>Engineering a new column</h4>
everaging domain knowledge to engineer new features is an essential part of modeling. This quote from Andrew Ng summarizes the importance of feature engineering:

Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering.



In [None]:
# Create the new feature
telco['Avg_Night_Calls'] = telco['Night_Mins']/ telco['Night_Calls']

# Print the first five rows of 'Avg_Night_Calls'
print(telco['Avg_Night_Calls'].head())

<h2 id="churn_prediction">Churn Prediction</h2>
With your data preprocessed and ready for machine learning, it's time to predict churn! Learn how to build supervised learning machine models in Python using scikit-learn.


<h4>Predicting whether a new customer will churn</h4>

In [None]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

In [None]:
# Instantiate the classifier
clf = LogisticRegression()

In [None]:
# Fit the classifier
clf.fit(telco[features], telco['Churn'])

In [None]:
# Predict the label of new_customer
print(clf.predict(new_customer))

<h4>Training another scikit-learn model</h4>
All sklearn models have .fit() and .predict() methods like the one you used in the previous exercise for the LogisticRegression model. This feature allows you to easily try many different models to see which one gives you the best performance. To get you more confident with using the sklearn API, in this exercise you'll try fitting a DecisionTreeClassifier instead of a LogisticRegression

In [None]:
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier

# Instantiate the classifier
clf = DecisionTreeClassifier()

# Fit the classifier
clf.fit(telco[features], telco['Churn'])

# Predict the label of new_customer
print(clf.predict(new_customer))

<h4>Evaluating Model Performance</h4>
<h3>Creating training and test sets</h3>
Before you create any model, it is important to split your dataset into two: a training set which will be used to build your churn model, and a test set which will be used to validate your model. To do this, you can use the train_test_split() function from sklearn.model_selection.

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

In [None]:
# Create feature variable
X = telco.drop('Churn', axis=1)

In [None]:
# Create target variable
y = telco['Churn']

In [None]:
# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

<h3>Check each sets length</h3>
Just to make sure train_test_split() worked as you expected it to, check the lengths of X_train and X_test to see how many records are in each set. You can use functions like len() or attributes like .shape to explore this.

2333 Train set, 1000 Test set.

<h3>Computing accuracy</h3>
Having split your data into training and testing sets, you can now fit your model to the training data and then predict the labels of the test data. That's what you'll practice doing in this exercise.

So far, you've used Logistic Regression and Decision Trees. Here, you'll use a RandomForestClassifier, which you can think of as an ensemble of Decision Trees that generally outperforms a single Decision Tree.

Your work in the previous exercises has carried over, and the training and test sets are available in the variables X_train, X_test, y_train, and y_test.

In [None]:
# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Compute accuracy
print(clf.score(X_test, y_test))

<h4>Model Metrics</h4>
<h3>Confusion matrix</h3>
Using scikit-learn's confusion_matrix() function, you can easily create your classifier's confusion matrix and gain a more nuanced understanding of its performance. It takes in two arguments: The actual labels of your test set - y_test - and your predicted labels.

The predicted labels of your Random Forest classifier from the previous exercise are stored in y_pred and were computed as follows:

y_pred = clf.predict(X_test)

In [None]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

In [None]:
# Print the confusion matrix
print(confusion_matrix(y_test, y_pred))

<h3>Varying training set size</h3>
The size of your training and testing sets influences model performance. Models learn better when they have more training data. However, there's a risk that they overfit to the training data and don't generalize well to new data, so in order to properly evaluate the model's ability to generalize, you need enough testing data. As a result, there is a important balance and trade-off involved between how much you use for training and how much you hold for testing.

So far, you've used 70% for training and 30% for testing. Let's now use 80% of the data for training and evaluate how that changes the model's performance.

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Create feature variable
X = telco.drop('Churn', axis=1)

# Create target variable
y = telco['Churn']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Print confusion matrix
print(confusion_matrix(y_test, y_pred))

This classifier has a higher precision than the previous classifier.


<h3>Computing precision and recall</h3>
The sklearn.metrics submodule has many functions that allow you to easily calculate interesting metrics. So far, you've calculated precision and recall by hand - this is important while you develop your intuition for both these metrics.

In practice, once you do, you can leverage the precision_score and recall_score functions that automatically compute precision and recall, respectively. Both work similarly to other functions in sklearn.metrics - they accept 2 arguments: the first is the actual labels (y_test), and the second is the predicted labels (y_pred).

Let's now try a training size of 90%.

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Create feature variable
X = telco.drop('Churn', axis=1)

# Create target variable
y = telco['Churn']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Import precision_score
from sklearn.metrics import precision_score

In [None]:
# Print the precision
print(precision_score(y_test, y_pred))

In [None]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Create feature variable
X = telco.drop('Churn', axis=1)

# Create target variable
y = telco['Churn']

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

# Import RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Import recall_score
from sklearn.metrics import recall_score

# Print the recall
print(recall_score(y_test, y_pred))

<h4>Other model metrics</h4>
<h3>ROC curve</h3>
Let's now create an ROC curve for our random forest classifier. The first step is to calculate the predicted probabilities output by the classifier for each label using its .predict_proba() method. Then, you can use the roc_curve function from sklearn.metrics to compute the false positive rate and true positive rate, which you can then plot using matplotlib.

A RandomForestClassifier with a training set size of 70% has been fit to the data and is available in your workspace as clf.

In [None]:
# Generate the probabilities
y_pred_prob = clf.predict_proba(X_test)[:, 1]

In [None]:
# Import roc_curve
from sklearn.metrics import roc_curve
# Calculate the roc metrics
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)

In [None]:
# Plot the ROC curve
plt.plot(fpr,tpr)

# Add labels and diagonal line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.plot([0, 1], [0, 1], "k--")
plt.show()

<h3>Area under the curve</h3>
The ROC curve from the previous exercise is viewable on the right. Visually, it looks like a well-performing model. Let's quantify this by computing the area under the curve.

In [None]:
# Import roc_auc_score
from sklearn.metrics import roc_auc_score
# Print the AUC
print(roc_auc_score(y_test, y_pred_prob))

<h4>Precision-recall curve</h4>
Another way to evaluate model performance is using a precision-recall curve, which shows the tradeoff between precision and recall for different thresholds.

On the right, a precision-recall curve has been generated. Spend some time studying it and then select the statement below that is not true.
Recall is synonymous with specificity, and precision is identical with positive predictive value.

<h3>F1 score</h3>
As you've discovered, there's a tradeoff between precision and recall. Both are important metrics, and depending on how the business is trying to model churn, you may want to focus on optimizing one over the other. Often, stakeholders are interested in a single metric that can quantify model performance. The AUC is one metric you can use in these cases, and another is the F1 score, which is calculated as below:

2 * (precision * recall) / (precision + recall)
The advantage of the F1 score is it incorporates both precision and recall into a single metric, and a high F1 score is a sign of a well-performing model, even in situations where you might have imbalanced classes. In scikit-learn, you can compute the f-1 score using using the f1_score function.

In [None]:
# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the training data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Import f1_score
from sklearn.metrics import f1_score

# Print the F1 score
print(f1_score(y_test, y_pred))

<h2 id="model_tuning">Model Tuning</h2>
Learn how to improve the performance of your models using hyperparameter tuning and gain a better understanding of the drivers of customer churn that you can take back to the business.


<h3>Tuning your model</h3>
<h4>Tuning the number of features</h4>
The default hyperparameters used by your models are not optimized for your data. The goal of grid search cross-validation is to identify those hyperparameters that lead to optimal model performance. In the video, you saw how the random forest's n_estimators hyperparameter was tuned. Here, you'll practice tuning the max_features hyperparameter. The cv hyperparameter is set to 3 so that the code executes quickly.
A random forest is an ensemble of many decision trees. The n_estimators hyperparameter controls the number of trees to use in the forest, while the max_features hyperparameter controls the number features the random forest should consider when looking for the best split at decision tree.

A random forest classifier has been instantiated for you as clf.


In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV
# Create the hyperparameter grid
param_grid = {'max_features': ['auto', 'sqrt', 'log2']}

In [None]:
# Call GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=3)

# Fit the model
grid_search.fit(X, y)
# Print the optimal parameters
print(grid_search.best_params_)

<h4>Tuning other hyperparameters</h4>
The power of GridSearchCV really comes into play when you're tuning multiple hyperparameters, as then the algorithm tries out all possible combinations of hyperparameters to identify the best combination. 

In [None]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Create the hyperparameter grid
param_grid = {"max_depth": [3, None],
              "max_features": [1, 3, 10],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# Call GridSearchCV
grid_search = GridSearchCV(clf, param_grid, cv=3)

In [None]:
# Fit the model
grid_search.fit(X, y)

# Print the best hyperparameters
print(grid_search.best_params_) 

<h4>Randomized search</h4>
In the above chunk of code from the previous exercise, you may have noticed that the first line of code did not take much time to run, while the call to .fit() took several seconds to execute.

This is because .fit() is what actually performs the grid search, and in our case, it was grid with many different combinations. As the hyperparameter grid gets larger, grid search becomes slower. In order to solve this problem, instead of trying out every single combination of values, we could randomly jump around the grid and try different combinations. There's a small possibility we may miss the best combination, but we would save a lot of time, or be able to tune more hyperparameters in the same amount of time.

In scikit-learn, you can do this using RandomizedSearchCV. It has the same API as GridSearchCV, except that you need to specify a parameter distribution that it can sample from instead of specific hyperparameter values. Let's try it out now! The parameter distribution has been set up for you, along with a random forest classifier called clf.

In [None]:
# Import RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Create the hyperparameter grid
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# Call RandomizedSearchCV
random_search = RandomizedSearchCV(clf, param_dist)

In [None]:
# Fit the model
random_search.fit(X, y)

# Print best parameters
print(random_search.best_params_)

<h3>Feature importances</h3>
<h4>Visualizing feature importances</h4>
Your random forest classifier from earlier exercises has been fit to the telco data and is available to you as clf. Let's visualize the feature importances and get a sense for what the drivers of churn are, using matplotlib's barh to create a horizontal bar plot of feature importances.

In [None]:
# Calculate feature importances
importances = clf.feature_importances_

# Create plot
plt.barh(range(X.shape[1]), importances)
plt.show()

<h4>Improving the plot</h4>
In order to make the plot more readable, we need to do achieve two goals:

Re-order the bars in ascending order.
Add labels to the plot that correspond to the feature names.
To do this, we'll take advantage of NumPy indexing. The .argsort() method sorts an array and returns the indices. We'll use these indices to achieve both goals


In [None]:
# Sort importances
sorted_index = np.argsort(importances)

# Create labels
labels = X.columns[sorted_index]

# Clear current plot
plt.clf()

# Create plot
plt.barh(range(X.shape[1]), importances[sorted_index], tick_label=labels)
plt.show()

<h3>Adding new features</h3>
<h4>Does model performance improve?</h4>
6 new features have been added to the telco DataFrame:

Region_Code
Cost_Call
Total_Charge
Total_Minutes
Total_Calls
Min_Call

In [None]:
# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Create training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

<h4>Computing other metrics</h4>
In addition to accuracy, let's also compute the F1 score of this new model to get a better picture of model performance.

A 70-30 train-test split has already been done for you, and all necessary modules have been imported.

In [None]:
# Import f1_score
from sklearn.metrics import f1_score

# Instantiate the classifier
clf = RandomForestClassifier()

# Fit to the data
clf.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = clf.predict(X_test)

# Print the F1 score
print(f1_score(y_test, y_pred))