# Basic Modeling in scikit-learn
>  Before we can validate models, we need an understanding of how to create and work with them. This chapter provides an introduction to running regression and classification models in scikit-learn. We will use this model building foundation throughout the remaining chapters.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 1 exercises "Model Validation in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams['figure.figsize'] = (8, 8)

## Introduction to model validation

### Modeling steps

<div class=""><p>The process of using scikit-learn to create and test models has four steps, and you will use these four steps throughout this course. </p>
<p>Which of the following is <strong>NOT</strong> a valid method in the four-step <code>scikit-learn</code> model validation framework?</p></div>

<pre>
Possible Answers

.predict()

.fit()

<b>.validate()</b>
</pre>

**Validation is a technique all in its own and is not completed with .validate(). You need to learn a few tools and techniques before you can validate a model.**

### Seen vs. unseen data

<div class=""><p>Model's tend to have higher accuracy on observations they have seen before. In the candy dataset, predicting the popularity of Skittles will likely have higher accuracy than predicting the popularity of Andes Mints; Skittles is in the dataset, and Andes Mints is not. </p>
<p>You've built a model based on 50 candies using the dataset <code>X_train</code> and need to report how accurate the model is at predicting the popularity of the 50 candies the model was built on, and the 35 candies (<code>X_test</code>) it has never seen. You will use the mean absolute error, <code>mae()</code>, as the accuracy metric.</p></div>

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error as mae
from sklearn.ensemble import RandomForestRegressor

df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/11-model-validation-in-python/datasets/candy-data.csv')
X = df.drop(['competitorname', 'winpercent'], axis=1)
y = df['winpercent']

model = RandomForestRegressor(n_estimators=50)

In [None]:
X_train, y_train = X.iloc[:50].values, y.iloc[:50].values
X_test, y_test = X.iloc[50:].values, y.iloc[50:].values

Instructions
<ul>
<li>Using <code>X_train</code> and <code>X_test</code> as input data, create arrays of predictions using <code>model.predict()</code>.</li>
<li>Calculate model accuracy on both data the model has seen and data the model has not seen before.</li>
<li>Use the print statements to print the seen and unseen data.</li>
</ul>

In [None]:
# The model is fit using X_train and y_train
model.fit(X_train, y_train)

# Create vectors of predictions
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

# Train/Test Errors
train_error = mae(y_true=y_train, y_pred=train_predictions)
test_error = mae(y_true=y_test, y_pred=test_predictions)

# Print the accuracy for seen and unseen data
print("Model error on seen data: {0:.2f}.".format(train_error))
print("Model error on unseen data: {0:.2f}.".format(test_error))

Model error on seen data: 3.43.
Model error on unseen data: 10.84.


**When models perform differently on training and testing data, you should look to model validation to ensure you have the best performing model. In the next lesson, you will start building models to validate.**

## Regression models

### Set parameters and fit a model

<div class=""><p>Predictive tasks fall into one of two categories: regression or classification. In the candy dataset, the outcome is a <em>continuous</em> variable describing how often the candy was chosen over another candy in a series of 1-on-1 match-ups. To predict this value (the win-percentage), you will use a <strong>regression</strong> model.</p>
<p>In this exercise, you will specify a few parameters using a random forest regression model <code>rfr</code>.</p></div>

In [None]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor()

Instructions
<ul>
<li>Add a parameter to <code>rfr</code> so that the number of trees built is 100 and the maximum depth of these trees is 6.</li>
<li>Make sure the model is reproducible by adding a random state of <code>1111</code>.</li>
<li>Use the <code>.fit()</code> method to train the random forest regression model with <code>X_train</code> as the input data and <code>y_train</code> as the response.</li>
</ul>

In [None]:
# Set the number of trees
rfr.n_estimators = 100

# Add a maximum depth
rfr.max_depth = 6

# Set the random state
rfr.random_state = 1111

# Fit the model
rfr.fit(X, y)

RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=6, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=100, n_jobs=None, oob_score=False,
                      random_state=1111, verbose=0, warm_start=False)

**You have updated parameters after the model was initialized. This approach is helpful when you need to update parameters. Before making predictions, let's see which candy characteristics were most important to the model.**

### Feature importances

<div class=""><p>Although some candy attributes, such as chocolate, may be extremely popular, it doesn't mean they will be <em>important</em> to model prediction. After a random forest model has been fit, you can review the model's attribute, <code>.feature_importances_</code>, to see which variables had the biggest impact. You can check how important each variable was in the model by looping over the feature importance array using <code>enumerate()</code>.</p>
<p>If you are unfamiliar with Python's <code>enumerate()</code> function, it can loop over a list while also creating an automatic counter.</p></div>

Instructions
<ul>
<li>Loop through the feature importance output of <code>rfr</code>.</li>
<li>Print the column names of <code>X_train</code> and the importance score for that column.</li>
</ul>

In [None]:
# Print how important each column is to the model
for i, item in enumerate(rfr.feature_importances_):
      # Use i and item to print out the feature importance of each column
    print("{0:s}: {1:.2f}".format(X.columns[i], item))

chocolate: 0.44
fruity: 0.03
caramel: 0.02
peanutyalmondy: 0.05
nougat: 0.01
crispedricewafer: 0.03
hard: 0.01
bar: 0.02
pluribus: 0.02
sugarpercent: 0.17
pricepercent: 0.19


**No surprise here - chocolate is the most important variable. .feature_importances_ is a great way to see which variables were important to your random forest model.**

## Classification models

### Classification predictions

<div class=""><p>In model validation, it is often important to know more about the predictions than just the final classification. When predicting who will win a game, most people are also interested in <em>how likely</em> it is a team will win. </p>
<table>
<thead>
<tr>
<th>Probability</th>
<th>Prediction</th>
<th>Meaning</th>
</tr>
</thead>
<tbody>
<tr>
<td>0 &lt; .50</td>
<td>0</td>
<td>Team Loses</td>
</tr>
<tr>
<td>.50 +</td>
<td>1</td>
<td>Team Wins</td>
</tr>
</tbody>
</table>
<p>In this exercise, you look at the methods, <code>.predict()</code> and <code>.predict_proba()</code> using the <code>tic_tac_toe</code> dataset. The first method will give a prediction of whether Player One will win the game, and the second method will provide the probability of Player One winning. Use <code>rfc</code> as the random forest classification model.</p></div>

In [None]:
from sklearn.ensemble import RandomForestClassifier

df = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/11-model-validation-in-python/datasets/tic_tac_toe_preprocessed.csv')
X, y = df.iloc[:, :-1], df.iloc[:, -1]
X_train, y_train = X.iloc[:191].values, y.iloc[:191].values
X_test, y_test = X.iloc[191:].values, y.iloc[191:].values
rfc = RandomForestClassifier(max_depth=6, n_estimators=50, random_state=1111)

Instructions
<ul>
<li>Create two arrays of predictions. One for the classification values and one for the predicted probabilities.</li>
<li>Use the <code>.value_counts()</code> method for a pandas Series to print the number of observations that were assigned to each class.</li>
<li>Print the first observation of <code>probability_predictions</code> to see how the probabilities are structured.</li>
</ul>

In [None]:
# Fit the rfc model. 
rfc.fit(X_train, y_train)

# Create arrays of predictions
classification_predictions = rfc.predict(X_test)
probability_predictions = rfc.predict_proba(X_test)

# Print out count of binary predictions
print(pd.Series(classification_predictions).value_counts())

# Print the first value from probability_predictions
print('The first predicted probabilities are: {}'.format(probability_predictions[0]))

1    563
0    204
dtype: int64
The first predicted probabilities are: [0.26524423 0.73475577]


**Well done! You can see there were 563 observations where Player One was predicted to win the Tic-Tac-Toe game. Also, note that the predicted_probabilities array contains lists with only two values because you only have two possible responses (win or lose). Remember these two methods, as you will use them a lot throughout this course.**

### Reusing model parameters

<div class=""><p>Replicating model performance is vital in model validation.  Replication is also important when sharing models with co-workers, reusing models on new data or asking questions on a website such as <a href="https://stackoverflow.com/" target="_blank" rel="noopener noreferrer">Stack Overflow</a>. You might use such a site to ask other coders about model errors, output, or performance. The best way to do this is to replicate your work by reusing model parameters. </p>
<p>In this exercise, you use various methods to recall which parameters were used in a model.</p></div>

Instructions
<ul>
<li>Print out the characteristics of the model <code>rfc</code> by simply printing the model. </li>
<li>Print just the random state of the model.</li>
<li>Print the dictionary of model parameters.</li>
</ul>

In [None]:
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Print the classification model
print(rfc)

# Print the classification model's random state parameter
print('The random state is: {}'.format(rfc.random_state))

# Print all parameters
print('Printing the parameters dictionary: {}'.format(rfc.get_params()))

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=6, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=1111,
                       verbose=0, warm_start=False)
The random state is: 1111
Printing the parameters dictionary: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': 1111, 'verbose': 0,

**Recalling which parameters were used will be helpful going forward. Model validation and performance rely heavily on which parameters were used, and there is no way to replicate a model without keeping track of the parameters used!**

### Random forest classifier

<div class=""><p>This exercise reviews the four modeling steps discussed throughout this chapter using a random forest classification model. You will:</p>
<ol>
<li>Create a random forest classification model.</li>
<li>Fit the model using the <code>tic_tac_toe</code> dataset.</li>
<li>Make predictions on whether Player One will win (1) or lose (0) the current game.</li>
<li>Finally, you will evaluate the overall accuracy of the model.</li>
</ol>
<p>Let's get started!</p></div>

Instructions 1/4
<li>Create <code>rfc</code> using the <code>scikit-learn</code> implementation of random forest classifiers and set a random state of 1111.</li>

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

Instructions 2/4
<li>Fit <code>rfc</code> using <code>X_train</code> for the training data and <code>y_train</code> for the responses.</li>

In [None]:
# Fit rfc using X_train and y_train
rfc.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=6, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=1111,
                       verbose=0, warm_start=False)

Instructions 3/4
<div class="exercise--instructions exercise--typography"><div class="exercise--instructions__content"><ul>
<li>Predict the class values for <code>X_test</code>.</li>
</ul></div><div style="margin: 16px -15px 0px;"><section class="dc-sct-feedback" tabindex="-1"><div></div><nav class="dc-sct-feedback__nav"><ul class="dc-sct-feedback__tab-list"></ul></nav></section></div></div>

In [None]:
# Create predictions on X_test
predictions = rfc.predict(X_test)
print(predictions[0:5])

[1 1 1 1 1]


Instructions 4/4
<li>Use the method <code>.score()</code> to print an accuracy metric for <code>X_test</code> given the actual values <code>y_test</code>.</li>

In [None]:
# Print model accuracy using score() and the testing data
print(rfc.score(X_test, y_test))

0.817470664928292


**Notice the first five predictions were all 1, indicating that Player One is predicted to win all five of those games. You also see the model accuracy was only 82%.**