# Basic Model in scikit-learn modeling review

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X=X_train, y=y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,           max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None,       min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=500, n_jobs=1, oob_score=False, random_state=1111, verbose=0, warm_start=False)

predictions = model.predict(X_test)

print("{0:.2f}".format(mae(y_true=y_test, y_pred=predictions)))

In [5]:
import pandas as pd
candy_data = pd.read_csv('datasets/candy-data.csv')
candy_data

Unnamed: 0,competitorname,chocolate,fruity,caramel,peanutyalmondy,nougat,crispedricewafer,hard,bar,pluribus,sugarpercent,pricepercent,winpercent
0,100 Grand,1,0,1,0,0,1,0,1,0,0.732,0.860,66.971725
1,3 Musketeers,1,0,0,0,1,0,0,1,0,0.604,0.511,67.602936
2,One dime,0,0,0,0,0,0,0,0,0,0.011,0.116,32.261086
3,One quarter,0,0,0,0,0,0,0,0,0,0.011,0.511,46.116505
4,Air Heads,0,1,0,0,0,0,0,0,0,0.906,0.511,52.341465
...,...,...,...,...,...,...,...,...,...,...,...,...,...
80,Twizzlers,0,1,0,0,0,0,0,0,0,0.220,0.116,45.466282
81,Warheads,0,1,0,0,0,0,1,0,0,0.093,0.116,39.011898
82,WelchÕs Fruit Snacks,0,1,0,0,0,0,0,0,1,0.313,0.313,44.375519
83,WertherÕs Original Caramel,0,0,1,0,0,0,1,0,0,0.186,0.267,41.904308


In [15]:
import numpy as np
from sklearn.model_selection import train_test_split

# Define X and y
X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array([0, 1, 0])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1111)

In [16]:
# Import the LinearRegression class and mean_absolute_error function
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# The model is defined and fit using X_train and y_train
model = LinearRegression()
model.fit(X_train, y_train)

# Create vectors of predictions
train_predictions = model.predict(X_train)
test_predictions = model.predict(X_test)

# Train/Test Errors
train_error = mean_absolute_error(y_true=y_train, y_pred=train_predictions)
test_error = mean_absolute_error(y_true=y_test, y_pred=test_predictions)

# Print the accuracy for seen and unseen data
print("Model error on seen data: {0:.2f}.".format(train_error))
print("Model error on unseen data: {0:.2f}.".format(test_error))

Model error on seen data: 0.00.
Model error on unseen data: 2.00.


When models perform differently on training and testing data, you should look to model validation to ensure you have the best performing model.

## Regression models

In [17]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import RandomForestClassifier

rfr = RandomForestRegressor(random_state=1111)
rfc = RandomForestClassifier(random_state=1111)

![resim_2023-04-28_113505961](resim_2023-04-28_113505961.png)


### Random Forests Parameters 

n_estimators:the number of trees in the forest

max_depth:the maximum depth of the trees

random_state:random seed

In [18]:
from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor(n_estimators=50, max_depth=10)

rfr = RandomForestRegressor(random_state=1111)
rfr.n_estimators = 50
rfr.max_depth = 10

### Feature Importance

Print how important each column is to the model 

In [24]:
rfr.fit(X_train, y_train)  # fit the RandomForestRegressor instance

for i, item in enumerate(rfr.feature_importances_):    
    print("{0:d}: {1:.2f}".format(i, item))

0: 0.35
1: 0.65


### Set parameters and fit a model
Predictive tasks fall into one of two categories: regression or classification. In the candy dataset, the outcome is a continuous variable describing how often the candy was chosen over another candy in a series of 1-on-1 match-ups. To predict this value (the win-percentage), you will use a regression model.

In [25]:
# Set the number of trees
rfr.n_estimators = 100

# Add a maximum depth
rfr.max_depth = 6

# Set the random state
rfr.random_state = 1111

# Fit the model
rfr.fit(X_train, y_train)

We have updated parameters _after_ the model was initialized. This approach is helpful when we need to update parameters. Before making predictions, let's see which candy characteristics were most important to the model.

### Feature importances
Although some candy attributes, such as chocolate, may be extremely popular, it doesn't mean they will be important to model prediction. After a random forest model has been fit, you can review the model's attribute, .feature_importances_, to see which variables had the biggest impact. You can check how important each variable was in the model by looping over the feature importance array using enumerate().

In [None]:
# Fit the model using X and y
rfr.fit(X_train, y_train)

# Print how important each column is to the model
for i, item in enumerate(rfr.feature_importances_):
    # Use i and item to print out the feature importance of each column
    print("{0:s}: {1:.2f}".format(list(X_train.columns)[i], item))

In [None]:
<script.py> output:
    chocolate: 0.44
    fruity: 0.03
    caramel: 0.02
    peanutyalmondy: 0.05
    nougat: 0.01
    crispedricewafer: 0.03
    hard: 0.01
    bar: 0.02
    pluribus: 0.02
    sugarpercent: 0.17
    pricepercent: 0.19

No surprise here - chocolate _is_ the most important variable. .feature_importances_ is a great way to see which variables were important to your random forest model.

## Classification models
Categorical responses

In [28]:
import pandas as pd
tic_tac_toe = pd.read_csv('datasets/tic-tac-toe.csv')
tic_tac_toe

Unnamed: 0,Top-Left,Top-Middle,Top-Right,Middle-Left,Middle-Middle,Middle-Right,Bottom-Left,Bottom-Middle,Bottom-Right,Class
0,x,x,x,x,o,o,x,o,o,positive
1,x,x,x,x,o,o,o,x,o,positive
2,x,x,x,x,o,o,o,o,x,positive
3,x,x,x,x,o,o,o,b,b,positive
4,x,x,x,x,o,o,b,o,b,positive
...,...,...,...,...,...,...,...,...,...,...
953,o,x,x,x,o,o,o,x,x,negative
954,o,x,o,x,x,o,x,o,x,negative
955,o,x,o,x,o,x,x,o,x,negative
956,o,x,o,o,x,x,x,o,x,negative


In [3]:
import numpy as np
from sklearn.model_selection import train_test_split

# Define X and y
X = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array([0, 1, 0])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=1111)

### Using.predict() for classification

In [36]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=1111)
rfc.fit(X_train, y_train)
rfc.predict(X_test)

array([1])

In [37]:
pd.Series(rfc.predict(X_test)).value_counts()

1    1
dtype: int64

### Predicting probabilities

In [38]:
rfc.predict_proba(X_test)

array([[0.37, 0.63]])

In [None]:
array([[0. , 1. ],       
       [0.1, 0.9],       
       [0.1, 0.9],       
       ...])

In [39]:
rfc = RandomForestClassifier(random_state=1111) 
rfc.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1111,
 'verbose': 0,
 'warm_start': False}

In [43]:
rfc.fit(X_train, y_train)
rfc.score(X_test, y_test)

0.0

### **Classification** **predictions**

In model validation, it is often important to know more about the predictions than just the final classification. When predicting who will win a game, most people are also interested in how likely it is a team will win.

[![resim_2023-04-28_123357307](resim_2023-04-28_123357307.png)]()


In [5]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier

# Define the model object
rfc = RandomForestClassifier()

# Fit the rfc model. 
rfc.fit(X_train, y_train)

# Create arrays of predictions
classification_predictions = rfc.predict(X_test)
probability_predictions = rfc.predict_proba(X_test)

# Print out count of binary predictions
print(pd.Series(classification_predictions).value_counts())

# Print the first value from probability_predictions
print('The first predicted probabilities are: {}'.format(probability_predictions[0]))

1    1
dtype: int64
The first predicted probabilities are: [0.2 0.8]


`predicted_probabilities` array contains lists with only two values because you only have two possible responses (win or lose). 

### **Reusing model parameters**
Replicating model performance is vital in model validation.

In [6]:
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Print the classification model
print(rfc)

# Print the classification model's random state parameter
print('The random state is: {}'.format(rfc.random_state))

# Print all parameters
print('Printing the parameters dictionary: {}'.format(rfc.get_params()))

RandomForestClassifier(max_depth=6, n_estimators=50, random_state=1111)
The random state is: 1111
Printing the parameters dictionary: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 6, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': 1111, 'verbose': 0, 'warm_start': False}


In [None]:
<script.py> output:
    RandomForestClassifier(max_depth=6, n_estimators=50, random_state=1111)
    The random state is: 1111
    Printing the parameters dictionary: {'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 6, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 50, 'n_jobs': None, 'oob_score': False, 'random_state': 1111, 'verbose': 0, 'warm_start': False}

Recalling which parameters were used will be helpful going forward. Model validation and performance rely heavily on which parameters were used, and there is no way to replicate a model without keeping track of the parameters used

### **Random forest classifier**
This exercise reviews the four modeling steps discussed throughout this chapter using a random forest classification model.

1. Create a random forest classification model.
2. Fit the model using the `tic_tac_toe` dataset.
3. Make predictions on whether Player One will win (1) or lose (0) the current game.
4. Finally, we will evaluate the overall accuracy of the model.

In [9]:
from sklearn.ensemble import RandomForestClassifier

# Create a random forest classifier
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Fit rfc using X_train and y_train
rfc.fit(X_train, y_train)

# Create predictions on X_test
predictions = rfc.predict(X_test)
print(predictions[0:5])

# Print model accuracy using score() and the testing data
print(rfc.score(X_test, y_test))

[1]
0.0


That's all the steps! Notice the first five predictions were all 1, indicating that Player One is predicted to win all five of those games. 