<a href="https://colab.research.google.com/github/olenaageyeva/MyApiProject/blob/main/ML105/CodingRandomForests.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Initialization

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

In [2]:
col_index = ["Sex","Length","Diameter","Height",
              "Whole weight","Shucked weight",
              "Viscera weight","Shell weight", "Rings"]
data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', index_col=False)
data.columns = col_index
data

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
0,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,7
1,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,9
2,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,10
3,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,7
4,I,0.425,0.300,0.095,0.3515,0.1410,0.0775,0.1200,8
...,...,...,...,...,...,...,...,...,...
4171,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,11
4172,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,10
4173,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,9
4174,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,10


## Exploration

In [3]:
data.describe(include="all")

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Rings
count,4176,4176.0,4176.0,4176.0,4176.0,4176.0,4176.0,4176.0,4176.0
unique,3,,,,,,,,
top,M,,,,,,,,
freq,1527,,,,,,,,
mean,,0.524009,0.407892,0.139527,0.828818,0.3594,0.180613,0.238852,9.932471
std,,0.120103,0.09925,0.041826,0.490424,0.22198,0.10962,0.139213,3.223601
min,,0.075,0.055,0.0,0.002,0.001,0.0005,0.0015,1.0
25%,,0.45,0.35,0.115,0.4415,0.186,0.093375,0.13,8.0
50%,,0.545,0.425,0.14,0.79975,0.336,0.171,0.234,9.0
75%,,0.615,0.48,0.165,1.15325,0.502,0.253,0.329,11.0


The rings around an abalone's shell are used to determine its age: more rings means the abalone is older. (Age is what we're trying to predict with this model.)

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4176 entries, 0 to 4175
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Sex             4176 non-null   object 
 1   Length          4176 non-null   float64
 2   Diameter        4176 non-null   float64
 3   Height          4176 non-null   float64
 4   Whole weight    4176 non-null   float64
 5   Shucked weight  4176 non-null   float64
 6   Viscera weight  4176 non-null   float64
 7   Shell weight    4176 non-null   float64
 8   Rings           4176 non-null   int64  
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB


In [None]:
data.isna().sum()

## Preprocessing

In [5]:
y = data.loc[:,'Rings']
y

Unnamed: 0,Rings
0,7
1,9
2,10
3,7
4,8
...,...
4171,11
4172,10
4173,9
4174,10


In [7]:
features = data.iloc[:,:-1]
features

Unnamed: 0,Sex,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight
0,M,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700
1,F,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100
2,M,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550
3,I,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550
4,I,0.425,0.300,0.095,0.3515,0.1410,0.0775,0.1200
...,...,...,...,...,...,...,...,...
4171,F,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490
4172,M,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605
4173,M,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080
4174,F,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960


In [8]:
X = pd.get_dummies(features)
X

Unnamed: 0,Length,Diameter,Height,Whole weight,Shucked weight,Viscera weight,Shell weight,Sex_F,Sex_I,Sex_M
0,0.350,0.265,0.090,0.2255,0.0995,0.0485,0.0700,False,False,True
1,0.530,0.420,0.135,0.6770,0.2565,0.1415,0.2100,True,False,False
2,0.440,0.365,0.125,0.5160,0.2155,0.1140,0.1550,False,False,True
3,0.330,0.255,0.080,0.2050,0.0895,0.0395,0.0550,False,True,False
4,0.425,0.300,0.095,0.3515,0.1410,0.0775,0.1200,False,True,False
...,...,...,...,...,...,...,...,...,...,...
4171,0.565,0.450,0.165,0.8870,0.3700,0.2390,0.2490,True,False,False
4172,0.590,0.440,0.135,0.9660,0.4390,0.2145,0.2605,False,False,True
4173,0.600,0.475,0.205,1.1760,0.5255,0.2875,0.3080,False,False,True
4174,0.625,0.485,0.150,1.0945,0.5310,0.2610,0.2960,True,False,False


In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=3)
X_train = X_train.values
print(f"X train set length: {len(X_train)}, y train set length: {len(y_train)}")
print(f"X test set length: {len(X_test)}, y test set length: {len(y_test)}")

X train set length: 2923, y train set length: 2923
X test set length: 1253, y test set length: 1253


## Training

In [10]:
rf_model = RandomForestRegressor(random_state=1, max_samples=0.5)
rf_model.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'criterion': 'squared_error',
 'max_depth': None,
 'max_features': 1.0,
 'max_leaf_nodes': None,
 'max_samples': 0.5,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': 1,
 'verbose': 0,
 'warm_start': False}

In [11]:
rf_model.fit(X_train, y_train)

## Testing

In [12]:
y_pred = rf_model.predict(X_test.values)
y_pred

array([ 9.75, 10.97, 10.19, ...,  9.96, 14.63, 10.02])

In [13]:
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
print(f"MSE: {mse}")

MSE: 4.165322905027932


Since MSE is not as easily interpreted as mean accuracy, let's compare the model's output to the results on a couple test examples.

In [16]:
from random import seed, sample

def sample_predictions(model):
  seed(42)
  indices = sample(range(len(X_test)), 5)
  examples = X_test.iloc[indices,:].values
  labels = y_test.iloc[indices]
  predictions = model.predict(examples)

  for index, label, prediction in zip(indices, labels, predictions):
    print(f"Example {index}: The abalone's actual ring count is {label}, predicted ring count is {prediction}.")

  test_predictions = model.predict(X_test.values)
  print(f"MSE for this model is {mean_squared_error(y_test, test_predictions)}" )

In [17]:
sample_predictions(rf_model)

Example 228: The abalone's actual ring count is 9, predicted ring count is 9.05.
Example 51: The abalone's actual ring count is 14, predicted ring count is 11.71.
Example 563: The abalone's actual ring count is 17, predicted ring count is 15.2.
Example 501: The abalone's actual ring count is 10, predicted ring count is 14.1.
Example 457: The abalone's actual ring count is 6, predicted ring count is 8.63.
MSE for this model is 4.165322905027932


## Iteration

Let's try out a few different hyperparameter settings and see which version of the model performs best.

In [18]:
alternative_rf_model_1 = RandomForestRegressor(random_state=1, max_samples=0.5, max_features=1/3)
alternative_rf_model_1.fit(X_train,y_train)
sample_predictions(alternative_rf_model_1)

Example 228: The abalone's actual ring count is 9, predicted ring count is 9.54.
Example 51: The abalone's actual ring count is 14, predicted ring count is 11.25.
Example 563: The abalone's actual ring count is 17, predicted ring count is 14.25.
Example 501: The abalone's actual ring count is 10, predicted ring count is 13.61.
Example 457: The abalone's actual ring count is 6, predicted ring count is 8.65.
MSE for this model is 4.079034397446129


In [19]:
alternative_rf_model_2 = RandomForestRegressor(random_state=1, max_samples=0.5, max_features=1/3, n_estimators=200)
alternative_rf_model_2.fit(X_train, y_train)
sample_predictions(alternative_rf_model_2)

Example 228: The abalone's actual ring count is 9, predicted ring count is 9.06.
Example 51: The abalone's actual ring count is 14, predicted ring count is 11.21.
Example 563: The abalone's actual ring count is 17, predicted ring count is 14.68.
Example 501: The abalone's actual ring count is 10, predicted ring count is 13.42.
Example 457: The abalone's actual ring count is 6, predicted ring count is 8.58.
MSE for this model is 4.046719493216281


In [20]:
alternative_rf_model_3 = RandomForestRegressor(random_state=1, max_samples=0.5, max_features=1/3, n_estimators=500)
alternative_rf_model_3.fit(X_train,y_train)
sample_predictions(alternative_rf_model_3)

Example 228: The abalone's actual ring count is 9, predicted ring count is 9.148.
Example 51: The abalone's actual ring count is 14, predicted ring count is 11.126.
Example 563: The abalone's actual ring count is 17, predicted ring count is 14.586.
Example 501: The abalone's actual ring count is 10, predicted ring count is 13.32.
Example 457: The abalone's actual ring count is 6, predicted ring count is 8.604.
MSE for this model is 4.04973754509178


## Deployment

After the hyperparameters have been tuned to our satisfaction, we'd proceed with deployment. This lesson omits that code for the sake of space.