In [1]:
pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


# Random Forest


Random Forest is one of the most popular and most powerful machine learning algorithms. It is a type of ensemble machine learning algorithm called Bootstrap Aggregation or bagging. <b>

![](https://miro.medium.com/v2/resize:fit:640/format:webp/1*i0o8mjFfCn-uD79-F1Cqkw.png)

We will apply a Random Forest classifier to the task of classifying penguin species. To optimize performance and accuracy, we will focus on two key parameters: max_depth and n_estimators.

- n_estimators: number of trees in the forest
- max_depth: controls how deep each decision tree can grow

In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [3]:
penguins = sns.load_dataset("penguins")
penguins = penguins.fillna(0)
print(penguins.head())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
3  Adelie  Torgersen             0.0            0.0                0.0   
4  Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
0       3750.0    Male  
1       3800.0  Female  
2       3250.0  Female  
3          0.0       0  
4       3450.0  Female  


In [4]:
features = ['bill_length_mm', 'body_mass_g', "flipper_length_mm"] #add features per iteration such as 'body_mass_g'
X = penguins[features]
y = penguins['species']

In [5]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=101)

In [6]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(criterion='entropy', n_estimators=5, max_depth=3)
rf.fit(X_train, y_train) #fit the random forest to the training data

In [1]:
pip install fpdf

Collecting fpdf
  Downloading fpdf-1.7.2.tar.gz (39 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: fpdf
  Building wheel for fpdf (setup.py): started
  Building wheel for fpdf (setup.py): finished with status 'done'
  Created wheel for fpdf: filename=fpdf-1.7.2-py2.py3-none-any.whl size=40768 sha256=f053b909b204de5971811f98e11bf1c96dcdc1ba226276a0557c88abdefbedf1
  Stored in directory: c:\users\oldma\appdata\local\pip\cache\wheels\aa\da\11\a3189f34ddc13c26a2d0f329eac46b728c7f31c39e4dc26243
Successfully built fpdf
Installing collected packages: fpdf
Successfully installed fpdf-1.7.2
Note: you may need to restart the kernel to use updated packages.


In [8]:
from sklearn import tree
import graphviz
from fpdf import FPDF

def plot_tree_classification(model, features, class_names, output_file='random_forest'):  
    if isinstance(model, RandomForestClassifier):
        pdf = FPDF()

        for i, tree_model in enumerate(model.estimators_):
            dot_data = tree.export_graphviz(tree_model, out_file=None, 
                                  feature_names=features,  
                                  class_names=class_names,  
                                  filled=True, rounded=True,  
                                  special_characters=True)  

            # Turn into graph using graphviz
            graph = graphviz.Source(dot_data)  

            # Save as PNG for embedding in PDF
            image_file = f"{output_file}_tree_{i+1}.png"
            graph.render(filename=image_file, format='png')

            # Add each tree image to PDF
            pdf.add_page()
            pdf.image(image_file + '.png', x=10, y=10, w=180)

        # Save the complete PDF
        pdf_output_file = f"{output_file}.pdf"
        pdf.output(pdf_output_file)

        print(f"All trees saved in {pdf_output_file}.")

    else:
        raise ValueError("The model is not a RandomForestClassifier.")                                
    
    return graph




ModuleNotFoundError: No module named 'fpdf'

In [None]:
feature_names = X.columns
class_names = np.sort(np.unique(y)).astype(str)
plot_tree_classification(rf, feature_names, class_names)

In [None]:
def calculate_accuracy(predictions, actuals):
    if(len(predictions) != len(actuals)):
        raise Exception("The amount of predictions did not equal the amount of actuals")
    
    return (predictions == actuals).sum() / len(actuals)

In [None]:
predictionsOnTrainset = rf.predict(X_train)
predictionsOnTestset = rf.predict(X_test)

accuracyTrain = calculate_accuracy(predictionsOnTrainset, y_train)
accuracyTest = calculate_accuracy(predictionsOnTestset, y_test)

print("Accuracy on training set " + str(accuracyTrain))
print("Accuracy on test set " + str(accuracyTest))

## Portfolio assignment 19
30 min: Train a random forest to predict one of the categorical columns of your **own** dataset.
- Prepare the data:<br>
    - <b>Note</b>: Some machine learning algorithms can not handle missing values. You will either need to: 
         - replace missing values (with the mean or most popular value). For replacing missing values you can use .fillna(\<value\>) https://pandas.pydata.org/docs/reference/api/pandas.Series.fillna.html
         - remove rows with missing data.  You can remove rows with missing data with .dropna() https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html <br>
- Split your dataset into a train (70%) and test (30%) set.
- Use the train set to fit a RandomForestClassifier. You are free to to choose which columns you want to use as feature variables and you are also free to choose the max_depth of the tree. 
- Use your random forest model to make predictions for both the train and test set.
<br>
    
![](https://i.imgur.com/0v1CGNV.png)<br>
- Calculate the accuracy for both the train set predictions and test set predictions.
- Is the accurracy different? Did you expect this difference?
- Which number of trees, depth and features did you add per cycle?
- Is the accurracy different? Did you expect this difference?



Findings: ...