# Objective:

The objective of this notebook is to get a rough feel for CPU and RAM utilisation as a 2GB csv file is trained on a Random Forest model. I kept the laptop at high performance mode and did not open any additional applications like Chrome during this test, but did not go to the extent of fixing the CPU clock speed in BIOS.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score

In [2]:
headers = ['label', 'lepton pT', 'lepton eta', 'lepton phi', 'missing energy magnitude', 
           'missing energy phi', 'jet 1 pt', 'jet 1 eta', 'jet 1 phi', 
           'jet 1 b-tag', 'jet 2 pt', 'jet 2 eta', 'jet 2 phi', 'jet 2 b-tag', 
           'jet 3 pt', 'jet 3 eta', 'jet 3 phi', 'jet 3 b-tag', 'jet 4 pt', 
           'jet 4 eta', 'jet 4 phi', 'jet 4 b-tag', 'm_jj', 'm_jjj', 'm_lv', 
           'm_jlv', 'm_bb', 'm_wbb', 'm_wwbb']

In [3]:
df = pd.read_csv("data/HIGGS3.csv", names=headers)

**Increase in RAM usage after loading dataframe: 0.6GB. The csv file is about 2GB.** There is a big difference because for a csv file, each character is 1 byte. That means 12345 takes 5 bytes in a csv file, but only 2 bytes when stored as an integer in a dataframe.

In [4]:
df.head()

Unnamed: 0,label,lepton pT,lepton eta,lepton phi,missing energy magnitude,missing energy phi,jet 1 pt,jet 1 eta,jet 1 phi,jet 1 b-tag,...,jet 4 eta,jet 4 phi,jet 4 b-tag,m_jj,m_jjj,m_lv,m_jlv,m_bb,m_wbb,m_wwbb
0,1.0,0.869293,-0.635082,0.22569,0.32747,-0.689993,0.754202,-0.248573,-1.092064,0.0,...,-0.010455,-0.045767,3.101961,1.35376,0.979563,0.978076,0.920005,0.721657,0.988751,0.876678
1,1.0,0.907542,0.329147,0.359412,1.49797,-0.31301,1.095531,-0.557525,-1.58823,2.173076,...,-1.13893,-0.000819,0.0,0.30222,0.833048,0.9857,0.978098,0.779732,0.992356,0.798343
2,1.0,0.798835,1.470639,-1.635975,0.453773,0.425629,1.104875,1.282322,1.381664,0.0,...,1.128848,0.900461,0.0,0.909753,1.10833,0.985692,0.951331,0.803252,0.865924,0.780118
3,0.0,1.344385,-0.876626,0.935913,1.99205,0.882454,1.786066,-1.646778,-0.942383,0.0,...,-0.678379,-1.360356,0.0,0.946652,1.028704,0.998656,0.728281,0.8692,1.026736,0.957904
4,1.0,1.105009,0.321356,1.522401,0.882808,-1.205349,0.681466,-1.070464,-0.921871,0.0,...,-0.373566,0.113041,0.0,0.755856,1.361057,0.98661,0.838085,1.133295,0.872245,0.808487


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000000 entries, 0 to 2999999
Data columns (total 29 columns):
label                       float64
lepton pT                   float64
lepton eta                  float64
lepton phi                  float64
missing energy magnitude    float64
missing energy phi          float64
jet 1 pt                    float64
jet 1 eta                   float64
jet 1 phi                   float64
jet 1 b-tag                 float64
jet 2 pt                    float64
jet 2 eta                   float64
jet 2 phi                   float64
jet 2 b-tag                 float64
jet 3 pt                    float64
jet 3 eta                   float64
jet 3 phi                   float64
jet 3 b-tag                 float64
jet 4 pt                    float64
jet 4 eta                   float64
jet 4 phi                   float64
jet 4 b-tag                 float64
m_jj                        float64
m_jjj                       float64
m_lv                   

In [6]:
y = df['label'].values
print(y.shape)
X = df.drop("label", axis=1).values
print(X.shape)

(3000000,)
(3000000, 28)


**Increase in RAM usage after creating X and y numpy arrays: 0.6GB.**

In [7]:
X_train, X_valid, y_train, y_valid = train_test_split(X,y,test_size=0.25, random_state=33, stratify=y)

**Increase in RAM usage after creating X_train, X_valid, y_train, y_valid from train_test_split: 0.7GB.**

## 0. Laptop Specs

In [1]:
import multiprocessing
multiprocessing.cpu_count()

12

- **CPU**: Intel Core i7-9750H CPU @ 2.60GHz
- **RAM**: 15.8GB

Note that there are 6 cores but 12 logical processors due to HyperThreading, where each core can run 2 threads.

## 1. Run on Laptop with 1 Core, n_estimators=10

In [9]:
model = RandomForestClassifier(n_estimators=10)
%time model.fit(X_train, y_train)

Wall time: 2min 48s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

In [10]:
%time y_pred = model.predict(X_valid)
print(roc_auc_score(y_pred, y_valid))

Wall time: 3.2 s
0.7085671514744273


**Increase in RAM usage after training with n_estimators=10: 0.4GB.**

## 2. Run on Laptop with 6 Cores, n_estimators=10

In [12]:
model = RandomForestClassifier(n_estimators=10, n_jobs=-1)
%time model.fit(X_train, y_train)

Wall time: 35.7 s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1,
                       oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [13]:
%time y_pred = model.predict(X_valid)
print(roc_auc_score(y_pred, y_valid))

Wall time: 1.28 s
0.7074790498053475


Time to fit the model has significantly dropped from 2min 48s with 1 core to 36s with all 6 cores. The CPU % utilisation jumped to 100% during training. Note that Random Forest algorithm allows multiple decision trees to be built in parallel, thus having the `n_jobs` parameter, while other ensemble models such as AdaBoost or Gradient Boosting have to be run sequentially.

## 3. Run on Laptop with 6 Cores, n_estimators=100

In [14]:
model = RandomForestClassifier(n_estimators=100, n_jobs=-1)
%time model.fit(X_train, y_train)

Wall time: 5min 49s


RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=None, verbose=0,
                       warm_start=False)

In [15]:
%time y_pred = model.predict(X_valid)
print(roc_auc_score(y_pred, y_valid))

Wall time: 11.3 s
0.7398917458491556


As n_estimators has been increased from 10 to 100, there are 10x more decision trees to be built. The time taken increased from 36s to 5min 49s which is close to 10x more time. 

**Increase in RAM usage after training with n_estimators=100: 3.6GB.** This is 9x more than for 10 n_estimators, because of approximately 10x more bootstrapped data samples created for random forest, as well as decision tree nodes and leaves that need to be stored in memory. At this stage, the total RAM utilization of the laptop is about 11GB. Prior to loading the dataframe initially, the RAM utilization was 4.6GB, which means 6.4GB of RAM has been used by this notebook.