## Discussion
- Manay computationally expensive tasks in ML can be processed parallelly by splitting the work across CPU cores.
- Common ML tasks that can be made parallel are: 
    - Ensamble of decision trees (For example Random Forest classifier)
    - Evaluating model using resaampling procedure (For exaample- k-fold cross validation)
    - Tuning hyperparameters (For example - Grid search)
- Using multiple cores can significantly decrease the execution time. 

### Scikit-learn capability for paraller processing
- scikit-learn provides this capability using **n_jobs** argument.


In [None]:
# Load libraries 
from time import time
from sklearn.datasets import make_classification
from sklearn.ensemble import RandomForestClassifier
from matplotlib import pyplot

### Multicore: Model Train

In [None]:
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=3)

results = []
n_cores = [1, 2, 3, 4, 5, 6, 7, 8]
for n in n_cores:
    start = time()
    model = RandomForestClassifier(n_estimators=500, n_jobs=n)
    model.fit(X, y)
    end = time()
    result = end - start
    print(f'{n} -> seconds - {result}')
    results.append(result)
pyplot.plot(n_cores, results)
pyplot.show()

### Multicore: Model Evaluation

In [None]:
# Load lobraries 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold

In [None]:
# define dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=15, n_redundant=5, random_state=3)
results = []
n_cores = [1, 2, 3, 4, 5, 6, 7, 8]
for n in n_cores:
    model = RandomForestClassifier(n_estimators=100, n_jobs=1)
    kfold = RepeatedStratifiedKFold(n_splits=10, n_repeats=3)
    start = time()
    n_scores = cross_val_score(model, X, y, scoring='accuracy', cv=kfold, n_jobs=n)
    end = time()
    result = end - start
    print(f'{n} -> seconds - {result}')
    results.append(result)
pyplot.plot(n_cores, results)
pyplot.show()

## Model training process parallel along with model evaluation
- make n_jobs=4 for model
- make n_jobs=4 for cross_val_score

## Multi-core Hyperparameter Tuning

In [None]:
# Load Libraries
from sklearn.model_selection import GridSearchCV

In [None]:

# define dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, n_redundant=5, random_state=3)

model = RandomForestClassifier(n_estimators=100, n_jobs=4)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3)

# define grid
grid = dict()
grid['n_estimators'] = [25, 50, 75, 95]
# define grid search
n_cores = [1, 2, 3, 4, 5, 6, 7, 8]
for n in n_cores:
    gs = GridSearchCV(model, grid, n_jobs=n, cv=cv)# record current tim
    start = time()
    gs.fit(X, y)
    end = time()
    result = end - start
    print(f'{n} -> seconds - {result}')