# Naive Bayes Classifier

A simple, probabilistic classifier family

- Assume that features are conditionally independent, given the class.
- Highly efficient learning and prediction. But generalization performance may worse than more sophisticated learning methods.

**Naïve Bayes classifier types**:
- Bernoulli: binary features (e.g. word presence/absence)
- Multinomial: discrete features (e.g. word counts)
- Gaussian: continuous/real-valued features

Statistics computed for each class:
- For each feature: mean, standard deviation

In [None]:
from sklearn.naive_bayes import GaussianNB
from adspy_shared_utilities import plot_class_regions_for_classifier

X_train, X_test, y_train, y_test = train_test_split(X_C2, y_C2, random_state=0)

nbclf = GaussianNB().fit(X_train, y_train)
plot_class_regions_for_classifier(nbclf, X_train, y_train, X_test, y_test,
                                 'Gaussian Naive Bayes classifier: Dataset 1')

print('Accuracy of GaussianNB classifier on training set: {:.2f}'
     .format(nbclf.score(X_train, y_train)))
print('Accuracy of GaussianNB classifier on test set: {:.2f}'
     .format(nbclf.score(X_test, y_test)))

# Ensembles of Decision Trees

## Random forests

**sklearn.ensemblemodule**:
- Classification: RandomForestClassifier
    - Each tree gives probability for each class.
    - Probabilities averaged across trees.
    - Predict the class with highest probability
    
- Regression: RandomForestRegressor
    - mean of individual tree predictions.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2,
                                                   random_state = 0)

# for many feature pairs
pair_list = [[0,1], [0,2], [0,3], [1,2], [1,3], [2,3]]
for pair, axis in zip(pair_list, subaxes):
    X = X_train[:, pair]
    y = y_train
    
    clf = RandomForestClassifier().fit(X, y)
    
clf = RandomForestClassifier(n_estimators = 10,random_state=0).fit(X_train, y_train)
clf = RandomForestClassifier(max_features = 8, random_state = 0)

print('Accuracy of RF classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))

Random Forest Parameters: 
- **max_features**: Learning is quite sensitive to max_features. Setting max_features= 1 leads to forests with diverse, more complex trees.
- **n_estimators**: number of trees to use in ensemble (default: 10). Should be larger for larger datasets to reduce overfitting(but uses more computation).
- **max_features**: has a strong effect on performance. Influences the diversity of trees in the forest.
- **max_depth**: controls the depth of each tree (default: None. Splits until all leaves are pure).
- **n_jobs**: How many cores to use in parallel during training.

## Gradient-boosted decision trees

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_D2, y_D2, random_state = 0)

clf = GradientBoostingClassifier(learning_rate = 0.01, max_depth = 2, random_state = 0)
print('Accuracy of GBDT classifier on training set: {:.2f}'
     .format(clf.score(X_train, y_train)))
print('Accuracy of GBDT classifier on test set: {:.2f}'
     .format(clf.score(X_test, y_test)))