# Lab 6

In [1]:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier as DTC

## Q1

Without stopping criteria, a decision tree with iterate until it can no longer lower the GINI coefficient (or whatever metric is being used). This happens when the nodes contain items of only one class.

## Q2

Decision trees do not rely on the continuation of earlier trends, like linear models do. As we have seen in class, with sufficient depth, a decision tree can closely follow a non-linear trendline.

## Q3

This is the definition of the *n_estimators* parameter.

## Q4

First we have to load the data.

In [2]:
from sklearn.datasets import load_breast_cancer as lbc
data = lbc()
X = pd.DataFrame(data = data.data, columns = data.feature_names)
y = data.target

We will start with Naive Bayes.

In [3]:
model = GaussianNB()
model.fit(X, y)
y_pred = model.predict(X)
cm = confusion_matrix(y,y_pred)

In [4]:
pd.DataFrame(cm)

Unnamed: 0,0,1
0,189,23
1,10,347


Remember that 0 is a positive case. In this DF, the actual case is along the x axis and the prediction is along the y axis. Thus, we are looking at the bottom-right number (10).

Next, a single classification tree.

In [6]:
model = DTC(max_depth = 5, random_state = 1693)
model.fit(X, y)
y_pred = model.predict(X)
cm = confusion_matrix(y,y_pred)

In [7]:
pd.DataFrame(cm)

Unnamed: 0,0,1
0,209,3
1,0,357


There are no false negatives (three false positives). Lastly, let's do a Random Forest classifier.

In [8]:
model = RFC(random_state=1693, max_depth=5, n_estimators = 1000)
model.fit(X, y)
y_pred = model.predict(X)
cm = confusion_matrix(y,y_pred)

In [9]:
pd.DataFrame(cm)

Unnamed: 0,0,1
0,209,3
1,0,357


Once again, there are no false negatives. Thus, the Naive Bayes classifier had the most.

## Q5

This is the definition of a false positive.

## Q6

True; the equation for Naive Bayes classifiers yields a percentage probability of a given class.

## Q7

The root node is the "top" node, before any split has been made in the data.

## Q8

Soft classification gives a probability for each class, as opposed to returning only a single predicted class.

## Q9

This is the definition of posterior probability.

## Q10

As can be seen in Q4, these are the four ways in which data is grouped by a confusion matrix.

## Q11

Axons are the things which transmit electrical impulses from neuron to neuron through the brain. In a neural network, this is best represented by the connections between layers, or a given neuron's outputs.

## Q12

A set of definitions.

## Q13

The difference between these two methods is that in SGD, only a subset of the data is used to reduce the cost function. This constantly changing "surface" on which to minimize cost reduces the chance that some local minima sticks out, as it would have to do so over many iterations to stay put in a local minima.

## Q14

In [10]:
# First, reduce X to just our two features.
X = X[['mean radius','mean texture']]

In [11]:
# Now, initializing and fitting the model with 10 folds
model = GaussianNB()
accuracies = []

from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
kf = KFold(n_splits=10, random_state=1693,shuffle=True)
for train_index, test_index in kf.split(X):
    X_train = X.values[train_index]
    y_train = y[train_index]
    X_test = X.values[test_index]
    y_test = y[test_index]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracies.append(accuracy_score(y_test,y_pred))
    

In [12]:
np.mean(accuracies)

0.8805137844611528

Our mean accuracy is $.8805$.

## Q15

We can largely use the same method as the previous question.

In [13]:
# Now, initializing and fitting the model with 10 folds
model = RFC(random_state=1693, max_depth=7, n_estimators = 100)
accuracies = []

from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
kf = KFold(n_splits=10, random_state=1693,shuffle=True)
for train_index, test_index in kf.split(X):
    X_train = X.values[train_index]
    y_train = y[train_index]
    X_test = X.values[test_index]
    y_test = y[test_index]
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracies.append(accuracy_score(y_test,y_pred))
    

In [14]:
np.mean(accuracies)

0.8804824561403508

Rounded, we have the same result: $.8805$.

## Q16

In [None]:
# First, we need keras loaded
!pip install tensorflow
!pip install keras

In [17]:
from tensorflow import keras
import tensorflow as tf

Next, we will construct our neural network.

In [18]:
model = tf.keras.Sequential([keras.layers.Dense(16, activation = tf.nn.relu),
                             keras.layers.Dense(8, activation = tf.nn.softmax),
                            keras.layers.Dense(4, activation = tf.nn.relu),
                             keras.layers.Dense(1, activation = tf.nn.sigmoid)])


In [22]:
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [None]:
model.fit(X, y, epochs = 150, validation_split = 0.25, batch_size = 10, shuffle = False)

In [24]:
accuracies = []
kf = KFold(n_splits=10, random_state=1693,shuffle=True)
for train_index, test_index in kf.split(X):
    X_train = X.values[train_index]
    y_train = y[train_index]
    X_test = X.values[test_index]
    y_test = y[test_index]
    model.fit(X_train, y_train, epochs = 150, validation_split = 0.25, batch_size = 10, shuffle = False, verbose = 0)
    model.evaluate(X_test, y_test)



In [25]:
#We can take the accuracies straight from the output
accuracies = [.9474, .8421, .8947, .8772, .9298, .8596, .9298, .8947, .8947, .9107]

In [26]:
np.mean(accuracies)

0.89807

From the answer choices, the closest to $89.807\%$ is $89\%$.