# Machine Learning - Decision Tree - Malignant or Benign?

<center><img src="../images/generated/Gemini_Generated_Image_8fu90a8fu90a8fu9.jpeg" width="400"></center>
</br>
</br>
In this activity we will explore a medical diagnosis dataset and apply a classification model.

## Import Libraries

As usual, we will import a bunch of libraries to get started.

note - we will be importing additional libraries later on in the activity.

In [None]:
## Begin Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
## End Imports

## Load Dataset

The dataset for this activity is the Scikit-learn breast cancer dataset.

The goal is to predict if a breast tumor is malignant (cancerous) or benign (non-cancerous) based on features extracted from a digitized image of a fine needle aspirate (FNA) of a breast mass.

(_AI Generated Summary Begins Here_)
Data Set Summary:

* __Number of Samples:__ 569 tumor samples.

* __Features:__ 30 numerical features that describe the characteristics of the cell nuclei in each image. Examples include the mean radius, texture, perimeter, and area.

* __Target:__ The target variable is the diagnosis, with two possible classes:

* __`0`:__ Malignant (212 samples)

* __`1`:__ Benign (357 samples)

(_AI Generated Summary Ends Here - Credit Gemini_)

__Loading the Dataset:__
* `load_breast_cancer()` returns a dictionary containing the data we will use
* `data` - an array of feature values
* `target` - an array of target values where `0` is `malignant` and `1` is `benign`
* `feature_names` - column names
* `target_names` - an array of the target names

We will load the individual the following items into a DataFrame for inspection:

* data
* feature_names

In [None]:
# Load Data
breast_cancer = load_breast_cancer()
data = breast_cancer["data"]
columns = breast_cancer["feature_names"]
target = breast_cancer["target"]
target_columns = breast_cancer["target_names"]

# Load Data Into DataFrame
X = pd.DataFrame(data=data, 
                 columns=columns)

y = pd.Series(data=target)

## Getting Familiar With the Data

In [None]:
# View X's Info


In [None]:
# Preview X's Rows


In [None]:
# View Target Info


In [None]:
# View Target Value Counts


## Inspecting the Shape of the Features and Target

In [None]:
# Inspect Feature (X) Shape


In [None]:
# Inspect Target (y) Shape


## Visualizing the Distribution of Data

In [None]:
## Visualize Distribution of Feature Data
X.hist(figsize=(12,7),
       bins=30,
      edgecolor="black")
plt.subplots_adjust(hspace=0.7,
                    wspace=0.4)
plt.title("Distribution of Feature Data")
plt.tight_layout()

## Visualizing Target Data

In [None]:
# Bar Chart of Total Counts of Malignant and Benign
y.value_counts().plot(kind="bar",
            figsize=(10,5))

plt.ylabel("Count")
plt.xlabel("Diagnosis")
plt.xticks(rotation=0)
plt.title("Tumor Classification Counts")

## Splitting up the Data for Training

In [None]:
# Split Up Data
X_train, X_test, y_train, y_test = 

## Inspecting the Shape and Features of the Training and Test Data

In [None]:
# X Train Shape


In [None]:
# X Test Shape


In [None]:
# y Train Shape


In [None]:
# Y Test Shape


## Build Model - Decision Tree

Importing Decision Tree Classifier

```python
from sklearn.tree import DecisionTreeClassifier
```

In [None]:
## Import and Build Decision Tree
from sklearn.tree import DecisionTreeClassifier


## Train Model

In [None]:
## Train Model


## Get Predictions

In [None]:
## Get Predictions




## Evaluating the Model

Given the nature of the dataset, let's run a few different tests of the model:
* Accuracy Score
* Confusion Matrix
* Precision Score
* Recall Score

### Accuracy Score

Accuracy Score is the ration of correct predictions to the the total number of predictions.

`accuracy_score` takes two arguments:
* actual correct labels
* predicted labels from model

Returns the accuracy as a fraction (float between `0` and `1`)


Gets the ratio of correct predictions to the total number of predictions
Accuracy = Correct Predictions / Total Number of Predictions

__Accuracy Score Pros and Cons__:

* (+) Simple and Easy to use
* (-) Can be misleading with imbalanced datasets where the numbers in one class is significantly higher than in another

__Importing Accuracy Score__:

```python
from sklearn.metrics import accuracy_score
```

In [None]:
### Accuracy Score Results




### Confusion Matrix

Best used for binary classification (think spam not spam or in this case malignant or benign)

`confusion_matrix()` takes in `y_test` data, model `predictions`, and labels and returns a 2 by 2 tables with 4 key components:

* `True Positive` - Model correctly predicted positive
* `True Negative` - Model correction predicted negative
* `False Positive` - Model incorrectly predicted positive 
* `False Negative` - Model incorrectly predicted negative

__Importing Confusion Matrix__:

```python
from sklearn.metrics import confusion_matrix
```

In [None]:
### Confusion Matrix Results





#### Confusion Matrix - Making Sense of It

Here is a break down of the Confusion Matrix as a table.

| |Predicted Negative |Predicted Positive|
|:---|:---|:---|
|Actual Negative| True Negatives |   False Positives|
|Actual Positive |False Negatives  |True Positives

### Precision Score
Precision measures the accuracy of positive predictions made by a model: "Of all the instances the model predicted as positive, how many where actually positive?"

* `precision_score` takes in the `y_test` and `predictions` results and returns a float from 0-1.
* A higher score indicates higher precision

__Formula__:

Precision = True Positives / (True Positives + False Positives)


__Uses__:

Precision is important when the cost of a false positive is high.

* Spam Detections
* Medical Diagnosis
* E-commerce Recommendations

__Importing Precision Score__:

```python
from sklearn.metrics import precision_score
```

In [None]:
### Precision Score Results






### Recall Score

Recall Score asks the question: "of all the instances that were actually positive, how many did the model correctly identify?"

Focusing on the model's ability to find all the positive cases

__Formula:__

Recall = True Positive / (True Positive + False Negative)

__Uses:__

* Recall is important when the the cost of a false negative is high.
* For example, medical diagnosis a false negative could have life-threatning consequences.

__Importing Recall Score:__

```python
from sklearn.metrics import recall_score
```

In [None]:
### Recall Score Results




### Classification Report

Alternatively, can can run `classification_report` to evaluate the model's performance.

`classification_report` takes in the `y_test`, `predictions`, and `target_names` and returns:
* __Precision Score__ - positive predictions that where correct
* __Recall Score__ - how many positive cases the model correctly identified
* __F1-Score__ - harmonic mean of precision and recall - ranging from 0-1, 1 is best possible score
* __Support__ - number of actual occurences of each class

__note:__ a high F1 for a class with low support may not be as meangingful as a slightly lower F1 for a class with high support

__Importing Classification Report:__

```python
from sklearn.metrics import classification_report
```

In [None]:
### Classification Report Results




## Visualizing the Decision Tree

In [None]:
from sklearn import tree

plt.figure(figsize=(25,20))
tree.plot_tree(clf,
               feature_names=df.columns,
               class_names={0: "Malignant", 1: "Benign"},
               filled=True,
               fontsize=12)

plt.title("Decision Tree")

# Improving the Model

Next up, you will improve the model.

We'll take a look at Feature Importances, prune features that aren't pulliing their weights, rebuild the model, and evaluate the results.

### Feature Importances

Feature Importances is a technique that measures how much each feature in the dataset contributes to a model's predictions.

* Each input variable is given a score that indicates its relative infuence on the model's output.
* A higher score means a feature has a larger impact on the model's ability to make accurate predictions.

__Access Feature Importance:__

```python
clf.feature_importances_
```

In the cell below, let's create a new DataFrame from the `feature_importances` attribute where the index is the columns from the original DataFrame and the column is called `importance`.

We will then display the first 10 rows of data in descending order.

In [None]:
## Implement Feature Importance





## Visual Feature Importance

In the cell below, create a bar graph, visualizing Feature Importances.

In [None]:
## Visualize Feature Importance



## Preserve the top 10 features

In the cell below, create a new feature DataFrame to be used for training/testing called `X_pruned` that keeps only the top 10 features and view it's information.

In [None]:
# Preserve the top 10 features


## Split the New Training Data

Now let's split the Data for testing using the `X_pruned` and `y` datasets.

In [None]:
# Create and Split the New Test Data



## Build A New Classifier

Create a New Classifier Model

In [None]:
# Build the New Model



## Train the New Model

Using `X_pruned_train` and `y_train`, train the new model.

In [None]:
# Train the New Model



## Run Predictions

Using `X_pruned_test`, store the model's predictions results.

In [None]:
# Predict



## Classification Report

Run the classification report on the model's predictions.

Was there an improvement?

In [None]:
# Get Classification Report

