# Hackathon: Medical Diagnoses

You are working for a hospital and you want to be able to detect a certain disease using data taken from medical images.

You have data describing an image of a medical sample and knowledge of whether the patient the sample as taken from was diagnosed with the disease or not.

You want to be able to capture every patient who need to be assessed further so as to rule out any complications related to the disease.


<img src="images/doctor.png" style="display: block;margin-left: auto;margin-right: auto;height: 300px"/>

## About the data

This dataset comes from Scikit-Learn as one of the many datasets to explore and on which perform machine learning. [The Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#breast-cancer-wisconsin-diagnostic-dataset).

The features in the dataset were computed from a digitized image of a breast tissue sample; they describe characteristics of the cell nuclei present in the image.


|Column|Description|Type|
|:---|:---|:---|
|id|ID number|float|
|diagnosis|The diagnosis of breast tissues (M = malignant, B = benign)|float|
|radius_mean|mean of distances from center to points on the perimeter|float|
|texture_mean|standard deviation of gray-scale values|float|
|perimeter_mean|mean size of the perimeter of the tumor|float|
|area_mean|mean size of the tumor|float|
|smoothness_mean|mean of local variation in radius lengths|float|
|compactness_mean|mean of perimeter^2 / area - 1.0|float|
|concavity_mean|mean of severity of concave portions of the contour|float|
|concave points_mean|mean for number of concave portions of the contour|float|
|fractal_dimension_mean|mean for "coastline approximation" - 1|float|
|radius_se|standard error for the mean of distances from center to points on the perimeter|float|
|texture_se|standard error for standard deviation of gray-scale values|float|
|perimeter_se|standard error for the perimeter of the tumor|float|
|area_se|standard error for the size of the tumor|float|
|smoothness_se|standard error for local variation in radius lengths|float|
|compactness_se|standard error for perimeter^2 / area - 1.0|float|
|concavity_se|standard error for severity of concave portions of the contour|float|
|concave points_se|standard error for number of concave portions of the contour|float|
|fractal_dimension_se|standard error for "coastline approximation" - 1|float|
|radius_worst|"worst" or largest mean value for mean of distances from center to points on the perimeter|float|
|texture_worst|"worst" or largest mean value for standard deviation of gray-scale values|float|
|perimeter_worst|"worst" or largest mean value for mean of perimeter|float|
|area_worst|"worst" or largest mean value for mean of area|float|
|smoothness_worst|"worst" or largest mean value for local variation in radius lengths|float|
|compactness_worst|"worst" or largest mean value for perimeter^2 / area - 1.0|float|
|concavity_worst|"worst" or largest mean value for severity of concave portions of the contour|float|
|concave points_worst|"worst" or largest mean value for number of concave portions of the contour|float|
|fractal_dimension_worst|"worst" or largest mean value for "coastline approximation" - 1|float|

In [None]:
import pandas as pd
medical = pd.read_csv('data/medical.csv')
medical.head()

## Prepare `X` and `y`

Split the data into `X` and `y` where `X` is the feature matrix and `y` is the target (`price`)

Exclude `id` from the feature matrix due to it being a unique identifier.

Check the shape of `X` and `y`. 

## Train Test Split

Perform the train test split on the data to create `X_train`, `X_test`, `y_train`, `y_test`

Use a `random_state` to ensure the split is the same each time it is run.

Check the shape of `X_train`, `X_test`, `y_train` and `y_test`

## Preprocessing

There are no categorical values or missing values to deal with. However since we are building a `Logistic Regression` we will want to `scale` the data so that the coefficients can be compared.

Choose from the below and import it in from `sklearn.preprocessing`

- `StandardScaler`
- `RobustScaler`
- `MinMaxScaler`

Instantiate your scaler (eg. `scaler = RobustScaler()`) and try it out by performing:

```python
pd.DataFrame(scaler.fit_transform(X_train), columns=features)
```

## Building the Model

Now that we have a scaler chosen, we're ready to build a pipeline.

- Import `Pipeline` from `sklearn.pipeline` and a model (e.g. `LogisticRegression` from `sklearn.linear_model`).
- Instantiate the model with no parameters
- Instantiate the pipeline with the scaler and model as the 2 steps.

Fit the pipeline to `X_train` and `y_train`

# <mark>Assignment</mark>

<img src='images/gdd-logo.png' align=right width=300px>