# ***UCI Breast Cancer Pipeline Project***
###
### Some noteworthy information from UCI:
1) ID number
2) Diagnosis (M = malignant, B = benign)
3-32)

**Ten real-valued features are computed for each cell nucleus:**

1) radius (mean of distances from center to points on the perimeter)
2) texture (standard deviation of gray-scale values)
3) perimeter
4) area
5) smoothness (local variation in radius lengths)
6) compactness (perimeter^2 / area - 1.0)
7) concavity (severity of concave portions of the contour)
8) concave points (number of concave portions of the contour)
9) symmetry
10) fractal dimension ("coastline approximation" - 1)
###


## 0. Import Modules:

In [71]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression,LinearRegression, Lasso, Ridge
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.decomposition import PCA

## 1. Import UCI Dataset &#8594; Write dataset to local csv &#8594; Search for missing values and verify shape

In [52]:
## Import UCI Dataset and write to local csv
# from ucimlrepo import fetch_ucirepo
# breast_ca = fetch_ucirepo(id=17)

# breast_ca_df = breast_ca.data.original
# breast_ca_df.to_csv('UCI_BreastCancer.csv', index=False)
# print('Successfully wrote dataset to csv file!')

# Read csv and store as df
df = pd.read_csv('UCI_BreastCancer.csv')

# Search Dataset for missing / null values
try:
    if df.isnull().sum().any()>0:
        print('NaN values found: ', df.isnull().sum())
    else:
        print('No NaN or null values found')
except Exception as e:
    print(e)

# Verify features and shape
print(df.columns)
print(df.shape)

No NaN or null values found
Index(['ID', 'radius1', 'texture1', 'perimeter1', 'area1', 'smoothness1',
       'compactness1', 'concavity1', 'concave_points1', 'symmetry1',
       'fractal_dimension1', 'radius2', 'texture2', 'perimeter2', 'area2',
       'smoothness2', 'compactness2', 'concavity2', 'concave_points2',
       'symmetry2', 'fractal_dimension2', 'radius3', 'texture3', 'perimeter3',
       'area3', 'smoothness3', 'compactness3', 'concavity3', 'concave_points3',
       'symmetry3', 'fractal_dimension3', 'Diagnosis'],
      dtype='object')
(569, 32)


## 2. Define Target (y) and Features (X) &#8594; Convert Target to Binary &#8594; Train_Test_Split()

In [69]:
# Define features and target
y = df.Diagnosis
X = df.drop(columns=['Diagnosis'])
print(X.shape)
print(y.shape)

# Convert target data to binary and verify value_counts.
print('\nPrior to binary conversion: \n',y.value_counts())
try:
    y = pd.DataFrame(np.where(y == 'M',1,0), columns=['Diagnosis'])
    y = y.Diagnosis
    print('\nPost binary conversion: \n',y.value_counts(),'\n')

except Exception as e:
    print(e)

print(X.shape)
print(y.shape)

# train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

(569, 31)
(569,)

Prior to binary conversion: 
 Diagnosis
B    357
M    212
Name: count, dtype: int64

Post binary conversion: 
 Diagnosis
0    357
1    212
Name: count, dtype: int64 

(569, 31)
(569,)


## 3. Preprocessing / Scaling / Exploratory Data Analysis:

In [72]:
scaler = StandardScaler()

## All features are numeric
# print(X_train.nunique())

preprocess = Pipeline([
    ('scaler', StandardScaler())
])

preprocessor = ColumnTransformer([
    ('preprocessor', preprocess, X_train.columns)
])

pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('clfr', RandomForestClassifier()),
    ('clfg', GradientBoostingClassifier())
])

search_space = [
    {'clfr': RandomForestClassifier(), 'clfr__max_depth':np.linspace(5,55,10), 'clfr__n_estimators':np.linspace(10,100,10)},
    {'clfg': GradientBoostingClassifier(), 'clfg__learning_rate':np.logspace(-4,-1,9), 'clfg__n_estimators':np.linspace(10,100,10)},
]