# Breast Cancer Classification using Fine Needle Aspiration

#### Research Background:
> ### Prominent types of tumors
> - **Benign Tumors**
>
>   - Non-cancerous
>
>   - Capsulated
>
>   - Non-invasive
>
>   - Mild growth
>
>   - Non-metastasizing
>
>   - Deviant cells 
>
> - **Malignant tumors**
>
>   - Cancerous
>
>   - Non-capsulated
>
>   - Rapid growth
>
>   - Metastasizing
>
>   - Deviant cells(large, dark nuclei, abnormal shape)
>
***
This is where fine needle aspiration comes in.

***The biopsy procedure can allow us to help make a diagnosis or rule out conditions such as cancer.***
***
> ### Concerning the implemented dataset:
> - 0 represents a malignant tumor
>
> - 1 represents a benign tumor
> ***
> ### TODOS
> - Read dataset
>
> - Preprocessing
>
> - Train-test split
>
> - Logistic regression model 
>
> - Validate accuracy
>
> - Implement a Predictive System
***
# Model Implementation
***


### Dependencies

In [49]:
# for arrays
import numpy as np
# for dataframe construction
import pandas as pd
# for data loading
import sklearn.datasets
# for df split
from sklearn.model_selection import train_test_split
# for model init
from sklearn.linear_model import LogisticRegression
# for accuracy check
from sklearn.metrics import accuracy_score

### Dataframe Construction

In [29]:
# init data from sklearn(Can also be .read_csv from Kaggle found as 'Breast Cancer Wisconsin (Diagnostic) Data Set')
# returns as an array
breast_cancer_dataset = sklearn.datasets.load_breast_cancer()

# make into df since the above init did not automatically create one like a .read_csv
df = pd.DataFrame(breast_cancer_dataset.data, columns = breast_cancer_dataset.feature_names)

# make sure df contains desired data (n=5 by default)
# df.head()
# df.tail

# add 'diagnosis' column to df
# this is because the above essentially only added inputs to df ('data' in breast_cancer_dataset); thus outputs need to be added ('target' in breast_cancer_dataset)
# here we set the column parameter as 'diagnosis'
df['diagnosis'] = breast_cancer_dataset.target

#df ready!

### Preprocessing

In [None]:
# useful methods to analyze data

# returns (rows,columns)
df.shape

# returns columns, any null info, and respective data type 
# if missing can use techniques like imputation
df.info()

# check specifically for missing values
df.isnull().sum()

# returns statistical values 
df.describe()

# returns the distribution of 'target' or in our case the 'diagnosis' column
df['diagnosis'].value_counts()

# splits columns by 'target' values and returns the various means
df.groupby('diagnosis').mean()

### Train-Test Split

In [44]:
# separating the features(input) and target(output)

# note: row axis value is 0; whereas column axis value is 1
X = df.drop(columns='diagnosis', axis=1)
y = df['diagnosis']
X,y

# here we use the use the prior import to split the data to train and test tests

# note: random_state can be used to split data identically(same 20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# now we see how the data has been split from the original(X) into (X_train,X_test)
print(X.shape, X_train.shape, X_test.shape)

(569, 30) (455, 30) (114, 30)


### Model Training (Logistic Regression)

In [None]:
# we implement logistic regression which is especially useful for binary classification 

# instance of LogisticRegression()
model = LogisticRegression()

# training model using training data
model.fit(X_train, y_train)


### Accuracy Check

In [63]:
# accuracy on TRAINING data
X_train_prediction = model.predict(X_train)
# compares our predictions which are stored in 'X_train_prediction' with 'y_train'(actual output) via the 'accuracy_score' import
training_data_accuracy = accuracy_score(y_train, X_train_prediction)
training_data_accuracy

# accuracy on TEST data
X_test_prediction = model.predict(X_test)
# compares our predictions which are stored in 'X_train_prediction' with 'y_train'(actual output) via the 'accuracy_score' import
training_data_accuracy = accuracy_score(y_test, X_test_prediction)
training_data_accuracy

# model overfitting is the model trying to learn more from the data
# It is commonly present if TRAINING data accuracy is high compared to TEST data accuracy 

0.956140350877193

### Predictive System Construction

In [None]:
# here we use the model to preduct wether a tumor is benign or malignant

# input_data is set to an example below
input_data = (13.54,14.36,87.46,566.3,0.09779,0.08129,0.06664,0.04781,0.1885,0.05766,0.2699,0.7886,2.058,23.56,0.008462,0.0146,0.02387,0.01315,0.0198,0.0023,15.11,19.26,99.7,711.2,0.144,0.1773,0.239,0.1288,0.2977,0.07259)

# change the input data to a numpy array as we cannot reshape a tuple
input_np = np.asarray(input_data)

# reshape the numpy array because we are predicting for one data point
reshaped = input_np.reshape(1,-1)

prediction = model.predict(reshaped)
print(prediction)

if (prediction == 0):
  print('The Breast cancer is Malignant')
else:
  print('The Breast Cancer is Benign')


***
# Conclusion

> It is clear that machine learning can have revolutionary effects in healthcare. This project has made that fact all the more apparent. 