## My first attempt at making "tutorial" (kind-of) notes 
- explaining why I took certain steps and documenting thought process

In [None]:
from platform import python_version
print(python_version())
#just checking

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 

#To ignore warnings, to display warning only once use 'once', 
#change it to 'default' to get default settings
warnings.filterwarnings('ignore') 

%matplotlib inline

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import *
# Consiously importing modules/methods when required 

In [None]:
datalake = '../input/data.csv'
dataset = pd.read_csv(datalake)

In [None]:
dataset.head(2)

In [None]:
#removing 'id' and 'Unnamed: 32' columns and
# copying X and target label 'diagnosis' to seperate DataFrames for analysis
dataset.drop(['id','Unnamed: 32'], axis=1, inplace = True) #removing from dataset directly
X_data = dataset.drop('diagnosis', axis = 1)
Y_data = dataset.diagnosis #note that it becomes a pandas Series object

In [None]:
X_data.head(2) #checking if we have it as required

In [None]:
Y_data.head(2)

In [None]:
#checking if any null values in entire dataset
dataset.info()

- Total 569 datapoints(entries), 30 features(float), 1 target(object) ***can cross verify with dataset.shape***
- No missing/null values found
- no objects found apart from target_col - diagnosis; ***i.e we can directly proceed with analysis of X without encoding***  


In [None]:
#converting Y to machine understandable language i.e numbers
classes = list(Y_data.unique()) #get list of unique classes
values = [1 if x=='M' else 0 for x in classes] # assign corresponding values for classes
print('Unique Classes : {}'.format(classes))
print('Respective values assigned : {}'.format(values))

## As we are trying to predict if cancer is detected, I assigned 1 to Malignant class (M)

It is general convention, that the object/class to-be/if detected to corresponds to 1; we can proceed the other way also.

### In Machine learning if we have multi-class we can use one-vs-all approach; for Neural networks changing neurons in o/p layer should be enough

In [None]:
# assign values to classes and update dataset
Y_data = Y_data.map(dict(zip(classes, values)))
Y_data.head()

# Present data is valid to fit any model, so we will first try to get a baseline estimate before proceeding with data analysis and proceed on to other models final model

### As this is a classification model I take Simple Linear Classifier (Logistic Regression) to be my baseline model

# Modeling Logistic Regression

In [None]:
#import required modules
from sklearn.linear_model import LogisticRegression

#used below metrics from sklearn.metrics
    # accuracy_score
    # confusion_matrix
    # precision_score
    # recall_score

In [None]:
seed = 2913
val_ratio = 0.3 
X_train, X_test, Y_train, Y_test = train_test_split(X_data,Y_data, test_size = val_ratio, random_state = seed)
baseline_classifier = LogisticRegression()
baseline_classifier.fit(X_train,Y_train)  #fitting the data to model

In [None]:
y_pred = baseline_classifier.predict(X_test)    #predecting with test cases
accuracy = accuracy_score(Y_test, y_pred)
print("Accuracy of basline classifer (rounded to 3 digits) : ",round(accuracy,3))

In [None]:
print(confusion_matrix(Y_test,y_pred))

In [None]:
tn, fp, fn, tp = confusion_matrix(Y_test,y_pred).flatten() #converting from 2x2 array to a single row array
print ('tn, fp, fn, tp : ', (tn, fp, fn, tp))
print('Precision : ', precision_score(Y_test,y_pred))
print('Sensitivity/Recall(tp rate) : ', recall_score(Y_test,y_pred))

## Results 
* Baseline accuracy is around 96% which is decent
* 4 False negatives i.e model predicted no, but they actually do have the disease
* 3 False postivies i.e when the model predicted yes but don't have the disease
* Precision is 0.95, which implies when model predicts yes, only 95% of the time it is correct
* From sensitivity, we can observere that when it's actually yes(Malignant), only ***93%*** i.e 93 of 100 times does the model predict yes



## **We have a baseline estimate, we now do some data analysis and feature engineering to see if we can improve results predictions**

# Check if the data is skewed
> **i.e if any of the class is under/over-represented**

In [None]:
# Checking visually using count plot

sns.countplot(dataset['diagnosis'],label="Count")

In [99]:
# finding respective class count and their percentage wrt to total values
classes, count = np.unique(dataset['diagnosis'].values,return_counts=True)
for cls,val in zip(classes, count):
    print('No. of occurences {}'.format(val))
    print('{0} accounts for {1:.2f}% of total values\n'.format(cls, round(100*val/count.sum(),3)))

print('Total values : {}'.format(dataset['diagnosis'].value_counts().sum()))

No. of occurences 357
B accounts for 62.74% of total values

No. of occurences 212
M accounts for 37.26% of total values

Total values : 569


## It can be observed that there are not much class imbalance, we proceed with our analysis
* **If data is suffering from imbalance, we might have do upsampling or downsampling **

# Feature scaling (Data normalization)

It is optional step
- depends on the algorithm/model we plan to implement and on data we use
    - it allows for model to learn\features to contribute relative to their importance rather than their scale

If it is required to change the algorithm, we have check feature scale again

Some algos that require feature scaling:

    Logistic regression
    SVMs
    Perceptrons
    Neural networks
    PCA    
> ***Usually Distance based algos***

Models which do not require:

    Decision trees (and random forests)
    Naive Bayes
  


# Checking for outliers
