# Get Started
Breast cancer is the most common malignancy among women, accounting for nearly 1 in 3 cancers diagnosed among women in the United States, and it is the second leading cause of cancer death among women. Breast Cancer occurs as a results of abnormal growth of cells in the breast tissue, commonly referred to as a Tumor. A tumor does not mean cancer - tumors can be benign (not cancerous), pre-malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound and biopsy are commonly used to diagnose breast cancer performed.

Given breast cancer results from breast fine needle aspiration (FNA) test (is a quick and simple procedure to perform, which removes some fluid or cells from a breast lesion or cyst (a lump, sore or swelling) with a fine needle similar to a blood sample needle). Since this build a model that can classify a breast cancer tumor using two training classification:

    1= Malignant (Cancerous) - Present
    0= Benign (Not Cancerous) -Absent


The Breast Cancer datasets is available machine learning repository maintained by the University of California, Irvine. The dataset contains 569 samples of malignant and benign tumor cells.

+ The first two columns in the dataset store the unique ID numbers of the samples and the corresponding diagnosis (M=malignant, B=benign), respectively.
+ The columns 3-32 contain 30 real-value features that have been computed from digitized images of the cell nuclei, which can be used to build a model to predict whether a tumor is benign or malignant.

------------------------------------------

## Objective:

The repository is a learning exercise to:

- Apply the fundamental concepts of machine learning from an available dataset

The analysis is divided into four sections, saved in juypter notebooks in this repository

1. Identifying the problem and Data Sources
2. Data Explorations
3. Data Pre-Processing
4. Model Development
5. Model Improvement


In [1]:
# check Environments version 
# --------------------------
# Python version
import sys
print('Python: {}'.format(sys.version))
# scipy
import scipy
print('scipy: {}'.format(scipy.__version__))
# numpy
import numpy
print('numpy: {}'.format(numpy.__version__))
# matplotlib
import matplotlib
print('matplotlib: {}'.format(matplotlib.__version__))
# pandas
import pandas
print('pandas: {}'.format(pandas.__version__))
# scikit-learn
import sklearn
print('sklearn: {}'.format(sklearn.__version__))

Python: 2.7.13 |Anaconda custom (64-bit)| (default, Dec 20 2016, 23:09:15) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
scipy: 1.0.0
numpy: 1.13.3
matplotlib: 2.0.2
pandas: 0.21.0
sklearn: 0.19.1


---------------------------------------------------------------
# 1.  Identifying the problem and Getting data.

Identify the types of information contained in our data set In this notebook I used Python modules to import external data sets for the purpose of getting to know/familiarize myself with the data to get a good grasp of the data and think about how to handle the data in different ways. 

--------------------------------------------------------

### A. Import library

First, let’s import all/ minimum of the modules, functions and objects we are going to use in this tutorial



### B. Load Dataset

First, load the supplied CSV file using additional options in the Pandas read_csv function.
Inspecting the data or We can load the data directly from the UCI Machine Learning repository in sklearn.

The first step is to visually inspect the new data set. There are multiple ways to achieve this:

* The easiest being to request the first few records using the DataFrame data.head()* method. By default, “data.head()” returns the first 5 rows from the DataFrame object df (excluding the header row).
* Alternatively, one can also use “df.tail()” to return the five rows of the data frame.
* For both head and tail methods, there is an option to specify the number of records by including the required number in between the parentheses when calling either method.Inspecting the data

In [21]:
#import library
#---------------------------------------------



--libararies complate--


In [3]:
# Load data 
# showing data information
# ----------------------------



Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.

        

-------------------------------------------
# 2. Data Explorations

Goal : Explore the variables to assess how they relate to the response variable In this notebook, I am getting familiar with the data using data exploration and visualization techniques using python libraries (Pandas, matplotlib, seaborn. Familiarity with the data is important which will provide useful knowledge for data pre-processing)


### Statistical Summary

Now we can take a look at a summary of each attribute.

This includes the count, mean, the min and max values as well as some percentiles.

In [None]:
# Use pandas as container the data
# showing head of data
# Set a global name X,y (for simple name)
#---------------------------------------------




In [None]:
# Summirize dataset using matplotlib
# Plot histograms 
#-------------------------------------






----------------------------------------
# 3. Data Preprocessing

Data preprocessing is a crucial step for any data analysis problem. It is often a very good idea to prepare your data in such way to best expose the structure of the problem to the machine learning algorithms that you intend to use.This involves a number of activities such as:
- Assigning numerical values to categorical data;
- Handling missing values; and
- Normalizing the features (so that features on small scales do not dominate when fitting a model to the data).

Goal : Find the most predictive features of the data and filter it so it will enhance the predictive power of the analytics model. 

*) We assume that the dataset for this project is clean. So we not do preprocessing right now

  
--------------------------------

In this sections to improve the models, I use  some techniques, that are : (Ignoring when you try a simple way)
1. feature normalization
2. feature selection to reduce high-dimension data

### Feature Normalization


In [6]:
# preprocessing code # 
# Feature normalize using StandardScaler()
#---------------------------------------------



### Feature Reduction using PCA


In [7]:
# feature selection and feature reduction using PCA
#---------------------------------------------


### Data Representation

In [None]:
# Checking Data Representation
# showing structure of Input (X)
# showing structure of target (y)
#---------------------------------------------





----------------------------------------------------------
# 4. Models Development 

In this section of the project, you will develop the tools and techniques necessary for models to make a prediction. Being able to make accurate evaluations of each model's performance through the use of these tools and techniques helps to greatly reinforce the confidence in your predictions.

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 3 different algorithms:
- KNeighborsClassifier,
- DecisionTreeClassifier,
- RandomForestClassifier,
- GaussianNB,     
- SVM,


----------------------------------------------------------
### A. Split Data
Before build the models, the next implementation requires that you take the dataset and split the data into training and testing subsets. Typically, the data is also shuffled into a random order when creating the training and testing subsets to remove any bias in the ordering of the dataset.

For the code cell below, you will need to implement the following:

- Use train_test_split from sklearn.cross_validation to shuffle and split the features and prices data into training and testing sets.
- Split the data into 75% training and 25% testing.
- Set the random_state for train_test_split to a value of your choice. This ensures results are consistent.
    Assign the train and testing splits to X_train, X_test, y_train, and y_test.


### B. Models Validation

It is difficult to measure the quality of a given model without quantifying its performance over training and testing. This is typically done using some type of performance metric, whether it is through calculating some type of error, the goodness of fit, or some other useful measurement. For this project, we will be calculating the accuracy_score function to computes the accuracy. the function returns the subset accuracy. If the entire set of predicted labels for a sample strictly match with the true set of labels, then the subset accuracy is 1.0; otherwise it is 0.0.


### C. Cross validation
Cross Validation technique  assess the performance of machine learning models. It helps in knowing how the machine learning model would generalize to an independent data set. 
Tthis is one of resampling methods to make the best use of your training data in order to accurately estimate the performance of a model on new unseen data.


Accurate estimates of performance can then be used to help you choose which set of model parameters to use or which model to select. Once you have chosen a model, you can train for final model on the entire training dataset and start using it to make predictions.



In [8]:
# -Initialization method classifier
# - Split data
#---------------------------------------------





In [17]:
# -Initialization method classifier
# - Split data
#---------------------------------------------





In [18]:
# Show evaluation to pandas
#---------------------------------------------




                        accuracy  cross val stdv cross val time process
KNeighborsClassifier    0.965035   0.926253      +/- 9.25%     0.036504
DecisionTreeClassifier  0.944056   0.927976      +/- 6.54%     0.072742
RandomForestClassifier  0.951049   0.959649      +/- 6.85%      0.25531
GaussianNB              0.958042   0.936779      +/- 7.22%     0.013605
SVC                     0.622378   0.627663     +/- 35.40%      0.80199


-------------------------------------------------------------
# 5. Model Improvement


Find the most predictive features of the data and filter it so it will enhance the predictive power of the analytics model. 
In this project I use two sub step to improve the accuracy the best model, that are: 

1. Preprocessing : We use feature selection to reduce high-dimension data, feature extraction and transformation for dimensionality reduction
2. parameters tuning  in order to find one with the best model's performance with best parameters ( hyper-parameters). 

Note : Machine learning models are parameterized so that their behavior can be tuned for a given problem.
Models can have many parameters and finding the best combination of parameters can be treated as a search problem. Not all parameters of a classifier is learned from the estimators. Those parameters are called hyper-parameters and are passed as arguments to the constructor of the classifier. Each estimator has a different set of hyper-parameters, which can be found in the corresponding documentation.
We can search for the best performance of the classifier sampling different hyper-parameter combinations. This will be done with an exhaustive grid search, provided by the GridSearchCV function.




-------------------------------------
## SVM tune parameters

In [20]:
# SVM tune parameters 
#---------------------------------------------





--------- Now Trying Support Vector Machine Classifier ---------
Support Vector Machine Accuracy: 97.20%
Cross validation score: 93.83% (+/- 5.81%)
Execution time: 2057.5 seconds 

Best parameters: {'kernel': 'linear', 'C': 0.1} 



## Random Forest Classifier tune parameters

In [22]:
# Random Forest Classifier tune parameters
#--------------------------------------------




--------- Now Trying Support Random Forest Classifier ---------
Random Forest Accuracy: 96.50%
Cross validation score: 95.62% (+/- 4.23%)
Execution time: 57.645 seconds 

Best parameters: {'n_estimators': 91, 'criterion': 'entropy'} 



## Naive Bayes Classifier tune parameters

In [23]:
# Naive Bayes Classifier tune parameters
#--------------------------------------------



--------- Now Trying Support Naive Bayes Classifier  ---------
Accuracy: 95.80%
Cross validation score: 93.69% (+/- 3.29%)
Execution time: 0.24562 seconds 

Best parameters: {'priors': [0.4, 0.6]}


---------------------------
---------------------------
-------------------------
### Example : Saving the best model

In [None]:
# from sklearn.externals import joblib
# joblib.dump(data_scaler, 'data_scaler_cancer.pkl')
# joblib.dump(pca, 'pca_cancer.pkl')
# joblib.dump(clf_svc, 'svc_cancer.pkl')

### Example : Load the model

In [None]:
# # load model:
# # 1.standard
# # 2. pca
# # 3. algorithm
# svm= joblib.load('svc_cancer.pkl')
# predict=svm.predict(X_test)
# print 'Score :',accuracy_score(predict,y_test)

# svm= joblib.load('lr_cancer.pkl')
# predict=svm.predict(X_test)
# print 'Score :',accuracy_score(predict,y_test)
