# Logistic Regression Lab 2

Scikit-Learn includes several sample datasets which can demonstrate
logistic regression's usefulness.

This is a very free-form lab: you won't be walked through it step-by-step,
so you might want to keep some other examples open.

In [6]:
import sklearn.datasets
import pandas
import sklearn.linear_model
import sklearn.cross_validation
import sklearn.metrics
import numpy

We will look at the Wisconsin breast cancer database, and a classic
dataset of [different kinds of iris flowers](https://en.wikipedia.org/wiki/Iris_flower_data_set).

In [7]:
bc = sklearn.datasets.load_breast_cancer()
print bc.DESCR

Breast Cancer Wisconsin (Diagnostic) Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)
        
        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, field
        13 is Radius SE, field 23 is Worst Radius.
 

In [8]:
iris = sklearn.datasets.load_iris()
print iris.DESCR

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d

# Wisconsin

In the Wisconsin breast cancer database, you are trying to predict whether
a tumour is malignant or benign. The database consists of the measurements
of the tumour (bc.data) and the nature of the tumour (bc.target) -- 1 = malignant, 0 == benign.

Try using various combinations of parameters in a logistic regression.

Validate your results with a cross cut validation



In [11]:
malignancy = pandas.Series(bc.target).value_counts()
malignancy

1    357
0    212
dtype: int64

# Irises

There are three kinds of flowers in the dataset:

- [Setosa](https://en.wikipedia.org/wiki/Iris_setosa) ( = 0)

- [Versicolor](https://en.wikipedia.org/wiki/Iris_versicolor) ( = 1)

- [Virginica](https://en.wikipedia.org/wiki/Iris_virginica) ( = 2)

Try using various combinations of parameters in a logistic regression.

Validate your results with a cross cut validation

In [13]:
bc.data.shape
columns_names = []
radius (mean):                         6.981   28.11
    texture (mean):                        9.71    39.28
    perimeter (mean):                      43.79   188.5
    area (mean):                           143.5   2501.0
    smoothness (mean):                     0.053   0.163
    compactness (mean):                    0.019   0.345
    concavity (mean):                      0.0     0.427
    concave points (mean):                 0.0     0.201
    symmetry (mean):                       0.106   0.304
    fractal dimension (mean):              0.05    0.097
    radius (standard error):               0.112   2.873
    texture (standard error):              0.36    4.885
    perimeter (standard error):            0.757   21.98
    area (standard error):                 6.802   542.2
    smoothness (standard error):           0.002   0.031
    compactness (standard error):          0.002   0.135
    concavity (standard error):            0.0     0.396
    concave points (standard error):       0.0     0.053
    symmetry (standard error):             0.008   0.079
    fractal dimension (standard error):    0.001   0.03
    radius (worst):                        7.93    36.04
    texture (worst):                       12.02   49.54
    perimeter (worst):                     50.41   251.2
    area (worst):                          185.2   4254.0
    smoothness (worst):                    0.071   0.223
    compactness (worst):                   0.027   1.058
    concavity (worst):                     0.0     1.252
    concave points (worst):                0.0     0.291
    symmetry (worst):                      0.156   0.664
    fractal dimension (worst):             0.055   0.208

(569, 30)

In [14]:
description_list = bc.DESCR.split('\n')
description_list
for i in range(len(description_list)):
    if '====================' in description_list[i]:
        break

['Breast Cancer Wisconsin (Diagnostic) Database',
 '',
 'Notes',
 '-----',
 'Data Set Characteristics:',
 '    :Number of Instances: 569',
 '',
 '    :Number of Attributes: 30 numeric, predictive attributes and the class',
 '',
 '    :Attribute Information:',
 '        - radius (mean of distances from center to points on the perimeter)',
 '        - texture (standard deviation of gray-scale values)',
 '        - perimeter',
 '        - area',
 '        - smoothness (local variation in radius lengths)',
 '        - compactness (perimeter^2 / area - 1.0)',
 '        - concavity (severity of concave portions of the contour)',
 '        - concave points (number of concave portions of the contour)',
 '        - symmetry ',
 '        - fractal dimension ("coastline approximation" - 1)',
 '        ',
 '        The mean, standard error, and "worst" or largest (mean of the three',
 '        largest values) of these features were computed for each image,',
 '        resulting in 30 features.  Fo

In [18]:
sklearn.linear_model.LogisticRegression?

In [17]:
import sklearn.grid_search

In [None]:
params = 