# Basic Descriptive Statistics with a Pandas Dataframes
For this notebook we will use the Wisconsin Breast Cancer dataset (https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data). This is a dataset with an ID, Diagnosis ('M' for malignant and 'B' for benign) plus 30 features. 

You'll note that I use imports as I go so that you can see them where they get used--rather than piling them all up at the top of the file. This also means you only load them as you need them. 

In [None]:
import pandas as pd 

# read in the file from UCI <recommend you save locally and load it if your connectivity is iffy>

# Loading the file over the internet
filename = "https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data" 

# Loading the file locally in the same folder as the Python Notebook
#filename = "wi_breast_cancer.csv"
names = ['ID','Diagnosis','Mean-Radius','Mean-Texture','Mean-Perimeter','Mean-Area','Mean-Smoothness',
         'Mean-Compactness','Mean-Concavity','Mean-ConcavePoints','Mean-Symmetry','Mean-FractalDimension',
         'StdErr-Radius','StdErr-Texture','StdErr-Perimeter','StdErr-Area','StdErr-Smoothness',
         'StdErr-Compactness','StdErr-Concavity','StdErr-ConcavePoints','StdErr-Symmetry','StdErr-FractalDimension',
         'Worst-Radius','Worst-Texture','Worst-Perimeter','Worst-Area','Worst-Smoothness',
         'Worst-Compactness','Worst-Concavity','Worst-ConcavePoints','Worst-Symmetry','Worst-FractalDimension']

#loading the file into a dataframe
data = pd.read_csv(filename, names=names, header=None) 

We'll start this lesson with a few ways to look at the data, such as the shape, info and description which are function built into Pandas dataframes.

In [None]:
print("Group by Diagnosis \n", data.groupby('Diagnosis').size())  # how many in each class?

### Converting the "Class" to a Numeric or Boolean
As we can see above, Diagnosis, the independent variable, is a categorical variable. Since it is a non-numeric value we will not be able see it using many of the built-in dataframe tools unless we convert it to a boolean or numeric value.

We will: 
* Look at a section of the dataset where the value varies so that we can make sure we get what we want. 
* Map it to the numeric value we want 'M' or Malignant = 1 or True and 'B' or Benign = 0 or False. 
* Look at the same section to see if we were successful.

In [None]:
print(data["Diagnosis"][20:25])

In [None]:
# Convert the Diagnosis to a numeric variable
data['Diagnosis'] = data['Diagnosis'].map({'M': 1, 'B': 0})
# Malignant tumors = 1 or True and Benign tumors = 0 or False
print(data["Diagnosis"][20:25])

In [None]:
print("Correlation \n" , data.corr(method='pearson'))             # how correlated are the features pairwise?

There is a high pairwise correlation across Mean, StdErr and Worst for these features, as you would expect. Probably we do not need to keep all of them--just the one of each that gives us the best outcome. 

In [None]:
print("Skew \n", data.skew())                                     # how Gaussian are the features?

As a sidebar, I also use pandas profiling (https://github.com/pandas-profiling/pandas-profiling) which generates a beautiful HTML page of the data. I'll let you check it out for yourselves. 

### Splitting the data into dependent and independent variables
Before we do anything, we should split our data into these classes and drop any variables that are not useful. ID appears to be the only non-useful variable for building our classification model. 

To create our classification model we need to assign the X and y. Diagnosis is our dependent variable ('y') and the remaining variables are our independent variables ('X'). We will drop 'ID' since we know it is non-predicting. 

In this section we will "inform" ourselves about the dataframe so we can properly segment it. 

In [None]:
data.info()

In [None]:
X = data.iloc[:, 2:32]   # load features into X dataframe
y = data.iloc[:, 1]      # Load target into y dataframe

In [None]:
# Make sure that we have only the features we want
X.info()

In [None]:
# Make sure that the y holds what we expect by looking at the same section of "Diagnosis" above
y[20:25].head()

## Data Visualization
Let's now take a look at the data distribution of the independent variables using matplotlib and various built in plots: 

#### Univariate
* histograms
* density
* box
* scatter

#### Multivariate
* Correlation matrix
* Scatter matrix

### Histograms
Histograms are a univariate (one variable at a time) plot that allows us to see the "skew" that was represented as a single value by dataframe.skew(). 

By binning the data we can see that many of the variables are skewed and some are nearly Gaussian. For example, StdErr-Area, which had the largest skew also has the most exponential looking distribution. 

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

# A helper for visualizing the data--sets the width and precision
pd.set_option('display.width', 100) 
pd.set_option('precision', 3)

# Look at the data distribution via histograms
X.hist(figsize=(15,20), color = 'orange', edgecolor = 'blue')
plt.subplots_adjust(wspace=.5, hspace=.25)
plt.show()

### Density Plots
These univariate denisty plots are showing us the same information, but as a line rather than a histogram. Not clear why pyplot decided to show the identical information of a density and a histogram in a different order. May I suggest Seaborn? https://seaborn.pydata.org/

In [None]:
X.plot(kind='density', subplots=True, layout=(6,5), sharex=False, figsize=(15,20))
plt.subplots_adjust(wspace=.5, hspace=.5)
plt.show()

### Box Plots
These plots also show the distribution, but add the quartile information. The box represents where 50% of the data can be found. The whisker lines show the 75% and the 25% percentiles. The circles are the datapoints that are potential data outliers. These fall 1.5 times larger than values inside the "boxes". 

In [None]:
X.plot(kind='box', subplots=True, layout=(6,5), sharex=False, sharey=False, figsize=(15,20))
plt.subplots_adjust(wspace=.5, hspace=.25)
plt.show()

### Scatterplots 
NOTE: this is not terrifically interesting since Y is 0 or 1. I have a scatter matrix below that does a pairwise comparison, but with this many features, it's also hard to justify. 

PS: Was too lazy to put these into subplots. If I get around to honing my Seaborn skills I may update this code later. 

In [None]:
j = 2 

for i in X:
    plt.title(names[j])
    plt.scatter(X, y, c=y)
    plt.show()
    j=j+1
    
j = 0 

### Correlation Matrix
With a smaller dataset one might find this interesting. As it is, with 30 features it is hard to see the value. Basically the zeroith row is the dependent variable--but telling which features are the yellowest is pretty tough. It's must better to do this via feature selection than to visualize the data when you have this many variables. 

In [None]:
# Correlation Matrix
fig = plt.figure(figsize=(15,12))
ax = fig.add_subplot(111) 
cax = ax.matshow(data[1:32].corr(), vmin=-1, vmax=1) 
fig.colorbar(cax) 
plt.subplots_adjust(wspace=.25, hspace=.25)
plt.show()

### Scattermatrix for a pairwise comparison
As with Correlation Matrices, this works great for small datasets, but not well for features more than 10 as we have. I'm showing this for illustration purposes only. The purpose here, and with the correlation matrix above, is to show you multivariate plots. 

In [None]:
from pandas.plotting import scatter_matrix
scatter_matrix(data, figsize=(15,20), c=y)
plt.subplots_adjust(wspace=.01, hspace=.01)
plt.show()