Pre processing
----------
The following code contains a series of pre-processing functions that can be used to greatly increase speed of improvement of any ML algorythm.

* Selecting features
* Dealing with missing values
* Dealing with outliers
* Feature scaling 
* Splitting dataset

Select features that are input into model. These could be chosen based on what are considered to be the most important features (from Eleanor Barr or my feature importance functions.)

This notebook steps through how to: 
1. Import the preprocessing functions from preprocessing_ml.py
2. Use the functions to preprocess data
3. How to apply a ML algorithm (Linear Regression) to the preprocessed data

In [63]:
#import relevant modules 
#import relevant libraries 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

In [64]:
#read in the dataset 
dataset = pd.read_csv('framingham.csv')

In [65]:
#This is required to accept any changes to the module by forcing notebook to re-read the file 
#if any changes are made to the module while notebook is running
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We now take a look at the content of the module:
----------

In [66]:
%%bash 
#read the preprocessing file to check it is there and looks right

head -20 preprocessing_ml.py # Print head of .py file to check if the file can be found


'''This is the preprocessing module created by Lewis Howell for the Exeter NatSci Machine Learning Group. 30/10/19
The module contains functions for feature selection, dealing with missing values, feature scaling 
and splitting dataset into test and training sets.
	- chose_features
	- drop_missing
	- impute_missing
	- scale_data
	- split_data
'''

print("Importing the preprocessing module for the Exeter NatSci Machine Learning Group.....")


def chose_features(dataset, features=[], n_features = -1, v=1, vv =0):
    '''Return reduced dataset with only chosen columns
    - dataset: pandas dataframe of dataset to have columns chosen
    - features (optional, default = all features): list of strings matching features to keep
    - n_features (optional) - if specified, the top n features from the scaled list is chosen: 
    ['glucose', 'age', 'totChol', 'cigsPerDay', 'diaBP', 'prevalentHyp',
        'diabetes', 'BPMeds', 'male', 'BMI', 'prevalentStroke',


In [67]:
#TL;DR
# You can use the functions to make the preprocessing VERY easy: #e.g. Process data in 1 line:
from preprocessing_ml import *

X_train, X_test, y_train, y_test = split_data(scale_data(drop_missing(chose_features(dataset))))


Importing the preprocessing module for the Exeter NatSci Machine Learning Group.....
Successfully imported the preprocessing module
Now selecting chosen features....
	 * Number of features:  15 (and "10YearCHD")
	 * Number of dropped features:  0

Now dropping rows with missing values....
	 * Dropped 582 rows 13.7%. 3658 rows remaining

Scaling data....
	 * Using standard scaling

Splitting data set into training and test sets....
	 * 80.0% data in training set
	 * 20.0% data in test set


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Tutorial
--------------------

In [77]:
#Try to import the contents of the local module preprocessing_ml.py file
import preprocessing_ml as prep

help(prep) #Print the help. Shows the functions in the file

Help on module preprocessing_ml:

NAME
    preprocessing_ml

DESCRIPTION
    This is the preprocessing module created by Lewis Howell for the Exeter NatSci Machine Learning Group. 30/10/19
    The module contains functions for feature selection, dealing with missing values, feature scaling 
    and splitting dataset into test and training sets.
            - chose_features
            - drop_missing
            - impute_missing
            - scale_data
            - split_data

FUNCTIONS
    chose_features(dataset, features=[], n_features=-1, v=1, vv=0)
        Return reduced dataset with only chosen columns
        - dataset: pandas dataframe of dataset to have columns chosen
        - features (optional, default = all features): list of strings matching features to keep
        - n_features (optional) - if specified, the top n features from the scaled list is chosen: 
        ['glucose', 'age', 'totChol', 'cigsPerDay', 'diaBP', 'prevalentHyp',
            'diabetes', 'BPMeds', 'male'

Selecting columns
-------------
* n_features from chi squ importance
* features from custom list

In [92]:
#Example for how to use:

#1. Feature selection:
help(prep.chose_features) #print help to see usage of module 

#features_list = ['TenYearCHD', sysBP', 'age', 'cigsPerDay', 'totChol']
#print(prep.chose_features(dataset, features=features_list,vv=1))

d2 = prep.chose_features(dataset, n_features=10,vv=1)
d2.head(2)

Importing the preprocessing module for the Exeter NatSci Machine Learning Group.....
Successfully imported the preprocessing module
Help on function chose_features in module preprocessing_ml:

chose_features(dataset, features=[], n_features=-1, v=1, vv=0)
    Return reduced dataset with only chosen columns
    - dataset: pandas dataframe of dataset to have columns chosen
    - features (optional, default = all features): list of strings matching features to keep
    - n_features (optional) - if specified, the top n features from the scaled list is chosen: 
    ['glucose', 'age', 'totChol', 'cigsPerDay', 'diaBP', 'prevalentHyp',
        'diabetes', 'BPMeds', 'male', 'BMI', 'prevalentStroke',
        'education', 'heartRate', 'currentSmoker'],
    - v (optional) - Verbose (default 1) int 0 or 1. Print no. of features kept and lost 
    - vv (optional) - Very verbose (default 0) int 0 or 1. Print list of chosen and rejected features

Now selecting chosen features....
	 * Number of feature

Unnamed: 0,TenYearCHD,sysBP,glucose,age,totChol,cigsPerDay,diaBP,prevalentHyp,diabetes,BPMeds,male
0,0,106.0,77.0,39,195.0,0.0,70.0,0,0,0.0,1
1,0,121.0,76.0,46,250.0,0.0,81.0,0,0,0.0,0


Dealing with missing values
-------------
* Dropping missing values
* Imputation
    * Mean or median

In [85]:
#2. Dealing with missing values: Dropping

help(prep.drop_missing)

d3 = prep.drop_missing(d2)
d3.head()

Help on function drop_missing in module preprocessing_ml:

drop_missing(dataset)
    Drop rows with any missing values and return dataset with dropped rows. Prints number and percentage of rows dropped
    - Dataset: pandas Dataframe

Now dropping rows with missing values....
	 * Dropped 474 rows 11.2%. 3766 rows remaining



Unnamed: 0,TenYearCHD,sysBP,glucose,age,totChol,cigsPerDay,diaBP,prevalentHyp,diabetes,BPMeds,male
0,0,106.0,77.0,39,195.0,0.0,70.0,0,0,0.0,1
1,0,121.0,76.0,46,250.0,0.0,81.0,0,0,0.0,0
2,0,127.5,70.0,48,245.0,20.0,80.0,0,0,0.0,1
3,1,150.0,103.0,61,225.0,30.0,95.0,1,0,0.0,0
4,0,130.0,85.0,46,285.0,23.0,84.0,0,0,0.0,0


In [80]:
#2.1 Imputation as an alternative to dropping missing vals
help(impute_missing)
d3_2 = prep.impute_missing(d2,strategy='median',vv=1)
d3_2.head()

Help on function impute_missing in module preprocessing_ml:

impute_missing(dataset, strategy='median', v=1, vv=0)
    Imputation - alternative to removing missing values.
    Fill all missing with column average (median or mean)
    dataset - Pandas Dataframe to be imputed
    strategy - str (optional) 'median' (default) or 'mean' to fill missing values with
    - v (optional) - Verbose (default 1) int 0 or 1. Print no. of missing and imputed values  
    - vv (optional) - Very verbose (default 0) int 0 or 1. Print list of imputed features with counts and replaced value

Imputing missing values with median....
	 * Number of missing values:  520
	 * Number of imputed values:  520


              N_missing  Imputed_value
glucose             388           78.0
BPMeds               53            0.0
totChol              50          234.0
cigsPerDay           29            0.0
male                  0            0.0
diabetes              0            0.0
prevalentHyp          0            0

Unnamed: 0,TenYearCHD,sysBP,glucose,age,totChol,cigsPerDay,diaBP,prevalentHyp,diabetes,BPMeds,male
0,0.0,106.0,77.0,39.0,195.0,0.0,70.0,0.0,0.0,0.0,1.0
1,0.0,121.0,76.0,46.0,250.0,0.0,81.0,0.0,0.0,0.0,0.0
2,0.0,127.5,70.0,48.0,245.0,20.0,80.0,0.0,0.0,0.0,1.0
3,1.0,150.0,103.0,61.0,225.0,30.0,95.0,1.0,0.0,0.0,0.0
4,0.0,130.0,85.0,46.0,285.0,23.0,84.0,0.0,0.0,0.0,0.0


In [90]:
#3. Scaling
help(scale_data)

d4 = prep.scale_data(d3,method='standard')
d4.head()

Help on function scale_data in module preprocessing_ml:

scale_data(data, method='standard', v=1)
    Return dataset scaled by MinMaxScalar or StandardScalar methods from sklearn.preprocessing
    - data: pandas dataframe of data to be scaled
    - method (optional): str of either 'minmax' for MinMaxScalar or 'std' for StandardScaler (default arg)
    - v (optiona -default = 1): Verbose

Scaling data....
	 * Using standard scaling


  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


Unnamed: 0,TenYearCHD,sysBP,glucose,age,totChol,cigsPerDay,diaBP,prevalentHyp,diabetes,BPMeds,male
0,-0.426669,-1.193478,-0.203953,-1.232568,-0.938737,-0.755367,-1.083114,-0.673835,-0.167687,-0.17668,1.116699
1,-0.426669,-0.515134,-0.245451,-0.417595,0.29417,-0.755367,-0.161456,-0.673835,-0.167687,-0.17668,-0.895496
2,-0.426669,-0.221185,-0.494439,-0.184746,0.182088,0.923276,-0.245243,-0.673835,-0.167687,-0.17668,1.116699
3,2.343737,0.79633,0.874995,1.328775,-0.266242,1.762598,1.011563,1.484042,-0.167687,-0.17668,-0.895496
4,-0.426669,-0.108128,0.128031,-0.417595,1.078748,1.175073,0.089905,-0.673835,-0.167687,-0.17668,-0.895496


In [91]:
#4. Splitting Data

help(prep.split_data)

X_train, X_test, y_train, y_test = prep.split_data(d4)
X_train.head()

Help on function split_data in module preprocessing_ml:

split_data(dataset, dep_var='TenYearCHD', test_size=0.2, v=1)
    Split the dataset, return X_train, X_test, y_train, y_test as Pandas Dataframes
    - dataset: Pandas Dataframe. Data to split into training and test data
    - dep_var (optional, default = 'TenYearCHD'): string. Name of column to be dependant variable
    - test_size (optional, default = 0.2): float (0.0-1.0). Proportion of total data to make up test set.
    Returns 4 datasets in order: X_train, X_test, y_train, y_test


Splitting data set into training and test sets....
	 * 80.0% data in training set
	 * 20.0% data in test set


Unnamed: 0,sysBP,glucose,age,totChol,cigsPerDay,diaBP,prevalentHyp,diabetes,BPMeds,male
1571,-0.967363,0.128031,-0.53402,-0.938737,-0.08391,-0.32903,-0.673835,-0.167687,-0.17668,-0.895496
17,-0.469911,-0.162455,-1.348993,-0.938737,-0.335706,0.131799,-0.673835,-0.167687,-0.17668,-0.895496
3039,1.632954,-0.203953,0.164528,-0.15416,-0.755367,1.179137,1.484042,-0.167687,-0.17668,-0.895496
3686,-0.221185,-0.079459,1.445199,-0.669739,-0.755367,-0.664178,-0.673835,-0.167687,-0.17668,-0.895496
1936,-0.786472,-0.577435,0.280953,-0.109327,0.923276,-0.245243,-0.673835,-0.167687,-0.17668,-0.895496


Below are the original functions. FOR REFERENCE ONLY. Any fixes should be made to the preprocessing_ml.py file.

You should NOT need to run the cell below

# Feature selection

In [5]:
'''Selecting features - dropping uninteresting columns'''


def chose_features(dataset, features=dataset.columns, n_features = -1, v=1, vv =0):
    '''Return reduced dataset with only chosen columns
    - dataset: pandas dataframe of dataset to have columns chosen
    - features (optional, default = all features): list of strings matching features to keep
    - n_features (optional) - if specified, the top n features from the scaled list is chosen: 
    ['glucose', 'age', 'totChol', 'cigsPerDay', 'diaBP', 'prevalentHyp',
        'diabetes', 'BPMeds', 'male', 'BMI', 'prevalentStroke',
        'education', 'heartRate', 'currentSmoker'],
    - v (optional) - Verbose (default 1) int 0 or 1. Print no. of features kept and lost 
    - vv (optional) - Very verbose (default 0) int 0 or 1. Print list of chosen and rejected features
    '''
                
    print('Now selecting chosen features....')
    
    if n_features != -1: 
        if n_features > len(dataset.columns):
            print('WARNING: chose_features has an error: n_features must be less than no. columns')
            return(-1)
        else:
            ordered_f = ['TenYearCHD','glucose', 'age', 'totChol', 'cigsPerDay', 'diaBP', 'prevalentHyp',
            'diabetes', 'BPMeds', 'male', 'BMI', 'sysBP','prevalentStroke',
            'education', 'heartRate', 'currentSmoker']
            features = ordered_f[0:n_features]

    if v == 1: 
        print('\t * Number of features: ', len(features))
        print('\t * Number of dropped features: ', len(dataset.columns) - len(features))
        
    if vv == 1:
        print('\t * Chosen features: ', features)
        print('\t * Dropped features: ',[col for col in dataset.columns if col not in features])
    print('')
    
    return dataset.copy()[features] #reduced dataset




# Missing values 

In [None]:
# '''Dealing with missing values'''
DONT RUN THIS
#Method 1: Drop missing values
def drop_missing(dataset):
    '''Drop rows with any missing values and return dataset with dropped rows. Prints number and percentage of rows dropped
    - Dataset: pandas Dataframe
    '''
    print('Now dropping rows with missing values....')
    dataset2 = dataset.copy().dropna().reset_index(drop=True)
    lost = len(dataset) - len(dataset2)
    print('\t * Dropped {} rows {:.1f}%. {} rows remaining\n'.format(lost,lost/len(dataset)*100,len(dataset2)))
    return dataset2

# Scaling data 

In [6]:
DONT RUN THIS
def mean_normalize(dataset):
    '''
    Normalise all features in a dataframe between -1 and 1 and return normalised dataframe.
    This is one method of feature scaling that may aid the performace of some ML algorithms
    Normalisation: (feature - mean)/range
    '''

    for feature in dataset:
        
        fmean = np.mean(dataset[feature])
        frange = np.amax(dataset[feature]) - np.amin(dataset[feature])

        #Vector Subtraction
        dataset[feature] = dataset[feature] - fmean
        #Vector Division
        dataset[feature] = dataset[feature] / frange

    return dataset

##e.g.
#dataset_n = mean_normalize(dataset.copy())
#dataset_n.head()

##I then found there were some build in normalisation/ scaling modules in sklearn.preprocessing so tried some of these


def scale_data(data, method='std'):
    '''Return dataset scaled by MinMaxScalar or StandardScalar methods from sklearn.preprocessing
    - data: pandas dataframe of data to be scaled
    - method (optional): str of either 'minmax' for MinMaxScalar or 'std' for StandardScaler (default arg)
    '''
    from sklearn import preprocessing
    
    if method == 'minmax':
        scaler_minmax = preprocessing.MinMaxScaler((0,1))
        return pd.DataFrame(scaler_minmax.fit_transform(data.copy()),columns=data.columns) 
    
    elif method == 'std':
        scaler_std = preprocessing.StandardScaler() #with_std=False
        return pd.DataFrame(scaler_std.fit_transform(dataset.copy()),columns=dataset.columns)
    
    else:
        print('\nscale_data encountered a failure!!\n')
        return(-1)

##e.g.
##scale_data(dataset).head()

# Splitting data 

In [7]:
DONT RUN THIS
def split_data(dataset,dep_var='TenYearCHD', test_size = 0.2, v = 1):
    '''Split the dataset, return X_train, X_test, y_train, y_test as Pandas Dataframes
    - dataset: Pandas Dataframe. Data to split into training and test data
    - dep_var (optional, default = 'TenYearCHD'): string. Name of column to be dependant variable
    - test_size (optional, default = 0.2): float (0.0-1.0). Proportion of total data to make up test set.
    '''
    from sklearn.model_selection import train_test_split
    y = dataset[dep_var]
    X = dataset.drop([dep_var], axis = 1)
    if v == 1: 
        print('Splitting data set into {}% training, {}% test dataset....'.format(100*(1-test_size),100*test_size))
        
    return train_test_split(X, y, test_size = test_size, random_state=0)

# Cross - validation 

Cross validation is used to asses the predictive performance of the models and to judge how they will perform outside the sample to a new dataset 

The main method is the k-fold validation method, which follows the general procedure:
1. shuffle dataset randomly
2. split dataset into k groups
3. For each unique group 
    - take the group as a hold out or test data set 
    - take the remaining groups as a training set 
    - fit a model on the training set and evaluate it on the test set 
    - retain the evaluation score and discard the model 
4. Summarize the skill of the model using the sample model evaluation score 

In [1]:
def cross_val(model, X, Y, cv=3):
    from sklearn.model_selection import cross_val_score 

    print('\nCrossvalidation score for {} splits:\n'.format(cv))   
    cv_results = cross_val_score(model, X, Y, cv)
    print(cv_results)
    print("Cross validation Accuracy: %0.2f (+/- %0.2f)" % (cv_results.mean(), cv_results.std() * 2))
    

#where model_name is replaced by whatever you have defined the model fit as 
#For example in the K-neighbors section I have defined the model_name as KN. (see k_neighbors function)

If any further explanation is required about these functions they have been implemented within the k_neighbors notebook or ask Lewis or Ellie 