Preprocessing demo
------------

This notebook steps through how to: 
1. Import the preprocessing functions from preprocessing_lh.py
2. Use the functions to preprocess data
3. How to apply a ML algorithm (Linear Regression) to the preprocessed data

In [7]:
#import relevant modules 
#import relevant libraries 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os

In [2]:
#read in the dataset 
dataset = pd.read_csv('framingham.csv')

In [29]:
#This is required to accept any changes to the module by forcing notebook to re-read the file 
#if any changes are made to the module while notebook is running
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We now take a look at the content of the module:
----------

In [31]:
%%bash 
#read the preprocessing file to check it is there and looks right

head -20 preprocessing_lh.py # Print head of .py file to check if the file can be found


'''This is the preprocessing module created by Lewis Howell for the Exeter NatSci Machine Learning Group. 30/10/19
The module contains functions for feature selection, dealing with missing values, feature scaling 
and splitting dataset into test and training sets.
	- chose_features
	- drop_missing
	- impute_missing
	- scale_data
	- split_data
'''

print("Importing the preprocessing module for the Exeter NatSci Machine Learning Group.....")

def chose_features(dataset, features=dataset.columns, n_features = -1, v=1, vv =0):
    '''Return reduced dataset with only chosen columns
    - dataset: pandas dataframe of dataset to have columns chosen
    - features (optional, default = all features): list of strings matching features to keep
    - n_features (optional) - if specified, the top n features from the scaled list is chosen: 
    ['glucose', 'age', 'totChol', 'cigsPerDay', 'diaBP', 'prevalentHyp',
        'diabetes', 'BPMeds', 'male', 'BMI', 'prevalentStroke',
        'education', 'he

In [37]:
#Try to import the contents of the local module preprocessing_lh.py file
from preprocessing_lh import *

help(prep) #Print the help. Shows the functions in the file

Help on module preprocessing_lh:

NAME
    preprocessing_lh

DESCRIPTION
    This is the preprocessing module created by Lewis Howell for the Exeter NatSci Machine Learning Group. 30/10/19
    The module contains functions for feature selection, dealing with missing values, feature scaling 
    and splitting dataset into test and training sets.
            - chose_features
            - drop_missing
            - impute_missing
            - scale_data
            - split_data

FUNCTIONS
    chose_features(dataset, features=[], n_features=-1, v=1, vv=0)
        Return reduced dataset with only chosen columns
        - dataset: pandas dataframe of dataset to have columns chosen
        - features (optional, default = all features): list of strings matching features to keep
        - n_features (optional) - if specified, the top n features from the scaled list is chosen: 
        ['glucose', 'age', 'totChol', 'cigsPerDay', 'diaBP', 'prevalentHyp',
            'diabetes', 'BPMeds', 'male'

In [45]:
#Example for how to use:

#1. Choose top 10 features:
chose_features(dataset, n_features=10)

Importing the preprocessing module for the Exeter NatSci Machine Learning Group.....
Successfully imported the preprocessing module
Now selecting chosen features....
	 * Number of features:  10 (and "10YearCHD")
	 * Number of dropped features:  5



Unnamed: 0,TenYearCHD,sysBP,glucose,age,totChol,cigsPerDay,diaBP,prevalentHyp,diabetes,BPMeds,male
0,0,106.0,77.0,39,195.0,0.0,70.0,0,0,0.0,1
1,0,121.0,76.0,46,250.0,0.0,81.0,0,0,0.0,0
2,0,127.5,70.0,48,245.0,20.0,80.0,0,0,0.0,1
3,1,150.0,103.0,61,225.0,30.0,95.0,1,0,0.0,0
4,0,130.0,85.0,46,285.0,23.0,84.0,0,0,0.0,0
5,0,180.0,99.0,43,228.0,0.0,110.0,1,0,0.0,0
6,1,138.0,85.0,63,205.0,0.0,71.0,0,0,0.0,0
7,0,100.0,78.0,45,313.0,20.0,71.0,0,0,0.0,0
8,0,141.5,79.0,52,260.0,0.0,89.0,1,0,0.0,1
9,0,162.0,88.0,43,225.0,30.0,107.0,1,0,0.0,1


# Feature selection

In [5]:
'''Selecting features - dropping uninteresting columns'''

def chose_features(dataset, features=dataset.columns, n_features = -1, v=1, vv =0):
    '''Return reduced dataset with only chosen columns
    - dataset: pandas dataframe of dataset to have columns chosen
    - features (optional, default = all features): list of strings matching features to keep
    - n_features (optional) - if specified, the top n features from the scaled list is chosen: 
    ['glucose', 'age', 'totChol', 'cigsPerDay', 'diaBP', 'prevalentHyp',
        'diabetes', 'BPMeds', 'male', 'BMI', 'prevalentStroke',
        'education', 'heartRate', 'currentSmoker'],
    - v (optional) - Verbose (default 1) int 0 or 1. Print no. of features kept and lost 
    - vv (optional) - Very verbose (default 0) int 0 or 1. Print list of chosen and rejected features
    '''
                
    print('Now selecting chosen features....')
    
    if n_features != -1: 
        if n_features > len(dataset.columns):
            print('WARNING: chose_features has an error: n_features must be less than no. columns')
            return(-1)
        else:
            ordered_f = ['TenYearCHD','glucose', 'age', 'totChol', 'cigsPerDay', 'diaBP', 'prevalentHyp',
            'diabetes', 'BPMeds', 'male', 'BMI', 'sysBP','prevalentStroke',
            'education', 'heartRate', 'currentSmoker']
            features = ordered_f[0:n_features]

    if v == 1: 
        print('\t * Number of features: ', len(features))
        print('\t * Number of dropped features: ', len(dataset.columns) - len(features))
        
    if vv == 1:
        print('\t * Chosen features: ', features)
        print('\t * Dropped features: ',[col for col in dataset.columns if col not in features])
    print('')
    
    return dataset.copy()[features] #reduced dataset




# Missing values 

In [3]:
'''Dealing with missing values'''

#Method 1: Drop missing values
def drop_missing(dataset):
    '''Drop rows with any missing values and return dataset with dropped rows. Prints number and percentage of rows dropped
    - Dataset: pandas Dataframe
    '''
    print('Now dropping rows with missing values....')
    dataset2 = dataset.copy().dropna().reset_index(drop=True)
    lost = len(dataset) - len(dataset2)
    print('\t * Dropped {} rows {:.1f}%. {} rows remaining\n'.format(lost,lost/len(dataset)*100,len(dataset2)))
    return dataset2

# Scaling data 

In [6]:
def mean_normalize(dataset):
    '''
    Normalise all features in a dataframe between -1 and 1 and return normalised dataframe.
    This is one method of feature scaling that may aid the performace of some ML algorithms
    Normalisation: (feature - mean)/range
    '''

    for feature in dataset:
        
        fmean = np.mean(dataset[feature])
        frange = np.amax(dataset[feature]) - np.amin(dataset[feature])

        #Vector Subtraction
        dataset[feature] = dataset[feature] - fmean
        #Vector Division
        dataset[feature] = dataset[feature] / frange

    return dataset

##e.g.
#dataset_n = mean_normalize(dataset.copy())
#dataset_n.head()

##I then found there were some build in normalisation/ scaling modules in sklearn.preprocessing so tried some of these


def scale_data(data, method='std'):
    '''Return dataset scaled by MinMaxScalar or StandardScalar methods from sklearn.preprocessing
    - data: pandas dataframe of data to be scaled
    - method (optional): str of either 'minmax' for MinMaxScalar or 'std' for StandardScaler (default arg)
    '''
    from sklearn import preprocessing
    
    if method == 'minmax':
        scaler_minmax = preprocessing.MinMaxScaler((0,1))
        return pd.DataFrame(scaler_minmax.fit_transform(data.copy()),columns=data.columns) 
    
    elif method == 'std':
        scaler_std = preprocessing.StandardScaler() #with_std=False
        return pd.DataFrame(scaler_std.fit_transform(dataset.copy()),columns=dataset.columns)
    
    else:
        print('\nscale_data encountered a failure!!\n')
        return(-1)

##e.g.
##scale_data(dataset).head()

# Splitting data 

In [7]:
def split_data(dataset,dep_var='TenYearCHD', test_size = 0.2, v = 1):
    '''Split the dataset, return X_train, X_test, y_train, y_test as Pandas Dataframes
    - dataset: Pandas Dataframe. Data to split into training and test data
    - dep_var (optional, default = 'TenYearCHD'): string. Name of column to be dependant variable
    - test_size (optional, default = 0.2): float (0.0-1.0). Proportion of total data to make up test set.
    '''
    from sklearn.model_selection import train_test_split
    y = dataset[dep_var]
    X = dataset.drop([dep_var], axis = 1)
    if v == 1: 
        print('Splitting data set into {}% training, {}% test dataset....'.format(100*(1-test_size),100*test_size))
        
    return train_test_split(X, y, test_size = test_size, random_state=0)

# Cross - validation 

Cross validation is used to asses the predictive performance of the models and to judge how they will perform outside the sample to a new dataset 

In [15]:
#Function in progress