Prepare data for machine learning

In [1]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from scipy import sparse
from sklearn.decomposition import PCA
import seaborn as sn
import matplotlib.pyplot as plt

In [2]:
nursingHomeDF = pd.read_csv("nursingHomeData.csv", index_col = 'FederalProviderNumber')
#nursingHomeDF.drop((nursingHomeDF.loc[(nursingHomeDF["InfectionScore"] == "Not Available") & (nursingHomeDF["FacilityReadmissionScore"] == "Not Available")]).index, inplace = True)
nursingHomeDF.drop((nursingHomeDF.loc[(nursingHomeDF["FacilityReadmissionScore"] == "Not Available")]).index, inplace = True)

trainDF, testDF = train_test_split(nursingHomeDF, test_size = 0.3, random_state=0)

FileNotFoundError: [Errno 2] No such file or directory: 'nursingHomeData.csv'

In [None]:
trainDF.drop((trainDF.loc[(trainDF["InfectionScore"] == "Not Available") & (trainDF["FacilityReadmissionScore"] == "Not Available")]).index, inplace = True)

First off we have too many columns to be easily human parsable, but as most of them are from one dataset we can ignore them for now and work on those few columns from other datasets

In [None]:
trainDF.info()

In [None]:
measureScoreColumns = trainDF.filter(like='Q').columns
noMeasureScoreColumns = list(set(trainDF.columns) - set(measureScoreColumns))

In [None]:
trainDF[noMeasureScoreColumns]

We will replace all NaN values in count columns with 0, under the assumption that if there was a penalty, it would have been reported.

In [None]:
countsColumns = ["fineCounts", "paymentDenialCounts", "StandardDeficiency", "ComplaintDeficiency", "InfectionControlInspectionDeficiency", "CitationunderIDR", "CitationunderIIDR"]
trainDF[countsColumns] = trainDF[countsColumns].fillna(0)

In [None]:
trainDF[noMeasureScoreColumns]

In [None]:
trainDF.shape

In [None]:
trainDF[noMeasureScoreColumns].isnull().sum()

In [None]:
noNA = (trainDF.isnull().sum() == 0).tolist()
hasNA = np.logical_not(noNA)

In [None]:
trainDF.loc[:,noNA].info()

In [None]:
trainDF.loc[:,hasNA].info()

As the rest of the values are scores from an evaluation, there are multiple imputation options, the most obvious including mean or median imputation. In this case both options seem reasonable, but lets just go with mean imputation so that the mean of our scores doesn't change from imputation. The initial distributions can be seen below.

As scaleing features to a constant variance needs to be done before PCA (we have alot of features and likely can get rid of a few) and as KNN is a distance based algorithm it seems reasonable to perform some form of scaling before using KNN to impute, so first we will scale using a standard scaler to set variance to 1 and mean to 0.
While min/max scaling is ideal for KNN, 

In [None]:
#scale features using standard scaler
numerics = trainDF.select_dtypes(include='float64').columns
scaler = StandardScaler()
scaler.fit(trainDF[numerics])
trainDF[numerics] = scaler.transform(trainDF[numerics])

In [None]:
#starting histograms for our data
trainDF.loc[:,hasNA].hist(figsize = (50,50))

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
meanImputedData = trainDF.copy()

imputer.fit(meanImputedData.loc[:,hasNA])
meanImputedData.loc[:,hasNA] = imputer.transform(meanImputedData.loc[:,hasNA])

In [None]:
meanImputedData.isnull().sum().sum()

And the resulting distributions after imputation:

In [None]:
#resulting histograms
meanImputedData.loc[:,hasNA].hist(figsize = (50,50))

While most of the distributions look pretty similar, some have changed quite considerably.
Lets see if a more complicated imputation scheme can perform better, in this case scikit-learns KNN imputer

.hist(figsize = (50,50))

In [None]:
KNNimp = KNNImputer(n_neighbors=2)
KNNImputedData = trainDF.copy()

KNNimp.fit(KNNImputedData.loc[:,hasNA])
KNNImputedData.loc[:,hasNA] = KNNimp.transform(KNNImputedData.loc[:,hasNA])

In [None]:
KNNImputedData.loc[:,hasNA].hist(figsize = (50,50))

KNN imputed data looks a little more consistent with the initial data

While were at it we can take a look at the rest of our data

In [None]:
KNNImputedData.loc[:,noNA].hist(figsize = (50,50))

we have a lot of matricies, we can see that a lot of the distributions look similar, and as such may be linearly correlated. We can perform PCA to know which features are likely to add little information to our model. First check if our matrix is sparce, as scikit-learn's PCA requires a non-sparce matrix.

In [None]:
trainDF = KNNImputedData
sparse.issparse(trainDF)

In [None]:
trainDF.shape

In [None]:
corrMatrix = trainDF.corr()

In [None]:
plt.figure(figsize=(50,50))
sn.heatmap(corrMatrix, annot=True)

From this we can see that there are mostly 4x4 blocks of correlation (the lighter the color, the higher the correlation), that is that the value of scores over 4 quarters are sometimes correlated, but that is not always the case.  
Having as much uncorrelated data as we have is a good sign that there is at least a lot of information in our data, although it does not necessarily explain what we want to model.
We can at least use PCA to remove a few correlated fields, although there will likely be quite a few leftover.

If we recall our initial data preparation we have 18 measure codes which we have 4 quarters of data for, giving us 72 features, and 4 measure codes which have adjusted and expected scores, so 8 more features there, plus 5 counts of deficiencies, 2 other count features, and finally our two target features adding up to 87 trainable features and 2 labels.
Below method based on: https://gist.github.com/rpromoditha/f73265e5a8db7084b521d79b2ecc3ece#file-pca_qas_2-py

In [None]:
pca = PCA(n_components=None)
pca.fit(trainDF[numerics])

exp_var = pca.explained_variance_ratio_ * 100
cum_exp_var = np.cumsum(exp_var)

plt.bar(range(87), exp_var, align='center',
        label='Individual explained variance')

plt.step(range(87), cum_exp_var, where='mid',
         label='Cumulative explained variance', color='red')

plt.ylabel('Explained variance percentage')
plt.xlabel('Principal component index')
plt.legend(loc='best')
plt.tight_layout()

From this we can see that we have very little correlation between our features, as show in the above heatmap. However, as we have 87 features, each feature inhenrently adds little to the explentation. As we can see above we can remove a large number of features with small additions by specifying that we would like to keep the best features that have a cummulative 80% explenation of variance.

In [None]:
#80% variance - 29 features
pca = PCA(n_components=0.80)
pca.fit(trainDF[numerics])
trainX = pca.transform(trainDF[numerics])
print(trainX.shape)
trainY = trainDF[['InfectionScore', 'FacilityReadmissionScore']]
print(trainY.shape)

Our data is now prepared enough for modeling, we can make the above into a function and ensure we prepare our training and testing data in the same way

In [None]:
#returns train/test data split into features and labels in format: trainX, trainY, testX, testY
#target can be FacilityReadmissionScore or InfectionScore
#copied into a python file so it can be imported
#this is the initial process, updates may be made in the dataPrep.py file that are not from this process

def preprocessData(data, splitSeed = 0, target = 'FacilityReadmissionScore'):
    #import statements
    import pandas as pd
    import numpy as np
    from sklearn.impute import KNNImputer
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from scipy import sparse
    from sklearn.decomposition import PCA
    
    #drop rows without a target
    data.drop((data.loc[(data[target] == "Not Available")]).index, inplace = True)
    
    #split data into train and test
    trainDF, testDF = train_test_split(data, test_size = 0.3, random_state = splitSeed)
    
    #the count based column names saved to a list
    countsColumns = ["fineCounts", "paymentDenialCounts", "StandardDeficiency", "ComplaintDeficiency", "InfectionControlInspectionDeficiency", "CitationunderIDR", "CitationunderIIDR"]
    
    #replace count based nulls with zeroes
    trainDF[countsColumns] = trainDF[countsColumns].fillna(0)
    testDF[countsColumns] = testDF[countsColumns].fillna(0)
    
    #as we are using KNN imputing, we should scale numeric values between 0 and 1 first so distance measures are consistant
    numerics = trainDF.select_dtypes(include='float64').columns
    scaler = StandardScaler()
    scaler.fit(trainDF[numerics])
    trainDF[numerics] = scaler.transform(trainDF[numerics])
    testDF[numerics] = scaler.transform(testDF[numerics])
    
    
    #impute null values with KNN imputer
    noNA = (trainDF.isnull().sum() == 0).tolist()
    hasNA = np.logical_not(noNA)
    
    KNNimp = KNNImputer(n_neighbors=2)
    KNNimp.fit(trainDF.loc[:,hasNA])
    trainDF.loc[:,hasNA] = KNNimp.transform(trainDF.loc[:,hasNA])
    testDF.loc[:,hasNA] = KNNimp.transform(testDF.loc[:,hasNA])
    
    #Perform PCA, separate train and test data
    pca = PCA(n_components=0.80)
    pca.fit(trainDF[numerics])
    trainX = pca.transform(trainDF[numerics])
    testX = pca.transform(testDF[numerics])
    trainY = trainDF[target]
    testY = testDF[target]
        
    return trainX, trainY, testX, testY

In [None]:
trainX_fromFunction, trainY_fromFunction, testX_fromFunction, testY_fromFunction = preprocessData(nursingHomeDF)

In [None]:
np.array_equal(trainX_fromFunction, trainX)