### Dropping missing data
The voting dataset contains missing values in the form of '?'

Sometimes missing values take the form '9999', other times a '0'
 
If lucky, the missing values will already be encoded as NaN. NaN is an efficient and simplified way of internally representing missing data.

In [13]:
import pandas as pd
import numpy as np

In [14]:
column_names = ['party', 'infants', 'water', 'budget', 'physician', 'salvador',
       'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',
       'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']

congress_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', 
                            names = column_names)

In [15]:
# convert '?' (no votes) to 'n' for a vote of 'no' - yes, this biases data
congress_data.replace('?', np.nan, inplace=True)

In [16]:
# Convert '?' to NaN
#df[df == '?'] = np.nan

# Print the number of NaNs
congress_data.isnull().sum()

# Print shape of original DataFrame
print("Shape of Original DataFrame: {}".format(congress_data.shape))

# Drop missing values and print shape of new DataFrame
congress_data = congress_data.dropna()

# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(congress_data.shape))


Shape of Original DataFrame: (435, 17)
Shape of DataFrame After Dropping All Rows with Missing Values: (232, 17)


When many values in the dataset are missing, dropping them may end up throwing away valuable information along with the missing data. It's better instead to develop an imputation strategy. This is where domain knowledge is useful, but in the absence of it, impute missing values with the mean or the median of the row or column that the missing value is in.

### Imputing missing data in a ML Pipeline
There are many steps to building a model, from creating training and test sets, to fitting a classifier or regressor, to tuning its parameters, to evaluating its performance on new data. Imputation can be seen as the first step of this machine learning process, the entirety of which can be viewed within the context of a pipeline. Scikit-learn provides a pipeline constructor that allows one to piece together these steps into one process and thereby simplify workflow.

In [17]:
# Import the Imputer module
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC

# Setup the Imputation transformer: imp
imp = Imputer(missing_values='NaN', strategy='most_frequent', axis=0)

# Instantiate the SVC classifier: clf
clf = SVC()

# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
        ('SVC', clf)]

In [18]:
column_names = ['party', 'infants', 'water', 'budget', 'physician', 'salvador',
       'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',
       'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']

congress_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/voting-records/house-votes-84.data', 
                            names = column_names)

In [19]:
# convert '?' (no votes) to 'n' for a vote of 'no' - yes, this biases data
congress_data.replace('?', np.nan, inplace=True)

In [20]:
# build predictor and target df
X, y = congress_data.drop('party', axis=1), congress_data['party']

Note the use of .drop() to drop the target variable 'party' from the feature array X as well as the use of the .values attribute to ensure X and y are NumPy arrays. Without using .values, X and y are a DataFrame and Series respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape.

In [22]:
y.describe()

count          435
unique           2
top       democrat
freq           267
Name: party, dtype: object

In [23]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.svm import SVC
from sklearn.metrics import classification_report 
from sklearn.model_selection import train_test_split

# Setup the pipeline steps: steps
steps = [('imputation', Imputer(missing_values='NaN', strategy='most_frequent', axis=0)),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)

# Predict the labels of the test set
y_pred = pipeline.predict(X_test)

# Compute metrics
print('\nClassification_report: \n{}'.format(classification_report(y_test, y_pred)))

ValueError: could not convert string to float: 'y'