# Logistic Regression

This notebook provides an example of how to use SML to read in a dataset, split the data into training and testing data, replace troublesome such as NaNs from the dataset, and perform classifcation on the dataset. For this use-case we use publicly availiable [US Census Data from 1990](https://archive.ics.uci.edu/ml/datasets/US+Census+Data+%281990%29) and use logistic regression to classify incomes. **[Clarify with Mike]**.

**[ Why are we Preprocessing Census Data??]**

## SML Query

### Imports

We Make the nescessary imports to use sml to read in the dataset, split the dataset into training and testing data, replace troublesome values from the dataset and perform classifcation on the dataset.

In [5]:
from sml import execute

Next we create a query statement to `READ` in the data and the file is delimited by ',', the header is not used, and the types of values are numeric and string, next we `REPLACE` any values of NaN with the mode of the column, `SPLIT` the dataset and use 80% of it for training and 20% of it for testing, and lastly, we perform classification using logistic regression on the 15th column, using columns 1-14 as the predictiors.

In [9]:
query = 'READ "../data/census.csv" (separator=",", header = 0, types = [1:numeric, 2:string]) AND \
            REPLACE ("NaN", "mode") AND SPLIT (train = .8, test = 0.2) AND \
            CLASSIFY (predictors=[1,2,3,4,5 , 6,7, 8, 9, 10 ,11 ,12, 13,14], label = 15, algorithm = logistic)'

execute(query)

## Manually

The subsequent cells below show how the same actions of a SML query can be performed manually.

### Imports

Here we import the necessary libraries needed to perform the same actions as the SML query above.

In [1]:
import pandas as pd
import seaborn as sns
import numpy as np

from sklearn.preprocessing import LabelEncoder, Imputer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer

from sklearn import cross_validation, metrics
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation, metrics

### READ

By default the 1990 Census Dataset does not include it's headers, so we specify it manually, and read that file into a pandas dataframe.

In [2]:

names = ['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
         'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss',
         'hours-per-week', 'native-country', 'income']

data = pd.read_csv('../data/census.csv', names = names)

### Preprocessing

Next we have to encode categorical values so this can be passed into the sklearn machine learning library to perform logistic regression.

In [10]:
def encode_categorical(df, cols=None):
    categorical = list()
    if cols is not None:
        categorical = cols
    else:
        for col in df.columns:
            if df[col].dtype == 'object':
                categorical.append(col)

    for feature in categorical:
        l = list(df[feature])
        s = set(l)
        l2 = list(s)
        numbers = list()
        for i in range(0,len(l2)):
            numbers.append(i)
        df[feature] = df[feature].replace(l2, numbers)
    return df

data_encoded = encode_categorical(data)

### Replace

We impute missing values in our panadas dataframe to account for NaNs in our dataset

In [11]:
class ImputeCategorical(BaseEstimator, TransformerMixin):
    def __init__(self, columns=None):
        self.columns = columns
        self.imputer = None
    def fit(self, data, target=None):
        if self.columns is None:
            self.columns = data.columns
        self.imputer = Imputer(missing_values=0, strategy='most_frequent')
        self.imputer.fit(data[self.columns])
        return self
    def transform(self, data):
        """
        Uses the encoders to transform a data frame.
        """
        output = data.copy()
        output[self.columns] = self.imputer.transform(output[self.columns])

        return output
imputer = ImputeCategorical(['workclass', 'native-country', 'occupation'])
data = imputer.fit_transform(data)


### SPLIT

We then seperate our labels from our features and use a sklearn function to perform a 80%/20% split our training and testing dataset respectively.

In [12]:
labels = data['income']
features = data.drop('income',1)

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=.2, random_state=42)

### CLASSIFY

Lastly we fit our logistic regression  with our training dataset and make predictions on our testing dataset and display the accuracy.

In [13]:
logreg = LogisticRegression(C=1e5)
logreg.fit(X_train, y_train)
pred = logreg.predict(X_test)

logreg_scores = cross_validation.cross_val_score(logreg, features, labels,
                                                 cv=10, scoring='accuracy')

print("Accuracy: %s" % logreg_scores.mean())

Accuracy: 0.798224925109
