# Creating a simple first model:
In this chapter, you'll build a first-pass model. You'll use numeric data only to train the model. Spoiler alert - throwing out all of the text data is bad for performance! But you'll learn how to format your predictions. Then, you'll be introduced to natural language processing (NLP) in order to start working with the large amounts of text in the data.

# 1. It's time to build a model
### 1.1 Setting up a train-test split in scikit-learn
Alright, you've been patient and awesome. It's finally time to start training models!

The first step is to split the data into a training set and a test set. Some labels don't occur very often, but we want to make sure that they appear in both the training and the test sets. We provide a function that will make sure at least `min_count` examples of each label appear in each split: `multilabel_train_test_split`.

Feel free to check out the full code for `multilabel_train_test_split` [here](https://github.com/drivendataorg/box-plots-sklearn/blob/master/src/data/multilabel.py).

You'll start with a simple model that uses __just the numeric columns__ of your DataFrame when calling `multilabel_train_test_split`. The data has been read into a DataFrame `df` and a list consisting of just the numeric columns is available as `NUMERIC_COLUMNS`.

### Instructions:
* Create a new DataFrame named `numeric_data_only` by applying the `.fillna(-1000)` method to the numeric columns (available in the list `NUMERIC_COLUMNS`) of `df`.
* Convert the labels (available in the list `LABELS`) to dummy variables. Save the result as `label_dummies`.
* In the call to `multilabel_train_test_split()`, set the `size` of your test set to be `0.2`. Use a `seed` of `123`.
* Fill in the `.info()` method calls for `X_train`, `X_test`, `y_train`, and `y_test`.

In [1]:
# Import pandas, numpy, warn, CountVectorizer
import pandas as pd
import numpy as np
from warnings import warn
from sklearn.feature_extraction.text import CountVectorizer

#### DEFINE SAMPLING UTILITIES ####

# First multilabel_sample, which is called by multilabel_train_test_split
def multilabel_sample(y, size=1000, min_count=5, seed=None):
    """ Takes a matrix of binary labels `y` and returns
        the indices for a sample of size `size` if
        `size` > 1 or `size` * len(y) if size =< 1.
        
        The sample is guaranteed to have > `min_count` of
        each label.
    """
    try:
        if (np.unique(y).astype(int) != np.array([0, 1])).all():
            raise ValueError()
    except (TypeError, ValueError):
        raise ValueError('multilabel_sample only works with binary indicator matrices')
    
    if (y.sum(axis=0) < min_count).any():
        raise ValueError('Some classes do not have enough examples. Change min_count if necessary.')
    
    if size <= 1:
        size = np.floor(y.shape[0] * size)
    
    if y.shape[1] * min_count > size:
        msg = "Size less than number of columns * min_count, returning {} items instead of {}."
        warn(msg.format(y.shape[1] * min_count, size))
        size = y.shape[1] * min_count
    
    rng = np.random.RandomState(seed if seed is not None else np.random.randint(1))
    
    if isinstance(y, pd.DataFrame):
        choices = y.index
        y = y.values
    else:
        choices = np.arange(y.shape[0])
    
    sample_idxs = np.array([], dtype=choices.dtype)
    
    # first, guarantee > min_count of each label
    for j in range(y.shape[1]):
        label_choices = choices[y[:, j] == 1]
        label_idxs_sampled = rng.choice(label_choices, size=min_count, replace=False)
        sample_idxs = np.concatenate([label_idxs_sampled, sample_idxs])
        
    sample_idxs = np.unique(sample_idxs)
        
    # now that we have at least min_count of each, we can just random sample
    sample_count = size - sample_idxs.shape[0]
    
    # get sample_count indices from remaining choices
    remaining_choices = np.setdiff1d(choices, sample_idxs)
    remaining_sampled = rng.choice(remaining_choices, size=sample_count, replace=False)
        
    return np.concatenate([sample_idxs, remaining_sampled])

# Now define multilabel_train_test_split to be used below
def multilabel_train_test_split(X, Y, size, min_count=5, seed=None):
    """ Takes a features matrix `X` and a label matrix `Y` and
        returns (X_train, X_test, Y_train, Y_test) where all
        classes in Y are represented at least `min_count` times.
    """
    index = Y.index if isinstance(Y, pd.DataFrame) else np.arange(Y.shape[0])

    test_set_idxs = multilabel_sample(Y, size=size, min_count=min_count, seed=seed)    
    train_set_idxs = np.setdiff1d(index, test_set_idxs)
    
    test_set_mask = index.isin(test_set_idxs)
    train_set_mask = ~test_set_mask
    
    return (X[train_set_mask], X[test_set_mask], Y[train_set_mask], Y[test_set_mask])
#### ####

# Load data
df = pd.read_csv('_datasets/TrainingSetSample.csv', index_col=0)

# Load LABELS and NUMERIC_COLUMNS lists
LABELS = ['Function',
          'Use',
          'Sharing',
          'Reporting',
          'Student_Type',
          'Position_Type',
          'Object_Type', 
          'Pre_K',
          'Operating_Status']

NUMERIC_COLUMNS = ['FTE', "Total"]

# Convert object to category for LABELS
df[LABELS] = df[LABELS].apply(lambda x: x.astype('category'))

In [2]:
# Create the new DataFrame: numeric_data_only
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)

# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])

# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(numeric_data_only,
                                                               label_dummies,
                                                               size=0.2, 
                                                               seed=123)

# Print the info
print("X_train info:")
print(X_train.info())
print("\nX_test info:")  
print(X_test.info())
print("\ny_train info:")  
print(y_train.info())
print("\ny_test info:")  
print(y_test.info()) 

X_train info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1040 entries, 198 to 101861
Data columns (total 2 columns):
FTE      1040 non-null float64
Total    1040 non-null float64
dtypes: float64(2)
memory usage: 24.4 KB
None

X_test info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 520 entries, 209 to 448628
Data columns (total 2 columns):
FTE      520 non-null float64
Total    520 non-null float64
dtypes: float64(2)
memory usage: 12.2 KB
None

y_train info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1040 entries, 198 to 101861
Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
dtypes: uint8(104)
memory usage: 113.8 KB
None

y_test info:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 520 entries, 209 to 448628
Columns: 104 entries, Function_Aides Compensation to Operating_Status_PreK-12 Operating
dtypes: uint8(104)
memory usage: 56.9 KB
None




With the data split, you can now train a model!

### 1.2 Training a model
With split data in hand, you're only a few lines away from training a model.

In this exercise, you will import the logistic regression and one versus rest classifiers in order to fit a multi-class logistic regression model to the `NUMERIC_COLUMNS` of your feature data.

Then you'll test and print the accuracy with the `.score()` method to see the results of training.

Before you train! Remember, we're ultimately going to be using logloss to score our model, so don't worry too much about the accuracy here. Keep in mind that you're throwing away all of the text data in the dataset - that's by far most of the data! So don't get your hopes up for a killer performance just yet. We're just interested in getting things up and running at the moment.

All data necessary to call `multilabel_train_test_split()` has been loaded into the workspace.

### Instructions:
* Import `LogisticRegression` from `sklearn.linear_model` and `OneVsRestClassifier` from `sklearn.multiclass`.
* Instantiate the classifier `clf` by placing `LogisticRegression()` inside `OneVsRestClassifier()`.
* Fit the classifier to the training data `X_train` and `y_train`.
* Compute and print the accuracy of the classifier using its `.score()` method, which accepts two arguments: `X_test` and `y_test`.

In [3]:
# Import classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier

# Create the DataFrame: numeric_data_only
numeric_data_only = df[NUMERIC_COLUMNS].fillna(-1000)

# Get labels and convert to dummy variables: label_dummies
label_dummies = pd.get_dummies(df[LABELS])

# Create training and test sets
X_train, X_test, y_train, y_test = multilabel_train_test_split(numeric_data_only,
                                                               label_dummies,
                                                               size=0.2, 
                                                               seed=123)

# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression(solver='liblinear'))

# Fit the classifier to the training data
clf.fit(X_train, y_train)

# Print the accuracy
print("Accuracy: {}".format(clf.score(X_test, y_test)))



Accuracy: 0.0


Ok! The good news is that your workflow didn't cause any errors. The bad news is that your model scored the lowest possible accuracy: __0.0__! But hey, you just threw away ALL of the text data in the budget. Later, you won't. Before you add the text data, let's see how the model does when scored by log loss.