# Workflows in Python: Getting data ready to build models

This is to follow the example code from [Katie Malone's blog post](https://civisanalytics.com/blog/data-science/2015/12/17/workflows-in-python-getting-data-ready-to-build-models/) at Civis Analytics. This gives an example of a workflow model for Python. She describes as 

**"a workflow that focuses on getting a quick-and-dirty model up and running as quickly as possible, and then going back to iterate on the weak points until the model seems to be converging on an answer."**

- Dataset: “Pump it Up: Mining the Water Table” challenge on drivendata.org, which has examples of wells in Africa, their characteristics and whether they are **functional, non-functional, or functional but in need of repair.** 

-  Goal: build a model that will take the characteristics of a well and predict correctly which category that well falls into.

# Getting started
- read in data
- transform features and labels to make the data amenable to machine learning
- pick a modeling strategy (classification)
- make a train/test split (this was done implicitly when I called cross_val_score)
- evaluate several models for identifying wells that are failed or in danger of failing

# 1. Labels
## A quick print statement on the labels shows that the labels are strings

In [2]:
import pandas as pd
import numpy as np
features_df = pd.DataFrame.from_csv("data/training_set_values.csv")
labels_df   = pd.DataFrame.from_csv("data/training_set_labels.csv") 
print(labels_df.head(20))

                  status_group
id                            
69572               functional
8776                functional
34310               functional
67743           non functional
19728               functional
9944                functional
19816           non functional
54551           non functional
53934           non functional
46144               functional
49056               functional
50409               functional
36957               functional
50495               functional
53752               functional
61848               functional
48451           non functional
58155           non functional
34169  functional needs repair
18274               functional


## Mapping labels to integers
"When I want a specific mapping between strings and integers, like here, doing it manually is usually the way I go."
- there’s also the sklearn LabelEncoder.
- pandas applymap()
    - apply() vs. applymap(): applymap() operates on a whole dataframe while apply() operates on a series

### Pandas applymap

In [3]:
def label_map(y):
    if y=="functional":
        return 2
    elif y=="functional needs repair":
        return 1
    else:
        return 0
labels_df = labels_df.applymap(label_map)
print(labels_df.head())

       status_group
id                 
69572             2
8776              2
34310             2
67743             0
19728             2


# 2. Features

In [4]:
print(features_df.head())

       amount_tsh date_recorded        funder  gps_height     installer  \
id                                                                        
69572        6000    2011-03-14         Roman        1390         Roman   
8776            0    2013-03-06       Grumeti        1399       GRUMETI   
34310          25    2013-02-25  Lottery Club         686  World vision   
67743           0    2013-01-28        Unicef         263        UNICEF   
19728           0    2011-07-13   Action In A           0       Artisan   

       longitude   latitude              wpt_name  num_private  \
id                                                               
69572  34.938093  -9.856322                  none            0   
8776   34.698766  -2.147466              Zahanati            0   
34310  37.460664  -3.821329           Kwa Mahundi            0   
67743  38.486161 -11.155298  Zahanati Ya Nanyumbu            0   
19728  31.130847  -1.825359               Shuleni            0   

           

## Many of the features are categorical and they need to be transformed to numerical values.
- transform categorical features: OneHotEncoder in sklearn or get_dummies() in pandas.

In [5]:
def transform_feature( df, column_name ):
    
    """ take features_df and the name of a column in that dataframe, 
        and return the same dataframe but 
        with the indicated feature encoded with integers rather than strings"""
    
    unique_values = set( df[column_name].tolist() )
    transformer_dict = {}
    for ii, value in enumerate(unique_values):
        transformer_dict[value] = ii

    def label_map(y):
        return transformer_dict[y]
    df[column_name] = df[column_name].apply( label_map )
    return df

In [6]:
### list of column names indicating which columns to transform; 
### this is just a start!  Use some of the print( labels_df.head() )
### output upstream to help you decide which columns get the
### transformation

names_of_columns_to_transform = ["funder", "installer", "wpt_name", "basin", "subvillage",
                    "region", "lga", "ward", "public_meeting", "recorded_by",
                    "scheme_management", "scheme_name", "permit",
                    "extraction_type", "extraction_type_group",
                    "extraction_type_class",
                    "management", "management_group",
                    "payment", "payment_type",
                    "water_quality", "quality_group", "quantity", "quantity_group",
                    "source", "source_type", "source_class",
                    "waterpoint_type", "waterpoint_type_group"]

for column in names_of_columns_to_transform:
    features_df = transform_feature( features_df, column )
    
print( features_df.head() )
    
### remove the "date_recorded" column--we're not going to make use
### of time-series data today
features_df.drop("date_recorded", axis=1, inplace=True)

print(features_df.columns.values)

       amount_tsh date_recorded  funder  gps_height  installer  longitude  \
id                                                                          
69572        6000    2011-03-14    1539        1390       1749  34.938093   
8776            0    2013-03-06     774        1399        136  34.698766   
34310          25    2013-02-25     906         686       1400  37.460664   
67743           0    2013-01-28     590         263        556  38.486161   
19728           0    2011-07-13    1341           0        537  31.130847   

        latitude  wpt_name  num_private  basin          ...            \
id                                                      ...             
69572  -9.856322     15203            0      5          ...             
8776   -2.147466      5611            0      7          ...             
34310  -3.821329      6138            0      6          ...             
67743 -11.155298      1252            0      1          ...             
19728  -1.825359     2

## prep for sklearn: convert it to numpy.ndarray

In [7]:
X = features_df.as_matrix()
y = labels_df["status_group"].tolist()

# 3. Train and test 
- The cheapest and easiest way to train on one portion of my dataset and test on another, and to get a measure of model quality at the same time, is to use **sklearn.cross_validation.cross_val_score()**
    - Splits my data into three equal portions, trains on two of them, and tests on the third
    - This process repeats three times.

In [12]:
import sklearn.linear_model
import sklearn.cross_validation
clf = sklearn.linear_model.LogisticRegression()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print( score )

[ 0.68707071  0.68429293  0.6809596 ]


# 4. Classification or Regression
- I have the choice of **modeling with a classifier** and potentially getting slightly worse performance, 
- or building **a regression but needing to add a post-processing step that turns my continuous (i.e. float) predictions into integer category labels.** 
- I’ve decided to go with the classification approach for this example, but this is a decision made for convenience that I could revisit when improving my model down the road.

# 5. Compare algorithms
- I started with a simple logistic regression above (despite the name, this is a classification algorithm) 
- I’ll compare to a couple of other classifiers, a decision tree classifier and a random forest classifier, to see which one seems to do the best.

In [13]:
import sklearn.tree
import sklearn.ensemble

clf = sklearn.tree.DecisionTreeClassifier()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print( score )

clf = sklearn.ensemble.RandomForestClassifier()
score = sklearn.cross_validation.cross_val_score( clf, X, y )
print( score )

[ 0.74242424  0.73717172  0.73393939]
[ 0.78656566  0.78747475  0.78242424]
