# Kaggle - Bosch Production Line Performance 
### (Handling Large Data With Limited Memory)

Welcome! This jupyter notebook will demonstrate how to work with large datasets in python by analyzing production line data associated with the Bosch Kaggle competition (https://www.kaggle.com/c/bosch-production-line-performance). 

[This notebook is still a work in progress and will be updated as I improve algorithm performance.]

Questions, comments, suggestions, and corrections can be sent to mgebhard@gmail.com.

## Business Challenge
Bosch, a manufacturing company, teamed up with Kaggle to challenge teams to create a classification algorithm that predicts "internal failures along the manufacturing process using thousands of measurements and tests made for each component along the assembly line."


## Data
Bosch provided six huge files worth of data for the challenge (https://www.kaggle.com/c/bosch-production-line-performance/data). Three sets of training data--numeric, categorical, and dates--and the equivalent sets of test data. They contain a large number of features (one of the largest sets ever hosted on Kaggle), and the uncompressed files come out to **14.3 GB**. 

One of the largest difficulties associated with the competition is handing this amount of data. One strategy is to move the data to Amazon Web Services and use big data tools like Spark and Hadoop. Often, however, we are forced to extract value from data given real-world constraints like less memory and processing power. In this notebook, I'll work through an alternative approach where I split and simplify the data in order to process it on my 8GB RAM laptop.

Let's start by examining the training data. Because the files are so large, we can't do the usual practice of using pandas to read the .CSV file into a dataframe. Instead, let's just look at a few lines.

In [1]:
import pandas as pd

line_count = 0
extracted_lines = []
with open('train_numeric.csv') as f:
    for line in f:
        if line_count < 6:
            extracted_lines.append(line)
            line_count += 1
        else:
            break
for line in extracted_lines:
    print line[:40], '...', line[-40:]

Id,L0_S0_F0,L0_S0_F2,L0_S0_F4,L0_S0_F6,L ... 4258,L3_S51_F4260,L3_S51_F4262,Response

4,0.03,-0.034,-0.197,-0.179,0.118,0.116, ... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0

6,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, ... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0

7,0.088,0.086,0.003,-0.052,0.161,0.025,- ... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0

9,-0.036,-0.064,0.294,0.33,0.074,0.161,0 ... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0

11,-0.055,-0.086,0.294,0.33,0.118,0.025, ... ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0



We see that each line in train_numeric.csv represents a component with an Id, a long list of features (many of which are blank), and a Response indicating passage or failure of QC. Further examination shows that only 0.58% of Responses are failures, or *1*.

Because we already have more data than we can handle, we're going to simplify by only working with train_numeric.csv and disregard train_categorical.csv and train_date.csv. Furthermore, we need to deal with the fact that train_numeric.csv is larger than we can handle and also is highly imbalanced. To do this, we're going to pull out all of the rows with Positive responses and randomly sample an equivalent number of negative rows. We'll make a new .CSV file that is 1/100th the size of the original and is now equally balanced.

In [3]:
import random

line_count = 0
extracted_positive_lines = []
with open('train_numeric.csv') as f:
    for line in f:
        if line_count == 0:
            extracted_positive_lines.append(line)
            line_count += 1
        elif line[-2] == '1':
            extracted_positive_lines.append(line)

line_count = 0
extracted_negative_lines = []
with open('train_numeric.csv') as f:
    for line in f:
        if line_count == 0:
            line_count += 1
            continue
        if line_count > 0 and random.random() < 0.0058:
            extracted_negative_lines.append(line)

combined_extracted_lines = extracted_positive_lines + extracted_negative_lines
with open('train_numeric_short.csv', 'w') as f:
    for line in combined_extracted_lines:
        f.write(line)

Now we can move the new .CSV to a pandas dataframe and replace the empty features with *0*.

In [4]:
train_numeric_short_df = pd.read_csv('train_numeric_short.csv')
train_numeric_short_df.fillna(value=0, inplace=True)
train_numeric_short_df.shape

(13726, 970)

We're now working with 13769 samples with 968 features not including Id and Response. Let's use train_test_split from sklearn.cross_validation to split our training data, which will let us quickly evaluate and compare various classifiers.

In [5]:
from sklearn.cross_validation import train_test_split

X = train_numeric_short_df.drop(['Response', 'Id'], axis=1)
y = train_numeric_short_df['Response']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

## Comparing Classifiers
With our training data split into new training and test sets, we can feed it into various sci-kit learn classifiers. The Kaggle competition is being judged using the Matthews correlation coefficient, so we'll use that to find the best classifier.
* https://en.wikipedia.org/wiki/Matthews_correlation_coefficient
* http://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html

Additionally, we can use [recursive feature elimination with cross-validation](http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFECV.html). Our data set is high dimensional with 968 features. Removing features of low importance can reduce the model complexity, overfitting, and training time.

In [6]:
from sklearn.metrics import matthews_corrcoef
from sklearn.feature_selection import RFECV

We can start with a simple [logistic regression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) combined with the recursive feature elimination.

In [7]:
from sklearn.linear_model import LogisticRegression

clf = RFECV(LogisticRegression(), step=200)
clf.fit(X_train, y_train)
y_output = clf.predict(X_test)
matthews_corrcoef(y_test, y_output)

0.1774656753667388

Next, let's try a [linear SVC model](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html).

In [8]:
from sklearn.svm import LinearSVC

clf = RFECV(LinearSVC(), step=200)
clf.fit(X_train, y_train)
y_output = clf.predict(X_test)
matthews_corrcoef(y_test, y_output)

0.1655979020934964

Let's try the [ExtraTreesClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html).

In [15]:
from sklearn.ensemble import ExtraTreesClassifier

forest = ExtraTreesClassifier(n_estimators=250, random_state=0)
clf = RFECV(forest, step=200)
clf.fit(X_train, y_train)
y_output = clf.predict(X_test)
matthews_corrcoef(y_test, y_output)

0.29972361676295534

Now that we've settled on the ExtraTreesClassifier, let's retrain it using our full training set from before we split it with train_test_split.

In [12]:
clf.fit(X, y)

RFECV(cv=None,
   estimator=ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=250, n_jobs=1,
           oob_score=False, random_state=0, verbose=0, warm_start=False),
   estimator_params=None, scoring=None, step=200, verbose=0)

We're ready to analyze the actual test data provided by Bosch. As with the training data, though, the 2.1 GB file is quite large for my laptop. We can split up the test data into files of 100000 lines each and get predictions for each smaller file and then stitch the predictions back together for a final submission file.

Fortunately, pandas can read .CSV files in chunks which makes it easy to split up the test data file.

In [113]:
test = pd.read_csv('test_numeric.csv', chunksize=100000)
file_number = 0
for chunk in test:
    path = 'test_data/short' + str(file_number) + '.csv'
    chunk.to_csv(path)
    file_number += 1

In [114]:
for i in range(12):
    test_numeric_short_df = pd.read_csv('test_data/short' + str(i) + '.csv').fillna(value=0)
    Ids = test_numeric_short_df.ix[:,'Id']
    X_test_real = test_numeric_short_df.drop(['Id', 'Unnamed: 0'], axis=1)
    y_output_real = selector.predict(X_test_real)
    output = pd.Series(y_output_real, name='Response')
    output = pd.concat([Ids, output], axis=1)
    output.to_csv('test_output/test_output' + str(i) + '.csv', index=False)

Now we just have to put our prediction files together into a single file.

In [116]:
import shutil

shutil.copyfile('test_output/test_output0.csv', 'test_output/output_combined.csv')

output_combined = open('test_output/output_combined.csv', 'a')
for i in range(1,12):
    lines = open('test_output/test_output' + str(i) + '.csv', 'r').readlines()
    for line in lines[1:]:
        output_combined.write(line)
output_combined.close()

## Conclusions

Submitting our file to Kaggle gets us a score of 0.04623. 