# Machine learning coding challenge
## 
Your task is to:
1. Download the "Used Cars Database" from Kaggle Data Sets (https://www.kaggle.com/orgesleka/used-cars-database)
1. Create whatever data subsets for training and testing you believe are appropriate.
1. Build a predictive model, to predict which of the cars (listed at < 100k Euros) are listed as cheap (< 2,000 Euros), based on some combination of the other columns
1. Produce results that convince us the estimate is a realistic measure of generalisation performance (if the model was to be deployed live for future listings).

You may only use built-in Python 3 packages, and the basic scientific python kit such as numpy, sklearn, and pandas (no keras/tensorflow/spark please!).

Also, everything must be done programmatically from this notebook, under the assumption "autos.csv" is placed in the same folder.

## Part 2 - Resource Constraints
To really impress it (and lets face it, that's what you're trying to do), now write code that follows this same process, but assuming the arbitrary resource limitation that you can only afford to load 1,000 rows of the data set into memory at once.


## Priorities
1. (40%) You use best practice machine learning process to get a *valid* result, and you comply with the resource limitations (i.e. your *code doesn't crash*).
1. (30%) You adhere to general style and software engineering principles in writing your python code
1. (10%) Your model performs reasonably well. I DON'T want you to waste time on extensive feature engineering to optimise model performance - just get a valid, better-than-random result.
1. (10%) Your code runs reasonably efficiently, given the resource constraints (again, don't spend hours optimising...).

INSTALLATION AND SETUP

Installed Kaggle to enable data download
Downloaded and installed minconda
Created conda environment called 'ipython' with ipython-notebook, numpy, pandas and scikit-learn 

In [None]:
#pip install kaggle
#bash Miniconda3-latest-MacOSX-x86_64.sh
#conda create -n i_python ipython-notebook numpy pandas scikit-learn
#conda info --envs

Activate the i_python environment, change to the chosen working directory and start the codeing_test2.ipynb notebook.
Downloaded and installed minconda
Created conda environment called 'ipython' with ipython-notebook, numpy, pandas and scikit-learn 

In [20]:
#source activate i_python
#ipython notebook coding_test_2_solution_BroichM.ipynb

Clear variables and libs, then load the required packages and print their versions.
The versions used in this solution where:
Python version 3.5.5
numpy version 1.14.2.
pandas version  0.22.0.
scikit-learn version  0.19.1.

In [45]:
# clear all variables and libs
def clearall():
    all = [var for var in globals() if "__" not in (var[:2], var[-2:])]
    for var in all:
        del globals()[var]
clearall()

#import libs
import os
import sys
import time
import numpy as np
import pandas as pd
import sklearn
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier

print("The Python version is %s.%s.%s" % sys.version_info[:3])
print('The numpy version is {}.'.format(np.__version__))
print('The pandas version is {}.'.format(pd.__version__))
print('The scikit-learn version is {}.'.format(sklearn.__version__))

The Python version is 3.5.5
The numpy version is 1.14.2.
The pandas version is 0.22.0.
The scikit-learn version is 0.19.1.


 Download the data, move to working dir and unzip (working dir on my mac is: /Users/mb/Desktop/near_map)

In [39]:
com = "kaggle datasets download -d orgesleka/used-cars-database"
os.system(com)
com = "mv ../../.kaggle/datasets/orgesleka/used-cars-database/used-cars-database.zip ."
os.system(com)
com = "unzip used-cars-database.zip"
os.system(com) #produces an error but does the job

KeyError: '_oh'

Define the functions (including random forest used in this solution)
In response to the specs the preprocessing function marks 50% of the cars as training data and the remaining 50% as cases that we will predict/ test the model on. The precition/ testing data is not used in the model (e.g. cross validation) but set aside. Using 50% is conservative and the resutling accuracy can be assumed to generlaize well to a new set of cars. 

In [40]:
# functions

def func_preprocess(df):
    """
    v 1.0 To preprocess the dataframe including the selection of fitting and testing data
    :param dataframe:
    :return: preprocessed dataframe
    """

    #drop columns with liklely minor importance for prediction
    df.drop(['dateCrawled','name', 'seller', 'offerType', 'abtest','gearbox','model','monthOfRegistration','fuelType','notRepairedDamage','dateCreated', 'nrOfPictures', 'postalCode','lastSeen'], axis='columns', inplace=True)
    
    #add another colume with random 0 or 1 to the df to use as train and test dataset (here spitting 50/50)
    slength = df.shape[0]
    # flag 1 is used for fitting so setting p=[0.75, 0.25] would result in a 25% fitting sample
    df['fit_eval'] = np.random.choice([0, 1],slength, p=[0.5, 0.5])

    #find all NaN and repalce with mode value for all column in df
    for column in df.columns:
        df[column].fillna(df[column].mode()[0], inplace=True)

    #translate the categy variables into integer
    for column in df.columns:
        if df[column].dtype == type(object):
            le = preprocessing.LabelEncoder()
            le.fit_transform(df[column])
            df[column] = le.fit_transform(df[column])            
    
    return df


# run func_preprocess v 1.0 
#df = func_preprocess(df)

def func_business(df):
    """
    v 1.0 To apply the business rules
    :param dataframe:
    :return: preprocessed dataframe
    """

    #subset cars listed at < 100k Euros as per specs
    df[df.price < 100000]

    #which of the cars (listed at < 100k Euros) are listed as cheap (< 2,000 Euros) as per specs, 
    #code the y variable as 1 and 0 for price <2000 and >= 2000 euro, respectivley 
    df['y'] = np.where(df['price']<2000, 1, 0)

    return df

# run func_business v 1.0 
#df = func_business(df)


def func_select_fit_eval(df,fit):
    """
    v 1.0 To select fitting and evaluation dataset for training 
        or application of model  
    :param df:
    :param fit flag:
    :return: train_test_x, train_test_y
    """
    
    #subset the x variables and the y variable
    #the last two variables are y and the identifyier of train and test dataset
    #these are not included in the x variables
    x_header = df.columns[1:len(df.columns)-2]
    y_header = df.columns[-1] ### last one?
    
    #the function can be called to provide input for train or test dataset
    if(fit == 1):     
        # pick only the fitting data
        df_fit = df[df['fit_eval'] == 0]
        train_test_x = df_fit[x_header]
        train_test_y = df_fit[y_header]
    else:
        # pick only the data not used for fitting
        df_eval = df[df['fit_eval'] == 1]
        train_test_x = df_eval[x_header]
        train_test_y = df_eval[y_header]
        
    return train_test_x, train_test_y

# run select v 1.0 
#fit = 1 #or fit = 2
#train_test_xy = func_create_fit_eval(df,fit)
#train_test_x = train_test_xy[0]
#train_test_y = train_test_xy[1]


def func_random_forest(clf, treecount,counter):
#def func_random_forest(treecount,counter):
    """
    v 1.0 To fit random forest
    :param random forest obejcts:
    :param tree counter:
    :param counter:
    :return: random forest obejcts
    """
    
    if(counter ==0):
        # for first chunk create random forest model using all remaining features 
        # set warm_start=True so that more trees can be added for subsequent chunks
        clf = RandomForestClassifier(n_estimators = treecount,warm_start=True, max_features = 5, n_jobs = 1)
        clf.fit(train_test_x, train_test_y)        
    else:
        # add more trees for subsequent chunks
        clf.set_params(n_estimators = treecount)
        clf.fit(train_test_x, train_test_y)    
    
    return clf

# run func_random forest v 1.0 
#clf = func_random_forest(clf,treecount,counter)


def func_accuracy(ys,predictions):
    """
    v 1.0 To calculate accuraies and f1 score 
    :param ys:
    :param predictions:
    :return: none
    """
    
    ## flattern the prediction results and the y variable harverested from the chunks; 
    counter = 0
    global y 
    global y_hat_model  
    for name, prediction in predictions:
        y.extend(ys[counter][1])
        y_hat_model.extend(predictions[counter][1])
        counter = counter + 1

    y = np.asarray(y)
    y_hat_model = np.asarray(y_hat_model)
    

    #count the confusion matrix values
    #result count 
    all_count = len(y)
    #count of cars < 2k
    all_pos = sum(y)
    #count of cars predicted as < 2k
    all_hat = sum(y_hat_model)
    
    #corred cars < 2k
    correct_pos = sum((y==1) & (y_hat_model==1))
    #corred cars >= 2k
    correct_neg = sum((y==0) & (y_hat_model==0))

    #p (precision) is correct positive results divided by all positive results returned by the classifier. 
    p = correct_pos / all_hat
    #r (recall) is correct positive results divided by all samples that should have been identified as positive. 
    r = correct_pos / all_pos
    #f1 score is the harmonic average of the precision and recall,
    f1 = 2 * (p * r / (p + r))
    #overall accuracy
    o = (correct_pos + correct_neg) /all_count

    #print overall accurcy, recall, precision, and f1 score w 5 d.p.
    print('overall_accuracy',round(o,3), 'recall',round(r,3), 'precision',round(p,3), 'f1',round(f1,3))

# run func_accuracy v 1.0 
#run func_accuracy(ys,predictions)

Set lists and variables

In [41]:
evals = []
clf = []
predictions = []
ys = []
y = []
y_hat_model = []

# do one random forest tree per chunk and use a chunk size of 1000 as per specs
treecount = 1
chunksize = 1000

In [42]:
#start timer
start = time.time()

#fit the model by building up a random forest adding one tree per chunk of lines read
#the result of 'clf' contraining a random forest with as many tree as number of chunks times treecount 
#(which is set by the user) 
#this loop also produces 'evals' a vector denoting which cars have been used for fitting

counter = 0
for df in pd.read_csv("autos.csv", chunksize=chunksize ,encoding='latin-1'):

    # run func_preprocess v 1.0 
    df = func_preprocess(df)

    #capture the fit_evals
    evals.append((counter, df['fit_eval']))

    # run func_business v 1.0 
    df = func_business(df)
    #print(list(df))

    ## func to select fitting or application data
    # here fitting
    fit = 1
    train_test_xy = func_select_fit_eval(df,fit)
    train_test_x = train_test_xy[0]
    train_test_y = train_test_xy[1]

    # run func_random_forest v 1.0 
    clf = func_random_forest(clf,treecount,counter)

    counter = counter + 1
    treecount = treecount + 1

In [43]:
#apply the 'clf' model to one chunk at a time
# the loop laos reads back the 'evals' a vector denoting which cars have been used for fitting
# so that only once not used for fitting are used for prediction and later evalaution

#this loop results in two vecotrs: 'predictions' and 'ys' that denote the model results and the 
#true results respectivley

counter = 0
for df in pd.read_csv("autos.csv", chunksize=chunksize ,encoding='latin-1'):
    # run func_preprocess v 1.0 
    df = func_preprocess(df) 

    # run func_business v 1.0 
    df = func_business(df)

    #### read fit_eval back into the df
    eval = evals[counter][1]
    df['fit_eval'] = eval

    # func to select fitting or application data
    # here application
    fit = 0
    train_test_xy = func_select_fit_eval(df,fit)
    train_test_x = train_test_xy[0]
    train_test_y = train_test_xy[1]

    ## predict and append the predictions and the ys
    predictions.append((counter, clf.predict(train_test_x)))
    ys.append((counter, train_test_y))
    counter = counter + 1

In [44]:
# run func_accuracy v 1.0 
func_accuracy(ys,predictions)

#end timer
end = time.time()
print("runtime in min:",(end - start)/60)# time in min

recall 0.825 precision 0.822 overall accuracy 0.86 f1 0.824
runtime in min: 0.7175386508305868


So in less then 1 min of runtime I got >80% accuracy as compared to 50%, which would be the random benchmark as per specs. 