# Don't Get Kicked

Here, we are trying to predict used cars purchases which incur loss to dealerships. First, we need to load the data. Then, we concatenate two dataframes to do some feature engineering on the entire dataset.

In this section we load the necessary libraries and write functions to process data.

In [1]:
import numpy as np
import pandas as pd
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score


# Data preprocessing and feature engineering
def preprocessing(train_df, test_df):
    data_train = pd.read_csv('training.csv')
    data_test = pd.read_csv('test.csv')
    
    y_train = data_train["IsBadBuy"]
    
    data_train = data_train.drop(["IsBadBuy"], axis = 1)
    
    print("The size of training data is:", len(data_train))
    print("The size of test data is:", len(data_test))
    
    print('The number of cars in Good/Kick classes are, respectively: ', Counter(y_train))


    data = pd.concat([data_train, data_test])

    data["PurchYear"] = data["PurchDate"].apply(lambda x: x.split("/")[2]).astype(int)
    data["PurchMonth"] = data["PurchDate"].apply(lambda x: x.split("/")[0]).astype(int)
    data = data.drop(['PurchDate'], axis=1)

    data["AUCGUART"] = data["AUCGUART"].fillna("YELLOW")

    cat_cols = ['Auction', 'Make', 'Model', 'Trim', 'SubModel', 'Color', 'Transmission', 'WheelType', 
                'Nationality', 'Size', 'TopThreeAmericanName', 'PRIMEUNIT', 'AUCGUART', 'VNST']

    for col in cat_cols:
        data[col] = data[col].fillna(data[col].mode()[0])

    for col in list(data):
        if col not in cat_cols:
            data[col] = data[col].fillna(data[col].median())
        
        
        
    data["MMR"] = data[['MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionRetailAveragePrice',
                        'MMRCurrentAuctionAveragePrice', 'MMRCurrentRetailAveragePrice', 
                        'MMRAcquisitionAuctionCleanPrice', 'MMRAcquisitonRetailCleanPrice',
                        'MMRCurrentAuctionCleanPrice', 'MMRCurrentRetailCleanPrice']].mean(axis=1)


    data["VehicleAge"] = data["VehicleAge"].replace(to_replace = 0, value = 0.1)
    data["MilesperYear"] = data["VehOdo"] / data["VehicleAge"]

    data["WheelTypeID"] = data["WheelTypeID"].replace(to_replace = 0, value = 1)
    data["Transmission"] = data["Transmission"].replace(to_replace = "Manual", value = "MANUAL")

    del_col = ['RefId', 'VehYear', 'SubModel', 'WheelType', 'VehOdo',
               'MMRAcquisitionAuctionAveragePrice', 'MMRAcquisitionAuctionCleanPrice',
               'MMRAcquisitionRetailAveragePrice', 'MMRAcquisitonRetailCleanPrice',
               'MMRCurrentAuctionAveragePrice', 'MMRCurrentAuctionCleanPrice',
               'MMRCurrentRetailAveragePrice', 'MMRCurrentRetailCleanPrice']

    for col in del_col:
        data = data.drop([col], axis = 1)

    
    # Encoding categorical data
    labelencoder = LabelEncoder()

    data[['Transmission']] = labelencoder.fit_transform(data[['Transmission']])
    data[['PRIMEUNIT']] = labelencoder.fit_transform(data[['PRIMEUNIT']])

    data = pd.get_dummies(data, columns = ['Auction', 'Make', 'Model', 'Trim', 'Color', 'Nationality', 
                                           'Size', 'TopThreeAmericanName', 'AUCGUART', 'VNST'])


    Class_0_ix = np.random.choice(y_train[y_train==0].index.tolist(), size=9024).tolist()

    ix = y_train[y_train==1].index.tolist() + Class_0_ix
    
    train = data.iloc[ix]
    y_train = y_train.iloc[ix]
    test = data.iloc[72983:]
    
    print('The number of cars in Good/Kick classes after down-sampling of the former are, respectively: ', Counter(y_train))

    return [train, y_train, test, data_test]
    
    

# Apply PCA to remove overlapping data
def apply_pca(train_df, test_df, n_features):

    pca = PCA(n_components=n_features)
    pca.fit(np.array(train_df))

    train_df_pca = pca.transform(train_df)

    test_df_pca = pca.transform(test_df)
    
    return [train_df_pca, test_df_pca]


# Evaluation and Score
def gini(actual, pred):

    actual_len = len(actual)
    assert( actual_len == len(pred) )
    all = np.asarray(np.c_[ actual, pred, np.arange(actual_len) ], dtype=np.float)
    all = all[ np.lexsort((all[:,2], -1*all[:,1])) ] 
    giniSum = all[:,0].cumsum().sum() / all[:,0].sum()
    giniSum -= (actual_len + 1) / 2.
    return giniSum / actual_len

def normalized_gini(solution, submission):
    normalized_gini = gini(solution, submission)/gini(solution, solution)
    return normalized_gini



Using the functions above, the data is loaded and preprocessed as follows. One may extract the month and year of purchase and ignore the rest of this column, since the type of this column is string. Next, we need to fill the missing data. The "AUCGUART" column is filled with "YELLOW", since the majority of the data in this column is missing and this value is an intermediate value. For other categorical features we use the mode and for numerical data the missing values are replaced with the median of the column. The data related to MMR can be condensed to just one column showing the average of all MMRs. Moreover, the "VehOdo" column can be replaced by "Miles per Year".

We encode all categorical features and one hot encode those with more than two non-ordinal categories. Besides, the Kicked/Good class is skewed and can cause issues for the predictive model. One method to fix issue is to downsize the dominent class, so that the size of two classes are close to one another.

In [2]:
[train, y_train, test, data_test] = preprocessing('training.csv', 'test.csv')


# Apply PCA
[train_pca, test_pca] = apply_pca(train, test, n_features=200)


# Splitting the dataset into the Training set and Test set
X_train, X_test, y_train, y_test = train_test_split(train_pca, y_train, test_size = 0.2, random_state = 0)

The size of training data is: 72983
The size of test data is: 48707
The number of cars in Good/Kick classes are, respectively:  Counter({0: 64007, 1: 8976})


  y = column_or_1d(y, warn=True)


The number of cars in Good/Kick classes after down-sampling of the former are, respectively:  Counter({0: 9024, 1: 8976})


Finally, a model has to be selected to learn from the training data set and predict the class associated with each car in the test data. The XGBoost model is selected here, however, we need to investigate more models and compare the results based on the CV scores. We generate a csv file as the submission file.

In [3]:
# parameter tuning of xgboost
# start from default setting
boost_params = {'eval_metric': 'logloss'}

classifier = XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100,
                           silent=True, objective="binary:logistic", booster='gbtree',
                           n_jobs=-1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0,
                           subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0,
                           reg_lambda=1, scale_pos_weight=1, base_score=0.5)
                  


classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)


# Making the Confusion Matrix
cm = confusion_matrix(y_test, y_pred)


# Applying k-Fold Cross Validation
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
accuracies.mean()
accuracies.std()


"""# Applying Grid Search to find the best model and the best parameters
parameters = [{'max_depth': [2, 3, 4, 5], 'learning_rate': [0.5, .1, .01, .001], 'n_estimators': [10, 100, 1000],
              'gamma': [0, .1, .2, .3], 'min_child_weight': [1, 3, 7, 10]}]
grid_search = GridSearchCV(estimator = classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

print(best_parameters)
"""


#gini(y_test, y_pred)
print('The normalized Gini coefficient is: ', normalized_gini(y_test, y_pred))

# The trained model is used to predict the class of each car:
data_test['IsBadBuy'] = classifier.predict(test_pca)

data_test[['RefId', 'IsBadBuy']].to_csv(
    'kick_submission.csv', index=False, float_format='%.3f')

  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:
  if diff:


The normalized Gini coefficient is:  0.2990780762918337


  if diff:


The Gini coefficient or the normalized Gini coefficient are appropriate choices to quantify the classification performance. We ideally look for maximizing the true positive and true negative, and minimize false negative and false positive. A confusion matrix can help us see the number of cars in each category. Alternatively, a F1-score can be used, which is a metric to evaluate precision and recall.

In order to improve the performance of the model, we can make one of the following changes:
- The number of "Good" cars can be increased or decreased. Also, we can try another technique to address the skewed class, for example over-sampling the "Kicked" cars.
- The number of final featuers in PCA can be increased.
- The hyper parameters in XGBoost can be tuned using grid search (commented in the last cell).
- Another machine leaning technique can be adopted.