Analysis.Rmd

---
output: html_document
---
# 1 Pre-analysis
As a first step to this analysis, we identify possible features that may change the cost of a tube assembly quoted by a supplier. We also ask questions about the dataset where the answers to these questions will help us to choose the right machine learning algorithms. Since we are predicting the cost value, a continuous variable, we will use a regression algorithm family.

### 1.1  Questions

To archieve our goal of predicting the cost with a good accuracy, we need to answer the following questions.

1. What features are useful to determine the cost and what features to exclude from the analysis?
2. Is there a unique mathematical model describing the cost in function of the quantity for each supplier?
3. If there is a mathematical model, is it a linear or non-linear model? 
4. Are there more than one model to estimate the cost where each model are independent of the others?
5. Are there weighted decisions that can be taken amoung the features? 


# 2 Possible Dependent Features
In this section, we will answer the question: _What features are used to determine the cost and what features to exclude from the analysis?_

### 2.1  Tube Physical Properties
As a supplier, we have to think on which tube features the cost will be based. We know that a tube assembly is made with one or more components. Some numerical tube properties may be helpful to check.

* The total weight of the tube and some statistical measures like the maximum and average weights
* The quantity to purchase with the mean, maximum and minimum quantities for each tube
* The volume which depends on the diameter, the wall thickness and the length of the tube
* The number of bends used with the bend radius. Logically, it is more difficult to bend a tube than to keep it linear, so it should be more expansive
* The number of components to assemble a tube. Assembling many components need welds and connectors which should be more expansive
* The material used to make the tube. Some type of material can be much expensive than others (e.g. Basic Metal Group vs Precious Metal Group)
* The type of component used to assemble the tube. Some types of component may be more complicated to build by their shape (e.g. a tee is more complicated to build than a straight tube)

### 2.2  Supplier Features

* The date when the supplier has quoted the price which is certainly less 20 years ago than today when the cost is not adjusted.
* The suppliers may use different mathematical models to quote their price. Some may set cheaper costs, some may set expensive costs.
* The supplier itself which can use different model and may be considered as classes.


# 3 Preparing & Cleaning the Dataset
In this section, we will explain why we chose to keep and exclude features and how we will clean the dataset.

From the dataset, we note that there are a total of 2048 components. These components are spread amoung the `comp_[type].csv` files uniquely. This means that we can create a single table `Component` by merging those files together. To avoid to many columns, we will remove some features that we do not want in this analysis.

The training and test sets are merged together where we set the cost to 0 for the test set.

The file `bill_of_materials.csv` gives us the list of components with their respective quantity used to assemble a tube. Thus, to calculate the total weight for each tube, we use the formula $$W_T = \sum_{i = 0}^n W_i Q_i$$ where $W_T$ is the total weight of the tube $T$, $W = (W_1,\ldots,W_n)$ is the vector of component weights, $Q = (Q_1,\ldots,Q_n)$ the vector of component quantities and $n \leq 8$ the number of possible components used to assemble a tube $T$.

Let the total volume estimation of a tube assembly be denoted by $V_T$. The volume is function of the length, the wall thickness and the diameter of the tube and its formula is $$V_T = \pi L t(d - t)$$, where $t$ is the wall thickness, $d$ the outside diameter and $L$ the developed length of the tube.

Every ID used (e.g. tube_assembly_id, material, supplier, etc.) as a string in CSV files are converted to a positive integer without the leading zeros. The quote date is converted to a positive integer in the format `YYYYMMDD`. We keep only the year and month (YYYYMM) of the date since the influance of the days in a month on the cost should be negligeable.

To test and find data efficiently, we create a database `Caterpillar` with tables, views and indexes. The script to query this database is given by the file `DatabaseManipulation.R`. The script to insert in batch the data from the CSV files to the database is given by the file `DatabaseInsertions.R`.

We include librairies for graphs and tables to display in this document. We also connect to the Caterpillar database for the next queries to execute.
```{r echo = FALSE, message = FALSE, warning = FALSE}
source("Scripts/DatabaseManipulation.R")

## Include libraries for plotting graphes and to display some tables.
library(ggplot2)
library(xtable)
library(knitr)

database <- CaterpillarDatabase$new()
database$connect()
```

We prepare the train and test datasets needed for the analysis.
```{r echo = FALSE, message = FALSE, warning = FALSE}
query = "SELECT TAP.fkTubeAssembly, TAP.supplierID, TAP.quoteDate, TAP.anualUsage, TAP.minOrderQuantity, TAP.quantity, TAP.cost,
                TAC.totalWeight, TAC.minWeight, TAC.maxWeight, TAC.numberOfComponents,
                TA.diameter, TA.wallThickness, TA.length, TA.bendRadius, TA.materialID, TA.specs, TA.numberOfBends,
                Q.maxQty, Q.avgQty, Q.cntQty
         FROM TubeAssemblyPricing AS TAP
             INNER JOIN TubeAssembly AS TA ON TA.pkTubeAssembly = TAP.fkTubeAssembly
             LEFT JOIN TubeAssemblyWeightView AS TAC ON TAC.fkTubeAssembly = TAP.fkTubeAssembly
             INNER JOIN TubeAssemblyQuantityView AS Q ON TAP.fkTubeAssembly = Q.fkTubeAssembly
         LIMIT 60448;"
data <- database$selectFromTable(query)
gc()

# Change the date format to YYYYMM and replace NA by 0.
library(lubridate)
library(stringr)
data$quoteDate <- as.Date(data$quoteDate)
data$quoteDate <- year(data$quoteDate) * 100 + month(data$quoteDate)

data$numberOfComponents <- ifelse(is.na(data$numberOfComponents), 0, data$numberOfComponents)
data$totalWeight <- ifelse(is.na(data$totalWeight), 0, data$totalWeight)
#data$volume <- pi * data$length * data$wallThickness * (data$diameter - data$wallThickness)
data$minWeight <- ifelse(is.na(data$minWeight), 0, data$minWeight)
data$maxWeight <- ifelse(is.na(data$maxWeight), 0, data$maxWeight)
data$specs <- ifelse(is.na(data$specs), 0, str_count(data$specs, ",") + 1)

# Split the dataset in a train set and a test set, and set the ID for each one.
test <- data[data$cost == 0,]
test$id <- 1:nrow(test)
train <- data[data$cost > 0,]
train$id <- 1:nrow(train)
```

# 4 Cost Models
In this section, we will answer the question: _Is there a unique mathematical model describing the cost in function of the quantity for each supplier?_ The first objective is to check the existence of a mathematical model representing the cost in function of the quantity. The second objective is to show if the model is applied by a unique supplier. The last objective is to show if each supplier has its own model. If the unicity does not hold, then we have to check if a model is applied by more than one supplier or if a supplier can apply more than one model depending of other features. We will then answer the question: _If a mathematical model exists, is it a linear or non-linear model?_

We denote $C_{T}(Q)$ our cost heuristic function of a tube assembly $T$ given by a supplier with $\beta$ our learning parameters and $Q$ the vector of quantities.

### 4.1 Existence of a Mathematical Model
We start with few tube assemblies which are quoted by the supplier `S-0066`.

```{r echo = FALSE, message = FALSE, warning = FALSE, results = 'asis'}
columns <- c("fkTubeAssembly", "supplierID", "totalWeight", "quantity", "cost")
data <- train[train$fkTubeAssembly %in% c(2, 5, 5000, 19365), columns]
print(xtable(data, caption = "Tubes 2, 5, 5000 and 19365"), comment = FALSE, type = 'latex')

par(mfrow = c(1,3), xpd = TRUE)
data_min <- data[data$fkTubeAssembly == c(2, 5),]
plot(data_min$quantity, data_min$cost, col = "red", xlab = "Quantity", ylab = "Cost", xlim = c(1, 100), ylim = range(data_min$cost))
points(data_min$quantity, data_min$cost, pch = 20, col = "red")
mtext("Cost of tubes in function of quantity (Supplier S-0066)", side = 3, line = -2, outer = TRUE)

curve((18.906873355/x) + 2.999059664, add = TRUE, col = "blue")
curve((23.477796387/x) + 4.8964239413, add = TRUE, col = "green")

data_max <- data[data$fkTubeAssembly == 5000,]
plot(data_max$quantity, data_max$cost, col = "red", xlab = "Quantity", ylab = "Cost", xlim = range(data_min$quantity), ylim = range(data_max$cost))
points(data_max$quantity, data_max$cost, pch = 15, col = "red")
curve((20.153235813/x) + 121.5064293385, col = "black", add = TRUE)

data_large <- data[data$fkTubeAssembly == 19365,]
plot(data_large$quantity, data_large$cost, col = "red", xlab = "Quantity", ylab = "Cost", xlim = range(data_min$quantity), ylim = range(data_large$cost))
points(data_large$quantity, data_large$cost, pch = 15, col = "red")
curve((244.434490855/x) + 54.3475236892, add = TRUE, col = "chocolate")

legend('topright', 
       c("Train Dataset", "TA-00002", "TA-00005", "TA-08459", "TA-05000"), 
       lty = 1, 
       col = c('red', 'blue', 'green', 'yellow', 'black'),
       bty = 'n') 
       #cex = .75,
      #inset = c(-0.2, -0.35))
```

From the graphs, we see that the curves estimating the red points are clearly hyperbolas of equation 
$$C_{T}(Q) = \frac{\beta_0 - \beta_1}{Q} + \beta_1$$
where $Q \geq 1$ is the quantity for a tube assembly ID $T$, $\beta_1$ is the cost at the last level of purchase based on quantity and supplier (most of the time $Q = 250$), and $\beta_0$ is the cost at the first level of purchase based on quantity and supplier (most of the time $Q = 1$). This equation indicates that if Caterpillar buy more tubes of the same ID, cheaper will be the cost per tube. This proves the existence of a mathematical model representing the cost in function of the quantity.

If we take a look at the right most graph, we see that our curve doesn't seem to fit the points. However, the maximum quantity is 7 (not 250) for this tube which make the model less accurate assuming the same model is used. This assumption makes sense since 
$$\lim_{Q \to \infty} C_{T}(Q) = \beta_1$$
which means that we need to find the right $\beta_1$ to match with any quantity. We also have to find the cost of one tube which is $\beta_0$.

For example, if we take the tube `TA-19365`, we have $C_{T}(1) = \beta_0 = 298.7820145446$. We know that $C_{T}(2) = \frac{\beta_0 + \beta_1}{2} = 156.1959237271 \Leftrightarrow \beta_1 = 13.60983291$. Therefore, the model for the tube `TA-19365` is $C_{T}(Q) = \frac{285.172181635}{Q} + 13.60983291$. With $Q = 7$, we obtain $C_{T}(7) = 54.348716001$ which has a square error of $0.000001422$ from the original cost. With our estimated, i.e. $C_{T}(Q) = (244.434490855/Q) + 54.3475236892$, we have $C_{T}(7) = 89.266736668$ which has a square error of $1219.351435073$. Thus, if $Q$ is small (say $Q < 25$), the model may underfit.


### 4.2 Unicity of the Model per Supplier
We verify with few tube assemblies, which are quoted by the supplier `S-0054`, if the same model used for the supplier `S-0066` applies.

```{r echo = FALSE, message = FALSE, warning = FALSE, results = 'asis'}
data <- train[train$fkTubeAssembly %in% c(130, 280, 1892, 5013), columns]
print(xtable(data, caption = "Tubes 130, 280, 1892, 5013"), comment = FALSE, type = 'latex')

plot(data$quantity, data$cost, col = "red", xlab = "Quantity", ylab = "Cost", xlim = range(data$quantity), ylim = range(data$cost))
points(data$quantity, data$cost, pch = 20, col = "red")
mtext("Cost of tubes in function of quantity (Supplier S-0054)", side = 3, line = -1, outer = TRUE)

curve((9.528552272/x) + 4.6265642127, add = TRUE, col = "blue")
curve((12.076918369/x) + 17.304407338, add = TRUE, col = "green")
curve((4.273242712/x) + 3.8531517947, add = TRUE, col = "black")
curve((4.092408874/x) + 3.043572609, add = TRUE, col = "chocolate")
```

The model used by the supplier `S-0054` seems to be the same as the one used by the supplier `S-0066`, but if we look carefully at the curves, we see that greater is the quantity, more accurate is the estimate. This means that the model follows the same behaviour as the model used by the supplier `S-0066`.

This doesn't seem to be the case for the tube `TA-00384` from the supplier `S-0064`. This supplier provides 6 levels of purchase where the highest quantity is $Q = 45$. We use the same model as before but this time, the model doesn't fit the points.

```{r echo = FALSE, message = FALSE, warning = FALSE, results = 'asis'}
data <- train[train$fkTubeAssembly == 384, columns]
print(xtable(data, caption = "TA-00384"), comment = FALSE, type = 'latex')

plot(data$quantity, data$minOrderQuantity, col = "red", xlab = "Quantity", ylab = "Cost", 
     xlim = range(data$quantity), ylim = range(data$cost), main = "Cost in function of quantity for tube 384 of supplier S-0064")
points(data$quantity, data$cost, pch = 20, col = "red")
#mtext("Cost in function of quantity for tube 384 of supplier S-0064", side = 3, line = -1, outer = TRUE)
curve((0.88469478 / (x-19)) + 4.401217429, add = TRUE, col = "blue")
```

Since the first quantity level is $Q_0 = 20$, we need to translate the model by subtracting $x$ by $Q_0 - 1 = 19$. This gives the following model $$C_{T}(Q) = \frac{\beta_0 - \beta_1}{Q - Q_0 - 1} + \beta_1$$ for all $Q \geq 1$. However, the model still underfits the data because the cost decreases much slower than the model used for our previous tests. Therefore, we can assume that a model with specific parameters is used to estimate the cost by one or many suppliers but not all. However, the general model is used by suppliers and may need to adjust the parameters to fit the data. 


# 5 Machine Learning Algorithms
We present the machine learning algorithms we will try after the analysis given by the four previous sections. We have seen that many decision trees can be built. Also, per section 4, many models which are simplifications of the general model can be used together to predict the cost. This leaves us two possible ensemble algorithms: Random Forest for regression or Extreme Boosting Trees for Regression (XGBoost).

We identify conditional paths which will tell us if decision trees will be useful or not. In the previous section, we have seen that the model can underfit if there are not enough quantity purchase levels and if the quantity is small. Otherwise, we can use the model to estimate the cost given a quantity and a tube assembly. Here are few points that identify some conditions.

* If there is only one quantity purchase level, we cannot estimate the cost in function of this quantity. We need al least another feature on which the cost depends.
* If there are many quantity purchase levels with $Q_0 > 1$, then we need to translate the model found at section 4. Thus, we have to check if $Q_0 = 1$ or if $Q_0 > 1$.
* Depending of the supplier, the model may be slightly different. Since we have 68 suppliers in the train and test sets, we can set 68 possible models. But we didn't show that a supplier use only one model. It may be possible that a supplier uses more than one model to calculate the cost function.
* A specific model can be used if the diameter of a tube is $d > 10$ and another one is used if $d \leq 10$ for example. The same may apply for the other features.

Only with those conditions, we can build many decision trees to help us estimating the cost. Given a feature $x$, we have to establish a probability $\mathbb{P}_{k,l}(x)$ for each edge $l$ of each level $k$ of the trees. 

For example, we can set at the second level 3 edges: $\mathbb{P}_{2,1}(x) = 0.65$ if $Q_0 = 1$, $\mathbb{P}_{2,2}(x) = 0.25$ if $1 < Q_0 \leq 20$ and $\mathbb{P}_{2,3}(x) = 0.1$ if $Q_0 > 20$. Another example would be to set at the fourth level 2 edges: $\mathbb{P}_{4,1}(x) = 0.7$ if the number of bends is less than 4 and $\mathbb{P}_{4,2}(x) = 0.3$ if the number of bends is greater or equal to 4.

### 5.1 Random Forest Algorithm
Suppose we want to create a tree for each supplier. This gives us a forest of 68 trees. The next level can be if the tubes have been bended (at least one bend) or not. The next level can be the material used to make a tube where each of them has a certain probability of usage. We can go deeper, but this gives us a good intuition to show that a random forest algorithm is a good choice to predict the cost.


```{r echo = FALSE, message = FALSE, warning = FALSE}
require(caret)
require(randomForest)
set.seed(415)

## Take 80% of the training data and create a partition where the label is the cost.
cv.partitions <- createDataPartition(y = train$cost, p = 0.2, list = FALSE)
cv.train.partition <- train[cv.partitions, ]

## Cross-Validation with 5 folds using the Regression as the type of random forest algorithm.
cv.fold <- 5
cv.train_control <- trainControl(method = "cv", number = cv.fold)
cv.model<-train(cv.train.partition$cost ~., data = cv.train.partition, method = "rf", trControl = cv.train_control)
print(cv.model)
print(cv.model$finalModel)

system.time({
    train.rf <- randomForest(log(cost + 1) ~., data = train, nTree = 20)
    print(train.rf)
    train.perdiction <- exp(predict(train.rf, newdata = test)) - 1
    col <- cbind(test$id, train.perdiction)
    colnames(col) <- c('id', 'cost')
    write.csv(col, "Submission.csv", row.names = FALSE)
})
varImpPlot(train.rf)
print(train.rf$importance)
```

Using only all features of the train set to predict the cost with the random forest algorithm using regression gives a score of 80% of variances explained. Here is a list of engineer tuning actions done to improve the prediction.

* Removing the day part of the quote date and keeping the date in the integer format `YYYYMM` improved the score with 82.4% of variances explained.
* Removing the bracket_pricing field improved the score with 84.56% of variances explained.
* Adding the total weight improved the score with 90.22% of variances explained.
* Adding the diameter of the tube improved the score with 91.96% of variances explained.
* Adding the wall thickness and length of the tube improved the score with 92.75% of variances explained.
* Adding the maximum and average quantity for each tube improved the score with 93.06% of variances explained.
* Adding the maximum weight of components for each tube improved the score with 93.17% of variances explained.
* Adding the component type for each tube improved the score with 93.31% of variances explained.
* Adding the typical bend radius for each tube improved the score with 93.40% of variances explained.
* Adding the number of quantities for each tube improved the score with 93.45% of variances explained.
* Adding the material ID for each tube improved the score with 93.59% of variances explained.
* Adding the number of specs for each tube improved the score with 93.61% of variances explained.
* Adding the number of components for each tube improved the score with 93.67% of variances explained.
* Adding the number of bends for each tube improved the score with 93.72% of variances explained.

Some features didn't improve the prediction.

* end_a_1x
* end_a_2x
* end_x_1x
* end_x_2x
* end_a
* end_x
* num_bracket
* other
* num_boss
* number of quantity levels per tube
* bracket_pricing


### 5.2 Extreme Gradient Boosted Regression Trees
Before the learning we will use the cross validation to evaluate our error rate (RMSE) to get a better prediction.

```{r echo = FALSE, message = FALSE, warning = FALSE}
require(xgboost)
require(methods)


# for (package in c('<package1>', '<package2>')) {
#     if (!require(package, character.only=T, quietly=T)) {
#         install.packages(package)
#         library(package, character.only=T)
#     }
# }

system.time({
    train_labels <- log(train$cost + 1)
    test$cost <- NULL
    train$cost <- NULL

    train_matrix <- xgb.DMatrix(as.matrix(train), label = train_labels)
    
    ## We use the linear regression because multiplying the models C(Q) by Q gives a linear model.
    ## subsample = 0.85 where XGBoost will collect randomly 85% of the training data to grow trees.
    ## Since subsample < 1 and colsample_bytree < 1, the algorithm chooses randomly a subset of rows or columns.
    param <- list(objective = "reg:linear", 
                  eta = 0.02, 
                  max_depth = 7, 
                  min_child_weight = 6, 
                  subsample = 0.85, 
                  colsample_bytree = 0.8)
    
    ## Get the feature real names.
    names <- dimnames(train)[[2]]
    
    ### Cross-Validation
    cv.nfolds <- 5
    nrounds <- 4000
    model.cv <- xgb.cv(data     = train_matrix, 
                       nfold    = cv.nfolds, 
                       param    = param, 
                       nrounds  = nrounds, 
                       verbose  = 0)
    model.cv$names <- as.integer(rownames(model.cv))
    
    print(ggplot(model.cv, aes(x = names, y = test.rmse.mean)) + 
        geom_line() + 
        ggtitle("Training RMSE using 5-fold CV") + 
        xlab("Number of trees") + 
        ylab("RMSE"))
         
    print(model.cv)
})
```

We now train the data and predict with the test matrix. Then we will see the important features in ascending order of importance.

```{r echo = FALSE, message = FALSE, warning = FALSE}
## Print a model with its statistical information.
model.print <- function(model, nround)
{
    cat(paste0("Prediction with ", nround, " trees...\n\n"))
    model_dump <- xgb.dump(model, with.stats = TRUE)
    print(head(model_dump, 15))
    cat("\n\n")
}

## Visualize the top 10 most important features.
model.visualize <- function(model, names)
{
    importance_matrix <- xgb.importance(names, model = model)
    xgb.plot.importance(importance_matrix)
}

## Training.
system.time({
    model = xgboost(param = param, 
                    train_matrix, 
                    nrounds = nrounds, 
                    verbose = 0)
    test_matrix <- xgb.DMatrix(as.matrix(test))
    prediction <- exp(predict(model, test_matrix)) - 1
    model.print(model, nrounds)
    
    predict_xg_data <- cbind(test$id, prediction)
    colnames(predict_xg_data) <- c('id', 'cost')
    write.csv(predict_xg_data, "XG1_Submission.csv", row.names = FALSE)
})
```


### 5.3 Conclusion
Which of the algorithms gives the best accuracy and why?


# 6 Data Visualization
The objective of this section is to visualize which features have the biggest importance on the cost of a given tube assembly. Also, we want to see how accurate is the model to predict the cost. We need to show that applying our model to the train set, our prediction is close to the real costs. By close, we mean with an accuracy greater than 95%.

The following graph shows the features in descending order of importance that influance the cost.
```{r echo = FALSE, message = FALSE, warning = FALSE}
model.visualize(model, names)
```
As we can see, the purchase quantity has a major importance on the cost of tubes.