# Kaggle Challenge
## Predict service faults on Australia's largest telecommunications network
### Yen-Ting Chen

This is a <a href="https://www.kaggle.com/c/telstra-recruiting-network" target="_blank">Kaggle Challenge</a> aimed to predict network fault severities based on past service logs, for Australia's telcommunications company Telstra Network.

The training data from the past service logs is contained within five separate files:
1. **`train.csv`**: The unique id of past logged events, its location, and the outcome fault severity (levels 0, 1 or 2). Location consists of 929 categories.
2. **`event_type.csv`**: Mapping of event id to "event types". Each id may have more than one type. Event type consists of 53 categories.
3. **`log_feature.csv`**: Mapping of event id to "log features" and their "volumes". Each id may have more than one feature. Log features consists of 386 categories, while their volumes range from 1 to 1310.
4. **`resource_type.csv`**: Mapping of event id to "resource type". Each id may have more than one resource type. Resource type consists of 10 categories.
5. **`severity_type.csv`**: Mapping of event id to "severity type". Each id maps to one severity type. Severity type consists of 5 categories.  

In [2]:
library(caret); library(tidyr); library(Hmisc); library(dplyr)

REvent_type <- read.csv("event_type.csv")
RLog_feature <- read.csv("log_feature.csv")
RResource_type <- read.csv("resource_type.csv")
RSeverity_type <- read.csv("severity_type.csv")
RTrain <- read.csv("train.csv")

In total, all categories from all variables total to over 1300 categories. It is not feasible to simply cast all categories as dummy variables for prediction modeling.  
  
Instead, the following approach was taken to reduce these 1300+ categories to extract only 90 variables.

For each of the variables **`event_type`**, **`log_feature`**, and **`location`**, its categories were grouped according to that event's outcome fault severity level (0, 1 or 2). This is done by looking at the range of frequencies of a given category towards a severity level, dividing the range up into a number of buckets, and placing each category into the bucket according to its frequency of that outcome. Therefore, the top bucket for severity level 0 will contain the categories that most often occur in an event that had severity level of 0, while the bottom bucket contains the categories that least often give severity level 0. The same goes for all three severity levels. Each event id will have at least number 1 within one of the buckets for each severity level. If the event id maps to more than one category, it may have 1's in more than one bucket, or have a number greater than 1 in some buckets if some categories belong in the same frequency range.  

These buckets group categories together based on their frequencies of resulting in a certain fault severity level. These frequency buckets essentially represent the categories to severity level relationship, but on a less-detailed scale, thereby reducing the number of "categories". These are used as the predictors for modeling.

Here are the steps to generate the frequency buckets for **`event_type`**:  
First, we must join the tables to match event id, severity level, and event types.

In [3]:
temp <- inner_join(RTrain, REvent_type, by="id")
temp <- select(temp, event_type, fault_severity)

Count up frequencies of each event type by their fault severity levels for all events.

In [4]:
eventFreq <- temp %>% group_by(event_type, fault_severity) %>% summarize(count=n())
rm(temp)

To have the severity levels as variables with frequencies as values, we must spread the table to wide format.

In [5]:
eventFreq <- spread(eventFreq, fault_severity, count)
colnames(eventFreq) <- c("event_type", "fault0", "fault1", "fault2")

Some event type and severity level combinations are not going to occur and will have NA. Replace these NA values with 0 frequency count.

In [6]:
eventFreq[is.na(eventFreq)] <- 0

There are 53 categories in the **`event_type`** variable. We will divide the range of frequencies for these categories into 5 buckets for each severity level. For **`log_feature`** and **`location`**, there are more categories in those variables, and more buckets will be used to differentiate categories better.

In [7]:
eventFreq <- eventFreq %>% mutate(buckets0 = cut2(eventFreq$fault0, g=5), buckets1 = cut2(eventFreq$fault1, g=5),
                          buckets2 = cut2(eventFreq$fault2, g=5)) 
## a1 = event_fault0_5_1; b2 = event_fault1_5_2
levels(eventFreq$buckets0) <- c("a1", "a2", "a3", "a4", "a5")
levels(eventFreq$buckets1) <- c("b1", "b2", "b3", "b4", "b5")
levels(eventFreq$buckets2) <- c("c1", "c2", "c3", "c4", "c5")

The bucket name `a1` refers to the bucket containing event type categories that least often are related to severity level of 0, while `a5` contains the most frequent occuring categories for level 0. Bucket names starting with b are for level 1, and c is for level 2.  

The `eventFreq` table now maps each event type to which bucket it belongs to for each severity level. We need this information in a wide format with the buckets as individual variables, and values of 1 or 0 (belongs to or not) as values. Again, there are combinations that do not occur (most do not, as each category can only belong in one bucket for each level). These will have NA values that must be replaced with 0.

In [8]:
eventFreq$fault0 <- NULL; eventFreq$fault1 <- NULL; eventFreq$fault2 <- NULL
eventFreq$freq <- 1
eventFreq <- spread(eventFreq, buckets0, freq)
eventFreq$freq <- 1
eventFreq <- spread(eventFreq, buckets1, freq)
eventFreq$freq <- 1
eventFreq <- spread(eventFreq, buckets2, freq)
eventFreq[is.na(eventFreq)] <- 0

A brief glimpse into the `eventFreq` table in wide format:

In [9]:
head(eventFreq, n=3)

Unnamed: 0,event_type,a1,a2,a3,a4,a5,b1,b2,b3,b4,b5,c1,c2,c3,c4,c5
1,event_type 1,1,0,0,0,0,1,0,0,0,0,1,0,0,0,0
2,event_type 10,0,0,0,1,0,0,0,0,1,0,0,0,0,1,0
3,event_type 11,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1


The `eventFreq` table must then be joined with the original data table of event types, to combine the bucketing results to the event id's and event types.

In [10]:
event <- left_join(REvent_type, eventFreq, by="event_type")
event[is.na(event)] <- 0

This is then grouped by event id's so that each event id has only one entry, and id's with multiple event types have the counts of which event type in which bucket combined (summed).

In [12]:
eventFinal <- event %>% group_by(id) %>% summarize("a1"=sum(a1), "a2"=sum(a2), "a3"=sum(a3),
                                                   "a4"=sum(a4), "a5"=sum(a5),
                                                   "b1"=sum(b1), "b2"=sum(b2), "b3"=sum(b3),
                                                   "b4"=sum(b4), "b5"=sum(b5),
                                                   "c1"=sum(c1), "c2"=sum(c2), "c3"=sum(c3),
                                                   "c4"=sum(c4), "c5"=sum(c5))
rm(eventFreq, event)
head(eventFinal)

Unnamed: 0,id,a1,a2,a3,a4,a5,b1,b2,b3,b4,b5,c1,c2,c3,c4,c5
1,1,0,0,0,1,1,0,0,0,0,2,0,0,0,0,2
2,2,0,0,0,0,2,0,0,0,0,2,0,2,0,0,0
3,3,0,0,0,0,1,0,0,0,0,1,0,0,0,0,1
4,4,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0
5,5,0,0,0,0,2,0,0,0,0,2,0,2,0,0,0
6,6,0,0,0,0,1,0,0,0,0,1,0,1,0,0,0


The 53 categories in **`event_type`** are now reduced to 15 variables.  

Similar processes are applied to **`log_feature`** and **`location`**, with more buckets since they have more categories. Full analysis code can be viewed in my <a href="https://github.com/janie128/Random_Projects/tree/master/KaggleTelstra" target="_blank">Github</a>.  

For the variables **`resource_type`** and **`severity_type`**, there are only a few categories and will therefore be directly cast as dummy variables. The process is shown here for **`resource_type`**.

In [None]:
resourceFinal <- RResource_type
resourceFinal$freq <- 1
resourceFinal <- spread(resourceFinal, resource_type, freq)
resourceFinal[is.na(resourceFinal)] <- 0

With the variables extracted from the 1300+ categories, we can now fully expand (join) the main **`train`** data table with outcome fault severities to match with the extracted variables by the event id (or location).  

For prediction modeling, the outcome fault severities are converted to factors (3 levels). In addition, to match the final submission file format for the challenge, the fault severity levels with values of 0, 1 or 2 are relabeled as "predict_0", "predict_1", or "predict_2".

In [None]:
total <- left_join(RTrain, eventFinal, by="id")
total <- left_join(total, featureFinal, by="id")
total <- left_join(total, locationFinal, by="location")
total$location <- NULL
total <- left_join(total, resourceFinal, by="id")
total <- left_join(total, severityFinal, by="id")

total$id <- NULL
total$fault_severity <- as.factor(total$fault_severity)
for (n in 2:dim(total)[2]){
  total[,n] <- as.integer(total[,n])
}

levels(total$fault_severity) <- c("predict_0", "predict_1", "predict_2")

Now that we have a manageable training data set with 90 predictors, we can begin our prediction modeling. The Extreme Gradient Boosting model (xgboost) and the Random Forest model (rf) were trained, with the latter giving better performance. The `mtry` parameter in the Random Forest model was optimized. Below is shown the final optimized procedure:

In [None]:
set.seed(123)
inTrain <- createDataPartition(total$fault_severity, p=0.7, list=FALSE)
training <- total[inTrain,]
testing <- total[-inTrain,]

set.seed(123)
fitCtrl <- trainControl(verboseIter = TRUE)
modelRF <- train(fault_severity ~ ., data=training, method="rf", prox=TRUE, trControl=fitCtrl,
                 tuneGrid = expand.grid(mtry = c(9)))

To generate the final submission predictions, the `test` file must also be joined with the extracted features by the event id or location. The above prediction model is used to predict the probabilities of the three possible fault severity outcomes.

In [None]:
test <- read.csv("test.csv")
totalTest <- left_join(test, eventFinal, by="id")
totalTest <- left_join(totalTest, featureFinal, by="id")
totalTest <- left_join(totalTest, locationFinal, by="location")
totalTest$location <- NULL
totalTest <- left_join(totalTest, resourceFinal, by="id")
totalTest <- left_join(totalTest, severityFinal, by="id")
totalTest[is.na(totalTest)] <- 0
predictions <- predict(modelRF, totalTest, type="prob")

submission <- cbind(id=totalTest$id,predictions)

The metric used to evaluate prediction results in this challenge is the <a href="https://www.kaggle.com/c/telstra-recruiting-network/details/evaluation" target="_blank">multi-class log loss</a>. With this analysis, we achieved a score of 0.64, where the top score was 0.40, and the benchmark score for guessing equal probability for all three outcomes is around 26. Feature engineering was successfully performed to extract 90 features from 1300+ categories to enable efficient prediction modeling.

***

#### Links
Kaggle challenge description: <a href="https://www.kaggle.com/c/telstra-recruiting-network" target="_blank">https://www.kaggle.com/c/telstra-recruiting-network</a>  
Kaggle challenge dataset: <a href="https://www.kaggle.com/c/telstra-recruiting-network/data" target="_blank">https://www.kaggle.com/c/telstra-recruiting-network/data</a>  
Full code files on my <a href="https://github.com/janie128/Random_Projects/tree/master/KaggleTelstra" target="_blank">Github</a>