# Automatic classification of seismic P-wave receiver functions using Random Forests (RF)

In this lab exercise, we are going to train Random Forests to automatically classify the seismic P-wave receiver functions that were also used in previous two lab exercises, one on logistic regressin and the other one on decision trees.

Specifically, your task is to classify the P-wave receiver functions, which were computed based on the recorded seismic data, into two categories: good and bad. The entire data set consists of 12,597 receiver functions (i.e., seismic traces), each of which was visually examined and manually labeled as either good or bad by one of Prof. Aibing Li's PhD students, Ying Zhang, in the Department of Earth and Atmospheric Sciences at University of Houston. The good seismic traces are labled (or, encoded) as 1, and bad seismic traces are encoded as 0. <br>

After finishing this exercise, you can expect to <br>
1. be able to implement Random Forests using Scikit-Learn; <br>
2. better understand the regularization role played by the hyperparameter, **max_depth**; <br>
3. be able to diagnose overfitting vs. underfitting by constructing the error curves. <br>

<br>
Author: Jiajia Sun @ University of Houston, 02/28/2019

## 1. Review Random Forests
<font color = red>**Task 1:**</font> Please write one paragraph that summarizes RF (i..e, what is it, and how does it work?) . <font color = red>**(8 points)**</font> <br>
<br>
**HINT**: You can include into this paragraph answers to the following questions: <br>
1. What do 'random' and 'forest' each mean? **(2 points)**<br>
2. How does RF work? Please refer to Slide 44. **(2 points)** <br>
3. What are the advantages of RF as compared to Decision Trees? Please feel free to search online for answers. If you do so, please include the webpage's URL in your acknowledgments.**(2 points)** <br>
4. Anything else you would like to include (e.g., something that you think would help your colleagues understand RF).**(2 points)**<br>


In [None]:
(Answer to Task 1:)


<font color = red>**Task 2:**</font> Write a short paragraph about the history of RF. Be sure to include into your writing the two terms: **Random Patches** and **Random Subspaces** and what they each mean. <font color = red>**(8 points)**</font> <br>
<br>
**HINT**: Please refer back to our lecture slides, in case you need a refresher. <br>

In [None]:
(Answer to Task 2:)


## 2. Import data
<font color = red>**Task 3:**</font> Import the amplitude data and the labels from *Traces_qc.mat*. Be sure to store the seismic amplitudes from all seismic stations into the varible **amp_data**, and the labels for all the seismic traces into the variable **label_data**. <font color = red>**(4 points)**</font> <br>
<br>
**HINT**: Please refer back to the last lab exercise to see how to import data. Please note that in the last lab exericse, I used the variable **flag_data** for all the labels, but here we are going to use a different variable name, **label_data**. <br>

In [None]:
# Answer to Task 3


<font color = red>**Task 4:**</font> Check the shape of the variable **amp_data**, and make sure that it is a 12,597 X 651 array. <font color = red>**(2 points)**</font>  Please use the terminology that we discussed in class on Slide 22, and explain what these two numbers, 12,597 and 651, each mean. <font color = red>**(2 points)**</font> <br>



In [None]:
# Answer to Task 4


In [None]:
print('The total of bad seismic traces is:', len(np.where(label_data==0)[0]))
print('The total of good seismic traces is:', len(np.nonzero(label_data)[0]))

## 3. Preprocessing data
The purpose of preprocessing is to get the data ready for the subsequent analysis or computations. The most common preprocessing step in machine learning is to [standardize features](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) by removing the mean and scaling to unit variance, just as what you did for your lab exercises on Logistic Regression and Support Vector Machines. <br>
<br>
However, as was mentioned in last exercise on Decision Trees, to implement Decision Trees does not require feature scaling or centering at all. RF is simply a large collection of Decision Trees. Therefore, RF does not need the standardizing step, either. <br>
<br>
But, it is still important to randomly shuffle our data, for the reasons explained in the last exercise.

In [None]:
np.random.seed(42)
all_data = np.append(amp_data,label_data,1) # put all the seismic traces and their lables into one matrix.

<font color = red>**Task 5:**</font> Randomly permute the data stored in the variable **all_data** using <font color=blue>**np.random.permutation**</font>, and store the permuted data in a new variable **all_data_permute**. <font color = red>**(5 points)**</font> <br>
<br>
**HINT**: If you forget how to do it, please refer back to your lab exercise on Decision Trees. Note the variable name in this notebook is different from last time.

In [None]:
# Answer to Task 5


## 4. Split data into training and cross-validation sets
Same as what we did in last exercise using Decision Trees, we are going to use the first 10,000 seismic traces as out training data set, and the rest 2,597 traces as test dat set.

<font color = red>**Task 6:**</font> Create the training data set by assigning the first 10,000 instances and their corresponding labels in **all_data_permute** to new variables, **X_train** and **y_train**, respectively. And similarly, create the validation data set by assigning the remaining instances and their corresponding labels in **all_data_permute** to new variables, **X_validation** and **y_validation**, respectively. <font color = red>**(6 points)**</font> <br>

In [None]:
# Answer to Task 6


Note that, in Scikit-learn, there is a convenient way of splitting the data by using the [train_test_split](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) module, which is widely used in practice. But for our lab exercises, to keep things consistent, and more importantly, to keep the comparison of the prediction accuracies from different ML algorithms fair, we mannually split the whole set of data into a training and validation set.

## 5. Import and set up RF classifer
<font color = red>**Task 7:**</font> Import [**RandomForestClassifier**](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) from Scikit-Learn. Set up your RandomForestClassifier by setting **n_estimators = 100**, **max_depth = 10**, **random_state = 42**, and **class_weight = 'balanced_subsample'**, and assign this classifier to a new variable **rf_clf**. <font color = red>**(10 points)**</font> <br>
<br>
**HINT:** In case you forget how to do it, please refer to the last lecture slide, or the official Scikit-Learn documentation on [RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html). <br>
<br>
**NOTE:** We did not discuss the use of **class_weight** in class. It is used to deal with the situation where the number of samples in one class is much larger than the number of samples in the other class. Let us assume that, there are 999 samples in Class 1 and 1 sample in Class 2. You can imagine that, because of the disproportion, machine learning algorithms will have trouble predicting Class 2. In our case here, we also have the disproportion problem because the number of bad seismic traces is three times the number of good seismic traces. In this case, we can use **class_weight** to balance the decision trees.

In [None]:
# Answer to Task 7


## 6. Train a RF model
<font color = red>**Task 8:**</font> Train a RF model using the **training** data set, <font color=blue>**X_train**</font> and <font color=blue>**y_train**</font>, and the classifier, <font color=blue>**rf_clf**</font>, you set up above. <font color = red>**(10 points)**</font> <br>
<br>
**HINT**: If you forget how to do it, please refer back to our lecture slides, or the accompanying example notebook *RandomForest_example.ipynb*.

In [None]:
# Answer to Task 8


## 7. Evaluation
<font color = red>**Task 9a:**</font> Make predictions on the <font color=blue>**validation**</font> data set, and assign the predictions to a new variable, **y_pred**. <font color = red>**(5 points)**</font> <br>

In [None]:
# Answer to Task 9a


<font color = red>**Task 9b:**</font> Import **classification_report** from **sklearn.metrics**. Print out the classification report. <font color = red>**(5 points)**</font> <br>

**HINT:** Please refer to the last exercise for how to do it.

In [None]:
# Answer to Task 9b


The expected output should look like

<img src = "ClassificationReport.PNG">

In [None]:
# The following code is based on a modification of the codes in this webpage
# http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
importances = rf_clf.feature_importances_
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(10):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
import matplotlib.pyplot as plt
plt.figure()
plt.title("Feature importances")
plt.bar(range(50), importances[indices][:50])
plt.xticks(range(50), indices, rotation = 90)
plt.xlim([-1, 50])
plt.tight_layout()
plt.show()

The feature importance plot that you have obtained should look similar to the following one:
    
<img src = "FeatureImportance.png">

In [None]:
# The prediction error on training data
1 - rf_clf.score(X_train, y_train)

The expected output is:  &nbsp;&nbsp;&nbsp;&nbsp; 0.078200000000000047

In [None]:
# The prediction error on validation data
1 - rf_clf.score(X_validation,y_validation)

The expected output is:  &nbsp;&nbsp;&nbsp;&nbsp; 0.13708124759337692

## 8. Construct error curves
Similar to previous lab exercise on Decision Trees, we are going to construct error curves by training a sequence of Random Forests.

<font color = red>**Task 10a:**</font> Write a few sentences to explain why constucting error curves is important for machine learning. <font color = red>**(5 points)**</font> <br>

In [None]:
(answer to Task 10a:)


In [None]:
train_errors = np.zeros(25)
validation_errors = np.zeros(25)

<font color = red>**Task 10b:**</font> We are going to train 25 Random Forests with max_depth ranging from 1 to 25. For each Random Forest, please save its prediction errors on both training and validation data sets to **train_errors** and **validation_errors**, respectively. To make our code look clean, we are going to use a for loop to achieve that. Your task is to finish the following for loop. <font color = red>**(20 points)**</font> <br>

In [None]:
for idepth in np.arange(1,26):
    print('Random Forest with max_depth = ', idepth)
    # step 1: set up your RandomForestClassifier. Hint: max_depth = idepth.
    rf_clf = 
    # step 2: perform training using training data
    rf_clf.fit
    # step 3: make predictions using validation data
    y_pred = 
    # step 4: save prediction errors to train_errors, and validation_errors
    train_errors[idepth-1] = 
    validation_errors[idepth-1]= 

In [None]:
print("The minimum validation error is : ", min(validation_errors))
print("The best prediction arracy is:", 1 - min(validation_errors))

best_depth = np.argmin(validation_errors) + 1
print("The minimum validation error (i.e., the best prediction accuracy) occurs when max_depth = ", best_depth)

In [None]:
max_depth =  np.arange(1,26)
plt.plot(max_depth,train_errors,'-ro',label="training errors")
plt.plot(max_depth,validation_errors,'-bo',label="validation errors")
plt.plot(best_depth,validation_errors[best_depth-1],'gD',label="Best Depth")
#plt.plot([best_depth,best_depth],[0,validation_errors[best_depth-1]],'g-')
plt.title('error curves',fontsize=20)
plt.legend(loc="upper right", fontsize=16)
plt.xlabel("Max_depth", fontsize=20)
plt.ylabel("Prediction errors", fontsize=20, rotation=90)
plt.show()

Your error curve plot should look similar to the following one:
    
<img src = "errorcurves_25.png">

## 9. Applications of Random Forests to geoscience 

<font color = red>**Task 11:**</font> Do a literature search and look for at least <font color=blue>**two**</font> examples where Random Forests are used to solve some geoscience-related problems. Then, report the source of the information (e.g., URL, DOI, etc.), and summarize each example using a few sentences. <font color = red> **(10 points)**</font>

In [None]:
(answer to Task 11:)


## Acknowledgments
I would like to thank Ying Zhang for manually labeling all the seismic traces, and Prof. Aibing Li for making this data set available to the students in this class. Ms. Zhang also kindly explained the fundamentals of seismic P-wave receiver functions to me. In addition, I would like to acknowledge Simin Gao, a gradute student in Department of Earth and Atmospheric Sciences at University of Houston, for the for loop that was used to construct the error curves. <br>

<img src = "photo.png" width="400">

## Bonus:
<font color = red>**Task:**</font> Following what you did before, train 25 Extremely Randomized Trees (a.k.a., Extra-Trees) with max_depth ranging from 1 to 25. For each Extra-Trees, please save its prediction errors on both training and validation data sets to **train_errors** and **validation_errors**, respectively, and plot them up. <font color = red>**(10 points)**</font> <br>

**HINT:** To implement Extra-Trees, you will need to import **ExtraTreesClassifier** class instead of **RandomForestClassifier**.

The errors curves that you have constructed based on Extra-Trees should look similar to the following:
    
<img src = "errorcurves_25_ExtraTrees.png">