## Exercise on Machine Learning 101 - Part 2
---
Instructions are given in <span style="color:blue">blue</span> color.

In this exercise, you will work with decision trees to (hopefully) further improve the classification results we got during the CRISP-DM lecture notes.

* <div style="color:blue">This week's material included an article by Pedro Domingos in which he gave a brief insight into the domain of Machine Learning. One of the papers' sections mentioned the usage of so-called <b>model ensembles</b>. Go back to the article, find the three techniques used for ensemble methods, as stated by Domingos, and cite his description for each one here.</div>

*Your solution goes here:*


Remember that email you send to your social worker friend? The one where you told him that you possibly need more data to get improved results out of your model? Well, he replied and stated that, unfortunately, there isn't any more data he could provide to you.
It seems there is nothing left you can do but to go back to the drawing board for the second iteration of your modeling phase.

* <div style="color:blue">The folder <code>/data</code>, next to this exercise, contains the file <code>Student_Survey.csv</code>. Read the data into a <code>DataFrame</code> and make sure to import any necessary libraries, too.</div>

In [None]:
# Libraries:


The following is needed for **reproducibility** (see [here](https://www.mikulskibartosz.name/how-to-set-the-global-random_state-in-scikit-learn/)):

In [None]:
np.random.seed(42)

In [None]:
# Your solution goes here:


* <div style="color:blue">Remove the columns <code>G1</code>, <code>G2</code>, <code>G3</code>, and <code>Walc</code> from your <code>DataFrame</code>.</div>

In [None]:
# Your solution goes here:


* <div style="color:blue">Replace all categorical values in your <code>DataFrame</code> with numerical data - using an appropriate method.</div>

In [None]:
# Your solution goes here:


One of the main issues present in our data is that the classes are highly **imbalanced**. This effect can be seen rather often when performing classification tasks and means that there is a different number of total samples for each present class. Usually, imbalanced classes make it much harder to successfully fit a model. In our case, this imbalance is quite drastic.
* <div style="color:blue">Confirm, both visually and numerically, that the classes in your <code>DataFrame</code> are imbalanced.</div>

In [None]:
# Your solution goes here:


As usual, depending on the data, the use-case, and your personal experience, there are many techniques you could try to implement in order to circumvent or mitigate imbalanced classes. [This website](https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/), for example, lists a number of reasonable suggestions to tackle this issue.

One approach that could potentially work for our data is to reduce the number of total classes. We know (from our **Business Understanding**) that underage students' alcohol consumption classified higher than `1` is already alarmingly high. Therefore, it would make sense to bundle classes `2`, `3`, `4`, and `5` into a single category `0` (representing increased alcohol consumption), while class `1` (low consumption) remains unchanged.

* <div style="color:blue">Replace all entries in the <code>Dalc</code> column that are larger than <b>1</b> with the new <code>0</code> class.</div>

In [None]:
# Your solution goes here:


* <div style="color:blue">Similar as before, give insight into the new balance of classes, both visually and numerically.</div>

**Note**: Your classes will still not be perfectly balanced, but at least we improved upon the previous situation.

In [None]:
# Your solution goes here:


* <div style="color:blue">For model training, implement a <code>DecisionTreeClassifier</code> for which parameters have been tuned using cross-validated grid search.</div>
* <div style="color:blue">The parameters we are interested in are:</div>

    * `max_depth` - using the values: [3, 4]
    * `min_samples_split` - using the values: [2, 3, 4, 5]
    * `min_samples_leaf` - using the values: [2, 3, 4, 5]

* <div style="color:blue">Explicitly set the <code>criterion</code> parameter of your classifier to <code>entropy</code>.</div>
* <div style="color:blue">Don't forget to eventually <b>fit</b> your model, using optimized parameters.</div>

**Note**: This time around, we are not asking to create a separate test set to perform hold-out-validation. As our data is very sparse, and validation is already performed using cross-validation, this should be the right call.

In [None]:
# Your solution goes here:


* <div style="color:blue">Have a look at the documentation for <code>GridSearchCV</code>. What parameter influences the number of folds used for cross-validation? How many folds are there by default?</div>

*Your solution goes here:*


* <div style="color:blue">Print out the parameters for <code>max_depth</code>, <code>min_samples_split</code>, and <code>min_samples_leaf</code> for the best estimator found during grid search.</div>

In [None]:
# Your solution goes here:


* <div style="color:blue">Get the accuracy score of your decision tree.</div>

**Note**: If you have done everything correctly, your accuracy should exceed 70% at this point.

In [None]:
# Your solution goes here:


* <div style="color:blue">Visualize your decision tree.</div>

In [None]:
# Your solution goes here:


At the beginning of this exercise, you were asked to have a look into model ensembles. Frankly, [ensemble learning](https://en.wikipedia.org/wiki/Ensemble_learning) is a vast topic on its own, for which we, unfortunately, don't have enough time to look into during class. Instead, this exercise is designed to at least make you aware of the concept.

In simple words, ensemble learning is the method of training many base learners (ensemble members) from which the predictions are combined into a single estimator. As a result, this single estimator should have better performance than any of the ensemble members (on average).

When it comes to decision trees, there are several ensemble methods you could choose from - the most fundamental one being the **Random Forrest** estimator. The documentation for `scikit-learn`'s `RandomForestClassifier` can be found [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier).

* <div style="color:blue">Import the <code>ensemble</code> package from <code>sklearn</code>.</div>
* <div style="color:blue">Create a new instance of the <code>RandomForestClassifier</code></div>
* <div style="color:blue">Explicitly set the <code>criterion</code> parameter of your classifier to <code>entropy</code>.</div>
* <div style="color:blue">Explicitly set the <code>n_estimators</code> parameter of your classifier to <code>200</code> (the number of trees in the forest).</div>
* <div style="color:blue">For <code>max_depth</code>, <code>min_samples_split</code>, and <code>min_samples_leaf</code>, use the optimized parameters you found earlier.</div>
* <div style="color:blue">Train your model.</div>

In [None]:
# Your solution goes here:


* <div style="color:blue">Get the accuracy score of your random forest.</div>

**Note**: If you have done everything correctly, your accuracy should (slightly) exceed the accuracy of the previous decision tree model.

In [None]:
# Your solution goes here:
