# DS 3000 HW 10

Due: Friday Aug 9th @ 11:59 PM EST

### Submission Instructions
Submit this `ipynb` file to Gradescope (this can also be done via the assignment on Canvas).  To ensure that your submitted files represent your latest code, make sure to give a fresh `Kernel > Restart & Run All` just before uploading the files to gradescope.

### Tips for success
- Start early
- Make use of Piazza
- Make use of Office hour
- Remember to use cells and headings to make the notebook easy to read (if a grader cannot find the answer to a problem, you will receive no points for it)
- Under no circumstances may one student view or share their ungraded homework or quiz with another student [(see also)](http://www.northeastern.edu/osccr/academic-integrity), though you are welcome to **talk about** (not show each other) the problems.

In [1]:
# below are all the modules you will need on this homework
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import seaborn as sns
from sklearn import tree
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import StratifiedKFold, KFold
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import r2_score
import pylab as py
import scipy.stats as stats

## Part 1: Decision Tree (35 total points)

For this problem you will use the `df_owl_2018.csv` file again in your Homework Module on Canvas. This data set contains statistics from the 2018 Overwatch League (cleaned from [this website](https://overwatchleague.com/en-us/statslab?statslab=heroes)). Overwatch is a video game where two teams of 6 players compete against each other. On each team, a player may assume one of three roles:
- Damage: whose job is to attack the other team
- Support: whose job is to heal their own team
- Tank: whose job is to absorb the damage of the other team

However, while those are the general jobs of each role, occasionally a player in one role behaves more like another. In this part, we will see if the numeric statistics from a game of Overwatch can be used to accurately predict the role of a player using a Decision Tree.

In [4]:
df_owl = pd.read_csv('df_owl_2018.csv')
df_owl.head()

Unnamed: 0,start_time,match_id,stage,map_type,map_name,player,team,hero,role,Ability Damage Done,...,Ultimates Used,Unscoped Accuracy,Unscoped Hits,Unscoped Shots,Venom Mine Kills,Weapon Accuracy,Weapon Kills,Whole Hog Efficiency,Whole Hog Kills,of Rockets Fired
0,2018-01-11 00:12:00,10223,Overwatch League - Stage 1,PAYLOAD,Dorado,Agilities,Los Angeles Valiant,Genji,Damage,0.0,...,8,0.0,0,0,0,0.273585,0,0.0,0,0.0
1,2018-01-11 00:12:00,10223,Overwatch League - Stage 1,PAYLOAD,Dorado,Danteh,San Francisco Shock,Genji,Damage,0.0,...,1,0.0,0,0,0,0.166667,0,0.0,0,0.0
2,2018-01-11 00:12:00,10223,Overwatch League - Stage 1,PAYLOAD,Dorado,Danteh,San Francisco Shock,Junkrat,Damage,0.0,...,3,0.0,0,0,0,0.1375,0,0.0,0,0.0
3,2018-01-11 00:12:00,10223,Overwatch League - Stage 1,PAYLOAD,Dorado,Danteh,San Francisco Shock,Tracer,Damage,0.0,...,3,0.0,0,0,0,0.327001,0,0.0,0,0.0
4,2018-01-11 00:12:00,10223,Overwatch League - Stage 1,PAYLOAD,Dorado,Envy,Los Angeles Valiant,D.Va,Tank,0.0,...,23,0.0,0,0,0,0.314785,0,0.0,0,0.0


### Part 1.1: Build a Decision Tree (15 points)

Create a Decision Tree to predict the `role` of an Overwatch player using all the numeric statistics as x features:
- Use `max_depth = 3`
- The code for creating the `x_feat_list` is:

```python
x_feat_list = list(df_owl.loc[:,'Ability Damage Done':'of Rockets Fired'].columns)
```
- Plot the tree and make sure you can easily read the nodes of the resulting Decision Tree (you will want to use `plt.gcf().set_size_inches()`)


### Part 1.2: Predict Prof. Gerber's Role (5 points)

Thanks to Professor Gerber who created this assignment. 

Professor Gerber used to play a lot of Overwatch, and he would usually play only one role. Download the `df_prof_gerber_ow.csv` file, which contains the average performance of Prof. Gerber when he played his favorite character in the game$^*$.
- Convert it to an array using `np.array`
- Predict the role Prof. Gerber most often played with the `dec_tree_clf.predict` function

$^*$ since Prof. Gerber only had access to some of the statistics, he made guesses for some of the values, but it should be a pretty good estimation.

### Part 1.3: Cross Validate and Compute Accuracy (15 points)

Can we trust this prediction? Peform a 10-fold cross validation Decision Tree, using a Stratified K Fold, and then create a confusion matrix of the resulting predictions vs. the true roles. Calculate the overall accuracy and discuss **in a markdown cell** with 2-3 sentences what you can say about how the decision tree is performing.

## Part 2: Random Forest (50 total points)

We would like to classify the importance of each x-feature in predicting the role of an Overwatch player, as well as avoid complaints that our single decision tree may be overfitting. To accomplish these two tasks, we will build a Random Forest using the same data as from Part 1.

### Part 2.1: Build the Random Forest (20 points)

Build a Random Forest Classifier which classifies the `role` of an Overwatch player using all of the numerical statistics from the data. Use `max_depth = 3` and 10-fold cross validation.

**Note:** do *not* specify `n_estimators` more than 1000. 1000 will take a little while to run (and may be worth it), but you may also just use the default of `n_estimators = 100` if you wish (though expect unstable results).

### Part 2.2: Get the Confusion Matrix and Accuracy (10 points)

Create a confusion matrix of the resulting predictions vs. the true roles. Calculate the overall accuracy and discuss **in a markdown cell** with 1-2 sentences what you can say about how the random forest performs compared with the single decision tree. You *should* see at least a slight decrease in performance; why does that make sense? Is it a good/bad thing?

To get the labels for the roles, you'll want to define:

```python
y_feat_list = np.array(['Damage', 'Support', 'Tank'])
```

### Part 2.3: Feature Importance (10 points)

Print a bar plot (You can use the function in the lecture note) to describe the top 10 features which are most useful for classification.  Qualitatively describe if these most important features are meaningful. In other words:
- If the classifier performs well, we care about which features helped it work
- If the classifier doesn't perform well, we don't care which features helped it "work"

### Part 2.4: Feature Importance (10 points)

Based on the result in Part 2.3, Fit another tree model with the top 5 features in the random forest, and print the tree plot. 

## Part 3: Summary on models (15 points)

In your own words, explain the difference between multiple linear regression, decision tree and random forest. For each model, please list:
- How the model works
- When the model is applicable
- Any assumption required for the model
- Pros and cons for the model