# Skills challenge \#10
Below are a series of questions. Use the loaded data to answer the questions. You will almost certainly need to import more packages (`pandas`, `numpy`, etc.) to complete these. You are welcome to use any source except for your classmates. So Google away!

You will be graded on both the **correctness** and **cleanliness** of your work. So don't submit poorly written code or your grade will reflect that. Use Markdown describing what you have done. If you get stuck, move on to another part. Most questions don't rely on the answer to earlier questions.

### Imports

In [116]:
import pandas as pd

### Data loading

In [117]:
df = pd.read_csv('../data/free_throws.csv')

In [118]:
df.head()

Unnamed: 0,period,player,playoffs,shot_made,home_team,visit_team,home_score,visit_score,home_final_score,visit_final_score,minutes,season_start,shot_count
0,1,Andrew Bynum,0,1,LAL,PHX,1,0,114,106,47.0,2006,1
1,1,Andrew Bynum,0,1,LAL,PHX,2,0,114,106,47.0,2006,2
2,1,Andrew Bynum,0,1,LAL,PHX,12,18,114,106,29.733333,2006,1
3,1,Andrew Bynum,0,0,LAL,PHX,12,18,114,106,29.733333,2006,2
4,1,Shawn Marion,0,1,LAL,PHX,12,21,114,106,29.2,2006,1


### Data description

This data is all the free throws taken in the NBA between 2006 and 2016. The columns that we are interested are:
- **game_id**: The unique ID for each game. The number itself doesn't mean anything, but if two rows have the same game_id, then that means they occurred in the same game.
- **period**: Which period the free throw occurred in (there are four periods in basketball).
- **player**: The player's name
- **playoffs**: 0 = not in playoffs, 1 = in playoffs
- **shot_made**: 0 = shot not made, 1 = shot made
- **home_team**: Abbreviated name of the home team (the team whose city the game was played in)
- **visit_team**: Abbreviated name of the visiting team
- **home_score**: The home team's score in the game when the free throw was taken
- **visit_score**: The visiting team's score in the game when the free throw was taken
- **final_score**: The final score in the game
- **minutes**: How many minutes into the game when the free throw was taken. An NBA game is 48 minutes long (not counting a potential overtime).
- **season_start**: What year the season started in (seasons start in one year and finish the following year)
- **shot_count**: Some times a player is awarded more than one free throw. A 1 indicates this is their first shot, a 2 indicates this is their second. They can get a maximum of 3 shots.

## Tasks

### Data cleaning
**DC1:** Drop any rows with any missing values. Save the result back to `df`.

**DC2:** Label encode any text columns. Save the encoded values back to that same column (don't create a new one).

**DC3:** Split the data into training and testing sets, with the test set having 30% of the data. Create `X_train`, `X_test`, `y_train` and `y_test`. The value column you will be predicting is `shot_made`, so make that be the y. Stratify the data according to `shot_made` using the `stratify` argument of `train_test_split`.

### Model building

**MB1:** Build a random forest model to classify whether or not the shot is made (`shot_made` column). Pick some hyperparameters and use those. Plot a confusion matrix showing the results. Creating a Markdown cell with brief comments (1 or 2 sentences) on the results.

**MB2:** Run a cross validation grid search to search values for `max_depth`, `min_samples_split` and `min_samples_leaf` for a *decision tree model* (plain decision tree, not a random forest). Use three values for each hyperparameter. Plot a confusion matrix showing the results of the best model, along with the hyperparameters it chose. Creating a Markdown cell with brief comments (1 or 2 sentences) on the results.

**MB3:** Build an XGBoost model. Pick some hyperparameters and use those. Plot a confusion matrix showing the results. Creating a Markdown cell with brief comments (1 or 2 sentences) on the results.

### Feature engineering

**FE1:** Basketball games have four "quarters", each lasting twelve minutes. Use this to create a new column `quarter` which has values 1, 2, 3 or 4. It should have a value of 1 if the `minutes` is less than or equal to 12, 2 if it is in (12, 24], and so forth.

**FE2:** Retrain your XGBoost model using the same hyperparameters as above, but now include this new column. Plot a confusion matrix showing the results. Plot a bar chart showing the feature importance for each column. Creating a Markdown cell with brief comments (1 or 2 sentences) on the results.

### Bonus

**B1:** Create a new column `pct_made` which, for each player, shows what percentage of their free throws they made *in all previous seasons*. So for instance, if the row you are looking at is a game in 2009, then you want to calculate (for that player) what percentage of their free throws they made in 2006 through 2008. Fill any missing values with the median of `pct_made`. Once you've done this, again retrain your XGBoost model by including this new column. Again, plot a confusion matrix and a bar chart showing feature importances. Creating a Markdown cell with brief comments (1 or 2 sentences) on the results.