# Support vector machines: Problem solving

This exercise uses the `baseball` dataset, which contains Major League Baseball data from 1986 and 1987.

The goal is to predict `Division` (E: East or W: West) using all the other variables.

| Variable  | Type    | Description                                                                      |
|:-----------|:---------|:----------------------------------------------------------------------------------|
| AtBat     | Ratio   | Number of times at bat in 1986                                                   |
| Hits      | Ratio   | Number of hits in 1986                                                           |
| HmRun     | Ratio   | Number of home runs in 1986                                                      |
| Runs      | Ratio   | Number of runs in 1986                                                           |
| RBI       | Ratio   | Number of runs batted in in 1986                                                 |
| Walks     | Ratio   | Number of walks in 1986                                                          |
| Years     | Ratio   | Number of years in the major leagues                                             |
| CAtBat    | Ratio   | Number of times at bat during his career                                         |
| CHits     | Ratio   | Number of hits during his career                                                 |
| CHmRun    | Ratio   | Number of home runs during his career                                            |
| CRuns     | Ratio   | Number of runs during his career                                                 |
| CRBI      | Ratio   | Number of runs batted in during his career                                       |
| CWalks    | Ratio   | Number of walks during his career                                                |
| League    | Nominal | A factor with levels A and N indicating player's league at the end of 1986       |
| Division  | Nominal | A factor with levels E and W indicating player's division at the end of 1986     |
| PutOuts   | Ratio   | Number of put outs in 1986                                                       |
| Assists   | Ratio   | Number of assists in 1986                                                        |
| Errors    | Ratio   | Number of errors in 1986                                                         |
| Salary    | Ratio   | 1987 annual salary on opening day in thousands of dollars                        |
| NewLeague | Nominal | A factor with levels A and N indicating player's league at the beginning of 1987 |

**Source:** This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. 

## Load data

Import `pandas` so we can load a dataframe.

Load the dataframe with `datasets/baseball.csv`.

## Explore data

To remove missing values `dropna` and store the results back into your dataframe.

------------------
**QUESTION:**

How many rows had NaN?

**ANSWER: (click here to edit)**


-------------------

To get the five figure summary, `describe` the dataframe.

To visualize relationships amongst variables, first import `plotly.express`.

And create a correlation matrix heatmap.

------------------
**QUESTION:**

What groups of variables are strongly correlated?

**ANSWER: (click here to edit)**



-------------------

**Note**: Although we have a lot of multicolinearity, that is not necessarily a problem for *penalized* methods like SVM, ridge, and lasso regression.
However, the value of the penalty parameter `C` becomes even more important with multicolinearity.

Do a histogram of the class label `Division` since we would like to see how our classes are balanced. 

------------------
**QUESTION:**

What can you say about the balance between the classes and any possible problems that may result?

**ANSWER: (click here to edit)**


-------------------

## Prepare train/test sets

We need to separate our predictors (`X`) from our class label (`Y`), putting each into its own dataframe, so create `X` and `Y`.

Convert the nominal variables in `X` to dummies, storing the result in `X`. Keep all levels.

Convert the nominal variable in `Y`, but drop the reference level.

------------------
**QUESTION:**

Why did we keep all levels of nominals for `X` but not `Y`?
What level/class is now `1` in `Y`?

**ANSWER: (click here to edit)**



-------------------

To split the data into train/test sets, import `model_selection`.

And do the actual spliting of data, using `random_state=1`.

## Fit model

Import libraries for:

- SVM
- Metrics
- Ravel

As well as libraries we need to standardize:

- Scale
- Pipeline

Make a pipeline so we can scale and train in one step:

- Use `StandardScaler`
- Use `SVC` with `random_state=1`, `kernel="rbf"`, and `C=60`

Call`fit` on the pipeline.

Get and save predictions.

## Evaluate the model

Get the accuracy.

And get the recall and precision.

------------------
**QUESTION:**

Are you surprised by the classifier's ability to distinguish between East and West MLB divisions?
Why or why not?

**ANSWER: (click here to edit)**



-------------------

**QUESTION:**

Try going back and changing `C` to different values to see how that effects the results. Try low values below 1 and large values up to 10,000.

What values did you try, and how did the accuracy change?

**ANSWER: (click here to edit)**



-------------------

## Submit your work

When you have finished the notebook, please download it, log in to [OKpy](https://okpy.org/) using "Student Login", and submit it there.

Then let your instructor know on Slack.
