# Hitters data preparation

- We illustrate the following regression methods on a data set called "Hitters"
- It includes 20 variables and 322 observations of major league baseball players. 
- The goal is to predict a baseball player’s salary on the basis of various features associated with performance in the previous year. 
- We don't cover the topic of exploratory data analysis in this notebook. 

- Visit [this documentation](https://cran.r-project.org/web/packages/ISLR/ISLR.pdf) if you want to learn more about the data

Note that scikit-learn provides a [**pipeline**](https://kirenz.github.io/ds-python/docs/data.html#pipelines-in-scikit-learn
) library for data preprocessing and feature engineering, which is considered best practice for data preparation. However, since we use scikit-learn as well as statsmodels in some of our examples, we won't create a data prerocessing pipeline in this example.

## Import

In [None]:
import pandas as pd

df = pd.___("https://raw.githubusercontent.com/kirenz/datasets/master/Hitters.csv")

In [None]:
df

In [None]:
df.___()

### Missing values

Note that the salary is missing for some of the players:

In [None]:
# show sum of missing values per variable 
print(df.___().___())

We simply drop the missing cases: 

In [None]:
# drop missing cases
df = df.____()

## Create label and features

Since we will use algorithms from scikit learn, we need to encode our categorical features as one-hot numeric features (dummy variables):

In [None]:
# get dummies
dummies = pd.___(df[['League', 'Division','NewLeague']])

In [None]:
dummies.info()

In [None]:
print(dummies.head())

Next, we create our label y:

In [None]:
y = df['___']

We drop the column with the outcome variable (Salary), and categorical columns for which we already created dummy variables:

In [None]:
X_numerical = df.___(['Salary', 'League', 'Division', 'NewLeague'], axis=1).astype('float64')

- Make a list of all numerical features (we need them later)
- Only store the column names

In [None]:
list_numerical = X_numerical.___
list_numerical

In [None]:
# Create all features (concatenate all variables with "concat")
X = pd.___([X_numerical, dummies[['League_N', 'Division_W', 'NewLeague_N']]], axis=1)

X.info()

### Split data

Split the data set into train and test set with the first 70% of the data for training and the remaining 30% for testing.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=___, random_state=10)

In [None]:
X_train.head()