# Missing Data

## Drop Rows or Columns

### Drop rows with missing data:
```
df.dropna()
```
### Drop columns with missing data:
```
df.dropna(axis=1)
```

### Other options:
```
df.dropna(how='all') # Drops rows where ALL data is missing.
df.dropnz(thresh=4) # Drops rows with at least 4 missing values.
df.dropna(subset=['C']) drops rows with NaN in 'C'.
```

## Imputing Missing Values

Often it's better to replace `NaN` with some value, to avoid removing too much data. This is where the `Imputer` class comes into play:
```
from sklearn.preprocessing import Imputer
imr = Imputer(missing_values='NaN', strategy='mean', axis=0) # axis=1 for row means. strategy=['mean',  
                                                              'median','most_frequent']
imr.fit(df)
imputed_data = imr.transform(df.values)
```

# Handling Categorical Data

Categorical non-numerical features can be broken down into ordinal (e.g. t-shirt size), and nominal (t-short colour).

In the ordinal case, the mapping can be done on a single column (as higher numbers will have meaning):
```
size_mapping = {
    'XL': 3,
    'L' : 2,
    'M' : 1
}
df['size'] = df['size'].map(size_mapping)
```

In the nominal case, dummy features need to be created to avoid the incorrect interpretation of number order:
```
import pandas as pd
pd.get_dummies(df) # Ignores numerical fields, can work on a subset of the dataframe. 
```

# Partitioning into Training and Test Sets
```
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
```

# Feature Scaling

For the majority of ML algorithms (decision trees being an exception), they handle the data much better if features are all in the same scale. There are two main approaches:
1. Normalization - $x^{i}_{norm} = \frac{x^{i}-x_{min}}{x_{max}-x_{min}}$ - Places values in the range $[0,1]$.
2. Standardization - $x^{i}_{std} = \frac{x^{i}-\mu_{x}}{\sigma_{x}}$ - Places values in a normal distribution. This is often more useful, as it maintains outlier information while making them weaker.

In both cases, the code is very similar:
```
from sklean.preprocessing import MinMaxScaler, StandardScaler
mms, stds = MinMaxScaler(), StandardScaler()
X_train_norm = mms.fit_transform(X_train) # Fit only occurs on training.
X_test_std = mms.transform(X_test)
```

# Feature Selection

There are a number of methods of feature selection, such as L1 regularization for Logistic Regression (which encourages sparsity). Additionally, the SBS algorithm (not in sklearn) and many others (which are) can be used for direct selection before an algorithm. Random Forests even have a `_feature_importance` parameter to see which features are chosen in the fitting process. 