# Data preprocessing

## Convert the columns to the correct datatypes 
When imported data may get assigned datatypes that are not fitting for them. Depending on what you want them to be there are several ways of converting them:
```python
df["floatValue"] = df["floatAsString"].astype(float)
df['date'] = pd.to_datetime(df['dateAsString'])
```
To select only columns of a certain type you can use:
```python
toConvert = df.select_dtypes(include=['int']).columns.tolist()
```
or
```python
toConvert = df.dtypes[df.dtypes == np.int64].index.tolist()
```
With a for loop you can then perform conversions on these:
```python
# Converting all integer columns to floats
toConvert = df.dtypes[df.dtypes == np.int64].index.tolist()
for x in toConvert: df[x] = df[x].astype(float)
    ```

## Check for NaN's
It's probably a good idea to remove a column if more than 30% of it's rows contain Nan's. Check manually with *df.isnull().sum()* which ones to drop to be certain. 
```python
df.isnull().sum()
...
df.drop(['column'], axis=1, inplace=True)
```

Also drop rows where there is no label (what we use as y), as there is no use in training on data that has no value to predict.
```python
df.dropna(subset=['label_column'],inplace=True)
```

After that you could replace the still missing values by the mean or median for numerical data and the most occurring category for the others:
```python
def fill_missing_values(df):
    ''' This function imputes missing values with median for numeric columns 
        and most frequent value for categorical columns'''
    missing = df.isnull().sum()
    missing = missing[missing > 0]
    for column in list(missing.index):
        if df[column].dtype == 'object':
            df[column].fillna(df[column].value_counts().index[0], inplace=True)
        elif df[column].dtype == 'int64' or 'float64' or 'int16' or 'float16':
            df[column].fillna(df[column].median(), inplace=True)
```

## Columns that represent grades can be converted to numeric
This is when a continuous value is represented as a non numeric value, such as a quality representation, a rating, etc. This avoids turning them into multiple columns later. 
```python
df.Quality.replace({"perfect":4,"good":3,"acceptable":2,"bad":1,"unusable":0}, inplace=True)
df["Quality"] = df["Quality"].astype(int)
```

## Remove useless info from useful columns 
E.g. columns containing temperatures where there is more than what is needed in the data.
```python
import re
my_string = "temperature: 75.6 F"
pattern = re.compile("\d+\.\d+")
out = re.match(pattern, my_string)
```

## Extracting parts of a date
Sometimes the year part of a date can be good enough for a model to make predictions on.
```python
df["year"] = df["original_datetime"].apply(lambda row: row.year)
```

## Split the data into categorical and numeric values
This can be done by effectively splitting up the data into two DataDrames that will be reassembled later on, or making masks that contain only one of the two categories.
```python
numeric=[]
categorical=[]

for column in df.columns:
    if df[column].dtype == 'object':
        categorical.append(column)
    elif df[column].dtype == 'int64' or 'float64' or 'int16' or 'float16':
        numeric.append(column)
        
num = df[numeric]
cat = df[categorical]
``` 

<br>**The parts below are applied to the numerical part of the DataFrame**

## Scale data with StandardScaler

The StandardScaler transforms the data in such a way that it has it's mean at 0 and std as 1. Standardization is useful for data which has negative values. It arranges the data in normal distribution. It is more useful for classification than regression.

```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

#scale selected columns
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
df['column'] = scaler.fit_transform(df[['column']])
```
Mind the double square brackets in the last example.
```python
#or to scale an entire DataFrame:
df_scaled = scaler.fit_transform(df)    
```

https://medium.com/predict/standardization-on-crazy-data-python-cd5b1282a97f

## Normalisation (min max)
Normalisation is more useful than the StandardScaler when the data is only positive. It puts everything between 0 and 1.

https://medium.com/@equipintelligence/normalization-of-crazy-data-python-4fa6611e7b46

## Columns with high variance are canidates for log normalization
Print out the variance of the numerical columns. You can find those with df.var()

Apply the log normalization function to the columns with high variance (> 1000):
```python
df['numericColumn'] = np.log(df.numericColumn)
```

This will cause problems if there are rows with zero in them. This can be overcome the following way:

```python
log_col = lambda x: np.log(x) if x>0 else 0

for x in (df.var()[df.var()>1000]).index: 
   dft[x] = dft[x].apply(log_col)
```

## Columns with low variance can be dropped

These columns do not contain enough variance in their data to be of use during the processing and can be dropped. This can be done with *VarianceThreshold*. A way to determine the threshold below which to drop columns could be to take the mean of all the variances for the columns and drop those that are a certain amount below, but use common sense to not drop too many or to close to the mean. 

```python 
thresholder = VarianceThreshold(threshold=0.5)
thresholder.fit(features)
features_high_variance = df_num[df_num.columns[thresholder.get_support(indices=True)]]
```
This keeps the output as a DataFrame that has the originals columns and indexes intact. The normal output from fit_transform strips those away.

## Remove outliers
Outliers are values that are far away from the rest of the values and that could cause problems with the model. However, outliers that are linked to the Y axis should be kept, e.g. an outlier that gives a higher sales value should be kept.
These outliers should be in the training dataset, not in the test.

A good starting point could be to remove everything that is more than 3 standard deviations away from the mean. Do keep in mind that if there are values that can be zero, e.g. swimming pool size, you could get points marked as outliers because of the zero's, while they are actually valid points.

## Check for skewed data
If the data is skewed, apply boxcox or remove the outliers.

## Check the correlation between features

If two columns are extremely colinear (>95%) they should be removed.
Techniques to check for this is to use df.corr() and plot an sns.heatmap() of it.
Another way to check is with variance_inflation_factor()

If two columns have a high correlation, but aren't so high that they could be dropped due to having a correlation above 95% and they are similar values they could be replaced by the mean of the columns.
E.g. values from multiple sensors.

```python
columns=['col1','col2']
df["mean"] = df.apply(lambda row: row[columns].mean(), axis=1)
```

## Feature selection
Which features are the most useful to get a good model? A RandomForest can be used to see which parameters it uses for it's predictions. This can then serve as a base to determine what to use and what not. This is not the final RandomForest that will be used to make predictions with.

The following function can be used to get all the features that are above a specified importance percentage:
```python
def random_forest_limits(rf, x_df, limit):
    # Get numerical feature importances
    importances = list(rf.feature_importances_)
    # List of tuples with variable and importance
    feature_importances = [(feature, round(importance, 3)) for feature, importance in zip(list(x_df.columns), importances)]
    # Sort the feature importances by most important first
    feature_importances = sorted(feature_importances, key = lambda x: x[1], reverse = True)    
    features = []
    for x in feature_importances:
        if (x[1] >= limit):
            features.append(x[0])
    return features
``` 
The parameters to pass in are the RandomForest model, the features DataFrame (the X part), and the importance percentage below which not to include the feature. 

<br>**The following applies to the the categorical data**

## Get dummy variables.
This step transforms categorical data into data that can be used by the models to make predictions. It is different for columns with only two possible categories (binary) and multiple ones.

### Encoding binary variables
These are columns with two possible choices, e.g. Y/N, Male/Female, ...

#### Pandas version:
```python
df["encoded_col"] = df["col"].apply(lambda val: 1 if val == "y" else 0)
```
#### LabelEncoder version:
```python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["encoded_col_le"] = le.fit_transform(df["col"])
```

### Encoding categories

A column that can contain a *limited* number of values, e.g. car brands, colors, etc should be converted to n columns, where n is the number of categories in the original - 1.

#### With pd.get_dummies()
Using get_dummies in this way will return just the new columns for this categorical column. It's output would have to be concatenated and the original column would need to be deleted from the original DataFrame.
```python
pd.get_dummies(users["fav_color"])
```

To do this automatically do the following:
```python
df = pd.get_dummies(df, columns=[x], drop_first=True)
```
The drop_first parameter specifies that the dummies trap would occur, giving a useless column in the dataset. The amount of new columns would be equal to the amount of categories and not categories - 1 without it. 

We could get the problem that some categories are in the training set, but not in the test set or vice versa. The following code can be used to solve that:
```python
# add missing columns
for x in df.columns[~df.columns.isin(df_filtered.columns)]:
    df_filtered[x] = 0
    
# remove extra columns
for x in df_filtered.columns[~dft.columns.isin(df.columns)]:
    df_filtered.drop(x, axis=1, inplace=True)
```

#### With OneHotEncoder
One hot encoder does not have this problem, as there is a fit and a separate transform that can be applied to the test set.
```python
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X)
X = enc.transform(X)
XTest = enc.transform(XTest)
```

Another random model can be constructed at this point to determine which of the categorical features are important enough to include into the final model. See the *feature selection* part earlier in the document.

## train_test_split

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
```
**Stratified sampling**:
If we know that the distribution of variables in the y column in the dataset is uneven and we wanted to train a model to try to predict y, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.