# Data preprocessing

Steps to preprocess data before doing the train_test_split on it.

## Convert the columns to the correct datatypes 
When imported data may get assigned datatypes that are not fitting for them.
```python
df["floatValue"] = df["floatAsString"].astype(float)
df['date'] = pd.to_datetime(df['dateAsString'])
```

## Check for NaN's
Remove the rows or set to zero.

## Remove useless info from useful columns 
E.g. columns containing temperatures where there is more than what is needed in the data.
```python
import re
my_string = "temperature: 75.6 F"
pattern = re.compile("\d+\.\d+")
out = re.match(pattern, my_string)
```

## Columns with high variance are canidates for log normalization

Print out the variance of the numerical columns. You can find those with df.var()

Apply the log normalization function to the columns with high variance:
```python
df['numericColumn'] = np.log(df.numericColumn)
```

## Scale data with StandardScaler
Standard scaler subtracts the mean and divides by the std for each row it is applied to.
```python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)

#or to scale an entire DataFrame:
df_scaled = scaler.fit_transform(df)

#to scale a column:
    df['column'] = scaler.fit_transform(df[['column']])
```
Mind the double square brackets in the last example.

## Encoding binary variables
These are columns with two possible choices, e.g. Y/N, Male/Female, ...

### Pandas version:
```python
df["encoded_col"] = df["col"].apply(lambda val: 1 if val == "y" else 0)
```
### LabelEncoder version:
```python
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["encoded_col_le"] = le.fit_transform(df["col"])
```

## Encoding categories

A column that can contain a *limited* number of values. E.g. car brands, colors, ...

This format will return just the new columns for this categorical column. It would have to be concatenated and the original column would need to be deleted.
```python
pd.get_dummies(users["fav_color"])
```

To do this automatically do the following:
```python
df = pd.get_dummies(df, columns=[x])
```

## Replacing similar columns by their mean
E.g. values from multiple sensors.
```python
columns=['col1','col2','col3']
df["mean"] = df.apply(lambda row: row[columns].mean(), axis=1)
```

## Extracting parts of a date
```python
df["month"] = df["original_datetime"].apply(lambda row: row.month)
```

## Vectorizing text
```python
tf = term frequency
idf = inverse document frequency

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
text_tfidf = tfidf_vec.fit_transform(documents)

#get a list of words and their index
vocabulary=tfidf_vec.vocabulary_

#to get the weights for the words and their index you can use:
print(text_tfidf[3].data)
print(text_tfidf[3].indices)
```

## PCA
```python
from sklearn.decomposition import PCA

# Set up PCA and the X vector for diminsionality reduction
pca = PCA()
wine_X = wine.drop("Type", axis=1)

# Apply PCA to the wine dataset X vector
transformed_X = pca.fit_transform(wine_X)

# Look at the percentage of variance explained by the different components
print(pca.explained_variance_ratio_)
```

## train_test_split

```python
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)
```
**Stratified sampling**:
If we know that the distribution of variables in the y column in the dataset is uneven and we wanted to train a model to try to predict y, we would want to train the model on a sample of data that is representative of the entire dataset. Stratified sampling is a way to achieve this.

PCA explaining
vectorizing