## General outline:

- load the data
- broadly segregate continuous vs categorical variables
- visualize the data to get a feel
  1. visualize the target (histogram/bar for regression/classification) - note if you may need to apply any transformations
  2. visualize the univariate distributions of the features (histograms, bars)
  3. visulize target vs feature values distributions
      - for **cassification**
        - *categorical variables* - by doing nested bar plots with outer class being the target 
        - *continuous variables* - by plotting histograms with distributions colored by class
      -  for **regression**
        - *categorical variables* - box plots by category for different categorical variables
        - *continous variables* - good ol' scatter plots
- get quick summary stats for continuous variables and value counts for categorical variables
- treatment of invalid/missing data (including in the target)
  - drop columns like ids, dates(if underlying info like day/year/month also irrelevant), and const values
  - remove extreme values in case of regression targets (use a boxplot)
  - use imputers to fill in missing values (eg. Simple/KKN-Imputer - may need to encode categorical features first)
- decide on a metric to evalate the model on (eg. AUC-ROC, adjusted R^2)
- treatment of categorical variables. For example, OrdinalEncoder, OneHotEncoder, categorical_encoders.TargetEncoder (needs to go separately and first in the pipeline)
- treatment of continous variables

## TODO
  0. check this notebook into a github repo
  1. write helper functions for: visualization (see cases above), maybe using seaborn if seems quicker

### imports related to cross-validation
(Will help remind where to look for in the sklearn docs for the right API)

**general**
```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```

**missing values imputation**
 ```
from sklearn.impute import MissingIndicator
from sklearn.impute import SimpleImputer
from sklearn.impute impute KNNImputer
 ```

**feature engineering**
```
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from category_encoders import TargetEncoder
```

**cross-validation and model tuning**
```
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipelines import make_pipeline
from sklearn.compose import make_column_transformer
```

**classifiction models**
```
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
```

**target transformation** \
`from sklearn.compose import TransformedTargetRegressor`

**Sklearn API reference** - https://scikit-learn.org/stable/modules/classes.html

loading data \
`df = pd.read_csv()`

examining data
```
df.info()
df.head()
```

train_test_split \
`X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)`

boxplot \
`plt.boxplot(y, vert=False)`

apply function \
`df["column_name"].apply(func)`

**Example usage**

OneHotEncoder 
```
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categories=[['first','second','third','forth']], drop="if_binary")
X = [['third'], ['second'], ['first']]
enc.fit(X)
print(enc.transform([['second'], ['first'], ['third'],['forth']]))
```

OrdinalEncoder
```
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[['first','second','third','forth']])
X = [['third'], ['second'], ['first']]
enc.fit(X)
print(enc.transform([['second'], ['first'], ['third'],['forth']]))
```

Target transformer
```
log_regressor = TransformedTargetRegressor(
    regressor=LinearRegression(),
    func=np.log,
    inverse_func=np.exp.fit())
```
returns an estimator

ColumnTransformer syntax <-> Pipeline syntax
```
cat_preprocessing = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='NA'),
    OneHotEncoder(handle_unknown='ignore'))
cont_preprocessing = make_pipeline(
    SimpleImputer(),
    StandardScaler())

preprocess = make_column_transformer(
    (cat_preprocessing, make_column_selector(dtype_include='object')),
    remainder=cont_preprocessing)
```

Pipeline syntax <-> ColumnTransformer syntax \
`make_pipeline(preprocess, log_regressor)`

GridSearchCV syntax \
`GridSearchCV(make_pipeline_output_object, param_grid=param_grid, cv=KFold(n_splits=5,shuffle=True))`


In [2]:
# function to separate continuous and categorical features
def separate_cont_cat(df):
    continuous_features = df.select_dtypes(include='number').columns
    categorical_features = df.select_dtypes(exclude='number').columns
    return continuous_features, categorical_features

In [3]:
# function to return X, y (training features, labels)
def split_df_x_y(df, target_col):
    cols = df.columns
    feature_cols = [col for col in cols if col != target_col]
    return df[feature_cols], df[target_col]