## General outline:

- load the data, note the size and composition (continuous vs categorical features)
- set prediction goals, decide on a metric to evaluate the model on (eg. AUC-ROC, adjusted R^2)
- broadly segregate continuous vs categorical variables
- visualize the data to get a feel of it
  1. visualize the target (histogram/bar for regression/classification) - note if you may need to apply any transformations
  2. visualize the univariate distributions of the features (histograms, bars)
  3. visulize target vs feature values distributions
      - for **cassification**
        - *categorical variables* - by doing nested bar plots with outer class being the target 
        - *continuous variables* - by plotting histograms with distributions colored by class
      -  for **regression**
        - *categorical variables* - box plots by category for different categorical variables
        - *continous variables* - good ol' scatter plots
- get quick summary stats for continuous variables and value counts for categorical variables
- treatment of invalid/missing data (including in the target)
  - drop columns like ids, dates(if underlying info like day/year/month also irrelevant), and const values
  - remove extreme values in case of regression targets (use a boxplot)
  - use imputers to fill in missing values (eg. Simple/KKN-Imputer - may need to encode categorical features first)
- treatment of categorical variables. For example, OrdinalEncoder, OneHotEncoder, categorical_encoders.TargetEncoder (needs to go separately and first in the pipeline)
- treatment of continous variables:
  - definitely scale for KNN, Kernel SVM: `StandardScaler` or `MinMaxScaler` ftw

### Tree Ensembles
**Random Forests**
  - Main parameter: max_feature
  - around sqrt(n_features) for classification
  - Around n_features for regression
  - n_estimators > 100

Thumb rule for when to use tree-based models:
- model non-linear relationships
- don’t care about scaling, no need for feature engineering
- random forests are very robust, good benchmark
- **gradient boosting** will often give the best performance with careful tuning (*early stopping, learning rate, regularization, max_features*, pruning via *max_depth*)
- LightGBM Interface:
```
    lgbm = LGBMClassifier()
    lgbm.fit(X_train, y_train)
    lgbm.score(X_test, y_test))
```

##### more miscellaneous TIPs
- LinearSVC, LogisticRegression: `dual=False` if `n_samples` >> `n_features`
- `LogisticRegression(solver="sag")` for `n_samples` large.
- Stochastic Gradient Descent for `n_samples` really large
- tip on solvers: https://stackoverflow.com/questions/38640109/logistic-regression-python-solvers-definitions
- LogisticRegression(solver='lbfgs', multi_class='multinomial').fit(X, y)

## TODO
P0
- complete feature engineering + tuning cycle with random forest, lightGBM
- finish the general outline to include modeling, cross-validation, and evaluation
- choosing an evaluation metric for regression - RMSE vs r^2

P1
- write helper functions for: visualization (see cases above), maybe using seaborn if seems quicker

### imports
(Will help remind where to look for in the sklearn docs for the right API)

**general**
```
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
```

**missing values imputation**
 ```
from sklearn.impute import MissingIndicator
from sklearn.impute import SimpleImputer
from sklearn.impute impute KNNImputer
 ```

**feature engineering**
```
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from category_encoders import TargetEncoder
```

**cross-validation and model tuning**
```
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.pipelines import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
```

**classifiction models**
```
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from lightgbm.sklearn import LGBMClassifier
```

**target transformation** \
`from sklearn.compose import TransformedTargetRegressor`

**Sklearn API reference** - https://scikit-learn.org/stable/modules/classes.html

### Example usage

loading data \
`df = pd.read_csv()`

examining data
```
df.info()
df.head()
y.unique()
y.nunique()
y.value_counts()
```

train_test_split
```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# make y boolean when it's not (eg, -1/1)
# this allows sklearn to determine the positive class more easily
X_train, X_test, y_train, y_test = train_test_split(X, y == '1', random_state=0)
```

boxplot \
`plt.boxplot(y, vert=False)`

apply function \
`df["column_name"].apply(func)`

OneHotEncoder 
```
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categories=[['first','second','third','forth']], drop="if_binary")
X = [['third'], ['second'], ['first']]
enc.fit(X)
print(enc.transform([['second'], ['first'], ['third'],['forth']]))
```

OrdinalEncoder
```
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[['first','second','third','forth']])
X = [['third'], ['second'], ['first']]
enc.fit(X)
print(enc.transform([['second'], ['first'], ['third'],['forth']]))
```

Target transformer
```
log_regressor = TransformedTargetRegressor(
    regressor=LinearRegression(),
    func=np.log,
    inverse_func=np.exp.fit())
```
returns an estimator

ColumnTransformer syntax <-> Pipeline syntax

```
cat_preprocessing = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='NA'),
    OneHotEncoder(handle_unknown='ignore'))
cont_preprocessing = make_pipeline(
    SimpleImputer(),
    StandardScaler())
```
*note the example of make_column_selector*
```
preprocess = make_column_transformer(
    (cat_preprocessing, make_column_selector(dtype_include='object')),
    remainder=cont_preprocessing)
```

Pipeline syntax <-> ColumnTransformer syntax \
`make_pipeline(preprocess, log_regressor)`

GridSearchCV syntax \
`GridSearchCV(make_pipeline_output_object, param_grid=param_grid, cv=KFold(n_splits=5,shuffle=True))`

Common estimator API:
```
.fit(X, y)
.predict(X')
.score(X', y')
.predict_proba(X')
```

### Evaluation metrics

##### Binary classification
*Threshold-based:*
- accuracy
- precision, recall, f1

*Ranking:*
- average precision
- ROC AUC

##### Multiclass classification
*Threshold-based:*
- accuracy
- precision, recall, f1 (macro average, weighted)

*Ranking:*
- OVR ROC AUC
- OVO ROC AUC


**sklearn**

*metrics in cross-validation*
```
from sklearn.model_selection import cross_val_score
explicit_accuracy =  cross_val_score(rf, X, y,  scoring="accuracy")
```
*common metrics*
```
from sklearn.metrics import accuracy_score, recall_score, precision_score, roc_auc_score, f1_score, classification_report

recall_score(y_test, y_pred)
precision_score(y_test, y_pred)
confusion_matrix(y_true, y_pred)
f1_score = f1_score(y_test, y_pred)
roc_score = roc_auc_score(y_test, estimator.predict_proba(X_test)[:, 1])
classification_report(y_true, y_pred)
```

Plotting the PR curve:
```
from sklearn.metrics import PrecisionRecallDisplay
display = PrecisionRecallDisplay.from_estimator(
    classifier, X_test, y_test, name="LinearSVC"
)
_ = display.ax_.set_title("2-class Precision-Recall curve")
```
Plotting the ROC curve:
```
from sklearn.metrics import RocCurveDisplay
y_score = clf.decision_function(X_test)
fpr, tpr, _ = roc_curve(y_test, y_score, pos_label=clf.classes_[1])
roc_display = RocCurveDisplay(fpr=fpr, tpr=tpr).plot()
```


In [4]:
# get the final set of transformed features and coefficients 
def feature_coefficients(grid_search_cv_object):
    coeff= grid_search_cv_object.best_estimator_.named_steps['linearsvc'].coef_[0] # key will be the id of the final estimator
    feature_names =grid_search_cv_object.best_estimator_.named_steps.columntransformer.named_transformers_["onehotencoder"].get_feature_names()

In [2]:
# function to separate continuous and categorical features
def separate_cont_cat(df):
    continuous_features = df.select_dtypes(include='number').columns
    categorical_features = df.select_dtypes(exclude='number').columns
    return continuous_features, categorical_features

In [3]:
# function to return X, y (training features, labels)
def split_df_x_y(df, target_col):
    cols = df.columns
    feature_cols = [col for col in cols if col != target_col]
    return df[feature_cols], df[target_col]