## Machine Learning

### Model exploration with PyCaret

Try PyCaret to start 
https://github.com/pycaret/pycaret/blob/master/tutorials/Tutorial%20-%20Regression.ipynb

In [None]:
data = training_df

# import pycaret regression and init setup
from pycaret.regression import *
s = setup(data, target = 'charges', session_id = 123)

# import RegressionExperiment and init the class
from pycaret.regression import RegressionExperiment
exp = RegressionExperiment()

# init setup on exp
exp.setup(data, target = 'charges', session_id = 123)

# compare baseline models
best = compare_models()

# compare models using OOP
# exp.compare_models()

In [None]:
# evaluate the models and interpret 
# plot residuals
plot_model(best, plot = 'residuals')

# plot error
plot_model(best, plot = 'error')

# plot feature importance
plot_model(best, plot = 'feature')

evaluate_model(best)

# train lightgbm model
lightgbm = create_model('lightgbm')

# interpret summary model
interpret_model(lightgbm, plot = 'summary')

In [None]:
# predict on test set
holdout_pred = predict_model(best)

# show predictions df
holdout_pred.head()

# copy data and drop charges
new_data = data.copy()
new_data.drop('charges', axis=1, inplace=True)
new_data.head()

# predict model on new_data
predictions = predict_model(best, data = new_data)
predictions.head()


# save pipeline
save_model(best, 'my_first_pipeline')

# load pipeline
loaded_best_pipeline = load_model('my_first_pipeline')
loaded_best_pipeline

In [None]:
# ensembling 

# train a dt model with default params
dt = create_model('dt')

# tune hyperparameters of dt
tuned_dt = tune_model(dt)

# define tuning grid
dt_grid = {'max_depth' : [None, 2, 4, 6, 8, 10, 12]}

# tune model with custom grid and metric = MAE
tuned_dt = tune_model(dt, custom_grid = dt_grid, optimize = 'MAE')

# tune dt using optuna
tuned_dt = tune_model(dt, search_library = 'optuna')

# ensemble with bagging
ensemble_model(dt, method = 'Bagging')

# ensemble with boosting
ensemble_model(dt, method = 'Boosting')

In [None]:
# # top 3 models based on mae
best_mae_models_top3 = compare_models(n_select = 3, sort = 'MAE')

# blending models 
blend_models(best_mae_models_top3)

In [None]:
# find best model based on CV metrics
# returns the best model out of all trained models in the current setup based on the optimize parameter
automl() 

In [None]:
# dashboard function
dashboard(dt, display_format ='inline')

In [None]:
# finalize the model 
final_best = finalize_model(best)

# save model
# save_model(best, 'my_first_model')


# load model
# loaded_from_disk = load_model('my_first_model')
# loaded_from_disk



### Basic models to try 
* Linear Regression - Use sci-kit learn to implement
* Logistic Regression
* Support Vector Machines
* Basic decision trees
* Naive Bayes

### XGBoost 
Tree ensemble model that can handle tabular, numerical, low-dimensional data very well. Fast and scalable, and can be hyperparameter tuned to optimize performance.

### ResNet-like architecture
Neural network architecture adapted for tabular data, allowing learning from shallow and deep features. Use PyTorch or TensorFlow to implement. Less interpretable. 

### Transformer 
Use Transformers with tabular data? 

### KNN - K Nearest Neighbors
Predicts the target variable based on the similarity of the features with the nearest neighbors in the training dataset. Computationally expensive, but can be used for regression and classification.

### Model Evaluation
* Cross-validation
* Classification Accuracy
* Confusion Matrix
* ROC-AUC Curve
  

### Model Fine-tuning
* Hyperparameter tuning
* Ensembling multiple models together - voting classifier, bagging, boosting, XGBoost 

### Feature Importance

**Univariate Feature Selection:** This method uses statistical tests like chi-squared tests or ANOVA to evaluate the relationship between each feature and the target variable independently. It ranks the features based on their significance.

In [None]:
# from sklearn.feature_selection import SelectKBest, chi2, f_classif

# # For categorical target using Chi-squared test
# selector = SelectKBest(score_func=chi2, k=5)
# selector.fit_transform(X, y)

# # For continuous target using ANOVA F-value
# selector = SelectKBest(score_func=f_classif, k=5)
# selector.fit_transform(X, y)

# # Get feature importances
# scores = selector.scores_
# feature_importances = pd.DataFrame({'feature': X.columns, 'importance': scores})
# feature_importances = feature_importances.sort_values('importance', ascending=False)
# print(feature_importances)

In [None]:
# other statistical testing 
# import scipy.stats as stats

# # Perform a t-test to compare the means of two groups
# t_stat, p_value = stats.ttest_ind(group1, group2)
# print("T-statistic:", t_stat, "P-value:", p_value)

# # Perform a chi-squared test to determine the association between two categorical variables
# chi_stat, p_value, dof, ex = stats.chi2_contingency(contingency_table)
# print("Chi-squared statistic:", chi_stat, "P-value:", p_value)