## Machine learning project pipeline

How to dive into a new machine learning project? (full guidance myself)

### 1.Tool for machine learning approach

- Language: Python
- Libs for data modeling: sklearn, xgboost,..
- Libs for data analysis: numpy, pandas,..
- Libs for data visualization: matplotlib, seaborn,..


In [2]:
import sklearn

from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor  # deprecated
from sklearn.ensemble import AdaBoostClassifier, AdaBoostRegressor
from sklearn.ensemble import VotingClassifier, VotingRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LinearRegression,LogisticRegression, Lasso, Ridge
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC, SVR
from sklearn.mixture import GaussianMixture, BayesianGaussianMixture

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.preprocessing import StandardScaler, Normalizer, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.manifold import MDS, LocallyLinearEmbedding, Isomap, TSNE
from sklearn.cluster import KMeans, SpectralClustering  , MiniBatchKMeans, GMM
from sklearn.impute import SimpleImputer, KNNImputer

from sklearn.model_selection import StratifiedKFold, StratifiedGroupKFold, StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.model_selection import train_test_split, learning_curve
from sklearn.model_selection import cross_val_score, cross_validate, cross_val_predict
from sklearn.metrics import accuracy_score, f1_score, mean_absolute_error, mean_squared_error
from sklearn.metrics import confusion_matrix, classification_report

from sklearn.neural_network import MLPClassifier, MLPRegressor

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer


from xgboost import XGBClassifier, XGBRegressor

from sklearn.pipeline import make_pipeline, Pipeline


That's a great collection of scikit-learn tools you've imported! Here's a breakdown of some commonly used ones, along with their applications, data suitability, pros, and cons:

**1. Classification Algorithms:**

* **DecisionTreeClassifier:**
    * **When to use:** Good for interpretable models, handling mixed-data types, and identifying important features.
    * **Data:** Categorical or numerical data.
    * **Pros:** Interpretable, robust to outliers.
    * **Cons:** Prone to overfitting, high variance.

* **RandomForestClassifier:**
    * **When to use:** Ensemble of decision trees for improved accuracy and reduced variance.
    * **Data:** Categorical or numerical data.
    * **Pros:** More accurate than single decision trees, handles mixed data types.
    * **Cons:** Less interpretable than single decision trees.

* **KNeighborsClassifier:**
    * **When to use:** Good for classification tasks with well-defined clusters.
    * **Data:** Numerical data.
    * **Pros:** Simple, effective for certain datasets.
    * **Cons:** High computational cost for large datasets, sensitive to outliers and curse of dimensionality.

* **LogisticRegression:**
    * **When to use:** Classifies data into binary classes (0 or 1).
    * **Data:** Numerical data (often with feature scaling).
    * **Pros:** Interpretable coefficients, good for binary classification problems.
    * **Cons:** Limited to binary classification, may not capture complex relationships.

* **GaussianNB:**
    * **When to use:** Efficient for large datasets, works well with naive Bayes assumptions.
    * **Data:** Numerical data (often with feature scaling).
    * **Pros:** Fast, works well with certain assumptions.
    * **Cons:** Assumes independence of features, may not be suitable for complex relationships.

* **SVC (Support Vector Classifier):**
    * **When to use:** Powerful for high-dimensional data, good for separating classes with a clear margin.
    * **Data:** Numerical data (often with feature scaling).
    * **Pros:** Effective for high-dimensional data, handles outliers well.
    * **Cons:** Can be computationally expensive, less interpretable.

* **XGBClassifier:**
    * **When to use:** Powerful ensemble method, often outperforms other algorithms.
    * **Data:** Categorical or numerical data.
    * **Pros:** Highly accurate, handles various data types, offers regularization.
    * **Cons:** Can be computationally expensive, less interpretable than some algorithms.

**2. Regression Algorithms:**

* **DecisionTreeRegressor:**
    * **When to use:** Similar to DecisionTreeClassifier for regression tasks.
    * **Data:** Categorical or numerical data.
    * **Pros:** Interpretable, good for handling missing values.
    * **Cons:** Prone to overfitting, high variance.

* **RandomForestRegressor:**
    * **When to use:** Ensemble of decision trees for improved accuracy and reduced variance in regression.
    * **Data:** Categorical or numerical data.
    * **Pros:** More accurate than single decision trees, handles mixed data types.
    * **Cons:** Less interpretable than single decision trees.

* **KNeighborsRegressor:**
    * **When to use:** Similar to KNeighborsClassifier for regression tasks.
    * **Data:** Numerical data.
    * **Pros:** Simple, effective for certain datasets.
    * **Cons:** High computational cost for large datasets, sensitive to outliers and curse of dimensionality.

* **LinearRegression:**
    * **When to use:** Models linear relationships between features and target variable.
    * **Data:** Numerical data (often with feature scaling).
    * **Pros:** Interpretable coefficients, simple to understand.
    * **Cons:** Assumes linear relationships, may not be suitable for complex relationships.

* **MLPRegressor (Multi-layer Perceptron):**
    * **When to use:** Powerful for non-linear relationships, can learn complex patterns.
    * **Data:** Numerical data.
    * **Pros:** Handles non-linear relationships, flexible architecture.
    * **Cons:** Prone to overfitting, requires careful hyperparameter tuning.

* **XGBRegressor:**
    * **When to use:** Ensemble method often outperforming other algorithms for regression.
    * **Data:** Categorical or numerical data.
    * **Pros:** Highly accurate, handles various data types, offers regularization.
    * **Cons:** Can be computationally expensive, less interpretable than some algorithms.


**3. Preprocessing Tools (continued):**

* **OneHotEncoder:**
    * **Pros:** Efficient for high-cardinality categorical features, avoids introducing artificial ordering.
    * **Cons:** Increases feature dimensionality.

* **OrdinalEncoder:**
    * **Pros:** Preserves some ordinal information of categorical features.
    * **Cons:** Assumes a natural ordering exists for categories, may not be suitable for all categorical data.

* **LabelEncoder:**
    * **Pros:** Simple and efficient for encoding categorical features.
    * **Cons:** Introduces artificial ordering, may lead to misinterpretations if categories don't have inherent order.

* **PCA (Principal Component Analysis):**
    * **When to use:** Reduce dimensionality of data while preserving most of the variance.
    * **Data:** Numerical data.
    * **Pros:** Reduces computation time and storage requirements, can help visualize high-dimensional data.
    * **Cons:** May lose some information during dimensionality reduction.

**4. Other Important Tools:**

* **KMeans:** Unsupervised clustering algorithm for grouping similar data points. 
* **SimpleImputer:** Handles missing values in datasets.
* **StratifiedKFold, StratifiedGroupKFold, StratifiedShuffleSplit:** Cross-validation techniques for imbalanced classification problems.
* **GridSearchCV, RandomizedSearchCV:** Hyperparameter tuning techniques for optimizing model performance.
* **Pipeline:** Combines data preprocessing and modeling steps into a single workflow.
* **ColumnTransformer:** Applies different preprocessing techniques to different columns.

**Choosing the Right Tool:**

The selection of the best tool depends on several factors:

* **Problem Type:** Classification, regression, clustering, dimensionality reduction, etc.
* **Data Type:** Numerical, categorical, mixed data.
* **Data Characteristics:** Linearity, relationships between features, presence of outliers, etc.
* **Desired Model Properties:** Interpretability, accuracy, efficiency, etc.

**General Tips:**

* Start with simpler models and explore more complex ones if needed.
* Experiment with different algorithms and preprocessing techniques.
* Use cross-validation to evaluate model performance on unseen data.
* Consider the trade-off between accuracy and interpretability.


By understanding these tools and their use cases, you can effectively build and improve your machine learning models using scikit-learn.

Do you have any specific questions about a particular tool or a machine learning task you'd like to tackle? I'm happy to provide further guidance!

### 2. Dataset for ml approach

- Crawl data from internet
- Given data
- create new data pattern

### 3. Understand the limit of data 

- Understand the business problem

### 4. Analysis data (Basic - not edit, modify anything)

- EDA : Explore Data Analysis
- Basic statistic

### 4. Visualize data

- Data correlation
- Data distribution
- Data missing


### 5. Data processing

- EDA - advance
- Data missing processing
- feature engineering
- Feature selection
- StandardScalar features, Normalize, Standardize
- Handling missing data, imputation technique, encoding technique, one-hot, ordinal
- create new feature,
- optimize feature for models,
- data wrangling.
- Getting everything related to the data at this point, 'cause we focus on hyperparameter later.


### 5.1. Visualize data again and see the differences



### 5.2 Split data

- train test split
- split the data after we handling missing data, do not trying to reverse the process
- Stratified Kfold


### 6. Modeling data (basic)

- Base on experience trying apply pre-built ml models as much as possible on the raw data, 
- Basic selection
- Evaluate models without hyperparameter
- Modeling data without hesitated, not require performing complex technique on data this turn.

### 7. Evaluate vs visualize the model performance

- Visualize model performances
- Monitor model metrics for tuning later
- Plotting the model metrics with params for better picturing
- Analyze those curve and make the decision, the next move.

### 8. Advance - Fine-tuning mode, hyperparameter tuning

- Trying apply all skillset on the model, 
- cross-validation
- confusion metric
- keep tracking the model metrics for be better move
- 

### 9. Keep hyperparameter model until model can be able to capture the patterns of data

- repeating the processing, evaluate -> validation -> visualize -> evaluate
- keep doing until everything's gonna touch its limits.

### 10. Pipeline

In [5]:

# model_rf = 

pipeline = make_pipeline([
    OneHotEncoder(handle_unknown='ignore'),
    OrdinalEncoder(),
    RandomForestClassifier(random_state=100, n_estimators=100),
    XGBClassifier(n_estimators=100),
    KNeighborsClassifier(),
    GaussianNB()
    
    
    # XGBRegressor(n_estimators=100),

],verbose=True)
pipeline.steps



[('list',
  [OneHotEncoder(handle_unknown='ignore'),
   OrdinalEncoder(),
   RandomForestClassifier(random_state=100),
   XGBClassifier(base_score=None, booster=None, callbacks=None,
                 colsample_bylevel=None, colsample_bynode=None,
                 colsample_bytree=None, device=None, early_stopping_rounds=None,
                 enable_categorical=False, eval_metric=None, feature_types=None,
                 gamma=None, grow_policy=None, importance_type=None,
                 interaction_constraints=None, learning_rate=None, max_bin=None,
                 max_cat_threshold=None, max_cat_to_onehot=None,
                 max_delta_step=None, max_depth=None, max_leaves=None,
                 min_child_weight=None, missing=nan, monotone_constraints=None,
                 multi_strategy=None, n_estimators=100, n_jobs=None,
                 num_parallel_tree=None, random_state=None, ...),
   KNeighborsClassifier(),
   GaussianNB()])]