## Step 1: Data Preprocessing

### Initial steps
- Handle Missing Values: Use strategies like mean or median imputation.
- Encoding: Convert categorical variables into a format suitable for ML models, using techniques like one-hot encoding.
- Feature Scaling: Normalize or standardize features.

### What the below function is performing... 
- 🧠 _The data preprocessing function has successfully handled missing values (even though there weren't any in this sample data), encoded the categorical variables, and scaled the numerical variables. The data is also split into training and test sets._


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
import pandas as pd

def preprocess_data(df, target_column):
    """
    Preprocesses the data: handles missing values, encodes categorical variables, and scales numerical variables.
    
    Args:
    - df (pandas.DataFrame): The input DataFrame.
    - target_column (str): The target variable column name.
    
    Returns:
    - X_train, X_test, y_train, y_test: Preprocessed data split into training and test sets.
    """
    
    # Separate target variable and features
    X = df.drop(target_column, axis=1)
    y = df[target_column]
    
    # Identify categorical and numerical columns
    categorical_cols = [cname for cname in X.columns if 
                        X[cname].dtype == "object"]
    numerical_cols = [cname for cname in X.columns if 
                      X[cname].dtype in ['int64', 'float64']]
    
    # Preprocessing for numerical data: imputation and scaling
    numerical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])
    
    # Preprocessing for categorical data: imputation and one-hot encoding
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])
    
    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
    
    # Split data into train and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
    
    # Preprocessing data using the defined transformers
    X_train = preprocessor.fit_transform(X_train)
    X_test = preprocessor.transform(X_test)
    
    return X_train, X_test, y_train, y_test

# Let's assume a sample dataset for demonstration (a more detailed dataset would be ideal for testing)
sample_data = {
    'Feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'Feature2': ['A', 'B', 'A', 'B', 'A', 'A', 'B', 'B', 'A', 'B'],
    'Target': [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]
}
sample_df = pd.DataFrame(sample_data)

X_train, X_test, y_train, y_test = preprocess_data(sample_df, 'Target')
# X_train[:5], y_train[:5]  # Displaying the first 5 rows for brevity

## Step 2: Model Selection and Evaluation

### Initial steps
- Create a pool of candidate models.
- Train each model on the dataset.
- Evaluate each model's performance.
- Rank models based on performance.





### Precautions 
- For simplicity, let's consider three models: Logistic _Regression, Random Forest Classifier, and Gradient Boosting Classifier_ . We'll evaluate the performance based on accuracy.
- All three models (Logistic Regression, Random Forest, and Gradient Boosting) have achieved an accuracy of 50% on the test set. This isn't surprising given the small and simplistic nature of the sample dataset. With more detailed and diverse data, we would expect to see differences in performance among the models.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import accuracy_score

def evaluate_models(X_train, X_test, y_train, y_test):
    """
    Trains a set of models on the training data and evaluates them on the test data.
    
    Args:
    - X_train, X_test, y_train, y_test: Training and test data.
    
    Returns:
    - dict: A dictionary with model names as keys and their accuracy scores as values.
    """
    
    # Define models
    models = {
        'Logistic Regression': LogisticRegression(max_iter=1000),
        'Random Forest': RandomForestClassifier(),
        'Gradient Boosting': GradientBoostingClassifier()
    }
    
    # Train and evaluate models
    model_scores = {}
    for model_name, model in models.items():
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)
        accuracy = accuracy_score(y_test, predictions)
        model_scores[model_name] = accuracy
    
    return model_scores

model_performance = evaluate_models(X_train, X_test, y_train, y_test)
model_performance

## Step 3: Hyperparameter Tuning
_For the top-performing models, search for the best hyperparameters using techniques like grid search or random search._ 
- 👉🏻 Hyperparameter tuning for just one of the models: the Random Forest Classifier. We'll use a basic grid search approach to find the best hyperparameters.

##### The hyperparameter tuning process suggests that the best parameters for the Random Forest model on this sample dataset are:

- max_depth: 20
- min_samples_leaf: 2
- min_samples_split: 2
- n_estimators: 10

In [None]:
from sklearn.model_selection import GridSearchCV

def tune_random_forest(X_train, y_train):
    """
    Performs hyperparameter tuning for the Random Forest model using GridSearchCV.
    
    Args:
    - X_train, y_train: Training data.
    
    Returns:
    - dict: Best hyperparameters.
    """
    
    # Define hyperparameters grid
    param_grid = {
        'n_estimators': [10, 50, 100],
        'max_depth': [None, 10, 20, 30],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
    
    rf_model = RandomForestClassifier()
    grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, 
                               cv=3, n_jobs=-1, verbose=2)
    
    # Fit the model
    grid_search.fit(X_train, y_train)
    
    return grid_search.best_params_

# Due to the small size of the sample dataset, the grid search will be quick.
best_params = tune_random_forest(X_train, y_train)
best_params


### Step 4: Final Model Selection
_With hyperparameter tuning complete, you can retrain your models using the best hyperparameters and then evaluate them on a separate validation set or use cross-validation to determine the final model._