In [207]:
import numpy as np
import pandas as pd

In [208]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.feature_selection import SelectKBest,chi2
from sklearn.tree import DecisionTreeClassifier

In [209]:
df = pd.read_csv('train.csv')

In [210]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Let's Plan

In [211]:
df.drop(columns=['PassengerId','Name','Ticket','Cabin'],inplace=True)

In [212]:
# Step 1 -> train/test/split
X_train,X_test,y_train,y_test = train_test_split(df.drop(columns=['Survived']),
                                                 df['Survived'],
                                                 test_size=0.2,
                                                random_state=42)

In [213]:
X_train.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5,S
733,2,male,23.0,0,0,13.0,S
382,3,male,32.0,0,0,7.925,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.275,S


In [197]:
y_train.sample(5)

492    0
265    0
239    0
386    0
240    0
Name: Survived, dtype: int64

In [214]:
# imputation transformer
trf1 = ColumnTransformer([
    ('impute_age',SimpleImputer(),[2]),
    ('impute_embarked',SimpleImputer(strategy='most_frequent'),[6])
],remainder='passthrough')


In [215]:
# one hot encoding
trf2 = ColumnTransformer([
    ('ohe_sex_embarked',OneHotEncoder(sparse=False,handle_unknown='ignore'),[1,6])
],remainder='passthrough')

In [216]:

# Scaling
trf3 = ColumnTransformer([
    ('scale',MinMaxScaler(),slice(0,10))
])


## SelectKBest and chi2
`SelectKBest` and `chi2` are components of Scikit-learn's feature selection module used for selecting the most significant features from a dataset based on univariate statistical tests, particularly using the chi-squared (chi2) statistical test.

### SelectKBest:

#### Key Aspects:
- **Purpose:** SelectKBest is used for selecting K number of features based on their score in univariate statistical tests.
- **Selection Criteria:** It selects the top K features with the highest scores based on the provided scoring function.

### chi2 (Chi-Squared Test):

#### Key Aspects:
- **Purpose:** The chi2 statistic measures dependence between variables in a dataset, specifically for categorical targets and categorical features.
- **Usage in SelectKBest:** Often used as the scoring function in SelectKBest for feature selection based on chi-squared scores.

### Implementation Example:

Here's an example showcasing the usage of SelectKBest and chi2 for feature selection:

```python
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectKBest, chi2

# Load dataset (example with Iris dataset)
iris = load_iris()
X, y = iris.data, iris.target

# Instantiate SelectKBest with chi2 as scoring function and select top 2 features
k_best_features = SelectKBest(score_func=chi2, k=2)

# Fit SelectKBest and transform the dataset
X_new = k_best_features.fit_transform(X, y)

# Print selected features
print("Selected Features:")
print(X_new)
```

### Steps Explained:

1. **Load Dataset:** Load a dataset (in this case, the Iris dataset) with features (`X`) and target (`y`).

2. **Instantiate SelectKBest:** Create a SelectKBest instance, specifying the scoring function (in this case, chi2) and the number of features to select (`k=2`).

3. **Fit and Transform:** Fit the SelectKBest to the dataset (`X` and `y`) and transform it to obtain the subset of selected features.

4. **Display Selected Features:** Print or display the subset of selected features obtained from SelectKBest.

SelectKBest with chi2 is a powerful method for selecting the most significant features from a dataset, particularly useful in scenarios involving categorical targets and features. It facilitates improving model performance by focusing on the most informative features for predictive modeling.

In [217]:
# Feature selection
trf4 = SelectKBest(score_func=chi2,k=8)

## Decision Tree Classifier
The `DecisionTreeClassifier` in Scikit-learn is a machine learning model that belongs to the family of decision tree algorithms used for classification tasks. It constructs a tree structure by recursively splitting the data based on features to make predictions.

### Key Aspects of DecisionTreeClassifier:

#### 1. **Classification Algorithm:**
   - Designed specifically for solving classification problems, where the target variable is categorical.

#### 2. **Tree-Based Approach:**
   - Builds a tree-like structure by making sequential decisions based on features to classify data points.

#### 3. **Feature Selection:**
   - Automatically determines the most relevant features for predicting the target variable.

#### 4. **Handling Nonlinear Relationships:**
   - Can handle complex relationships and interactions between features non-linearly.

### Parameters and Methods:

#### Important Parameters:
- `criterion`: Determines the criterion for splitting ('gini' for Gini impurity, 'entropy' for information gain).
- `max_depth`: Controls the maximum depth of the tree to prevent overfitting.
- `min_samples_split`: Minimum number of samples required to split an internal node.
- `min_samples_leaf`: Minimum number of samples required to be at a leaf node.

#### Common Methods:
- `fit(X, y)`: Trains the DecisionTreeClassifier on input data `X` and target `y`.
- `predict(X)`: Predicts the class labels for input data `X`.
- `predict_proba(X)`: Predicts class probabilities for each sample in `X`.

### Example Usage:

Here's a simple example demonstrating how to use `DecisionTreeClassifier`:

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Instantiate DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions
predictions = clf.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
```

### Steps Explained:

1. **Load Dataset:** Load a dataset (Iris dataset in this example) with features (`X`) and target (`y`).

2. **Split Dataset:** Split the dataset into training and test sets using `train_test_split`.

3. **Instantiate and Train:** Create an instance of `DecisionTreeClassifier` and train it using the training data.

4. **Predictions and Evaluation:** Make predictions on the test set and evaluate the model's accuracy using `accuracy_score`.

The `DecisionTreeClassifier` is a versatile model used for classification tasks, providing a simple yet effective way to understand decision-making processes in machine learning. Adjusting its parameters helps control overfitting and improves model performance.

In [218]:
# train the model
trf5 = DecisionTreeClassifier()

# Create Pipeline

In [219]:
pipe = Pipeline([
    ('trf1',trf1),
    ('trf2',trf2),
    ('trf3',trf3),
    ('trf4',trf4),
    ('trf5',trf5)
])

## Pipeline Vs make_pipeline
In Scikit-learn, the `Pipeline` and `make_pipeline` functions are tools for creating pipelines, enabling the chaining of multiple steps or transformers to form a unified machine learning workflow.

### Pipeline:

#### Key Aspects:
- **Sequential Execution:** Allows the sequential execution of a series of data transformations followed by a final estimator.
- **Integration of Transformers and Estimators:** Easily integrates preprocessing transformers and machine learning models into a single pipeline.
- **Consistent Workflow:** Ensures a consistent workflow for data preprocessing and modeling.

#### Usage:
- **Data Processing Flow:** A typical use case involves chaining together preprocessing steps like scaling, encoding, and feature selection, followed by fitting a model.
- **Hyperparameter Tuning:** Often used with GridSearchCV for hyperparameter tuning across the entire pipeline.

#### Example Usage:
Here's an example demonstrating the usage of `Pipeline`:

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a Pipeline with preprocessing steps and a model
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('classifier', LogisticRegression(random_state=42))
])

# Fit the Pipeline on training data
pipeline.fit(X_train, y_train)

# Make predictions
predictions = pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
```

### make_pipeline:

#### Simplified Construction:
- **Convenience Function:** Offers a simplified way to create a `Pipeline` without explicitly naming the steps.
- **Automatic Naming:** Assigns generic names to steps based on their classes.

#### Usage:
- **Quick Pipeline Creation:** Ideal for creating simple pipelines without specifying explicit names for each step.

#### Example Usage:
Here's an example demonstrating the usage of `make_pipeline`:

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a Pipeline using make_pipeline
pipeline = make_pipeline(
    StandardScaler(),
    PCA(n_components=2),
    LogisticRegression(random_state=42)
)

# Fit the Pipeline on training data
pipeline.fit(X_train, y_train)

# Make predictions
predictions = pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2f}")
```

Both `Pipeline` and `make_pipeline` facilitate the construction of machine learning workflows by sequentially applying data preprocessing steps and modeling. While `Pipeline` offers explicit step naming and more customization, `make_pipeline` simplifies pipeline creation without the need for explicit step names. Choose the appropriate method based on the complexity and specificity required for your workflow.

(Same applies to ColumnTransformer vs make_column_transformer)

In [None]:
# Alternate Syntax
pipe = make_pipeline(trf1,trf2,trf3,trf4,trf5)

## Displaying pipeline using set_config
The `set_config(display='diagram')` in Scikit-learn is a configuration setting that allows visualizing the pipeline as a diagram using the `Pipeline` module. This setting enhances the visualization of the pipeline's structure, displaying the individual steps and their connections graphically.

### Key Aspects:

#### 1. **Pipeline Visualization:**
   - **Graphical Representation:** Displays the pipeline's steps as a diagram, showcasing the flow of data transformations and model application.
   - **Enhanced Understanding:** Provides a visual representation for better comprehension of the pipeline's structure and sequence.

#### 2. **Enhanced Visual Debugging:**
   - **Debugging Aid:** Assists in identifying issues or errors within the pipeline by visualizing the sequence of transformations.

### Usage:

#### Example:
Here's an example demonstrating the usage of `set_config(display='diagram')`:

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, set_config

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Enable diagram display for Pipeline
set_config(display='diagram')

# Create a Pipeline
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=2)),
    ('classifier', LogisticRegression(random_state=42))
])

# Fit the Pipeline on training data
pipeline.fit(X_train, y_train)
```

### Steps Explained:

1. **Import Necessary Modules:** Import required modules (`Pipeline`, `set_config`) and other necessary modules.
   
2. **Enable Diagram Display:** Use `set_config(display='diagram')` to set the configuration for displaying the pipeline as a diagram.

3. **Create and Fit Pipeline:** Create a `Pipeline` object, defining the sequence of preprocessing and modeling steps, and fit it to the training data.

Upon executing the pipeline fitting, it will display a graphical representation (diagram) of the pipeline's structure, illustrating the order of steps and their connections.

The `set_config(display='diagram')` configuration setting in Scikit-learn enhances visualization by providing a graphical representation of the pipeline's structure, aiding in understanding the sequence of transformations and the flow of data within the pipeline.

In [204]:
from sklearn import set_config
set_config(display='diagram')

In [220]:
# train
pipe.fit(X_train,y_train)

## Explore the Pipeline
Watch <a href='https://www.youtube.com/watch?v=xOccYkgRV4Q&list=PLKnIA16_Rmvbr7zKYQuBfsVkjoLcJgxHH&index=29&t=2090'>this</a> video at 34:50.

In [232]:
# Code here
pipe.named_steps

{'trf1': ColumnTransformer(remainder='passthrough',
                   transformers=[('impute_age', SimpleImputer(), [2]),
                                 ('impute_embarked',
                                  SimpleImputer(strategy='most_frequent'),
                                  [6])]),
 'trf2': ColumnTransformer(remainder='passthrough',
                   transformers=[('ohe_sex_embarked',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse=False),
                                  [1, 6])]),
 'trf3': ColumnTransformer(transformers=[('scale', MinMaxScaler(), slice(0, 10, None))]),
 'trf4': SelectKBest(k=8, score_func=<function chi2 at 0x0000021C01E5AD30>),
 'trf5': DecisionTreeClassifier()}

In [233]:
# Predict
y_pred = pipe.predict(X_test)

In [234]:
y_pred

array([1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1,
       0, 0, 0], dtype=int64)

In [235]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

0.6256983240223464

# Cross Validation using Pipeline

In [236]:
# cross validation using cross_val_score
from sklearn.model_selection import cross_val_score
cross_val_score(pipe, X_train, y_train, cv=5, scoring='accuracy').mean()

0.6391214419383433

# GridSearch using Pipeline
The `GridSearchCV` in Scikit-learn is a method used for systematically searching for the best hyperparameters for a machine learning model. It performs an exhaustive search over a specified parameter grid, evaluating the model's performance using cross-validation to determine the optimal hyperparameters.

### Key Aspects of GridSearchCV:

#### 1. **Hyperparameter Tuning:**
   - **Grid Search:** Tests all possible combinations of hyperparameters specified in a grid.
   - **Parameter Space Exploration:** Enables exploring multiple combinations to identify the best ones.

#### 2. **Cross-Validation:**
   - **K-Fold Cross-Validation:** Evaluates model performance using cross-validation (often K-fold) on each parameter combination.

#### 3. **Optimization Metric:**
   - **Scoring Function:** Uses a scoring function (like accuracy, F1-score, etc.) to evaluate and select the best parameters.

### Usage:

#### Example:
Here's an example illustrating the usage of `GridSearchCV`:

```python
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.svm import SVC

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define hyperparameter grid
param_grid = {
    'C': [0.1, 1, 10],
    'gamma': [0.001, 0.01, 0.1],
    'kernel': ['rbf', 'linear']
}

# Instantiate SVC classifier
svc = SVC()

# Create GridSearchCV object
grid_search = GridSearchCV(estimator=svc, param_grid=param_grid, cv=3, scoring='accuracy')

# Fit the GridSearchCV to find the best hyperparameters
grid_search.fit(X_train, y_train)

# Get the best hyperparameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print(f"Best Parameters: {best_params}")
print(f"Best Score: {best_score:.2f}")
```

### Steps Explained:

1. **Import Necessary Modules:** Import required modules (`GridSearchCV`, `train_test_split`, `SVC`, etc.).

2. **Load and Split Dataset:** Load a dataset (Iris dataset in this example) and split it into training and test sets.

3. **Define Hyperparameter Grid:** Specify a grid of hyperparameters for the model to explore.

4. **Instantiate Model:** Create an instance of the model to be tuned (SVC in this case).

5. **Create GridSearchCV:** Define a `GridSearchCV` object with the model, parameter grid, cross-validation method (`cv`), and scoring metric.

6. **Fit GridSearchCV:** Fit the `GridSearchCV` object to the training data to perform the search for the best hyperparameters.

7. **Get Best Parameters and Score:** Retrieve the best parameters and the corresponding score obtained from the grid search.

`GridSearchCV` is a powerful tool for systematically exploring multiple combinations of hyperparameters, optimizing a model's performance, and identifying the best parameter values for improved model accuracy or generalization.

In [237]:
# gridsearchcv
params = {
    'trf5__max_depth':[1,2,3,4,5,None]
}

In [238]:
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')
grid.fit(X_train, y_train)

In [239]:
grid.best_score_

0.6391214419383433

In [240]:
grid.best_params_

{'trf5__max_depth': 2}

# Exporting the Pipeline

In [241]:
# export 
import pickle
pickle.dump(pipe,open('pipe.pkl','wb'))