# Table of Contents

1. [Introduction to Scikit-learn](#1)
   * [1.1 What is Scikit-learn?](#l.1)
   * [1.2 Why use Scikit-learn?](#1.2)
   * [1.3 Scikit-learn Installation](#1.3)
   

2. [Data Representation in Scikit-learn](#2)
   * [2.1 Data Structures](#2.1)
   * [2.2 Target Variable and Features](#2.2)


3. [Data Preprocessing](#3)
   * [3.1 Handling Missing Data](#3.1)  
   * [3.2 Feature Scaling](#3.2)  
   * [3.3 One-Hot Encoding for Categorical Variables](#3.3)


4. [Train-Test Split](#4)
   * [4.1 Importance of Splitting Data](#4.1)  
   * [4.2 Using train_test_split](#4.2)


5. [Supervised Learning](#5)
   * [5.1 Linear Regression](#5.1)  
   * [5.2 Decision Trees](#5.2)  
   * [5.3 Random Forest](#5.3)  
   * [5.4 Support Vector Machines (SVM)](#5.4)  
   * [5.5 K-Nearest Neighbors (KNN)](#5.5)


6. [Unsupervised Learning](#6)
   * [6.1 K-Means Clustering](#6.1)  
   * [6.2 Hierarchical Clustering](#6.2) 
   * [6.3 Principal Component Analysis (PCA)](#6.3)


7. [Model Evaluation](#7)
   * [7.1 Cross-Validation](#7.1) 
   * [7.2 Metrics](#7.2)


8. [Hyperparameter Tuning](#8)
   * [8.1 Grid Search](#8.1)  
   * [8.2 Randomized Search](#8.2)  
   * [8.3 Model Selection](#8.3)


9. [Ensemble Learning](#9)
   * [9.1 Bagging (Bootstrap Aggregating)](#9.1)  
   * [9.2 Stacking](#9.2)


10. [Pipelines in scikit-learn](#10)
   * [10.1 Combining Data Preprocessing and Modeling](#10.1)  
   * [10.2 Simplifying Workflow](#10.2)


11. [Feature Importance](#11)
   * [11.1 Extracting Feature Importance from Trees](#11.1)  
   * [11.2 Permutation Importance](#11.2)


12. [Handling Imbalanced Datasets](#12)
   * [12.1 Techniques for Imbalanced Classification](#12.1)



<a id = "1"></a>
# 1. Introduction to Scikit-learn



<a id = "1.1"></a>
### 1.1 What is Scikit-learn?

Scikit-learn often abbreviated as sklearn, is a popular open-source machine learning library for Python. It provides simple and efficient tools for data analysis and modeling, making it a valuable resource for both beginners and experienced machine learning practitioners. Scikit-learn is built on NumPy, SciPy, and Matplotlib, and it integrates well with other scientific computing libraries. It offers a wide range of machine learning algorithms and tools for tasks such as classification, regression, clustering, dimensionality reduction and more.




<a id = "1.2"></a>
### 1.2 Why use Scikit-learn?

Scikit-learn provides a unified and user-friendly interface for various machine learning algorithms. Its extensive range of algorithms, efficient data preprocessing tools and integration with other Python libraries make it a go-to choice for diverse machine learning tasks.



<a id = "1.3"></a>
### 1.3. Scikit-learn Installation

To install Scikit-learn, you can use pip. Open the terminal or command prompt and enter the following command:

In [None]:
pip install scikit-learn

import sklearn   
print(sklearn.__version__)   # This line of code will print the version of scikit-learn if the installation was successful.

<a id = "2"></a>
# 2. Data Representation in Scikit-learn


<a id = "2.1"></a>
### 2.1 Data Structures

Scikit-learn works with NumPy arrays and Pandas dataframes. NumPy arrays are preferred for mathematical operations.



<a id = "2.2"></a>
### 2.2 Target Variable and Features

In scikit-learn, the target variable (dependent variable) is referred to as `y` and features (independent variables) are referred to as `X`.


In [None]:
# Example: 01 (Using NumPy Arrays)
import numpy as np
X = np.array([[1, 2], [3, 4]])
y = np.array([0, 1])

# Example: 02 (Using Pandas Dataframe)
import pandas as pd
df = pd.DataFrame({'feature1': [1, 3], 'feature2': [2, 4]})
X = df[['feature1', 'feature2']]
y = pd.Series([0, 1])

<a id = "3"></a>
# 3. Data Preprocessing

<a id = "3.1"></a>
### 3.1 Handling Missing Data

You can use `SimpleImputer` to fill missing values with the mean, median or a constant.


In [None]:
from sklearn.impute import SimpleImputer
import numpy as np

X = np.array([[1, 2, np.nan], [4, np.nan, 6], [7, 8, 9]])
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

<a id = "3.2"></a>
### 3.2 Feature Scaling
Standardize features to have mean=0 and variance=1 using `StandardScaler`.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

<a id = "3.3"></a>
### 3.3 One-Hot Encoding for Categorical Variables

Convert categorical variables into numerical format using `OneHotEncoder`.

In [None]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
X_encoded = encoder.fit_transform(X)

<a id = "4"></a>
# 4. Train-Test Split

<a id = "4.1"></a>
### 4.1 Importance of Splitting Data

Training a model on the entire dataset can lead to overfitting. Train-test split allows assessing the model's performance on unseen data.


<a id = "4.2"></a>
### 4.2 Using `train_test_split`

You can use `train_test_split` to split the dataset into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split

X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y = np.array([0, 1, 0])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<a id = "5"></a>
# 5. Supervised Learning


<a id = "5.1"></a>
### 5.1 Linear Regression

Linear regression is a simple algorithm for predicting a continuous target variable based on one or more predictor variables.


In [None]:
from sklearn.linear_model import LinearRegression

model = LinearRegression() # Linear regression model
model.fit(X_train, y_train) # Train the model
y_pred = model.predict(X_test) # Make predictions

<a id = "5.2"></a>
### 5.2 Decision Trees

Decision trees make decisions based on features to predict the target variable.

In [None]:
from sklearn.tree import DecisionTreeRegressor

model = DecisionTreeRegressor() # Decision tree model
model.fit(X_train, y_train) 
y_pred = model.predict(X_test) 

<a id = "5.3"></a>
### 5.3 Random Forest
Random forest is an ensemble method that builds multiple decision trees and merges their predictions.

In [None]:
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor() # Random forest model
model.fit(X_train, y_train) 
y_pred = model.predict(X_test) 

<a id = "5.4"></a>
### 5.4 Support Vector Machines (SVM)

SVM is a powerful algorithm for both classification and regression tasks.

In [None]:
from sklearn.svm import SVR

model = SVR() # Create an SVM model
model.fit(X_train, y_train) 
y_pred = model.predict(X_test) 

<a id = "5.5"></a>
### 5.5 K-Nearest Neighbors (KNN)

KNN predicts the target variable by considering the majority class among its k-nearest neighbors.

In [None]:
from sklearn.neighbors import KNeighborsRegressor

# Assuming X_train and y_train are training data
model = KNeighborsRegressor(n_neighbors=1)  
model.fit(X_train, y_train)  
y_pred = model.predict(X_test)  

<a id = "6"></a>
## 6. Unsupervised Learning

<a id = "6.1"></a>
### 6.1  K-Means Clustering

K-Means is a clustering algorithm that groups data points into k clusters.

In [None]:
from sklearn.cluster import KMeans

model = KMeans(n_clusters=2, n_init=10) # K-Means model
model.fit(X) 

<a id = "6.2"></a>
### 6.2 Hierarchical Clustering

Hierarchical clustering builds a tree of clusters to represent the data's hierarchical structure.

In [None]:
from sklearn.cluster import AgglomerativeClustering

model = AgglomerativeClustering(n_clusters=2) # Hierarchical clustering model
model.fit(X) 

<a id = "6.3"></a>
### 6.3 Principal Component Analysis (PCA)
PCA is a dimensionality reduction technique that transforms data into a new coordinate system.

In [None]:
from sklearn.decomposition import PCA

model = PCA(n_components=2) # PCA model
model.fit(X) 
X_pca = model.transform(X) 

<a id = "7"></a>
# 7. Model Evaluation

<a id = "7.1"></a>
### 7.1 Cross-Validation

Cross-validation is a crucial step in assessing a model's performance by dividing the dataset into multiple subsets. It helps in estimating how well a model will generalize to an independent dataset.

In [None]:
from sklearn.model_selection import cross_val_score, KFold
from sklearn.linear_model import LogisticRegression
import numpy as np

X = np.random.rand(100, 5)  # 100 samples, 5 features
y = np.random.randint(2, size=100)  # Binary target variable
model = LogisticRegression(max_iter=1000) # Logistic regression model 
kf = KFold(n_splits=5, shuffle=True, random_state=42)
cross_val_scores = cross_val_score(model, X, y, cv=kf)

print("Cross-validation scores:", cross_val_scores)
print("Average Cross-validation score:", np.mean(cross_val_scores))

<a id = "7.2"></a>
### 7.2 Metrics (Accuracy, Precision, Recall, F1-Score, ROC-AUC)

Different metrics provide insights into a model's performance. 

- **Accuracy:** Measures overall correctness.
- **Precision:** Focuses on the accuracy of positive predictions.
- **Recall:** Emphasizes the true positive rate.
- **F1-Score:** Balances precision and recall.
- **ROC-AUC:** Assesses binary classification models.


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='micro')  
recall = recall_score(y_test, y_pred, average='micro')  
f1 = f1_score(y_test, y_pred, average='micro')  
roc_auc = roc_auc_score(y_test, model.predict_proba(X_test), multi_class='ovr')  

<a id = "8"></a>
# 8. Hyperparameter Tuning

<a id = "8.1"></a>
### 8.1 Grid Search

Grid Search optimizes model performance by systematically searching hyperparameter combinations. It evaluates each combination using cross-validation to find the best set of hyperparameters.


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_

<a id = "8.2"></a>
### 8.2 Randomized Search

Randomized Search explores hyperparameter space through a specified number of random combinations. It is efficient for a large hyperparameter search space.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
from sklearn.ensemble import RandomForestClassifier

param_dist = {'n_estimators': randint(50, 200), 'max_depth': [None, 10, 20]} 
random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=5, cv=5)
random_search.fit(X_train, y_train)
best_params = random_search.best_params_

<a id = "8.3"></a>
### 8.3 Model Selection

Model selection involves choosing the best-performing model from a set of candidate models. It is often done using cross-validation to ensure generalization performance.

In [None]:
from sklearn.model_selection import cross_val_score, LeaveOneOut
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier  
import numpy as np

X_train = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
y_train = np.array([0, 1, 0])

models = [
    RandomForestClassifier(),
    DecisionTreeClassifier()
]

for model in models:
    try:
        scores = cross_val_score(model, X_train, y_train, cv=LeaveOneOut(), scoring='accuracy')
        average_score = np.mean(scores)
        print(f"Average Cross-Validation Score for {type(model).__name__}: {average_score}")
    except Exception as e:
        print(f"Error for {type(model).__name__}: {e}")

<a id = "9"></a>
# 9. Ensemble Learning

<a id = "9.1"></a>
### 9.1 Bagging (Bootstrap Aggregating)

Bagging builds multiple models independently and combines them to reduce overfitting. It involves training each model on a random subset of the training data (with replacement) and aggregating their predictions.

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
base_model = DecisionTreeClassifier() # Decision tree as the base model
bagging_model = BaggingClassifier(base_model, n_estimators=50, random_state=42) # BaggingClassifier
bagging_model.fit(X_train, y_train)


<a id = "9.2"></a>
### 9.2 Boosting (AdaBoost, Gradient Boosting)

Boosting builds a strong model by sequentially training weak models, focusing on misclassified instances. AdaBoost adjusts weights of misclassified instances while Gradient Boosting fits each model to the residuals of the combined ensemble.

In [None]:
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 
adaboost_model = AdaBoostClassifier(n_estimators=50, random_state=42) # AdaBoost
adaboost_model.fit(X_train, y_train)
gradboost_model = GradientBoostingClassifier(n_estimators=50, random_state=42) # Gradient boosting
gradboost_model.fit(X_train, y_train)


<a id = "9.3"></a>
### 9.3 Stacking

Stacking combines multiple base models by training a meta-model on their predictions. It aims to capture diverse patterns present in individual models.

In [None]:
import numpy as np
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=100, n_features=20, n_informative=10, n_classes=3, random_state=42) # Dummy data
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42) 
base_models = [('svm', SVC()), ('tree', DecisionTreeClassifier())] # Base models
stacking_model = StackingClassifier(estimators=base_models, final_estimator=LogisticRegression(), cv=cv) # Stacking model with logistic regression as final estimator
stacking_model.fit(X, y)

<a id = "10"></a>
# 10. Pipelines in Scikit-learn


<a id = "10.1"></a>
### 10.1 Combining Data Preprocessing and Modeling

Pipelines string together multiple data processing steps and a final estimator. This ensures a smooth workflow, simplifies code and reduces the risk of data leakage.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X_train = np.array([[1, 2], [3, 4], [5, 6]]) 
y_train = np.array([0, 1, 0])
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('rf', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)

<a id = "10.2"></a>
### 10.2 Simplifying Workflow

Pipelines simplify the machine learning workflow by encapsulating data preprocessing and model training in a single object. This improves code readability and reproducibility.

In [None]:
X_train = np.array([[1, 2], [3, 4], [5, 6]]) 
y_train = np.array([0, 1, 0])
X_test = np.array([[7, 8], [9, 10], [11, 12]]) 
pipeline.fit(X_train, y_train) 
pipeline.predict(X_test) 

<a id = "11"></a>
# 11. Feature Importance


<a id = "11.1"></a>
### 11.1 Extracting Feature Importance from Trees

For tree-based models, feature importance indicates each feature's contribution to the model. It helps in understanding which features are most influential.

In [None]:
from sklearn.ensemble import RandomForestClassifier
import numpy as np


X_train = np.array([[1, 2], [3, 4], [5, 6]]) 
y_train = np.array([0, 1, 0])
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
feature_importance = rf_model.feature_importances_

<a id = "11.2"></a>
### 11.2 Permutation Importance

Permutation Importance measures a feature's impact by randomly permuting its values and observing the model's performance change. A decrease in performance indicates the feature's importance.

In [None]:
from sklearn.inspection import permutation_importance
from sklearn.ensemble import RandomForestClassifier
import numpy as np

X_test = np.array([[7, 8], [9, 10], [11, 12]]) # Test data
y_test = np.array([0, 1, 0])
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
perm_importance = permutation_importance(rf_model, X_test, y_test)
feature_importance = perm_importance.importances_mean

<a id = "12"></a>
# 12. Handling Imbalanced Datasets


<a id = "12.1"></a>
### 12.1 Techniques for Imbalanced Classification
Description
Imbalanced datasets pose challenges in classification. Techniques like resampling (oversampling minority class, undersampling majority class), using different algorithms and adjusting class weights help address this issue.

In [None]:
from sklearn.utils import resample
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import numpy as np

X_train = np.array([[1, 2], [3, 4], [5, 6]]) 
y_train = np.array([0, 1, 0])
X_resampled, y_resampled = resample(X_train[y_train == 1], y_train[y_train == 1],
                                    n_samples=X_train[y_train == 0].shape[0], random_state=42) # Resampling for balanced classes
rf_model_weighted = RandomForestClassifier(class_weight='balanced') # Using different algorithms with class weights
svc_model_weighted = SVC(class_weight='balanced')