# Most Used Functions in scikit-learn

In this notebook, we will cover some of the most commonly used functions in the scikit-learn library. scikit-learn is a popular machine learning library in Python that provides simple and efficient tools for data analysis and modeling.

## 1. Train-Test Split

The `train_test_split` function is used to split the data into training and testing sets.

In [None]:
# Example: Train-Test Split
from sklearn.model_selection import train_test_split
import pandas as pd

# Sample data
data = {
    'feature': [1, 2, 3, 4, 5],
    'target': [2, 4, 6, 8, 10]
}
df = pd.DataFrame(data)

# Split data
X = df[['feature']]
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f'X_train:\n{X_train}')
print(f'X_test:\n{X_test}')

## 2. StandardScaler

The `StandardScaler` function is used to standardize features by removing the mean and scaling to unit variance.

In [None]:
# Example: StandardScaler
from sklearn.preprocessing import StandardScaler

# Sample data
data = {
    'feature1': [1, 2, 3, 4, 5],
    'feature2': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)

# Standardize features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df)

print(f'Scaled data:\n{scaled_data}')

## 3. OneHotEncoder

The `OneHotEncoder` function is used to encode categorical features as a one-hot numeric array.

In [None]:
# Example: OneHotEncoder
from sklearn.preprocessing import OneHotEncoder

# Sample data
data = {'category': ['A', 'B', 'A', 'C']}
df = pd.DataFrame(data)

# One-hot encode categorical features
encoder = OneHotEncoder(sparse=False)
encoded_data = encoder.fit_transform(df)

print(f'Encoded data:\n{encoded_data}')

## 4. PCA (Principal Component Analysis)

The `PCA` function is used for dimensionality reduction.

In [None]:
# Example: PCA
from sklearn.decomposition import PCA

# Sample data
data = {
    'feature1': [1, 2, 3, 4, 5],
    'feature2': [2, 4, 6, 8, 10],
    'feature3': [5, 4, 3, 2, 1]
}
df = pd.DataFrame(data)

# Apply PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(df)

print(f'Reduced data:\n{reduced_data}')

## 5. GridSearchCV

The `GridSearchCV` function is used to perform hyperparameter tuning by cross-validation.

In [None]:
# Example: GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# Sample data
data = {
    'feature1': [1, 2, 3, 4, 5],
    'feature2': [2, 4, 6, 8, 10],
    'target': [0, 0, 1, 1, 1]
}
df = pd.DataFrame(data)

# Define the model and parameters
X = df[['feature1', 'feature2']]
y = df['target']
model = RandomForestClassifier()
param_grid = {'n_estimators': [10, 50, 100]}

# Perform GridSearchCV
grid_search = GridSearchCV(model, param_grid, cv=3)
grid_search.fit(X, y)

print(f'Best parameters: {grid_search.best_params_}')
print(f'Best score: {grid_search.best_score_}')

## 6. Confusion Matrix

The `confusion_matrix` function is used to evaluate the accuracy of a classification.

In [None]:
# Example: Confusion Matrix
from sklearn.metrics import confusion_matrix

# Sample data
y_true = [0, 0, 1, 1, 1]
y_pred = [0, 1, 1, 1, 0]

# Compute confusion matrix
cm = confusion_matrix(y_true, y_pred)

print(f'Confusion matrix:\n{cm}')

## 7. Classification Report

The `classification_report` function is used to generate a report of classification metrics.

In [None]:
# Example: Classification Report
from sklearn.metrics import classification_report

# Sample data
y_true = [0, 0, 1, 1, 1]
y_pred = [0, 1, 1, 1, 0]

# Generate classification report
report = classification_report(y_true, y_pred)

print(f'Classification report:\n{report}')