# Q1. What is the purpose of grid search cv in machine learning, and how does it work?

# ANS:-

GridSearchCV (Cross-Validation) is a technique in machine learning used for hyperparameter tuning. Hyperparameters are external configurations for a model that are not learned from the data, such as the learning rate, number of hidden layers, or the number of trees in a random forest. The purpose of GridSearchCV is to systematically search through a predefined set of hyperparameter values, evaluating the model's performance for each combination using cross-validation, and then selecting the hyperparameters that yield the best performance.

In [1]:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load dataset
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=42)

# Define the parameter grid
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf', 'poly'], 'gamma': ['scale', 'auto']}

# Create the SVM model
svm_model = SVC()

# Create GridSearchCV
grid_search = GridSearchCV(svm_model, param_grid, cv=5, scoring='accuracy')

# Fit the model to the data
grid_search.fit(X_train, y_train)

# Print the best parameters
print("Best parameters: ", grid_search.best_params_)

# Evaluate the model on the test set
accuracy = grid_search.score(X_test, y_test)
print("Test accuracy: {:.2f}%".format(accuracy * 100))


Best parameters:  {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
Test accuracy: 100.00%


# Q2. Describe the difference between grid search cv and randomize search cv, and when might you choose one over the other?

# ANS:-

Both Grid Search CV (Cross-Validation) and Randomized Search CV are techniques used for hyperparameter tuning in machine learning models. They help in finding the optimal set of hyperparameters for a model, which can significantly impact its performance. Here's a brief description of each and when you might choose one over the other:

Grid Search CV:

In Grid Search CV, you specify a grid of hyperparameter values, and the algorithm evaluates the model performance for each combination of hyperparameters using cross-validation.
It exhaustively searches through all possible combinations of hyperparameter values.
This method is computationally expensive, especially when the hyperparameter search space is large.
It is suitable when you have a relatively small search space, and you want to ensure that you've explored all possible combinations thoroughly.

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris


iris = load_iris()
X, y = iris.data, iris.target

param_grid = {
    'n_estimators': [10, 50, 100],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10]
}


rf_classifier = RandomForestClassifier()


grid_search = GridSearchCV(estimator=rf_classifier, param_grid=param_grid, cv=5)
grid_search.fit(X, y)


print("Best hyperparameters:", grid_search.best_params_)


Best hyperparameters: {'max_depth': 20, 'min_samples_split': 10, 'n_estimators': 10}


In Randomized Search CV, you specify a probability distribution for each hyperparameter, and the algorithm randomly samples a specified number of combinations to evaluate.
It samples a fixed number of hyperparameter combinations, making it more efficient for large search spaces.
This method might not explore all possible combinations but can provide good results with fewer evaluations.
It is suitable when you have a large search space, and performing an exhaustive search with Grid Search would be too computationally expensive.
Example code using RandomizedSearchCV in scikit-learn:

In [4]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris


iris = load_iris()
X, y = iris.data, iris.target

param_dist = {
    'n_estimators': [10, 50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10, 15]
}


rf_classifier = RandomForestClassifier()


random_search = RandomizedSearchCV(estimator=rf_classifier, param_distributions=param_dist, n_iter=10, cv=5)
random_search.fit(X, y)


print("Best hyperparameters:", random_search.best_params_)


Best hyperparameters: {'n_estimators': 100, 'min_samples_split': 5, 'max_depth': 30}


# Q3. What is data leakage, and why is it a problem in machine learning? Provide an example.

# ANS:-
Data leakage in machine learning occurs when information from outside the training dataset is used to create a model, leading to an overly optimistic assessment of the model's performance. This can result in a model that performs well on the training and validation data but fails to generalize to new, unseen data. Data leakage can happen in various forms, and it's crucial to identify and prevent it to ensure the model's reliability.

Example of Data Leakage in Machine Learning:

In [8]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Generate a simple dataset with a feature and a target variable
data = {'Feature': [1, 2, 3, 4, 5, 6],
        'Target': [0, 0, 0, 1, 1, 1]}
df = pd.DataFrame(data)

# Introduce a data leakage by using information from the target variable in the feature
df['Leakage_Feature'] = df['Feature'] * df['Target']

# Split the dataset into training and testing sets
train_data, test_data = train_test_split(df, test_size=0.2, random_state=42)

# Separate features and target variable
X_train = train_data[['Feature', 'Leakage_Feature']]
y_train = train_data['Target']
X_test = test_data[['Feature', 'Leakage_Feature']]
y_test = test_data['Target']

# Train a Random Forest classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy on the test set: {accuracy}")



Accuracy on the test set: 1.0


# Q4. How can you prevent data leakage when building a machine learning model?

# ANS:-

Preventing data leakage is crucial to ensure the robustness and reliability of machine learning models. Here are some strategies to help prevent data leakage:

Strict Separation of Training and Testing Data:

Always maintain a clear separation between the training and testing datasets. Ensure that the model is trained only on historical data and evaluated on completely unseen data.
Feature Engineering Awareness:

Be cautious when creating new features and make sure they are derived only from information available before the target variable is observed. Avoid using future information or information that directly incorporates the target variable in the training features.

# Q5. What is a confusion matrix, and what does it tell you about the performance of a classification model?