<a href="https://colab.research.google.com/github/mehtabr1212/PRODIGY_DS_03/blob/main/Task_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bank Marketing (with social/economic context)

Number of Instances: 41188 for bank-additional-full.csv

Number of Attributes: 20 + output attribute.
#Attribute information:
# Input variables:
## Bank client data:
1 - age (numeric)

2 - job : type of job (categorical: "admin.","blue-collar","entrepreneur","housemaid","management","retired","self-employed","services","student","technician","unemployed","unknown")

3 - marital : marital status (categorical: "divorced","married","single","unknown"; note: "divorced" means divorced or widowed)

4 - education (categorical: "basic.4y","basic.6y","basic.9y","high.school","illiterate","professional.course","university.degree","unknown")

5 - default: has credit in default? (categorical: "no","yes","unknown")

6 - housing: has housing loan? (categorical: "no","yes","unknown")

7 - loan: has personal loan? (categorical: "no","yes","unknown")

## Related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: "cellular","telephone")

9 - month: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")

10 - day_of_week: last contact day of the week (categorical: "mon","tue","wed","thu","fri")

11 - duration: last contact duration, in seconds (numeric). Important note:  this attribute highly affects the output target (e.g., if duration=0 then y="no"). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

## Other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: "failure","nonexistent","success")

## Social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)     

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)     

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

## Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: "yes","no")

In [13]:
import matplotlib.pyplot as plt
import pandas as pd
# Step 1: Loading the dataset
df = pd.read_csv('/content/bank-additional-full.csv')
df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
2,37,services,married,high.school,no,yes,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
4,56,services,married,high.school,no,no,yes,telephone,may,mon,...,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no


In [14]:
from sklearn.preprocessing import LabelEncoder

# Step 2: Preprocessing the data
# Handling missing values (if any)
# Checking for missing values
missing_values = df.isnull().sum()
print("Missing values:")
print(missing_values)

# Encoding categorical variables using LabelEncoder to convert them into numerical representations, which is necessary for machine learning algorithms.
# Identifing categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns

# Initializing LabelEncoder
label_encoders = {}

# Encoding categorical columns
for col in categorical_cols:
    label_encoders[col] = LabelEncoder()
    df[col] = label_encoders[col].fit_transform(df[col])

Missing values:
age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64


In [15]:
# Step 3: Splitting the dataset into features and target variable
X = df.drop('y', axis=1)  # Features contain all columns except the target variable column (y). These features will be used as input to train the machine learning model.
y = df['y']    # Target variable is separated from the features and stored in a separate variable (y). This variable holds the values that the model aims to predict based on features.

In [16]:
# Step 4: Splitting the dataset into training and testing sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Training Set (X_train, y_train):** This subset of the data is used to train the machine learning model. It contains both the features (X_train) and the corresponding target variable (y_train).

**Testing Set (X_test, y_test):** This subset of the data is used to evaluate the trained model's performance. It also contains features (X_test) and the corresponding target variable (y_test). By using a portion of the data that the model hasn't seen during training, we can assess how well the model generalizes to unseen data.

In [17]:
# Step 5: Building the Decision Tree Classifier using the DecisionTreeClassifier class from scikit-learn

from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

**Initializing the Classifier:** We create an instance of the DecisionTreeClassifier class and assign it to the variable clf. We set random_state=42 to ensure reproducibility of the results.

**Training the Classifier:** The fit() method of the classifier is then used to train the model on the training data (X_train and y_train). This process involves the model learning patterns and relationships in the training data to make predictions on new, unseen data.

In [18]:
# Step 6: Evaluate the model

from sklearn.metrics import accuracy_score, classification_report
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print(classification_report(y_test, y_pred))

Accuracy: 0.8894149065307113
              precision    recall  f1-score   support

           0       0.94      0.94      0.94      7303
           1       0.51      0.51      0.51       935

    accuracy                           0.89      8238
   macro avg       0.73      0.73      0.73      8238
weighted avg       0.89      0.89      0.89      8238



This step involves evaluating the performance of the trained decision tree classifier:

**Making Predictions:** The trained classifier (clf) is used to predict the target variable (y_pred) for the testing features (X_test) using the predict() method.

**Computing Accuracy:** The accuracy of the model's predictions is calculated using the accuracy_score() function from scikit-learn. This function compares the predicted labels (y_pred) with the true labels (y_test) and returns the proportion of correctly classified samples.

**Printing Metrics:** The accuracy score is printed to the console to indicate how well the model performed overall. Additionally, a detailed classification report is printed, which provides metrics such as precision, recall, F1-score, and support for each class in the target variable. These metrics help assess the model's performance across different classes and identify any potential issues, such as class imbalance or misclassifications.

### **Overall performance of the model:** Good

* **Accuracy**: The accuracy of the model is 0.889, or 88.9% (Approx.). This indicates that the model correctly predicts the target variable for nearly 89% of the samples in the testing set.

* **Precision & Recall**: Precision measures the proportion of true positive predictions among all positive predictions, while recall measures the proportion of true positive predictions among all actual positives. For class 1 (positive class), the precision and recall are both around 0.51, indicating that the model correctly identifies 51% (Approx.) of the positive instances, but there is also a relatively high rate of false positives.

* **F1-score**: The F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics. For class 1, the F1-score is also 0.51 (Approx.), which suggests a reasonable balance between precision and recall.

* **Support**: The support values represent the number of samples for each class. In this case, there are 7303 instances of class 0 and 935 instances of class 1 in the testing set.

Overall, while the accuracy is relatively high, the precision, recall, and F1-score for the minority class (class 1) are lower, indicating that model may struggle with correctly identifying positive instances. Depending on the specific application and the cost associated with false positives and false negatives, further optimization of the model may be necessary to improve the performance. Additionally, considering the class imbalance (class 0 has significantly more samples than class 1), techniques such as resampling or adjusting class weights may help improve the model's performance further.

In [19]:
from sklearn.model_selection import GridSearchCV

# Step 7: Fine-tune the model
# Defining the parameter grid
param_grid = {
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Initializing the decision tree classifier
clf = DecisionTreeClassifier(random_state=42)

# Initializing GridSearchCV
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid, cv=5, scoring='accuracy')

# Performing grid search
grid_search.fit(X_train, y_train)

# Getting the best parameters
best_params = grid_search.best_params_
print("Best Parameters:", best_params)

# Using the best parameters to train a new decision tree classifier
best_clf = DecisionTreeClassifier(**best_params, random_state=42)
best_clf.fit(X_train, y_train)

Best Parameters: {'max_depth': 5, 'min_samples_leaf': 1, 'min_samples_split': 5}


The above code performs Hyperparameter Tuning for the decision tree classifier using GridSearchCV, which systematically searches for the best combination of hyperparameters from a specified grid. Here's a summary:

**Parameter Grid Definition:**

* Defines a dictionary param_grid specifying the hyperparameters to be tuned (max_depth, min_samples_split, min_samples_leaf) and the range of values to search over.

**Initialize Decision Tree Classifier:**

* Initializes a decision tree classifier (clf) with default parameters and a fixed random state for reproducibility.

**Initialize GridSearchCV:**

* Creates an instance of GridSearchCV, specifying the decision tree classifier (estimator), the parameter grid (param_grid), 5-fold cross-validation (cv=5), and the scoring metric (scoring='accuracy').

**Perform Grid Search:**

* Executes the grid search by calling the fit() method on the training data (X_train, y_train). GridSearchCV performs cross-validated hyperparameter tuning, evaluating the model's performance using the specified scoring metric.

**Get Best Parameters:**

* Retrieves the best hyperparameters found during the grid search using the best_params_ attribute of the GridSearchCV object.

**Train Model with Best Parameters:**
* Creates a new decision tree classifier (best_clf) with the best hyperparameters obtained from grid search.

* Trains the model on the entire training data (X_train, y_train) using the best parameters.

This process ensures that the decision tree classifier is fine-tuned to achieve the best performance on the training data based on the specified hyperparameters. Adjusting these hyperparameters can significantly impact the model's performance and generalization ability.

In [20]:
# Step 8: Make predictions
# Use the trained classifier to make predictions on new data
y_pred = best_clf.predict(X_test)

# Evaluate the performance of the tuned model
accuracy = accuracy_score(y_test, y_pred)
print("Tuned Model Accuracy:", accuracy)
print(classification_report(y_test, y_pred))

Tuned Model Accuracy: 0.9149065307113377
              precision    recall  f1-score   support

           0       0.94      0.96      0.95      7303
           1       0.65      0.54      0.59       935

    accuracy                           0.91      8238
   macro avg       0.80      0.75      0.77      8238
weighted avg       0.91      0.91      0.91      8238



The above code makes predictions using the tuned decision tree classifier (best_clf) on the test data (X_test) and evaluates the performance of the model using various metrics. Here's a summary:

**Make Predictions:**

* Uses the predict() method of the trained classifier (best_clf) to predict the class labels for the test data (X_test). The predicted labels are stored in the variable y_pred.

**Evaluate Model Performance:**
* Computes the accuracy of the model by comparing the predicted labels (y_pred) with the true labels (y_test) using the accuracy_score() function.
* Prints the accuracy of the tuned model on the test data.
* Generates a classification report using the classification_report() function, which includes precision, recall, F1-score, and support for each class, as well as the overall accuracy and macro/micro averages.

This step provides insights into how well the tuned model performs on unseen data, allowing to assess its generalization ability and make informed decisions about its deployment or further refinement.

## Performace of tuned model:
The model's performance metrics indicate reasonably **good performance**, but there are some aspects to consider:

**Accuracy:** The overall accuracy of 91.49% suggests that the model correctly predicts the class labels for approximately 91.49% of the instances in the test set. However, accuracy alone may not be sufficient to evaluate model performance, especially in the presence of class imbalance.

**Precision and Recall:**
* Precision (also known as positive predictive value) measures the proportion of correctly predicted positive instances among all instances predicted as positive. In this case, the precision for class 1 is 0.65, indicating that 65% of the instances predicted as positive are true positives.

* Recall (also known as sensitivity) measures the proportion of correctly predicted positive instances among all actual positive instances. The recall for class 1 is 0.54, indicating that 54% of the actual positive instances are correctly identified by the model.

**F1-score:** The F1-score is the harmonic mean of precision and recall and provides a balance between the two metrics. The F1-score for class 1 is 0.59, which is relatively lower compared to class 0 (0.95), indicating that the model's performance on class 1 is weaker.

**Support:** The support refers to the number of actual occurrences of each class in the test set. It provides context for interpreting the precision, recall, and F1-score metrics.

Overall, while the model achieves a high accuracy, it's essential to consider its performance across different metrics, especially in the context of class imbalance. Further analysis, such as examining the ROC curve and AUC (Area Under the Curve), and potentially exploring additional techniques like threshold adjustment or class weighting, may help in assessing and improving the model's performance further.

In [21]:
# Example of threshold adjustment
threshold = 0.4  # Adjust the threshold as needed
y_pred_adjusted = (best_clf.predict_proba(X_test)[:,1] >= threshold).astype(int)

# Evaluate the adjusted predictions
print(classification_report(y_test, y_pred_adjusted))

              precision    recall  f1-score   support

           0       0.95      0.95      0.95      7303
           1       0.62      0.64      0.63       935

    accuracy                           0.91      8238
   macro avg       0.78      0.79      0.79      8238
weighted avg       0.92      0.91      0.91      8238



After applying all the techniques, the performance metrics obtained after applying Threshold Adjustment indicate an improvement in the model's ability to balance precision and recall, especially for class 1 (positive class).

Here's a brief interpretation:

* **Precision:** The precision for class 1 has improved from 0.65 to 0.62 after threshold adjustment. This suggests that out of all instances predicted as positive, 62% are true positives, indicating a better ability to identify actual positive cases while minimizing false positives.

* **Recall:** The recall for class 1 has increased from 0.54 to 0.64. This indicates that the model is capturing a higher proportion of actual positive instances, reducing the number of false negatives.

* **F1-score:** The F1-score for class 1 has also increased from 0.59 to 0.63, reflecting a better balance between precision and recall for the positive class.
Accuracy: The overall accuracy remains the same at 91%, indicating that the model's ability to classify both classes correctly has been maintained.

Overall, these metrics suggest an improvement in the model's performance after threshold adjustment, particularly in correctly identifying positive instances (class 1). However, whether this performance is considered "good" depends on the specific requirements and constraints.