## 0\. Background

*   You have been given a marketing dataset for a bank and you have been tasked with predicting the outcome of a marketing campaign

*   You have a sample dataset with an <font color="red">y</h1></font> target.  


<font color="red"><h1> PRACTICE LAB </h1></font>

This is a Kaggle dataset and additional information on the dataset can be found at this link https://www.kaggle.com/datasets/prakharrathi25/banking-dataset-marketing-targets



## 1\. Load the Dataset into a Pandas DataFrame

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Project Directory path
project_dir = 'drive/MyDrive/UW.8740/Lab Tests/Lab2/'

In [None]:
train = pd.read_csv(project_dir+ 'train.csv')
test = pd.read_csv(project_dir+ 'test.csv')

In [None]:
train

In [None]:
df = pd.concat([train, test])

### 1\.1 Display the top 5 rows and bottom 5 rows of the DataFrame

<font color="red"><h1> <Submit Answer Below [1]:</h1></font>


In [None]:
df

## 2\.Data Exploration

### 2\.1 Identify the number of missing nulls in columns

<font color="red"><h1>Submit Answer Below [2]:</h1></font>

In [None]:
df.info()

In [None]:
df.isnull().sum()

In [None]:
df.describe()

### 2\.2 Identify the most frequently occurring job, education, and marital status in your dataset

<font color="red"><h1>Submit Answer Below [3] :</h1></font>

In [None]:
df.describe(include='all')

In [None]:
df.nunique()

In [None]:
df.nunique()[df.nunique()<=20]

### 2\.3 Display the unique values of all columns with less than 20 unique values to confirm that they should be considered categorical.

<font color="red"><h1>Submit Answer Below [4] :</h1></font>

In [None]:
for column in df.columns:
    if (df[column].nunique() < 20):
        print("Column:", column)
        print("Number of Unique Values:", df[column].nunique())
        print("Unique values:", df[column].unique())
        print("Most Frequent Value:", df[column].mode()[0])
        print('*' * 80)

## 3\.Pre-process the data

### 3\.1 Organize your columns into Catogorical features, numeric features and target variable


In [None]:
y = df['y']
X = df.drop('y', axis=1)

### 3\.2 Feature Transformation


*   One Hot Encoding for Categorical Features.  You may include binary features in the one hot encoding to simplify your code

* You can drop the month column to simplify your code

* Use LabelEncoder to convert the y column from (yes/no) to (0/1)


<font color="red"><h1>Submit Answer Below [5] :</h1></font>

In [None]:
df

In [None]:
X = X.drop(["month"], axis = 1)

In [None]:
categorical_features = ['job', 'marital', 'education', 'default',
                        'housing', 'loan', 'contact', 'poutcome']

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features)
    ])

X = preprocessor.fit_transform(X)

In [None]:
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

## 4\. Train Predictive Classification Models

Split your data to (80% Training/20% Testing)

Train the three classifiers with no optimization.

Report the following for all classifiers :

*   Confusion Matrix
*   Accuracy Score
*   AUC Score



<font color="red"><h1>Submit Answer Below [6] :</h1></font>

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
###############################################################
# Accuracy Metric
###############################################################
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

### 4\.1 Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(random_state=42)
dt_clf.fit(X_train, y_train)

In [None]:
from sklearn.metrics import roc_curve

y_test_pred = dt_clf.predict(X_test)
print(f'Area under ROC curve: {roc_auc_score(y_test, y_test_pred): 0.4f}')
print(f'Accuracy {accuracy_score(y_test, y_test_pred): 0.4f}')
print(f'Weigted F1 score {f1_score(y_test, y_test_pred, average="weighted"): 0.4f}')
print(classification_report(y_test, y_test_pred, digits=4))
pd.DataFrame(confusion_matrix(y_test, y_test_pred))

print('\nROC curve')
probs = dt_clf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test, probs)
auc_score = roc_auc_score(y_test, probs)
print("AUC Score:", auc_score)
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], linestyle='--', color='red')  # Diagonal line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.show()

### 4\.2 Support Vector Machine

In [None]:
from sklearn.svm import SVC

svm = SVC(gamma='auto')
svm = svm.fit(X_train,y_train)

In [None]:
y_svm_pred = svm.predict(X_test)
print(f'Area under ROC curve: {roc_auc_score(y_test, y_svm_pred): 0.4f}')
print(f'Accuracy {accuracy_score(y_test, y_svm_pred): 0.4f}')
print(f'Weigted F1 score {f1_score(y_test, y_svm_pred, average="weighted"): 0.4f}')
print(classification_report(y_test, y_svm_pred, digits=4))
pd.DataFrame(confusion_matrix(y_test, y_svm_pred))

### 4\.1 Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
logit = LogisticRegression(max_iter=1000)
X_test.fit(X_train,y_train)

In [None]:
y_lr_pred = logit.predict(X_test)
print(f'Area under ROC curve: {roc_auc_score(y_test, y_lr_pred): 0.4f}')
print(f'Accuracy {accuracy_score(y_test, y_lr_pred): 0.4f}')
print(f'Weigted F1 score {f1_score(y_test, y_lr_pred, average="weighted"): 0.4f}')
print(classification_report(y_test, y_lr_pred, digits=4))
pd.DataFrame(confusion_matrix(y_test, y_lr_pred))

## 5\. Save the models to a file using the pickle library


- show that you can save a model to a file then read the model from the file that you saved

<font color="red"><h1>Submit Answer Below [7] :</h1></font>

In [None]:
import pickle