- Enable logging so that all printouts from your program goes directly to a log file named *logfile.txt* which you also need to submit alongside your source code.


In [248]:
import logging
#logging config
logging.basicConfig(
					level=logging.DEBUG,
					format='%(levelname)-6s | %(asctime)s | %(message)s',
					filename=f'logfile.txt',
					filemode='w',
					)
logger = logging.getLogger()

- Read A.csv data.

In [249]:
import pandas as pd
df = pd.read_csv('deliverables/A.csv', sep=';', header = 0)
df.head()

Unnamed: 0,wavelet_transformed_variance,wavelet_transformed_skewness,wavelet_transformed_curtosis,image_entropy,counterfeit
0,-3.5985,-13.6593,17.6052,-2.4927,1
1,-2.0662,0.16967,-1.0054,-0.82975,1
2,3.9922,-4.4676,3.7304,-0.1095,0
3,4.2134,-2.806,2.0116,0.67412,0
4,4.3398,-5.3036,3.8803,-0.70432,0


In [250]:
df.shape

(1234, 5)

In [251]:
df.columns

Index(['wavelet_transformed_variance', 'wavelet_transformed_skewness',
       'wavelet_transformed_curtosis', 'image_entropy', 'counterfeit'],
      dtype='object')

- Separate dependent variable y as the counterfeit column, and rest of the variables as independent variables (as X).

In [252]:
# Separate dependent variable y as the counterfeit column
variable_y = df['counterfeit']
variable_y

0       1
1       1
2       0
3       0
4       0
       ..
1229    0
1230    0
1231    0
1232    1
1233    0
Name: counterfeit, Length: 1234, dtype: int64

In [253]:
# rest of the variables as independent variables (as X).
variable_x = df.drop(columns='counterfeit')
variable_x.shape

(1234, 4)

In [254]:
#  Split the dataset, i.e., the (X,y) into training (50%) and test (50%).
variable_y = variable_y.values.reshape(-1,1)
variable_y.shape

(1234, 1)

- Split the dataset, i.e., the (X,y) into training (50%) and test (50%).

In [255]:
nrows = len(df)
split = nrows//2
split

617

In [256]:
training_set_y = variable_y[:split]
print(training_set_y.shape)
testing_set_y = variable_y[split:]
print(testing_set_y.shape)

(617, 1)
(617, 1)


In [257]:
training_set_x = variable_x[:split]
print(training_set_x.shape)
testing_set_x = variable_x[split:]
print(testing_set_x.shape)

(617, 4)
(617, 4)


- Load the scaler object from *scaler.joblib*. It's already fit to a bunch of training samples. So, don't worry about fit it again. Instead, you may want to use the following lines to use the already fitted scaler object to transform both training and test set. And, do not scale the dependent variable/feature (i.e., y which is the target column counterfeit).


In [258]:
from sklearn.preprocessing import StandardScaler 
import joblib
scaler = StandardScaler()
scaler = joblib.load('deliverables/scaler.joblib') 
X_train_scaled = scaler.transform(training_set_x) 
X_test_scaled = scaler.transform(testing_set_x)



https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


- Build your first classifier with LogisticRegression.
- Once the model1 object is instantiated, call the fit method on the training split.
- Make sure to save the model with joblib library and as a file named model1.joblib. Please do not submit the joblib file. It's only for your later usage of this model.


In [259]:
import numpy as np
from sklearn.linear_model import LogisticRegression

model1 = LogisticRegression(penalty='l1',tol=1,solver='liblinear',multi_class='auto',fit_intercept=False,max_iter=3)


In [260]:
from sklearn.metrics import accuracy_score

model1.fit(X_train_scaled,training_set_y)
pred1 = model1.predict(X_test_scaled)
accuracy = accuracy_score(testing_set_y, pred1)
print("Accuracy:", accuracy)


Accuracy: 0.965964343598055


  y = column_or_1d(y, warn=True)


In [261]:
joblib.dump(model1, 'model1.joblib')

['model1.joblib']

- Build your second classifier with KNeighborsClassifier
- Once the model2 object is instantiated, call the fit method on the training split.
- Make sure to save the model with joblib library and as a file named model2.joblib. Please do not submit the joblib file. It's only for your later usage of this model.

In [262]:
from sklearn.neighbors import KNeighborsClassifier
model2 = KNeighborsClassifier(n_neighbors=3,p=2)

In [263]:
model2.fit(X_train_scaled,training_set_y)
pred2 = model2.predict(X_test_scaled)
accuracy = accuracy_score(testing_set_y, pred2)
print("Accuracy:", accuracy)

Accuracy: 0.9967585089141004


  return self._fit(X, y)


In [264]:
joblib.dump(model2, 'model2.joblib')

['model2.joblib']

-  Build your third classifier with SVC
- Once the model3 object is instantiated, call the fit method on the training split.
- Make sure to save the model with joblib library and as a file named model3.joblib. Please do not submit the joblib file. It's only for your later usage of this model.

In [265]:
from sklearn.svm import SVC
model3 = SVC(gamma='auto')

In [266]:
model3.fit(X_train_scaled,training_set_y)
pred3 = model3.predict(X_test_scaled)
accuracy = accuracy_score(testing_set_y, pred3)
print("Accuracy:", accuracy)

Accuracy: 0.9918962722852512


  y = column_or_1d(y, warn=True)


In [267]:
joblib.dump(model3, 'model3.joblib')

['model3.joblib']

- Build your fourth classifier with MLPClassifier
- Once the model4 object is instantiated, call the fit method on the training split.
- Make sure to save the model with joblib library and as a file named model4.joblib. Please do not submit the joblib file. It's only for your later usage of this model.

In [268]:
from sklearn.neural_network import MLPClassifier
model4 = MLPClassifier(solver='lbfgs',alpha=0.1,hidden_layer_sizes=(1,5))

In [269]:
model4.fit(X_train_scaled,training_set_y)
pred4 = model4.predict(X_test_scaled)
accuracy = accuracy_score(testing_set_y, pred4)
print("Accuracy:", accuracy)

  y = column_or_1d(y, warn=True)


Accuracy: 0.9902755267423015


In [270]:
joblib.dump(model4, 'model4.joblib')

['model4.joblib']

- Build your fifth classifier with DecisionTreeClassifier
- Once the model5 object is instantiated, call the fit method on the training split.
- Make sure to save the model with joblib library and as a file named model5.joblib. Please do not submit the joblib file. It's only for your later usage of this model.

In [271]:
from sklearn.tree import DecisionTreeClassifier

model5 = DecisionTreeClassifier()
model5.fit(X_train_scaled,training_set_y)
pred5 = model5.predict(X_test_scaled)
accuracy = accuracy_score(testing_set_y, pred5)
print("Accuracy:", accuracy)


Accuracy: 0.9821717990275527


In [272]:
joblib.dump(model5, 'model5.joblib')

['model5.joblib']

- Build your sixth classifier with RandomForestClassifier.
- Once the model6 object is instantiated, call the fit method on the training split.
- Make sure to save the model with joblib library and as a file named model6.joblib. Please do not submit the joblib file. It's only for your later usage of this model.


In [273]:
from sklearn.ensemble import RandomForestClassifier

model6 = RandomForestClassifier(max_depth=2)

In [274]:
model6.fit(X_train_scaled,training_set_y)
pred6 = model6.predict(X_test_scaled)
accuracy = accuracy_score(testing_set_y, pred6)
print("Accuracy:", accuracy)

  return fit_method(estimator, *args, **kwargs)


Accuracy: 0.9319286871961102


In [275]:
joblib.dump(model6, 'model6.joblib')

['model6.joblib']

-  Build your seventh classifier with XGBClassifier
- Once the model7 object is instantiated, call the fit method on the training split.
- Make sure to save the model with joblib library and as a file named model7.joblib. Please do not submit the joblib file. It's only for your later usage of this model.

In [276]:
from xgboost import XGBClassifier
model7 = XGBClassifier(n_estimators=2,max_depth= 2, learning_rate=1, objective='binary:logistic')

In [277]:
model7.fit(X_train_scaled,training_set_y)
pred7 = model7.predict(X_test_scaled)
accuracy = accuracy_score(testing_set_y, pred7)
print("Accuracy:", accuracy)

Accuracy: 0.9529983792544571


In [278]:
joblib.dump(model7, 'model7.joblib')

['model7.joblib']

- Build your eighth classifier with GaussianNB
- Once the model8 object is instantiated, call the fit method on the training split.
- Make sure to save the model with joblib library and as a file named model8.joblib. Please do not submit the joblib file. It's only for your later usage of this model.


In [279]:
from sklearn.naive_bayes import GaussianNB

model8 = GaussianNB()
model8.fit(X_train_scaled,training_set_y)
pred8 = model8.predict(X_test_scaled)
accuracy = accuracy_score(testing_set_y, pred8)
print("Accuracy:", accuracy)

Accuracy: 0.8476499189627229


  y = column_or_1d(y, warn=True)


In [280]:
joblib.dump(model8, 'model8.joblib')

['model8.joblib']

-   Get predictions for the 2 samples in B.csv:
    - Have each of the 8 models predict the 2 samples present in the B.csv file. Please note: the true annotations for the two samples are not present in the file. 


In [281]:
df_B = pd.read_csv('deliverables/B.csv',sep=';',header=0)
df_B.head()

Unnamed: 0,wavelet_transformed_variance,wavelet_transformed_skewness,wavelet_transformed_curtosis,image_entropy
0,3.97,10.17,-4.11,-4.61
1,0.9,4.77,-4.84,-5.59


In [282]:
# all the columns in B are the same as variable_x
B_variable_x = df_B
B_variable_x_scaled = scaler.transform(B_variable_x)
B_variable_x_scaled.shape

(2, 4)

In [283]:
predicted1 = model1.predict(B_variable_x_scaled)
print("Model 1 predicted",predicted1)
predicted2 = model2.predict(B_variable_x_scaled)
print("Model 2 predicted",predicted2)
predicted3 = model3.predict(B_variable_x_scaled)
print("Model 3 predicted",predicted3)
predicted4 = model4.predict(B_variable_x_scaled)
print("Model 4 predicted",predicted4)
predicted5 = model5.predict(B_variable_x_scaled)
print("Model 5 predicted",predicted5)
predicted6 = model6.predict(B_variable_x_scaled)
print("Model 6 predicted",predicted6)
predicted7 = model7.predict(B_variable_x_scaled)
print("Model 7 predicted",predicted7)
predicted8 = model8.predict(B_variable_x_scaled)
print("Model 8 predicted",predicted8)

Model 1 predicted [0 1]
Model 2 predicted [0 1]
Model 3 predicted [0 1]
Model 4 predicted [0 1]
Model 5 predicted [0 1]
Model 6 predicted [0 1]
Model 7 predicted [0 1]
Model 8 predicted [0 0]


In [284]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, matthews_corrcoef
pred1 = pred1.reshape(-1,1)
print(pred1.shape)
print("Precision for model1:", precision_score(testing_set_y, pred1, average="binary")) 
print('Recall for model1:', recall_score(testing_set_y, pred1, average="binary")) 


(617, 1)
Precision for model1: 0.9333333333333333
Recall for model1: 0.9925373134328358


In [285]:
pred2 = pred2.reshape(-1,1)
print(pred2.shape)
print("Precision for mode2:", precision_score(testing_set_y, pred2, average="binary")) 
print('Recall for model2:', recall_score(testing_set_y, pred2, average="binary")) 

(617, 1)
Precision for mode2: 0.9925925925925926
Recall for model2: 1.0


In [286]:
pred3 = pred3.reshape(-1,1)
print(pred3.shape)
print("Precision for model3:", precision_score(testing_set_y, pred3, average="binary")) 
print('Recall for model3:', recall_score(testing_set_y, pred3, average="binary")) 

(617, 1)
Precision for model3: 0.9816849816849816
Recall for model3: 1.0


In [287]:
pred4 = pred4.reshape(-1,1)
print(pred4.shape)
print("Precision for model4:", precision_score(testing_set_y, pred4, average="binary")) 
print('Recall for model4:', recall_score(testing_set_y, pred4, average="binary"))

(617, 1)
Precision for model4: 0.9888059701492538
Recall for model4: 0.9888059701492538


In [288]:
pred5 = pred5.reshape(-1,1)
print(pred5.shape)
print("Precision for model5:", precision_score(testing_set_y, pred5, average="binary")) 
print('Recall for model5:', recall_score(testing_set_y, pred5, average="binary"))

(617, 1)
Precision for model5: 0.9638989169675091
Recall for model5: 0.996268656716418


In [289]:
pred6 = pred6.reshape(-1,1)
print(pred6.shape)
print("Precision for model6:", precision_score(testing_set_y, pred6, average="binary")) 
print('Recall for model6:', recall_score(testing_set_y, pred6, average="binary"))

(617, 1)
Precision for model6: 0.9448818897637795
Recall for model6: 0.8955223880597015


In [290]:
pred7 = pred7.reshape(-1,1)
print(pred7.shape)
print("Precision for model7:", precision_score(testing_set_y, pred7, average="binary")) 
print('Recall for model7:', recall_score(testing_set_y, pred7, average="binary"))

(617, 1)
Precision for model7: 0.947565543071161
Recall for model7: 0.9440298507462687


In [291]:
pred8 = pred8.reshape(-1,1)
print(pred8.shape)
print("Precision for model8:", precision_score(testing_set_y, pred8, average="binary")) 
print('Recall for model8:', recall_score(testing_set_y, pred8, average="binary"))

(617, 1)
Precision for model8: 0.8452380952380952
Recall for model8: 0.7947761194029851


- Evaluate models based on C.csv dataset:
    - C.csv contains annotated samples. Use the target annotations to evaluate all the 8 models.
    - Please report accuracy, precision, recall, f1-score, false-discovery rate, matthews correlation coefficient
    - Please don't forget to scale the samples present in this dataset prior to sending to the models for prediction.
    - Also, comment which model performs the best in each of the 6 evaluation metrics.
    - Also, comment which model do you think is acceptable to deploy in a real bank. Justify your reasoning. Also, avoid answers like none is acceptable. 

In [292]:
C_df = pd.read_csv('deliverables/C.csv', sep=';', header=0)
C_df.head()

Unnamed: 0,wavelet_transformed_variance,wavelet_transformed_skewness,wavelet_transformed_curtosis,image_entropy,counterfeit
0,4.8272,3.0687,0.68604,0.80731,0
1,2.8969,0.70768,2.29,1.8663,0
2,2.1464,6.0795,-0.5778,-2.2302,0
3,-2.1405,-0.16762,1.321,-0.20906,1
4,-4.8554,-5.9037,10.9818,-0.82199,1


In [293]:
C_df.shape

(136, 5)

In [294]:
C_variable_x = C_df.drop(columns=['counterfeit'])
C_variable_y = C_df['counterfeit']
C_variable_y = C_variable_y.values.reshape(-1,1)
C_variable_x_scaled = scaler.transform(C_variable_x)
print(C_variable_x_scaled.shape)
print(C_variable_y.shape)

(136, 4)
(136, 1)


In [295]:
C_model1 = model1.predict(C_variable_x_scaled)
C_model2 = model2.predict(C_variable_x_scaled)
C_model3 = model3.predict(C_variable_x_scaled)
C_model4 = model4.predict(C_variable_x_scaled)
C_model5 = model5.predict(C_variable_x_scaled)
C_model6 = model6.predict(C_variable_x_scaled)
C_model7 = model7.predict(C_variable_x_scaled)
C_model8 = model8.predict(C_variable_x_scaled)


In [296]:
C_model1 = C_model1.reshape(-1,1)
C_model2 = C_model2.reshape(-1,1)
C_model3 = C_model3.reshape(-1,1)
C_model4 = C_model4.reshape(-1,1)
C_model5 = C_model5.reshape(-1,1)
C_model6 = C_model6.reshape(-1,1)
C_model7 = C_model7.reshape(-1,1)
C_model8 = C_model8.reshape(-1,1)
print(C_model1.shape)
print(C_model2.shape)
print(C_model3.shape)
print(C_model4.shape)
print(C_model5.shape)
print(C_model6.shape)
print(C_model7.shape)
print(C_model8.shape)

(136, 1)
(136, 1)
(136, 1)
(136, 1)
(136, 1)
(136, 1)
(136, 1)
(136, 1)


In [297]:
# report accuracy, precision, recall, f1-score, false-discovery rate, matthews correlation coefficient
# TP = label set (variable_y/counterfeit)
# FP = features set (variable_X)

accuracy1 = accuracy_score(C_variable_y, C_model1)
precision1 = precision_score(C_variable_y, C_model1, average="binary")
recall1 = recall_score(C_variable_y, C_model1, average="binary")
f1score1 = f1_score(C_variable_y, C_model1, average="binary")
FDR1 = 1 - precision1
MCC1 = matthews_corrcoef(C_variable_y, C_model1)

print("Model 1 Accuracy on set C:", accuracy1)
print("Model 1 Precision on set C:", precision1)
print("Model 1 Recall on set C:", recall1)
print("Model 1 F1-Score on set C:", f1score1)
print("Model 1 False-Discovery Rate on set C:", FDR1)
print("Model 1 Matthews Correlation Coefficient on set C:", MCC1)


Model 1 Accuracy on set C: 0.9632352941176471
Model 1 Precision on set C: 0.9137931034482759
Model 1 Recall on set C: 1.0
Model 1 F1-Score on set C: 0.954954954954955
Model 1 False-Discovery Rate on set C: 0.08620689655172409
Model 1 Matthews Correlation Coefficient on set C: 0.9266851278250421


- Model 2 on C.csv

In [298]:
accuracy2 = accuracy_score(C_variable_y, C_model2)
precision2 = precision_score(C_variable_y, C_model2, average="binary")
recall2 = recall_score(C_variable_y, C_model2, average="binary")
f1score2 = f1_score(C_variable_y, C_model2, average="binary")
FDR2 = 1 - precision2
MCC2 = matthews_corrcoef(C_variable_y, C_model2)

print("Model 2 Accuracy on set C:", accuracy2)
print("Model 2 Precision on set C:", precision2)
print("Model 2 Recall on set C:", recall2)
print("Model 2 F1-Score on set C:", f1score2)
print("Model 2 False-Discovery Rate on set C:", FDR2)
print("Model 2 Matthews Correlation Coefficient on set C:", MCC2)

Model 2 Accuracy on set C: 1.0
Model 2 Precision on set C: 1.0
Model 2 Recall on set C: 1.0
Model 2 F1-Score on set C: 1.0
Model 2 False-Discovery Rate on set C: 0.0
Model 2 Matthews Correlation Coefficient on set C: 1.0


In [299]:
accuracy3 = accuracy_score(C_variable_y, C_model3)
precision3 = precision_score(C_variable_y, C_model3, average="binary")
recall3 = recall_score(C_variable_y, C_model3, average="binary")
f1score3 = f1_score(C_variable_y, C_model3, average="binary")
FDR3 = 1 - precision3
MCC3 = matthews_corrcoef(C_variable_y, C_model3)

print("Model 3 Accuracy on set C:", accuracy3)
print("Model 3 Precision on set C:", precision3)
print("Model 3 Recall on set C:", recall3)
print("Model 3 F1-Score on set C:", f1score3)
print("Model 3 False-Discovery Rate on set C:", FDR3)
print("Model 3 Matthews Correlation Coefficient on set C:", MCC3)

Model 3 Accuracy on set C: 1.0
Model 3 Precision on set C: 1.0
Model 3 Recall on set C: 1.0
Model 3 F1-Score on set C: 1.0
Model 3 False-Discovery Rate on set C: 0.0
Model 3 Matthews Correlation Coefficient on set C: 1.0
