<a href="https://colab.research.google.com/github/hrd1409/python-ml-projects/blob/master/ML_pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Importing essential libraries

In [20]:
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA

Creating synthetic data

In [21]:
x,y=make_classification(n_samples= 1000, n_features= 20,n_classes= 2,random_state=42)

Splitting data

In [22]:
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)

In [23]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)

(700, 20)
(700,)
(300, 20)


In [24]:
print(x)

[[-0.6693561  -1.49577819 -0.87076638 ... -1.26733697 -1.2763343
   1.01664321]
 [ 0.09337237  0.78584826  0.10575379 ... -0.12270893  0.6934308
   0.91136272]
 [-0.90579721 -0.60834121  0.29514098 ...  0.83049813 -0.73733198
  -0.5782121 ]
 ...
 [-0.20013455 -1.46108168  1.79701652 ... -1.50280171 -1.27473745
   1.60111869]
 [ 0.03935575  0.24868361 -0.47532342 ...  0.09912579  0.54269228
   1.20827474]
 [ 0.76921528  0.47076539  0.16994471 ...  0.6561162   0.64333186
  -2.02100232]]


No need to scale data as it is already scaled


In [25]:
pipe=Pipeline([
    ("PCA",PCA(n_components=5)),
    ("knn",KNeighborsClassifier())])

In [26]:
pipe.fit(x_train,y_train)

In [27]:
y_pred=pipe.predict(x_test)

In [28]:
print(y_pred)

[1 0 0 1 1 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 0 1 1 1 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 1 1 0 0 1 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0
 0 1 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0
 0 0 1 0 1 1 0 1 0 1 1 1 0 1 1 1 0 0 1 0 1 1 0 1 1 1 1 0 1 1 0 1 1 0 1 1 0
 0 1 1 1 0 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 1 0 1 1 0 1 1
 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 0
 0 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 0 0 1 0 0 1 1 1 0 0
 0 0 0 0 1 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 1 0 1 0 0 0 1 0 1 0 1 1 0
 0 0 0 0]


Predicting accuracy of the model on synthectic data

In [29]:
from sklearn.metrics import accuracy_score

In [19]:
 accuracy_score(y_test,y_pred)

0.8433333333333334

Adding more evalution metrics

In [40]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix, roc_auc_score
y_pred = pipe.predict(x_test)
y_pred_proba = pipe.predict_proba(x_test)[:, 1]

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

In [41]:
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print(f"ROC AUC: {roc_auc:.2f}")

Accuracy: 0.83
Precision: 0.88
Recall: 0.78
F1-Score: 0.83
Confusion Matrix:
[[129  16]
 [ 34 121]]
ROC AUC: 0.90


**Scaling data to check does accuracy increases**
---



In [35]:
from sklearn.preprocessing import StandardScaler


In [36]:
pipe2=Pipeline([
    ("Scaler",StandardScaler()),
    ("knn",KNeighborsClassifier())])

In [37]:
pipe2.fit(x_train,y_train)
y_pred2=pipe2.predict(x_test)

In [38]:
print(y_pred2)

[0 0 0 1 1 0 0 1 1 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 1 1 0 1 1 1 1 0 0 1 0
 0 0 0 0 1 0 0 1 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0
 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 0
 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 1 0 1 0 0 1 1 0 1 1 0
 0 1 0 1 0 0 1 0 1 1 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 1 0
 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1 0 1 0 1 0 0 0
 0 1 0 1 1 0 0 0 0 0 0 1 1 0 1 1 0 1 1 1 1 0 1 1 1 0 0 0 0 0 0 0 1 1 1 0 0
 1 0 0 0 1 1 0 0 0 0 1 1 0 1 0 1 1 0 1 1 0 1 0 1 1 0 0 0 0 0 1 0 1 1 0 1 0
 0 1 0 0]


In [39]:
accuracy2 = accuracy_score(y_test, y_pred2)
print("Accuracy of our scaled model is:- ",accuracy2)

Accuracy of our scaled model is:-  0.77


In [42]:
y_pred = pipe2.predict(x_test)
y_pred_proba = pipe2.predict_proba(x_test)[:, 1]

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-Score: {f1:.2f}")
print("Confusion Matrix:")
print(conf_matrix)
print(f"ROC AUC: {roc_auc:.2f}")


Accuracy: 0.77
Precision: 0.82
Recall: 0.72
F1-Score: 0.76
Confusion Matrix:
[[120  25]
 [ 44 111]]
ROC AUC: 0.84
