In this notebook, I will create a pipeline mechanism to feed the query data for the model to predict the results. This is an important aspect of modeling to check how the model does prediction on a single query point.

From `03-Modeling-FI.ipynb` notebook, I noticed __Gradient Boosting__ ensemble classifier outperformed all the models including Random Forest and XGBoost classifiers. Though train loss of Random Forest and XGBoost classifiers is negligible, the cross-validation loss is more, which substantiates the fact that both the models are overfitting.

__1. Packages__

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from IPython.display import display

In [3]:
import numpy as np
import os
import pandas as pd
import pickle

__2. Features and target__

In [4]:
features = ['alpha', 'delta', 'u', 'g', 'r', 'i', 'z', 'redshift']
target = 'class'

__3. Fetch the raw data__

In [5]:
def fetch_data(features):
    """
    This function fetches the raw data.
    """
    data = {f: [float(input("  '{}': ".format(f)))] for f in features}
    df = pd.DataFrame(data=data)
    print("Raw data is fetched successfully.")
    return df

In [6]:
df = fetch_data(features=features)

  'alpha': 12
  'delta': 12
  'u': 12
  'g': 12
  'r': 12
  'i': 12
  'z': 12
  'redshift': 1
Raw data is fetched successfully.


__4. Preprocess the raw data__

In [7]:
def preprocess(df, features):
    """
    This function preprocess the rae data.
    """
    scale = 'analysis_dumps/scaling.pkl'
    with open(file=scale, mode='rb') as pre_pkl:
        scaling = pickle.load(file=pre_pkl)
    
    df = scaling.transform(X=df)
    df = pd.DataFrame(data=df, columns=features)
    return df

In [8]:
df = preprocess(df=df, features=features)
display(df)

Unnamed: 0,alpha,delta,u,g,r,i,z,redshift
0,0.033319,0.294487,0.046076,0.074076,0.110276,0.111598,0.120764,0.143846


__5. Feature engineering on preprocessed data__

In [9]:
def featurize(df):
    """
    This function featurizes the dataframe.
    It selects the important features obtained using RF.
    Please refer 02-Modeling and 03-Modeling-FI notebooks.
    """
    fi_cols = ['redshift', 'g-r', 'i-z', 'u-r', 'i-r', 'z-r', 'g']
    df['g-r'] = df['g'] - df['r']
    df['i-z'] = df['i'] - df['z']
    df['u-r'] = df['u'] - df['r']
    df['i-r'] = df['i'] - df['r']
    df['z-r'] = df['z'] - df['r']
    df = df[fi_cols]
    return df

In [10]:
df = featurize(df=df)
display(df)

Unnamed: 0,redshift,g-r,i-z,u-r,i-r,z-r,g
0,0.143846,-0.0362,-0.009166,-0.0642,0.001322,0.010488,0.074076


__6. Predictions__

In [11]:
def prediction(X):
    """
    This functions predicts the datapoint.
    """
    model = 'model_dumps/fi_models/fi_model_stacking_classifier.pkl'
    with open(file=model, mode='rb') as m_pkl:
        clf = pickle.load(file=m_pkl)
    
    pred_proba = clf.predict_proba(X=X)
    confidence = np.round(a=np.max(pred_proba)*100, decimals=2)
    pred_class = clf.predict(X=X)[0]
    if pred_class == 'QSO': pred_class = 'Quasi-Stellar Object'
    elif pred_class == 'GALAXY': pred_class = 'Galaxy'
    else: pred_class = 'Star'
    print("The predicted class is '{}' with a confidence of {}%.".format(pred_class, confidence))

In [12]:
prediction(X=df)

The predicted class is 'Galaxy' with a confidence of 62.85%.


__7. Machine learning pipeline__

For a single query point.

In [13]:
def ml_pipeline(features):
    """
    This is a local machine learning application.
    """
    print("Please provide the data for below features.")
    df = fetch_data(features=features)
    df = preprocess(df=df, features=features)
    df = featurize(df=df)
    prediction(X=df)

In [14]:
ml_pipeline(features=features)

Please provide the data for below features.
  'alpha': 15
  'delta': 15
  'u': 15
  'g': 15
  'r': 15
  'i': 15
  'z': 15
  'redshift': 15
Raw data is fetched successfully.
The predicted class is 'Quasi-Stellar Object' with a confidence of 76.57%.


For the test data.

In [15]:
from sklearn.metrics import classification_report

In [16]:
def pipeline_for_whole_test_data(features, target='class'):
    """
    This function a pipeline for whole dataset.
    """
    data = pd.read_csv(filepath_or_buffer='data/test_data.csv')
    
    X_test = data[features]
    y_test = data[target].values
    
    X_test = featurize(df=X_test)
    
    model = 'model_dumps/fi_models/fi_model_stacking_classifier.pkl'
    with open(file=model, mode='rb') as m_pkl:
        clf = pickle.load(file=m_pkl)
    
    cm_pred = clf.predict(X=X_test)
    
    print(classification_report(y_true=y_test, y_pred=cm_pred))

In [17]:
pipeline_for_whole_test_data(features=features)

              precision    recall  f1-score   support

      GALAXY       0.97      0.98      0.98     11889
         QSO       0.96      0.92      0.94      3792
        STAR       0.99      1.00      0.99      4319

    accuracy                           0.97     20000
   macro avg       0.97      0.97      0.97     20000
weighted avg       0.97      0.97      0.97     20000

