# SOME GOOD CODING PRACTICES FOR DATA SCIENTISTS

# 1.Automate repetitive tasks through functions

Let’s look at a common task in Machine Learning projects pipeline like tuning hyperparameters model. Suppose you’re working in an image classification project and you would like to try Support Vector Machine classifier (SVC) and Artificial Neural Network (ANN) with Multi-layer Perceptron classifier (MLPClassifier()). To tune the hyperparameters model we’ll be using the GridSearchCV() method for each one from scikit-learn Machine Learning framework.

In [None]:
# svc model
ml_model = SVC()
hyper_parameter_candidates = {"C": [1e-4, 1e-2, 1, 1e2, 1e4],
    "gamma": [1e-3, 1e-2, 1, 1e2, 1e3],
    "class_weight": [None, "balanced"],
    "kernel":["linear", "poly", "rbf", "sigmoid"]}
scoring_parameter = "accuracy"
cv_fold = KFold(n_splits=5, shuffle=True, random_state=1)
classifier_model = GridSearchCV(estimator=ml_model, 
    param_grid=hyper_parameter_candidates,   
    scoring=scoring_parameter, cv=cv_fold)
classifier_model.fit(X_train, y_train)


# ann model
ml_model = MLPClassifier()
hyper_parameter_candidates = {"hidden_layer_sizes":[(20), (50), 
   (100)], "max_iter":[500, 800, 1000], 
   "activation":["identity", "logistic", "tanh", "relu"],
   "solver":["lbfgs", "sgd", "adam"]}
scoring_parameter = "accuracy"
cv_fold = KFold(n_splits=5, shuffle=True, random_state=1)
classifier_model = GridSearchCV(estimator=ml_model,    
   param_grid=hyper_parameter_candidates,  
   scoring=scoring_parameter, cv=cv_fold)
classifier_model.fit(X_train, y_train)

Based on the code above, I could write a simple function as:

In [1]:
def tune_hyperparameter_model(ml_model, X_train, y_train, hyper_parameter_candidates, scoring_parameter, cv_fold):   
    classifier_model = GridSearchCV(estimator=ml_model, 
       param_grid=hyper_parameter_candidates, 
       scoring=scoring_parameter, cv=cv_fold)       
    classifier_model.fit(X_train, y_train)  
    return classifier_model

# 2.Implement error handling

Let’s add some error handling code:

In [2]:
def tune_hyperparameter_model(ml_model, X_train, y_train, hyper_parameter_candidates, scoring_parameter, cv_fold):   
    try:   
        classifier_model = GridSearchCV(estimator=ml_model, 
           param_grid=hyper_parameter_candidates, 
           scoring=scoring_parameter, cv=cv_fold)       
        classifier_model.fit(X_train, y_train)
    except:
        exception_message = sys.exc_info()[0]
        print("An error occurred. {}".format(exception_message))
    return classifier_model

With some planning, I should be able to write a better generic function print_exception_message() to print my exception messages and use it for all my functions.

In [3]:
def print_exception_message(message_orientation="horizontal"):
    """
    print full exception message
   :param message_orientation: horizontal or vertical
   :return None   
    """
    try:
        exc_type, exc_value, exc_tb = sys.exc_info()           
        file_name, line_number, procedure_name, line_code =  traceback.extract_tb(exc_tb)[-1]      
        time_stamp = " [Time Stamp]: " + str(time.strftime(" %Y-%m-%d %I:%M:%S %p"))
        file_name = " [File Name]: " + str(file_name)
        procedure_name = " [Procedure Name]: " +  str(procedure_name)
        error_message = " [Error Message]: " + str(exc_value)       
        error_type = " [Error Type]: " + str(exc_type)                   
        line_number = " [Line Number]: " + str(line_number)               
        line_code = " [Line Code]: " + str(line_code)
        if (message_orientation == "horizontal"):
            print( "An error occurred:{};{};{};{};{};{}; {}".format(time_stamp, file_name, procedure_name, 
               error_message, error_type, line_number, line_code))
        elif (message_orientation == "vertical"):
            print( "An error occurred:\n{}\n{}\n{}\n{}\n{}\n{}\n{}".format(time_stamp, file_name, 
               procedure_name, error_message, error_type,        
               line_number, line_code))
        else:
            pass                   
    except:
        exception_message = sys.exc_info()[0]
        print("An error occurred. {}".format(exception_message))

If we implement this function in our code, we’ll use a simple line of code in the exception block only.

In [4]:
def tune_hyperparameter_model(ml_model, X_train, y_train, hyper_parameter_candidates, scoring_parameter, cv_fold):   
    try:   
        classifier_model = GridSearchCV(estimator=ml_model, 
           param_grid=hyper_parameter_candidates, 
           scoring=scoring_parameter, cv=cv_fold)       
        classifier_model.fit(X_train, y_train)
    except:
        print_exception_message()
    return classifier_model

In our case, for the MLPClassifier() model, we’ll have:

In [None]:
ml_model = MLPClassifier()
hyper_parameter_candidates = {"hidden_layer_sizes":[(20), (50), 
   (100)], "max_iter":[500, 800, 1000], 
   "activation":["identity", "logistic", "tanh", "relu"],
   "solver":["lbfgs", "sgd", "adam"]}
scoring_parameter = "accuracy"
cv_fold = KFold(n_splits=5, shuffle=True, random_state=1)
classifier_model = tune_hyperparameter_model(ml_model, X_train,  
   y_train, hyper_parameter_candidates, scoring_parameter, 
   cv_fold)

We’ll continue updating the function tune_hyperparameter_model() code in the next topics.

# 3. Do not hardcode of default numerical and string parameters including Machine Learning hyperparameters model

A simple way to fix these issues is to put all the default numerical and string values in a simple configuration (config.py) file. For security purposes any config could be encrypted as necessary. Let me update our function tune_hyperparameter_model() to handle GridSearchCV() and RandomizedSearchCV() methods.

In [10]:
def tune_hyperparameter_model(ml_model, X_train, y_train, hyper_parameter_candidates, scoring_parameter, cv_fold, search_cv_type="grid"):   
    try:
        if (search_cv_type=="grid"):
            classifier_model = GridSearchCV(estimator=ml_model, 
               param_grid=hyper_parameter_candidates, 
               scoring=scoring_parameter, cv=cv_fold)
        elif (search_cv_type=="randomized"):
            classifier_model =  RandomizedSearchCV(estimator=ml_model, 
               param_distributions=hyper_parameter_candidates, 
               scoring=scoring_parameter, cv=cv_fold)
        classifier_model.fit(X_train, y_train)
    except:
        print_exception_message()
    return classifier_model

As you can see the input parameter search_cv_type is optional and equal to “grid”. If we create a config.py file with the following two lines of code:

In [11]:
GRID_SEARCH_CV="grid"
RANDOMIZED_SEARCH_CV="randomized"

We can update our function (import config statement is required).

In [12]:
import config
def tune_hyperparameter_model(ml_model, X_train, y_train, hyper_parameter_candidates, scoring_parameter , cv_fold, search_cv_type="grid"): 
    try:
        if (search_cv_type==config.GRID_SEARCH_CV):
            classifier_model = GridSearchCV(estimator=ml_model,   param_grid=hyper_parameter_candidates, 
            scoring=scoring_parameter, cv=cv_fold) 

        elif (search_cv_type== config.RANDOMIZED_SEARCH_CV):
            classifier_model = RandomizedSearchCV(estimator=ml_model, param_distributions=hyper_parameter_candidates, 
            scoring=scoring_parameter, cv=cv_fold)
        classifier_model.fit(X_train, y_train)
    except:
        print_exception_message()
    return classifier_model

Now we have a nice generic function that can be used everywhere and we can change the logic of it by changing the config.py file only. Nothing has been hardcoded here! 

# 4.  Provide code comments, especially Dostring comments for modules, functions, classes, or methods definition

Python programming language has a specific standard way of writing comments for class objects and function procedures. It is called Docstring — 

“docstring is a literal string that occurs as the first statement in a module, function, class, or method definition. Such a docstring becomes the __doc__ special attribute of that object”.

Let's add the Docstring for our function tune_hyperparameter_model().

In [14]:
import config
def tune_hyperparameter_model(ml_model, X_train, y_train, hyper_parameter_candidates, scoring_parameter , cv_fold, 
search_cv_type="grid"):   
    """
    apply grid search cv and randomized search cv algorithms to 
    find optimal hyperparameters model 
    :param ml_model: defined machine learning model
    :param X_train: feature training data
    :param y_train: target (label) training data
    :param hyper_parameter_candidates: dictionary of 
     hyperparameter candidates
    :param scoring_parameter: parameter that controls what metric 
     to apply to the evaluated model
    :param cv_fold: number of cv divided folds
    :param search_cv_type: type of search cv (gridsearchcv or 
     randomizedsearchcv)
    :return classifier_model: defined classifier model
    """
    try:
        if (search_cv_type==config.GRID_SEARCH_CV):
            classifier_model = GridSearchCV(estimator=ml_model, 
               param_grid=hyper_parameter_candidates, 
               scoring=scoring_paramete, cv=cv_fold)
        elif (search_cv_type==config.RANDOMIZED_SEARCH_CV):
            classifier_model = RandomizedSearchCV(estimator=ml_model, 
               param_distributions=hyper_parameter_candidates, 
               scoring=scoring_parameter, cv=cv_fold)
        classifier_model.fit(X_train, y_train)
    except:
        print_exception_message()
    return classifier_model

Finally we have a real production Python function implemented now. We should be able to include this function to a base (super) class to be reused in any Machine Learning projects.

# 5.  Ensure that programs unit tests implemented 

 Every computer program created needs to have unit tests. We need to implement unit tests in Machine Learning projects as well. One of the most important unit tests in Machine Learning classification projects is the calculation of the __Accuracy Classification Score__. 

Suppose I want my test data to have a high Accuracy Score compared with a required Threshold Accuracy Score value. If the calculated Accuracy Score is greater than or equal to Threshold Accuracy Score value, then I’ll use the results to make the required business classification decisions. If not, I may need to do the following: retrain my previous model, try other classification models, collect more data if possible, speak to a domain expert to get more context information, etc.

Let’s look at a simple unit test code shown below. This test for done using the Fashion-MNIST image datasets and a trained ANN model file fashion_mnist_ann_classification.pkl.

In [20]:
import unittest
import os
import config
import pandas as pd
import pickle
#from fashion_mnist_ann import calculate_accuracy_score

class ANNTest(unittest.TestCase):
    """
    ann unit test class
    """    
    def testAccuracyScore(self):
        """
        accuracy classification score unit test
        """
        #        get data folder path
        data_folder_path = config.DATA_FOLDER_PATH
        #         define fashion mnist test pandas dataframe
        fashion_mnist_test = config.FASHION_MNIST_TEST
        df_fashion_mnist_test = pd.read_csv(os.path.join(data_folder_path,  fashion_mnist_test), header=None)
        #         get number of columns
        df_fashion_mnist_test_columns = df_fashion_mnist_test.shape[1]
        #         select y test label
        target_column_number = config.TARGET_COLUMN_NUMBER
        y_test = df_fashion_mnist_test.iloc[:,0:target_column_number]
        #         flat y test label
        y_test_flattened = y_test.values.ravel()
        #         select X test features
        X_test =       df_fashion_mnist_test.iloc[:,target_column_number:df_fashion_mnist_test_columns]
        #         normalize X test features with min-max scaling
        X_test = (X_test.astype("float32") - config.XMIN) /  (config.XMAX - config.XMIN)
        #         open and close fashion mnist model pkl file
        mlp_classifier_model_pkl = open(config.FASHION_MNIST_MODEL_FILE, "rb")      
        mlp_classifier_model_file = pickle.load(mlp_classifier_model_pkl)
        mlp_classifier_model_pkl.close()#         get y predict test
        y_predict_test = mlp_classifier_model_file.predict(X_test)
        #         calculate accuracy classification score             
        accuracy_score_value = calculate_accuracy_score(y_test_flattened,  y_predict_test)
        #         test for accuracy score value greater than or equal to  threshold accuracy score
        self.assertGreaterEqual(accuracy_score_value, 
           config.THRESHOLD_ACCURACY_SCORE, "Test Accuracy Score Failed.")              
        
#if __name__ == "__main__":   
 #   unittest.main()

As you can see I have commented every line of code for you to understand very clearly how this unit test runs. In real production projects, this code will be encapsulated in a main derived class file to run train, validation and test data.

The function __calculate_accuracy_score()__ calculates the __Accuracy Classification Score__.

In [22]:
def calculate_accuracy_score(label_true, label_predict):       
    """
    calculate accuracy classification score
    :param label_true: label true values
    :param label_predict: label predicted values
    return: accuracy classification score
    """
    try:
        accuracy_score_value = accuracy_score(label_true, 
           label_predict) * 100
        accuracy_score_value = float("{0:0.2f}".format(accuracy_score_value)) 
    except:    
        print_exception_message()         
    return accuracy_score_value

If we run some testing. From the config.py file the Threshold Accuracy Score is equal 86%. After we run the test, it would pass if the calculated Accuracy Score were greater than or equal to 86%. If Threshold Accuracy Score increases to 90%, the test would fail with the following message:

 It’s up to the Data Analytics team to decide, if this score is sufficient enough to be used for real business data classification. 

__“Be a good Python Software Engineer, not a good “Pythonic” Software Engineer”__

SOURCE: https://medium.com/@ernest.bonat/refactoring-python-code-for-machine-learning-projects-python-spaghetti-code-everywhere-daaa6c116bd1