The issue is that data quality is low with bad classification of ticket resolution type where unrelated tickets are classified as the same category, so I have to reclassify them by manually looking at the issue and corresponding resolution text typed in by users and technicians (I've tried using NLP by writing a Python script in an attempt to group similar text to no avail)

# Let's make some noisy data for classification

In [1]:
import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
%matplotlib inline

from sklearn.datasets import make_classification

In [2]:
data,labels = make_classification(n_samples=2000, # let's say we have 2000 and it is too many 
                                  n_features=4000, # too many features 
                                  n_informative=20, # only 20 out of 4000 features are informative, so the data is noisy 
                                  n_redundant=10, # and 10 features are redundant, so, more noisy 
                                  n_repeated=0, 
                                  n_classes=2, # for simplicity, let's say we only want to classify 2 categories
                                  n_clusters_per_class=10, 
                                  weights=None, 
                                  flip_y=0.1, # another way to add noise to the data by incorrectly label the data 
                                  class_sep=1.0, 
                                  hypercube=True, #?
                                  shift=0.0, 
                                  scale=1.0, 
                                  shuffle=True, 
                                  random_state=12345, # for replication purpose
                                 )

# If we use all the data for classification -- baseline

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import  make_pipeline
from sklearn.model_selection import StratifiedShuffleSplit,cross_val_score

In [4]:
rf = RandomForestClassifier(n_estimators = 100, # use more estimators to avoid overfitting
                            random_state = 12345, # for replication purpose
                            )
classifier = make_pipeline(StandardScaler(), # scaling the data to increase classification purpose
                           rf,
                           )
cv = StratifiedShuffleSplit(n_splits = 10, # let's do 10 folds of cross validation to estimate the classification performance
                            test_size = 0.2, # for each fold, we split the data into 80% training and 20% testing
                            random_state = 12345, # for replication purpose
                            )
scores = cross_val_score(classifier,
                         data,
                         labels,
                         cv = cv, # define the cross validation
                         scoring='roc_auc', # use ROC AUC to measure the cross validation performance
                         n_jobs = -1, # increase the computational speed
                         verbose = 1, # print out the computation process
                         )

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:  1.3min finished


In [8]:
print(f'score = {np.mean(scores):.4f} +/- {np.std(scores):.4f}')

score = 0.5266 +/- 0.0298


# now, let's perform resampling

In [10]:
# define a pipeline to functionalize the resampling and classification within those samples

def resample_classification(original_data,original_labels,n_samples = 100,):
    idx_sampled = np.random.choice(np.arange(len(data)), # all the possible instances
                                   size = n_samples,  # size
                                   replace = False, # sample without replacement
                                   )
    features,labels = original_data[idx_sampled],original_labels[idx_sampled]
    rf = RandomForestClassifier(n_estimators = 100, # use more estimators to avoid overfitting
                            random_state = 12345, # for replication purpose
                            )
    classifier = make_pipeline(StandardScaler(), # scaling the data to increase classification purpose
                               rf,
                               )
    cv = StratifiedShuffleSplit(n_splits = 10, # let's do 10 folds of cross validation to estimate the classification performance
                                test_size = 0.2, # for each fold, we split the data into 80% training and 20% testing
                                random_state = 12345, # for replication purpose
                                )
    scores = cross_val_score(classifier,
                             features,
                             labels,
                             cv = cv, # define the cross validation
                             scoring='roc_auc', # use ROC AUC to measure the cross validation performance
                             n_jobs = -1, # increase the computational speed
                             verbose = 1, # print out the computation process
                             )
    return scores

In [11]:
n_sampling = 100
results = np.array([resample_classification(data,labels,) for _ in range(n_sampling)])

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    2.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10

[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.5s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.6s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.3s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.2s finished
[Parallel(n_jobs=-1)]: Using backend LokyBacken

In [12]:
results.shape

(100, 10)

In [13]:
print(f'score = {results.mean(1).mean():.4f} +/- {results.mean(1).std():.4f}')

score = 0.5013 +/- 0.0508


# Conclusion:

Depends on your data, the resampling would exagerate the standard deviation of the classification performance. 