## Background
In this notebook, we show how to feed the embeddings from the language model into the MLP classifier. Then, we take the github repo, `kubernetes/kubernetes`, as an example. We do transfer learning and show the results.

## Data
**combined_sig_df.pkl**
https://storage.googleapis.com/issue_label_bot/notebook_files/combined_sig_df.pkl
This file includes the github issue contents including titles, bodies, and labels.

**feat_df.csv** https://storage.googleapis.com/issue_label_bot/notebook_files/feat_df.csv
This file includes 1600-dimentional embeddings of 14390 issues from `kubernetes/kubernetes`.

In [1]:
import pandas as pd

combined_sig_df = pd.read_pickle('combined_sig_df.pkl')
feat_df = pd.read_csv('feat_df.csv')

In [2]:
# github issue contents
combined_sig_df.head(3)

Unnamed: 0,index,updated_at,last_time,labels,repo,url,title,body,len_labels,index.1,...,sig/aws,sig/cluster-ops,sig/multicluster,sig/instrumentation,sig/openstack,sig/contributor-experience,sig/architecture,sig/vmware,sig/service-catalog,part
0,0,2018-02-24 15:09:51 UTC,2018-02-24 15:09:51 UTC,"[lifecycle/rotten, priority/backlog, sig/clust...",kubernetes/kubernetes,"""https://api.github.com/repos/kubernetes/kuber...",minions ip does not follow --hostname-override...,"according to 9267, the kubelet should registe...",4,0,...,0,0,0,0,0,0,0,0,0,6
1,10,2018-08-09 13:48:32 UTC,2018-08-09 13:48:32 UTC,"[lifecycle/stale, sig/cluster-lifecycle]",kubernetes/kubernetes,"""https://api.github.com/repos/kubernetes/kuber...",node-pool upgrade reverts reserved ip to ephem...,<!-- thanks for filing an issue! before hittin...,2,1,...,0,0,0,0,0,0,0,0,0,6
2,12,2018-05-24 19:39:07 UTC,2018-05-24 19:39:07 UTC,"[help wanted, kind/feature, lifecycle/rotten, ...",kubernetes/kubernetes,"""https://api.github.com/repos/kubernetes/kuber...",allow defining different timeouts for differen...,or at least different for steaming and non-str...,5,2,...,0,0,0,0,0,0,0,0,0,6


In [3]:
# embeddings of github issues [mean, max]
feat_df.head(3)

Unnamed: 0,f_0,f_1,f_2,f_3,f_4,f_5,f_6,f_7,f_8,f_9,...,f_1590,f_1591,f_1592,f_1593,f_1594,f_1595,f_1596,f_1597,f_1598,f_1599
0,-0.059059,-0.024261,-0.014145,-0.047195,-0.104533,0.072735,-0.024336,0.040534,0.037922,-0.18527,...,0.031312,0.133795,0.384244,0.14794,0.404347,0.355213,0.269353,0.227154,0.178882,0.309761
1,-0.04026,0.046072,-0.023587,-0.024443,0.013061,-0.051123,0.036322,0.035255,0.003225,-0.072199,...,0.280282,0.399076,0.573803,0.797381,0.377034,0.601118,0.275607,0.765224,0.337764,0.480431
2,-0.005449,-0.030515,-0.019329,0.018486,0.001672,0.008174,-0.009112,-0.053917,0.073985,0.03468,...,0.293691,0.019976,0.516499,0.4656,0.233908,0.161443,0.155191,0.2282,0.252346,0.30497


In [4]:
# count the labels in the holdout set
from collections import Counter
c = Counter()

for row in combined_sig_df[combined_sig_df.part == 6].labels:
    c.update(row)

## Split data
Split the data into two sets according to the column, `part`.
There are 28 labels in total because 28 sig labels have at least 30 issues, which are preprocessed in the notebook, [EvaluateEmbeddings](https://github.com/machine-learning-apps/IssuesLanguageModel/blob/master/notebooks/5_EvaluateEmbeddings.ipynb).

In [5]:
train_mask = combined_sig_df.part != 6
holdout_mask = ~train_mask

In [6]:
X = feat_df[train_mask].values
label_columns = [x for x in combined_sig_df.columns if 'sig/' in x]
y = combined_sig_df[label_columns][train_mask].values

print(X.shape)
print(y.shape)

(7236, 1600)
(7236, 28)


In [7]:
X_holdout = feat_df[holdout_mask].values
y_holdout = combined_sig_df[label_columns][holdout_mask].values

print(X_holdout.shape)
print(y_holdout.shape)

(7154, 1600)
(7154, 28)


In [8]:
from sklearn.metrics import roc_auc_score

def calculate_auc(predictions):
    auc_scores = []
    counts = []

    for i, l in enumerate(label_columns):
        y_hat = predictions[:, i]
        y = y_holdout[:, i]
        auc = roc_auc_score(y_true=y, y_score=y_hat)
        auc_scores.append(auc)
        counts.append(c[l])
    
    df = pd.DataFrame({'label': label_columns, 'auc': auc_scores, 'count': counts})    
    display(df)
    weightedavg_auc = df.apply(lambda x: x.auc * x['count'], axis=1).sum() / df['count'].sum()
    print(f'Weighted Average AUC: {weightedavg_auc}')
    return df, weightedavg_auc

## Sklearn MLP
Feed the embeddings from the language model to the MLP classifier.

In [9]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV

In [10]:
mlp = MLPClassifier(early_stopping=True, n_iter_no_change=5, max_iter=500, solver='adam', 
                   random_state=1234)

In [11]:
mlp.fit(X, y)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       n_iter_no_change=5, nesterovs_momentum=True, power_t=0.5,
       random_state=1234, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [12]:
mlp_predictions = mlp.predict_proba(X_holdout)

In [13]:
mlp_df, mlp_auc = calculate_auc(mlp_predictions)

Unnamed: 0,label,auc,count
0,sig/cluster-lifecycle,0.863932,498
1,sig/node,0.884496,1311
2,sig/api-machinery,0.892453,1090
3,sig/scalability,0.907244,258
4,sig/cli,0.935913,544
5,sig/autoscaling,0.949778,100
6,sig/network,0.945694,923
7,sig/cloud-provider,0.934848,29
8,sig/storage,0.965592,824
9,sig/scheduling,0.926638,397


Weighted Average AUC: 0.9168608333252417


## Gird search
To tune the MLP.

In [14]:
# params = {'hidden_layer_sizes': [(100,), (200,), (400, ), (50, 50), (100, 100), (200, 200)],
#               'alpha': [.001, .01, .1, 1, 10],
#               'learning_rate': ['constant', 'adaptive'],
#               'learning_rate_init': [.001, .01, .1]}

params = {'hidden_layer_sizes': [(100,), (200,), (400, ), (50, 50), (100, 100), (200, 200)],
              'alpha': [.001],
              'learning_rate': ['adaptive'],
              'learning_rate_init': [.001]}

mlp_clf = MLPClassifier(early_stopping=True, validation_fraction=.2, n_iter_no_change=4, max_iter=500)

gscvmlp = GridSearchCV(mlp_clf, params, cv=5, n_jobs=-1)

gscvmlp.fit(X, y)

GridSearchCV(cv=5, error_score='raise-deprecating',
       estimator=MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(100,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       n_iter_no_change=4, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.2, verbose=False, warm_start=False),
       fit_params=None, iid='warn', n_jobs=-1,
       param_grid={'hidden_layer_sizes': [(100,), (200,), (400,), (50, 50), (100, 100), (200, 200)], 'alpha': [0.001], 'learning_rate': ['adaptive'], 'learning_rate_init': [0.001]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [15]:
print(f'The best model from grid search is:\n=====================================\n{gscvmlp.best_estimator_}')

The best model from grid search is:
MLPClassifier(activation='relu', alpha=0.001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=True, epsilon=1e-08,
       hidden_layer_sizes=(200, 200), learning_rate='adaptive',
       learning_rate_init=0.001, max_iter=500, momentum=0.9,
       n_iter_no_change=4, nesterovs_momentum=True, power_t=0.5,
       random_state=None, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.2, verbose=False, warm_start=False)


In [16]:
mlp_tuned_predictions = gscvmlp.predict_proba(X_holdout)

In [17]:
mlp_tuned_df, mlp_tuned_auc = calculate_auc(mlp_tuned_predictions)

Unnamed: 0,label,auc,count
0,sig/cluster-lifecycle,0.865366,498
1,sig/node,0.886492,1311
2,sig/api-machinery,0.89692,1090
3,sig/scalability,0.906918,258
4,sig/cli,0.940113,544
5,sig/autoscaling,0.959805,100
6,sig/network,0.949056,923
7,sig/cloud-provider,0.936532,29
8,sig/storage,0.966809,824
9,sig/scheduling,0.928674,397


Weighted Average AUC: 0.9197858043321433
