In [41]:
import pandas as pd
from sentence_transformers import SentenceTransformer


I will use a simple sentence transformer to create the embeddings, and then feed the embeddings into a simple multi output classifier/

In [42]:
df = pd.read_json("data/dataset.json")
df.head()

# I decided to shuffle the dataframe to ensure randomness in the 50 50 train-test split, however when I tried this approach
#  without shuffling the accuracy was 0 as the test split had label that werent present in the train split.

df = df.sample(frac=1).reset_index(drop=True)

I decided to use a LabelEncoder while at it, for this simple case we can also just do list(set(labels)) and enumarete.

In [43]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

type_encoder = LabelEncoder()
priority_encoder = LabelEncoder()
project_encoder = LabelEncoder()

df["type_enc"] = type_encoder.fit_transform(df["Type"])
df["priority_enc"] = priority_encoder.fit_transform(df["Priority"])
df["project_enc"] = project_encoder.fit_transform(df["project_name"])

df.head()

Unnamed: 0,summary,description,reporter_name,project_name,Assignee,Priority,Type,id,type_enc,priority_enc,project_enc
0,Issue reported: Rename AppCode module configur...,Summary of issue:\nAuto-generated issue based ...,,,,,,75,8,5,16
1,Problem observed: Investigate multiple excepti...,Details provided:\nAuto-generated issue based ...,,,,,,86,8,5,16
2,Detected anomaly: AI Mentions user themselves ...,Observed behaviour:\nThe AI feature integrated...,user_081,,user_081,,,65,8,5,16
3,Improvement needed: Clarify Git branching stra...,Reported case:\nThe current state of the Git b...,user_133,,user_133,,,63,8,5,16
4,System shows: Settings tab: Improve clarity of...,Observed behaviour:\nThere are two search fiel...,user_054,Fast Badger,user_083,Normal,Usability Problem,34,6,3,0


In [44]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
df['embeddings'] = df.apply(lambda row: embedding_model.encode(f"{row['summary']} {row['description']}"), axis=1)

In [45]:
x_train = np.vstack(df['embeddings'].values[:50])
y_train = df[["type_enc", "priority_enc", "project_enc"]].values[:50]
x_test = np.vstack(df['embeddings'].values[50:])
y_test = df[["type_enc", "priority_enc", "project_enc"]].values[50:]

x_train.shape, y_train.shape, x_test.shape, y_test.shape

((50, 384), (50, 3), (50, 384), (50, 3))

In [46]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.svm import LinearSVC


classifier = MultiOutputClassifier(LinearSVC())
classifier.fit(x_train, y_train)

predictions = classifier.predict(x_test)
classifier.score(x_test,y_test) 

np.float64(0.36)

In [47]:
table = df[50:][["project_name","type_enc", "priority_enc", "project_enc"]]
pred_df = pd.DataFrame(predictions, index=table.index,columns=["type_pred", "priority_pred", "project_pred"])
table = table.join(pred_df)

table['type_correct'] = table['type_enc'] == table['type_pred']
table['priority_correct'] = table['priority_enc'] == table['priority_pred']
table['project_correct'] = table['project_enc'] == table['project_pred']

# we have a lot of issues without project name or type, and as the dataset is really small, our accuracy is low 
accuracy_report = table.groupby('project_name',dropna=False)[['type_correct', 'priority_correct', 'project_correct']].mean() * 100
accuracy_report

Unnamed: 0_level_0,type_correct,priority_correct,project_correct
project_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Fast Badger,0.0,100.0,100.0
Fast Buffalo,0.0,50.0,0.0
Fast Falcon,0.0,0.0,0.0
Fast Panda,100.0,100.0,0.0
Fast Roe,0.0,50.0,0.0
Fast Wolf,0.0,25.0,0.0
Lazy Beaver,50.0,50.0,0.0
Lazy Raccoon,0.0,0.0,0.0
Lazy Whale,100.0,100.0,0.0
Unidentified Hedgehog,0.0,100.0,0.0


# UPDATE
For Deliverable 2 -> filling the dataset with the predicted values. The Dataset can be found in the data folder as dataset_filled.json

In [62]:
df = pd.read_json("data/dataset.json")

df.loc[50:, ['Type', 'Priority', 'project_name']] = predictions
df[50:60]

Unnamed: 0,summary,description,reporter_name,project_name,Assignee,Priority,Type,id
50,Issue reported: java.lang.NoSuchMethodError: o...,Summary of issue:\n**Version:** Build #IU-251....,user_091,16,,5,8,51
51,Issue reported: Fix bash script formating issu...,Description:\nRemove bash script formating fro...,user_028,16,user_028,5,8,52
52,Problem observed: Authorization Issue with Sum...,Summary of issue:\nThe 'SummarizeT' Slack app ...,user_130,16,user_124,5,8,53
53,User noticed: Regression: Unfriendly error whe...,Details provided:\nWhen attempting to `Create ...,user_050,16,user_001,3,8,54
54,System shows: Improve project selection for su...,Summary of issue:\nThe current 'summarizeT' fu...,user_081,5,user_124,3,5,55
55,System shows: Disable Setting value is redunda...,"Description:\nThe diagnostic ""Setting value is...",user_121,16,user_121,5,8,56
56,System shows: Java imports are unresolved in B...,Reported case:\nObserved behavior:\n\n* Java i...,user_028,16,,1,8,57
57,Issue reported: Buildifier does not reformat f...,Details provided:\nThe `buildifier` tool does ...,user_028,13,user_076,3,0,58
58,Improvement needed: Redundant .Companion is in...,"Reported case:\nAfter invoking the ""Move"" refa...",user_072,16,user_107,5,8,59
59,Unexpected behavior: Enhance project switching...,"Description:\n**Problem:**\n\n* Previously, sw...",user_079,16,user_040,3,0,60


In [65]:
df.to_json("data/dataset_filled.json", orient="records", indent=4)