In [72]:
import pandas as pd
from sentence_transformers import SentenceTransformer


I will use a simple sentence transformer to create the embeddings, and then feed the embeddings into a simple multi output classifier/

In [73]:
df = pd.read_json("data/dataset.json")
df[:50]

# I decided to shuffle the dataframe to ensure randomness in the 50 50 train-test split, however when I tried this approach
#  without shuffling the accuracy was 0 as the test split had label that werent present in the train split.

#df = df.sample(frac=1).reset_index(drop=True)

Unnamed: 0,summary,description,reporter_name,project_name,Assignee,Priority,Type,id
0,Issue reported: Search bar expands unexpectedl...,Details provided:\n**Problem:** The search bar...,user_079,Unidentified Roe,user_128,Normal,Bug,1
1,Detected anomaly: AI Assistant Icon Color Miss...,Observed behaviour:\nThe AI Assistant menu ite...,user_062,Unidentified Roe,user_018,Normal,Task,2
2,User noticed: Preserve table filtering after n...,"Description:\n**Problem:**\nCurrently, when a ...",user_134,Fast Buffalo,,Normal,Bug,3
3,Issue reported: Containers crash with Failed t...,Observed behaviour:\nWhen running six containe...,user_105,Unidentified Hedgehog,user_126,Major,Bug,4
4,System shows: Inconsistent inlay hints for fun...,Summary of issue:\nThere is an inconsistency i...,user_078,Lazy Whale,user_056,Normal,Bug,5
5,Issue reported: Context menu action Disable fo...,Summary of issue:\nThe 'Disable for This Proje...,user_113,Lazy Beaver,user_113,Normal,Bug,6
6,System shows: False positive in IncorrectCance...,Observed behaviour:\nThe `IncorrectCancellatio...,user_120,Fast Badger,user_070,Normal,Bug,7
7,Detected anomaly: High CPU Usage and Cache Reb...,Observed behaviour:\nAfter updating the projec...,user_127,Unidentified Tiger,user_048,Normal,Bug,8
8,Problem observed: Add Mark as relates to (and/...,"Reported case:\nCurrently, the system suggests...",user_079,Unidentified Roe,user_045,Normal,Usability problem,9
9,Improvement needed: AI duplicate detection inc...,Observed behaviour:\nThere is an issue with th...,user_081,Unidentified Roe,user_081,Normal,Bug,10


I decided to use a LabelEncoder while at it, for this simple case we can also just do list(set(labels)) and enumarete.

In [74]:
import numpy as np
from sklearn.preprocessing import LabelEncoder

type_encoder = LabelEncoder()
priority_encoder = LabelEncoder()
project_encoder = LabelEncoder()

type_labels = type_encoder.fit_transform(df["Type"][:50])
priority_labels = priority_encoder.fit_transform(df["Priority"][:50])
project_labels = project_encoder.fit_transform(df["project_name"][:50])

df["type_enc"] = -1
df["priority_enc"] = -1
df["project_enc"] = -1

df.loc[:49, "type_enc"] = type_encoder.transform(df["Type"][:50])
df.loc[:49, "priority_enc"] = priority_encoder.transform(df["Priority"][:50])
df.loc[:49, "project_enc"] = project_encoder.transform(df["project_name"][:50])

print(type_encoder.classes_,
priority_encoder.classes_,
project_encoder.classes_)



['Bug' 'Cosmetics' 'Exception' 'Feature' 'Performance problem' 'Task'
 'Usability Problem' 'Usability problem'] ['Critical' 'Major' 'Minor' 'Normal' 'P4'] ['Fast Badger' 'Fast Buffalo' 'Fast Falcon' 'Fast Hedgehog' 'Fast Panda'
 'Fast Roe' 'Fast Tiger' 'Fast Wolf' 'Lazy Beaver' 'Lazy Panda'
 'Lazy Raccoon' 'Lazy Whale' 'Unidentified Hedgehog' 'Unidentified Roe'
 'Unidentified Tiger' 'Unidentified Whale']


In [75]:
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")
df['embeddings'] = df.apply(lambda row: embedding_model.encode(f"{row['summary']} {row['description']}"), axis=1)

In [76]:
x_train = np.vstack(df['embeddings'].values[:50])
y_train = df[["type_enc", "priority_enc", "project_enc"]].values[:50]

x_test = np.vstack(df['embeddings'].values[50:])


x_train.shape, y_train.shape, x_test.shape

((50, 384), (50, 3), (50, 384))

In [77]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

classifier = MultiOutputClassifier(LogisticRegression(class_weight='balanced'))
classifier.fit(x_train, y_train)

predictions = classifier.predict(x_test)

# UPDATE
For Deliverable 2 -> filling the dataset with the predicted values. The Dataset can be found in the data folder as dataset_filled.json

In [78]:
df = pd.read_json("data/dataset.json")

predictions_to_labels = lambda preds: (type_encoder.inverse_transform([preds[0]])[0],
                                       priority_encoder.inverse_transform([preds[1]])[0],
                                        project_encoder.inverse_transform([preds[2]])[0])
predictions = [predictions_to_labels(pred) for pred in predictions]

df.loc[50:, ['Type', 'Priority', 'project_name']] = predictions
df[50:100]

Unnamed: 0,summary,description,reporter_name,project_name,Assignee,Priority,Type,id
50,Issue reported: java.lang.NoSuchMethodError: o...,Summary of issue:\n**Version:** Build #IU-251....,user_091,Fast Falcon,,Critical,Bug,51
51,Issue reported: Fix bash script formating issu...,Description:\nRemove bash script formating fro...,user_028,Lazy Raccoon,user_028,Normal,Task,52
52,Problem observed: Authorization Issue with Sum...,Summary of issue:\nThe 'SummarizeT' Slack app ...,user_130,Lazy Beaver,user_124,Minor,Task,53
53,User noticed: Regression: Unfriendly error whe...,Details provided:\nWhen attempting to `Create ...,user_050,Fast Falcon,user_001,Normal,Task,54
54,System shows: Improve project selection for su...,Summary of issue:\nThe current 'summarizeT' fu...,user_081,Unidentified Roe,user_124,Minor,Bug,55
55,System shows: Disable Setting value is redunda...,"Description:\nThe diagnostic ""Setting value is...",user_121,Lazy Panda,user_121,Normal,Task,56
56,System shows: Java imports are unresolved in B...,Reported case:\nObserved behavior:\n\n* Java i...,user_028,Fast Falcon,,Normal,Bug,57
57,Issue reported: Buildifier does not reformat f...,Details provided:\nThe `buildifier` tool does ...,user_028,Fast Roe,user_076,Normal,Task,58
58,Improvement needed: Redundant .Companion is in...,"Reported case:\nAfter invoking the ""Move"" refa...",user_072,Fast Wolf,user_107,Normal,Bug,59
59,Unexpected behavior: Enhance project switching...,"Description:\n**Problem:**\n\n* Previously, sw...",user_079,Unidentified Roe,user_040,Normal,Usability Problem,60


In [79]:
df.to_json("data/dataset_filled.json", orient="records", indent=4)