# Config project and libs

In [None]:
! pip install simpletransformers

In [2]:
import pandas as pd 
import torch 
from sklearn.model_selection import train_test_split
from simpletransformers.classification import ClassificationModel

In [4]:
gpu = torch.cuda.is_available()

In [5]:
gpu

True

# Data loading


In [7]:
df = pd.read_csv('https://raw.githubusercontent.com/LucasRotsen/tcc_case_study_tasks/main/data/task_sample.csv', sep='|', names=['title', 'description', 'type']).dropna()

# Modeling


### Encoding 1 to bugs and 0 to others


In [23]:
df['labels'] = [1 if row == "Bug" else 0 for row in df["type"]]

### Setup df rows to stay up for training

In [24]:
df['text'] = [f"{row['title']} {row['description']}" for row in df.to_dict('records')]

## Slice

In [42]:
 train, test = train_test_split(df, test_size=0.5)

## Setup model

In [26]:
model = ClassificationModel(
    "roberta",
    "roberta-base",
    use_cuda=gpu
)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.dense.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'roberta.pooler.dense.bias', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.weight', 'classi

## Training 

In [27]:
model.train_model(train)

  0%|          | 0/7500 [00:00<?, ?it/s]

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Running Epoch 0 of 1:   0%|          | 0/938 [00:00<?, ?it/s]

  model.parameters(), args.max_grad_norm


(938, 0.4649371043451305)

In [29]:
result, model_outputs, wrong_predictions = model.eval_model(test)

  0%|          | 0/7500 [00:00<?, ?it/s]

Running Evaluation:   0%|          | 0/938 [00:00<?, ?it/s]

In [30]:
result

{'auprc': 0.8822140429492533,
 'auroc': 0.9162827630174568,
 'eval_loss': 0.41754718121689266,
 'fn': 612,
 'fp': 557,
 'mcc': 0.6856586725268654,
 'tn': 3513,
 'tp': 2818}

### Build correlation table


In [31]:
cor_tab = {0: 'Not a Bug', 1: 'Bug!'}

In [52]:
issues = [
          """hub does not allow to update labels with space hub issue update 2 -l bug,documentation,enhancement,good goissue
What happened:

The label which got added was good and goissue was missed.
hub cli is not allowing to set labels with space for example labels such as good first issue cannot be set using the hub cli
More info:

git version 2.30.1 (Apple Git-130)
hub version refs/heads/master""",

"""hub.github.com dark themed browser issue If you'd like to reproduce it, it's firefox with browser.display.use_system_colors = true and env GTK_THEME=Adwaita:dark""",

"""Msys2 Package The problem I'm trying to solve:
Feadability of using the hub command inside a cygwin/msys based environment

How I imagine hub could expose this functionality:
making a Msys2 Package for the program.
Packaging would be essentially the exact same as archlinux, except utilizing the EXE file instead """,
"""Sync should "force" update local branches The problem I'm trying to solve:
When branches are force pushed into upstream, the local branches will be out of sync. And require manual remediation. By forcing an update, hub will ensure that all branches will reflect upstream.

How I imagine hub could expose this functionality:
hub sync --force""",
"""Get all unclaimed issues The problem I'm trying to solve:
A command that can check all unclaimed issues """,
"""hub sync reverts worktree Command attempted:
hub sync

What happened:

I had a worktree checked out (test-branch) that was behind origin.
I did a "hub sync" in the root of the repository (which has master checked out). It appears to update the "test-branch" locally.

When I go to the worktree folder, I notice that the opposite diff is staged for commit that reverts the worktree BACK to the state before the "hub sync".

More info:
hub version 2.13.0""",
"""Unable to install md2roff-bin binary, locally or in CI/CD. I used to be able to go run and go get and go install this CLI tool from my mac's terminal as well as in Jenkins CI pipelines. Lately, my builds have been failing with the error shown above. I've tried a few different things, but I'm just not sure what I'm missing. How do I build and install the md2roff-bin binary in a CI environment if this method is no longer going to work? Perhaps this is a bug? I'm really just not sure what happened. Thank you! """,
"""Sync should "force" update local branches The problem I'm trying to solve:
When branches are force pushed into upstream, the local branches will be out of sync. And require manual remediation. By forcing an update, hub will ensure that all branches will reflect upstream.

How I imagine hub could expose this functionality:
hub sync --force

""",
"""make pr with no arguments the same as pr list The problem I'm trying to solve:

Attempting to list the issues, pull requests, and releases for a GitHub repository:

hub issue - lists all open issues
hub release - lists all releases
hub pr - does nothing, shows usage information
How I imagine hub could expose this functionality:

For self-consistency, and consistency with git builtins, hub pr should do what hub pr list does now, the extra verb should be implied as the default action.

It would also be fine if hub issue list, hub pr list, and hub release list were all aliases for hub issue, hub pr, and hub release with no verb.""",
"""Editing release tag from command line always results as "untagged-XXXXX"  So basically I'm trying to automate my release process by creating a draft "proxy" release and I would like to edit its tag version when I'm about to actually publish the release.

The thing is, I realized that:
1_ When I try to edit anything regarding the release, the tag ends up being something like "untagged-1421c803c66244c71c64"
2_ Can we even change just the tag for a draft release?"""
]

In [60]:
templates_issues = ["bug", "bug", "feat", "feat", "feat", "bug", "bug", "feat", "feat", "bug"]

In [54]:
predictions, raw_outputs = model.predict(issues)

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/2 [00:00<?, ?it/s]

In [62]:
list(zip([cor_tab[index] for index in predictions], templates_issues))

[('Bug!', 'bug'),
 ('Bug!', 'bug'),
 ('Not a Bug', 'feat'),
 ('Not a Bug', 'feat'),
 ('Not a Bug', 'feat'),
 ('Bug!', 'bug'),
 ('Bug!', 'bug'),
 ('Not a Bug', 'feat'),
 ('Not a Bug', 'feat'),
 ('Not a Bug', 'bug')]