- The system message helps set the behavior of the assistant. For example, you can modify the personality of the assistant or provide specific instructions about how it should behave throughout the conversation. However note that the system message is optional and the model’s behavior without a system message is likely to be similar to using a generic message such as "You are a helpful assistant."

- The user messages provide requests or comments for the assistant to respond to. 
- Assistant messages store previous assistant responses, but can also be written by you to give examples of desired behavior.

In [0]:
import pandas as pd
import os
import openai
from sklearn.metrics import precision_recall_fscore_support, accuracy_score, classification_report
openai.api_key = "OPEN_AI_API_KEY"
import warnings
import time
warnings.simplefilter("ignore")
from collections import Counter

pred_field = 'sentence' # this can be one of : 'sentence' or 'context'

In [0]:
sentences = ["Here, we provide aLFQ, an open-source implementation of algorithms supporting the estimation of protein quantities by any of the aforementioned methods, and additionally provide automated workflows for data analysis and error estimation.",
             "WGCNA” in R package was used to construct the weighted gene co-expression network [63]",
             "AccuTyping takes inputs of the two color intensities digitized from scanned microarray images with one of the two popular software packages, GenePix (Axon Instrument, Union City, CA) or ImaGene (Biodiscovery, Inc., El Segundo, CA).",
             "The survey data were entered into an Access database using a two-pass data verification process and analyzed using SPSS v15.0 software.",
             "The clinical assessment and laboratory results that were recorded into a Microsoft Access database were analyzed using Statistical Package for the Social Sciences (PASW \u2013 former SPSS) version 18 and R version 2.9.2 (R Foundation for Statistical Computing, Vienna, Austria).",
             "4Cin can also generate models using 4C-seq-like data coming from recently developed techniques such as NG Capture-C or Capture-C, as long as they are used to capture at least 4 viewpoints within each region of interest.",
             "AccuTyping takes inputs of the two color intensities digitized from scanned microarray images with one of the two popular software packages, GenePix (Axon Instrument, Union City, CA) or ImaGene (Biodiscovery, Inc., El Segundo, CA).",
             "aLFQ was implemented in R as a modular S3 package.",
             "The 4Cin pipeline can be deployed pulling the docker image from https://hub.docker.com/r/batxes/4cin_ubuntu/ to avoid the installation of the dependencies.",
             "aLFQ is written in R and freely available under the GPLv3 from CRAN (http://www.cran.r-project.org).",
             "AlignerBoost is implemented as a uniform Java application and is freely available at https://github.com/Grice-Lab/AlignerBoost."]
labels = ['creation', 'usage', 'mention', 'usage', 'usage', 'mention', 'mention', 'creation', 'deposition', 'deposition', 'deposition']

## Data Reading

In [0]:
final_data_df = pd.read_csv('../data/software_citation_intent_merged.csv')
LABEL2TEXT = {0 : 'creation', 1 : 'usage', 2 : 'mention', 3: 'none'}

In [0]:
def update_context(df):
    df['context'] = df.apply(lambda x: x['context'] if x['context'] == x['context'] else x['sentence'], axis = 1)
update_context(final_data_df)

In [0]:
final_data_df.head()

Unnamed: 0.1,Unnamed: 0,id,sentence,used,created,mention,context,label,text
0,0,PMC5189946,All of this analysis was implemented using Mat...,False,True,False,All of this analysis was implemented using Mat...,0,All of this analysis was implemented using Mat...
1,1,PMC4511233,"Code for calculating partition similarity, obt...",False,True,False,Since the probability of getting a given MI is...,0,"Code for calculating partition similarity, obt..."
2,2,PMC4186879,All behavioral statistical analyses were perfo...,False,False,True,All behavioral statistical analyses were perfo...,2,All behavioral statistical analyses were perfo...
3,3,PMC5026371,"M-Track was written using Python 2.7, OpenCV 3...",True,False,False,"M-Track was written using Python 2.7, OpenCV 3...",1,"M-Track was written using Python 2.7, OpenCV 3..."
4,4,PMC1283974,"Mindboggle is a freely downloadable, open sour...",False,True,False,"Mindboggle is a freely downloadable, open sour...",0,"Mindboggle is a freely downloadable, open sour..."


In [0]:
X_train_df = pd.read_csv('../data/software_citation_intent_train.csv')
X_test_df = pd.read_csv('../data/software_citation_intent_test.csv')
X_test_df['label_descriptive'] = X_test_df['label'].apply(lambda x: LABEL2TEXT[x])

update_context(X_train_df)
update_context(X_test_df)

In [0]:
Counter(X_test_df['label_descriptive'].to_list())

Counter({'usage': 449, 'none': 200, 'mention': 95, 'creation': 94})

## Zero-shot GPT model

In [0]:
def predict_gpt(sentences, y_test, initial_message, n = -1, print_every = 10, verbose = False):
    predicted_labels = []
    true_labels = []
    indices = range(len(sentences))
    for i, sentence, label in zip(indices[:n], sentences[:n], y_test[:n]):
        message = initial_message + [{"role": "user", "content": sentence.strip()}]
        try:
            completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=message, request_timeout = 10)
            predicted_class = completion.choices[0].message.content
            if i % print_every == 0:
                if not verbose:
                    print('Sentence', i)
                else:
                    print(i, 'Sentence: ', sentence, '\nPredicted class:', predicted_class, 'Real class:' + label + '\n\n')
            predicted_labels.append(predicted_class)
            true_labels.append(label)
        except:
            'got an error'
            continue
    return predicted_labels, true_labels

## Evaluation

In [0]:
def evaluate(true_labels, predicted_labels):
    p, r, f1, support = precision_recall_fscore_support(true_labels, predicted_labels, average='macro')
    accuracy = round(accuracy_score(true_labels, predicted_labels), 3)
    print('Precision: ', round(p, 3), 'Recall: ', round(r, 3), 'F1:', round(f1, 3), 'Accuracy:', accuracy)
    print(classification_report(true_labels, predicted_labels))

In [0]:
num_examples = 5
examples_used = X_train_df[X_train_df['label_descriptive'] == 'usage'].sample(num_examples)[pred_field].to_list()
examples_created = X_train_df[X_train_df['label_descriptive'] == 'creation'].sample(num_examples)[pred_field].to_list()
examples_mentioned = X_train_df[X_train_df['label_descriptive'] == 'mention'].sample(num_examples)[pred_field].to_list()
examples_none = X_train_df[X_train_df['label_descriptive'] == 'none'].sample(num_examples)['sentence'].to_list()

# PROMPT 1
initial_message = [{"role": "system", 
                "content": "You are a scientist trying to figure out the citation intent behind software mentioned in sentences coming from research articles. Your four categories are: usage, creation, mention, or none. The definitions of the classes are: \
                - usage: software was used in the paper \
                - creation: software was created by the authors of the paper \
                - mention: software was mentioned in the paper, but not used, nor created \
                - none: none of the previous 3 categories apply \
                You need to output one category only."}]
for example in examples_used:
    initial_message += [{"role": "user", "content" : example}]
    initial_message += [{"role": "assistant", "content" : 'usage'}]
for example in examples_created:
    initial_message += [{"role": "user", "content" : example}]
    initial_message += [{"role": "assistant", "content" : 'creation'}]
for example in examples_mentioned:
    initial_message += [{"role": "user", "content" : example}]
    initial_message += [{"role": "assistant", "content" : 'mention'}]
for example in examples_none:
    initial_message += [{"role": "user", "content" : example}]
    initial_message += [{"role": "assistant", "content" : 'none'}]

# PROMPT 2
# initial_message_content = "You are a scientist trying to figure out the citation intent behind software mentioned in sentences coming from research articles. Your four categories are: usage, creation, mention, or none. The definitions of the classes are: \
#                 - usage: software was used in the paper"
# for example in examples_used:
#     initial_message_content += '\tExample: ' + example
# initial_message_content += '- creation: software was created by the authors of the paper'
# for example in examples_created:
#     initial_message_content += '\tExample: ' + example
# initial_message_content += '- mention: software was mentioned in the paper, but not used, nor created'
# for example in examples_created:
#     initial_message_content += '\tExample: ' + example
# initial_message_content += "- none: none of the previous 3 categories apply"
# for example in examples_none:
#     initial_message_content += '\tExample: ' + example

# initial_message = [{"role": "system", "content": initial_message_content}]

#### Test dataset

In [0]:
n = -1

In [0]:
test_sentences = X_test_df[pred_field].to_list()
y_test = X_test_df['label_descriptive'].to_list()
y_pred, y_test_completed = predict_gpt(test_sentences, y_test, initial_message, n = n, print_every = 1)

Sentence 3
Sentence 6
Sentence 7
Sentence 8
Sentence 9
Sentence 10
Sentence 11
Sentence 13
Sentence 30
Sentence 33
Sentence 35
Sentence 36
Sentence 37
Sentence 39
Sentence 43
Sentence 44
Sentence 45
Sentence 46
Sentence 47
Sentence 48
Sentence 49
Sentence 50
Sentence 51
Sentence 52
Sentence 53
Sentence 57
Sentence 84
Sentence 85
Sentence 108
Sentence 120
Sentence 122
Sentence 124
Sentence 125
Sentence 140
Sentence 144
Sentence 148
Sentence 149
Sentence 150
Sentence 151
Sentence 153
Sentence 156
Sentence 190
Sentence 194
Sentence 205
Sentence 208
Sentence 221
Sentence 223
Sentence 225
Sentence 234
Sentence 249
Sentence 251
Sentence 254
Sentence 270
Sentence 271
Sentence 272
Sentence 273
Sentence 274
Sentence 284
Sentence 285
Sentence 286
Sentence 287
Sentence 296
Sentence 309
Sentence 312
Sentence 320
Sentence 321
Sentence 331
Sentence 333
Sentence 347
Sentence 350
Sentence 358
Sentence 361
Sentence 363
Sentence 366
Sentence 373
Sentence 382
Sentence 384
Sentence 387
Sentence 388
Senten

In [0]:
evaluate(y_test_completed, y_pred)

Precision:  0.667 Recall:  0.469 F1: 0.45 Accuracy: 0.517
              precision    recall  f1-score   support

    creation       1.00      0.25      0.40        20
     mention       0.33      0.29      0.31        24
        none       0.40      0.97      0.56        67
       usage       0.94      0.36      0.53       129

    accuracy                           0.52       240
   macro avg       0.67      0.47      0.45       240
weighted avg       0.73      0.52      0.50       240



#### CZI validation dataset

In [0]:
czi_combined = pd.read_csv('/Workspace/Users/aistrate@chanzuckerberg.com/czi_val.csv')
test_sentences_czi = czi_combined['text'].to_list()
y_test_czi = czi_combined['label'].to_list()
y_pred_czi, y_test_czi_completed = predict_gpt(test_sentences_czi, y_test_czi, initial_message, n = n, print_every = 1)

Sentence 0
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 6
Sentence 7
Sentence 8
Sentence 9
Sentence 11
Sentence 12
Sentence 13
Sentence 14
Sentence 15
Sentence 16
Sentence 17
Sentence 18
Sentence 20
Sentence 21
Sentence 22
Sentence 23
Sentence 24
Sentence 25
Sentence 26
Sentence 27
Sentence 28
Sentence 29
Sentence 30
Sentence 31
Sentence 32
Sentence 33
Sentence 34
Sentence 35
Sentence 36
Sentence 37
Sentence 38
Sentence 39
Sentence 40
Sentence 41
Sentence 42
Sentence 44
Sentence 45
Sentence 46
Sentence 47
Sentence 48
Sentence 49
Sentence 50
Sentence 51
Sentence 52
Sentence 53
Sentence 54
Sentence 55
Sentence 56
Sentence 57
Sentence 58
Sentence 59
Sentence 60
Sentence 61
Sentence 62
Sentence 63
Sentence 64
Sentence 65
Sentence 67
Sentence 68
Sentence 69
Sentence 70
Sentence 71
Sentence 72
Sentence 73
Sentence 74
Sentence 75
Sentence 77
Sentence 78
Sentence 79
Sentence 80
Sentence 81
Sentence 82
Sentence 83
Sentence 84
Sentence 85
Sentence 86
Sentence 87
Sentence 88
Sentence 89
S

In [0]:
evaluate(y_test_czi_completed, y_pred_czi)

Precision:  0.429 Recall:  0.378 F1: 0.29 Accuracy: 0.332
              precision    recall  f1-score   support

    creation       0.50      0.33      0.40         9
     mention       0.17      0.19      0.18        27
        none       0.05      0.67      0.09        18
       usage       1.00      0.33      0.49       335

    accuracy                           0.33       389
   macro avg       0.43      0.38      0.29       389
weighted avg       0.89      0.33      0.45       389

