# Fine-tuning

Experiment with three different types of fine-tuning data generation:
1. **FT-1** - For each feature/property, get one positive and one negative example and explanation.
2. **FT-25** - For each feature/property, get 25 positive and 25 negative examples. Do not provide explanation.
3. **FT-Maj** - For each sentence and each feature/property in human annotated data, filter out validation set, assign positive to majority vote, assign negative to examples where label is not listed for any annotators.

In [825]:
from collections import defaultdict
import json
import pandas as pd

## format for single example:

```
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, 
              {"role": "user", "content": "What's the capital of France?"}, 
              {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
```


In [826]:
propertiesDict = defaultdict(list)
questionsDict = defaultdict()

sentenceFeaturesDict = defaultdict(list)
sentenceFeaturesList = []

with open('../APP/features-sentence.json') as f:
    sentenceFeaturesList = json.load(f)

    for featureDict in sentenceFeaturesList:
        sentenceFeaturesDict[featureDict['key']] = {val:desc for val, desc in zip(featureDict['values'],featureDict['descriptions'])}
        questionsDict[featureDict['key']] = featureDict['question']
        propertiesDict[featureDict['key']] = featureDict['values']

wordFeaturesDict = defaultdict(list)
wordFeaturesList = []

with open('../APP/features-word.json') as f:
    wordFeaturesList = json.load(f)
    for featureDict in wordFeaturesList:
        wordFeaturesDict[featureDict['key']] = {val:desc for val, desc in zip(featureDict['values'],featureDict['descriptions'])}
        questionsDict[featureDict['key']] = featureDict['question']
        propertiesDict[featureDict['key']] = featureDict['values'] # dictionary of features and their list of properties

featuresDict = sentenceFeaturesDict | wordFeaturesDict

In [827]:
count=0
for k, v in propertiesDict.items():
    count += len(v)
print(count*2) # number of examples for finetuning

258


In [835]:
questionsDict['Mood']

'Does the following sentence utilize the _PROP_ mood?'

In [830]:
featuresDict['Mood']

{'indicative': "denotes a mood of verbs expressing a simple statement of a fact. Example: 'Conservative Ben Shapiro told Fox news the verdict was horrifying.'",
 'subjunctive': "denotes a mood of verbs expressing what is imagined or hypothetical or possible. Example: 'They would be immediate targets should the US and South Korea attack the north.'",
 'exclamatory': "denotes a mood of verbs expressing surprise, strong emotion, or pain. Example: 'He generated her in His Blood and continually revives her with His Spirit!'",
 'interrogative': "denotes a mood of verbs expressing questions. Example: 'Peradventure there be fifty righteous within the city: wilt thou also destroy and not spare the place for the fifty righteous that are therein?'",
 'imperative': "denotes a mood of verbs expressing commands or directives. Example: 'Women everywhere should be furious.'",
 'optative': "denotes a mood of verbs expressing wishes. Example: 'For their sake, lets hope it works better than this years fl

## System prompt

In [123]:
model_version = "gpt-3.5"
SYSTEM_MESSAGE = f"""You are ChatGPT, a large language model trained by OpenAI, based on the {model_version} architecture. 
You are also an expert linguist and grammarian specializing in news text. 
Format your response as a JSON object with 'Answer' and 'Explanation' as the keys. 
The value of 'Answer' should be either 'yes' or 'no'. Explain your choice in the 'Explanation'."""

## User prompt
```
In the context of '{feature}', '{prop}' {desc}
{quest}: {sentence}
```

## Dataset format

```
{"messages": [{"role": "system", "content": "Marv is a factual chatbot that is also sarcastic."}, 
              {"role": "user", "content": "What's the capital of France?"}, 
              {"role": "assistant", "content": "Paris, as if everyone doesn't know that already."}]}
```

# FT -1

In [None]:
data_path = "data/V3/_gpt4/"

In [836]:
FEATURE = "Mood"
PROPS = propertiesDict[FEATURE]
PROPS

['indicative',
 'subjunctive',
 'exclamatory',
 'interrogative',
 'imperative',
 'optative']

In [547]:
PROP = 'allusions'
SID = 3990

In [548]:
df = pd.read_csv(data_path+"V3_"+FEATURE.replace(" ","_")+".csv")
df = df[df["sentence_id"]==SID]
df = df[df['property_gpt4']==PROP]
df

Unnamed: 0,sentence_id,technique,text,feature_id,props_a20,props_a21,props_a22,annotator_consistency,props_gpt4_majority,res_1.0_1,...,gpt_props_0.2_3,gpt3.5_0.2_consistency,gpt3.5_0.2_majority,humans isCorrect,gpt isCorrect,comments,ground truth,property_gpt4,res_gpt4_0.0_V3_2,property_gpt4_0.0_V3_2
98,3990,[13],"Last January, parents filed a federal complain...",Language_varieties,"['correctness', 'clarity', 'middle']","['correctness', 'clarity', 'middle']","['correctness', 'clarity', 'middle']",True,"['correctness', 'middle']","{\n ""Properties"": [""correctness"", ""clarity"", ...",...,"['clarity', 'appropriateness']",0,"['correctness', 'clarity', 'appropriateness']",1,0.0,,"['correctness', 'clarity','middle','register s...",allusions,"{\n ""Answer"": ""no"",\n ""Explanation"": ""The se...",[]


In [549]:
sentence = df['text'].iloc[0]
print(sentence)

Last January, parents filed a federal complaint against Chatham Middle School in Chatham, New Jersey, for forcing students to watch videos that proselytized for Islam.


In [550]:
ASSISTANT_MESSAGE = df['res_gpt4_0.0_V3_2'].iloc[0]

In [551]:
print(json.dumps(json.loads(ASSISTANT_MESSAGE)))

{"Answer": "no", "Explanation": "The sentence provided does not contain an 'allusion'. An allusion is a figure of speech that references a person, place, thing, or event. These references can be direct or indirect and often rely on the audience's knowledge of the reference to understand its meaning. The sentence in question is a straightforward statement about an event that occurred and does not reference another context, phrase, or cultural knowledge that would require the reader to understand an implied meaning beyond what is explicitly stated."}


In [552]:
USER_MESSAGE = f"""In the context of '{FEATURE}', '{PROP}' {featuresDict[FEATURE][PROP]}
{questionsDict[FEATURE].replace('_PROP_', PROP)} : '{sentence}'
"""

In [553]:
print(USER_MESSAGE)

In the context of 'Language varieties', 'allusions' is the importation of a phrase from another context without explictly naming the other context - intertextuality (it requires cultural knowledge). Examples: 'Cage against the machine' as used in a headline about Nicalas Cage alluding to the band name 'Rage against the machine'
Does the following sentence demonstrate or contain 'allusions'? : 'Last January, parents filed a federal complaint against Chatham Middle School in Chatham, New Jersey, for forcing students to watch videos that proselytized for Islam.'



In [554]:
datum = {'messages': [{'role': 'system', 'content': SYSTEM_MESSAGE}, 
                      {'role': 'user', 'content': USER_MESSAGE}, 
                      {'role': 'assistant', 'content': json.dumps(json.loads(ASSISTANT_MESSAGE))}]}

In [555]:
s = json.dumps(datum)
print(s)

{"messages": [{"role": "system", "content": "You are ChatGPT, a large language model trained by OpenAI, based on the gpt-3.5 architecture. \nYou are also an expert linguist and grammarian specializing in news text. \nFormat your response as a JSON object with 'Answer' and 'Explanation' as the keys. \nThe value of 'Answer' should be either 'yes' or 'no'. Explain your choice in the 'Explanation'."}, {"role": "user", "content": "In the context of 'Language varieties', 'allusions' is the importation of a phrase from another context without explictly naming the other context - intertextuality (it requires cultural knowledge). Examples: 'Cage against the machine' as used in a headline about Nicalas Cage alluding to the band name 'Rage against the machine'\nDoes the following sentence demonstrate or contain 'allusions'? : 'Last January, parents filed a federal complaint against Chatham Middle School in Chatham, New Jersey, for forcing students to watch videos that proselytized for Islam.'\n"}

# NOT Captured

* Aspect - perfect progressive
* Figures of argument - antimetabole, abduction, deduction, induction
* Figures of word choice - metaplasms, polyptoton, ploce,  'anatanaclasis',
 'synonyms', 'rhetorical conditional'
* Language varieties - low, high, 'dialects/registers','register shift', 'maxims/proverbs', allusions

# FT-25
* We get 25 examples for each feature/property. Response is yes/no. No explanations.
* Find sentences that are not in the validation data sets

In [648]:
model_version = "gpt-3.5"
SYSTEM_MESSAGE = f"""You are an expert linguist and grammarian specializing in news text."""


In [649]:
FEATURE = "Aspect"

df = pd.read_csv("data/all_features_and_gpt/"+FEATURE+".csv")
val_df = pd.read_csv(data_path+"V3_"+FEATURE.replace(" ","_")+".csv")
val_ids = val_df['sentence_id'].unique()

df = df[~df['sentence_id'].isin(val_ids)]
df.to_csv("data/all_features_and_gpt/"+FEATURE+"_filtered.csv")

In [650]:
PROPS = propertiesDict[FEATURE]
PROPS

['simple', 'perfect', 'progressive', 'perfect progressive']

In [790]:
PROP = 'perfect progressive'
answer = 'no'

In [791]:
ids =[143,11]

In [792]:
for id in ids:
    sentence = df[df['sentence_id']==id]['text'].iloc[0]
    
    USER_MESSAGE = f"""In the context of '{FEATURE}', '{PROP}' {featuresDict[FEATURE][PROP]}
    {questionsDict[FEATURE].replace('_PROP_', PROP)} : '{sentence}'
    """
    
    ASSISTANT_MESSAGE = {"Answer":answer}
    
    datum = {'messages': [{'role': 'system', 'content': SYSTEM_MESSAGE}, 
                          {'role': 'user', 'content': USER_MESSAGE}, 
                          {'role': 'assistant', 'content': json.dumps(ASSISTANT_MESSAGE)}]}
    
    s = json.dumps(datum)
    print(s)

{"messages": [{"role": "system", "content": "You are an expert linguist and grammarian specializing in news text. \nFormat your response as a JSON object with 'Answer' as the key. \nThe value of 'Answer' should be either 'yes' or 'no'."}, {"role": "user", "content": "In the context of 'Aspect', 'perfect progressive' is expressing the end of an ongoing action. Example: 'Rover has been eating a bone.'\n    Does the following sentence utilize the perfect progressive aspect? : 'Sagacious gun owners have always known that the ultimate goal of gun control extremists such as Chuck Schumer, Nancy Pelosi, Dianne Feinstein, et al., has always been gun confiscation.'\n    "}, {"role": "assistant", "content": "{\"Answer\": \"no\"}"}]}
{"messages": [{"role": "system", "content": "You are an expert linguist and grammarian specializing in news text. \nFormat your response as a JSON object with 'Answer' as the key. \nThe value of 'Answer' should be either 'yes' or 'no'."}, {"role": "user", "content": 

# FT-Maj
## Steps to create data from noisy human labels

* filter out validation sentences
* convert to lists and concatenate all human labels
* Positive examples:
    * Find majority human annotator labels
        1. create counter of list items and choose max
* Negative examples:
    * Find rows where given label is not in any human annotator labels (concatenated list from above)
* Append to file

In [798]:
# sorted(list(featuresDict.keys()))

In [1]:
features =  [
'Aspect',
'Emphasis',
'Figures of argument',
'Figures of word choice',
'Language varieties',
'Lexical and semantic fields',
'Modifying clauses',
'Mood',
'New words and changing uses',
'Parallelism',
'Phrases built on nouns',
'Phrases built on verbs',
'Predication',
'Sentence architecture',
'Series',
'Subject choices',
'Tense',
'Tropes',
'Verb choices'
]


In [838]:
import ast
from collections import Counter
def concat(a20,a21,a22):

    if type(a20) == str:
        l0 = ast.literal_eval(a20)
    else:
        l0 = []

    if type(a21) == str:
        l1 = ast.literal_eval(a21)
    else:
        l1 = []
    
    if type(a22) == str:
        l2 = ast.literal_eval(a22)
    else:
        l2 = []

    _all = l0+l1+l2
    
    return Counter(_all)

    

In [839]:
def write_datum(FEATURE,PROP,sentence):
    SYSTEM_MESSAGE = f"""You are an expert linguist and grammarian specializing in news text."""
    
    USER_MESSAGE = f"""In the context of grammar '{PROP}' '{FEATURE}' {featuresDict[FEATURE][PROP]}
    {questionsDict[FEATURE].replace('_PROP_', PROP)} : '{sentence}'
    """
    
    ASSISTANT_MESSAGE = {"Answer":answer}
    
    datum = {'messages': [{'role': 'system', 'content': SYSTEM_MESSAGE}, 
                          {'role': 'user', 'content': USER_MESSAGE}, 
                          {'role': 'assistant', 'content': json.dumps(ASSISTANT_MESSAGE)}]}
    
    s = json.dumps(datum)
    return s

In [840]:
for FEATURE in features:
    df = pd.read_csv("data/all_features_and_gpt/"+FEATURE.replace(" ","_")+".csv")
    val_df = pd.read_csv(data_path+"V3_"+FEATURE.replace(" ","_")+".csv")
    val_ids = val_df['sentence_id'].unique()

    # filtered DF
    df = df[~df['sentence_id'].isin(val_ids)]
    # Save to file for later if needed
    df.to_csv("data/all_features_and_gpt/"+FEATURE.replace(" ","_")+"_filtered.csv")

    df['all_human_anns'] = df.apply(lambda x: concat(x['props_a20'],x['props_a21'],x['props_a22']),axis=1)

    PROPS = propertiesDict[FEATURE]

    for PROP in PROPS:
        for row in df.iterrows():
            # print(row[1]['sentence_id'])
            sentence = row[1]['text']
            # print(sentence)
            counter = row[1]['all_human_anns']
            if PROP in counter.keys():
                if counter[PROP] > 1:
                    answer = 'yes'
            else:
                answer = 'no'

            datum = write_datum(FEATURE,PROP,sentence)

            #write to file
            with open("fine-tuning/"+FEATURE.replace(" ","_")+"_FT_data.jsonl", 'a') as file1:
                file1.write("\n"+datum)

NOTE: remove the first line which is `\n` using awk

```
cd fine-tuning

for FILE in *.jsonl; 
	do awk 'NR>1' $FILE > tmp.jsonl && mv tmp.jsonl $FILE;
done
```