# 트랜잭션에 대한 다중 클래스 분류

이 노트북에서는 트랜잭션의 공개 데이터 세트를 미리 정의한 여러 범주로 분류하려고 합니다. 이러한 접근 방식은 트랜잭션 데이터를 미리 정의된 범주에 맞추려는 모든 다중 클래스 분류 사용 사례에 복제할 수 있어야 하며, 이 과정을 마치면 레이블이 있는 데이터 세트와 레이블이 없는 데이터 세트를 모두 처리할 수 있는 몇 가지 접근 방식을 알게 될 것입니다.

이 노트북에서 다룰 다양한 접근 방식은 다음과 같습니다:
- **제로 샷 분류:** 먼저 안내 프롬프트만 사용하여 5개의 명명된 버킷 중 하나에 트랜잭션을 넣는 제로 샷 분류를 해보겠습니다.
- **임베딩을 사용한 분류:** 그 다음에는 레이블이 지정된 데이터 세트에 임베딩을 생성한 다음, 전통적인 분류 모델을 사용해 카테고리를 식별하는 데 있어 임베딩이 얼마나 효과적인지 테스트합니다.
- **미세 조정된 분류:** 마지막으로 라벨이 지정된 데이터 세트에 대해 학습된 미세 조정된 모델을 생성하여 제로 샷 및 소수 샷 분류 접근 방식과 비교합니다.

-------------------

For this notebook we will be looking to classify a public dataset of transactions into a number of categories that we have predefined. These approaches should be replicable to any multiclass classificaiton use case where we are trying to fit transactional data into predefined categories, and by the end of running through this you should have a few approaches for dealing with both labelled and unlabelled datasets.

The different approaches we'll be taking in this notebook are:
- **Zero-shot Classification:** First we'll do zero shot classification to put transactions in one of five named buckets using only a prompt for guidance
- **Classification with Embeddings:** Following this we'll create embeddings on a labelled dataset, and then use a traditional classification model to test their effectiveness at identifying our categories
- **Fine-tuned Classification:** Lastly we'll produce a fine-tuned model trained on our labelled dataset to see how this compares to the zero-shot and few-shot classification approaches

## Setup

In [1]:
%load_ext autoreload
%autoreload 

In [2]:
import openai
import pandas as pd
import numpy as np
import json
import os

openai.api_key = os.getenv("OPENAI_API_KEY")
COMPLETIONS_MODEL = "text-davinci-002"

### 데이터 집합 로드

스코틀랜드 도서관에서 25,000파운드 이상의 공공 거래 데이터 집합을 사용하고 있습니다. 이 데이터 집합에는 세 가지 기능이 있습니다:
- 공급업체: 공급업체의 이름
- 설명: 거래에 대한 텍스트 설명
- Value: GBP로 표시된 거래 금액

**출처**:

https://data.nls.uk/data/organisational-data/transactions-over-25k/

-----------

We're using a public transaction dataset of transactions over £25k for the Library of Scotland. The dataset has three features that we'll be using:
- Supplier: The name of the supplier
- Description: A text description of the transaction
- Value: The value of the transaction in GBP

**Source**:

https://data.nls.uk/data/organisational-data/transactions-over-25k/

In [3]:
transactions = pd.read_csv('./data/25000_spend_dataset_current.csv', encoding= 'unicode_escape')
len(transactions)

359

In [4]:
transactions.head()

Unnamed: 0,Date,Supplier,Description,Transaction value (£)
0,21/04/2016,M & J Ballantyne Ltd,George IV Bridge Work,35098.0
1,26/04/2016,Private Sale,Literary & Archival Items,30000.0
2,30/04/2016,City Of Edinburgh Council,Non Domestic Rates,40800.0
3,09/05/2016,Computacenter Uk,Kelvin Hall,72835.0
4,09/05/2016,John Graham Construction Ltd,Causewayside Refurbishment,64361.0


In [26]:
def request_completion(prompt):
    
    completion_response =   openai.Completion.create(
                            prompt=prompt,
                            temperature=0,
                            max_tokens=5,
                            top_p=1,
                            frequency_penalty=0,
                            presence_penalty=0,
                            model=COMPLETIONS_MODEL
                            )
        
    return completion_response

def classify_transaction(transaction,prompt):
    
    prompt = prompt.replace('SUPPLIER_NAME',transaction['Supplier'])
    prompt = prompt.replace('DESCRIPTION_TEXT',transaction['Description'])
    prompt = prompt.replace('TRANSACTION_VALUE',str(transaction['Transaction value (£)']))
    
    classification = request_completion(prompt)['choices'][0]['text'].replace('\n','')
    
    return classification

# 이 함수는 Finetuning API의 prepare_data 함수에서 트레이닝 및 검증 출력을 가져와 다음과 같이 확인합니다.
# 각각 동일한 수의 클래스가 있는지 확인합니다.
# 클래스 수가 같지 않으면 미세 조정이 실패하고 오류를 반환합니다.

# This function takes your training and validation outputs from the prepare_data function of the Finetuning API, and
# confirms that each have the same number of classes.
# If they do not have the same number of classes the fine-tune will fail and return an error

def check_finetune_classes(train_file,valid_file):

    train_classes = set()
    valid_classes = set()
    with open(train_file, 'r') as json_file:
        json_list = list(json_file)
        print(len(json_list))

    for json_str in json_list:
        result = json.loads(json_str)
        train_classes.add(result['completion'])
        #print(f"result: {result['completion']}")
        #print(isinstance(result, dict))

    with open(valid_file, 'r') as json_file:
        json_list = list(json_file)
        print(len(json_list))

    for json_str in json_list:
        result = json.loads(json_str)
        valid_classes.add(result['completion'])
        #print(f"result: {result['completion']}")
        #print(isinstance(result, dict))
        
    if len(train_classes) == len(valid_classes):
        print('All good')
        
    else:
        print('Classes do not match, please prepare data again')

## 제로 샷 분류

먼저 간단한 프롬프트를 사용하여 이러한 트랜잭션을 분류하는 기본 모델의 성능을 평가해 보겠습니다. 모델에 5개의 카테고리를 제공하고 분류할 수 없는 카테고리에 대해서는 '분류할 수 없음'이라는 포괄적인 항목을 표시합니다.

---------------------

We'll first assess the performance of the base models at classifying these transactions using a simple prompt. We'll provide the model with 5 categories and a catch-all of "Could not classify" for ones that it cannot place.

In [6]:
zero_shot_prompt = '''You are a data expert working for the National Library of Scotland. 
You are analysing all transactions over £25,000 in value and classifying them into one of five categories.
The five categories are Building Improvement, Literature & Archive, Utility Bills, Professional Services and Software/IT.
If you can't tell what it is, say Could not classify
                      
Transaction:
                      
Supplier: SUPPLIER_NAME
Description: DESCRIPTION_TEXT
Value: TRANSACTION_VALUE
                      
The classification is:'''

In [7]:
# Get a test transaction
transaction = transactions.iloc[0]

# Interpolate the values into the prompt
prompt = zero_shot_prompt.replace('SUPPLIER_NAME',transaction['Supplier'])
prompt = prompt.replace('DESCRIPTION_TEXT',transaction['Description'])
prompt = prompt.replace('TRANSACTION_VALUE',str(transaction['Transaction value (£)']))

# Use our completion function to return a prediction
completion_response = request_completion(prompt)
print(completion_response['choices'][0]['text'])

 Building Improvement


첫 번째 시도는 정확합니다. M & J Ballantyne Ltd는 주택 건설업체이며 그들이 수행한 작업은 실제로 건물 개선입니다.

샘플 크기를 25개로 확장하고 간단한 안내 메시지를 통해 다시 한 번 성능을 확인해 보겠습니다.

-----------

Our first attempt is correct, M & J Ballantyne Ltd are a house builder and the work they performed is indeed Building Improvement.

Lets expand the sample size to 25 and see how it performs, again with just a simple prompt to guide it

In [8]:
test_transactions = transactions.iloc[:25]
test_transactions['Classification'] = test_transactions.apply(lambda x: classify_transaction(x,zero_shot_prompt),axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_transactions['Classification'] = test_transactions.apply(lambda x: classify_transaction(x,zero_shot_prompt),axis=1)


In [9]:
test_transactions['Classification'].value_counts()

 Building Improvement     10
 Professional Services     8
 Software/IT               4
 Literature & Archive      3
Name: Classification, dtype: int64

In [10]:
test_transactions.head(25)

Unnamed: 0,Date,Supplier,Description,Transaction value (£),Classification
0,21/04/2016,M & J Ballantyne Ltd,George IV Bridge Work,35098.0,Building Improvement
1,26/04/2016,Private Sale,Literary & Archival Items,30000.0,Literature & Archive
2,30/04/2016,City Of Edinburgh Council,Non Domestic Rates,40800.0,Building Improvement
3,09/05/2016,Computacenter Uk,Kelvin Hall,72835.0,Software/IT
4,09/05/2016,John Graham Construction Ltd,Causewayside Refurbishment,64361.0,Building Improvement
5,09/05/2016,A McGillivray,Causewayside Refurbishment,53690.0,Building Improvement
6,16/05/2016,John Graham Construction Ltd,Causewayside Refurbishment,365344.0,Building Improvement
7,23/05/2016,Computacenter Uk,Kelvin Hall,26506.0,Software/IT
8,23/05/2016,ECG Facilities Service,Facilities Management Charge,32777.0,Professional Services
9,23/05/2016,ECG Facilities Service,Facilities Management Charge,32777.0,Professional Services


레이블이 지정된 예가 없어도 초기 결과는 꽤 괜찮습니다! 분류할 수 없었던 사례들은 주제에 대한 단서가 거의 없는 어려운 사례들이었지만, 레이블이 지정된 데이터 세트를 정리하여 더 많은 예제를 제공하면 더 나은 성능을 얻을 수 있을 것입니다.

------------------------------

Initial results are pretty good even with no labelled examples! The ones that it could not classify were tougher cases with few clues as to their topic, but maybe if we clean up the labelled dataset to give more examples we can get better performance.

## 임베딩을 사용한 분류

지금까지 분류한 작은 세트에서 임베딩을 생성해 보겠습니다. 데이터 세트의 101개 트랜잭션에 대해 제로 샷 분류기를 실행하고 15개의 **분류할 수 없음** 결과를 수동으로 수정하여 레이블이 지정된 예제 세트를 만들었습니다.

---------------

Lets create embeddings from the small set that we've classified so far - we've made a set of labelled examples by running the zero-shot classifier on 101 transactions from our dataset and manually correcting the 15 **Could not classify** results that we got

### 임베딩 생성

이 초기 섹션에서는 모든 기능을 연결하는 결합된 필드에서 임베딩을 생성하기 위해 [데이터 세트 가져오기 노트북](Obtain_dataset.ipynb)의 접근 방식을 재사용합니다.

---------------

This initial section reuses the approach from the [Obtain_dataset Notebook](Obtain_dataset.ipynb) to create embeddings from a combined field concatenating all of our features

In [11]:
df = pd.read_csv('./data/labelled_transactions.csv')
df.head()

Unnamed: 0,Date,Supplier,Description,Transaction value (£),Classification
0,15/08/2016,Creative Video Productions Ltd,Kelvin Hall,26866,Other
1,29/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,74806,Building Improvement
2,29/05/2017,Morris & Spottiswood Ltd,George IV Bridge Work,56448,Building Improvement
3,31/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,164691,Building Improvement
4,24/07/2017,John Graham Construction Ltd,Causewayside Refurbishment,27926,Building Improvement


In [12]:
df['combined'] = "Supplier: " + df['Supplier'].str.strip() + "; Description: " + df['Description'].str.strip() + "; Value: " + str(df['Transaction value (£)']).strip()
df.head(2)

Unnamed: 0,Date,Supplier,Description,Transaction value (£),Classification,combined
0,15/08/2016,Creative Video Productions Ltd,Kelvin Hall,26866,Other,Supplier: Creative Video Productions Ltd; Desc...
1,29/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,74806,Building Improvement,Supplier: John Graham Construction Ltd; Descri...


In [13]:
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

df['n_tokens'] = df.combined.apply(lambda x: len(tokenizer.encode(x)))
len(df)

  from .autonotebook import tqdm as notebook_tqdm


101

In [14]:
embedding_path = './data/transactions_with_embeddings_100.csv'

In [15]:
from openai.embeddings_utils import get_embedding

df['babbage_similarity'] = df.combined.apply(lambda x: get_embedding(x, engine='text-similarity-babbage-001'))
df['babbage_search'] = df.combined.apply(lambda x: get_embedding(x, engine='text-search-babbage-doc-001'))
df.to_csv(embedding_path)

### 분류에 임베딩 사용

이제 임베딩을 만들었으니 이름을 지정한 카테고리로 분류했을 때 어떤 효과가 있는지 살펴봅시다.

이를 위해 [Classification_using_embeddings](Classification_using_embeddings.ipynb) 노트북의 템플릿을 사용하겠습니다.

-------------

Now that we have our embeddings, let see if classifying these into the categories we've named gives us any more success.

For this we'll use a template from the [Classification_using_embeddings](Classification_using_embeddings.ipynb) notebook

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

fs_df = pd.read_csv(embedding_path)
fs_df["babbage_similarity"] = fs_df.babbage_similarity.apply(eval).apply(np.array)
fs_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Supplier,Description,Transaction value (£),Classification,combined,n_tokens,babbage_similarity,babbage_search
0,0,15/08/2016,Creative Video Productions Ltd,Kelvin Hall,26866,Other,Supplier: Creative Video Productions Ltd; Desc...,136,"[-0.009802100248634815, 0.022551486268639565, ...","[-0.00232666521333158, 0.019198870286345482, 0..."
1,1,29/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,74806,Building Improvement,Supplier: John Graham Construction Ltd; Descri...,140,"[-0.009065819904208183, 0.012094118632376194, ...","[0.005169447045773268, 0.00473341578617692, -0..."
2,2,29/05/2017,Morris & Spottiswood Ltd,George IV Bridge Work,56448,Building Improvement,Supplier: Morris & Spottiswood Ltd; Descriptio...,141,"[-0.009000026620924473, 0.02405017428100109, -...","[0.0028343256562948227, 0.021166473627090454, ..."
3,3,31/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,164691,Building Improvement,Supplier: John Graham Construction Ltd; Descri...,140,"[-0.009065819904208183, 0.012094118632376194, ...","[0.005169447045773268, 0.00473341578617692, -0..."
4,4,24/07/2017,John Graham Construction Ltd,Causewayside Refurbishment,27926,Building Improvement,Supplier: John Graham Construction Ltd; Descri...,140,"[-0.009065819904208183, 0.012094118632376194, ...","[0.005169447045773268, 0.00473341578617692, -0..."


In [17]:
X_train, X_test, y_train, y_test = train_test_split(
    list(fs_df.babbage_similarity.values), fs_df.Classification, test_size=0.2, random_state=42
)

clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
preds = clf.predict(X_test)
probas = clf.predict_proba(X_test)

report = classification_report(y_test, preds)
print(report)

                      precision    recall  f1-score   support

Building Improvement       0.92      1.00      0.96        11
Literature & Archive       1.00      1.00      1.00         3
               Other       0.00      0.00      0.00         1
         Software/IT       1.00      1.00      1.00         1
       Utility Bills       1.00      1.00      1.00         5

            accuracy                           0.95        21
           macro avg       0.78      0.80      0.79        21
        weighted avg       0.91      0.95      0.93        21



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


이 모델의 성능은 매우 강력하므로 임베딩을 생성하고 더 간단한 분류기를 사용하는 것도 효과적인 접근 방식처럼 보이며, 제로 샷 분류기는 라벨이 없는 데이터 세트의 초기 분류를 수행하는 데 도움이 됩니다.

한 걸음 더 나아가, 동일한 레이블이 지정된 데이터 세트에 대해 학습된 미세 조정된 모델이 비슷한 결과를 제공하는지 살펴봅시다.

-----------------

Performance for this model is pretty strong, so creating embeddings and using even a simpler classifier looks like an effective approach as well, with the zero-shot classifier helping us do the initial classification of the unlabelled dataset.

Lets take it one step further and see if a fine-tuned model trained on this same labelled datasets gives us comparable results

## 미세 조정된 트랜잭션 분류

이 사용 사례에서는 동일한 레이블이 지정된 101개의 트랜잭션 집합에 대해 미세 조정된 모델을 학습하고 이 미세 조정된 모델을 보이지 않는 트랜잭션 그룹에 적용하여 위의 몇 가지 샷 분류를 개선해 보겠습니다.

-------------------------------

For this use case we're going to try to improve on the few-shot classification from above by training a fine-tuned model on the same labelled set of 101 transactions and applying this fine-tuned model on group of unseen transactions

### 미세 조정된 분류기 구축하기

먼저 데이터를 준비하기 위해 몇 가지 데이터 준비 작업을 수행해야 합니다. 이 작업은 다음 단계를 거치게 됩니다:
- 먼저 클래스를 나열하고 숫자 식별자로 대체합니다. '건물 개선'과 같이 여러 개의 연속된 토큰이 아닌 단일 토큰을 예측하도록 모델을 만들면 더 나은 결과를 얻을 수 있습니다.
- 또한 각 예제에 공통 접두사와 접미사를 추가하여 모델이 예측할 수 있도록 지원해야 합니다. 이 경우 텍스트가 이미 'Supplier'로 시작하므로 접미사 '\n\n###\n\n'을 추가하겠습니다.
- 마지막으로 분류를 위해 각 대상 클래스에 선행 공백을 추가하여 모델을 다시 한 번 지원합니다.

--------------

We'll need to do some data prep first to get our data ready. This will take the following steps:
- First we'll list out our classes and replace them with numeric identifiers. Making the model predict a single token rather than multiple consecutive ones like 'Building Improvement' should give us better results
- We also need to add a common prefix and suffix to each example to aid the model in making predictions - in our case our text is already started with 'Supplier' and we'll add a suffix of '\n\n###\n\n'
- Lastly we'll aid a leading whitespace onto each of our target classes for classification, again to aid the model

In [18]:
ft_prep_df = fs_df.copy()
len(ft_prep_df)

101

In [19]:
ft_prep_df.head()

Unnamed: 0.1,Unnamed: 0,Date,Supplier,Description,Transaction value (£),Classification,combined,n_tokens,babbage_similarity,babbage_search
0,0,15/08/2016,Creative Video Productions Ltd,Kelvin Hall,26866,Other,Supplier: Creative Video Productions Ltd; Desc...,136,"[-0.009802100248634815, 0.022551486268639565, ...","[-0.00232666521333158, 0.019198870286345482, 0..."
1,1,29/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,74806,Building Improvement,Supplier: John Graham Construction Ltd; Descri...,140,"[-0.009065819904208183, 0.012094118632376194, ...","[0.005169447045773268, 0.00473341578617692, -0..."
2,2,29/05/2017,Morris & Spottiswood Ltd,George IV Bridge Work,56448,Building Improvement,Supplier: Morris & Spottiswood Ltd; Descriptio...,141,"[-0.009000026620924473, 0.02405017428100109, -...","[0.0028343256562948227, 0.021166473627090454, ..."
3,3,31/05/2017,John Graham Construction Ltd,Causewayside Refurbishment,164691,Building Improvement,Supplier: John Graham Construction Ltd; Descri...,140,"[-0.009065819904208183, 0.012094118632376194, ...","[0.005169447045773268, 0.00473341578617692, -0..."
4,4,24/07/2017,John Graham Construction Ltd,Causewayside Refurbishment,27926,Building Improvement,Supplier: John Graham Construction Ltd; Descri...,140,"[-0.009065819904208183, 0.012094118632376194, ...","[0.005169447045773268, 0.00473341578617692, -0..."


In [20]:
classes = list(set(ft_prep_df['Classification']))
class_df = pd.DataFrame(classes).reset_index()
class_df.columns = ['class_id','class']
class_df  , len(class_df)

(   class_id                 class
 0         0                 Other
 1         1  Building Improvement
 2         2           Software/IT
 3         3         Utility Bills
 4         4  Literature & Archive,
 5)

In [24]:
ft_df_with_class = ft_prep_df.merge(class_df,left_on='Classification',right_on='class',how='inner')

# 모델을 돕기 위해 각 완성품에 선행 공백을 추가합니다.
# Adding a leading whitespace onto each completion to help the model
ft_df_with_class['class_id'] = ft_df_with_class.apply(lambda x: ' ' + str(x['class_id']),axis=1)
ft_df_with_class = ft_df_with_class.drop('class', axis=1)

# 프롬프트가 종료되는 시점을 모델이 알 수 있도록 각 프롬프트 끝에 공통 구분 기호를 추가합니다.
# Adding a common separator onto the end of each prompt so the model knows when a prompt is terminating
ft_df_with_class['prompt'] = ft_df_with_class.apply(lambda x: x['combined'] + '\n\n###\n\n',axis=1)
ft_df_with_class.head()

Unnamed: 0.1,Unnamed: 0,Date,Supplier,Description,Transaction value (£),Classification,combined,n_tokens,babbage_similarity,babbage_search,class_id,prompt
0,0,15/08/2016,Creative Video Productions Ltd,Kelvin Hall,26866,Other,Supplier: Creative Video Productions Ltd; Desc...,136,"[-0.009802100248634815, 0.022551486268639565, ...","[-0.00232666521333158, 0.019198870286345482, 0...",0,Supplier: Creative Video Productions Ltd; Desc...
1,51,31/03/2017,NLS Foundation,Grant Payment,177500,Other,Supplier: NLS Foundation; Description: Grant P...,135,"[-0.015305811539292336, 0.022675275802612305, ...","[-0.006104097235947847, 0.020085038617253304, ...",0,Supplier: NLS Foundation; Description: Grant P...
2,70,26/06/2017,British Library,Legal Deposit Services,50056,Other,Supplier: British Library; Description: Legal ...,135,"[-0.015445035882294178, 0.027791442349553108, ...","[-0.01456734724342823, 0.03029645048081875, -0...",0,Supplier: British Library; Description: Legal ...
3,71,24/07/2017,ALDL,Legal Deposit Services,27067,Other,Supplier: ALDL; Description: Legal Deposit Ser...,135,"[-0.011744093149900436, 0.01669803448021412, -...","[-0.008064398542046547, 0.012981051579117775, ...",0,Supplier: ALDL; Description: Legal Deposit Ser...
4,100,24/07/2017,AM Phillip,Vehicle Purchase,26604,Other,Supplier: AM Phillip; Description: Vehicle Pur...,134,"[-0.011187470518052578, 0.01638782024383545, -...","[0.003970459569245577, 0.013751459307968616, -...",0,Supplier: AM Phillip; Description: Vehicle Pur...


In [27]:
# 각 클래스에 여러 개의 관측값이 있는 경우 이 단계는 필요하지 않습니다.
# 우리의 경우에는 그렇지 않으므로 데이터를 섞어 훈련 및 검증 세트에서 동일한 클래스를 얻을 수 있는 더 나은 기회를 제공합니다.
# 유효성 검사 집합에 클래스가 적으면 미세 조정된 모델에 오류가 발생하므로 이 단계는 필수입니다.

# This step is unnecessary if you have a number of observations in each class
# In our case we don't, so we shuffle the data to give us a better chance of getting equal classes in our train and validation sets
# Our fine-tuned model will error if we have less classes in the validation set, so this is a necessary step

import random 

labels = [x for x in ft_df_with_class['class_id']]
text = [x for x in ft_df_with_class['prompt']]
ft_df = pd.DataFrame(zip(text, labels), columns = ['prompt','class_id']) #[:300]
ft_df.columns = ['prompt','completion']
ft_df['ordering'] = ft_df.apply(lambda x: random.randint(0,len(ft_df)), axis = 1)
ft_df.set_index('ordering',inplace=True)
ft_df_sorted = ft_df.sort_index(ascending=True)
ft_df_sorted.head()

Unnamed: 0_level_0,prompt,completion
ordering,Unnamed: 1_level_1,Unnamed: 2_level_1
0,Supplier: John Graham Construction Ltd; Descri...,1
0,Supplier: John Graham Construction Ltd; Descri...,1
1,Supplier: ECG Facilities Service; Description:...,3
1,Supplier: Insight Direct (UK) Ltd; Description...,1
2,Supplier: John Graham Construction Ltd; Descri...,1


In [28]:
# 이 단계는 이 분류기에 대한 훈련/검증 세트를 이미 생성한 경우 기존 파일을 제거하는 단계입니다.
# This step is to remove any existing files if we've already produced training/validation sets for this classifier
#!rm transactions_grouped*

# 셔플된 데이터 프레임을 .jsonl 파일로 출력하고 prepare_data 함수를 실행하여 입력 파일을 가져옵니다.
# We output our shuffled dataframe to a .jsonl file and run the prepare_data function to get us our input files
ft_df_sorted.to_json("transactions_grouped.jsonl", orient='records', lines=True)
!openai tools fine_tunes.prepare_data -f transactions_grouped.jsonl -q

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Analyzing...

- Your file contains 101 prompt-completion pairs
- Based on your data it seems like you're trying to fine-tune a model for classification
- For classification, we recommend you try one of the faster and cheaper models, such as `ada`
- For classification, you can estimate the expected model performance by keeping a held out dataset, which is not used for training
- There are 62 duplicated prompt-completion sets. These are rows: [1, 4, 5, 8, 11, 12, 15, 17, 20, 21, 22, 25, 26, 27, 28, 29, 32, 33, 34, 35, 36, 37, 38, 40, 46, 51, 52, 54, 55, 57, 58, 59, 60, 62, 63, 64, 65, 66, 67, 69, 71, 72, 73, 74, 75, 76, 77, 79, 80, 83, 84, 86, 88, 89, 90, 91, 92, 93, 94, 96, 97, 98]
- All prompts end with suff

In [30]:
# 이 함수는 준비된 두 파일에 클래스가 모두 나타나는지 확인합니다.
# 그렇지 않으면 미세 조정된 모델 생성에 실패합니다.
# This functions checks that your classes all appear in both prepared files
# If they don't, the fine-tuned model creation will fail
check_classes('transactions_grouped_prepared_train.jsonl','transactions_grouped_prepared_valid.jsonl')

NameError: name 'check_classes' is not defined

In [31]:
# This step creates your model
!openai api fine_tunes.create -t "transactions_grouped_prepared_train.jsonl" -v "transactions_grouped_prepared_valid.jsonl" --compute_classification_metrics --classification_n_classes 5 -m curie

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Upload progress: 100%|████████████████████| 10.3k/10.3k [00:00<00:00, 7.23Mit/s]
Uploaded file from transactions_grouped_prepared_train.jsonl: file-Hk3CMmX8PL59zqvkWtY6gcVf
Upload progress: 100%|████████████████████| 2.67k/2.67k [00:00<00:00, 4.83Mit/s]
Uploaded file from transactions_grouped_prepared_valid.jsonl: file-zbFaBt9wcNvZlyOSQKltHRoq
Created fine-tune: ft-w0TiyIpkRULKFjzn6PPF99Gy
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-03-14 07:46:31] Created fine-tune: ft-w0TiyIpkRULKFjzn6PPF99Gy

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-w0TiyIpkRULKFjzn6PPF99Gy



In [32]:
# Congrats, you've got a fine-tuned model!
# Copy/paste the name provided into the variable below and we'll take it for a spin
fine_tuned_model = 'curie:ft-personal-2022-10-20-10-42-56'

### 미세 조정된 분류기 적용하기

이제 분류기를 적용하여 성능을 확인해 보겠습니다. 훈련 집합에는 31개의 고유한 관측값만 있고 검증 집합에는 8개가 있으므로 성능이 어떤지 살펴보겠습니다.

---------

Now we'll apply our classifier to see how it performs. We only had 31 unique observations in our training set and 8 in our validation set, so lets see how the performance is

In [33]:
test_set = pd.read_json('transactions_grouped_prepared_valid.jsonl', lines=True)
test_set.head()

Unnamed: 0,prompt,completion
0,Supplier: Flexiform; Description: Kelvin Hall;...,3
1,Supplier: M & J Ballantyne Ltd; Description: G...,3
2,Supplier: Wavetek Ltd; Description: Kelvin Hal...,3
3,Supplier: Glasgow City Council; Description: K...,3
4,Supplier: Morris & Spottiswood Ltd; Descriptio...,3


In [34]:
test_set['predicted_class'] = test_set.apply(lambda x: openai.Completion.create(model=fine_tuned_model, prompt=x['prompt'], max_tokens=1, temperature=0, logprobs=5),axis=1)
test_set['pred'] = test_set.apply(lambda x : x['predicted_class']['choices'][0]['text'],axis=1)

InvalidRequestError: That model does not exist

In [None]:
test_set['result'] = test_set.apply(lambda x: str(x['pred']).strip() == str(x['completion']).strip(), axis = 1)

In [None]:
test_set['result'].value_counts()

성능은 좋지 않습니다 - 안타깝게도 이는 예상된 결과입니다. 각 클래스의 예가 몇 개 밖에 없는 경우 임베딩과 기존 분류기를 사용한 위의 접근 방식이 더 효과적이었습니다.

미세 조정된 모델은 레이블이 지정된 관측값이 많을 때 가장 잘 작동합니다. 수백 개 또는 수천 개가 있다면 더 나은 결과를 얻을 수 있지만, 홀드아웃 집합에서 마지막으로 테스트를 수행하여 새로운 관찰 집합에 잘 일반화되지 않는지 확인해 보겠습니다.

-----------

Performance is not great - unfortunately this is expected. With only a few examples of each class, the above approach with embeddings and a traditional classifier worked better.

A fine-tuned model works best with a great number of labelled observations. If we had a few hundred or thousand we may get better results, but lets do one last test on a holdout set to confirm that it doesn't generalise well to a new set of observations

In [None]:
holdout_df = transactions.copy().iloc[101:]
holdout_df.head()

In [35]:
holdout_df['combined'] = "Supplier: " + holdout_df['Supplier'].str.strip() + "; Description: " + holdout_df['Description'].str.strip() + '\n\n###\n\n' # + "; Value: " + str(df['Transaction value (£)']).strip()
holdout_df['prediction_result'] = holdout_df.apply(lambda x: openai.Completion.create(model=fine_tuned_model, prompt=x['combined'], max_tokens=1, temperature=0, logprobs=5),axis=1)
holdout_df['pred'] = holdout_df.apply(lambda x : x['prediction_result']['choices'][0]['text'],axis=1)

NameError: name 'holdout_df' is not defined

In [36]:
holdout_df.head(10)

NameError: name 'holdout_df' is not defined

In [334]:
holdout_df['pred'].value_counts()

 2    231
 0     27
Name: pred, dtype: int64

그 결과도 마찬가지로 실망스러운 수준이었으며, 레이블이 지정된 관측값이 적은 데이터 세트의 경우 제로 샷 분류나 임베딩을 사용한 기존 분류가 미세 조정된 모델보다 더 나은 결과를 가져온다는 사실을 알게 되었습니다.

미세 조정 모델은 여전히 훌륭한 도구이지만, 분류하려는 각 클래스에 대해 레이블이 지정된 예시 수가 많을 때 더 효과적입니다.

------------------

Well those results were similarly underwhelming - so we've learned that with a dataset with a small number of labelled observations, either zero-shot classification or traditional classification with embeddings return better results than a fine-tuned model.

A fine-tuned model is still a great tool, but is more effective when you have a larger number of labelled examples for each class that you're looking to classify