## Text prediction
NLP jest dziedziną nauki maszynowego, która pozwala komputerowi na zrozumienie, analizę, manipulację a nawet generowanie języka człowieka.

In [1]:
%matplotlib inline

import numpy as np
import warnings
import matplotlib.pyplot as plt

warnings.filterwarnings('ignore')
np.random.seed(123)

## Analiza nastrojów
Binary classification

In [2]:
from autogluon.core.utils.loaders import load_pd
train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet')
test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet')
subsample_size = 1000
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head(10)

Unnamed: 0,sentence,label
43787,very pleasing at its best moments,1
16159,", american chai is enough to make you put away...",0
59015,too much like an infomercial for ram dass 's l...,0
5108,a stirring visual sequence,1
67052,cool visual backmasking,1
35938,hard ground,0
49879,"the striking , quietly vulnerable personality ...",1
51591,pan nalin 's exposition is beautiful and myste...,1
56780,wonderfully loopy,1
28518,"most beautiful , evocative",1


### Trenowanie modelu

In [3]:
from autogluon.text import TextPredictor

predictor = TextPredictor(label='label', eval_metric='acc', path='./ag_sst')
predictor.fit(train_data, time_limit=60)

Global seed set to 123
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 108 M 
1 | validation_metric | Accuracy                     | 0     
2 | loss_func         | CrossEntropyLoss             | 0     
-------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.573   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 123


Training: 0it [00:00, ?it/s]

Time limit reached. Elapsed time is 0:01:10. Signaling Trainer to stop.


Validating: 0it [00:00, ?it/s]

Epoch 0, global step 0: val_acc reached 0.50000 (best 0.50000), saving model to "/mnt/c/Users/User/Desktop/New folder/ag_sst/epoch=0-step=0.ckpt" as top 3


<autogluon.text.text_prediction.predictor.TextPredictor at 0x7f61e9b4b550>

Powyżej określamy, że: kolumna o nazwie label zawiera wartości etykiet do przewidzenia.
AutoGluon powinien zoptymalizować swoje predykcje pod kątem metryki oceny dokładności, wytrenowane modele powinny być zapisane w folderze ag_sst, a szkolenie powinno trwać około 60 sekund.

### Ewaluacja

In [4]:
test_score = predictor.evaluate(test_data)
print(test_score)

Predicting: 0it [00:00, ?it/s]

{'acc': 0.5217889908256881}


Domyślną metryką jest accuracy, która jest bardzo dobra, ale może być zmieniona na jakąś inna.

In [5]:
test_score = predictor.evaluate(test_data, metrics=['acc', 'f1'])
print(test_score)

Predicting: 0it [00:00, ?it/s]

{'acc': 0.5217889908256881, 'f1': 0.6351706036745406}


### Prediction

In [6]:
sentence1 = "it's a charming and often affecting journey."
sentence2 = "It's slow, very, very, very slow."
predictions = predictor.predict({'sentence': [sentence1, sentence2]})
print('"Sentence":', sentence1, '"Predicted Sentiment":', predictions.iloc[0])
print('"Sentence":', sentence2, '"Predicted Sentiment":', predictions.iloc[1])

Predicting: 0it [00:00, ?it/s]

"Sentence": it's a charming and often affecting journey. "Predicted Sentiment": 1
"Sentence": It's slow, very, very, very slow. "Predicted Sentiment": 0


In [7]:
probs = predictor.predict_proba({'sentence': [sentence1, sentence2]})
print('"Sentence":', sentence1, '"Predicted Class-Probabilities":', probs.iloc[0])
print('"Sentence":', sentence2, '"Predicted Class-Probabilities":', probs.iloc[1])

Predicting: 0it [00:00, ?it/s]

"Sentence": it's a charming and often affecting journey. "Predicted Class-Probabilities": 0    0.359767
1    0.640233
Name: 0, dtype: float32
"Sentence": It's slow, very, very, very slow. "Predicted Class-Probabilities": 0    0.687784
1    0.312216
Name: 1, dtype: float32


In [8]:
test_predictions = predictor.predict(test_data)
test_predictions.head()

Predicting: 0it [00:00, ?it/s]

0    1
1    0
2    1
3    1
4    0
Name: label, dtype: int64

### Save and load
Wytrenowany predyktor jest automatycznie zapisywany na końcu fit(), a można go łatwo załadować.

In [9]:
loaded_predictor = TextPredictor.load('ag_sst')
loaded_predictor.predict_proba({'sentence': [sentence1, sentence2]})

Load pretrained checkpoint: ag_sst/model.ckpt


Predicting: 0it [00:00, ?it/s]

Unnamed: 0,0,1
0,0.359767,0.640233
1,0.687784,0.312216


##  Continuous Training
Możesz również załadować predyktor i wywołać .fit() ponownie, aby kontynuować trenowanie tego samego predyktoru z nowymi danymi.

In [10]:
new_predictor = TextPredictor.load('ag_sst')
new_predictor.fit(train_data, time_limit=30, save_path='ag_sst_continue_train')
test_score = new_predictor.evaluate(test_data, metrics=['acc', 'f1'])
print(test_score)

Load pretrained checkpoint: ag_sst/model.ckpt
Global seed set to 123
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 108 M 
1 | validation_metric | Accuracy                     | 0     
2 | loss_func         | CrossEntropyLoss             | 0     
-------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.573   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 123


Training: 0it [00:00, ?it/s]

Time limit reached. Elapsed time is 0:00:35. Signaling Trainer to stop.


Validating: 0it [00:00, ?it/s]

Epoch 0, global step 0: val_acc reached 0.52000 (best 0.52000), saving model to "/mnt/c/Users/User/Desktop/New folder/ag_sst_continue_train/epoch=0-step=0.ckpt" as top 3


Predicting: 0it [00:00, ?it/s]

{'acc': 0.5217889908256881, 'f1': 0.6351706036745406}


## Sentence Similarity Task

In [11]:
sts_train_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/train.parquet')[['sentence1', 'sentence2', 'score']]
sts_test_data = load_pd.load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/dev.parquet')[['sentence1', 'sentence2', 'score']]
sts_train_data.head(10)

Unnamed: 0,sentence1,sentence2,score
0,A plane is taking off.,An air plane is taking off.,5.0
1,A man is playing a large flute.,A man is playing a flute.,3.8
2,A man is spreading shreded cheese on a pizza.,A man is spreading shredded cheese on an uncoo...,3.8
3,Three men are playing chess.,Two men are playing chess.,2.6
4,A man is playing the cello.,A man seated is playing the cello.,4.25
5,Some men are fighting.,Two men are fighting.,4.25
6,A man is smoking.,A man is skating.,0.5
7,The man is playing the piano.,The man is playing the guitar.,1.6
8,A man is playing on a guitar and singing.,A woman is playing an acoustic guitar and sing...,2.2
9,A person is throwing a cat on to the ceiling.,A person throws a cat on the ceiling.,5.0


W tym przykładzie, kolumna o nazwie score zawiera wartości numeryczne (które chcemy przewidzieć), które są ocenami równości dla każdej pary wyrażeń.   

In [12]:
print('Min score=', min(sts_train_data['score']), ', Max score=', max(sts_train_data['score']))

Min score= 0.0 , Max score= 5.0


AutoGluon automatycznie określa typ problemu predykcyjnego i odpowiednią funkcję straty.

In [13]:
predictor_sts = TextPredictor(label='score', path='./ag_sts')
predictor_sts.fit(sts_train_data, time_limit=60)

Global seed set to 123
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 108 M 
1 | validation_metric | MeanSquaredError             | 0     
2 | loss_func         | MSELoss                      | 0     
-------------------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
435.570   Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 123


Training: 0it [00:00, ?it/s]

Time limit reached. Elapsed time is 0:01:02. Signaling Trainer to stop.


Validating: 0it [00:00, ?it/s]

Epoch 0, global step 0: val_rmse reached 0.90564 (best 0.90564), saving model to "/mnt/c/Users/User/Desktop/New folder/ag_sts/epoch=0-step=0.ckpt" as top 3


<autogluon.text.text_prediction.predictor.TextPredictor at 0x7f622d4fb0a0>

In [16]:
test_score = predictor_sts.evaluate(sts_test_data, metrics=['rmse'])
print('RMSE = {:.2f}'.format(test_score['rmse']))

Predicting: 0it [00:00, ?it/s]

RMSE = 1.42


Przykład:

In [15]:
sentences = ['The child is riding a horse.',
             'The young boy is riding a horse.',
             'The young man is riding a horse.',
             'The young man is riding a bicycle.']

score1 = predictor_sts.predict({'sentence1': [sentences[0]],
                                'sentence2': [sentences[1]]}, as_pandas=False)

score2 = predictor_sts.predict({'sentence1': [sentences[0]],
                                'sentence2': [sentences[2]]}, as_pandas=False)

score3 = predictor_sts.predict({'sentence1': [sentences[0]],
                                'sentence2': [sentences[3]]}, as_pandas=False)
print(score1, score2, score3)

Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

Predicting: 0it [00:00, ?it/s]

3.1627088 2.9886348 2.8420243


# Text Prediction - Customization

In [17]:
import numpy as np
import warnings
import autogluon as ag
warnings.filterwarnings('ignore')
np.random.seed(123)

In [18]:
from autogluon.core import TabularDataset
subsample_size = 1000
train_data = TabularDataset('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet')
test_data = TabularDataset('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet')
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head(10)

Unnamed: 0,sentence,label
43787,very pleasing at its best moments,1
16159,", american chai is enough to make you put away...",0
59015,too much like an infomercial for ram dass 's l...,0
5108,a stirring visual sequence,1
67052,cool visual backmasking,1
35938,hard ground,0
49879,"the striking , quietly vulnerable personality ...",1
51591,pan nalin 's exposition is beautiful and myste...,1
56780,wonderfully loopy,1
28518,"most beautiful , evocative",1


## Konfiguracja
TextPredictor zapewnia kilka prostych konfiguracji wstępnych

In [19]:
from autogluon.text.text_prediction.presets import list_text_presets
list_text_presets()

['default',
 'medium_quality_faster_train',
 'high_quality',
 'best_quality',
 'multilingual']

Wytrenujmy predyktor tekstu z ustawionym ustawieniem medium_quality_faster_train.

In [20]:
from autogluon.text import TextPredictor
predictor = TextPredictor(eval_metric='acc', label='label')
predictor.fit(
    train_data=train_data,
    presets='medium_quality_faster_train',
    time_limit=60,
)

Global seed set to 123


Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/51.7M [00:00<?, ?B/s]

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 13.5 M
1 | validation_metric | Accuracy                     | 0     
2 | loss_func         | CrossEntropyLoss             | 0     
-------------------------------------------------------------------
13.5 M    Trainable params
0         Non-trainable params
13.5 M    Total params
53.934    Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 123


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch 0, global step 3: val_acc reached 0.47500 (best 0.47500), saving model to "/mnt/c/Users/User/Desktop/New folder/AutogluonModels/ag-20220406_191256/epoch=0-step=3.ckpt" as top 3
Time limit reached. Elapsed time is 0:01:00. Signaling Trainer to stop.


Validating: 0it [00:00, ?it/s]

Epoch 0, global step 4: val_acc reached 0.50000 (best 0.50000), saving model to "/mnt/c/Users/User/Desktop/New folder/AutogluonModels/ag-20220406_191256/epoch=0-step=4.ckpt" as top 3


<autogluon.text.text_prediction.predictor.TextPredictor at 0x7f622d0733a0>

In [21]:
predictor.evaluate(test_data, metrics=['f1', 'acc'])

Predicting: 0it [00:00, ?it/s]

{'f1': 0.10351966873706005, 'acc': 0.5034403669724771}