In [1]:
import pandas as pd
from utils import NlpUtils
from training import TrainClassModel
from pickle import dump

**Load train_data**

Also, get some partial transcripts. In this case, 2 and 5 seconds.

In [33]:
train_data = pd.read_csv("data/train_data.csv", index_col=0)
train_data = NlpUtils().get_partial_transcripts(train_data, filter_values=[2, 5])
train_data

Unnamed: 0,sid,label,transcript,partial_transcripts,transcript_2,transcript_5
0,CF8e49eff05aa1505fd419f4b3183e6898,human,"Hey, what's up?","(0): Hey./(0.4): hey, what's/(0.408): Hey, wha...","Hey, what's up?","Hey, what's up?"
1,CF78c33ca4e5d8a1fefbbd814b0b77912a,voicemail,"Hi, this is Frank Apple. I'm unable to take yo...","(0): Hi./(0.356): Hi this./(0.367): hi, this i...","Hi, this is Frank Apple. I'm unable to take y...","Hi, this is Frank Apple. I'm unable to take yo..."
2,CF374e99cf5e643b84243a908d45429bf6,voicemail,Your call has been forwarded to an automated v...,(0): You're?/(0.001): Your call./(0.317): Your...,Your call has been forwarded to an automated ...,your call has been forwarded to an automated v...
3,CF357bb78d1787f6a9e461ab1f229890c8,human,"Good morning. Absolutely, give me one moment.",(0): Good./(0.418): Good morning./(0.42): Good...,Good morning.,Good morning.
4,CF000ffa4ef2954b2b22ae5278033b7bb9,human,Hi. This is Bruce Miller.,"(0): I just/(0.155): Hi, this is Chris./(0.283...",Hi. This is Bruce Miller.,Hi. This is Bruce Miller. Yes.
...,...,...,...,...,...,...
2913,CF0651f63da6f2ea02f18d7cf6c9efafc8,voicemail,Your call has been forwarded to an automatic v...,(0): Your call./(0.364): Your call has./(0.37)...,Your call has been forwarded to an automatic ...,your call has been forwarded to an automatic v...
2914,CFb10e29e754bf876678745028dee8278b,human,Hello. Yes.,(0): Hello./(1.191): Hello./(1.207): Hello./(3...,Hello.,Hello. Yes.
2915,CF9b1e586eb99588d63b7244f4da31f9ea,voicemail,Your call has been forwarded to an automatic v...,(0): Your call has been./(0.404): Your call ha...,Your call has been forwarded to an automatic ...,your call has been forwarded to an automatic v...
2916,CF5eb6d8cc9cb4b2e482dee49b5501aa3f,human,Valid.,(0): Valid./(0.307): Salak./(0.975): Valid.,Valid.,Valid.


**ML models & hyperparameter optimization class**

I created a class that performs the classification of a dataset from a text feature (clearly, the dependent variable is categorical).
- Makes **division of test and training datasets** (just by configuring the percentage of the test set size).
- Includes **cross validation**.
- If desired, **removes duplicate items**. This is to control overfitting.
- The particularity of this script is that it **uses different models and does the optimization of hyperparameters automatically**.
- The algorithms included are: **Logistic Regression, Random Forest, Light Gradient Boosting Machine, and Support Vector Machine**.
- Undersampling can also be configured, if desired.
- If desired, **cleans up the text**. Remove stopwords, punctuation, accents and non-English characters.
- The **corpus (i.e. all texts) is transformed into a numerical matrix using TF-IDF**. This matrix is used as training input.

The experiment showed a good result without altering the balance or class size (no undersampling needed).

In [9]:
text_col = "transcript"
label_col = "label"
clf = TrainClassModel(train_data, text_col=text_col, label_col=label_col, clean=True, val_size=0.2)

578 duplicated rows removed
training set size: 1872 | test set size: 468
label for training set:
voicemail    1378
human         494
Name: label, dtype: int64
label for test set:
voicemail    345
human        123
Name: label, dtype: int64




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



**Model Training**

- Keeping the same trainset/devset distribution, training was tested with the algorithms included in the Class.
- While the original objectives were for recall to be greater than 95% and accuracy to be greater than 75%, this would be equivalent to obtaining an **f1-score greater than 83.8%**.
- **All models exceeded an f1-score of 96%.**
- At the same time, the difference between the training and test (dev) metrics is very short, indicating that no model would be overfitting or underfitting.
- By its nature SVM is resilient to overfitting, which is why I chose it as the best model.

In [10]:
model_logit, train_metric, test_metric = clf.fit_best_model(n_trials=5, n_splits=5, model_type="logit", metric="f1")

[32m[I 2022-08-14 22:15:36,159][0m A new study created in memory with name: no-name-4d7556e9-7bf5-4cfd-871c-5a2af5e49352[0m
[32m[I 2022-08-14 22:15:36,203][0m Trial 0 finished with value: 0.9699058241700427 and parameters: {'penalty': 'l2'}. Best is trial 0 with value: 0.9699058241700427.[0m
[32m[I 2022-08-14 22:15:36,244][0m Trial 1 finished with value: 0.9699058241700427 and parameters: {'penalty': 'l2'}. Best is trial 0 with value: 0.9699058241700427.[0m

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

[32m[I 2022-08-14 22:15:36,307][0m Trial 2 finished with value: 0.9674185463659148 and parameters: {'penalty': 'none'}. Best is trial 0 with value: 0.

Number of finished trials: 5
Best trial: {'penalty': 'l2'}
1 0.9727272727272727
2 0.9711313927926849
3 0.9721612266924708
4 0.9691902133618915
5 0.9758814154237335
*** MODEL: logit ***
training metric -f1-: 0.9722183041996107
test metric -f1-: 0.9699058241700427
Confusion Matrix for test set
              precision    recall  f1-score   support

       human       0.94      0.97      0.96       123
   voicemail       0.99      0.98      0.98       345

    accuracy                           0.98       468
   macro avg       0.97      0.97      0.97       468
weighted avg       0.98      0.98      0.98       468

[[119   4]
 [  7 338]]


In [11]:
model_rf, train_metric, test_metric = clf.fit_best_model(n_trials=100, n_splits=5, model_type="random_forest", metric="f1")

[32m[I 2022-08-14 22:15:46,919][0m A new study created in memory with name: no-name-486c7d92-b973-4c69-878c-77054793f93f[0m
[32m[I 2022-08-14 22:15:47,021][0m Trial 0 finished with value: 0.9651352698807469 and parameters: {'n_estimators': 35, 'max_depth': 12, 'min_samples_split': 4, 'min_samples_leaf': 3, 'bootstrap': True, 'criterion': 'entropy'}. Best is trial 0 with value: 0.9651352698807469.[0m
[32m[I 2022-08-14 22:15:47,145][0m Trial 1 finished with value: 0.9675802987969613 and parameters: {'n_estimators': 55, 'max_depth': 10, 'min_samples_split': 4, 'min_samples_leaf': 2, 'bootstrap': True, 'criterion': 'entropy'}. Best is trial 1 with value: 0.9675802987969613.[0m
[32m[I 2022-08-14 22:15:47,272][0m Trial 2 finished with value: 0.9675802987969613 and parameters: {'n_estimators': 55, 'max_depth': 14, 'min_samples_split': 3, 'min_samples_leaf': 4, 'bootstrap': True, 'criterion': 'entropy'}. Best is trial 1 with value: 0.9675802987969613.[0m
[32m[I 2022-08-14 22:15:47

[32m[I 2022-08-14 22:15:50,272][0m Trial 29 finished with value: 0.9649650168437419 and parameters: {'n_estimators': 35, 'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 3, 'bootstrap': True, 'criterion': 'gini'}. Best is trial 20 with value: 0.9703550142523969.[0m
[32m[I 2022-08-14 22:15:50,354][0m Trial 30 finished with value: 0.9675802987969613 and parameters: {'n_estimators': 25, 'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 3, 'bootstrap': True, 'criterion': 'gini'}. Best is trial 20 with value: 0.9703550142523969.[0m
[32m[I 2022-08-14 22:15:50,432][0m Trial 31 finished with value: 0.9675802987969613 and parameters: {'n_estimators': 25, 'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 3, 'bootstrap': True, 'criterion': 'gini'}. Best is trial 20 with value: 0.9703550142523969.[0m
[32m[I 2022-08-14 22:15:50,520][0m Trial 32 finished with value: 0.9675802987969613 and parameters: {'n_estimators': 25, 'max_depth': None, 'min_samp

[32m[I 2022-08-14 22:15:53,731][0m Trial 58 finished with value: 0.9672536443148688 and parameters: {'n_estimators': 35, 'max_depth': None, 'min_samples_split': 4, 'min_samples_leaf': 2, 'bootstrap': False, 'criterion': 'gini'}. Best is trial 20 with value: 0.9703550142523969.[0m
[32m[I 2022-08-14 22:15:53,867][0m Trial 59 finished with value: 0.9675802987969613 and parameters: {'n_estimators': 65, 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 2, 'bootstrap': False, 'criterion': 'gini'}. Best is trial 20 with value: 0.9703550142523969.[0m
[32m[I 2022-08-14 22:15:53,997][0m Trial 60 finished with value: 0.9674185463659148 and parameters: {'n_estimators': 65, 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 2, 'bootstrap': False, 'criterion': 'gini'}. Best is trial 20 with value: 0.9703550142523969.[0m
[32m[I 2022-08-14 22:15:54,118][0m Trial 61 finished with value: 0.9649650168437419 and parameters: {'n_estimators': 45, 'max_depth': None, 'min_s

[32m[I 2022-08-14 22:15:56,709][0m Trial 87 finished with value: 0.9675802987969613 and parameters: {'n_estimators': 35, 'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 3, 'bootstrap': False, 'criterion': 'gini'}. Best is trial 67 with value: 0.9704990745144784.[0m
[32m[I 2022-08-14 22:15:56,786][0m Trial 88 finished with value: 0.9678946285243877 and parameters: {'n_estimators': 25, 'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 4, 'bootstrap': True, 'criterion': 'gini'}. Best is trial 67 with value: 0.9704990745144784.[0m
[32m[I 2022-08-14 22:15:56,864][0m Trial 89 finished with value: 0.9649650168437419 and parameters: {'n_estimators': 25, 'max_depth': None, 'min_samples_split': 4, 'min_samples_leaf': 4, 'bootstrap': True, 'criterion': 'gini'}. Best is trial 67 with value: 0.9704990745144784.[0m
[32m[I 2022-08-14 22:15:56,938][0m Trial 90 finished with value: 0.9675802987969613 and parameters: {'n_estimators': 25, 'max_depth': None, 'min_sam

Number of finished trials: 100
Best trial: {'n_estimators': 35, 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 4, 'bootstrap': False, 'criterion': 'gini'}
1 0.9690199282180263
2 0.9676657239428847
3 0.9721612266924708
4 0.9632635930956271
5 0.9660900155949661
*** MODEL: random_forest ***
training metric -f1-: 0.9676400975087951
test metric -f1-: 0.9675802987969613
Confusion Matrix for test set
              precision    recall  f1-score   support

       human       0.92      0.98      0.95       123
   voicemail       0.99      0.97      0.98       345

    accuracy                           0.97       468
   macro avg       0.96      0.98      0.97       468
weighted avg       0.98      0.97      0.97       468

[[121   2]
 [ 10 335]]


In [12]:
model_svm, train_metric, test_metric = clf.fit_best_model(n_trials=100, n_splits=5, model_type="svm", metric="f1")

[32m[I 2022-08-14 22:15:59,371][0m A new study created in memory with name: no-name-1989623b-ed97-42fc-bca2-6e44e2635f0b[0m
[32m[I 2022-08-14 22:15:59,420][0m Trial 0 finished with value: 0.9640627307362887 and parameters: {'C': 20, 'shrinking': True, 'kernel': 'sigmoid'}. Best is trial 0 with value: 0.9640627307362887.[0m
[32m[I 2022-08-14 22:15:59,473][0m Trial 1 finished with value: 0.9642502482621649 and parameters: {'C': 100, 'shrinking': False, 'kernel': 'rbf'}. Best is trial 1 with value: 0.9642502482621649.[0m
[32m[I 2022-08-14 22:15:59,515][0m Trial 2 finished with value: 0.9611958684734199 and parameters: {'C': 10, 'shrinking': True, 'kernel': 'sigmoid'}. Best is trial 1 with value: 0.9642502482621649.[0m
[32m[I 2022-08-14 22:15:59,567][0m Trial 3 finished with value: 0.9585339200803332 and parameters: {'C': 10, 'shrinking': False, 'kernel': 'rbf'}. Best is trial 1 with value: 0.9642502482621649.[0m
[32m[I 2022-08-14 22:15:59,609][0m Trial 4 finished with val

[32m[I 2022-08-14 22:16:01,807][0m Trial 40 finished with value: 0.958750286456344 and parameters: {'C': 15, 'shrinking': False, 'kernel': 'rbf'}. Best is trial 27 with value: 0.9755023700817169.[0m
[32m[I 2022-08-14 22:16:01,882][0m Trial 41 finished with value: 0.9670855213803451 and parameters: {'C': 5, 'shrinking': True, 'kernel': 'poly'}. Best is trial 27 with value: 0.9755023700817169.[0m
[32m[I 2022-08-14 22:16:01,954][0m Trial 42 finished with value: 0.9644341558373232 and parameters: {'C': 5, 'shrinking': True, 'kernel': 'poly'}. Best is trial 27 with value: 0.9755023700817169.[0m
[32m[I 2022-08-14 22:16:02,024][0m Trial 43 finished with value: 0.9644341558373232 and parameters: {'C': 5, 'shrinking': True, 'kernel': 'poly'}. Best is trial 27 with value: 0.9755023700817169.[0m
[32m[I 2022-08-14 22:16:02,115][0m Trial 44 finished with value: 0.9670855213803451 and parameters: {'C': 20, 'shrinking': True, 'kernel': 'poly'}. Best is trial 27 with value: 0.97550237008

[32m[I 2022-08-14 22:16:04,455][0m Trial 81 finished with value: 0.9670855213803451 and parameters: {'C': 80, 'shrinking': True, 'kernel': 'poly'}. Best is trial 27 with value: 0.9755023700817169.[0m
[32m[I 2022-08-14 22:16:04,517][0m Trial 82 finished with value: 0.961795918367347 and parameters: {'C': 1, 'shrinking': True, 'kernel': 'poly'}. Best is trial 27 with value: 0.9755023700817169.[0m
[32m[I 2022-08-14 22:16:04,576][0m Trial 83 finished with value: 0.9670855213803451 and parameters: {'C': 1, 'shrinking': True, 'kernel': 'poly'}. Best is trial 27 with value: 0.9755023700817169.[0m
[32m[I 2022-08-14 22:16:04,638][0m Trial 84 finished with value: 0.9644341558373232 and parameters: {'C': 8, 'shrinking': True, 'kernel': 'poly'}. Best is trial 27 with value: 0.9755023700817169.[0m
[32m[I 2022-08-14 22:16:04,698][0m Trial 85 finished with value: 0.9644341558373232 and parameters: {'C': 40, 'shrinking': True, 'kernel': 'poly'}. Best is trial 27 with value: 0.97550237008

Number of finished trials: 100
Best trial: {'C': 5, 'shrinking': True, 'kernel': 'poly'}
1 0.969415218987032
2 0.9742710120068611
3 0.9685758297157261
4 0.9665030630888833
5 0.9560536900619153
*** MODEL: svm ***
training metric -f1-: 0.9669637627720835
test metric -f1-: 0.9670855213803451
Confusion Matrix for test set
              precision    recall  f1-score   support

       human       0.94      0.96      0.95       123
   voicemail       0.99      0.98      0.98       345

    accuracy                           0.97       468
   macro avg       0.96      0.97      0.97       468
weighted avg       0.97      0.97      0.97       468

[[118   5]
 [  7 338]]


In [13]:
model_lgb, train_metric, test_metric = clf.fit_best_model(n_trials=100, n_splits=5, model_type="lgbm", metric="f1")

[32m[I 2022-08-14 22:16:50,935][0m A new study created in memory with name: no-name-19d2ff3e-ed59-4830-98f3-472c9d582b1a[0m
[32m[I 2022-08-14 22:16:51,002][0m Trial 0 finished with value: 0.9647914629135586 and parameters: {'learning_rate': 0.04, 'n_estimators': 39, 'max_depth': 13, 'colsample_bytree': 0.65, 'reg_alpha': 0.1545341898542818, 'reg_lambda': 0.46901126982174424, 'min_child_samples': 13}. Best is trial 0 with value: 0.9647914629135586.[0m
[32m[I 2022-08-14 22:16:51,059][0m Trial 1 finished with value: 0.9649650168437419 and parameters: {'learning_rate': 0.07, 'n_estimators': 33, 'max_depth': 16, 'colsample_bytree': 0.7, 'reg_alpha': 0.19831289895383536, 'reg_lambda': 0.004080699748026157, 'min_child_samples': 9}. Best is trial 1 with value: 0.9649650168437419.[0m
[32m[I 2022-08-14 22:16:51,113][0m Trial 2 finished with value: 0.9619883040935672 and parameters: {'learning_rate': 0.05, 'n_estimators': 29, 'max_depth': 18, 'colsample_bytree': 0.7, 'reg_alpha': 0.109

[32m[I 2022-08-14 22:16:52,612][0m Trial 25 finished with value: 0.9700584523220984 and parameters: {'learning_rate': 0.1, 'n_estimators': 42, 'max_depth': 12, 'colsample_bytree': 0.7, 'reg_alpha': 0.26046460933697324, 'reg_lambda': 0.2842901514898562, 'min_child_samples': 6}. Best is trial 3 with value: 0.9702081609268574.[0m
[32m[I 2022-08-14 22:16:52,684][0m Trial 26 finished with value: 0.9700584523220984 and parameters: {'learning_rate': 0.1, 'n_estimators': 45, 'max_depth': 12, 'colsample_bytree': 0.7, 'reg_alpha': 0.33190350129312385, 'reg_lambda': 0.026411276343149263, 'min_child_samples': 6}. Best is trial 3 with value: 0.9702081609268574.[0m
[32m[I 2022-08-14 22:16:52,751][0m Trial 27 finished with value: 0.9674185463659148 and parameters: {'learning_rate': 0.1, 'n_estimators': 42, 'max_depth': 15, 'colsample_bytree': 0.7, 'reg_alpha': 0.3810158293246335, 'reg_lambda': 0.14454228619509724, 'min_child_samples': 6}. Best is trial 3 with value: 0.9702081609268574.[0m
[

[32m[I 2022-08-14 22:16:54,318][0m Trial 50 finished with value: 0.96461453456248 and parameters: {'learning_rate': 0.08, 'n_estimators': 34, 'max_depth': 10, 'colsample_bytree': 0.65, 'reg_alpha': 0.1190966196025466, 'reg_lambda': 0.08971283299014966, 'min_child_samples': 5}. Best is trial 42 with value: 0.9729835823308011.[0m
[32m[I 2022-08-14 22:16:54,385][0m Trial 51 finished with value: 0.9675802987969613 and parameters: {'learning_rate': 0.1, 'n_estimators': 27, 'max_depth': 11, 'colsample_bytree': 0.65, 'reg_alpha': 0.10570323594497207, 'reg_lambda': 0.05091751247741228, 'min_child_samples': 6}. Best is trial 42 with value: 0.9729835823308011.[0m
[32m[I 2022-08-14 22:16:54,447][0m Trial 52 finished with value: 0.9675802987969613 and parameters: {'learning_rate': 0.1, 'n_estimators': 25, 'max_depth': 10, 'colsample_bytree': 0.65, 'reg_alpha': 0.1215589899450632, 'reg_lambda': 0.2768072091977736, 'min_child_samples': 5}. Best is trial 42 with value: 0.9729835823308011.[0m

[32m[I 2022-08-14 22:16:56,015][0m Trial 75 finished with value: 0.9700584523220984 and parameters: {'learning_rate': 0.1, 'n_estimators': 41, 'max_depth': 11, 'colsample_bytree': 0.75, 'reg_alpha': 0.31277074440672764, 'reg_lambda': 0.17617336986415955, 'min_child_samples': 6}. Best is trial 42 with value: 0.9729835823308011.[0m
[32m[I 2022-08-14 22:16:56,091][0m Trial 76 finished with value: 0.9699058241700427 and parameters: {'learning_rate': 0.09, 'n_estimators': 40, 'max_depth': 11, 'colsample_bytree': 0.75, 'reg_alpha': 0.31382142474573765, 'reg_lambda': 0.15827567519755936, 'min_child_samples': 6}. Best is trial 42 with value: 0.9729835823308011.[0m
[32m[I 2022-08-14 22:16:56,166][0m Trial 77 finished with value: 0.9674185463659148 and parameters: {'learning_rate': 0.1, 'n_estimators': 41, 'max_depth': 10, 'colsample_bytree': 0.75, 'reg_alpha': 0.3550936504298032, 'reg_lambda': 0.20092761053952884, 'min_child_samples': 12}. Best is trial 42 with value: 0.9729835823308011

Number of finished trials: 100
Best trial: {'learning_rate': 0.1, 'n_estimators': 24, 'max_depth': 10, 'colsample_bytree': 0.65, 'reg_alpha': 0.10097943486556425, 'reg_lambda': 0.04755631471881741, 'min_child_samples': 7}
1 0.9690199282180263
2 0.970977478523334
3 0.9687856447350118
4 0.969575631581326
5 0.965875912408759
*** MODEL: lgbm ***
training metric -f1-: 0.9688469190932913
test metric -f1-: 0.9675802987969613
Confusion Matrix for test set
              precision    recall  f1-score   support

       human       0.92      0.98      0.95       123
   voicemail       0.99      0.97      0.98       345

    accuracy                           0.97       468
   macro avg       0.96      0.98      0.97       468
weighted avg       0.98      0.97      0.97       468

[[121   2]
 [ 10 335]]


**Store the best 10-second model on disk**

In [14]:
dump(model_svm, open("model_10.p", "wb"))

**Testing faster models**
- Faster" models were tested, i.e. instead of capturing 10 seconds of audio, they would capture 2 or 5 seconds.
- The 2-second model did not prove to be very efficient.
- The 5-second model is almost as efficient as the 10-second model. The f1-score is only 1% lower. It may not be as robust to overfitting, but the difference between training and dev metrics is not significant.

In [15]:
text_col = "transcript_5"
clf_5 = TrainClassModel(train_data, text_col=text_col, label_col=label_col, clean=True)

724 duplicated rows removed
training set size: 1755 | test set size: 439
label for training set:
voicemail    1278
human         477
Name: label, dtype: int64
label for test set:
voicemail    320
human        119
Name: label, dtype: int64




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [20]:
model_logit_5, train_metric, test_metric = clf_5.fit_best_model(n_trials=5, n_splits=5, model_type="logit", metric="f1")

[32m[I 2022-08-14 22:24:53,342][0m A new study created in memory with name: no-name-8cf5db77-1c3d-47c7-98f1-222efb7aefbf[0m
[32m[I 2022-08-14 22:24:53,382][0m Trial 0 finished with value: 0.9328917197452229 and parameters: {'penalty': 'l2'}. Best is trial 0 with value: 0.9328917197452229.[0m
[32m[I 2022-08-14 22:24:53,418][0m Trial 1 finished with value: 0.9443430194228917 and parameters: {'penalty': 'l2'}. Best is trial 1 with value: 0.9443430194228917.[0m

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

[32m[I 2022-08-14 22:24:53,477][0m Trial 2 finished with value: 0.9390647082576217 and parameters: {'penalty': 'none'}. Best is trial 1 with value: 0.

Number of finished trials: 5
Best trial: {'penalty': 'l2'}
1 0.9650398406374502
2 0.9498161764705884
3 0.9680839455881283
4 0.9564119546320058
5 0.954153605015674
*** MODEL: logit ***
training metric -f1-: 0.9587011044687692
test metric -f1-: 0.938484076433121
Confusion Matrix for test set
              precision    recall  f1-score   support

       human       0.87      0.96      0.91       119
   voicemail       0.98      0.95      0.96       320

    accuracy                           0.95       439
   macro avg       0.93      0.95      0.94       439
weighted avg       0.95      0.95      0.95       439

[[114   5]
 [ 17 303]]


In [17]:
model_rf_5, train_metric, test_metric = clf_5.fit_best_model(n_trials=100, n_splits=5, model_type="random_forest", metric="f1")

[32m[I 2022-08-14 22:23:17,323][0m A new study created in memory with name: no-name-fc99dc05-039b-48ee-a8ec-cb45af5a1e69[0m
[32m[I 2022-08-14 22:23:17,422][0m Trial 0 finished with value: 0.9238376127689105 and parameters: {'n_estimators': 55, 'max_depth': 10, 'min_samples_split': 4, 'min_samples_leaf': 4, 'bootstrap': False, 'criterion': 'gini'}. Best is trial 0 with value: 0.9238376127689105.[0m
[32m[I 2022-08-14 22:23:17,572][0m Trial 1 finished with value: 0.9416980237154151 and parameters: {'n_estimators': 75, 'max_depth': None, 'min_samples_split': 4, 'min_samples_leaf': 2, 'bootstrap': True, 'criterion': 'entropy'}. Best is trial 1 with value: 0.9416980237154151.[0m
[32m[I 2022-08-14 22:23:17,697][0m Trial 2 finished with value: 0.9263957972391439 and parameters: {'n_estimators': 65, 'max_depth': 10, 'min_samples_split': 4, 'min_samples_leaf': 2, 'bootstrap': True, 'criterion': 'gini'}. Best is trial 1 with value: 0.9416980237154151.[0m
[32m[I 2022-08-14 22:23:17,79

[32m[I 2022-08-14 22:23:20,689][0m Trial 29 finished with value: 0.9387773213651808 and parameters: {'n_estimators': 55, 'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 4, 'bootstrap': False, 'criterion': 'gini'}. Best is trial 22 with value: 0.941969596827495.[0m
[32m[I 2022-08-14 22:23:20,763][0m Trial 30 finished with value: 0.9443430194228917 and parameters: {'n_estimators': 25, 'max_depth': None, 'min_samples_split': 4, 'min_samples_leaf': 4, 'bootstrap': False, 'criterion': 'entropy'}. Best is trial 30 with value: 0.9443430194228917.[0m
[32m[I 2022-08-14 22:23:20,830][0m Trial 31 finished with value: 0.9286446611652914 and parameters: {'n_estimators': 25, 'max_depth': None, 'min_samples_split': 4, 'min_samples_leaf': 4, 'bootstrap': False, 'criterion': 'entropy'}. Best is trial 30 with value: 0.9443430194228917.[0m
[32m[I 2022-08-14 22:23:20,897][0m Trial 32 finished with value: 0.9387773213651808 and parameters: {'n_estimators': 25, 'max_depth': None, '

[32m[I 2022-08-14 22:23:24,000][0m Trial 58 finished with value: 0.9390647082576217 and parameters: {'n_estimators': 65, 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 2, 'bootstrap': False, 'criterion': 'entropy'}. Best is trial 30 with value: 0.9443430194228917.[0m
[32m[I 2022-08-14 22:23:24,114][0m Trial 59 finished with value: 0.9364428917634469 and parameters: {'n_estimators': 55, 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 4, 'bootstrap': False, 'criterion': 'entropy'}. Best is trial 30 with value: 0.9443430194228917.[0m
[32m[I 2022-08-14 22:23:24,240][0m Trial 60 finished with value: 0.9364428917634469 and parameters: {'n_estimators': 55, 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 2, 'bootstrap': False, 'criterion': 'entropy'}. Best is trial 30 with value: 0.9443430194228917.[0m
[32m[I 2022-08-14 22:23:24,366][0m Trial 61 finished with value: 0.9387773213651808 and parameters: {'n_estimators': 65, 'max_depth': Non

[32m[I 2022-08-14 22:23:27,274][0m Trial 87 finished with value: 0.9387773213651808 and parameters: {'n_estimators': 35, 'max_depth': None, 'min_samples_split': 4, 'min_samples_leaf': 2, 'bootstrap': False, 'criterion': 'entropy'}. Best is trial 62 with value: 0.9467440509254825.[0m
[32m[I 2022-08-14 22:23:27,384][0m Trial 88 finished with value: 0.9387773213651808 and parameters: {'n_estimators': 55, 'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 2, 'bootstrap': False, 'criterion': 'entropy'}. Best is trial 62 with value: 0.9467440509254825.[0m
[32m[I 2022-08-14 22:23:27,549][0m Trial 89 finished with value: 0.9335251362810417 and parameters: {'n_estimators': 55, 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 2, 'bootstrap': True, 'criterion': 'gini'}. Best is trial 62 with value: 0.9467440509254825.[0m
[32m[I 2022-08-14 22:23:27,659][0m Trial 90 finished with value: 0.9361454545454546 and parameters: {'n_estimators': 45, 'max_depth': None, '

Number of finished trials: 100
Best trial: {'n_estimators': 45, 'max_depth': None, 'min_samples_split': 3, 'min_samples_leaf': 2, 'bootstrap': False, 'criterion': 'entropy'}
1 0.9552142121018796
2 0.9325443786982248
3 0.9649652919559002
4 0.9639185855263158
5 0.9433414043583535
*** MODEL: random_forest ***
training metric -f1-: 0.9519967745281347
test metric -f1-: 0.9387773213651808
Confusion Matrix for test set
              precision    recall  f1-score   support

       human       0.86      0.97      0.91       119
   voicemail       0.99      0.94      0.96       320

    accuracy                           0.95       439
   macro avg       0.93      0.96      0.94       439
weighted avg       0.95      0.95      0.95       439

[[115   4]
 [ 18 302]]


In [18]:
model_svm_5, train_metric, test_metric = clf_5.fit_best_model(n_trials=100, n_splits=5, model_type="svm", metric="f1")

[32m[I 2022-08-14 22:23:30,359][0m A new study created in memory with name: no-name-5c5f18ed-11a1-4e8c-beaf-97b2086ba699[0m
[32m[I 2022-08-14 22:23:30,406][0m Trial 0 finished with value: 0.9387773213651808 and parameters: {'C': 10, 'shrinking': False, 'kernel': 'rbf'}. Best is trial 0 with value: 0.9387773213651808.[0m
[32m[I 2022-08-14 22:23:30,451][0m Trial 1 finished with value: 0.9205409318922273 and parameters: {'C': 60, 'shrinking': True, 'kernel': 'rbf'}. Best is trial 0 with value: 0.9387773213651808.[0m
[32m[I 2022-08-14 22:23:30,497][0m Trial 2 finished with value: 0.9358419591172789 and parameters: {'C': 3, 'shrinking': False, 'kernel': 'poly'}. Best is trial 0 with value: 0.9387773213651808.[0m
[32m[I 2022-08-14 22:23:30,543][0m Trial 3 finished with value: 0.9305928853754941 and parameters: {'C': 10, 'shrinking': True, 'kernel': 'poly'}. Best is trial 0 with value: 0.9387773213651808.[0m
[32m[I 2022-08-14 22:23:30,591][0m Trial 4 finished with value: 0.93

[32m[I 2022-08-14 22:23:32,301][0m Trial 40 finished with value: 0.9414209191940373 and parameters: {'C': 40, 'shrinking': False, 'kernel': 'poly'}. Best is trial 40 with value: 0.9414209191940373.[0m
[32m[I 2022-08-14 22:23:32,353][0m Trial 41 finished with value: 0.9352162559269074 and parameters: {'C': 40, 'shrinking': False, 'kernel': 'poly'}. Best is trial 40 with value: 0.9414209191940373.[0m
[32m[I 2022-08-14 22:23:32,403][0m Trial 42 finished with value: 0.9302629990405205 and parameters: {'C': 40, 'shrinking': False, 'kernel': 'poly'}. Best is trial 40 with value: 0.9414209191940373.[0m
[32m[I 2022-08-14 22:23:32,459][0m Trial 43 finished with value: 0.93321162330747 and parameters: {'C': 5, 'shrinking': False, 'kernel': 'poly'}. Best is trial 40 with value: 0.9414209191940373.[0m
[32m[I 2022-08-14 22:23:32,514][0m Trial 44 finished with value: 0.9355322721729527 and parameters: {'C': 10, 'shrinking': False, 'kernel': 'poly'}. Best is trial 40 with value: 0.94142

[32m[I 2022-08-14 22:23:34,301][0m Trial 81 finished with value: 0.9328917197452229 and parameters: {'C': 15, 'shrinking': True, 'kernel': 'poly'}. Best is trial 65 with value: 0.9416980237154151.[0m
[32m[I 2022-08-14 22:23:34,353][0m Trial 82 finished with value: 0.9355322721729527 and parameters: {'C': 60, 'shrinking': True, 'kernel': 'poly'}. Best is trial 65 with value: 0.9416980237154151.[0m
[32m[I 2022-08-14 22:23:34,401][0m Trial 83 finished with value: 0.9355322721729527 and parameters: {'C': 100, 'shrinking': True, 'kernel': 'poly'}. Best is trial 65 with value: 0.9416980237154151.[0m
[32m[I 2022-08-14 22:23:34,445][0m Trial 84 finished with value: 0.938484076433121 and parameters: {'C': 10, 'shrinking': False, 'kernel': 'rbf'}. Best is trial 65 with value: 0.9416980237154151.[0m
[32m[I 2022-08-14 22:23:34,492][0m Trial 85 finished with value: 0.938484076433121 and parameters: {'C': 20, 'shrinking': True, 'kernel': 'poly'}. Best is trial 65 with value: 0.94169802

Number of finished trials: 100
Best trial: {'C': 20, 'shrinking': True, 'kernel': 'poly'}
1 0.948324090886784
2 0.9464051223062594
3 0.9648184427536561
4 0.960178223336118
5 0.9397890042027617
*** MODEL: svm ***
training metric -f1-: 0.9519029766971159
test metric -f1-: 0.9328917197452229
Confusion Matrix for test set
              precision    recall  f1-score   support

       human       0.86      0.95      0.90       119
   voicemail       0.98      0.94      0.96       320

    accuracy                           0.95       439
   macro avg       0.92      0.95      0.93       439
weighted avg       0.95      0.95      0.95       439

[[113   6]
 [ 18 302]]


In [19]:
model_lgb_5, train_metric, test_metric = clf_5.fit_best_model(n_trials=100, n_splits=5, model_type="lgbm", metric="f1")

[32m[I 2022-08-14 22:24:43,290][0m A new study created in memory with name: no-name-79632865-ab69-4422-8cff-57a9bdef68f3[0m
[32m[I 2022-08-14 22:24:43,337][0m Trial 0 finished with value: 0.9338323954983923 and parameters: {'learning_rate': 0.04, 'n_estimators': 25, 'max_depth': 20, 'colsample_bytree': 0.75, 'reg_alpha': 0.2597022499684475, 'reg_lambda': 0.003058029022068154, 'min_child_samples': 15}. Best is trial 0 with value: 0.9338323954983923.[0m
[32m[I 2022-08-14 22:24:43,379][0m Trial 1 finished with value: 0.9361454545454546 and parameters: {'learning_rate': 0.06, 'n_estimators': 28, 'max_depth': 12, 'colsample_bytree': 0.8, 'reg_alpha': 0.13689666445515886, 'reg_lambda': 0.036562737946965254, 'min_child_samples': 11}. Best is trial 1 with value: 0.9361454545454546.[0m
[32m[I 2022-08-14 22:24:43,418][0m Trial 2 finished with value: 0.9286446611652914 and parameters: {'learning_rate': 0.08, 'n_estimators': 20, 'max_depth': 16, 'colsample_bytree': 0.8, 'reg_alpha': 0.3

[32m[I 2022-08-14 22:24:44,834][0m Trial 25 finished with value: 0.9335251362810417 and parameters: {'learning_rate': 0.09, 'n_estimators': 54, 'max_depth': 13, 'colsample_bytree': 0.65, 'reg_alpha': 0.2997289037780683, 'reg_lambda': 0.04719421368139203, 'min_child_samples': 9}. Best is trial 13 with value: 0.9446042802342014.[0m
[32m[I 2022-08-14 22:24:44,907][0m Trial 26 finished with value: 0.9416980237154151 and parameters: {'learning_rate': 0.1, 'n_estimators': 50, 'max_depth': 15, 'colsample_bytree': 0.65, 'reg_alpha': 0.22363251554256885, 'reg_lambda': 0.2522792275426997, 'min_child_samples': 7}. Best is trial 13 with value: 0.9446042802342014.[0m
[32m[I 2022-08-14 22:24:44,976][0m Trial 27 finished with value: 0.9361454545454546 and parameters: {'learning_rate': 0.09, 'n_estimators': 56, 'max_depth': 12, 'colsample_bytree': 0.65, 'reg_alpha': 0.16424482872029514, 'reg_lambda': 0.12020886028200724, 'min_child_samples': 6}. Best is trial 13 with value: 0.9446042802342014.

[32m[I 2022-08-14 22:24:46,510][0m Trial 50 finished with value: 0.9257316866858399 and parameters: {'learning_rate': 0.07, 'n_estimators': 57, 'max_depth': 20, 'colsample_bytree': 0.75, 'reg_alpha': 0.17313722859706657, 'reg_lambda': 0.01798022808477409, 'min_child_samples': 14}. Best is trial 46 with value: 0.9469998792707955.[0m
[32m[I 2022-08-14 22:24:46,573][0m Trial 51 finished with value: 0.941969596827495 and parameters: {'learning_rate': 0.09, 'n_estimators': 45, 'max_depth': 14, 'colsample_bytree': 0.7, 'reg_alpha': 0.36055462137208244, 'reg_lambda': 0.009594884522822977, 'min_child_samples': 7}. Best is trial 46 with value: 0.9469998792707955.[0m
[32m[I 2022-08-14 22:24:46,639][0m Trial 52 finished with value: 0.9416980237154151 and parameters: {'learning_rate': 0.05, 'n_estimators': 51, 'max_depth': 18, 'colsample_bytree': 0.65, 'reg_alpha': 0.184091978700632, 'reg_lambda': 0.015436180748264404, 'min_child_samples': 8}. Best is trial 46 with value: 0.946999879270795

[32m[I 2022-08-14 22:24:48,403][0m Trial 75 finished with value: 0.941969596827495 and parameters: {'learning_rate': 0.04, 'n_estimators': 59, 'max_depth': 10, 'colsample_bytree': 0.75, 'reg_alpha': 0.1807110117344268, 'reg_lambda': 0.08867987001404289, 'min_child_samples': 7}. Best is trial 46 with value: 0.9469998792707955.[0m
[32m[I 2022-08-14 22:24:48,466][0m Trial 76 finished with value: 0.9341335333833458 and parameters: {'learning_rate': 0.1, 'n_estimators': 34, 'max_depth': 10, 'colsample_bytree': 0.65, 'reg_alpha': 0.20385425151501674, 'reg_lambda': 0.10798748179055373, 'min_child_samples': 8}. Best is trial 46 with value: 0.9469998792707955.[0m
[32m[I 2022-08-14 22:24:48,541][0m Trial 77 finished with value: 0.9367343997694193 and parameters: {'learning_rate': 0.06, 'n_estimators': 44, 'max_depth': 12, 'colsample_bytree': 0.65, 'reg_alpha': 0.21955427085951984, 'reg_lambda': 0.029672365679427985, 'min_child_samples': 7}. Best is trial 46 with value: 0.9469998792707955

Number of finished trials: 100
Best trial: {'learning_rate': 0.08, 'n_estimators': 51, 'max_depth': 11, 'colsample_bytree': 0.75, 'reg_alpha': 0.15712303881812745, 'reg_lambda': 0.01901376687972032, 'min_child_samples': 7}
1 0.9552142121018796
2 0.9362950713882391
3 0.9618588814836279
4 0.960178223336118
5 0.9332417582417583
*** MODEL: lgbm ***
training metric -f1-: 0.9493576293103245
test metric -f1-: 0.9414209191940373
Confusion Matrix for test set
              precision    recall  f1-score   support

       human       0.87      0.97      0.92       119
   voicemail       0.99      0.95      0.97       320

    accuracy                           0.95       439
   macro avg       0.93      0.96      0.94       439
weighted avg       0.96      0.95      0.95       439

[[115   4]
 [ 17 303]]


**Store the best 5-second model on disk**

In [22]:
dump(model_lgb_5, open("model_5.p", "wb"))

**Use the trained model to classify unlabelled texts**

- The same pipeline implemented in the training process is maintained: clean up the text and transform it using TF-IDF.
- The model stored on disk is loaded.
- It proceeds to predict. On my machine it takes almost more than half a second predicting 730 cases.

In [34]:
import pandas as pd
from pickle import load
from datetime import datetime

from training import TrainClassModel

# Read data
test_data = pd.read_csv('data/evaluation_data.csv', index_col=0)
# Load Model
best_model = load(open("model_10.p", "rb"))

text_col = "transcript"
start = datetime.now()
# Make predictions
preds = TrainClassModel.predict(best_model, test_data[text_col], clean=True)
test_data["predictions"] = preds

end = datetime.now()

print(f"Time elapsed: {end-start} s")
print(f"Predictions done: {len(preds)}")
print(f"Time per prediction: {(end-start)/len(preds)} s")

# Save predictions to file
test_data.to_csv("data/evaluation_preds.csv")
test_data

Time elapsed: 0:00:00.562737 s
Predictions done: 730
Time per prediction: 0:00:00.000771 s


Unnamed: 0,sid,transcript,partial_transcripts,predictions
0,CF8fd2e7ac0e4ff2316bb18a9ffe5e9e68,Your call has been forwarded to an automated v...,(0): You're?/(0.065): Your call./(0.422): Your...,voicemail
1,CF2a9819f31261b93230e2ad68888bb479,Lamancha. This is Carrie. Can I help you?,(0): Ramon./(0.119): Clermont./(0.365): Lamanc...,human
2,CF94166971f53d5b09ac2e411755ead266,"Yes, so let me says hello, you've reached this...",(0): Yes./(0.129): Who was it just for?/(0.255...,voicemail
3,CF13315d4973c7c6333ed31aac7b406f46,Brian toner.,(0): Ryan./(0.245): Ryan tone./(0.265): Brian ...,human
4,CFd61f7d8a06b913c4b2170bebe99b3331,"Hello, darling. Static is not available. Pleas...","(0): Hello./(0.51): Hello./(0.549): Hello, ba...",voicemail
...,...,...,...,...
725,CFa21859e752416d642f9d93aa1c290f66,718 is not available to take your call. Please...,(0): Seven./(0.311): 337./(0.328): 371./(0.73)...,voicemail
726,CF5d58b2164c51a9ec5295e48c16beca5c,Zack Hess is currently unavailable.,(0): Zach./(0.448): Zack Hess./(0.449): Zack./...,voicemail
727,CFf631ae0c49dc94fd43ae77da07c233e5,"Who you've reached, Jessica, Russell, Gilliam,...",(0): Who./(0.626): Who./(0.791): Who you've r...,voicemail
728,CFc52805d7b70ebf85c0a924d3f2ef6749,"Hi, you've reached brooks'. Schaefer. I'm not ...","(0): Hi./(0.011): hi, you've/(0.393): Hi, you'...",voicemail


I wrote some cases that would sound like a human and a voicemail. Judge the results for yourselves.

In [25]:
messages = {
    "transcript": [
        "Hello, this is John Doe's phone number. Please leave a message after the beep.",
        "Hello Joanne! How are you? Long time no speak!",
        "Your call has been transferred to the message box.",
        "It seems to me that this has been a very complex situation for you and your family. Are you OK?"
    ],
    "label": ["voicemail", "human", "voicemail", "human"]
}
messages = pd.DataFrame(messages)
text_col = "transcript"
messages["predictions"] = TrainClassModel.predict(best_model, messages[text_col], clean=True)
messages

Unnamed: 0,transcript,label,predictions
0,"Hello, this is John Doe's phone number. Please...",voicemail,voicemail
1,Hello Joanne! How are you? Long time no speak!,human,human
2,Your call has been transferred to the message ...,voicemail,voicemail
3,It seems to me that this has been a very compl...,human,human
