/
ords_quest_training.py
416 lines (347 loc) · 15.5 KB
/
ords_quest_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
#!/usr/bin/env python3
"""
A rudimentary attempt at using Quest data to train an NLP model using scikit-learn.
Quests are citizen-science type microtasks that ask humans to evaluate and classify repair data.
Most quests aim to determine a set of common fault types for a given product category.
See the dat/quests/README.md for details.
Some of the problems of using the quest data for training and validation are that:
. Early quests did not filter out very poor quality problem text at all.
. Human evaluation was not always conclusive or accurate when problem text was ambiguous or poorly translated.
. The problem text is multi-lingual and training works best with a single language at a time.
. The fault types (training labels) are imprecise "buckets" and better values could be gleaned after the quest.
Consequently, after cleaning and filtering, the data left for training is not really sufficient.
Nonetheless the quests have been useful learning exercises that could inform future quests.
The following "test" was conducted on the 202303 dataset:
1. Selected quest no. 5 "DustUp" as it has decent quality data compared to previous quests.
2. Removed the validation step and used all of the GBR/USA quest data for training. (738 records)
3. Used the model on all ORA "Vacuum" GBR/USA records. (1446 records)
4. Exported the results to a spreadsheet and manually reviewed the predictions, excluding records used in training.
5. Found that I agreed with 60% of the predictions and that this was roughly reflected across all fault types:
58.67% Power/battery
64.17% Blockage
67.39% Poor data
58.43% Motor
65.08% Cable/cord
55.81% Internal damage
56.41% Button/switch
60.61% Brush
71.43% Filter
57.89% External damage
50.00% Hose/tube/pipe
46.15% Other
41.67% Overheating
50.00% Dustbag/canister
33.33% Accessories/attachments
50.00% Wheels/rollers
6. Refactored the script to optionally train with "clean" data, i.e. ignoring punctuation and stopwords.
7. Retrained without validation, using all of the GBR/USA quest data.
8. Compared new predictions with opinions given in first manual review.
9. Found that yet again human opinion agreed with 60% of predictions, although the spread had changed slightly.
60.43% Power/battery (+1.76%)
63.25% Blockage (-0.92%)
63.73% Poor data (-3.67%)
60.00% Motor (+1.57%)
70.49% Cable/cord (+5.41%)
55.56% Internal damage (-0.26%)
55.26% Brush (-5.34%)
60.61% Button/switch (+4.20%)
58.33% Filter (-13.10%)
75.00% Hose/tube/pipe (+25.00%)
62.50% External damage (+4.61%)
38.46% Other (-7.69%)
36.36% Wheels/rollers (-13.64%)
37.50% Overheating (-4.17%)
40.00% Accessories/attachments (+6.67%)
50.00% Dustbag/canister (0.00%)
10. It is worth noting that the % difference represents only a handful of records (<7) for each fault type (avg=0).
11. Conclusion: need more/better data and better `fault_type` labels.
Would be worth curating a custom vocabulary for each `product_category`.
12. Implemented improved list of stopwords, including corpus-specific, plus improved lemma char filtering.
13. Retrained without validation, using all of the GBR/USA quest data.
14. Compared new predictions with opinions given in first manual review.
15. Found that human opinion agreed with 66% of predictions.
61.94% Poor data (-5.45%)
70.18% Power/battery (+11.51%)
69.52% Blockage (+5.36%)
75.00% Motor (+16.57%)
76.27% Cable/cord (+11.19%)
65.79% Internal damage (+9.98%)
63.16% Brush (+2.55%)
70.59% Button/switch (+14.18%)
65.63% Filter (-5.80%)
72.22% Hose/tube/pipe (+22.22%)
58.82% External damage (+0.93%)
52.94% Other (+6.79%)
53.33% Wheels/rollers (+3.33%)
50.00% Overheating (+8.33%)
22.22% Accessories/attachments (-11.11%)
14.29% Dustbag/canister (-35.71%)
"""
from funcs import *
import pandas as pd
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn import model_selection
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
from joblib import dump
from joblib import load
# Split value is the fraction to take for validation, e.g. 0.3
def dump_data(data, countries, split=0):
logger.debug("*** RAW DATA ***")
logger.debug(len(data))
# Filter out NaNs in problem column.
data.dropna(axis="rows", subset=["problem"], inplace=True, ignore_index=True)
logger.debug("*** NO NAN DATA ***")
logger.debug(len(data))
data = data[data["country"].isin(countries)]
logger.debug("*** COUNTRY DATA ***")
logger.debug(len(data))
# Filter for string length. Earlier quests often contained very short problem text.
# However, this can leave almost no data for training.
# data = data[(data['problem'].apply(lambda x: len(str(x)) > 8))]
# logger.debug('*** LENGTH DATA ***')
# logger.debug(len(data))
# Select useful columns and rows.
data = data.reindex(columns=["id_ords", "fault_type", "problem"])
if split != 0:
# Take % of the data for validation.
data_val = data.groupby(["fault_type"]).sample(
frac=split, replace=False, random_state=1
)
logger.debug("*** VALIDATION DATA ***")
logger.debug(len(data_val))
logger.debug(len(data_val) / len(data))
# Remaining % of the data is for training.
data_train = data.drop(data_val.index)
else:
data_val = pd.DataFrame()
data_train = data
logger.debug("*** TRAINING DATA ***")
logger.debug(len(data_train))
logger.debug(len(data_train) / len(data))
# Save the data to the 'out' directory in csv format for use later.
data_train.to_csv(format_path("ords_quest_training_data"), index=False)
data_val.to_csv(format_path("ords_quest_validation_data"), index=False)
# Most of the quest data returns an alpha of 0.01.
# One of the quests returns a slightly higher alpha.
# Use this to check for best values.
# It will slow down execution considerably.
def get_alpha(data, labels, vects, search=False, refit=False):
if search:
# Try out some alpha values to find the best one for this data.
params = {
"alpha": [0, 0.001, 0.01, 0.1, 5, 10],
}
# Instantiate the search with the model we want to try and fit it on the training data.
cvval = 5
if len(data) < cvval:
cvval = len(data)
multinomial_nb_grid = model_selection.GridSearchCV(
MultinomialNB(),
param_grid=params,
scoring="f1_macro",
n_jobs=-1,
cv=cvval,
refit=False,
verbose=2,
)
multinomial_nb_grid.fit(vects, labels)
msg = "** TRAIN {}: classifier best alpha value(s): {}".format(
quest, multinomial_nb_grid.best_params_
)
logger.debug(msg)
print(msg)
return multinomial_nb_grid.best_params_["alpha"]
else:
return 0.01
# Required if cleaning the training data.
class LemmaTokenizer:
def __init__(self):
self.wnl = WordNetLemmatizer()
self.rx = re.compile("[\W\d_]")
def __call__(self, doc):
return [
self.wnl.lemmatize(t) for t in word_tokenize(doc) if (not self.rx.search(t))
]
def get_stopwords():
stopfile1 = open(pathfuncs.DATA_DIR + "/stopwords-english.txt", "r")
stopfile2 = open(pathfuncs.DATA_DIR + "/stopwords-english-repair.txt", "r")
stoplist = stopfile1.read().replace("\n", " ") + stopfile2.read().replace("\n", " ")
stopfile1.close()
stopfile2.close()
return stoplist
def do_training(data, tokenizer=False, stopwords=False, vocabulary=False):
column = data.problem
labels = data.fault_type
vectorizer = TfidfVectorizer()
# Ignore punctuation, tokenize stop words.
if tokenizer != False:
vectorizer.set_params(tokenizer=tokenizer)
# Ignore punctuation, if tokenizer then must tokenize stop words.
if stopwords != False:
if tokenizer != False:
stopwords = tokenizer(stopwords)
else:
stopwords = list(stopwords)
vectorizer.set_params(stop_words=stopwords)
if vocabulary != False:
vectorizer.vocabulary_ = vocabulary
logger.debug(vectorizer.vocabulary_)
logger.debug(vectorizer.get_stop_words())
feature_vects = vectorizer.fit_transform(column)
# Get the alpha value. Use search=True to find a good value, or False for default.
alpha = get_alpha(column, labels, feature_vects, search=False)
# Other classifiers are available!
nb_classifier = MultinomialNB(force_alpha=True, alpha=alpha)
logger.debug("** TRAIN : vectorizer ~ shape **")
logger.debug(feature_vects.shape)
logger.debug("** TRAIN : vectorizer ~ feature names **")
logger.debug(vectorizer.get_feature_names_out())
# Fit the data.
nb_classifier.fit(feature_vects, labels)
logger.debug("** TRAIN : nb_classifier: params **")
logger.debug(nb_classifier.get_params())
# Get predictions.
preds = nb_classifier.predict(feature_vects)
logger.debug(
"** TRAIN : nb_classifier: F1 SCORE: {}".format(
metrics.f1_score(labels, preds, average="macro")
)
)
logger.debug(preds)
# Save the classifier and vectoriser objects for use later.
dump(nb_classifier, clsfile)
dump(vectorizer, tdffile)
# Save predictions to 'out' directory in csv format.
data.loc[:, "prediction"] = preds
data.to_csv(format_path("ords_quest_training_results"), index=False)
# Save prediction misses.
misses = data[(data["fault_type"] != data["prediction"])]
logger.debug(misses)
misses.to_csv(format_path("ords_quest_training_misses"), index=False)
def do_validation(data):
column = data.problem
labels = data.fault_type
# Get the classifier and vectoriser that were fitted for this task.
nb_classifier = load(clsfile)
logger.debug("** VAL : classifier {}".format(type(nb_classifier)))
vectorizer = load(tdffile)
logger.debug("** VAL : vectorizer {}".format(type(vectorizer)))
# Get the predictions
feature_vects = vectorizer.transform(column)
preds = nb_classifier.predict(feature_vects)
logger.debug(
"** VAL : nb_classifier: F1 SCORE: {}".format(
metrics.f1_score(labels, preds, average="macro")
)
)
logger.debug(metrics.classification_report(labels, preds))
# Predictions output for inspection.
data.loc[:, "prediction"] = preds
data.to_csv(format_path("ords_quest_validation_results"), index=False)
# Prediction misses for inspection.
misses = data[(data["fault_type"] != data["prediction"])]
logger.debug(misses)
misses.to_csv(format_path("ords_quest_validation_misses"), index=False)
def do_test(data, category, countries):
data.dropna(axis="rows", subset=["problem"], inplace=True, ignore_index=True)
data = data[
(data["product_category"] == category) & (data["country"].isin(countries))
]
column = data.problem
# Get the classifier and vectoriser that were fitted for this task.
nb_classifier = load(clsfile)
logger.debug("** TEST : classifier {}".format(type(nb_classifier)))
vectorizer = load(tdffile)
logger.debug("** TEST : vectorizer {}".format(type(vectorizer)))
# Get the predictions.
feature_vects = vectorizer.transform(column)
preds = nb_classifier.predict(feature_vects)
# Predictions output.
data.loc[:, "prediction"] = preds
data.to_csv(format_path("ords_quest_test_results"), index=False)
def format_path(filename, ext="csv"):
return "{}/{}_{}.{}".format(pathfuncs.OUT_DIR, filename, quest, ext)
if __name__ == "__main__":
logger = logfuncs.init_logger(__file__)
# Available quest datasets.
# See dat/quests for details.
quests = [
tuple(["Compcat-Laptops", "computers/compcat_laptops.csv", "Laptop"]),
tuple(["Compcat-Desktops", "computers/compcat_laptops.csv", "Desktop PC"]),
tuple(["MobiFix", "mobiles/mobifix.csv", "Mobiles"]),
tuple(["PrintCat", "printers/printcat.csv", "Printer/scanner"]),
tuple(["TabiCat", "tablets/tabicat.csv", "Tablet"]),
tuple(["DustUp", "vacuums/dustup.csv", "Vacuum"]),
]
# Select a dataset 0 to 5.
quest, path, category = quests[5]
logger = logfuncs.init_logger("ords_quest_training_" + quest)
# Path to the classifier and vectoriser objects for dump/load.
clsfile = format_path("ords_quest_obj_nbcl", "joblib")
tdffile = format_path("ords_quest_obj_tdif", "joblib")
# Quest data is multi-lingual and NLP tends to require a single language.
# Using English stopwords/vocabulary will require English language text.
# Filter for subset of records with English language text.
iso_list = ["GBR", "USA"]
# Other countries with en lang `problem` (only a few dozen more records)
# ['AUS', 'IRL', 'ISL', 'JEY', 'NOR', 'NZL', 'SWE']
# Get vocabulary for `product_category` from list of terms after stopwords extracted.
# See `ords_extract_vocabulary.py`
vocabulary = False
# dfvocab = pd.read_csv(pathfuncs.OUT_DIR + '/ords_vocabulary_problem.csv',
# dtype=str)
# vocabulary = list(dfvocab.loc[dfvocab.category == category].groupby(
# 'term', sort=False).groups)
# Use stopwords?
stopwords = False
# stopwords = get_stopwords()
# Use tokenisation?
tokenizer = False
# tokenizer = LemmaTokenizer()
# # Initialise the training and (optional) validation datasets.
data = pd.read_csv(
pathfuncs.DATA_DIR + "/quests/" + path,
dtype=str,
keep_default_na=False,
na_values="",
)
# Do validation step?
validate = False
# If validating, take a fraction of corpus, e.g. 0.3
splitval = 0
logger.debug("*** quest={} ***".format(quest))
logger.debug("*** countries={} ***".format(iso_list))
logger.debug("*** validate={} ***".format(validate))
logger.debug("*** splitval={} ***".format(splitval))
logger.debug("*** vocabulary={} ***".format(vocabulary))
logger.debug("*** stopwords={} ***".format(stopwords))
logger.debug("*** tokenizer={} ***".format(tokenizer))
dump_data(data, iso_list, splitval)
# Use the dumped training dataset.
data = pd.read_csv(
format_path("ords_quest_training_data"),
dtype=str,
keep_default_na=False,
na_values="",
)
do_training(data, tokenizer=tokenizer, stopwords=stopwords, vocabulary=vocabulary)
if validate:
# Use the dumped validation dataset or provide other dataset.
data = pd.read_csv(
format_path("ords_quest_validation_data"),
dtype=str,
keep_default_na=False,
na_values="",
)
do_validation(data)
# Read the entire ORDS export.
data = pd.read_csv(pathfuncs.path_to_ords_csv(), dtype=str)
# The function will filter the data for the appropriate product_category.
# The subset will contain the training and validation data
# but does not have a fault_type column and id_ords may not match.
# Early ORDS exports did not have permanent unique identifiers.
do_test(data, category, iso_list)