Our top 1 solution to AI IJC competition
Team round included several tasks:
- Prediction aggressive driving among taxi drivers using following data types: taxi route, customer comment about the ride, other table data about the order.
- Extracting a reason for aggressive driving from comments
- Detecting parts of taxi routs with aggressive driving
Final round task:
- Building a clustering system for taxi drivers, which in future could be used as a reward system
About the data:
- There were only two labels in the dataset:
0
which meant unaggressive driving and1
, which symbolized aggressive driving. The dataset was imbalanced with about 96% of0
labels and only 4% of1
. These labels were only given for the first task. - The dataset itself didn't have commnets for every ride, which made it hard to train big supervised models.
However we had a whole dataset of unlabeled comments. Which we were to use.
Solution for the 1st task:
- The best score was achieved with a text model sismetanin/xlm_roberta_base-ru-sentiment-rusentiment. However the stack also included XGBoost with Optuna and a GCN
- We did EDA and cleaned short or senseless comments like
"Да"
,"Нет"
,"."
etc. - Implement UDA in order to utilize unlabeled samples. Original PyTorch UDA implementation
- We used back translation to English as an augmentation of unlabeld data.
- We chose Cross Entropy as supervised loss and KL Divergece as unsupervised loss
Solution for the 2nd task:
- Delete senseless samples from text dataset
- Use a QA model AlexKay/xlm-roberta-large-qa-multilingual-finedtuned-ru by asking it
"как выражалось агрессивное вождение?"
Solution for 3rd task:
- Training a GCN model for the task 1
- Using the model to score driving agressiveness at every node individually