Machine Learning approach to classify the sender of Telegram chat messages
Steps (high-level overview)
- Download Telegram message history using tvdstaaij/telegram-history-dump together with vysheng/tg
- Extract message texts and message senders and dump them to JSON file (
- Filter messages by minimum word length, filter stop words and build (text, label) tuples, where label is the name of the message's sender and text is the respective message text. (
- Compute features: binary feature (contains / contains not) for every word in the entire message collection, as well as for every bi- and trigram (
train_classifier.py::compute_train_test()). NOTE: Actually only top X most frequent words, bi- and trigrams are used as features because of complexity reasons.
- Compute features for every message (n-dimensional vector with n = number of features = number of top X most frequent words, bigrams and trigrams)
- Shuffle feature set. Divide into training set and test set (test set of length 5000).
- Train nltk.NaiveBayesClassifier classifier
- Test and compute accuracy (
- Dump trained classifier as well as feature list.
- Optional: Classify some hand-picked, new, unseen message (
- Training set size (number of messages for every chat partner)
- Feature vector dimensionality (number of words and ngrams to be used as binary features)
- Use single words, bigrams, trigrams or all of them
- Removing stopwords
- Minimum word length (2, 3, ...)
- |C| = 4 (four classes, including three chat partners and myself)
- |training set| = 37257 (messages from chat partners: 7931, 9795, 9314, 10217)
- |test set| = 5000
- |features| = 5000 (3735 single words, 1201 bigrams, 64 trigrams)
- min(|w|) = 2 (minimum word length of 2, including Unicode emojis)
- remove German stopwords (
... led to ...
- Accuracy: 0.61 = 61 %
- Training time: 348.26 sec
Comparison to fastText
As a comparison baseline I've also trained a fastText classifier. (
fasttext_preprocess.py for preprocessing the messages to fasttext-compliant input format,
fasttext_train_test.py to train fasttext, make predictions on the previously extracted test data and compute accuracy).
This led to an accuracy of 0.6, but with a very much better training time of 0.66 sec.
61 % is certainly not a very reliable classifier, but at least significantly better than random guessing (chance of 1/4 in this case). Although having only very basic knowledge in machine learning (this project is kind of my first practical experiment in that area), I'd suppose that learning a person's chat writing style is way harder than detecting the sentiment in a tweet (that article originally inspired me) or classifying news headlines to categories. Actually it's not only hard to get a machine learn a chatting style, but also to do so as a human. Given a chat message without any semantic context, could you find out who of your friends is the sender? Probably not. But actually, the practical relevance of this project isn't quit high anyway, but it was a good practice for me to get into the basics of ML.
- Ferdinand Mütsch, 2017