Naive-Bayes based classification of Telegram chat messages.
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
.gitignore
README.md
extract_messages.py
fasttext_preprocess.py
fasttext_train_test.py
run_ruby.sh
run_tg_server.sh
test.py
train.py
train_classifier.py
utils.py

README.md

tg-chat-classification

Machine Learning approach to classify the sender of Telegram chat messages

Steps (high-level overview)

  1. Download Telegram message history using tvdstaaij/telegram-history-dump together with vysheng/tg
  2. Extract message texts and message senders and dump them to JSON file (extract_messages.py)
  3. Filter messages by minimum word length, filter stop words and build (text, label) tuples, where label is the name of the message's sender and text is the respective message text. (train.py::get_messages())
  4. Compute features: binary feature (contains / contains not) for every word in the entire message collection, as well as for every bi- and trigram (train_classifier.py::compute_train_test()). NOTE: Actually only top X most frequent words, bi- and trigrams are used as features because of complexity reasons.
  5. Compute features for every message (n-dimensional vector with n = number of features = number of top X most frequent words, bigrams and trigrams)
  6. Shuffle feature set. Divide into training set and test set (test set of length 5000).
  7. Train nltk.NaiveBayesClassifier classifier
  8. Test and compute accuracy (train.py::print_classifier_stats)
  9. Dump trained classifier as well as feature list.
  10. Optional: Classify some hand-picked, new, unseen message (test.py)

Optimization factors

  • Training set size (number of messages for every chat partner)
  • Feature vector dimensionality (number of words and ngrams to be used as binary features)
  • Use single words, bigrams, trigrams or all of them
  • Removing stopwords
  • Minimum word length (2, 3, ...)
  • ...

Best result

  • |C| = 4 (four classes, including three chat partners and myself)
  • |training set| = 37257 (messages from chat partners: 7931, 9795, 9314, 10217)
  • |test set| = 5000
  • |features| = 5000 (3735 single words, 1201 bigrams, 64 trigrams)
  • min(|w|) = 2 (minimum word length of 2, including Unicode emojis)
  • remove German stopwords (nltk.corpus.stopwords.words('german'))

... led to ...

  • Accuracy: 0.61 = 61 %
  • Training time: 348.26 sec

Comparison to fastText

As a comparison baseline I've also trained a fastText classifier. (fasttext_preprocess.py for preprocessing the messages to fasttext-compliant input format, fasttext_train_test.py to train fasttext, make predictions on the previously extracted test data and compute accuracy). This led to an accuracy of 0.6, but with a very much better training time of 0.66 sec.

Conclusion

61 % is certainly not a very reliable classifier, but at least significantly better than random guessing (chance of 1/4 in this case). Although having only very basic knowledge in machine learning (this project is kind of my first practical experiment in that area), I'd suppose that learning a person's chat writing style is way harder than detecting the sentiment in a tweet (that article originally inspired me) or classifying news headlines to categories. Actually it's not only hard to get a machine learn a chatting style, but also to do so as a human. Given a chat message without any semantic context, could you find out who of your friends is the sender? Probably not. But actually, the practical relevance of this project isn't quit high anyway, but it was a good practice for me to get into the basics of ML.

Authors