Tasks for Advance Natural Language Processing Course at ITMO University.
- Data cleaning and duplicaets detection
- Quora duplicates detection
- Attention mechanism and Bert
- Fine tuning and dividing documents on 10 topic categories
- Topic Modeling
This file contains :
- Data Cleaning:
- Remove non-english words
- Remove html-tags (try to do it with regular expression, or play with beautifulsoap library)
- Apply lemmatization / stemming
- Remove stop-words
- Duplicates detection using LSH
The task in this file isbuild an LSTM-based siamese homework and search for the duplicates in quora question pairs dataset.
The task in this file is to create attention mechanism with the numpy tool. And use of pre-trained models for text processing.
The task in this file is to divide documents on 10 topic categories using Huggingface Datasets library.
In this file I've applied topic modeling with NMF (using sklearn.decomposition.NMF) and topic modeling with LDA (using gensim implementation) in addition to applying the following two quality fuctions: coherence, and normalized PMI.