Skip to content

paxF3E/Outliers-in-Text-Corpora

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Outliers in Text Corpora

  • usually text corpora contain a mix of exhaustive as well as non-exhaustive documents
  • such a mix of wide genres and themes of documents extensively increase the occurances of anomalies, which need to be flagged for deployed ML systems
  • this notebook studies and implements the same to analyze anomalous texts and emerging themes in large NLP corpora
  • the workflow is based on: Transformers, cleablab, UMAP, and c-TF-IDF from BERTopic
    1. AutoTokenizer and AutoModel submodules from transformers are used to obtain accurate tokenized representations from the raw text belonging to different genres and themes and retrieve the relevant model architecture, using a pretrained tokenizer+model from HuggingFaceHub
    2. PyTorch datasets are obtained from preprocessed text datasets from HuggingFaceHub, and operations are performed over the dataset using torch
    3. OutOfDistribution submodule of cleanlab is used to determine the outliers based on outlier scores using a nearest neighbour estimator
    4. the results are further processed and visualized to conclude the anomaly trends using UMAP and class based TF-IDF to further filter out the outliers using the weights of words/phrased in the corpus
    5. once the anomalous genres are clustered, topics within them are identified depicting out-of-distribution topics

About

identify and model topics in clustered genres in documents for NLI corpora

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published