Skip to content

hrushikesh-dhumal/nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nlp

Boilerplate natural language processing

Concepts

  • Word embedding skip gram- It takes a word and a window of k words arond it to build 2k contexts. Usually k=5 is selected.
  • Word embedding negative sampling - In order to improve the speed on execution instead of calculating similarity of word from all the contexts, randomly few context will be selected.
  • Word Mover’s Distance (WMD)- It utilizes word2vec embeddings property that distances between embedded word vectors are to some degree semantically meaningful. A text documents is represented as a weighted point cloud of embedded words. The distance between two text documents A and B is the minimum cumulative distance that words from document A need to travel to match exactly the point cloud of document B. The optimization problem underlying WMD reduces to a special case of the well-studied Earth Mover’s Distance. remove stop words when using WMD, because it focuses on set of few important words in both the documents. When clustering, WMD has higher accuracy on problems of short length and performs poorly on lengthy text as per 1 and 2

Releases

No releases published

Packages

No packages published