Hidden Markov Model based POS tagging for 60+ languages on universal dependencies (UD) data
In this notebook (HMM_UD.ipynb), we will use the Pomegranate library to build a simple Hidden Markov Model for part-of-speech tagging.
The goal here is breadth rather than depth: we want to cover as many languages in the UD tagset as possible, therefore we did not implement additional features like:
-
Laplace Smoothing Wiki Link
-
Backoff Smoothing Speech & Language Processing Ch. 4,9,10
-
Extending to Trigrams Trigram Paper
આ પોથીમાં આપણે વિવિધ ભાષાઓમાં શબ્દ ભેદ (પાર્ટ્સ ઓફ સ્પીચ) ઉકેલવાનું કામ હિડેન માર્કોવ મોડેલ (HMM) વડે કરીશું.
અહીં દાડમ લાઇબ્રેરી Pomegranate વાપરવામાં આવી છે.
આપણું લક્ષ્ય ઊંડાણ ને બદલે વિસ્તારનો છે, એટલા માટે નીચે આપેલ ગુણવિશેષનો સમાવેશ નથી:
-
લાપ્લેસ નિયમિતકારણ વિકિપીડિયા
-
બેકઓફ નિયમિતકરણ ચોંપડી પાઠ 4,9,10
-
ટ્રાઇગ્રામ સંશોધન પેપર
We scan the entire UD folder to read in all the names of the respective language subdirectories, and prune out datasets that don't have train sets. Lack of a dev set is tolerated, as dev sets are fused to the training set, given the lack of iterative training in our HMM implementation.
We need the following libraries installed:
- Pomegranate
- Numpy
- Collections
- pyconll
In addition, helper functions are found in data_prep.py and hmm_utils.py. Make sure you have these files in the same directory as this notebook!