In [0]:
import datetime

print("خوش آمدید" , datetime.datetime.now())

خوش آمدید 2020-02-13 05:12:50.999867




# **Urud Named Entity Recognition**
by **Tafseer Ahmed**

# **Part 1: Using Libraries**


*   **SpaCy** (for NER)
*   **StanfordNLP** (for Urdu PoS)



## Named Entity Recognizer (NER)
NER locate and classify named entity mentions in unstructured text into pre-defined categories, such as:


*   Person
*   Location
*   Organization
*   Date
*   ...*italicized text*

# **Using NER**
SpaCy is a free open-source library for Natural Language Processing in Python.

In [0]:
#!pip install -U spacy
# Uncomment the above statement, if you are not in colab and spacy is not installed there
import spacy

Loading model of English language

In [0]:
nlp = spacy.load('en_core_web_sm')

Finding Named Entities

In [0]:
doc = nlp(u'Sarfaraz Ahmed has been retained as Pakistan captain while Babar Azam has been named as the vice captain \
            for the home series against Sri Lanka, a press release by Pakistan Cricket Board (PCB) said. \
            The final squads will be named on September 23.')
[(ent.text, ent.label_) for ent in doc.ents]

[('Sarfaraz Ahmed', 'PERSON'),
 ('Pakistan', 'GPE'),
 ('Babar Azam', 'ORG'),
 ('Sri Lanka', 'GPE'),
 ('Pakistan Cricket Board', 'ORG'),
 ('PCB', 'ORG'),
 ('September 23', 'DATE')]

Python Syntax

In [0]:
#[(ent.text, ent.label_) for ent in doc.ents]
NElist = []
for ent in doc.ents:
  NElist.append((ent.text, ent.label_))
print(NElist)

[('Sarfaraz Ahmed', 'PERSON'), ('Pakistan', 'GPE'), ('Babar Azam', 'ORG'), ('Sri Lanka', 'GPE'), ('Pakistan Cricket Board', 'ORG'), ('PCB', 'ORG'), ('September 23', 'DATE')]


## Part of Speech (PoS)

*   Universal Tagset (ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, SYM, VERB, X)
*   Language Specific Tagset

Urdu PoS Tagger is available in StanfordNLP

Installing Stanford NLP


In [16]:
!pip install StanfordNLP

Collecting StanfordNLP
[?25l  Downloading https://files.pythonhosted.org/packages/41/bf/5d2898febb6e993fcccd90484cba3c46353658511a41430012e901824e94/stanfordnlp-0.2.0-py3-none-any.whl (158kB)
[K     |██                              | 10kB 19.2MB/s eta 0:00:01[K     |████▏                           | 20kB 3.3MB/s eta 0:00:01[K     |██████▏                         | 30kB 4.8MB/s eta 0:00:01[K     |████████▎                       | 40kB 3.1MB/s eta 0:00:01[K     |██████████▎                     | 51kB 3.8MB/s eta 0:00:01[K     |████████████▍                   | 61kB 4.5MB/s eta 0:00:01[K     |██████████████▌                 | 71kB 5.2MB/s eta 0:00:01[K     |████████████████▌               | 81kB 5.8MB/s eta 0:00:01[K     |██████████████████▋             | 92kB 6.5MB/s eta 0:00:01[K     |████████████████████▋           | 102kB 5.1MB/s eta 0:00:01[K     |██████████████████████▊         | 112kB 5.1MB/s eta 0:00:01[K     |████████████████████████▊       | 122kB 5.1MB/

Downloading model for Urdu

```
# This is formatted as code
```



In [17]:
import stanfordnlp
stanfordnlp.download('ur')

Using the default treebank "ur_udtb" for language "ur".
Would you like to download the models for: ur_udtb now? (Y/n)
y

Default download directory: /root/stanfordnlp_resources
Hit enter to continue or type an alternate directory.


Downloading models for: ur_udtb
Download location: /root/stanfordnlp_resources/ur_udtb_models.zip


100%|██████████| 152M/152M [00:33<00:00, 4.40MB/s]



Download complete.  Models saved to: /root/stanfordnlp_resources/ur_udtb_models.zip
Extracting models file for: ur_udtb
Cleaning up...Done.


Using NLP for Urdu

In [22]:
nlp = stanfordnlp.Pipeline(lang="ur")
doc = nlp("لڑکیاں لائبریری میں کتابیں پڑھ رہی تھیں۔")

Use device: cpu
---
Loading: tokenize
With settings: 
{'model_path': '/root/stanfordnlp_resources/ur_udtb_models/ur_udtb_tokenizer.pt', 'lang': 'ur', 'shorthand': 'ur_udtb', 'mode': 'predict'}
---
Loading: pos
With settings: 
{'model_path': '/root/stanfordnlp_resources/ur_udtb_models/ur_udtb_tagger.pt', 'pretrain_path': '/root/stanfordnlp_resources/ur_udtb_models/ur_udtb.pretrain.pt', 'lang': 'ur', 'shorthand': 'ur_udtb', 'mode': 'predict'}
---
Loading: lemma
With settings: 
{'model_path': '/root/stanfordnlp_resources/ur_udtb_models/ur_udtb_lemmatizer.pt', 'lang': 'ur', 'shorthand': 'ur_udtb', 'mode': 'predict'}
Building an attentional Seq2Seq model...
Using a Bi-LSTM encoder
Using soft attention for LSTM.
Finetune all embeddings.
[Running seq2seq lemmatizer with edit classifier]
---
Loading: depparse
With settings: 
{'model_path': '/root/stanfordnlp_resources/ur_udtb_models/ur_udtb_parser.pt', 'pretrain_path': '/root/stanfordnlp_resources/ur_udtb_models/ur_udtb.pretrain.pt', 'lang': '



Linguistic Features of the words



In [34]:
for s in doc.sentences:
  for w in s.words:
    print(w.text, " : ", w)

لڑکیاں  :  <Word index=1;text=لڑکیاں;lemma=لڑکی;upos=NOUN;xpos=NN;feats=Case=Nom|Gender=Masc|Number=Plur|Person=3;governor=5;dependency_relation=nsubj>
لائبریری  :  <Word index=2;text=لائبریری;lemma=لائبریری;upos=NOUN;xpos=NN;feats=Case=Acc|Gender=Fem|Number=Sing|Person=3;governor=5;dependency_relation=advmod>
میں  :  <Word index=3;text=میں;lemma=میں;upos=ADP;xpos=PSP;feats=AdpType=Post;governor=2;dependency_relation=case>
کتابیں  :  <Word index=4;text=کتابیں;lemma=کتاب;upos=NOUN;xpos=NN;feats=Case=Nom|Gender=Masc|Number=Plur|Person=3;governor=5;dependency_relation=obj>
پڑھ  :  <Word index=5;text=پڑھ;lemma=پڑھ;upos=VERB;xpos=VM;feats=Voice=Act;governor=0;dependency_relation=root>
رہی  :  <Word index=6;text=رہی;lemma=رہ;upos=AUX;xpos=VAUX;feats=Aspect=Perf|Gender=Fem|Number=Sing|VerbForm=Part;governor=5;dependency_relation=aux>
تھیں  :  <Word index=7;text=تھیں;lemma=تھا;upos=AUX;xpos=VAUX;feats=Gender=Fem|Mood=Ind|Number=Plur|Person=3|Tense=Past|VerbForm=Fin;governor=6;dependency_relati

**Exercise:** Print the UPoS and XPoS corresponding to the each word in the following format.

word =  لڑکیاں 		,univ pos :  NOUN 		 ,other pos :  NN

word =  لائبریری 		,univ pos :  NOUN 		 ,other pos :  NN

word =  میں 		,univ pos :  ADP 		 ,other pos :  PSP

word =  کتابیں 		,univ pos :  NOUN 		 ,other pos :  NN

word =  پڑھ 		,univ pos :  VERB 		 ,other pos :  VM

word =  رہی 		,univ pos :  AUX 		 ,other pos :  VAUX

word =  تھیں 		,univ pos :  AUX 		 ,other pos :  VAUX

word =  ۔ 		,univ pos :  PUNCT 		 ,other pos :  SYM



In [0]:
#write code here