# **Natural Language Processing**
by **Tafseer Ahmed**, Mohammad Ali Jinnah University



We will create the following for Urdu
*   Named Entity Recognizer
*   Word Embedding



# Named Entity Recognizer (NER)

NER locate and classify named entity mentions in unstructured text into pre-defined categories, such as:


*   Person
*   Location
*   Organization
*   Date
*   ...





 

**Using NER**

spaCy is a free open-source library for Natural Language Processing in Python.

In [0]:
#!pip install -U spacy
# Uncomment the above statement, if you are not in colab and spacy is not installed there
import spacy

Loading model of English language

In [0]:
nlp = spacy.load('en_core_web_sm')

Finding Named Entities 

In [0]:
doc = nlp(u'Sarfaraz Ahmed has been retained as Pakistan captain while Babar Azam has been named as the vice captain \
            for the home series against Sri Lanka, a press release by Pakistan Cricket Board (PCB) said. \
            The final squads will be named on September 23.')
[(ent.text, ent.label_) for ent in doc.ents]

[('Sarfaraz Ahmed', 'PERSON'),
 ('Pakistan', 'GPE'),
 ('Babar Azam', 'ORG'),
 ('Sri Lanka', 'GPE'),
 ('Pakistan Cricket Board', 'ORG'),
 ('PCB', 'ORG'),
 ('September 23', 'DATE')]



```
Note that it does not recognize Babar Azam correctly. Hence, we cannot always rely on the avialble libraries. We need to train NERs for different domains and languages.
Here, we will train an NER for Urdu.
```



**Creating NER for Urdu**





**Reading Data from the Google Drive**



In [0]:
from google.colab import drive

# This will prompt for authorization.
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Note about the data:**
```
The dataset is a chunk of the dataset retrieved from 
http://ltrc.iiit.ac.in/ner-ssea-08/index.cgi?topic=5
(Workshop on NER for South and South East Asian Languages in IJCNLP 2008 at Hyderabad, India)

The data is reformatted and enriched. Please refer/acknowledge in your paper/report to the original source, if you use this data.
```



In [0]:


import csv
with open('/content/drive/My Drive/pycon-tafseer/urdu-ner/conll-ner.csv', encoding = 'utf-8') as csvfile:
    data = list(csv.reader(csvfile, delimiter='\t'))
print(data[0])

['1', 'زیرتربیت', 'زيرتربیت', 'NOUN', 'NNC', '', 'O', '']




```
The format of the above output is:
word-no, word, lemma/root-word, universal-part-of-speech, other-part-of-speech,empty, IBO-tag, Entity-type

The first word of a named entity is tagged as B(eginning), 
the other words of the named entity are tagged as I(ntermediate),
and the words not belonging to named entity are tagged as O(ther)

For example
Bill ... B Person
Gates ... I Person
founded ... O
Microsoft ... B Organization

```




Extracting features

In [0]:
def extract_features(words, i):
    wid = words[i][0]

    token = words[i][1]
    upos = words[i][3]
    xpos = words[i][4]
    
    prev_token = ""
    prev_upos = ""
    prev_xpos = "" 
    
    next_token = ""
    next_upos = ""
    next_xpos = ""
    
    
    if int(wid) != 1: 
        prev_token = words[i-1][1]
        prev_upos = words[i-1][3]
        prev_xpos = words[i-1][4] 
    if  i < len(words)-1:
        if int(wid) < int(words[i+1][0]): 
            next_token = words[i+1][1]
            next_upos = words[i+1][3]
            next_xpos = words[i+1][4]

    is_number = False
    try:
        if float(token):
            is_number = True
    except:
        pass
    
    features_dict = {"token": token
           , "upos": upos
            , "xpos": xpos          
           , "prev_token": prev_token
            , "prev_upos": next_upos  
            , "prev_xpos": next_xpos
            , "next_token": next_token
           , "next_upos": next_upos
            , "next_xpos": next_xpos
        , "is_number": is_number}
    return features_dict

print(data[3:6])
print(extract_features(data, 4))


[['4', 'کی', 'کا', 'ADP', 'PSP', '', 'O', ''], ['5', 'تربیت', 'تربیت', 'NOUN', 'NN', '', 'O', ''], ['6', 'میں', 'میں', 'ADP', 'PSP', '', 'O', '']]
{'token': 'تربیت', 'upos': 'NOUN', 'xpos': 'NN', 'prev_token': 'کی', 'prev_upos': 'ADP', 'prev_xpos': 'PSP', 'next_token': 'میں', 'next_upos': 'ADP', 'next_xpos': 'PSP', 'is_number': False}


Converting feature vector for each word

In [0]:
X_features = []
Y = []


for i in range(len(data)):
    try:
        X_features.append(extract_features(data, i))
        Y.append(data[i][6])
    except:
        pass


print(X_features[4])


print(len(X_features),":",len(Y))

{'token': 'تربیت', 'upos': 'NOUN', 'xpos': 'NN', 'prev_token': 'کی', 'prev_upos': 'ADP', 'prev_xpos': 'PSP', 'next_token': 'میں', 'next_upos': 'ADP', 'next_xpos': 'PSP', 'is_number': False}
14792 : 14792


In [0]:
print(Y[2:5])

['O', 'O', 'O']


Currently many features have string data, we convert it into numeric vectors with a column for each string

In [0]:
from sklearn.feature_extraction import DictVectorizer
vectoriser = DictVectorizer(sparse=False)
X = vectoriser.fit_transform(X_features)

print("number of features: ", len(X[3]))
print(X[3])


number of features:  8714
[0. 0. 0. ... 0. 0. 0.]


Making training set and training the classifier

In [0]:
from sklearn import model_selection
from sklearn import svm            # import support vector machine

cl = svm.LinearSVC()
validation_size = 0.20
seed = 7


from sklearn.ensemble import RandomForestClassifier
#cl = RandomForestClassifier()


X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, stratify = Y, test_size=validation_size, random_state=seed)
cl.fit(X_train, Y_train)


LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
          verbose=0)

In [0]:
print(cl.predict(X_test[0:3]))

['O' 'O' 'NEA']


In [0]:
print(Y_test[0:3])

['O', 'O', 'NEA']


In [0]:
from sklearn import metrics
X_predict = cl.predict(X_test)
print(metrics.classification_report(X_predict , Y_test))
print(metrics.confusion_matrix(X_predict , Y_test))


              precision    recall  f1-score   support

         NEA       0.72      0.93      0.81        14
         NED       0.54      0.62      0.58        24
         NEL       0.92      0.91      0.91       170
         NEM       0.76      0.85      0.80        60
         NEN       0.91      0.90      0.91        70
         NEO       0.60      0.83      0.70        18
         NEP       0.89      0.89      0.89       128
        NETE       0.50      1.00      0.67         5
        NETI       0.86      0.97      0.91        37
        NETO       0.00      0.00      0.00         1
        NETP       0.83      0.83      0.83         6
           O       0.99      0.97      0.98      2426

    accuracy                           0.96      2959
   macro avg       0.71      0.81      0.75      2959
weighted avg       0.96      0.96      0.96      2959

[[  13    0    0    0    0    0    1    0    0    0    0    0]
 [   0   15    0    0    0    0    0    0    0    0    0    9]
 [   0 

# Word Embedding

The words can be represnted as feature vectors in such a way that the words having "similar" (related) meaning are present near to each other.

The (mini) Corpus consists of Hadith books and Bible.

In [0]:
import gensim
import nltk

reader = nltk.corpus.PlaintextCorpusReader("/content/drive/My Drive/pycon-tafseer/Corpus",'.*\.txt')
text = reader.raw()
print(reader.fileids())

['BibleOT-1.txt', 'BibleOT-2.txt', 'IbnMaja.txt', 'IbneKatheer-2.txt', 'Ibnekatheer-1.txt', 'Mishkat-1.txt', 'Mishkat-2.txt', 'Mishkat-3.txt', 'Sahih_Muslim.txt', 'SunanAbuDawood2.txt', 'SunanNisai 1.txt', 'SunanNisai2.txt']


Text is split into sentences

In [0]:
import re

sentences = re.split(r'[۔؟]',text)

print(len(sentences))
print(sentences[6])



108539
 3 تب خدا نے  کہا "روشنی ہو جا"تو روشنی ہو گئی


Sentences are split into words

In [0]:
from nltk.tokenize import WordPunctTokenizer
word_punct_tokenizer = WordPunctTokenizer()


s = []
for i in range(len(sentences)):
    s.append(word_punct_tokenizer.tokenize(sentences[i]))

print(s[6])

['3', 'تب', 'خدا', 'نے', 'کہا', '"', 'روشنی', 'ہو', 'جا', '"', 'تو', 'روشنی', 'ہو', 'گئی']


Word2Vec model is trained

In [0]:
model = gensim.models.Word2Vec(s, min_count=5, size = 20)
model.wv.most_similar('سورج')

  if np.issubdtype(vec.dtype, np.int):


[('آفتاب', 0.9327265024185181),
 ('طلوع', 0.924567699432373),
 ('غروب', 0.9115254282951355),
 ('گرہن', 0.8704209327697754),
 ('صبح', 0.7990034222602844),
 ('چاند', 0.7939923405647278),
 ('موسم', 0.7647169232368469),
 ('منی', 0.7602494955062866),
 ('قیام', 0.7539945840835571),
 ('عرفات', 0.7458064556121826)]

Printing related words

In [0]:
model.wv.most_similar('فجر')

  if np.issubdtype(vec.dtype, np.int):


[('عصر', 0.978093147277832),
 ('عشاء', 0.9616297483444214),
 ('مغرب', 0.9334694147109985),
 ('ظہر', 0.9296648502349854),
 ('سنتیں', 0.8735674023628235),
 ('سنتوں', 0.8643820881843567),
 ('قنوت', 0.8530187606811523),
 ('وتر', 0.8439244031906128),
 ('چاشت', 0.8286170959472656),
 ('تہجد', 0.8248888254165649)]

In [0]:
model.wv.doesnt_match(["رمضان" ,"شوال" ,"رجب" ,"جمعہ"])


  vectors = vstack(self.word_vec(word, use_norm=True) for word in used_words).astype(REAL)
  if np.issubdtype(vec.dtype, np.int):


'جمعہ'