# NLTK - Natural Language Toolkit

<b>NLTK</b> is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries.

URL : https://www.nltk.org/

### Importing libraries

In [12]:
import nltk, string

### Testing data

In [13]:
english_text = """Perhaps one of the most significant advances made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as \"algebraic objects\". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before."""

In [14]:
english_text

'Perhaps one of the most significant advances made by Arabic mathematics began at this time with the work of al-Khwarizmi, namely the beginnings of algebra. It is important to understand just how significant this new idea was. It was a revolutionary move away from the Greek concept of mathematics which was essentially geometry. Algebra was a unifying theory which allowedrational numbers,irrational numbers, geometrical magnitudes, etc., to all be treated as "algebraic objects". It gave mathematics a whole new development path so much broader in concept to that which had existed before, and provided a vehicle for future development of the subject. Another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before.'

In [72]:
arabic_text ="""ربما كانت أحد أهم التطورات التي قامت بها الرياضيات العربية التي بدأت في هذا الوقت بعمل الخوارزمي وهي بدايات الجبر, ومن المهم فهم كيف كانت هذه الفكرة الجديدة مهمة, فقد كانت خطوة نورية بعيدا عن المفهوم اليوناني للرياضيات التي هي في جوهرها هندسة, الجبر کان نظرية موحدة تتيح الأعداد الكسرية والأعداد اللا كسرية, والمقادير الهندسية وغيرها, أن تتعامل على أنها أجسام جبرية, وأعطت الرياضيات ككل مسارا جديدا للتطور بمفهوم أوسع بكثير من الذي كان موجودا من قبل, وقم وسيلة للتنمية في هذا الموضوع مستقبلا. وجانب آخر مهم لإدخال أفكار الجبر وهو أنه سمح بتطبيق الرياضيات على نفسها بطريقة لم تحدث من قبل"""

In [74]:
arabic_text

'ربما كانت أحد أهم التطورات التي قامت بها الرياضيات العربية التي بدأت في هذا الوقت بعمل الخوارزمي وهي بدايات الجبر, ومن المهم فهم كيف كانت هذه الفكرة الجديدة مهمة, فقد كانت خطوة نورية بعيدا عن المفهوم اليوناني للرياضيات التي هي في جوهرها هندسة, الجبر کان نظرية موحدة تتيح الأعداد الكسرية والأعداد اللا كسرية, والمقادير الهندسية وغيرها, أن تتعامل على أنها أجسام جبرية, وأعطت الرياضيات ككل مسارا جديدا للتطور بمفهوم أوسع بكثير من الذي كان موجودا من قبل, وقم وسيلة للتنمية في هذا الموضوع مستقبلا. وجانب آخر مهم لإدخال أفكار الجبر وهو أنه سمح بتطبيق الرياضيات على نفسها بطريقة لم تحدث من قبل'

### Data Cleaning

#### Lower Case

In [67]:
english_text= english_text.lower()

#### Eliminate Ponctuation

In [18]:
#english_text_clear = ''.join([x for x in english_text if x not in string.punctuation])
#arabic_text_clear = ''.join([x for x in arabic_text if x not in string.punctuation])

In [19]:
#print(english_text_clear)
#print(arabic_text_clear)

perhaps one of the most significant advances made by arabic mathematics began at this time with the work of alkhwarizmi namely the beginnings of algebra it is important to understand just how significant this new idea was it was a revolutionary move away from the greek concept of mathematics which was essentially geometry algebra was a unifying theory which allowedrational numbersirrational numbers geometrical magnitudes etc to all be treated as algebraic objects it gave mathematics a whole new development path so much broader in concept to that which had existed before and provided a vehicle for future development of the subject another important aspect of the introduction of algebraic ideas was that it allowed mathematics to be applied to itselfin a way which had not happened before
ربما كانت أحد أهم التطورات التي قامت بها الرياضيات العربية التي بدأت في هذا الوقت بعمل الخوارزمي و هي بدايات الجبر، و من المهم فهم كيف كانت هذه الفكرة الجديدة مهمة، فقد كانت خطوة ثورية بعيدا عن المفهوم ال

### Tokinization

In [76]:
from nltk.tokenize import sent_tokenize, word_tokenize

#### Tokenizing by word

In [77]:
eng_words = word_tokenize(english_text)
ar_words = nltk.word_tokenize(arabic_text)

Using word_tokenize() to split up our text into words:<br>

       - Perhaps 
       - one 
       - of 
       - the 
       - most

But NLTK were also considered these strings to be words:

       - ','
       - '.'

<b>Arabic Text Also The same Things</b>
    e.g :
    
    
        
         'أحد',
         'أهم',
         'التطورات',
         'التي',
         'قامت',
         'بها',

#### Tokenizing by sentence

In [78]:
eng_sent = sent_tokenize(english_text)
ar_sent = sent_tokenize(arabic_text)

The <b>sent_tokenize()</b> splite the text into sentences :

        - 'perhaps one of the most significant advances made by arabic mathematics began at this time with the work of al-khwarizmi, namely the beginnings of algebra.'
        - 'it is important to understand just how significant this new idea was.'
Same Thing In Arabic Text :
        
          

## Filtering Stop Words

Stop words are words that we want to ignore, so we filter them out of our text when we’re processing it.

In [55]:
nltk.download("stopwords")
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\BASH_TOOR\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [80]:
stop_words_eng = set(stopwords.words("english"))
stop_words_ar = set(stopwords.words("arabic"))

In [83]:
filter_english_text = []
for word in eng_words:
    if word.casefold() not in stop_words_eng:
        filter_english_text.append(word)

# Arabic Text using List Comprehension
filter_arabic_text = [
    word for word in ar_words if word not in stop_words_ar
]

filter_arabic_text

['ربما',
 'كانت',
 'أحد',
 'أهم',
 'التطورات',
 'قامت',
 'الرياضيات',
 'العربية',
 'بدأت',
 'الوقت',
 'بعمل',
 'الخوارزمي',
 'وهي',
 'بدايات',
 'الجبر',
 ',',
 'المهم',
 'فهم',
 'كانت',
 'الفكرة',
 'الجديدة',
 'مهمة',
 ',',
 'فقد',
 'كانت',
 'خطوة',
 'نورية',
 'بعيدا',
 'المفهوم',
 'اليوناني',
 'للرياضيات',
 'جوهرها',
 'هندسة',
 ',',
 'الجبر',
 'کان',
 'نظرية',
 'موحدة',
 'تتيح',
 'الأعداد',
 'الكسرية',
 'والأعداد',
 'اللا',
 'كسرية',
 ',',
 'والمقادير',
 'الهندسية',
 'وغيرها',
 ',',
 'تتعامل',
 'أنها',
 'أجسام',
 'جبرية',
 ',',
 'وأعطت',
 'الرياضيات',
 'ككل',
 'مسارا',
 'جديدا',
 'للتطور',
 'بمفهوم',
 'أوسع',
 'بكثير',
 'كان',
 'موجودا',
 'قبل',
 ',',
 'وقم',
 'وسيلة',
 'للتنمية',
 'الموضوع',
 'مستقبلا',
 '.',
 'وجانب',
 'آخر',
 'مهم',
 'لإدخال',
 'أفكار',
 'الجبر',
 'أنه',
 'سمح',
 'بتطبيق',
 'الرياضيات',
 'نفسها',
 'بطريقة',
 'تحدث',
 'قبل']

### Stemming  English Text

<b>reduce words to their root</b>, which is the core part of a word.

In [141]:
from nltk.stem import PorterStemmer

In [94]:
# Instanciate PorterStemmer Object

stem = PorterStemmer()

In [117]:
english_words_stem = [stem.stem(word) for word in filter_english_text]
english_words_stem

['perhap',
 'one',
 'signific',
 'advanc',
 'made',
 'arab',
 'mathemat',
 'began',
 'time',
 'work',
 'al-khwarizmi',
 ',',
 'name',
 'begin',
 'algebra',
 '.',
 'import',
 'understand',
 'signific',
 'new',
 'idea',
 '.',
 'revolutionari',
 'move',
 'away',
 'greek',
 'concept',
 'mathemat',
 'essenti',
 'geometri',
 '.',
 'algebra',
 'unifi',
 'theori',
 'allowedr',
 'number',
 ',',
 'irrat',
 'number',
 ',',
 'geometr',
 'magnitud',
 ',',
 'etc.',
 ',',
 'treat',
 '``',
 'algebra',
 'object',
 "''",
 '.',
 'gave',
 'mathemat',
 'whole',
 'new',
 'develop',
 'path',
 'much',
 'broader',
 'concept',
 'exist',
 ',',
 'provid',
 'vehicl',
 'futur',
 'develop',
 'subject',
 '.',
 'anoth',
 'import',
 'aspect',
 'introduct',
 'algebra',
 'idea',
 'allow',
 'mathemat',
 'appli',
 'itselfin',
 'way',
 'happen',
 '.']

### Stemming  Arabic Text

In [115]:
from snowballstemmer import stemmer

In [139]:
ar_stemmer = stemmer("arabic")

arabic_words_stem = [ar_stemmer.stemWord(ar_word) for ar_word in filter_arabic_text]
arabic_words_stem

['ربم',
 'كان',
 'احد',
 'اهم',
 'تطور',
 'قام',
 'رياض',
 'عرب',
 'بدء',
 'وقت',
 'عمل',
 'خوارزم',
 'وه',
 'دايا',
 'جبر',
 ',',
 'مهم',
 'فهم',
 'كان',
 'فكر',
 'جديد',
 'مهم',
 ',',
 'فقد',
 'كان',
 'خطو',
 'نور',
 'عيد',
 'مفهوم',
 'يونان',
 'رياض',
 'جوهر',
 'هندس',
 ',',
 'جبر',
 'کان',
 'نظر',
 'موحد',
 'تتيح',
 'اعداد',
 'كسر',
 'والاعداد',
 'اللا',
 'كسر',
 ',',
 'والمقادير',
 'هندس',
 'غير',
 ',',
 'تتعامل',
 'انه',
 'اجسام',
 'جبر',
 ',',
 'اعط',
 'رياض',
 'ككل',
 'مسار',
 'جديد',
 'تطور',
 'مفهوم',
 'اوسع',
 'كثير',
 'كان',
 'موجود',
 'قبل',
 ',',
 'وقم',
 'سيل',
 'تنم',
 'موضوع',
 'مستقبل',
 '.',
 'جانب',
 'اخر',
 'مهم',
 'لادخال',
 'افكار',
 'جبر',
 'انه',
 'سمح',
 'تطبيق',
 'رياض',
 'نفس',
 'طريق',
 'تحدث',
 'قبل']

After Taking a look at the result , some of the output is difinitvely correct e.g:

      
      -  التطورات -  تطور
      -  عدد  -  الأعداد 
      - للتطور -   تطور 

But in the other hand the Stemmer give us a wrong result :

        - ربم
        - رياض
        - خطو


<b>Understemming</b> happens when two related words should be reduced to the same stem but aren’t. This is a false negative.
<b>Overstemming</b> happens when two unrelated words are reduced to the same stem even though they shouldn’t be. This is a false positive.

### Tagging Parts of Speech

POS tagging, is the task of labeling the words in your text according to their part of speech.

In [174]:
eng_pos_tag = nltk.pos_tag(eng_words)

In [144]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

In [168]:
ar_pos_tag = nltk.pos_tag(ar_words) 
#like every one said its not workin in the arabic lang (NNP for all tokens)

### Lemmatizing

A lemma is a word that represents a whole group of words, and that group of words is called a lexeme.

In [157]:
from nltk.stem import WordNetLemmatizer

In [158]:
lemmatizer = WordNetLemmatizer()

In [163]:
lemmatized_words = [lemmatizer.lemmatize(word) for word in eng_words]

In [161]:
lemmatized_words

['perhaps',
 'one',
 'of',
 'the',
 'most',
 'significant',
 'advance',
 'made',
 'by',
 'arabic',
 'mathematics',
 'began',
 'at',
 'this',
 'time',
 'with',
 'the',
 'work',
 'of',
 'al-khwarizmi',
 ',',
 'namely',
 'the',
 'beginning',
 'of',
 'algebra',
 '.',
 'it',
 'is',
 'important',
 'to',
 'understand',
 'just',
 'how',
 'significant',
 'this',
 'new',
 'idea',
 'wa',
 '.',
 'it',
 'wa',
 'a',
 'revolutionary',
 'move',
 'away',
 'from',
 'the',
 'greek',
 'concept',
 'of',
 'mathematics',
 'which',
 'wa',
 'essentially',
 'geometry',
 '.',
 'algebra',
 'wa',
 'a',
 'unifying',
 'theory',
 'which',
 'allowedrational',
 'number',
 ',',
 'irrational',
 'number',
 ',',
 'geometrical',
 'magnitude',
 ',',
 'etc.',
 ',',
 'to',
 'all',
 'be',
 'treated',
 'a',
 '``',
 'algebraic',
 'object',
 "''",
 '.',
 'it',
 'gave',
 'mathematics',
 'a',
 'whole',
 'new',
 'development',
 'path',
 'so',
 'much',
 'broader',
 'in',
 'concept',
 'to',
 'that',
 'which',
 'had',
 'existed',
 'befo

### Chunking

While <b>tokenizing</b> allows you to identify words and sentences, <b>chunking</b> allows you to identify phrases.

In [165]:
grammar = "NP: {<DT>?<JJ>*<NN>}"

<b>NP</b> stands for noun phrase
       
        Start with an optional (?) determiner ('DT')
        Can have any number (*) of adjectives (JJ)
        End with a noun (<NN>)

In [167]:
chunk_parser = nltk.RegexpParser(grammar)

In [171]:
!pip install ghostscript

Collecting ghostscript
  Downloading ghostscript-0.7-py2.py3-none-any.whl (25 kB)
Installing collected packages: ghostscript
Successfully installed ghostscript-0.7


In [192]:
import ghostscript

RuntimeError: Can not find Ghostscript DLL in registry

In [190]:
chunk_parser.parse(eng_pos_tag)

The Ghostscript executable isn't found.
See http://web.mit.edu/ghostscript/www/Install.htm
If you're using a Mac, you can try installing
https://docs.brew.sh/Installation then `brew install ghostscript`


LookupError: 

Tree('S', [('perhaps', 'RB'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('most', 'RBS'), ('significant', 'JJ'), ('advances', 'NNS'), ('made', 'VBN'), ('by', 'IN'), ('arabic', 'JJ'), ('mathematics', 'NNS'), ('began', 'VBD'), ('at', 'IN'), Tree('NP', [('this', 'DT'), ('time', 'NN')]), ('with', 'IN'), Tree('NP', [('the', 'DT'), ('work', 'NN')]), ('of', 'IN'), ('al-khwarizmi', 'JJ'), (',', ','), ('namely', 'RB'), ('the', 'DT'), ('beginnings', 'NNS'), ('of', 'IN'), Tree('NP', [('algebra', 'NN')]), ('.', '.'), ('it', 'PRP'), ('is', 'VBZ'), ('important', 'JJ'), ('to', 'TO'), ('understand', 'VB'), ('just', 'RB'), ('how', 'WRB'), ('significant', 'JJ'), Tree('NP', [('this', 'DT'), ('new', 'JJ'), ('idea', 'NN')]), ('was', 'VBD'), ('.', '.'), ('it', 'PRP'), ('was', 'VBD'), Tree('NP', [('a', 'DT'), ('revolutionary', 'JJ'), ('move', 'NN')]), ('away', 'RB'), ('from', 'IN'), Tree('NP', [('the', 'DT'), ('greek', 'JJ'), ('concept', 'NN')]), ('of', 'IN'), ('mathematics', 'NNS'), ('which', 'WDT'), ('was

In [189]:
tree.draw()

### Printing Dependencies

In [195]:
%load_ext watermark

In [202]:
%watermark -v -m -p nltk

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.19.0

nltk: 3.5

Compiler    : MSC v.1916 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 42 Stepping 7, GenuineIntel
CPU cores   : 4
Architecture: 64bit



In [197]:
%watermark --iversions

nltk: 3.5

