<a href="https://colab.research.google.com/github/priyanshgupta1998/Natural-language-processing-NLP-/blob/master/NLTK_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Stemming and Lemmatization with Python NLTK
This is a demonstration of stemming and lemmatization for the 17 languages supported by the NLTK 2.0.4 stem package.   
Stemming is a process of removing and replacing word suffixes to arrive at a common root form of the word.    


For stemming English words with NLTK, you can choose between the **PorterStemmer** or the **LancasterStemmer**.    

 The Porter Stemming Algorithm is the oldest stemming algorithm supported in NLTK.    
 The Lancaster Stemming Algorithm is much newer and can be more aggressive than the Porter stemming algorithm.    
 
The WordNet Lemmatizer uses the WordNet Database to lookup lemmas. Lemmas differ from stems in that a lemma is a canonical form of the word, while a stem may not be a real word.

##Non-English Stemmers
Stemming for Portuguese is available in NLTK with the **RSLPStemmer** and also with the **SnowballStemmer**. Arabic stemming is supported with the **ISRIStemmer**.

Snowball is actually a language for creating stemmers. The NLTK Snowball stemmer currently supports the following languages:   
Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Norwegian
Porter
Portuguese
Romanian
Russian
Spanish
Swedish

In [1]:
from nltk import wordpunct_tokenize
wordpunct_tokenize("That's thirty minutes away. I'll be there in ten.")

['That',
 "'",
 's',
 'thirty',
 'minutes',
 'away',
 '.',
 'I',
 "'",
 'll',
 'be',
 'there',
 'in',
 'ten',
 '.']

In [4]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [9]:
from nltk.corpus import stopwords
print(len(stopwords.fileids()))
print(stopwords.fileids())

21
['arabic', 'azerbaijani', 'danish', 'dutch', 'english', 'finnish', 'french', 'german', 'greek', 'hungarian', 'indonesian', 'italian', 'kazakh', 'nepali', 'norwegian', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish', 'turkish']


In [0]:
 text = '''
    There's a passage I got memorized. Ezekiel 25:17. "The path of the righteous man is beset on all sides\
    by the inequities of the selfish and the tyranny of evil men. Blessed is he who, in the name of charity\
    and good will, shepherds the weak through the valley of the darkness, for he is truly his brother's keeper\
    and the finder of lost children. And I will strike down upon thee with great vengeance and furious anger\
    those who attempt to poison and destroy My brothers. And you will know I am the Lord when I lay My vengeance\
    upon you." Now... I been sayin' that shit for years. And if you ever heard it, that meant your ass. You'd\
    be dead right now. I never gave much thought to what it meant. I just thought it was a cold-blooded thing\
    to say to a motherfucker before I popped a cap in his ass. But I saw some shit this mornin' made me think\
    twice. See, now I'm thinking: maybe it means you're the evil man. And I'm the righteous man. And Mr.\
    9mm here... he's the shepherd protecting my righteous ass in the valley of darkness. Or it could mean\
    you're the righteous man and I'm the shepherd and it's the world that's evil and selfish. And I'd like\
    that. But that shit ain't the truth. The truth is you're the weak. And I'm the tyranny of evil men.\
    But I'm tryin', Ringo. I'm tryin' real hard to be the shepherd.
    '''

In [13]:
languages_ratios = {}
tokens = wordpunct_tokenize(text)
words = [word.lower() for word in tokens]
print(words)

['there', "'", 's', 'a', 'passage', 'i', 'got', 'memorized', '.', 'ezekiel', '25', ':', '17', '.', '"', 'the', 'path', 'of', 'the', 'righteous', 'man', 'is', 'beset', 'on', 'all', 'sides', 'by', 'the', 'inequities', 'of', 'the', 'selfish', 'and', 'the', 'tyranny', 'of', 'evil', 'men', '.', 'blessed', 'is', 'he', 'who', ',', 'in', 'the', 'name', 'of', 'charity', 'and', 'good', 'will', ',', 'shepherds', 'the', 'weak', 'through', 'the', 'valley', 'of', 'the', 'darkness', ',', 'for', 'he', 'is', 'truly', 'his', 'brother', "'", 's', 'keeper', 'and', 'the', 'finder', 'of', 'lost', 'children', '.', 'and', 'i', 'will', 'strike', 'down', 'upon', 'thee', 'with', 'great', 'vengeance', 'and', 'furious', 'anger', 'those', 'who', 'attempt', 'to', 'poison', 'and', 'destroy', 'my', 'brothers', '.', 'and', 'you', 'will', 'know', 'i', 'am', 'the', 'lord', 'when', 'i', 'lay', 'my', 'vengeance', 'upon', 'you', '."', 'now', '...', 'i', 'been', 'sayin', "'", 'that', 'shit', 'for', 'years', '.', 'and', 'if',

In [14]:
for language in stopwords.fileids():
    stopwords_set = set(stopwords.words(language))
    print(stopwords_set)

{'إنما', 'أولئك', 'كلاهما', 'إذما', 'سوى', 'بس', 'ذان', 'فإذا', 'إنه', 'لستم', 'لهما', 'أولاء', 'ذه', 'أوه', 'إلا', 'بكم', 'كأنما', 'إليكم', 'ليسوا', 'ثم', 'هنا', 'ثمة', 'مع', 'إليكما', 'ذاك', 'والذي', 'ذواتي', 'فيما', 'كم', 'هذي', 'أيها', 'بهما', 'كيف', 'متى', 'عدا', 'هن', 'يا', 'بعد', 'أي', 'حتى', 'لسن', 'إذن', 'منه', 'تينك', 'أنتم', 'ذي', 'بمن', 'لا', 'اللتان', 'لي', 'إليك', 'لكم', 'شتان', 'أن', 'هذه', 'لئن', 'لستما', 'هنالك', 'بخ', 'بكن', 'إي', 'كي', 'ما', 'هكذا', 'عسى', 'هم', 'دون', 'خلا', 'كلا', 'كأين', 'ذلك', 'ماذا', 'لن', 'إما', 'ذانك', 'أف', 'اللاتي', 'فمن', 'ليس', 'منها', 'سوف', 'ذوا', 'بهن', 'كلتا', 'وإن', 'بي', 'أما', 'تي', 'اللواتي', 'ذو', 'ته', 'بلى', 'كذا', 'تلكم', 'لنا', 'أنى', 'نحن', 'فلا', 'لهن', 'مما', 'ذلكن', 'تلك', 'هذين', 'أنتما', 'إذا', 'مذ', 'بما', 'هؤلاء', 'آي', 'لكيلا', 'حين', 'فيم', 'عليك', 'لوما', 'ولكن', 'ليسا', 'ذينك', 'غير', 'عند', 'بيد', 'هيا', 'آها', 'بهم', 'إليكن', 'هناك', 'تين', 'لكن', 'كل', 'ريث', 'بماذا', 'أو', 'ممن', 'مهما', 'وما', 'ذلكما', 'من', '

In [16]:
for language in stopwords.fileids():
    stopwords_set = set(stopwords.words(language))
    words_set = set(words)
    common_elements = words_set.intersection(stopwords_set)
    languages_ratios[language] = len(common_elements) # language "score"
print(languages_ratios)

{'arabic': 0, 'azerbaijani': 4, 'danish': 4, 'dutch': 6, 'english': 48, 'finnish': 3, 'french': 6, 'german': 5, 'greek': 0, 'hungarian': 3, 'indonesian': 0, 'italian': 4, 'kazakh': 0, 'nepali': 0, 'norwegian': 5, 'portuguese': 3, 'romanian': 7, 'russian': 0, 'spanish': 3, 'swedish': 3, 'turkish': 0}


In [17]:
most_rated_language = max(languages_ratios, key=languages_ratios.get)
most_rated_language

'english'

#N-GRAMS
`an n-gram is a contiguous sequence of n items from a given sample of text or speech.`     

`The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus.`

`N-grams of texts are extensively used in text mining and natural language processing tasks.when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios)`

` N-gram models is used to estimate the probability of the last word of an N-gram given the previous words, and also to assign probabilities to entire sequences. we can use n-gram models to derive a probability of the sentence `

In [19]:
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer("[a-zA-Z'`éèî]+")
print(tokenizer.tokenize("Le temps est un grand maître, dit-on, le malheur est qu'il tue ses élèves."))

['Le', 'temps', 'est', 'un', 'grand', 'maître', 'dit', 'on', 'le', 'malheur', 'est', "qu'il", 'tue', 'ses', 'élèves']


In [36]:
from nltk.util import ngrams
generated_ngrams = ngrams('TEXT', 4, pad_left=True, pad_right=True)
print(list(generated_ngrams))

[(None, None, None, 'T'), (None, None, 'T', 'E'), (None, 'T', 'E', 'X'), ('T', 'E', 'X', 'T'), ('E', 'X', 'T', None), ('X', 'T', None, None), ('T', None, None, None)]


In [50]:
from nltk.util import ngrams
generated_ngrams = ngrams('TEXT', 4, pad_left=True, pad_right=True , left_pad_symbol=' ')
print(list(generated_ngrams))

[(' ', ' ', ' ', 'T'), (' ', ' ', 'T', 'E'), (' ', 'T', 'E', 'X'), ('T', 'E', 'X', 'T'), ('E', 'X', 'T', None), ('X', 'T', None, None), ('T', None, None, None)]


In [65]:
generated_ngrams = ngrams('TEXT', 4, pad_left=True, pad_right=True, left_pad_symbol=' ', right_pad_symbol=' ')
k = list(generated_ngrams)
print(k)
print(k[4])
''.join(k[4])

[(' ', ' ', ' ', 'T'), (' ', ' ', 'T', 'E'), (' ', 'T', 'E', 'X'), ('T', 'E', 'X', 'T'), ('E', 'X', 'T', ' '), ('X', 'T', ' ', ' '), ('T', ' ', ' ', ' ')]
('E', 'X', 'T', ' ')


'EXT '

In [40]:
from functools import partial
from nltk import ngrams
x = 'TEXT'
print(len(list(ngrams(x, 2))) , list(ngrams(x, 2)))

padded_ngrams = partial(ngrams, pad_left=True, pad_right=True, left_pad_symbol='_', right_pad_symbol='_')
print(list(padded_ngrams(x, 2)))
print(len(list(padded_ngrams(x, 2))))
print(len(list(padded_ngrams(x, 3))))
print(len(list(padded_ngrams(x, 4))))
print(len(list(padded_ngrams(x, 5))))

3 [('T', 'E'), ('E', 'X'), ('X', 'T')]
[('_', 'T'), ('T', 'E'), ('E', 'X'), ('X', 'T'), ('T', '_')]
5
6
7
8


In [0]:
ngrams_statistics = {}

for ngram in ngrams:
  if not ngrams_statistics.has_key(ngram):
      ngrams_statistics.update({ngram:1})
  else:
      ngram_occurrences = ngrams_statistics[ngram]
      ngrams_statistics.update({ngram:ngram_occurrences+1})