### Hello
#### Let's have a look together at some basic NLP tasks with Leonardo Da Vinci

#### You are listening 'Perfect Balance' - Jean-Philippe Rio-Py

In [44]:
# Sample text
da_vinci_quote = 'When once you have tasted flight, you will forever walk the earth with your eyes turned skyward,\
for there you have been, and there you will always long to return.'
print( 'Leonardo Da Vinci once said: {}'.format(leonardo_da_vinci_quote))

Leonardo Da Vinci once said: When once you have tasted flight, you will forever walk the earth with your eyes turned skyward, for there you have been, and there you will always long to return.


#### Convert to lowercase

In [45]:
da_vinci_quote.lower()

'when once you have tasted flight, you will forever walk the earth with your eyes turned skyward,for there you have been, and there you will always long to return.'

#### Punctuation Removal

In [46]:
import re  # regular expression

# We replace punctuations with a space " " because replacing with a space makes sure
# that words don't get concatenated together.

quote_no_punctuation = re.sub(r"[^a-zA-Z0-9]", " ", da_vinci_quote)
print(quote_no_punctuation)

When once you have tasted flight  you will forever walk the earth with your eyes turned skyward for there you have been  and there you will always long to return 


In [47]:
# Split text into tokens (words)
words = da_vinci_quote.split()
print('Tokens: {}'.format(words))

Tokens: ['When', 'once', 'you', 'have', 'tasted', 'flight,', 'you', 'will', 'forever', 'walk', 'the', 'earth', 'with', 'your', 'eyes', 'turned', 'skyward,for', 'there', 'you', 'have', 'been,', 'and', 'there', 'you', 'will', 'always', 'long', 'to', 'return.']


### NLTK: Natural Language ToolKit

In [48]:
# !pip install nltk # if you do not have it installed yet
# !python -m nltk.downloader all
import nltk

####  Split text into a list of tokens (words) with word_tokenize

In [49]:
from nltk.tokenize import word_tokenize
words = word_tokenize(da_vinci_quote)
print('Tokens: {}'.format(words))

Tokens: ['When', 'once', 'you', 'have', 'tasted', 'flight', ',', 'you', 'will', 'forever', 'walk', 'the', 'earth', 'with', 'your', 'eyes', 'turned', 'skyward', ',', 'for', 'there', 'you', 'have', 'been', ',', 'and', 'there', 'you', 'will', 'always', 'long', 'to', 'return', '.']


#### Split text into a list of sentences with 'sent_tokenize'

In [50]:
Da_Vinci_quotes = 'Simplicity is the ultimate sophistication.\
Art is never finished, only abandoned. Learning never exhausts the mind.'
print("Da Vinci: {}".format(Da_Vinci_quotes))

print('----------')

from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(Da_Vinci_quotes)
print("Sentences (Tokens): {}".format(sentences))


Da Vinci: Simplicity is the ultimate sophistication.Art is never finished, only abandoned. Learning never exhausts the mind.
----------
Sentences (Tokens): ['Simplicity is the ultimate sophistication.Art is never finished, only abandoned.', 'Learning never exhausts the mind.']


#### list of stop words

In [51]:
# List stop words
from nltk.corpus import stopwords
# In English
print(stopwords.words("English")[:50]) # list of the first 50 stopwords

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be']


In [52]:
# In Spanish
print(stopwords.words("Spanish")[:50]) # list of the first 50 stopwords

['de', 'la', 'que', 'el', 'en', 'y', 'a', 'los', 'del', 'se', 'las', 'por', 'un', 'para', 'con', 'no', 'una', 'su', 'al', 'lo', 'como', 'más', 'pero', 'sus', 'le', 'ya', 'o', 'este', 'sí', 'porque', 'esta', 'entre', 'cuando', 'muy', 'sin', 'sobre', 'también', 'me', 'hasta', 'hay', 'donde', 'quien', 'desde', 'todo', 'nos', 'durante', 'todos', 'uno', 'les', 'ni']


In [53]:
# First: Normalize text
norm_leonardo = re.sub(r"[^a-zA-Z0-9]", " ", Da_Vinci_quotes.lower())

# Second: Tokenize it in a list of words
words = norm_leonardo.split()
print('Normalized text: {}'.format(words))

Normalized text: ['simplicity', 'is', 'the', 'ultimate', 'sophistication', 'art', 'is', 'never', 'finished', 'only', 'abandoned', 'learning', 'never', 'exhausts', 'the', 'mind']


In [54]:
# Remove stop words
words1 = [w for w in words if w not in stopwords.words("english")]
print('Text without stop words: {}'.format(words1))
# compare words with words1. Do you see the differences?

Text without stop words: ['simplicity', 'ultimate', 'sophistication', 'art', 'never', 'finished', 'abandoned', 'learning', 'never', 'exhausts', 'mind']


### Stemming and Lemmatisation

In linguistic morphology and information retrieval:

#### Stemming
is the process of reducing inflected words to their word stem, base or root form—generally a written word form.

adjustable --> adjust or changing --> chang

In [55]:
from nltk.stem.porter import PorterStemmer
# Reduce words to their stems
words = norm_leonardo.split()
stemmed = [PorterStemmer().stem(w) for w in words]
print( 'words:', words)
print('------------------------')
print('stemmed:', stemmed)

words: ['simplicity', 'is', 'the', 'ultimate', 'sophistication', 'art', 'is', 'never', 'finished', 'only', 'abandoned', 'learning', 'never', 'exhausts', 'the', 'mind']
------------------------
stemmed: ['simplic', 'is', 'the', 'ultim', 'sophist', 'art', 'is', 'never', 'finish', 'onli', 'abandon', 'learn', 'never', 'exhaust', 'the', 'mind']


#### Lemmatisation
is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form

paintings --> painting

In [56]:
from nltk.stem.wordnet import WordNetLemmatizer
# Reduce words to their root form
text = 'The Mona Lisa is one of the most valuable paintings in the world.'
text = text.split(' ')
lemmed = [WordNetLemmatizer().lemmatize(w) for w in text]
print( 'text: {}'.format(text))
print('------------------------')
print('lemmed: {}'.format(lemmed))

text: ['The', 'Mona', 'Lisa', 'is', 'one', 'of', 'the', 'most', 'valuable', 'paintings', 'in', 'the', 'world.']
------------------------
lemmed: ['The', 'Mona', 'Lisa', 'is', 'one', 'of', 'the', 'most', 'valuable', 'painting', 'in', 'the', 'world.']


In [57]:
# Did you notice?
print('paintings -> {}'.format(WordNetLemmatizer().lemmatize('paintings')))


paintings -> painting


#### Follow Jabraghe:Learn with Classical Music on Facebook and Youtube.
#### Thanks for Watching and Listening