<h2><u>Machine Learning and NLP</u></h2>

# Module 5 - Natural Language Processing

<h2>In Class Codes</h2>

In this demo, you will be shown how to perform various techniques learnt throughout this module using libraries from Python.

##### Types Of Tokenizers

In [None]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
s = "Hello! Let's use NLTK."
print(word_tokenize(s))

from nltk.tokenize import wordpunct_tokenize
print(wordpunct_tokenize(s))

from nltk.tokenize import sent_tokenize, word_tokenize
print(sent_tokenize(s))

##### Creating Tokens Using NLTK

In [None]:
import nltk
from nltk.util import bigrams, trigrams, ngrams
#let us consider the below string for the example
string = "ML can be seen as a time-saving device that allows humans to explore their more creative ambitions while ML is in the background crunching numbers"

In [None]:
ML_tokens=nltk.word_tokenize(string)
ML_tokens

##### Creating Bigrams And Trigrams

In [None]:
ML_bigrams=list(nltk.bigrams(ML_tokens))
ML_bigrams

In [None]:
ML_trigrams=list(nltk.trigrams (ML_tokens))
ML_trigrams

##### POS Tagging - Steps

In [None]:
for token in ML_tokens:
    print(nltk.pos_tag([token]))

##### Shortcomings Of POS Tagger

In [None]:
sent= "Jim eats a banana"
tokens = word_tokenize(sent)
for token in tokens:
    print(nltk.pos_tag([token]))

In [None]:
from nltk.tokenize import RegexpTokenizer
reg_tokenizer = RegexpTokenizer('(?u)\W+|\$[\d\.]+|\S+')
regextokens = reg_tokenizer.tokenize(sent)
regextags = nltk.pos_tag(regextokens)
regextags

##### Stop Words

In [None]:
from nltk.corpus import stopwords
stop_words =stopwords.words('english')
stop_words

##### Removing Stop Words Using NLTK

In [None]:
filtered_sentence = [w for w in ML_tokens if not w in stop_words] 
print(filtered_sentence)

##### Stemming Using NLTK – Example 

In [None]:
from nltk.stem import PorterStemmer
pst=PorterStemmer()
print(pst.stem("Measure"))
print(pst.stem("Measurement"))
print(pst.stem("Measuring"))
print(pst.stem("Measurer"))
print(pst.stem("Measures"))

##### Non-English Stemmers

In [None]:
from nltk.stem import SnowballStemmer
sbst=SnowballStemmer("spanish")
print(sbst.languages)

In [None]:
print(sbst.stem("producción"))
print(sbst.stem("producto"))

##### Lemmatization Using NLTK - Example

In [None]:
from nltk.stem import WordNetLemmatizer
word_lem = WordNetLemmatizer()
print(word_lem.lemmatize("eating", pos="v"))
print(word_lem.lemmatize("eats", pos="v"))
print(word_lem.lemmatize("ate", pos="v"))

In [None]:
from nltk.stem import LancasterStemmer

print("Result of WordNetLemmatizer: ", word_lem.lemmatize("gone", pos="v"))

print("Result of PorterStemmer: ", PorterStemmer().stem("gone"))

print("Result of LancasterStemmer: ", LancasterStemmer().stem("gone"))


##### NER Using NLTK - Steps

In [None]:
import nltk
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

from nltk import ne_chunk

ex = "Ferrari is an Italian luxury sports car manufacturer based in Maranello. Founded by Enzo Ferrari in 1939, the company built its first car in 1940."

tokenized = nltk.word_tokenize(ex)
tagged = nltk.pos_tag(tokenized)
namedEnt = ne_chunk(tagged)
    
print(namedEnt)

##### WSD Using NLTK

In [None]:
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize

sent1 = "She is looking for a match"
sent2 = "Yesterday's Football match was exciting"

print(lesk(word_tokenize(sent1), 'match'))
print(lesk(word_tokenize(sent2), 'match'))

##### TF-IDF Using Python

In [None]:
review_1 = 'The movie was good and we really like it'
review_2 = 'the movie was good but the ending was boring'
review_3 = 'we did not like the movie as it was too lengthy'

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tf_vect = TfidfVectorizer(min_df=1,lowercase=True,stop_words='english')
review_1 = 'The movie was good and we really like it'
review_2 = 'the movie was good but the ending was boring'
review_3 = 'we did not like the movie as it was too lengthy'
review_list = [review_1,review_2,review_3]

tf_matrix = tf_vect.fit_transform(review_list)
tf_matrix.shape

In [None]:
import pandas as pd
tf_names = tf_vect.get_feature_names()
tf_df = pd.DataFrame(tf_matrix.toarray(),columns=tf_names)
tf_df

##### TextBlob For Sentiment Analysis

In [None]:
from textblob import TextBlob
print(TextBlob('great').sentiment)
print(TextBlob('dark').sentiment)
print(TextBlob('excellent').sentiment)
print(TextBlob('boring').sentiment)

<b><i>Conclusion</i></b>: In this demo, we examined how to apply various text pre-processing techniques using NLTK and TextBlob.