# Text preprocessing, POS tagging and NER
>  In this chapter, you will learn about tokenization and lemmatization. You will then learn how to perform text cleaning, part-of-speech tagging, and named entity recognition using the spaCy library. Upon mastering these concepts, you will proceed to make the Gettysburg address machine-friendly, analyze noun usage in fake news, and identify people mentioned in a TechCrunch article.

- toc: true 
- badges: true
- comments: true
- author: Lucas Nunes
- categories: [Python, Datacamp, Machine Learning]
- image: images/datacamp/1_supervised_learning_with_scikit_learn/2_regression.png

> Note: This is a summary of the course's chapter 2 exercises "Feature Engineering for NLP in Python" at datacamp. <br>[Github repo](https://github.com/lnunesAI/Datacamp/) / [Course link](https://www.datacamp.com/tracks/machine-learning-scientist-with-python)

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
plt.rcParams['figure.figsize'] = (8, 8)

## Tokenization and Lemmatization

### Identifying lemmas

<p>Identify the list of words from the choices which do not have the same lemma.</p>

<pre>
Possible Answers

He, She, I, They


Am, Are, Is, Was


Increase, Increases, Increasing, Increased


<b>Car, Bike, Truck, Bus</b>

</pre>

**Although all these words refer to vehicles, they are words with distinct base forms.**

### Tokenizing the Gettysburg Address

<div class=""><p>In this exercise, you will be tokenizing one of the most famous speeches of all time: the Gettysburg Address delivered by American President Abraham Lincoln during the American Civil War.</p>
<p>The entire speech is available as a string named <code>gettysburg</code>.</p></div>

In [None]:
gettysburg = "Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It's rather for us to be here dedicated to the great task remaining before us - that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion - that we here highly resolve that these dead shall not have died in vain - that this nation, under God, shall have a new birth of freedom - and that government of the people, by the people, for the people, shall not perish from the earth."

Instructions
<ul>
<li>Load the <code>en_core_web_sm</code> model using <code>spacy.load()</code>.</li>
<li>Create a Doc object <code>doc</code> for the <code>gettysburg</code> string.</li>
<li>Using list comprehension, loop over <code>doc</code> to generate the token texts.</li>
</ul>

In [None]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(gettysburg)

# Generate the tokens
tokens = [token.text for token in doc]
print(tokens)

['Four', 'score', 'and', 'seven', 'years', 'ago', 'our', 'fathers', 'brought', 'forth', 'on', 'this', 'continent', ',', 'a', 'new', 'nation', ',', 'conceived', 'in', 'Liberty', ',', 'and', 'dedicated', 'to', 'the', 'proposition', 'that', 'all', 'men', 'are', 'created', 'equal', '.', 'Now', 'we', "'re", 'engaged', 'in', 'a', 'great', 'civil', 'war', ',', 'testing', 'whether', 'that', 'nation', ',', 'or', 'any', 'nation', 'so', 'conceived', 'and', 'so', 'dedicated', ',', 'can', 'long', 'endure', '.', 'We', "'re", 'met', 'on', 'a', 'great', 'battlefield', 'of', 'that', 'war', '.', 'We', "'ve", 'come', 'to', 'dedicate', 'a', 'portion', 'of', 'that', 'field', ',', 'as', 'a', 'final', 'resting', 'place', 'for', 'those', 'who', 'here', 'gave', 'their', 'lives', 'that', 'that', 'nation', 'might', 'live', '.', 'It', "'s", 'altogether', 'fitting', 'and', 'proper', 'that', 'we', 'should', 'do', 'this', '.', 'But', ',', 'in', 'a', 'larger', 'sense', ',', 'we', 'ca', "n't", 'dedicate', '-', 'we', '

**You now know how to tokenize a piece of text. In the next exercise, we will perform similar steps and conduct lemmatization.**

### Lemmatizing the Gettysburg address

<div class=""><p>In this exercise, we will perform lemmatization on the same <code>gettysburg</code> address from before. </p>
<p>However, this time, we will also take a look at the speech, before and after lemmatization, and try to adjudge the kind of changes that take place to make the piece more machine friendly.</p></div>

Instructions 1/3
<p>Print the gettysburg address to the console.</p>

In [None]:
# Print the gettysburg address
print(gettysburg)

Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we're engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We're met on a great battlefield of that war. We've come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It's altogether fitting and proper that we should do this. But, in a larger sense, we can't dedicate - we can not consecrate - we can not hallow - this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so no

Instructions 2/3
<p>Loop over <code>doc</code> and extract the lemma for each token of <code>gettysburg</code>.</p>

In [None]:
# Create a Doc object
doc = nlp(gettysburg)

# Generate lemmas
lemmas = [token.lemma_ for token in doc]

Instructions 3/3
<p>Convert <code>lemmas</code> into a string using <code>join</code>.</p>

In [None]:
# Convert lemmas into a string
print(' '.join(lemmas))

four score and seven year ago -PRON- father bring forth on this continent , a new nation , conceive in Liberty , and dedicate to the proposition that all man be create equal . now -PRON- be engage in a great civil war , test whether that nation , or any nation so conceive and so dedicated , can long endure . -PRON- be meet on a great battlefield of that war . -PRON- have come to dedicate a portion of that field , as a final resting place for those who here give -PRON- life that that nation may live . -PRON- be altogether fitting and proper that -PRON- should do this . but , in a large sense , -PRON- can not dedicate - -PRON- can not consecrate - -PRON- can not hallow - this ground . the brave man , living and dead , who struggle here , have consecrate -PRON- , far above -PRON- poor power to add or detract . the world will little note , nor long remember what -PRON- say here , but -PRON- can never forget what -PRON- do here . -PRON- be for -PRON- the living , rather , to be dedicate her

**You're now proficient at performing lemmatization using spaCy. Observe the lemmatized version of the speech. It isn't very readable to humans but it is in a much more convenient format for a machine to process.**

## Text cleaning

### Cleaning a blog post

<div class=""><p>In this exercise, you have been given an excerpt from a blog post. Your task is to clean this text into a more machine friendly format. This will involve converting to lowercase, lemmatization and removing stopwords, punctuations and non-alphabetic characters.</p>
<p>The excerpt is available as a string <code>blog</code> and has been printed to the console. The list of stopwords are available as <code>stopwords</code>.</p></div>

In [None]:
stopwords = spacy.lang.en.stop_words.STOP_WORDS
blog = '\nTwenty-first-century politics has witnessed an alarming rise of populism in the U.S. and Europe. The first warning signs came with the UK Brexit Referendum vote in 2016 swinging in the way of Leave. This was followed by a stupendous victory by billionaire Donald Trump to become the 45th President of the United States in November 2016. Since then, Europe has seen a steady rise in populist and far-right parties that have capitalized on Europe’s Immigration Crisis to raise nationalist and anti-Europe sentiments. Some instances include Alternative for Germany (AfD) winning 12.6% of all seats and entering the Bundestag, thus upsetting Germany’s political order for the first time since the Second World War, the success of the Five Star Movement in Italy and the surge in popularity of neo-nazism and neo-fascism in countries such as Hungary, Czech Republic, Poland and Austria.\n'

Instructions
<ul>
<li>Using list comprehension, loop through <code>doc</code> to extract the <code>lemma_</code> of each token.</li>
<li>Remove stopwords and non-alphabetic tokens using <code>stopwords</code> and <code>isalpha()</code>.</li>
</ul>

In [None]:
# Load model and create Doc object
nlp = spacy.load('en_core_web_sm')
doc = nlp(blog)

# Generate lemmatized tokens
lemmas = [token.lemma_ for token in doc]

# Remove stopwords and non-alphabetic tokens
a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]

# Print string after text cleaning
print(' '.join(a_lemmas))



**Take a look at the cleaned text; it is lowercased and devoid of numbers, punctuations and commonly used stopwords. Also, note that the word U.S. was present in the original text. Since it had periods in between, our text cleaning process completely removed it. This may not be ideal behavior. It is always advisable to use your custom functions in place of isalpha() for more nuanced cases.**

### Cleaning TED talks in a dataframe

<div class=""><p>In this exercise, we will revisit the TED Talks from the first chapter. You have been a given a dataframe <code>ted</code> consisting of 5 TED Talks. Your task is to clean these talks using techniques discussed earlier by writing a function <code>preprocess</code> and applying it to the <code>transcript</code> feature of the dataframe. </p>
<p>The stopwords list is available as <code>stopwords</code>.</p></div>

In [None]:
ted = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/13-Feature%20Engineering%20for%20NLP%20in%20Python/datasets/ted_500x1.csv')

In [None]:
ted = ted[:20]

Instructions
<ul>
<li>Generate the Doc object for <code>text</code>. Ignore the <code>disable</code> argument for now.</li>
<li>Generate lemmas using list comprehension using the <code>lemma_</code> attribute.</li>
<li>Remove non-alphabetic characters using <code>isalpha()</code> in the if condition.</li>
</ul>

In [None]:
# Function to preprocess text
def preprocess(text):
  	# Create Doc object
    doc = nlp(text, disable=['ner', 'parser'])
    # Generate lemmas
    lemmas = [token.lemma_ for token in doc]
    # Remove stopwords and non-alphabetic characters
    a_lemmas = [lemma for lemma in lemmas 
            if lemma.isalpha() and lemma not in stopwords]
    
    return ' '.join(a_lemmas)
  
# Apply preprocess to ted['transcript']
ted['transcript'] = ted['transcript'].apply(preprocess)
print(ted['transcript'])

0     talk new lecture TED illusion create TED try r...
1     representation brain brain break left half log...
2     great honor today share Digital Universe creat...
3     passion music technology thing combination thi...
4     use want computer new program programming requ...
5     neuroscientist mixed background physics medici...
6     Pat Mitchell day January begin like work love ...
7     Taylor Wilson year old nuclear physicist littl...
8     grow Northern Ireland right north end absolute...
9     publish article New York Times Modern Love col...
10    Joseph Member Parliament Kenya picture Maasai ...
11    hi talk little bit music machine life specific...
12    hi let ask audience question lie child raise h...
13    historical record allow know ancient Greeks dr...
14    good morning little boy experience change life...
15    slide year ago time short slide morning time w...
16    like world like share year old love story poor...
17    fail woman fail feminist passionate opinio

**You have preprocessed all the TED talk transcripts contained in ted and it is now in a good shape to perform operations such as vectorization (as we will soon see how). You now have a good understanding of how text preprocessing works and why it is important. In the next lessons, we will move on to generating word level features for our texts.**

## Part-of-speech tagging

### POS tagging in Lord of the Flies

<div class=""><p>In this exercise, you will perform part-of-speech tagging on a famous passage from one of the most well-known novels of all time, <em>Lord of the Flies</em>, authored by William Golding.</p>
<p>The passage is available as <code>lotf</code> and has already been printed to the console.</p></div>

In [None]:
lotf = 'He found himself understanding the wearisomeness of this life, where every path was an improvisation and a considerable part of one’s waking life was spent watching one’s feet.'

Instructions
<ul>
<li>Load the <code>en_core_web_sm</code> model.</li>
<li>Create a doc object for <code>lotf</code> using <code>nlp()</code>.</li>
<li>Using the <code>text</code> and <code>pos_</code> attributes, generate tokens and their corresponding POS tags.</li>
</ul>

In [None]:
# Load the en_core_web_sm model
nlp = spacy.load('en_core_web_sm')

# Create a Doc object
doc = nlp(lotf)

# Generate tokens and pos tags
pos = [(token.text, token.pos_) for token in doc]
print(pos)

[('He', 'PRON'), ('found', 'VERB'), ('himself', 'PRON'), ('understanding', 'VERB'), ('the', 'DET'), ('wearisomeness', 'NOUN'), ('of', 'ADP'), ('this', 'DET'), ('life', 'NOUN'), (',', 'PUNCT'), ('where', 'ADV'), ('every', 'DET'), ('path', 'NOUN'), ('was', 'AUX'), ('an', 'DET'), ('improvisation', 'NOUN'), ('and', 'CCONJ'), ('a', 'DET'), ('considerable', 'ADJ'), ('part', 'NOUN'), ('of', 'ADP'), ('one', 'NUM'), ('’s', 'PART'), ('waking', 'VERB'), ('life', 'NOUN'), ('was', 'AUX'), ('spent', 'VERB'), ('watching', 'VERB'), ('one', 'PRON'), ('’s', 'PART'), ('feet', 'NOUN'), ('.', 'PUNCT')]


**Examine the various POS tags attached to each token and evaluate if they make intuitive sense to you. You will notice that they are indeed labelled correctly according to the standard rules of English grammar.**

### Counting nouns in a piece of text

<div class=""><p>In this exercise, we will write two functions, <code>nouns()</code> and <code>proper_nouns()</code> that will count the number of other nouns and proper nouns in a piece of text respectively.</p>
<p>These functions will take in a piece of text and generate a list containing the POS tags for each word. It will then return the number of proper nouns/other nouns that the text contains. We will use these functions in the next exercise to generate interesting insights about fake news. </p>
<p>The <code>en_core_web_sm</code> model has already been loaded as <code>nlp</code> in this exercise.</p></div>

Instructions 1/2
<p>Using the list <code>count</code> method, count the number of proper nouns (annotated as <code>PROPN</code>) in the <code>pos</code> list.</p>

In [None]:
nlp = spacy.load('en_core_web_sm')

# Returns number of proper nouns
def proper_nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count('PROPN')

print(proper_nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

3


Instructions 2/2
<p>Using the list <code>count</code> method, count the number of other nouns (annotated as <code>NOUN</code>) in the <code>pos</code> list.</p>

In [None]:
# Returns number of other nouns
def nouns(text, model=nlp):
  	# Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of other nouns
    return pos.count('NOUN')

print(nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

2


**You now know how to write functions that compute the number of instances of a particulat POS tag in a given piece of text. In the next exercise, we will use these functions to generate features from text in a dataframe.**

### Noun usage in fake news

<div class=""><p>In this exercise, you have been given a dataframe <code>headlines</code> that contains news headlines that are either fake or real. Your task is to generate two new features <code>num_propn</code> and <code>num_noun</code> that represent the number of proper nouns and other nouns contained in the <code>title</code> feature of <code>headlines</code>.</p>
<p>Next, we will compute the mean number of proper nouns and other nouns used in fake and real news headlines and compare the values. If there is a remarkable difference, then there is a good chance that using the <code>num_propn</code> and <code>num_noun</code> features in fake news detectors will improve its performance.</p>
<p>To accomplish this task, the functions <code>proper_nouns</code> and <code>nouns</code> that you had built in the previous exercise have already been made available to you.</p></div>

In [None]:
headlines = pd.read_csv('https://raw.githubusercontent.com/lnunesAI/Datacamp/main/2-machine-learning-scientist-with-python/13-Feature%20Engineering%20for%20NLP%20in%20Python/datasets/headlines_100x3.csv')

Instructions 1/2
<ul>
<li>Create a new feature <code>num_propn</code> by applying <code>proper_nouns</code> to <code>headlines['title']</code>.</li>
<li>Filter <code>headlines</code> to compute the mean number of proper nouns in fake news using the <code>mean</code> method.</li>
</ul>

In [None]:
headlines['num_propn'] = headlines['title'].apply(proper_nouns)

# Compute mean of proper nouns
real_propn = headlines[headlines['label'] == 'REAL']['num_propn'].mean()
fake_propn = headlines[headlines['label'] == 'FAKE']['num_propn'].mean()

# Print results
print("Mean no. of proper nouns in real and fake headlines are %.2f and %.2f respectively"%(real_propn, fake_propn))

Mean no. of proper nouns in real and fake headlines are 2.37 and 4.35 respectively


Instructions 2/2
<li>Repeat the process for other nous: create a feature <code>'num_noun'</code> using <code>nouns</code> and compute the mean of other nouns</li>

In [None]:
headlines['num_noun'] = headlines['title'].apply(nouns)

# Compute mean of other nouns
real_noun = headlines[headlines['label'] == 'REAL']['num_noun'].mean()
fake_noun = headlines[headlines['label'] == 'FAKE']['num_noun'].mean()

# Print results
print("Mean no. of other nouns in real and fake headlines are %.2f and %.2f respectively"%(real_noun, fake_noun))

Mean no. of other nouns in real and fake headlines are 2.32 and 1.84 respectively


**You now know to construct features using POS tags information. Notice how the mean number of proper nouns is considerably higher for fake news than it is for real news. The opposite seems to be true in the case of other nouns. This fact can be put to great use in designing fake news detectors.**

### Named entity recognition

### Named entities in a sentence

<p>In this exercise, we will identify and classify the labels of various named entities in a body of text using one of spaCy's statistical models. We will also verify the veracity of these labels.</p>

In [4]:
# Load the required model
nlp = spacy.load('en_core_web_sm')

# Create a Doc instance 
text = 'Sundar Pichai is the CEO of Google. Its headquarters is in Mountain View.'
doc = nlp(text)

# Print all named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_)

Sundar Pichai PERSON
Google ORG
Mountain View GPE


**Notice how the model correctly predicted the labels of Google and Mountain View but mislabeled Sundar Pichai as an organization. As discussed in the video, the predictions of the model depend strongly on the data it is trained on. It is possible to train spaCy models on your custom data. You will learn to do this in more advanced NLP courses.**

### Identifying people mentioned in a news article

<div class=""><p>In this exercise, you have been given an excerpt from a news article published in <em>TechCrunch</em>. Your task is to write a function <code>find_people</code> that identifies the names of people that have been mentioned in a particular piece of text. You will then use <code>find_people</code> to identify the people of interest in the article.</p>
<p>The article is available as the string <code>tc</code> and has been printed to the console. The required spacy model has also been already loaded as <code>nlp</code>.</p></div>

In [5]:
tc = "\nIt’s' been a busy day for Facebook  exec op-eds. Earlier this morning, Sheryl Sandberg broke the site’s silence around the Christchurch massacre, and now Mark Zuckerberg is calling on governments and other bodies to increase regulation around the sorts of data Facebook traffics in. He’s hoping to get out in front of heavy-handed regulation and get a seat at the table shaping it.\n"

Instructions
<ul>
<li>Create a Doc object for <code>text</code>.</li>
<li>Using list comprehension, loop through <code>doc.ents</code> and create a list of named entities whose label is <code>PERSON</code>.</li>
<li>Using <code>find_persons()</code>, print the people mentioned in <code>tc</code>.</li>
</ul>

In [29]:
def find_persons(text):
  # Create Doc object
  doc = nlp(text)
  
  # Identify the persons
  persons = [ent.text  for ent in doc.ents if ent.label_ == 'PERSON']
  
  # Return persons
  return persons

print(find_persons(tc))

['Sheryl Sandberg', 'Mark Zuckerberg']


**The article was related to Facebook and our function correctly identified both the people mentioned. You can now see how NER could be used in a variety of applications. Publishers may use a technique like this to classify news articles by the people mentioned in them. A question answering system could also use something like this to answer questions such as 'Who are the people mentioned in this passage?'. With this, we come to an end of this chapter. In the next, we will learn how to conduct vectorization on documents.**