## Getting Human Language Data

Here, we use an article on Nikola Tesla from history.com

In [None]:
import requests
from bs4 import BeautifulSoup

def request():
    url = "https://www.history.com/topics/inventions/nikola-tesla"
    r = requests.get(url)
    text = r.text

    # create a beautiful soup object
    soup = BeautifulSoup(text, "html.parser")

    return soup


In [11]:
def scrape_info(soup):
    """
    :param soup: Beautiful soup object
    :return:
    """
    article_title = soup.find("h1", class_="m-detail-header--title")
    article_title = article_title.text

    article_body = soup.find("div", class_="m-detail--body")
    article_body.find("aside").decompose()
    article_body = article_body.text

    return article_title, article_body

print(scrape_info(request()))

('Nikola Tesla', 'Serbian-American engineer and physicist Nikola Tesla (1856-1943) made dozens of breakthroughs in the production, transmission and application of electric power. He invented the first alternating current (AC) motor and developed AC generation and transmission technology. Though he was famous and respected, he was never able to translate his copious inventions into long-term financial success—unlike his early employer and chief rival, Thomas Edison.Nikola Tesla’s Early Years     Nikola Tesla was born in 1856 in Smiljan, Croatia, then part of the Austro-Hungarian Empire. His father was a priest in the Serbian Orthodox church and his mother managed the family’s farm. In 1863 Tesla’s brother Daniel was killed in a riding accident. The shock of the loss unsettled the 7-year-old Tesla, who reported seeing visions—the first signs of his lifelong mental illnesses.Did you know? During the 1890s Mark Twain struck up a friendship with inventor Nikola Tesla. Twain often visited hi

In [34]:
import nltk
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

## Tokenizing

In [28]:
from nltk.tokenize import sent_tokenize, word_tokenize

# Get data
title, article = scrape_info(request())

# sent_tokenize() to split up article into sentences:
sentences = sent_tokenize(article)

# tokenizing by word
article_words = word_tokenize(article)

In [20]:
sentences

['Serbian-American engineer and physicist Nikola Tesla (1856-1943) made dozens of breakthroughs in the production, transmission and application of electric power.',
 'He invented the first alternating current (AC) motor and developed AC generation and transmission technology.',
 'Though he was famous and respected, he was never able to translate his copious inventions into long-term financial success—unlike his early employer and chief rival, Thomas Edison.Nikola Tesla’s Early Years     Nikola Tesla was born in 1856 in Smiljan, Croatia, then part of the Austro-Hungarian Empire.',
 'His father was a priest in the Serbian Orthodox church and his mother managed the family’s farm.',
 'In 1863 Tesla’s brother Daniel was killed in a riding accident.',
 'The shock of the loss unsettled the 7-year-old Tesla, who reported seeing visions—the first signs of his lifelong mental illnesses.Did you know?',
 'During the 1890s Mark Twain struck up a friendship with inventor Nikola Tesla.',
 'Twain ofte

## Filtering Stop Words
Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [44]:
from nltk.corpus import stopwords

# a list of english stop words
stop_words = stopwords.words("english")

# filter word tokens
filtered_words = [word for word in article_words if word not in stop_words]

### Removing Punctuations

Punctuations such as .,? must be removed

In [46]:
filtered_words = [word for word in filtered_words if word.isalpha()]
filtered_words

# What remains are words where all characters are alphabet letters.

['engineer',
 'physicist',
 'Nikola',
 'Tesla',
 'made',
 'dozens',
 'breakthroughs',
 'production',
 'transmission',
 'application',
 'electric',
 'power',
 'He',
 'invented',
 'first',
 'alternating',
 'current',
 'AC',
 'motor',
 'developed',
 'AC',
 'generation',
 'transmission',
 'technology',
 'Though',
 'famous',
 'respected',
 'never',
 'able',
 'translate',
 'copious',
 'inventions',
 'financial',
 'early',
 'employer',
 'chief',
 'rival',
 'Thomas',
 'Tesla',
 'Early',
 'Years',
 'Nikola',
 'Tesla',
 'born',
 'Smiljan',
 'Croatia',
 'part',
 'Empire',
 'His',
 'father',
 'priest',
 'Serbian',
 'Orthodox',
 'church',
 'mother',
 'managed',
 'family',
 'farm',
 'In',
 'Tesla',
 'brother',
 'Daniel',
 'killed',
 'riding',
 'accident',
 'The',
 'shock',
 'loss',
 'unsettled',
 'Tesla',
 'reported',
 'seeing',
 'first',
 'signs',
 'lifelong',
 'mental',
 'know',
 'During',
 'Mark',
 'Twain',
 'struck',
 'friendship',
 'inventor',
 'Nikola',
 'Tesla',
 'Twain',
 'often',
 'visited'

## Stemming

Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.” Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. NLTK has more than one stemmer, but you’ll be using the Porter stemmer.

In [47]:
from nltk.stem import PorterStemmer

# stemming object
stemmer = PorterStemmer()

# stemmed_words
stemmed_words = [stemmer.stem(word) for word in filtered_words]

stemmed_words

['engin',
 'physicist',
 'nikola',
 'tesla',
 'made',
 'dozen',
 'breakthrough',
 'product',
 'transmiss',
 'applic',
 'electr',
 'power',
 'he',
 'invent',
 'first',
 'altern',
 'current',
 'ac',
 'motor',
 'develop',
 'ac',
 'gener',
 'transmiss',
 'technolog',
 'though',
 'famou',
 'respect',
 'never',
 'abl',
 'translat',
 'copiou',
 'invent',
 'financi',
 'earli',
 'employ',
 'chief',
 'rival',
 'thoma',
 'tesla',
 'earli',
 'year',
 'nikola',
 'tesla',
 'born',
 'smiljan',
 'croatia',
 'part',
 'empir',
 'hi',
 'father',
 'priest',
 'serbian',
 'orthodox',
 'church',
 'mother',
 'manag',
 'famili',
 'farm',
 'in',
 'tesla',
 'brother',
 'daniel',
 'kill',
 'ride',
 'accid',
 'the',
 'shock',
 'loss',
 'unsettl',
 'tesla',
 'report',
 'see',
 'first',
 'sign',
 'lifelong',
 'mental',
 'know',
 'dure',
 'mark',
 'twain',
 'struck',
 'friendship',
 'inventor',
 'nikola',
 'tesla',
 'twain',
 'often',
 'visit',
 'lab',
 'tesla',
 'photograph',
 'great',
 'american',
 'writer',
 'one'

## Tagging Parts of Speech

Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.

Here’s how to import the relevant parts of NLTK in order to tag parts of speech:

In [54]:
import nltk

article_pos = nltk.pos_tag(filtered_words)
article_pos

[('engineer', 'NN'),
 ('physicist', 'NN'),
 ('Nikola', 'NNP'),
 ('Tesla', 'NNP'),
 ('made', 'VBD'),
 ('dozens', 'NNS'),
 ('breakthroughs', 'JJ'),
 ('production', 'NN'),
 ('transmission', 'NN'),
 ('application', 'NN'),
 ('electric', 'JJ'),
 ('power', 'NN'),
 ('He', 'PRP'),
 ('invented', 'VBD'),
 ('first', 'RB'),
 ('alternating', 'VBG'),
 ('current', 'JJ'),
 ('AC', 'NNP'),
 ('motor', 'NN'),
 ('developed', 'VBD'),
 ('AC', 'NNP'),
 ('generation', 'NN'),
 ('transmission', 'NN'),
 ('technology', 'NN'),
 ('Though', 'NNP'),
 ('famous', 'JJ'),
 ('respected', 'VBD'),
 ('never', 'RB'),
 ('able', 'JJ'),
 ('translate', 'NN'),
 ('copious', 'JJ'),
 ('inventions', 'NNS'),
 ('financial', 'JJ'),
 ('early', 'JJ'),
 ('employer', 'NN'),
 ('chief', 'NN'),
 ('rival', 'JJ'),
 ('Thomas', 'NNP'),
 ('Tesla', 'NNP'),
 ('Early', 'NNP'),
 ('Years', 'NNP'),
 ('Nikola', 'NNP'),
 ('Tesla', 'NNP'),
 ('born', 'VBD'),
 ('Smiljan', 'NNP'),
 ('Croatia', 'NNP'),
 ('part', 'NN'),
 ('Empire', 'NNP'),
 ('His', 'PRP$'),
 ('fath

All the words in the quote are now in a separate tuple, with a tag that represents their part of speech. But what do the tags mean? Here’s how to get a list of tags and their meanings:

In [49]:
nltk.help.upenn_tagset()

$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

## Lemmatizing

Now that you’re up to speed on parts of speech, you can circle back to lemmatizing. Like stemming, lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.

**Note:** A lemma is a word that represents a whole group of words, and that group of words is called a lexeme.

For example, if you were to look up the word “blending” in a dictionary, then you’d need to look at the entry for “blend,” but you would find “blending” listed in that entry.

In this example, “blend” is the lemma, and “blending” is part of the lexeme. So when you lemmatize a word, you are reducing it to its lemma.

Here’s how to import the relevant parts of NLTK in order to start lemmatizing

In [50]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
lemmatized_words

['engineer',
 'physicist',
 'Nikola',
 'Tesla',
 'made',
 'dozen',
 'breakthrough',
 'production',
 'transmission',
 'application',
 'electric',
 'power',
 'He',
 'invented',
 'first',
 'alternating',
 'current',
 'AC',
 'motor',
 'developed',
 'AC',
 'generation',
 'transmission',
 'technology',
 'Though',
 'famous',
 'respected',
 'never',
 'able',
 'translate',
 'copious',
 'invention',
 'financial',
 'early',
 'employer',
 'chief',
 'rival',
 'Thomas',
 'Tesla',
 'Early',
 'Years',
 'Nikola',
 'Tesla',
 'born',
 'Smiljan',
 'Croatia',
 'part',
 'Empire',
 'His',
 'father',
 'priest',
 'Serbian',
 'Orthodox',
 'church',
 'mother',
 'managed',
 'family',
 'farm',
 'In',
 'Tesla',
 'brother',
 'Daniel',
 'killed',
 'riding',
 'accident',
 'The',
 'shock',
 'loss',
 'unsettled',
 'Tesla',
 'reported',
 'seeing',
 'first',
 'sign',
 'lifelong',
 'mental',
 'know',
 'During',
 'Mark',
 'Twain',
 'struck',
 'friendship',
 'inventor',
 'Nikola',
 'Tesla',
 'Twain',
 'often',
 'visited',
 '

## Chunking

While tokenizing allows you to identify words and sentences, chunking allows you to identify phrases.

Note: A phrase is a word or group of words that works as a single unit to perform a grammatical function. Noun phrases are built around a noun.

**Here are some examples:**

“A planet”
“A tilting planet”
“A swiftly tilting planet”

Chunking makes use of POS tags to group words and apply chunk tags to those groups. Chunks don’t overlap, so one instance of a word can be in only one chunk at a time.

You’ve got a list of tuples of all the words in the quote, along with their POS tag. In order to chunk, you first need to define a **chunk grammar**.

**Note:** A chunk grammar is a combination of rules on how sentences should be chunked. It often uses regular expressions, or regexes.

In [52]:
# chunk grammar with regular exp
grammar = "NP: {<DT>?<JJ>*<NN>}"

According to the rule you created, your chunks:

Start with an optional (?) determiner ('DT')
Can have any number (*) of adjectives (JJ)
End with a noun (<NN>)

In [58]:
# chunk_parser
chunk_parser = nltk.RegexpParser(grammar)

tree = chunk_parser.parse(article_pos)

tree

ModuleNotFoundError: No module named 'svgling'

Tree('S', [Tree('NP', [('engineer', 'NN')]), Tree('NP', [('physicist', 'NN')]), ('Nikola', 'NNP'), ('Tesla', 'NNP'), ('made', 'VBD'), ('dozens', 'NNS'), Tree('NP', [('breakthroughs', 'JJ'), ('production', 'NN')]), Tree('NP', [('transmission', 'NN')]), Tree('NP', [('application', 'NN')]), Tree('NP', [('electric', 'JJ'), ('power', 'NN')]), ('He', 'PRP'), ('invented', 'VBD'), ('first', 'RB'), ('alternating', 'VBG'), ('current', 'JJ'), ('AC', 'NNP'), Tree('NP', [('motor', 'NN')]), ('developed', 'VBD'), ('AC', 'NNP'), Tree('NP', [('generation', 'NN')]), Tree('NP', [('transmission', 'NN')]), Tree('NP', [('technology', 'NN')]), ('Though', 'NNP'), ('famous', 'JJ'), ('respected', 'VBD'), ('never', 'RB'), Tree('NP', [('able', 'JJ'), ('translate', 'NN')]), ('copious', 'JJ'), ('inventions', 'NNS'), Tree('NP', [('financial', 'JJ'), ('early', 'JJ'), ('employer', 'NN')]), Tree('NP', [('chief', 'NN')]), ('rival', 'JJ'), ('Thomas', 'NNP'), ('Tesla', 'NNP'), ('Early', 'NNP'), ('Years', 'NNP'), ('Nikola'