## Getting Human Language Data

Here, we use an article on Nikola Tesla from history.com

In [None]:
import requests
from bs4 import BeautifulSoup

def request():
    url = "https://www.history.com/topics/inventions/nikola-tesla"
    r = requests.get(url)
    text = r.text

    # create a beautiful soup object
    soup = BeautifulSoup(text, "html.parser")

    return soup


In [70]:
def scrape_info(soup):
    """
    :param soup: Beautiful soup object
    :return:
    """
    article_title = soup.find("h1", class_="m-detail-header--title")
    article_title = article_title.text

    article_body = soup.find("div", class_="m-detail--body")
    article_body.find("aside").decompose()
    article_body = article_body.text

    return article_title, article_body


In [71]:
import nltk


## Tokenizing

In [72]:
from nltk.tokenize import sent_tokenize, word_tokenize

# Get data
title, article = scrape_info(request())

# sent_tokenize() to split up article into sentences:
sentences = sent_tokenize(article)

# tokenizing by word
article_words = word_tokenize(article)

In [74]:
sentences[:3]

['Serbian-American engineer and physicist Nikola Tesla (1856-1943) made dozens of breakthroughs in the production, transmission and application of electric power.',
 'He invented the first alternating current (AC) motor and developed AC generation and transmission technology.',
 'Though he was famous and respected, he was never able to translate his copious inventions into long-term financial success—unlike his early employer and chief rival, Thomas Edison.Nikola Tesla’s Early Years     Nikola Tesla was born in 1856 in Smiljan, Croatia, then part of the Austro-Hungarian Empire.']

## Filtering Stop Words
Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [75]:
from nltk.corpus import stopwords

# a list of english stop words
stop_words = stopwords.words("english")

# filter word tokens
filtered_words = [word for word in article_words if word not in stop_words]

### Removing Punctuations

Punctuations such as .,? must be removed

In [78]:
filtered_article_words = [word for word in filtered_words if word.isalpha()]
filtered_article_words[: 10]

# What remains are words where all characters are alphabet letters.

['engineer',
 'physicist',
 'Nikola',
 'Tesla',
 'made',
 'dozens',
 'breakthroughs',
 'production',
 'transmission',
 'application']

## Stemming

Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.” Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. NLTK has more than one stemmer, but you’ll be using the Porter stemmer.

In [81]:
from nltk.stem import PorterStemmer

# stemming object
stemmer = PorterStemmer()

# stemmed_words
stemmed_words = [stemmer.stem(word) for word in filtered_words]

stemmed_words[:7]

['serbian-american', 'engin', 'physicist', 'nikola', 'tesla', '(', '1856-1943']

## Tagging Parts of Speech

Part of speech is a grammatical term that deals with the roles words play when you use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in your text according to their part of speech.

Here’s how to import the relevant parts of NLTK in order to tag parts of speech:

In [83]:
import nltk

article_pos = nltk.pos_tag(filtered_words)
article_pos[:5]

[('Serbian-American', 'JJ'),
 ('engineer', 'NN'),
 ('physicist', 'NN'),
 ('Nikola', 'NNP'),
 ('Tesla', 'NNP')]

All the words in the quote are now in a separate tuple, with a tag that represents their part of speech. But what do the tags mean? Here’s how to get a list of tags and their meanings:

In [84]:
# nltk.help.upenn_tagset()

## Lemmatizing

Now that you’re up to speed on parts of speech, you can circle back to lemmatizing. Like stemming, lemmatizing reduces words to their core meaning, but it will give you a complete English word that makes sense on its own instead of just a fragment of a word like 'discoveri'.

**Note:** A lemma is a word that represents a whole group of words, and that group of words is called a lexeme.

For example, if you were to look up the word “blending” in a dictionary, then you’d need to look at the entry for “blend,” but you would find “blending” listed in that entry.

In this example, “blend” is the lemma, and “blending” is part of the lexeme. So when you lemmatize a word, you are reducing it to its lemma.

Here’s how to import the relevant parts of NLTK in order to start lemmatizing

In [88]:
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatized_words = [lemmatizer.lemmatize(word) for word in filtered_words]
lemmatized_words[:5]

['Serbian-American', 'engineer', 'physicist', 'Nikola', 'Tesla']

## Chunking

While tokenizing allows you to identify words and sentences, chunking allows you to identify phrases.

Note: A phrase is a word or group of words that works as a single unit to perform a grammatical function. Noun phrases are built around a noun.

**Here are some examples:**

“A planet”
“A tilting planet”
“A swiftly tilting planet”

Chunking makes use of POS tags to group words and apply chunk tags to those groups. Chunks don’t overlap, so one instance of a word can be in only one chunk at a time.

You’ve got a list of tuples of all the words in the quote, along with their POS tag. In order to chunk, you first need to define a **chunk grammar**.

**Note:** A chunk grammar is a combination of rules on how sentences should be chunked. It often uses regular expressions, or regexes.

In [89]:
# chunk grammar with regular exp
grammar = "NP: {<DT>?<JJ>*<NN>}"

According to the rule you created, your chunks:

Start with an optional (?) determiner ('DT')
Can have any number (*) of adjectives (JJ)
End with a noun (<NN>)

In [93]:
# chunk_parser
chunk_parser = nltk.RegexpParser(grammar)

tree = chunk_parser.parse(nltk.pos_tag(article_words))

tree.draw()

## Named Entity Recognition (NER)
Named entities are noun phrases that refer to specific locations, people, organizations, and so on. With named entity recognition, you can find the named entities in your texts and also determine what kind of named entity they are.

In [65]:
tree = nltk.ne_chunk(article_pos)

In [62]:
tree.draw()

## Frequency Distribution
With a frequency distribution, you can check which words show up most frequently in your text. You’ll need to get started with an import:

In [91]:
from nltk import FreqDist

fdist = FreqDist(filtered_article_words)
fdist

FreqDist({'Tesla': 27, 'Edison': 8, 'AC': 7, 'first': 6, 'Westinghouse': 6, 'power': 5, 'He': 5, 'In': 5, 'years': 5, 'New': 5, ...})

FreqDist is a subclass of collections.Counter. Here’s how to create a frequency distribution of the entire corpus of personals ads:

In [92]:
# The top ten most frequent words in the article
top_ten_words = fdist.most_common(10)

top_ten_words

[('Tesla', 27),
 ('Edison', 8),
 ('AC', 7),
 ('first', 6),
 ('Westinghouse', 6),
 ('power', 5),
 ('He', 5),
 ('In', 5),
 ('years', 5),
 ('New', 5)]