## Getting Human Language Data

Here, we use an article on Nikola Tesla from history.com

In [None]:
import requests
from bs4 import BeautifulSoup

def request():
    url = "https://www.history.com/topics/inventions/nikola-tesla"
    r = requests.get(url)
    text = r.text

    # create a beautiful soup object
    soup = BeautifulSoup(text, "html.parser")

    return soup


In [11]:
def scrape_info(soup):
    """
    :param soup: Beautiful soup object
    :return:
    """
    article_title = soup.find("h1", class_="m-detail-header--title")
    article_title = article_title.text

    article_body = soup.find("div", class_="m-detail--body")
    article_body.find("aside").decompose()
    article_body = article_body.text

    return article_title, article_body

print(scrape_info(request()))

('Nikola Tesla', 'Serbian-American engineer and physicist Nikola Tesla (1856-1943) made dozens of breakthroughs in the production, transmission and application of electric power. He invented the first alternating current (AC) motor and developed AC generation and transmission technology. Though he was famous and respected, he was never able to translate his copious inventions into long-term financial success—unlike his early employer and chief rival, Thomas Edison.Nikola Tesla’s Early Years     Nikola Tesla was born in 1856 in Smiljan, Croatia, then part of the Austro-Hungarian Empire. His father was a priest in the Serbian Orthodox church and his mother managed the family’s farm. In 1863 Tesla’s brother Daniel was killed in a riding accident. The shock of the loss unsettled the 7-year-old Tesla, who reported seeing visions—the first signs of his lifelong mental illnesses.Did you know? During the 1890s Mark Twain struck up a friendship with inventor Nikola Tesla. Twain often visited hi

In [21]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

## Tokenizing

In [28]:
from nltk.tokenize import sent_tokenize, word_tokenize

# Get data
title, article = scrape_info(request())

# sent_tokenize() to split up article into sentences:
sentences = sent_tokenize(article)

# tokenizing by word
article_words = word_tokenize(article)

In [20]:
sentences

['Serbian-American engineer and physicist Nikola Tesla (1856-1943) made dozens of breakthroughs in the production, transmission and application of electric power.',
 'He invented the first alternating current (AC) motor and developed AC generation and transmission technology.',
 'Though he was famous and respected, he was never able to translate his copious inventions into long-term financial success—unlike his early employer and chief rival, Thomas Edison.Nikola Tesla’s Early Years     Nikola Tesla was born in 1856 in Smiljan, Croatia, then part of the Austro-Hungarian Empire.',
 'His father was a priest in the Serbian Orthodox church and his mother managed the family’s farm.',
 'In 1863 Tesla’s brother Daniel was killed in a riding accident.',
 'The shock of the loss unsettled the 7-year-old Tesla, who reported seeing visions—the first signs of his lifelong mental illnesses.Did you know?',
 'During the 1890s Mark Twain struck up a friendship with inventor Nikola Tesla.',
 'Twain ofte

## Filtering Stop Words
Stop words are words that you want to ignore, so you filter them out of your text when you’re processing it. Very common words like 'in', 'is', and 'an' are often used as stop words since they don’t add a lot of meaning to a text in and of themselves.

In [29]:
from nltk.corpus import stopwords

# a list of english stop words
stop_words = stopwords.words("english")

# filter word tokens
filtered_list = [word for word in article_words if word not in stop_words]
print(filtered_list)

['Serbian-American', 'engineer', 'physicist', 'Nikola', 'Tesla', '(', '1856-1943', ')', 'made', 'dozens', 'breakthroughs', 'production', ',', 'transmission', 'application', 'electric', 'power', '.', 'He', 'invented', 'first', 'alternating', 'current', '(', 'AC', ')', 'motor', 'developed', 'AC', 'generation', 'transmission', 'technology', '.', 'Though', 'famous', 'respected', ',', 'never', 'able', 'translate', 'copious', 'inventions', 'long-term', 'financial', 'success—unlike', 'early', 'employer', 'chief', 'rival', ',', 'Thomas', 'Edison.Nikola', 'Tesla', '’', 'Early', 'Years', 'Nikola', 'Tesla', 'born', '1856', 'Smiljan', ',', 'Croatia', ',', 'part', 'Austro-Hungarian', 'Empire', '.', 'His', 'father', 'priest', 'Serbian', 'Orthodox', 'church', 'mother', 'managed', 'family', '’', 'farm', '.', 'In', '1863', 'Tesla', '’', 'brother', 'Daniel', 'killed', 'riding', 'accident', '.', 'The', 'shock', 'loss', 'unsettled', '7-year-old', 'Tesla', ',', 'reported', 'seeing', 'visions—the', 'first',

## Stemming

Stemming is a text processing task in which you reduce words to their root, which is the core part of a word. For example, the words “helping” and “helper” share the root “help.” Stemming allows you to zero in on the basic meaning of a word rather than all the details of how it’s being used. NLTK has more than one stemmer, but you’ll be using the Porter stemmer.

In [None]:
from nltk.stem import PorterStemmer

# stemming object
stemmer = PorterStemmer()
