# <center>Text Mining</center>

___

Text data falls into the category of unstructured data and requires some preparation before it can be used for modeling. Text preperation is different from structed data pre-processing.

Today we will go through the process of preparing text data and building a predictive model on it.

## Why SpaCy?

There are many different libraries that can be used for text related processing. We will work with SpaCy.

>SpaCy is a free and open-source library developed by Explosion AI. It works well for simple to complex language understanding tasks and is designed specifically for production use.

SpaCy provides trained models for 48 differnt languages and has a model for multi-language as well. 

>Check this link for various English models : https://spacy.io/models/en

Before jumping in, let's have a look at various features provided by popular NLP related libraries and their performance in compairision to SpaCy.  
_*All charts are referenced from SpaCy Docs*_

### Feature Comparision

![](img/feature-comparision.png)

### Speed Comparision

![](img/speed-comparision.png)

### SpaCy Installation

To get started with SpaCy, install the package using pip in Terminal (for Mac) or CommandLine (for Windows)

The language pre-trained model packages can be downloaded using the "spacy download" command. We will download `en_core_web_md` package

- en = English
- core = Core (Vocab, Syntax, Entities, Vectors)
- web = Web Text
- sm/md/lg = Small/Medium/Large

!pip install spacy
!python -m spacy download en_core_web_md

Verify the library is installed

In [1]:
#! pip list | grep spacy
#! pip list | grep en-core-web-sm

### Import SpaCy Module and Load the Language Model

*We are loading Medium model as Small model doesn't contain word vectors that we will use later. If you are not using word vectors, you can load Small model as well.*

In [1]:
import spacy

# Load the language model
nlp = spacy.load('en_core_web_md')

# Create SpaCy Object
doc = nlp("Hello World")

# Print the document text
print(doc)

Hello World


# Text Preprocessing

Basic steps in text pre-processing are :

- Tokenisation 
- Stop Words removal
- Matcher and PhraseMatcher
- Lemmatization
- Vectorization

## Understanding SpaCy Objects

#### I. `nlp` Object

When we load the SpaCy model, it creates the SpaCy object. We define it with the variable name `nlp`. 

This object contains the language specific vocabulary, model weights and processing pipeline like tokenisation rules, stop words, POS rules etc.

![](img/nlp.png)

Look for pipeline component names using `pipe_names` attribute

In [2]:
print(nlp.pipe_names)

nlp.pipeline

['tagger', 'parser', 'ner']


[('tagger', <spacy.pipeline.pipes.Tagger at 0x117d65c50>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x119741d08>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x119741d68>)]

When we process some text with `nlp` object, it creates a doc object, short for document.

#### II. `Doc` Object

![](img/doc.png)

Token objects represent the word tokens in the document. To get a token at a specific position, simply index the Doc object like any python object.

#### III. `Span` Object

![](img/span.png)

A Span object is a slice of the document consisting of one or more tokens. Again to view a span, simply index with start and end position seperated by : like any python object.

## Tokenisation

#### I. Word Tokenisation

In [3]:
doc = nlp("Have a Good Day!")

# Print the document text
print("Doc Object   : ", doc.text)

# Get the token text using .text attribute
print("Token Object : ", doc[1].text) 
print("Token Object : ", doc[-1].text) # punctuation is also a token

# Take a span of tokens
span = doc[1:4]

# Get the span text using .text attribute
print("Span Object  : ", span.text)

Doc Object   :  Have a Good Day!
Token Object :  a
Token Object :  !
Span Object  :  a Good Day


In case all you want to do is tokenise and don't want to initialise rest of the pipeline. There are two options.

Let's compare them using the pos_ attribute on the token and check the output with only tokeniser and with complete pipeline.

- Default Pipeline

In [6]:
# Process the text 
doc = nlp("Hello world!")

print(doc.text)
print(doc[1].pos_)

Hello world!
NOUN


## POS Tags
* ADJ: adjective
* ADP: adposition
* ADV: adverb
* AUX: auxiliary verb
* CONJ: coordinating conjunction
* DET: determiner
* INTJ: interjection
* NOUN: noun
* NUM: numeral
* PART: particle
* PRON: pronoun
* PROPN: proper noun
* PUNCT: punctuation
* SCONJ: subordinating conjunction
* SYM: symbol
* VERB: verb
* X: other

- Doc object Only

`make_doc( )` function creates a doc object with only tokeniser

In [6]:
# Process the text 
doc = nlp.make_doc("Hello world!")

print(doc.text)
print(doc[1].pos_)

Hello world!



- Tokeniser object ( Temporarily )

Disable remaining pipeline in with statement. This doc is not available outside with statement.

In [7]:
# Disable tagger and parser
with nlp.disable_pipes('tagger', 'parser'):
    
    # Process the text 
    doc = nlp("Hello world!")
    print(doc[1].pos_)




#### II. Sentence Tokenisation

In [8]:
doc = nlp("Have a Good Day! Same to you.")

[w.text for w in doc.sents]

['Have a Good Day!', 'Same to you.']

#### Custom Pipeline Component

Let's see how to add a custom component to the pipeline.

In [9]:
nlp1 = spacy.load('en_core_web_sm')


def add_comp1(doc):
    
    print("This doc has {} tokens.".format(len(doc)))
    
    if len(doc) < 10 :
        print("This is a small review")
    elif len(doc) > 10 :
        print("This is a long review")
        
    return doc

nlp1.add_pipe(add_comp1, name = "size_info", last = True)

print(nlp1.pipe_names)  

doc = nlp1("The moview was good")

doc = nlp1("I loved the movie. It was a great experience!")

['tagger', 'parser', 'ner', 'size_info']
This doc has 4 tokens.
This is a small review
This doc has 11 tokens.
This is a long review


# STOP WORDS

Words that occur very frequently in the documents that they don't add any meaning or value are called stop words. It best to remove these words that are useless and consume a lot of resources.

##### Remember : Stop words are language & domain dependent.

In [10]:
from spacy.lang.en.stop_words import STOP_WORDS

# View the default stop words list
print(list(STOP_WORDS)[:10])


# Add some stop words to default list
stopwords = ['Hi', 'Bye'] + list(STOP_WORDS)
print(stopwords[:10])

# Remove some stop words to default list
stopwords = set(STOP_WORDS) - {'except', 'was'}
print(list(stopwords)[:10])

['after', 'becomes', 'indeed', 'becoming', 'at', 'mine', 'without', 'cannot', 'less', 'call']
['Hi', 'Bye', 'after', 'becomes', 'indeed', 'becoming', 'at', 'mine', 'without', 'cannot']
['no', 'than', 'third', 'after', 'meanwhile', 'his', 'becomes', 'indeed', 'once', 'becoming']


## Lexical Attributes

Attributes that don't hold any contextual information are called lexical attributes. 
Let's explore other available token attributes :

- i - index
- text - token text
- is_alpha - alphanumeric character (True/False)
- is_punct - punctuation (True/False)
- like_num - alphanumeric character (True/False)

refer to https://github.com/explosion/spaCy/issues/1439 for all available lexical attributes

In [11]:
doc = nlp("I was stuck in traffic for two hours and reached home @ 8")

print('Index:   ', [token.i for token in doc])
print('Text:    ', [token.text for token in doc])
print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc]) # observe both 'two' and '8'

Index:    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
Text:     ['I', 'was', 'stuck', 'in', 'traffic', 'for', 'two', 'hours', 'and', 'reached', 'home', '@', '8']
is_alpha: [True, True, True, True, True, True, True, True, True, True, True, False, False]
is_punct: [False, False, False, False, False, False, False, False, False, False, False, True, False]
like_num: [False, False, False, False, False, False, True, False, False, False, False, False, True]


## Lemmatization

Lemmas are root form of a word. It is helpful to reduce the bag of words by using the same root word for all similar kind of words.

In [12]:
# Process a text
doc = nlp("I have 5 friends but my best friend is Kathie")
# Print the text and the predicted tags
print([(w.text, w.lemma_) for w in doc]) 

[('I', '-PRON-'), ('have', 'have'), ('5', '5'), ('friends', 'friend'), ('but', 'but'), ('my', '-PRON-'), ('best', 'good'), ('friend', 'friend'), ('is', 'be'), ('Kathie', 'Kathie')]


## Statistical Models

Now, let us look at some context based attributes. All the information to make these predictions is also loaded with the model. These models are trained on large datasets of labeled example texts. 

#### I. Part of Speech

- pos_ - Part of Speech

POS means labeling words in a sentence as nouns, adjectives, verbs, tense etc. This is particularly helpful for identifying homophones in speech to text analysis. _*eg. If you accidentally drank a bottle of fabric dye, you might die.*_ **[ GOOD TO KNOW 😀 ]**


refer to https://spacy.io/api/annotation#pos-tagging

In [10]:
# Process a text
doc = nlp("I am 20 years old")

# Print the text and the predicted tags
print([(w.text, w.pos_) for w in doc])


doc = nlp("pen a letter to your pen pal")

print([(w.text, w.pos_) for w in doc])

[('I', 'PRON'), ('am', 'VERB'), ('20', 'NUM'), ('years', 'NOUN'), ('old', 'ADJ')]
[('pen', 'VERB'), ('a', 'DET'), ('letter', 'NOUN'), ('to', 'ADP'), ('your', 'DET'), ('pen', 'NOUN'), ('pal', 'NOUN')]


Use spacy.explain( [tag] ) function to find the meaning of the different tags

In [11]:
print(spacy.explain('PRON'))
print(spacy.explain('NUM'))
print(spacy.explain('VERB'))
print(spacy.explain('NOUN'))
print(spacy.explain('ADJ'))
print(spacy.explain('ADP'))
print(spacy.explain('DET'))



pronoun
numeral
verb
noun
adjective
adposition
determiner


**Determiners** are one of the ingredients of noun phrases that is, determining exactly which of several possible alternative objects in the world is referred to by a noun phrase

**Adpositional** phrases contain an adposition (preposition, postposition, or circumposition) as head and usually a complement such as a noun phrase


## dep_ - syntatic dependency

In [12]:
# Process a text
doc = nlp("I am 20 years old")

# Print the text and the predicted tags
print([(w.text, w.dep_) for w in doc])

[('I', 'nsubj'), ('am', 'ROOT'), ('20', 'nummod'), ('years', 'npadvmod'), ('old', 'acomp')]


Find Meanings

In [13]:
print(spacy.explain('nsubj'))
print(spacy.explain('ROOT'))
print(spacy.explain('nummod'))
print(spacy.explain('npadvmod'))
print(spacy.explain('acomp'))

nominal subject
None
None
noun phrase as adverbial modifier
adjectival complement


#### Visualise using displacy

In [14]:
from spacy import displacy # for visualisation

displacy.render(doc, style="dep", jupyter=True)

## Named Entity Recognition

Named entities are "real world objects" that are assigned a name, such as people, places, things, locations, currencies, and more.

Named entities can be accessed by using `doc.ents` which returns and iterator of span objects.

In [15]:
# Process a text
doc = nlp("Virat Kohli (India) surpasses Sachin and Lara, becomes fastest to 20,000 international runs.")

# Iterate over the predicted entities

for ent in doc.ents:
    # Print the entity text and its label
    print(ent.text, ent.label_, ent.start_char, ent.end_char)

Virat Kohli PERSON 0 11
India GPE 13 18
Sachin PERSON 30 36
Lara PERSON 41 45
20,000 CARDINAL 66 72


Find Meanings

In [16]:
print(spacy.explain('CARDINAL'))
print(spacy.explain('ORDINAL'))
print(spacy.explain('GPE'))

Numerals that do not fall under another type
"first", "second", etc.
Countries, cities, states


Visualise

In [17]:
displacy.render(doc, style="ent", jupyter=True)

## Pattern Matching

#### I. Token Matcher

Similar to regular expressions, `Matcher` module in SpaCy allow pattern matching for Doc and Token objects.

A pattern object is a list of dictionaries that can be added to the matcher object using add( ) function.

- The first argument is a unique name for the pattern. 
- The second argument is an optional callback. We don't need it, so we set it to None. 
- The third argument is the pattern.

Matcher returns a list of tuples each consisting 

- hash value of pattern name
- start index
- end index

Try the make a token optional use `OP` key

- `!` - match 0 times (negate)
- `?` - match 0 or 1 times (optional)
- `+` - match 1 or more times 
- `*` - match 0 or more times

ref: https://spacy.io/usage/rule-based-matching

In [80]:
# Import the Matcher
from spacy.matcher import Matcher

# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)

In [83]:
# 1. match text as such and case-insensitive
pattern_1 = [{'TEXT': 'Virat'}, {'LOWER': 'kohli'}]

# 2. match digit
pattern_2 = [{'IS_DIGIT': True}]

# 3. match lemma and part of speech
# the "?" operator makes the fast lemma token optional
pattern_3 = [{'LEMMA': 'fast', 'OP': '?'}, {'POS': 'ADJ'}, {'POS': 'NOUN'}]

# 4. match lemma and part of speech
pattern_4 = [{'IS_UPPER': True}]

# Add the pattern to the matcher
matcher.add('PATTERN_1', None, pattern_1)
matcher.add('PATTERN_2', None, pattern_2)
matcher.add('PATTERN_3', None, pattern_3)
matcher.add('PATTERN_4', None, pattern_4)

In [84]:
# Process some text
doc = nlp("Virat Kohli becomes FASTEST Indian cricketer to score 20000 international runs.")

# Call the matcher on the doc
matches = matcher(doc)

# Iterate over the matches
for match_id, start, end in matches:
    # Get the matched span
    print(nlp.vocab.strings[match_id])
    matched_span = doc[start:end]
    print(matched_span.text)

PATTERN_1
Virat Kohli
PATTERN_4
FASTEST
PATTERN_3
FASTEST Indian cricketer
PATTERN_3
Indian cricketer
PATTERN_2
20000
PATTERN_3
20000 international runs
PATTERN_3
international runs


#### II. Phrase Matcher

In case you want to match a span or phrase, use `PhraseMatcher` and create doc object as patterns.

In [59]:
# Import the Matcher
from spacy.matcher import PhraseMatcher

# Initialize the matcher with the shared vocab
matcher = PhraseMatcher(nlp.vocab)

In [60]:
# 1. match text as such and case-insensitive
pattern_1 = nlp('Virat Kohli')

# 2. match text as such and case-insensitive
pattern_2 = nlp('20000')

# Add the pattern to the matcher
matcher.add('NAME', None, pattern_1)
matcher.add('RUNS', None, pattern_2)

In [61]:
# Process some text
doc = nlp("Virat Kohli becomes fastest Indian cricketer to score 20000 international runs.")

# Iterate over the matches
for match_id, start, end in matcher(doc):
    
    # Get the matched span
    span = doc[start:end]
    print(span.text)

Virat Kohli
20000


## Similarity

When we load the language model, it also loads the 300-dimensional vector representation for the words. The vector representation has been computed using the Word2Vec algorithm on large Web text.

*You will learn more about Word2Vec in Deep Learning Module.*

*Its important to note that the small model doesn't contatin word vectors. That is also the reason it loads faster.*

SpaCy computes a similarity score between 0-1 between these vectors, at 3 levels 
- Doc,
- Span, and
- Token, 

using the similarity( ) function. By default, it uses cosine similarity.

*Similarity is averaged for Doc and Span.*

In [27]:
# Process some text
doc = nlp("Virat Kohli becomes fastest Indian cricketer to score 20000 international runs.")

# Access the vector via the token.vector attribute
print(doc[0].vector) 

[ 2.9832e-01  1.7471e-01  7.0592e-02  5.1045e-01  2.3978e-01  7.5919e-02
  2.3031e-01 -6.1956e-01  2.1394e-01 -1.1881e+00 -1.7273e-01 -3.8365e-01
 -8.6379e-01  2.4508e-02  9.3082e-02  4.1044e-01 -2.0216e-01 -1.3796e+00
  4.1507e-01  5.9532e-01  1.7056e-01  5.0616e-01  4.3366e-01  4.9059e-01
 -3.0667e-01  4.5887e-01  1.7847e-02  7.8978e-01 -3.6883e-01  5.2176e-01
  3.1528e-01 -9.4568e-03 -1.0214e-01  3.8308e-01 -7.4116e-01  2.5485e-01
 -3.8609e-01 -1.3838e-01  9.9488e-02 -4.1722e-01  4.0797e-01 -5.3534e-01
 -3.1326e-01 -1.7172e-01 -4.0924e-01  3.3197e-01 -1.9134e-01  1.1649e-02
  1.6144e-01 -2.4808e-01  4.0267e-01  2.9613e-01  6.2917e-02 -2.6367e-01
 -4.7752e-01  1.1263e-02 -1.2182e-01  3.3558e-01 -4.6654e-01 -1.8428e-01
  2.5530e-02 -1.4487e-01 -4.4719e-02 -2.5467e-02 -7.3004e-02  5.4738e-01
  7.8649e-02  2.9088e-01  1.5538e-01 -4.4414e-01  2.1993e-03 -2.1081e-01
  1.5998e-01 -7.3798e-01  2.2719e-01 -7.2435e-01 -5.0498e-01  4.7991e-02
  5.4960e-01 -2.2687e-01  4.4384e-01 -1.9989e-01  2

#### I. Token Similarity

In [62]:
doc1 = nlp("We are learning SpaCy")
doc2 = nlp("SpaCy is interesting")

token1 = doc1[2]
token2 = doc2[2]

print(token1.similarity(token2))

0.40250298


#### II. Span Similarity

In [63]:
doc1 = nlp("We are learning SpaCy")
doc2 = nlp("SpaCy is interesting")

span1 = doc1[2:]
span2 = doc2[:]

print(span1.similarity(span2))

0.42746237


#### III. Doc Similarity

In [64]:
doc1 = nlp("We are learning SpaCy")
doc2 = nlp("SpaCy is interesting")

print(doc1.similarity(doc2))

0.6511438662010053


## VECTORIZATION

Once the bag of words is created it needs to be encoded as integers or floating point values to be used as an input to a machine learning algorithm. This is called feature extraction (or vectorization).

Let us understand Vectorization with a small example.

In [65]:
text = ['This is the first document.', 
        'This is the second second document.', 
        'And the third one.', 
        'Is this the first document?']

### I. CountVectorizer

The CountVectorizer provides a simple way to tokenize a collection of text documents, build a vocabulary of known words and create a document- token matrix.

Let's use CountVectorizer from sklearn and create an instance of it.

In [66]:
from sklearn.feature_extraction.text import CountVectorizer

# create the transform
count_vectorizer = CountVectorizer()

Call the fit() function in order to learn a vocabulary from one or more documents.

In [67]:
# tokenize and build vocab
count_vectorizer.fit(text)

# summarize
print(count_vectorizer.vocabulary_)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


Call the transform() function on one or more documents as needed to encode each as a vector.

In [68]:
# encode document
count_matrix = count_vectorizer.transform(text)

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

In [69]:
# summarize encoded vector
print(count_matrix.shape)

(4, 9)


Because these vectors will contain a lot of zeros, we call them sparse. 

Python provides an efficient way of handling sparse vectors in the scipy.sparse package. 

The vectors returned from a call to transform() will be sparse vectors, and you can transform them back to numpy arrays to look and understand what is going on by calling the toarray() function.

In [70]:
print(type(count_matrix))
print()
print(count_matrix.toarray())

<class 'scipy.sparse.csr.csr_matrix'>

[[0 1 1 1 0 0 1 0 1]
 [0 1 0 1 0 2 1 0 1]
 [1 0 0 0 1 0 1 1 0]
 [0 1 1 1 0 0 1 0 1]]


The columns are all the word tokens (without punctuation) in sorted order.

In [71]:
print(count_vectorizer.get_feature_names())
import pandas as pd

pd.DataFrame(count_matrix.toarray(), columns=count_vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


Unnamed: 0,and,document,first,is,one,second,the,third,this
0,0,1,1,1,0,0,1,0,1
1,0,1,0,1,0,2,1,0,1
2,1,0,0,0,1,0,1,1,0
3,0,1,1,1,0,0,1,0,1


#### One can see that with CountVectorizer, all words were made lowercase by default and that the punctuation was ignored.

#### Bi-gram CountVectorizer

Note that in the previous corpus, the first and the last documents have exactly the same words hence are encoded in equal vectors. 

We lost the information that the last document is an interrogative form. 

To preserve some of the local ordering information we can extract 2-grams of words in addition to the 1-grams (individual words):

In [72]:
bigram_count_vectorizer = CountVectorizer(ngram_range=(1, 2))

bigram_count_vectorizer.fit_transform(text).toarray()

array([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],
       [0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],
       [1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],
       [0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]],
      dtype=int64)

In [73]:
print(bigram_count_vectorizer.get_feature_names())

['and', 'and the', 'document', 'first', 'first document', 'is', 'is the', 'is this', 'one', 'second', 'second document', 'second second', 'the', 'the first', 'the second', 'the third', 'third', 'third one', 'this', 'this is', 'this the']


### II. TfidfVectorizer

Word counts are a good starting point, but are very basic. One issue with simple counts is that longer document will have more imapct than small documents. An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document”.

- Term Frequency: This summarizes how often a given term appears within a document.  
- Inverse Document Frequency: This downscales terms that appear a lot across documents.

![](img/tfidf.png)

#### TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. 

It is interesting to know that their are many variants of calculating tf and idf

![](img/tfidf_formulas.png)

Let's use TfidfVectorizer from sklearn and create an instance of it.

In [74]:
from sklearn.feature_extraction.text import TfidfVectorizer

# create the transform
tfidf_vectorizer = TfidfVectorizer()

Call the fit() function in order to learn a vocabulary from one or more documents.

In [75]:
# tokenize and build vocab
tfidf_vectorizer.fit(text)

# summarize
print(tfidf_vectorizer.vocabulary_)

{'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}


Call the transform() function on one or more documents as needed to encode each as a vector.

In [76]:
# encode document
tfidf_matrix = tfidf_vectorizer.transform(text)

An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

In [77]:
print(tfidf_matrix.toarray())

[[0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]
 [0.         0.27230147 0.         0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.43877674 0.54197657 0.43877674 0.         0.
  0.35872874 0.         0.43877674]]


The columns are all the word tokens (without punctuation) in sorted order.

In [78]:
print(tfidf_vectorizer.get_feature_names())

['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']


### Web Scraping Example using Beautiful Soup

Lastly, let's see a small web scrapping example using library BeautifulSoup to extract text data from a web page. 

You will have to install to libraries 
- requests, and
- beautifulsoup4

! pip install requests

! pip install beautifulsoup4

In [53]:
from bs4 import BeautifulSoup
import requests
import re

def url_to_string(url):
    
    res = requests.get(url)
    html = res.text
    soup = BeautifulSoup(html, 'html5lib')
    
    # remove all javascript and stylesheet code
    for script in soup(["script", "style", 'aside']):
        script.extract()
    
    return " ".join(re.split(r'[\n\t]+', soup.get_text()))

text = url_to_string('https://www.nytimes.com/2018/08/13/us/politics/peter-strzok-fired-fbi.html?\
                        hp&action=click&pgtype=Homepage&clickSource=story-heading&\
                        module=first-column-region&region=top-news&WT.nav=top-news')

text = text[:1001]
text

'     F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is Fired - The New York Times                                                                            SectionsSEARCHSkip to contentSkip to site indexPoliticsLog InLog InToday’s PaperPolitics|F.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is FiredAdvertisementSupported byF.B.I. Agent Peter Strzok, Who Criticized Trump in Texts, Is FiredImagePeter Strzok, a top F.B.I. counterintelligence agent who was taken off the special counsel investigation after his disparaging texts about President Trump were uncovered, was fired.CreditCreditT.J. Kirkpatrick for The New York TimesBy Adam Goldman and Michael S. SchmidtAug. 13, 2018WASHINGTON — Peter Strzok, the F.B.I. senior counterintelligence agent who disparaged President Trump in inflammatory text messages and helped oversee the Hillary Clinton email and Russia investigations, has been fired for violating bureau policies, Mr. Strzok’s lawyer said Monday.Mr. Trump and h

### References

- https://spacy.io/usage

### Appendix

I. Annotation tool (also developed by Explosion.AI)

- https://prodi.gy/demo?view_id=ner