In [1]:
from Cython.Build import Inline
import pytest

The first thing to do when working with spaCy is import a language model and instantiate an nlp object with it.\*

The nlp object creates a **pipeline** of NLP processing tasks. You can add pre-defined pipes to

- **tokenize** raw text, i.e. segment it into tokens (roughly word-level)
- **tag** your text with **part-of-speech** labels
- **parse** your raw text to assign **dependency** labels
- tag **named entities**

... and a lot of other fancy things. You can also define any kind of **custom pipe** that you want and create attributes that will be stored in the Token's metadata structure.

Basically, this means your NLP pipeline all gets consolidated into one `nlp.pipe(docs, batch=batchsize)` call. Nice.

\**(Later, you will want to use different models -- Either spaCy's pretrained vectors in the web corpora, or glove, or your own, but not right now.)

In [2]:
import spacy
from spacy.lang.en import English # languages are coded as classes
from spacy import displacy

The thing you are loading with `spacy.load()` is (usually) a [pre-trained statistical model](https://github.com/explosion/spacy-models)

In [48]:
nlp = spacy.load('en_core_web_sm') # create a pipeline with English rules

In [21]:
nlp.__dir__()

['_meta',
 '_path',
 'vocab',
 'tokenizer',
 'pipeline',
 'max_length',
 '_optimizer',
 '__module__',
 'lang',
 'Defaults',
 '__doc__',
 'factories',
 '__init__',
 'path',
 'meta',
 'tensorizer',
 'tagger',
 'parser',
 'entity',
 'linker',
 'matcher',
 'pipe_names',
 'get_pipe',
 'create_pipe',
 'add_pipe',
 'has_pipe',
 'replace_pipe',
 'rename_pipe',
 'remove_pipe',
 '__call__',
 'disable_pipes',
 'make_doc',
 'update',
 'rehearse',
 'preprocess_gold',
 'begin_training',
 'resume_training',
 'evaluate',
 'use_params',
 'pipe',
 'to_disk',
 'from_disk',
 'to_bytes',
 'from_bytes',
 '__dict__',
 '__weakref__',
 '__repr__',
 '__hash__',
 '__str__',
 '__getattribute__',
 '__setattr__',
 '__delattr__',
 '__lt__',
 '__le__',
 '__eq__',
 '__ne__',
 '__gt__',
 '__ge__',
 '__new__',
 '__reduce_ex__',
 '__reduce__',
 '__subclasshook__',
 '__init_subclass__',
 '__format__',
 '__sizeof__',
 '__dir__',
 '__class__']

In [23]:
# see what's in the pipeline
print(nlp.pipeline)
for name in nlp.pipe_names:
    print(name)

[('tagger', <spacy.pipeline.pipes.Tagger object at 0x7fd9ac779390>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x7fd9aa5fe708>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x7fd9aa5fe768>)]
tagger
parser
ner


In [45]:
doc = nlp("WASHINGTON — During a night of heavy drinking at an upscale London bar in May 2016, George Papadopoulos, a young foreign policy adviser to the Trump campaign, made a startling revelation to Australia’s top diplomat in Britain: Russia had political dirt on Hillary Clinton. About three weeks earlier, Mr. Papadopoulos had been told that Moscow had thousands of emails that would embarrass Mrs. Clinton, apparently stolen in an effort to try to damage her campaign. Exactly how much Mr. Papadopoulos said that night at the Kensington Wine Rooms with the Australian, Alexander Downer, is unclear. But two months later, when leaked Democratic emails began appearing online, Australian officials passed the information about Mr. Papadopoulos to their American counterparts, according to four current and former American and foreign officials with direct knowledge of the Australians’ role. The hacking and the revelation that a member of the Trump campaign may have had inside information about it were driving factors that led the F.B.I. to open an investigation in July 2016 into Russia’s attempts to disrupt the election and whether any of President Trump’s associates conspired. If Mr. Papadopoulos, who pleaded guilty to lying to the F.B.I. and is now a cooperating witness, was the improbable match that set off a blaze that has consumed the first year of the Trump administration, his saga is also a tale of the Trump campaign in miniature. He was brash, boastful and underqualified, yet he exceeded expectations. And, like the campaign itself, he proved to be a tantalizing target for a Russian influence operation. While some of Mr. Trump’s advisers have derided him as an insignificant campaign volunteer or a “coffee boy,” interviews and new documents show that he stayed influential throughout the campaign. Two months before the election, for instance, he helped arrange a New York meeting between Mr. Trump and President Abdel Fattah el-Sisi of Egypt. The information that Mr. Papadopoulos gave to the Australians answers one of the lingering mysteries of the past year: What so alarmed American officials to provoke the F.B.I. to open a counterintelligence investigation into the Trump campaign months before the presidential election? It was not, as Mr. Trump and other politicians have alleged, a dossier compiled by a former British spy hired by a rival campaign. Instead, it was firsthand information from one of America’s closest intelligence allies. Interviews and previously undisclosed documents show that Mr. Papadopoulos played a critical role in this drama and reveal a Russian operation that was more aggressive and widespread than previously known. They add to an emerging portrait, gradually filled in over the past year in revelations by federal investigators, journalists and lawmakers, of Russians with government contacts trying to establish secret channels at various levels of the Trump campaign. The F.B.I. investigation, which was taken over seven months ago by the special counsel, Robert S. Mueller III, has cast a shadow over Mr. Trump’s first year in office — even as he and his aides repeatedly played down the Russian efforts and falsely denied campaign contacts with Russians. They have also insisted that Mr. Papadopoulos was a low-level figure. But spies frequently target peripheral players as a way to gain insight and leverage. F.B.I. officials disagreed in 2016 about how aggressively and publicly to pursue the Russia inquiry before the election. But there was little debate about what seemed to be afoot. John O. Brennan, who retired this year after four years as C.I.A. director, told Congress in May that he had been concerned about multiple contacts between Russian officials and Trump advisers. Russia, he said, had tried to “suborn” members of the Trump campaign. Mr. Papadopoulos, then an ambitious 28-year-old from Chicago, was working as an energy consultant in London when the Trump campaign, desperate to create a foreign policy team, named him as an adviser in early March 2016. His political experience was limited to two months on Ben Carson’s presidential campaign before it collapsed. Mr. Papadopoulos had no experience on Russia issues. But during his job interview with Sam Clovis, a top early campaign aide, he saw an opening. He was told that improving relations with Russia was one of Mr. Trump’s top foreign policy goals, according to court papers, an account Mr. Clovis has denied. Traveling in Italy that March, Mr. Papadopoulos met Joseph Mifsud, a Maltese professor at a now-defunct London academy who had valuable contacts with the Russian Ministry of Foreign Affairs. Mr. Mifsud showed little interest in Mr. Papadopoulos at first. But when he found out he was a Trump campaign adviser, he latched onto him, according to court records and emails obtained by The New York Times. Their joint goal was to arrange a meeting between Mr. Trump and President Vladimir V. Putin of Russia in Moscow, or between their respective aides. In response to questions, Mr. Papadopoulos’s lawyers declined to provide a statement. Before the end of the month, Mr. Mifsud had arranged a meeting at a London cafe between Mr. Papadopoulos and Olga Polonskaya, a young woman from St. Petersburg whom he falsely described as Mr. Putin’s niece. Although Ms. Polonskaya told The Times in a text message that her English skills are poor, her emails to Mr. Papadopoulos were largely fluent. “We are all very excited by the possibility of a good relationship with Mr. Trump,” Ms. Polonskaya wrote in one message. More important, Mr. Mifsud connected Mr. Papadopoulos to Ivan Timofeev, a program director for the prestigious Valdai Discussion Club, a gathering of academics that meets annually with Mr. Putin. The two men corresponded for months about how to connect the Russian government and the campaign. Records suggest that Mr. Timofeev, who has been described by Mr. Mueller’s team as an intermediary for the Russian Foreign Ministry, discussed the matter with the ministry’s former leader, Igor S. Ivanov, who is widely viewed in the United States as one of Russia’s elder statesmen. When Mr. Trump’s foreign policy team gathered for the first time at the end of March in Washington, Mr. Papadopoulos said he had the contacts to set up a meeting between Mr. Trump and Mr. Putin. Mr. Trump listened intently but apparently deferred to Jeff Sessions, then a senator from Alabama and head of the campaign’s foreign policy team, according to participants in the meeting. Mr. Sessions, now the attorney general, initially did not reveal that discussion to Congress, because, he has said, he did not recall it. More recently, he said he pushed back against Mr. Papadopoulos’s proposal, at least partly because he did not want someone so unqualified to represent the campaign on such a sensitive matter. If the campaign wanted Mr. Papadopoulos to stand down, previously undisclosed emails obtained by The Times show that he either did not get the message or failed to heed it. He continued for months to try to arrange some kind of meeting with Russian representatives, keeping senior campaign advisers abreast of his efforts. Mr. Clovis ultimately encouraged him and another foreign policy adviser to travel to Moscow, but neither went because the campaign would not cover the cost. Mr. Papadopoulos was trusted enough to edit the outline of Mr. Trump’s first major foreign policy speech on April 27, an address in which the candidate said it was possible to improve relations with Russia. Mr. Papadopoulos flagged the speech to his newfound Russia contacts, telling Mr. Timofeev that it should be taken as “the signal to meet.” “That is a statesman speech,” Mr. Mifsud agreed. Ms. Polonskaya wrote that she was pleased that Mr. Trump’s “position toward Russia is much softer” than that of other candidates. Stephen Miller, then a senior policy adviser to the campaign and now a top White House aide, was eager for Mr. Papadopoulos to serve as a surrogate, someone who could publicize Mr. Trump’s foreign policy views without officially speaking for the campaign. But Mr. Papadopoulos’s first public attempt to do so was a disaster. In a May 4, 2016, interview with The Times of London, Mr. Papadopoulos called on Prime Minister David Cameron to apologize to Mr. Trump for criticizing his remarks on Muslims as “stupid” and divisive. “Say sorry to Trump or risk special relationship, Cameron told,” the headline read. Mr. Clovis, the national campaign co-chairman, severely reprimanded Mr. Papadopoulos for failing to clear his explosive comments with the campaign in advance. From then on, Mr. Papadopoulos was more careful with the press — though he never regained the full trust of Mr. Clovis or several other campaign officials. Mr. Mifsud proposed to Mr. Papadopoulos that he, too, serve as a campaign surrogate. He could write op-eds under the guise of a “neutral” observer, he wrote in a previously undisclosed email, and follow Mr. Trump to his rallies as an accredited journalist while receiving briefings from the inside the campaign. In late April, at a London hotel, Mr. Mifsud told Mr. Papadopoulos that he had just learned from high-level Russian officials in Moscow that the Russians had “dirt” on Mrs. Clinton in the form of “thousands of emails,” according to court documents. Although Russian hackers had been mining data from the Democratic National Committee’s computers for months, that information was not yet public. Even the committee itself did not know. Whether Mr. Papadopoulos shared that information with anyone else in the campaign is one of many unanswered questions. He was mostly in contact with the campaign over emails. The day after Mr. Mifsud’s revelation about the hacked emails, he told Mr. Miller in an email only that he had “interesting messages coming in from Moscow” about a possible trip. The emails obtained by The Times show no evidence that Mr. Papadopoulos discussed the stolen messages with the campaign. Not long after, however, he opened up to Mr. Downer, the Australian diplomat, about his contacts with the Russians. It is unclear whether Mr. Downer was fishing for that information that night in May 2016. The meeting at the bar came about because of a series of connections, beginning with an Israeli Embassy official who introduced Mr. Papadopoulos to another Australian diplomat in London. It is also not clear why, after getting the information in May, the Australian government waited two months to pass it to the F.B.I. In a statement, the Australian Embassy in Washington declined to provide details about the meeting or confirm that it occurred. “As a matter of principle and practice, the Australian government does not comment on matters relevant to active investigations,” the statement said. The F.B.I. declined to comment. Once the information Mr. Papadopoulos had disclosed to the Australian diplomat reached the F.B.I., the bureau opened an investigation that became one of its most closely guarded secrets. Senior agents did not discuss it at the daily morning briefing, a classified setting where officials normally speak freely about highly sensitive operations. Besides the information from the Australians, the investigation was also propelled by intelligence from other friendly governments, including the British and Dutch. A trip to Moscow by another adviser, Carter Page, also raised concerns at the F.B.I. With so many strands coming in — about Mr. Papadopoulos, Mr. Page, the hackers and more — F.B.I. agents debated how aggressively to investigate the campaign’s Russia ties, according to current and former officials familiar with the debate. Issuing subpoenas or questioning people, for example, could cause the investigation to burst into public view in the final months of a presidential campaign. It could also tip off the Russian government, which might try to cover its tracks. Some officials argued against taking such disruptive steps, especially since the F.B.I. would not be able to unravel the case before the election. Others believed that the possibility of a compromised presidential campaign was so serious that it warranted the most thorough, aggressive tactics. Even if the odds against a Trump presidency were long, these agents argued, it was prudent to take every precaution. That included questioning Christopher Steele, the former British spy who was compiling the dossier alleging a far-ranging Russian conspiracy to elect Mr. Trump. A team of F.B.I. agents traveled to Europe to interview Mr. Steele in early October 2016. Mr. Steele had shown some of his findings to an F.B.I. agent in Rome three months earlier, but that information was not part of the justification to start an counterintelligence inquiry, American officials said. Ultimately, the F.B.I. and Justice Department decided to keep the investigation quiet, a decision that Democrats in particular have criticized. And agents did not interview Mr. Papadopoulos until late January. He was hardly central to the daily running of the Trump campaign, yet Mr. Papadopoulos continuously found ways to make himself useful to senior Trump advisers. In September 2016, with the United Nations General Assembly approaching and stories circulating that Mrs. Clinton was going to meet with Mr. Sisi, the Egyptian president, Mr. Papadopoulos sent a message to Stephen K. Bannon, the campaign’s chief executive, offering to broker a similar meeting for Mr. Trump. After days of scheduling discussions, the meeting was set and Mr. Papadopoulos sent a list of talking points to Mr. Bannon, according to people familiar with those interactions. Asked about his contacts with Mr. Papadopoulos, Mr. Bannon declined to comment. Mr. Trump’s improbable victory raised Mr. Papadopoulos’s hopes that he might ascend to a top White House job. The election win also prompted a business proposal from Sergei Millian, a naturalized American citizen born in Belarus. After he had contacted Mr. Papadopoulos out of the blue over LinkedIn during the summer of 2016, the two met repeatedly in Manhattan. Mr. Millian has bragged of his ties to Mr. Trump — boasts that the president’s advisers have said are overstated. He headed an obscure organization called the Russian-American Chamber of Commerce, some of whose board members and clients are difficult to confirm. Congress is investigating where he fits into the swirl of contacts with the Trump campaign, although he has said he is unfairly being scrutinized only because of his support for Mr. Trump. Mr. Millian proposed that he and Mr. Papadopoulos form an energy-related business that would be financed by Russian billionaires “who are not under sanctions” and would “open all doors for us” at “any level all the way to the top.” One billionaire, he said, wanted to explore the idea of opening a Trump-branded hotel in Moscow. “I know the president will distance himself from business, but his children might be interested,” he wrote. Nothing came of his proposals, partly because Mr. Papadopoulos was hoping that Michael T. Flynn, then Mr. Trump’s pick to be national security adviser, might give him the energy portfolio at the National Security Council. The pair exchanged New Year’s greetings in the final hours of 2016. “Happy New Year, sir,” Mr. Papadopoulos wrote. “Thank you and same to you, George. Happy New Year!” Mr. Flynn responded, ahead of a year that seemed to hold great promise. But 2017 did not unfold that way. Within months, Mr. Flynn was fired, and both men were charged with lying to the F.B.I. And both became important witnesses in the investigation Mr. Papadopoulos had played a critical role in starting.")

The `Doc` object holds an array of `TokenC` structs. 

`__init__` constructs a doc from:

>
>vocab (Vocab): A vocabulary object, which must match any models you
>    want to use (e.g. tokenizer, parser, entity recognizer).
>words (list or None): A list of unicode strings to add to the document
>    as words. If `None`, defaults to empty list.
>spaces (list or None): A list of boolean values, of the same length as
>    words. True means that the word is followed by a space, False means
>    it is not. If `None`, defaults to `[True]*len(words)`
>user_data (dict or None): Optional extra data to attach to the Doc.



[source](https://github.com/explosion/spaCy/blob/master/spacy/tokens/doc.pyx)

In [50]:
print(doc.is_tagged, doc.is_parsed, doc.tensor, doc.noun_chunks_iterator)

True True [[ 8.756554   -3.1600325  -0.1917338  ...  1.118364   -3.0791821
   2.7180376 ]
 [ 0.9409258   2.2175236  -1.3412498  ... -0.2895214  -3.3159554
   5.182918  ]
 [-3.9862745  -2.5407963  -2.5230224  ...  6.717414   -1.7213043
   3.3787484 ]
 ...
 [-1.9005902  -4.581986   -2.745854   ...  0.35454035 -1.842454
   0.97634935]
 [ 0.82343036 -2.4026475   2.5932384  ...  1.3355422  -1.4711387
   3.4241068 ]
 [ 0.5511681  -1.715608    3.2748218  ... -2.561218    0.4252711
   1.0786207 ]] <function noun_chunks at 0x7fd9af5c76a8>


Tokens and Spans are only "views" of a Doc;
they don't have word-level-or-lower data 
of their own.\*

What does that mean? .... 

- "Indexing" by Token is really a hard-coded function.
- Tokens are the most powerful information container in spaCy, in many ways. 
- They are the most manipulable, as you can use them in Python, numpy, or directly in Cython.
- But don't assume you can access or change information about them easily from the Doc level.

\**(But they do have higher-level,
syntactic or structural data, of course.)*

In [51]:
# for ent in doc.ents:
#     print(ent, ent.label_)

ents = {ent: ent.label_ for ent in doc.ents}
ents


from collections import Counter
Counter(ents.values())

Counter({'GPE': 42,
         'DATE': 44,
         'PERSON': 122,
         'ORG': 46,
         'CARDINAL': 10,
         'FAC': 1,
         'NORP': 39,
         'ORDINAL': 4,
         'LANGUAGE': 1,
         'TIME': 2,
         'LOC': 1,
         'EVENT': 3})

In [52]:
doc.__getitem__(3)
doc[3]

a

Slicing by index only takes a view on doc.text, 
and does not access the metadata. 

If you pass a slice, not an int, spaCy 
calls another function and interprets `i[0]` 
as the start index and `i[1]` as the stop.

You can not specify non-contiguous step sizes.

spaCy then returns a Span of token texts.

```cython
    start, stop = normalize_slice(len(self), i.start, i.stop, i.step)
    return Span(self, start, stop, label=0)
```

In [53]:
doc.__getitem__(slice(3,5))
doc[3:5]

a night

A **Token Span** is different from a character Span. You can see this when slicing into a Doc object. If you exceed the index you specify, the pointer just advances to the next point in the Doc (that's what is meant by "a view"). 

In [54]:
#def test_index(doc, slice):
doc[3:5][3] == doc[6]

True

In [55]:
token_span = doc[3:5]
for token in token_span:
    print(token)

token_span[31]

a
night


to

A Span does not have the same underlying data as its Tokens, but it is a Cython object that can hold its own data, like a label -- e.g., for named entities.

```cython
def char_span(self, int start_idx, int end_idx, label=0, kb_id=0, vector=None):
        """Create a `Span` object from the slice `doc.text[start : end]`.
        doc (Doc): The parent document.
        start (int): The index of the first character of the span.
        end (int): The index of the first character after the span.
        label (uint64 or string): A label to attach to the Span, e.g. for
            named entities.
        kb_id (uint64 or string):  An ID from a KB to capture the meaning of a named entity.
        vector (ndarray[ndim=1, dtype='float32']): A meaning representation of
            the span.
        RETURNS (Span): The newly constructed object.
        DOCS: https://spacy.io/api/doc#char_span
        """
        if not isinstance(label, int):
            label = self.vocab.strings.add(label)
        if not isinstance(kb_id, int):
            kb_id = self.vocab.strings.add(kb_id)
        cdef int start = token_by_start(self.c, self.length, start_idx)
        if start == -1:
            return None
        cdef int end = token_by_end(self.c, self.length, end_idx)
        if end == -1:
            return None
        # Currently we have the token index, we want the range-end index
        end += 1
        cdef Span span = Span(self, start, end, label=label, kb_id=kb_id, vector=vector)
        return span
```

Doc objects have [several useful attributes](https://spacy.io/api/doc):

In [56]:
# List sentences

list(doc.sents)

[WASHINGTON —,
 During a night of heavy drinking at an upscale London bar in May 2016, George Papadopoulos, a young foreign policy adviser to the Trump campaign, made a startling revelation to Australia’s top diplomat in Britain: Russia had political dirt on Hillary Clinton.,
 About three weeks earlier, Mr. Papadopoulos had been told that Moscow had thousands of emails that would embarrass Mrs. Clinton, apparently stolen in an effort to try to damage her campaign.,
 Exactly how much Mr. Papadopoulos said that night at the Kensington Wine Rooms with the Australian, Alexander Downer, is unclear.,
 But two months later, when leaked Democratic emails began appearing online, Australian officials passed the information about Mr. Papadopoulos to their American counterparts, according to four current and former American and foreign officials with direct knowledge of the Australians’ role.,
 The hacking and the revelation that a member of the Trump campaign may have had inside information about

In [57]:
# "noun chunks", based on dependency relationships

list(doc.noun_chunks)

[a night,
 heavy drinking,
 an upscale London bar,
 May,
 George Papadopoulos,
 a young foreign policy adviser,
 the Trump campaign,
 a startling revelation,
 Australia’s top diplomat,
 Britain,
 Russia,
 political dirt,
 Hillary Clinton,
 Mr. Papadopoulos,
 Moscow,
 thousands,
 emails,
 Mrs. Clinton,
 an effort,
 her campaign,
 Mr. Papadopoulos,
 the Kensington Wine Rooms,
 the Australian,
 Alexander Downer,
 leaked Democratic emails,
 Australian officials,
 the information,
 Mr. Papadopoulos,
 their American counterparts,
 four current and former American and foreign officials,
 direct knowledge,
 the Australians’ role,
 The hacking,
 the revelation,
 a member,
 the Trump campaign,
 inside information,
 it,
 factors,
 the F.B.I.,
 an investigation,
 July,
 Russia,
 ’s,
 the election,
 President Trump’s associates,
 Mr. Papadopoulos,
 who,
 the F.B.I.,
 a cooperating witness,
 the improbable match,
 a blaze,
 the first year,
 the Trump administration,
 his saga,
 a tale,
 the Trump ca

In [75]:
# entities

doc.ents

(WASHINGTON,
 London,
 May 2016,
 George Papadopoulos,
 Trump,
 Australia,
 Britain,
 Russia,
 Hillary Clinton,
 About three weeks earlier,
 Papadopoulos,
 Moscow,
 thousands,
 Clinton,
 Papadopoulos,
 the Kensington Wine Rooms,
 Australian,
 Alexander Downer,
 two months later,
 Democratic,
 Australian,
 Papadopoulos,
 American,
 four,
 American,
 Australians,
 Trump,
 July 2016,
 Russia,
 Trump’s,
 Papadopoulos,
 F.B.I.,
 the first year,
 Trump,
 Trump,
 Russian,
 Trump’s,
 Two months,
 New York,
 Trump,
 Abdel Fattah,
 Egypt,
 Papadopoulos,
 Australians,
 one,
 the past year,
 American,
 Trump,
 months,
 Trump,
 British,
 America,
 Interviews,
 Papadopoulos,
 Russian,
 the past year,
 Russians,
 Trump,
 seven months ago,
 Robert S. Mueller III,
 Trump’s,
 first year,
 Russian,
 Russians,
 Papadopoulos,
 2016,
 Russia,
 John O. Brennan,
 this year,
 four years,
 C.I.A.,
 Congress,
 May,
 Russian,
 Trump,
 Russia,
 Trump,
 Papadopoulos,
 Chicago,
 London,
 Trump,
 early March 2016,
 t

In [79]:
sent = list(doc.sents)[2]

In [80]:
displacy.render(sent, style="dep") # If *not* in a notebook, use .serve() instead

[Token attributes](https://spacy.io/api/token#attributes)

In [127]:
token = doc[10]
[token.text, token.lemma_, 
 token.pos_, token.tag_, token.dep_, 
 token.shape_, token.is_alpha, 
 token.is_stop, token.ent_type_, token.ent_id_,
 norm_]

['Dana', 'Dana', 'PROPN', 'NNP', 'appos', 'Xxxx', True, False, 'PERSON']

In [108]:
for i, token in enumerate(doc):
    if token.pos_ == "PROPN":
        if "|" in token.text:
            print(doc[i-1], token)

Walter Skinner|Skinner
Dana Scully|Scully


In [83]:
import nltk
nltk.download("stopwords")
from nltk.corpus import stopwords
nltk_stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [85]:
stops = nlp(" ".join([w for w in nltk_stopwords]))

In [89]:
stop_counts = Counter([token.pos_ for token in stops])

In [90]:
stop_counts

Counter({'VERB': 41,
         'PRON': 31,
         'ADV': 38,
         'NOUN': 17,
         'AUX': 11,
         'ADP': 27,
         'INTJ': 3,
         'CCONJ': 4,
         'DET': 20,
         'PART': 3,
         'ADJ': 10})