# the `spaCy` pipeline
*If you haven't yet, follow the instructions to download and install spaCy here: https://spacy.io/usage*

The `spaCy` library offers powerful text processing capabilities. It processes text by adding tags, what the library's creators call "annotations," to text. These annotations are filled with [linguistic information](https://spacy.io/usage/spacy-101#features) about each word or bit of punctuation in the text. For example, they can describe parts of speech, grammatical dependency, punctuation, sentence and clause spans, and a lot more. 

How does `spaCy` know what information to annotate to each piece of text? The program gets this information from a language model, such as `en_core_web_sm`, that you load before you can process the text. This language model is a statistical model that enable `spaCy` to make predictions about a word's part of speech, for example. It has been trained from popular lexical datasets, like Princeton's [WordNet](https://wordnet.princeton.edu/), and also contains information for words, such as their root form, in data files and lookup tables. From this information, the model can make predictions about the linguistic features of new text. 

For example, if it comes across the word "trans," which can be an adjective, a prefix, or a shortened form of a longer word like "transgender," it will make a guess on how to categorize this particular usage of "trans" based on other aspects of the sentence, such as the parts of speech of the other words surrounding it. 

The code below demonstrates how to import the library, load up the lanugage model, and save it to `nlp()`.

In [116]:
import pandas as pd
import spacy

In [117]:
# loading our language model and data

nlp = spacy.load("en_core_web_sm")
df = pd.read_json('test.json')

In [118]:
texts = df['text'].to_list()

In [119]:
texts[0]

'The weaponization of the U.S. Department of Justice against purported enemies of the Biden-Harris administration is nothing new. The Department’s persecution of pro-life advocates, praying grandmothers, and fathers of young children under the Freedom of Access to Clinic Entrances Act (while routinely ignoring arson and vandalism at crisis pregnancy centers chargeable under the same law) is hard to deny. But the Department’s meritless criminal prosecution of Dr. Eithan Haim, a whistleblowing physician who exposed covert gender medicine procedures inflicted on minors at Texas Children’s Hospital—performed in violation of state law—is an outrageous abuse of its law enforcement authority. The principal wrongdoer in Dr. Haim’s case—leading the charge in his criminal prosecution for alleged violation of the Health Insurance Portability and Accountability Act (“HIPAA”)—is Assistant United States Attorney Tina Ansari, a Democratic donor who targeted Haim in part during a time when her law lic

In [120]:
# doing just the first article text, because otherwise it's too much data

dataset = ''.join(texts[0])

In [121]:
type(dataset)

str

In [122]:
dataset[:100]

'The weaponization of the U.S. Department of Justice against purported enemies of the Biden-Harris ad'

In [123]:
# processing the text column into an NLP "doc"

docs = nlp(dataset)

In [124]:
## let's check out what kind of object this is, what can we do with it?

dir(docs)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_bulk_merge',
 '_context',
 '_get_array_attrs',
 '_realloc',
 '_vector',
 '_vector_norm',
 'cats',
 'char_span',
 'copy',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_dict',
 'from_disk',
 'from_docs',
 'from_json',
 'get_extension',
 'get_lca_matrix',
 'has_annotation',
 'has_extension',
 'has_unknown_spaces',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'noun_chunks',
 'noun_chunks_iterator',
 'remove_extension',
 'retokenize',
 'sentiment'

In [70]:
for sent in docs.sents:
    print(sent)

The weaponization of the U.S. Department of Justice against purported enemies of the Biden-Harris administration is nothing new.
The Department’s persecution of pro-life advocates, praying grandmothers, and fathers of young children under the Freedom of Access to Clinic Entrances Act (while routinely ignoring arson and vandalism at crisis pregnancy centers chargeable under the same law) is hard to deny.
But the Department’s meritless criminal prosecution of Dr. Eithan Haim, a whistleblowing physician who exposed covert gender medicine procedures inflicted on minors at Texas Children’s Hospital—performed in violation of state law—is an outrageous abuse of its law enforcement authority.
The principal wrongdoer in Dr. Haim’s case—leading the charge in his criminal prosecution for alleged violation of the Health Insurance Portability and Accountability Act (“HIPAA”)—is Assistant United States Attorney Tina Ansari, a Democratic donor who targeted Haim in part during a time when her law lice

In [132]:
for chunks in docs.noun_chunks:
    print(chunks)

The weaponization
the U.S. Department
Justice
purported enemies
the Biden-Harris administration
nothing
The Department’s persecution
pro-life advocates
grandmothers
young children
the Freedom
Access
Clinic Entrances Act
arson
vandalism
crisis pregnancy centers
the same law
the Department’s meritless criminal prosecution
Dr. Eithan Haim
a whistleblowing physician
who
covert gender medicine procedures
minors
Texas Children’s Hospital
violation
state law
an outrageous abuse
its law enforcement authority
The principal
Dr. Haim’s case
the charge
his criminal prosecution
alleged violation
the Health Insurance Portability and Accountability Act
a Democratic donor
who
Haim
part
a time
her law license
the Texas Bar
failure
her bar dues
legal scholar Ed Whelan
the only reason
Haim
he
the transgender ideology
that
the current presidential administration
pleadings
Dr. Haim’s lawyers
separate complaints
a former DOJ lawyer
both the State Bar
Texas
the Justice Department’s Office
Professional Respon

In [138]:
type(docs.noun_chunks)

generator

In [134]:
chunks = []
for chunk in docs.noun_chunks:
    chunks.append(chunk)

In [139]:
type(chunks)

list

In [183]:
type(chunks[0])

spacy.tokens.span.Span

In [137]:
len(chunks)

338

In [125]:
for token in docs:
    print(token)

The
weaponization
of
the
U.S.
Department
of
Justice
against
purported
enemies
of
the
Biden
-
Harris
administration
is
nothing
new
.
The
Department
’s
persecution
of
pro
-
life
advocates
,
praying
grandmothers
,
and
fathers
of
young
children
under
the
Freedom
of
Access
to
Clinic
Entrances
Act
(
while
routinely
ignoring
arson
and
vandalism
at
crisis
pregnancy
centers
chargeable
under
the
same
law
)
is
hard
to
deny
.
But
the
Department
’s
meritless
criminal
prosecution
of
Dr.
Eithan
Haim
,
a
whistleblowing
physician
who
exposed
covert
gender
medicine
procedures
inflicted
on
minors
at
Texas
Children
’s
Hospital
—
performed
in
violation
of
state
law
—
is
an
outrageous
abuse
of
its
law
enforcement
authority
.
The
principal
wrongdoer
in
Dr.
Haim
’s
case
—
leading
the
charge
in
his
criminal
prosecution
for
alleged
violation
of
the
Health
Insurance
Portability
and
Accountability
Act
(
“
HIPAA”)—is
Assistant
United
States
Attorney
Tina
Ansari
,
a
Democratic
donor
who
targeted
Haim
in
part
during

In [126]:
## check out the dropdown for each token, right after you write the dot, "token."
## run a few of the options there, like .lemma_, .pos_, etc.

for token in docs:
    print(token, token.pos_)

The DET
weaponization NOUN
of ADP
the DET
U.S. PROPN
Department PROPN
of ADP
Justice PROPN
against ADP
purported VERB
enemies NOUN
of ADP
the DET
Biden PROPN
- PUNCT
Harris PROPN
administration NOUN
is AUX
nothing PRON
new ADJ
. PUNCT
The DET
Department PROPN
’s PART
persecution NOUN
of ADP
pro ADJ
- ADJ
life ADJ
advocates NOUN
, PUNCT
praying VERB
grandmothers NOUN
, PUNCT
and CCONJ
fathers NOUN
of ADP
young ADJ
children NOUN
under ADP
the DET
Freedom PROPN
of ADP
Access PROPN
to ADP
Clinic PROPN
Entrances PROPN
Act PROPN
( PUNCT
while SCONJ
routinely ADV
ignoring VERB
arson NOUN
and CCONJ
vandalism NOUN
at ADP
crisis NOUN
pregnancy NOUN
centers NOUN
chargeable ADJ
under ADP
the DET
same ADJ
law NOUN
) PUNCT
is AUX
hard ADJ
to PART
deny VERB
. PUNCT
But CCONJ
the DET
Department PROPN
’s PART
meritless ADJ
criminal ADJ
prosecution NOUN
of ADP
Dr. PROPN
Eithan PROPN
Haim PROPN
, PUNCT
a DET
whistleblowing NOUN
physician NOUN
who PRON
exposed VERB
covert ADJ
gender NOUN
medicine NOUN
pro

In [127]:
for token in docs:
    print(token, token.ent_type_)

The 
weaponization 
of 
the ORG
U.S. ORG
Department ORG
of ORG
Justice ORG
against 
purported 
enemies 
of 
the FAC
Biden FAC
- FAC
Harris FAC
administration 
is 
nothing 
new 
. 
The 
Department ORG
’s 
persecution 
of 
pro 
- 
life 
advocates 
, 
praying 
grandmothers 
, 
and 
fathers 
of 
young 
children 
under 
the LAW
Freedom LAW
of LAW
Access LAW
to LAW
Clinic LAW
Entrances LAW
Act LAW
( 
while 
routinely 
ignoring 
arson 
and 
vandalism 
at 
crisis 
pregnancy 
centers 
chargeable 
under 
the 
same 
law 
) 
is 
hard 
to 
deny 
. 
But 
the 
Department ORG
’s 
meritless 
criminal 
prosecution 
of 
Dr. 
Eithan PERSON
Haim PERSON
, 
a 
whistleblowing 
physician 
who 
exposed 
covert 
gender 
medicine 
procedures 
inflicted 
on 
minors 
at 
Texas ORG
Children ORG
’s ORG
Hospital ORG
— 
performed 
in 
violation 
of 
state 
law 
— 
is 
an 
outrageous 
abuse 
of 
its 
law 
enforcement 
authority 
. 
The 
principal 
wrongdoer 
in 
Dr. 
Haim PERSON
’s 
case 
— 
leading 
the 
charge 
in 
hi

In [128]:
for token in docs:
    if token.ent_type_ == 'PERSON':
        print(token, token.ent_type_)


Eithan PERSON
Haim PERSON
Haim PERSON
Tina PERSON
Ansari PERSON
Haim PERSON
Ed PERSON
Whelan PERSON
Haim PERSON
Haim PERSON
Haim PERSON
Christopher PERSON
Rufo PERSON
Haim PERSON
Haim PERSON
Haim PERSON
Haim PERSON
Haim PERSON
Haim PERSON
Haim PERSON
Haim PERSON
Marcella PERSON
Burke PERSON
Haim PERSON
Haim PERSON
Ken PERSON
Paxton PERSON
Haim PERSON


In [129]:
spacy.explain('PERSON')

'People, including fictional'

In [130]:
for ent in docs.ents:
    print(ent, ent.label_)

the U.S. Department of Justice ORG
the Biden-Harris FAC
Department ORG
the Freedom of Access to Clinic Entrances Act LAW
Department ORG
Eithan Haim PERSON
Texas Children’s Hospital ORG
Haim PERSON
the Health Insurance Portability and Accountability Act ORG
United States GPE
Tina Ansari PERSON
Democratic NORP
Haim PERSON
the Texas Bar ORG
Ed Whelan PERSON
Haim PERSON
Haim PERSON
the State Bar ORG
Texas GPE
the Justice Department’s Office of Professional Responsibility ORG
Ansari NORP
Haim PERSON
Christopher Rufo PERSON
Haim PERSON
HIPAA ORG
42 CARDINAL
The Justice Department ORG
Haim PERSON
HIPAA ORG
HIPAA ORG
Up to 10 years DATE
250,000 MONEY
Haim PERSON
Texas Children’s Hospital ORG
January 2021 DATE
Haim ORG
Haim PERSON
Ansari NORP
HIPAA ORG
Haim PERSON
Superseding Indictment WORK_OF_ART
Ansari NORP
Oct. 10 DATE
Haim ORG
Haim ORG
Ansari NORP
Haim PERSON
Haim PERSON
Show Cause WORK_OF_ART
Ansari NORP
Sept. 1 DATE
Sept. 19 DATE
Haim ORG
Ansari NORP
the Texas Code of Professional Conduc

In [184]:
## some cool code for visualizing ents

from spacy import displacy
displacy.render(docs, style="ent")



In [None]:
## challenge: make a list of all the persons listed in the article
## then filter that list so that only full names (first and last) remain
## i.e. there are no single last names

people = []
for ent in docs.ents:
    if ent.label_ == 'PERSON':
        people.append(ent)

In [170]:
people

[Eithan Haim,
 Haim,
 Tina Ansari,
 Haim,
 Ed Whelan,
 Haim,
 Haim,
 Haim,
 Christopher Rufo,
 Haim,
 Haim,
 Haim,
 Haim,
 Haim,
 Haim,
 Haim,
 Haim,
 Marcella Burke,
 Haim,
 Haim,
 Ken Paxton,
 Haim]

In [171]:
for i in people:
    print(i, len(i))

Eithan Haim 2
Haim 1
Tina Ansari 2
Haim 1
Ed Whelan 2
Haim 1
Haim 1
Haim 1
Christopher Rufo 2
Haim 1
Haim 1
Haim 1
Haim 1
Haim 1
Haim 1
Haim 1
Haim 1
Marcella Burke 2
Haim 1
Haim 1
Ken Paxton 2
Haim 1


In [None]:
full_names = []
for i in people:
    if len(i) > 1:
        # had to come back and change this to a string object to save
        # it to a file
        full_names.append(str(i))

In [181]:
full_names

['Eithan Haim',
 'Tina Ansari',
 'Ed Whelan',
 'Christopher Rufo',
 'Marcella Burke',
 'Ken Paxton']

In [182]:
## saving to a text file

with open('full_names.txt', 'w') as f:
    f.writelines([i for i in full_names])