# the `PhraseMatcher`
The `PhraseMatcher` allows you to write specific phrases or sequences of text to find in the dataset. This is really useful if you already know the kind of thing that you want to pick out, including exact variations of those phrases. But not so useful if you want to account for more than a few variations. For a way to handle more complex variations of phrases, see the token `Matcher` section.

The process of using the `PhraseMatcher` involves four steps, divided into four sections below.
1. write down the exact phrase you're looking for in the text
2. code the phrase and pass it into the `PhraseMatcher` object
3. run the `PhraseMatcher` on your doc
4. print out the matches


## 1. write down the phrase
From close reading the bills dataset (in the [defining gender section](./questions.md)), we saw that the definitions include at least a single quote in the form of a backtick, terms like "gender" and "sex", and the word "means". The `PhraseMatcher` requires that we narrow down to the most common element that appears in all of them. This would be the backtick ` and terms like "gender" and "sex". I am leaving out everything after the term "gender" or "sex" because sometimes they are followed by single quotes and sometimes by double quotes, and I want to catch all of the possibilities for now.

Our patterns would therefore be the following: 

```
`gender
`sex
```

## 2. code the phrase and pass into PhraseMatcher
First, we will import the necessary libraries and load our text through  the `nlp()` pipeline.

In [46]:
import spacy
from spacy.matcher import PhraseMatcher
import requests # for getting the dataset

# loading up the model in english
nlp = spacy.load("en_core_web_sm")

# loading up our sample text, which is the first million characters
# of our cleaned dataset

source = requests.get('https://bit.ly/bills_clean')
text = source.content

In [47]:
type(text)

bytes

In [48]:
decoded = text.decode('utf-8')

In [49]:
# passing our dataset into the nlp() function
doc = nlp(decoded)

In [50]:
# remember list slicing?

doc[:100]

Congressional Bills 117th CongressFrom the U.S. Government Publishing OfficeH.R. 1112 Introduced in House (IH)&lt;DOC&gt;117th CONGRESS1st SessionH. R. 1112 To require a report on the military coup in Burma, and for otherpurposes.IN THE HOUSE OF REPRESENTATIVES February 18, 2021Mr. Connolly (for himself, Mr. Price of North Carolina, and Mr. Buchanan) introduced the following bill; which was referred to the Committee on Foreign Affairs A BILLTo require a report on the military coup in Burma, and for otherpurposes.Be it enacted by the Senate and House of Representatives of

Then, we create the `PhraseMatcher` object, code our phrases, and pass them into the object.

In [51]:
# create a matcher object.
# we will then add phrases to the object

matcher = PhraseMatcher(nlp.vocab)

In [52]:
# adding a number of phrases, "definition"
# also, running each of our phrases through the nlp, to create it's
# own "doc" object for each one. 
matcher.add("definitions", [
  nlp("`gender"),
  nlp("`sex")])

## 3. run the `PhraseMatcher`
Now, we run the `PhraseMatcher` on our `doc`. The results will first appear in a numeric form, but we will convert them to plain text in the next step.

In [53]:
# run the matcher on the doc
matches = matcher(doc)

# printing out the first 10 results.
# we get the hash, start and end locations
matches[:10]

[]

In [44]:
# see how many we got total
len(matches)

73

## 4. print the results
Finally, we print out the plain text of our results.

In [45]:
# to see the actual text, need to write code to access the text
# version of that information, like "text", "doc[start]" and
# "doc[end]"
# we can also print out the whole sentence, with doc.sent

for match in matches[:10]:
    number, start, end = match
    print(doc[start:end].sent)
    print('\n')


Gender reassignment medical intervention defined\n    ``For purposes of this chapter, the term `gender reassignment \nmedical intervention' means--\n            ``(1) performing a surgery that sterilizes an individual, \n        including castration, vasectomy, hysterectomy, oophorectomy, \n        metoidioplasty, penectomy, phalloplasty, and vaginoplasty, to \n        change the body of such individual to correspond to a sex that \n        is discordant with biological sex;\n            ``(2) performing a mastectomy on an individual for the \n        purpose described in paragraph (1); and\n            ``(3) administering or supplying to an individual \n        medications for the purpose described in paragraph (1), \n        including--\n                    ``(A) GnRH agonists or other puberty-blocking drugs \n                to stop or delay normal puberty;\n                    ``(B) testosterone or other androgens to biological \n                females at doses that are supraphysi

We can see that we've captured a lot here, even more than what we wanted, which is definitions of our gender terms. 
For example, we captured phrases like "striking 'sex'" and "inserting 'sex'". In the token `Matcher` section, we will 
look at ways of writing patterns that can handle more variations in our results. 

Let's save the data to a plain text file.

In [17]:
# first, create an empty list to store our definitions
defs = []

# then, write a loop that appends our data to that list with some useful labels 
for match in matches:
    number, start, end = match
    defs.append(f'Phrase: "{doc[start:end]}", ')
    defs.append('\n')
    defs.append(f"Sentence: {doc[start].sent}")
    defs.append('\n')
    defs.append(f'Starts: {start} of {len(doc)}')
    defs.append('\n')
    defs.append('\n')

 # finally, save that list to a plain text file called 'definitions'
with open('./out/definitions.txt', 'w') as f:
    for item in defs:
        f.write(str(item))