# Phrasebank from PDF

This notebook has the purpose of extracting the most common phrases from the training data.

E.g. a academic phrasebank from a poupular [scientific writing guidebooks](http://www.phrasebank.manchester.ac.uk/), or a high level scientific journal.



## Workflows

### Step 1: Load the data


In [2]:
from openphrasebank import extract_text_from_pdf, clean_text

pdf_path = r"../../data/Academic_Phrasebank.pdf"

# skip the cover and the last two page
text = extract_text_from_pdf(pdf_path, skip_first=6, skip_last=2)
cleaned_text = clean_text(text)


### Step 2: Extract the phrases

In [3]:
import spacy
from openphrasebank import extract_verb_phrases, extract_expanded_noun_phrases, is_valid_phrase
nlp = spacy.load("en_core_web_sm")
doc = nlp(cleaned_text)
# match with the verb and noun phrases patterns
verb_phrases = extract_verb_phrases(doc)
expanded_noun_phrases = extract_expanded_noun_phrases(doc)

In [4]:
# Using English language pre-trained model from spaCy, visit for models in other language https://spacy.io/models
! python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m58.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


### Step 3: Filter and export

In [4]:

# Combine lists and remove duplicates
combined_phrases = set(expanded_noun_phrases + verb_phrases)

# sort
sorted_phrases = sorted({phrase for phrase in combined_phrases if 1 < len(phrase.split(' ')) < 5 and len(phrase) > 2 and is_valid_phrase(phrase)})


In [5]:
### Step 3: Save the data
import re
# Write the sorted phrases to a Markdown file
with open('../../phrasebanks/academic_phrasebank.md', 'w') as file:
    for phrase in sorted_phrases:
        cleaned_phrase = re.sub(r'\n*', '', phrase)
        file.write(cleaned_phrase + '\n')