# Phrasebank from PDF

This notebook has the purpose of extracting the most common phrases from the training data.

E.g. a academic phrasebank from a poupular [scientific writing guidebooks](http://www.phrasebank.manchester.ac.uk/), or a high level scientific journal.



## Workflows

In [1]:
### Step 1: Load the data
import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path, skip_first=1, skip_last=2):
    doc = fitz.open(pdf_path)
    text = ""
    # Determine the range of pages to include
    start_page = skip_first if skip_first else 0
    end_page = len(doc) - skip_last

    for page_number in range(start_page, end_page):
        page = doc.load_page(page_number)
        text += page.get_text("text")

    doc.close()
    return text

pdf_path = r"../data/Academic_Phrasebank.pdf"

# skip the cover and the last two page
text = extract_text_from_pdf(pdf_path, skip_first=6, skip_last=2)

In [2]:
text

"6 | P a g e  \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \nMajor sections \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n7 | P a g e  \n \nWriting Introductions \n \n There are many ways to introduce an academic essay or short paper. Most academic writers, \nhowever, appear to do one or more of the following in their introductions: \n \n• \nestablish the context, background and/or importance of the topic \n• \nindicate an issue, problem, or controversy in the field of study  \n• \ndefine the topic or key terms  \n• \nstate of the purpose of the essay/writing \n• \nprovide an overview of the coverage and/or structure of the writing  \n \nIntroductions to research articles and dissertations tend to be relatively short but quite complex. \nSome of the more common moves include: \n \n• \nestablishing the context, background and/or importance of the topic \n• \ngiving a brief synopsis of the relevant literature  \n• \nindicating a problem

In [3]:
# It is reccomend to check the text you extracted mannuly

import re

def clean_text(text):
    # Remove specific unwanted patterns (e.g., page numbers, footers)
    text = re.sub(r'Page \d+ of \d+', '', text)
    text = re.sub(r'[\r\n]+', ' ', text)  # Remove explicit carriage returns and newlines
    text = text.strip()
    return text

cleaned_text = clean_text(text)
cleaned_text

"6 | P a g e                                     Major sections                                                              7 | P a g e     Writing Introductions     There are many ways to introduce an academic essay or short paper. Most academic writers,  however, appear to do one or more of the following in their introductions:    •  establish the context, background and/or importance of the topic  •  indicate an issue, problem, or controversy in the field of study   •  define the topic or key terms   •  state of the purpose of the essay/writing  •  provide an overview of the coverage and/or structure of the writing     Introductions to research articles and dissertations tend to be relatively short but quite complex.  Some of the more common moves include:    •  establishing the context, background and/or importance of the topic  •  giving a brief synopsis of the relevant literature   •  indicating a problem, controversy or a knowledge gap in the field of study   •  establishing th

In [4]:
### Step 2: Preprocess the data

import spacy
from spacy.matcher import Matcher

# Using English language pre-trained model from spaCy, visit for models in other language https://spacy.io/models
! python -m spacy download en_core_web_sm

# Load the SpaCy model
nlp = spacy.load("en_core_web_sm")

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m58.8 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [5]:
from spacy.language import Language

doc = nlp(cleaned_text)

def extract_verb_phrases(doc):
    verb_phrases = []
    for token in doc:
        if token.pos_ == 'VERB':
            # Extract the whole subtree of the verb
            subtree = [tok.lower_ for tok in token.subtree]
            verb_phrase = ' '.join(subtree).replace(' ,', ',')
            verb_phrases.append(verb_phrase)
    return verb_phrases

verb_phrases = extract_verb_phrases(doc)
print("Verb Phrases:", verb_phrases)

def extract_expanded_noun_phrases(doc):
    expanded_noun_phrases = []
    for chunk in doc.noun_chunks:
        # Extend to the left for adjectives
        start = chunk.start
        while start > 0 and doc[start - 1].pos_ in ['ADJ', 'ADV']:
            start -= 1
        # Extend to the right for prepositional phrases
        end = chunk.end
        if end < len(doc) and doc[end].pos_ == 'ADP':
            while end < len(doc) and doc[end].pos_ != 'PUNCT':
                end += 1
        expanded_phrase = doc[start:end].text
        expanded_noun_phrases.append(expanded_phrase)
    return expanded_noun_phrases

expanded_noun_phrases = extract_expanded_noun_phrases(doc)
print("Expanded Noun Phrases:", expanded_noun_phrases)


Verb Phrases: ['6 | p a g e                                      major sections                                                               7 | p a g e      writing introductions      there are many ways to introduce an academic essay or short paper .', 'to introduce an academic essay or short paper', 'most academic writers,   however, appear to do one or more of the following in their introductions :    ', 'to do one or more of the following in their introductions', 'most academic writers,   however, appear to do one or more of the following in their introductions :     •   establish the context, background and/or importance of the topic   •   indicate an issue, problem, or controversy in the field of study    •   define the topic or key terms    •   state of the purpose of the essay / writing   •   provide an overview of the coverage and/or structure of the writing      introductions to research articles and dissertations tend to be relatively short but quite complex .  ', 'indicat

In [9]:
import re

# Combine lists and remove duplicates
combined_phrases = set(expanded_noun_phrases + verb_phrases)


# Define a function to apply multiple filters
def is_valid_phrase(phrase):
    # Check for digits
    if any(char.isdigit() for char in phrase):
        return False
    # Check for specific special characters
    if any(char in phrase for char in ['(', ')', '-', '*', '/', '?', '=', '!', '@', '→',':', 'et al',
                                       '#', '$', '%', '^', '&', '<', '>', '[', ']', '  ', '\'',
                                       '{', '}', '|', '\\', '~', '`', '+', '_','•', ',', '/',
                                       '‘','’', '“', '”', '.', '—', '…', '°', '€', '£', '¥']):
        return False

    # remove normal phrases
    words_to_match = ['women', 'man', 'to do', 'grammar','icv','noun','dog','cat','v','c','p','d','P','re']
    if any(re.search(r'\b' + re.escape(word) + r'\b', phrase) for word in words_to_match):
        return False

    # Check for uppercase letters not at the start
    if re.search(r'\s[A-Z]', phrase):
        return False
    return True

# sort
sorted_phrases = sorted({phrase for phrase in combined_phrases if 1 < len(phrase.split(' ')) < 5 and len(phrase) > 2 and is_valid_phrase(phrase)})


In [10]:
### Step 3: Save the data

# Write the sorted phrases to a Markdown file
with open('../academic_phrasebank.md', 'w') as file:
    for phrase in sorted_phrases:
        cleaned_phrase = re.sub(r'\n*', '', phrase)
        print(cleaned_phrase)
        file.write(cleaned_phrase + '\n')

A better study
A broader perspective
A case study approach
A further definition
A further study
A future study
A good recommendation
A holistic approach
A key policy priority
A minority of participants
A more comprehensive study
A pattern
A positive correlation
A projection
A quantitative approach
A reasonable approach
A relationship
A scatter diagram
A small sample
A systematic literature review
A trend
Academic writers
Adverbial phrases
All figures
All singular countable nouns
All studies
All the studies
Almost every paper
An arguable weakness
An issue
Another important finding
Another important practical implication
Another interviewee
Another noticeable difference
Another potential problem
Another question
Another reason
Appetitive stimuli
Article references
Associative learning
Background information
Blood samples
Both words
Case studies
Common problems
Comparison within one sentence
Complex sentences
Compound sentences
Considerably more work
Contracted forms
Countable words
Cycli