# Phrasebank from [elsevier corpus](https://researchcollaborations.elsevier.com/en/datasets/elsevier-oa-cc-by-corpus)

This notebook has the purpose of extracting the most common phrases from the training data.

- E.g. phrasebank_pdf: generate a academic phrasebank from a poupular [scientific writing guidebooks](http://www.phrasebank.manchester.ac.uk/), or a high level scientific journal.
- E.g. phrasebank_elsevier: generate a academic phrasebank from [Elsevier OA CC-BY corpus](https://huggingface.co/datasets/orieg/elsevier-oa-cc-by).


## Workflows

In [1]:
### Step 1: Load the data
from openphrasebank import load_and_tokenize_data

# (1) the first time it might take a while to download/tokenize the data (up to half an hour!)
# (2) Using 'ENVI','EART' subject. If not set it will use all subject areas.
tokens_gen = load_and_tokenize_data(dataset_name="orieg/elsevier-oa-cc-by", 
                                subject_areas=['ENVI','EART'],
                                keys=['title', 'abstract','body_text'],
                                save_cache=True,
                                cache_file='temp_tokens.json')


[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Now


Processing title: 100%|██████████| 6114/6114 [00:00<00:00, 7279.73it/s]


Now


Processing abstract: 100%|██████████| 6114/6114 [00:10<00:00, 609.06it/s]


Now


Processing body_text: 100%|██████████| 1357851/1357851 [04:30<00:00, 5027.24it/s]


In [2]:
# Step 2:generate n-grams

from openphrasebank import tokens_generator, generate_multiple_ngrams, filter_frequent_ngrams

# Define the n values for which you want to calculate n-grams
n_values = [1,2,3,4,5,6,7,8]

# Tips: When temp_tokens availabe, you can load it into a gemerator
tokens_gen = tokens_generator('temp_tokens.json')  #/ 'temp_tokens_p_s.json'

# Generate the n-grams and count their frequencies
ngram_freqs = generate_multiple_ngrams(tokens_gen, n_values)



[nltk_data] Downloading package punkt to /home/codespace/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
# Step 3: Filter and export

# Define the top limits for each n-gram length
top_limits = {1: 2000, 2: 2000, 3: 1000, 4: 300, 5: 200, 6: 200, 7: 200, 8: 200}

# Filter the frequent n-grams and store the results in a dictionary
phrases = {}
freqs = {}
for n, limit in top_limits.items():
    phrases[n], freqs[n] = filter_frequent_ngrams(ngram_freqs[n], limit,min_freq=20)

# Combine and sort the phrases from n-gram lengths 2 to 6
sorted_phrases = sorted(sum((phrases[n] for n in range(2, 7)), []))

# Write the sorted phrases to a Markdown file
with open('../elsevier_phrasebank_ENVI_EART.txt', 'w') as file:
    for line in sorted_phrases:
        file.write(line + '\n')

In [9]:

keywords = ["climate"]
# Convert all strings to lower case
lowercase_strings = [s.lower() for s in sorted_phrases]
# Filter the list by keywords and remove duplicates
filtered_strings = list(set(s for s in lowercase_strings if any(k in s for k in keywords)))

In [10]:
filtered_strings

['nations framework convention on climate',
 'a changing climate',
 'the impacts of climate change',
 'climate services',
 'for climate change',
 'regional climate model',
 'to climate change',
 'framework convention on climate change',
 'to adapt to climate',
 'to adapt to climate change',
 'climate change on',
 'adapt to climate',
 'implications of climate change',
 'in a changing climate',
 'adapt to climate change',
 'climate change impacts on',
 'effects of climate change on',
 'the intergovernmental panel on climate',
 'impact of climate',
 'national climate',
 'the intergovernmental panel on climate change',
 'the impact of climate change on',
 'united nations framework convention on climate',
 'impact of climate change on',
 'of climate change on',
 'responses to climate change',
 'to climate change in',
 'under climate change',
 'in the context of climate change',
 'the effects of climate change on',
 'impact of climate change',
 'of climate',
 'the face of climate change',
 '

In [13]:
from IPython.display import HTML
def display_word_tree(phrases, keyword):
    js_code = """
    <script type="text/javascript" src="https://www.gstatic.com/charts/loader.js"></script>
    <script type="text/javascript">
      google.charts.load('current', {packages:['wordtree']});
      google.charts.setOnLoadCallback(drawChart);
    
      function drawChart() {
        var data = google.visualization.arrayToDataTable([
          ['Phrases'],
          {}
        ]);
    
        var options = {{
          wordtree: {{
            format: 'implicit',
            word: '{}'
          }}
        }};
    
        var chart = new google.visualization.WordTree(document.getElementById('wordtree_basic'));
        chart.draw(data, options);
      }}
    </script>
    <div id="wordtree_basic" style="width: 900px; height: 500px;"></div>
    """.format(phrases, keyword)
    
    return HTML(js_code)

In [14]:
# Step 4: visualize the n-grams

display_word_tree(filtered_strings, 'climate')


KeyError: 'packages'

## Others

In [9]:
# read from text
awl_dir = '../data/awl.txt'
with open(awl_dir, 'r') as f:
    awl = f.read().splitlines()
