--------------------
#### meaningful bigrams, 

- a combination of preprocessing and filtering is generally used
-------------------------

#### Simple Example

In [28]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures

In [54]:
# Sample text
text = '''Collocations are fundamental linguistic phenomena playing a crucial role in natural language understanding and processing. 
In essence, collocations represent the tendency of certain words or phrases to co-occur frequently within a language. 
Understanding collocations is essential for various language-related tasks including language learning, translation, information retrieval, 
and natural language processing (NLP).
At its core, a collocation is a pairing or grouping of words that appear together in a text more often than would be expected by 
chance alone. These word combinations often exhibit semantic or syntactic cohesion, reflecting the regular patterns of 
language usage by native speakers. Collocations encompass a wide range of linguistic constructions including verb-noun pairs 
such as "make a decision," adjective-noun pairs like "heavy rain," adverb-adjective pairs such as "extremely happy," 
and noun-preposition combinations like "interest in."
Collocations can be categorized based on various criteria such as lexical composition, grammatical structure, and semantic relationship. 
Lexical collocations involve words with specific lexical meanings that frequently co-occur together, while grammatical collocations focus on syntactic patterns and word order. Semantic collocations are based on the meaning and semantic relationship between words, reflecting concepts or ideas expressed in the text.
The identification and analysis of collocations are crucial for several reasons:
Semantic Understanding: Collocations provide insights into the meaning and usage of words within a language. 
By examining which words tend to occur together, we can infer semantic relationships and contextual nuances.
Language Proficiency: Knowledge of collocations is essential for language learners to achieve fluency and naturalness in their 
speech or writing. Understanding which words commonly co-occur helps learners produce more idiomatic and native-like expressions.
Text Processing: In NLP tasks such as information retrieval, text summarization, and sentiment analysis, recognizing and extracting 
collocations can improve the accuracy and relevance of results. Collocations help capture the underlying structure and semantics of text data.
Translation and Localization: Collocations pose challenges for translators due to their culture-specific and idiomatic nature. 
Translating collocations accurately requires knowledge of both source and target languages, as well as an understanding of the 
cultural context.
Several methods and techniques are employed to identify and extract collocations from text corpora. 
These include frequency-based approaches, statistical measures such as Pointwise Mutual Information (PMI) and chi-squared tests, 
as well as machine learning algorithms. Collocation extraction tools and libraries, such as NLTK (Natural Language Toolkit), 
Gensim, and spaCy, provide functionalities for detecting and analyzing collocations in text data.
In summary, collocations are essential linguistic constructs that reflect the regularities and patterns of language usage. 
Understanding collocations enhances language proficiency, facilitates text processing tasks, and contributes to the broader 
field of natural language understanding. By studying collocations, linguists and language practitioners gain valuable insights 
into the structure, semantics, and cultural aspects of language.'''

In [55]:
# Characters to strip off
characters_to_remove = [',', ';', '"', '(', ')', '.']

# Remove specified characters
for char in characters_to_remove:
    text = text.replace(char, '')

In [56]:
# Tokenize the text
tokens = word_tokenize(text)

In [57]:
# Create a BigramCollocationFinder
finder = BigramCollocationFinder.from_words(tokens)

In [58]:
# Score collocations using frequency
bigram_measures     = BigramAssocMeasures()
scored_collocations = finder.score_ngrams(bigram_measures.raw_freq)

In [59]:
# Print the top 5 collocations
for collocation in scored_collocations[:15]:
    print(collocation)

(('such', 'as'), 0.013215859030837005)
(('collocations', 'are'), 0.006607929515418502)
(('natural', 'language'), 0.006607929515418502)
(('of', 'language'), 0.006607929515418502)
((':', 'Collocations'), 0.004405286343612335)
(('Understanding', 'collocations'), 0.004405286343612335)
(('a', 'language'), 0.004405286343612335)
(('and', 'semantic'), 0.004405286343612335)
(('as', 'well'), 0.004405286343612335)
(('based', 'on'), 0.004405286343612335)
(('collocations', 'is'), 0.004405286343612335)
(('essential', 'for'), 0.004405286343612335)
(('information', 'retrieval'), 0.004405286343612335)
(('insights', 'into'), 0.004405286343612335)
(('into', 'the'), 0.004405286343612335)


**Useful preprocessing**

`Remove Stopwords`: Often, stopwords like "and", "the", "is", etc., form very frequent bigrams that aren't useful.

`Use Meaningful POS Tags`: Consider keeping bigrams where the words are nouns, adjectives, verbs, etc., and filter out bigrams where both words are prepositions, for instance.

`Frequency Filtering`: Remove bigrams that appear only once (or below a certain threshold).

`Collocation Finders`: nltk provides BigramCollocationFinder (and the corresponding trigram version) that can be used to detect bigrams that occur more frequently than would be expected based on individual word frequencies.

#### Using brown dataset

In [60]:
import nltk
from nltk.corpus import brown, stopwords

from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

In [61]:
nltk.download('brown')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\bhupe\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bhupe\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\bhupe\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [62]:
# 1. Remove stopwords and non-alpha words
stopwords_set  = set(stopwords.words('english'))
filtered_words = [word for word in brown.words(categories='news') if word not in stopwords_set and word.isalpha()]

In [63]:
len(filtered_words)

49600

In [64]:
print(filtered_words[:100])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'investigation', 'recent', 'primary', 'election', 'produced', 'evidence', 'irregularities', 'took', 'place', 'The', 'jury', 'said', 'presentments', 'City', 'Executive', 'Committee', 'charge', 'election', 'deserves', 'praise', 'thanks', 'City', 'Atlanta', 'manner', 'election', 'conducted', 'The', 'term', 'jury', 'charged', 'Fulton', 'Superior', 'Court', 'Judge', 'Durwood', 'Pye', 'investigate', 'reports', 'possible', 'irregularities', 'primary', 'Ivan', 'Allen', 'Only', 'relative', 'handful', 'reports', 'received', 'jury', 'said', 'considering', 'widespread', 'interest', 'election', 'number', 'voters', 'size', 'city', 'The', 'jury', 'said', 'find', 'many', 'registration', 'election', 'laws', 'outmoded', 'inadequate', 'often', 'ambiguous', 'It', 'recommended', 'Fulton', 'legislators', 'act', 'laws', 'studied', 'revised', 'end', 'modernizing', 'improving', 'The', 'grand', 'jury', 'commented', 'number', 'topics', 'among', 'Atla

In [65]:
# 2. Use POS tags to filter for nouns and adjectives
allowed_postags = ['NN', 'JJ', 'VB']
tagged_words    = nltk.pos_tag(filtered_words)
filtered_words  = [word for word, pos in tagged_words if pos in allowed_postags]

In [66]:
print(filtered_words[:100])

['investigation', 'recent', 'primary', 'election', 'evidence', 'place', 'jury', 'charge', 'election', 'praise', 'manner', 'election', 'term', 'jury', 'investigate', 'possible', 'primary', 'relative', 'handful', 'jury', 'widespread', 'interest', 'election', 'number', 'size', 'city', 'jury', 'many', 'registration', 'election', 'inadequate', 'ambiguous', 'revised', 'end', 'grand', 'jury', 'number', 'purchasing', 'inure', 'interest', 'jury', 'efficiency', 'reduce', 'cost', 'administration', 'jury', 'lacking', 'clerical', 'city', 'city', 'take', 'remedy', 'problem', 'automobile', 'title', 'law', 'outgoing', 'jury', 'next', 'provide', 'effective', 'date', 'orderly', 'implementation', 'law', 'effected', 'grand', 'jury', 'swipe', 'federal', 'child', 'welfare', 'major', 'general', 'assistance', 'program', 'jury', 'distribute', 'welfare', 'state', 'exception', 'none', 'money', 'realize', 'proportionate', 'distribution', 'disable', 'program', 'populous', 'future', 'receive', 'portion', 'available

In [22]:
# 3. Find Bigram Collocations
bigram_measures = BigramAssocMeasures()
finder          = BigramCollocationFinder.from_words(filtered_words)

finder.apply_freq_filter(min_freq=5)  # Keep only bigrams that appear 5+ times

In [67]:
# Top 10 bigrams using Pointwise Mutual Information
top_bigrams = finder.nbest(score_fn=bigram_measures.pmi, n=20)
top_bigrams

[('At', 'its'),
 ('Collocation', 'extraction'),
 ('Information', 'PMI'),
 ('Mutual', 'Information'),
 ('NLTK', 'Natural'),
 ('Pointwise', 'Mutual'),
 ('Several', 'methods'),
 ('Text', 'Processing'),
 ('The', 'identification'),
 ('Toolkit', 'Gensim'),
 ('accurately', 'requires'),
 ('achieve', 'fluency'),
 ('algorithms', 'Collocation'),
 ('approaches', 'statistical'),
 ('both', 'source'),
 ('broader', 'field'),
 ('chance', 'alone'),
 ('chi-squared', 'tests'),
 ('context', 'Several'),
 ('contextual', 'nuances')]