# Collocations

Collocations are the phrases of two or more words which highly co-occur.
They are meaningful bigrams and trigrams or ngrams. 
For example: He applied for Machine learning. 'Machine learning' is a collocation.

### How are Collocations different from regular BiGrams or TriGrams?
Some combination of two or more words which co-occur more oftenly  but does not have meaning together.
Bigrams or Trigrams can be a set of any two or more words ,but not neccessarily collocations. 
For example ‘good film’, ‘bad man’ these can be taken as bigrams but not collocations. 
Collocations are words which make meaning when appeared always together irrespective of frequency of occurring in the whole corpus. Such as Las Vegas, United States, Union Territory, New York, so on….




In [46]:
#load all libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords 
import spacy

In [33]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Madhavi\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

In [5]:
#text data sample

documents=['A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story ',
     'This quiet , introspective and entertaining independent is worth seeking ', 'Even fans of Ismail Merchants work , I suspect , would have a hard time sitting through this one .']

## Preprocessing

In [20]:
nlp=spacy.load("en_core_web_sm")

def clean_comments(text):
    #remove punctuations
    regex = re.compile(r'[^a-zA-Z0-9\s]+')
    nopunct = regex.sub("", str(text))
    nopunct=re.sub(r'[\s]+',' ' ,nopunct)
    #use spacy to lemmatize comments
    text=str(nopunct).strip()
    doc = nlp(text, disable=['parser','ner'])
    lemma = [token.lemma_ for token in doc]
    return lemma

In [21]:
clean_text=[]

for text in documents:
    clean_text.append(clean_comments(text))
    
print(clean_text)

[['a', 'series', 'of', 'escapade', 'demonstrate', 'the', 'adage', 'that', 'what', 'be', 'good', 'for', 'the', 'goose', 'be', 'also', 'good', 'for', 'the', 'gander', 'some', 'of', 'which', 'occasionally', 'amuse', 'but', 'none', 'of', 'which', 'amount', 'to', 'much', 'of', 'a', 'story'], ['this', 'quiet', 'introspective', 'and', 'entertaining', 'independent', 'be', 'worth', 'seek'], ['even', 'fan', 'of', 'Ismail', 'Merchants', 'work', 'I', 'suspect', 'would', 'have', 'a', 'hard', 'time', 'sit', 'through', 'this', 'one']]


In [25]:
#Flattening of list of lists of tokens into single list
comment_tokens=[token for tokens in clean_text for token in tokens]

In [26]:
print(comment_tokens)

['a', 'series', 'of', 'escapade', 'demonstrate', 'the', 'adage', 'that', 'what', 'be', 'good', 'for', 'the', 'goose', 'be', 'also', 'good', 'for', 'the', 'gander', 'some', 'of', 'which', 'occasionally', 'amuse', 'but', 'none', 'of', 'which', 'amount', 'to', 'much', 'of', 'a', 'story', 'this', 'quiet', 'introspective', 'and', 'entertaining', 'independent', 'be', 'worth', 'seek', 'even', 'fan', 'of', 'Ismail', 'Merchants', 'work', 'I', 'suspect', 'would', 'have', 'a', 'hard', 'time', 'sit', 'through', 'this', 'one']


### Initialize NLTK's Bigrams/Trigrams Finder

Some collocation measures to filter out bigrams and trigrams:
1. Frequency counting, 
2. Pointwise Mutual Information (PMI)
3. hypothesis testing (t-test and chi-square)

In [24]:
bigrams = nltk.collocations.BigramAssocMeasures()
trigrams = nltk.collocations.TrigramAssocMeasures()

In [28]:
bigramFinder = nltk.collocations.BigramCollocationFinder.from_words(comment_tokens)
trigramFinder = nltk.collocations.TrigramCollocationFinder.from_words(comment_tokens)

###  Frequency Counting as scoring metric

1.The simplest method is to rank bigrams or trigrams based upon it's count of occurrence.

2.Too sensitive to very frequent pairs and pronouns/articles/prepositions come up often.
Solution: filter for only adjectives and nouns

To fix this, we filter out for collocations not containing stop words and filter for only the following structures:
- Bigrams: (Noun, Noun), (Adjective, Noun)
- Trigrams: (Adjective/Noun, Anything, Adjective/Noun)

In [29]:
# bigram
bigram_freq = bigramFinder.ngram_fd.items()
bigramFreqTable = pd.DataFrame(list(bigram_freq), columns=['bigram','freq']).sort_values(by='freq', ascending=False)

bigramFreqTable.head().reset_index(drop=True)

Unnamed: 0,bigram,freq
0,"(good, for)",2
1,"(for, the)",2
2,"(of, which)",2
3,"(a, series)",1
4,"(of, Ismail)",1


In [30]:
# english stopwords
en_stopwords = set(stopwords.words('english'))

In [31]:
#function to filter for ADJ/NN bigrams
def rightTypes(ngram):
    if '-pron-' in ngram or '' in ngram or ' 'in ngram or 't' in ngram:
        return False
    for word in ngram:
        if word in en_stopwords:
            return False
    acceptable_types = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    second_type = ('NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in acceptable_types and tags[1][1] in second_type:
        return True
    else:
        return False


In [34]:
#filter bigrams
filtered_bi = bigramFreqTable[bigramFreqTable.bigram.map(lambda x: rightTypes(x))]

In [35]:
filtered_bi[:10]

Unnamed: 0,bigram,freq
33,"(quiet, introspective)",1
39,"(worth, seek)",1
44,"(Ismail, Merchants)",1
52,"(hard, time)",1
53,"(time, sit)",1
3,"(escapade, demonstrate)",1


In [36]:
#Trigram
trigram_freq = trigramFinder.ngram_fd.items()
trigramFreqTable = pd.DataFrame(list(trigram_freq), columns=['trigram','freq']).sort_values(by='freq', ascending=False)
trigramFreqTable.head().reset_index(drop=True)

Unnamed: 0,trigram,freq
0,"(good, for, the)",2
1,"(a, series, of)",1
2,"(fan, of, Ismail)",1
3,"(a, story, this)",1
4,"(story, this, quiet)",1


In [37]:
#function to filter trigrams
def rightTypesTri(ngram):
    if '-pron-' in ngram or '' in ngram or ' 'in ngram or '  ' in ngram or 't' in ngram:
        return False
    for word in ngram:
        if word in en_stopwords:
            return False
    first_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    third_type = ('JJ', 'JJR', 'JJS', 'NN', 'NNS', 'NNP', 'NNPS')
    tags = nltk.pos_tag(ngram)
    if tags[0][1] in first_type and tags[2][1] in third_type:
        return True
    else:
        return False

In [38]:
filtered_tri = trigramFreqTable[trigramFreqTable.trigram.map(lambda x: rightTypesTri(x))]

In [39]:
filtered_tri[:10]

Unnamed: 0,trigram,freq
46,"(Ismail, Merchants, work)",1
54,"(hard, time, sit)",1


 ##  Pointwise Mutual Information (pmi) as scoring metric
 
It measures the occurrence of co-ocurred words against the occurrence of the words independently. However, it is very sensitive to rare combination of words. For example, if a random bigram ‘abc xyz’ appears, and neither ‘abc’ nor ‘xyz’ appeared anywhere else in the text, ‘abc xyz’ will be identified as highly significant bigram when it could just be a random misspelling or a phrase too rare to generalize as a bigram. Therefore, this method is often used with a frequency filter.

In [42]:
#bigram
#filter for only those with more than 2 occurences
bigramFinder.apply_freq_filter(2)
bigramPMITable = pd.DataFrame(list(bigramFinder.score_ngrams(bigrams.pmi)), columns=['bigram','PMI']).sort_values(by='PMI', ascending=False)

In [44]:
bigramPMITable[:4]

Unnamed: 0,bigram,PMI
0,"(good, for)",4.930737
1,"(for, the)",4.345775
2,"(of, which)",3.608809


In [43]:
#trigram
trigramFinder.apply_freq_filter(2)
trigramPMITable = pd.DataFrame(list(trigramFinder.score_ngrams(trigrams.pmi)), columns=['trigram','PMI']).sort_values(by='PMI', ascending=False)

In [45]:
trigramPMITable[:4]

Unnamed: 0,trigram,PMI
0,"(good, for, the)",9.276512


Reference:

https://github.com/nicharuc/Collocations/blob/master/Collocations.ipynb

https://www.nltk.org/howto/collocations.html

https://medium.com/@nicharuch/collocations-identifying-phrases-that-act-like-individual-words-in-nlp-f58a93a2f84a
