# 6 Corpus Exploration

This Notebooks provides you with various tools to explore and compare corpora in more detail. The techniques described here are especially powerful in combination the content of Notebook 5 **Corpus Creation**, especially when different subcorpora are compared and contrasted to each other.


At the end of this Notebook

More specifically we discuss:

- Keyword in Context Analysis: Similar to concardance in AntConc.
- Collocations: 
- Feature selection

## 6.1 Keyword in Context

In [1]:
import nltk
from pathlib import Path
from nltk.tokenize import wordpunct_tokenize

In [2]:
moh_reports = list(Path('data/MOH/python').glob('*.txt'))

In [3]:
moh_reports[:10]

[PosixPath('data/MOH/python/PoplarMetropolitanBorough.1945.b18246175.txt'),
 PosixPath('data/MOH/python/CityofWestminster.1932.b18247945.txt'),
 PosixPath('data/MOH/python/CityofWestminster.1921.b18247830.txt'),
 PosixPath('data/MOH/python/PoplarandBromley.1900.b18245754.txt'),
 PosixPath('data/MOH/python/Poplar.1919.b18120878.txt'),
 PosixPath('data/MOH/python/PoplarMetropolitanBorough.1920.b18245924.txt'),
 PosixPath('data/MOH/python/CityofWestminster.1907.b18247726.txt'),
 PosixPath('data/MOH/python/CityofWestminster.1906.b18247714.txt'),
 PosixPath('data/MOH/python/CityofWestminster.1903.b18247684.txt'),
 PosixPath('data/MOH/python/PoplarMetropolitanBorough.1902.b18245778.txt')]

In [4]:
from tqdm.notebook import tqdm

corpus = []

for r in tqdm(moh_reports):
    with open(r) as in_doc:
        
        tokens = wordpunct_tokenize(in_doc.read().lower())
        for token in tokens:
            if token.isalpha():
                corpus.append(token)


HBox(children=(FloatProgress(value=0.0, max=159.0), HTML(value='')))




In [7]:
print('collected', len(corpus),'tokens')
nltk_corpus = nltk.text.Text(corpus)

collected 3550169 tokens


In [8]:
nltk_corpus.concordance('poor')

Displaying 25 of 1112 matches:
lt to arrange but the friends of the poor and the charity organisation society
in one case the milk proved to be of poor quality the work is carried out in t
uality between per cent and per cent poor quality between per cent and per cen
rict total good quality fair quality poor quality adulterated no percent no pe
e applicant is already in receipt of poor law relief or is considered ought to
reviously notified under to to total poor law institutions sanatoria poor law 
otal poor law institutions sanatoria poor law institutions sanatoria pulmonary
ulosis and the treatment of cases in poor law and other hospitals advance in s
the fat was between and per cent and poor or inferior quality in which the fat
 no per cent fair quality no percent poor quality no percent adulterated no pe
er to to total primary notifications poor law institution sanatoria pulmonary 
er to to total primary notifications poor law institutions sanatoria pulmonary
er to to total primar

## 6.2 Collocations

In [29]:
nltk_corpus.collocations()

per cent; public health; county council; london county; medical
officer; scarlet fever; whooping cough; males females; local
government; legal proceedings; dwelling houses; poplar bromley; small
pox; ice cream; sub district; government board; child welfare; city
council; death rate; bromley bow


In [98]:
import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

finder = BigramCollocationFinder.from_words(nltk_corpus)
finder.nbest(bigram_measures.pmi, 10) 

[('abso', 'lutely'),
 ('acidi', 'lacfc'),
 ('acquires', 'setiological'),
 ('adolph', 'mussi'),
 ('adolphus', 'massie'),
 ('adultorated', 'sanples'),
 ('adver', 'tising'),
 ('aeql', 'rrhage'),
 ('alathilde', 'christoffersen'),
 ('alio', 'wances')]

In [99]:
finder.apply_freq_filter(3)
finder.nbest(bigram_measures.pmi, 10)

[('bowers', 'gifford'),
 ('carrie', 'simuelson'),
 ('culex', 'pipiens'),
 ('heatherfield', 'ascot'),
 ('holmes', 'godson'),
 ('lehman', 'ashmead'),
 ('locum', 'tenens'),
 ('nemine', 'contradicente'),
 ('quinton', 'polyclinic'),
 ('rhesus', 'incompatibility')]

In [None]:
# finder = BigramCollocationFinder.from_words(nltk_corpus, window_size = 20)
# finder.apply_freq_filter(3)
# finder.nbest(bigram_measures.pmi, 10)

In [117]:
finder = BigramCollocationFinder.from_words(nltk_corpus)
token_filter = lambda *w: 'poor' not in w
finder.apply_ngram_filter(token_filter)
finder.nbest(bigram_measures.pmi, 10)

[('apprenticing', 'poor'),
 ('poor', 'gentlewomen'),
 ('poor', 'lawinstitu'),
 ('qualitj', 'poor'),
 ('regulgtions', 'poor'),
 ('poor', 'attenders'),
 ('poor', 'palatines'),
 ('poor', 'genl'),
 ('poor', 'ffour'),
 ('poor', 'packaging')]

In [119]:
help(bigram_measures)

Help on BigramAssocMeasures in module nltk.metrics.association object:

class BigramAssocMeasures(NgramAssocMeasures)
 |  A collection of bigram association measures. Each association measure
 |  is provided as a function with three arguments::
 |  
 |      bigram_score_fn(n_ii, (n_ix, n_xi), n_xx)
 |  
 |  The arguments constitute the marginals of a contingency table, counting
 |  the occurrences of particular events in a corpus. The letter i in the
 |  suffix refers to the appearance of the word in question, while x indicates
 |  the appearance of any word. Thus, for example:
 |  
 |      n_ii counts (w1, w2), i.e. the bigram being scored
 |      n_ix counts (w1, *)
 |      n_xi counts (*, w2)
 |      n_xx counts (*, *), i.e. any bigram
 |  
 |  This may be shown with respect to a contingency table::
 |  
 |              w1    ~w1
 |           ------ ------
 |       w2 | n_ii | n_oi | = n_xi
 |           ------ ------
 |      ~w2 | n_io | n_oo |
 |           ------ ------
 |         

### 6.3 Feature selection

In [121]:
from tqdm.notebook import tqdm

corpus = []
labels = []


for r in tqdm(moh_reports):
    with open(r) as in_doc:
        if 'westminster' in r.name.lower():
            labels.append(1)
        else:
            labels.append(0)

        corpus.append(in_doc.read().lower())
  

HBox(children=(FloatProgress(value=0.0, max=159.0), HTML(value='')))




In [122]:
print(len(labels),len(corpus))

159 159


In [151]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_extraction.text import CountVectorizer 

In [156]:
vectorizer = CountVectorizer(min_df=5)
X = vectorizer.fit_transform(corpus)
feature_names = vectorizer.get_feature_names()


In [159]:
ch2 = SelectKBest(chi2, k=10)
X = ch2.fit_transform(X, labels)


In [160]:
selected = [(feature_names[i],ch2.scores_[i])for i
                    in ch2.get_support(indices=True)]
selected

[('borough', 6827.175533272762),
 ('bow', 6681.346216439548),
 ('bromley', 6861.134136366376),
 ('city', 4592.729181914567),
 ('east', 1499.0376786761663),
 ('poplar', 11888.857471790638),
 ('road', 8510.875738951223),
 ('see', 2314.6724275246893),
 ('street', 4330.436649540313),
 ('westminster', 5105.364636488248)]

In [165]:
!pip3 install TextFeatureSelection

Collecting TextFeatureSelection
  Downloading https://files.pythonhosted.org/packages/42/3d/351dcabf4198218a4b7421e6f6069eb089af6f5642e8fdd5d95f11904726/TextFeatureSelection-0.0.12-py3-none-any.whl
Installing collected packages: TextFeatureSelection
Successfully installed TextFeatureSelection-0.0.12


In [168]:
from TextFeatureSelection import TextFeatureSelection
fsOBJ=TextFeatureSelection(target=labels,input_doc_list=corpus)
result_df=fsOBJ.getScore()
result_df

Unnamed: 0,word list,word occurence count,Proportional Difference,Mutual Information,Chi Square,Information Gain
0,00,103,-0.009709,0.094959,2.463282,0.004326
1,000,149,0.073826,0.008605,0.150191,0.000266
2,000000,1,-1.000000,0.778445,1.185538,0.001507
3,0001,3,1.000000,-inf,2.595483,0.000000
4,000163,1,1.000000,-inf,0.854210,0.000000
...,...,...,...,...,...,...
42232,¾gallons,1,-1.000000,0.778445,1.185538,0.001507
42233,¾ths,1,-1.000000,0.778445,1.185538,0.001507
42234,ægis,1,1.000000,-inf,0.854210,0.000000
42235,æration,1,-1.000000,0.778445,1.185538,0.001507


In [173]:
result_df[result_df['word occurence count'] > 5].sort_values('Chi Square',ascending=False)[:20]

Unnamed: 0,word list,word occurence count,Proportional Difference,Mutual Information,Chi Square,Information Gain
30606,pop,59,-1.0,0.778445,110.51589,0.184282
9432,bow,89,-0.640449,0.580268,106.152339,0.0
21070,horseferry,71,0.971831,-3.484235,102.313762,0.23972
8788,bessborough,67,1.0,-inf,98.289813,0.0
42219,zymotic,94,-0.553191,0.525609,93.326942,0.0
26433,millbank,62,1.0,-inf,86.266363,0.0
15282,dock,66,-0.787879,0.666327,85.911216,0.149069
41176,wes,64,0.96875,-3.380438,84.840438,0.205713
22037,india,67,-0.761194,0.65129,82.833472,0.144441
30188,pimlico,63,0.968254,-3.36469,82.552451,0.201071


In [170]:
help(result_df.sort_values)

Help on method sort_values in module pandas.core.frame:

sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last') method of pandas.core.frame.DataFrame instance
    Sort by the values along either axis.
    
    Parameters
    ----------
            by : str or list of str
                Name or list of names to sort by.
    
                - if `axis` is 0 or `'index'` then `by` may contain index
                  levels and/or column labels
                - if `axis` is 1 or `'columns'` then `by` may contain column
                  levels and/or index labels
    
                .. versionchanged:: 0.23.0
                   Allow specifying index or column level names.
    axis : {0 or 'index', 1 or 'columns'}, default 0
         Axis to be sorted.
    ascending : bool or list of bool, default True
         Sort ascending vs. descending. Specify list for multiple sort
         orders.  If this is a list of bools, must match the length of
      

In [None]:
best_vocabulary