# Tutorial for stopword filtering

Interactive notebook for demonstrating filtering

In [1]:
## import packages

%load_ext autoreload
%autoreload 2

import os,sys
import numpy as np
import pandas as pd

# display the figure in the notebook
# %matplotlib inline
# import matplotlib.pyplot as plt
# cmap = 'tab10'
# cm = plt.get_cmap(cmap)

## custom packages
src_dir = os.path.join( 'src')
sys.path.append(src_dir)

from filter_words import run_stopword_statistics
from filter_words import make_stopwords_filter
from filter_words import remove_stopwords_from_list_texts

## 1) Load corpus

Get the 20 newsgroup corpus. These are newsarticles from 20 different categories (newsgroups).

We get a list of documents, where each entry is a list of tokens

In [2]:
corpus_name = '20NewsGroup'
filename = os.path.join(os.pardir,'data','%s_corpus.csv'%(corpus_name))
df = pd.read_csv(filename,index_col=0)
list_texts = [  [h.strip() for h in doc.split()  ] for doc in df['text']    ]
list_texts[0] ## this is the first doc

['new',

 'religion',

 'forming',

 'sign',

 'yawn',

 'the',

 'church',

 'kibology',

 'did',

 'first',

 'and',

 'better']

## 2) Get stopword statistics

We calculate different statistics for each word in order to construct different stopword-filters:

- F, relative frequency
- I, Information content
- tfidf, term-frequency-inverse-document-frequency
- manual, whether the word occurs in the manual stopword list (1), otherwise nan


- H, empirical conditional entropy
- H-tilde, expected conditional entropy from randomized null model
- N, frequncy (number of counts)



In [3]:
## path to a manual stopword list (this one is from mallet)
path_stopword_list =  os.path.join(os.pardir,'data','stopword_list_en')

## number of realizations for the random null model
N_s = 10

## get the statistics
df = run_stopword_statistics(list_texts,N_s=N_s,path_stopword_list=path_stopword_list)

## look at the entries
df.sort_values(by='F',ascending=False).head()

Unnamed: 0,F,I,tfidf,manual,H,H-tilde,H-tilde_std,N
the,0.062401,0.244815,1.007189,1.0,12.982312,13.227127,0.004062,239094
and,0.024848,0.333264,1.142009,1.0,12.800792,13.134056,0.003877,95205
that,0.016991,0.29358,1.679582,1.0,12.76476,13.05834,0.007336,65103
for,0.011996,0.04851,1.088629,1.0,12.916255,12.964765,0.010105,45965
you,0.01162,0.459559,2.252701,1.0,12.497241,12.9568,0.005646,44521


## 3)  Construct a stopword filter

We construct different stopword filters based in different statistics.

For this we have to specify 3 different components:

- A) method; this specifies the statistic that we use to construct the stopword list. In detail, we define a statistic $S(w)$ and assign words to the stopword list starting from the low-to-high (e.g. $S(w) = F(w)$ assign low-frequency words to the stopword list). Possible options are:

    - 'INFOR',  filter words with high values of Information-content I [S=-I]
    - 'BOTTOM', filter words with low values of frequency [S = F]
    - 'TOP', filter words with high values of frequency [S = 1/F]
    - 'TFIDF', filter words with low values of tfidf [S=tfidf]
    - 'TFIDF_r', filter words with high values of tfidf [S=-tfidf]
    - 'MANUAL', filter words from manual stopword list; supply path via path_stopword_list (S = 1 if word is in the list, else it is nan, i.e. cannot be considered for removal.
        
        
- B) cutoff_type [defines the way in which we choose the cutoff]

     - 'p', selects stopword list such that a fraction p of tokens gets removed (approximately)
     - 'n', selects stopword list such that a number n of types gets removed
     - 't', selects stopword list such that all words with S<=S_t get removed
    
    
 
- C) cutoff_val [defines the value on which to do the thresholding, see cutoff_type for details]



Below you can select different options and inspect the result.

The resulting dataframe ```df_filter``` contains the words that were assigned to the stopword list based on the selection criteria.

In [4]:
## method-options
method = 'INFOR'
# method = 'BOTTOM'
# method = 'TOP'
# method = 'TFIDF'
# method = 'TFIDF_r'
# method = 'MANUAL'



## remove fraction of tokens
cutoff_type = 'p'
cutoff_val = 0.5

## remove number of types
# cutoff_type = 'n'
# cutoff_val = 10

## remove above a threshold value
# cutoff_type = 't'
# cutoff_val = 1

df_filter = make_stopwords_filter(df,
                                  method = method,
                                  cutoff_type = cutoff_type, 
                                  cutoff_val = cutoff_val, )

In [5]:
df_filter

Unnamed: 0,F-cumsum,S
writes,0.003482,-0.748533
article,0.006477,-0.613892
thanks,0.007250,-0.383101
apr,0.008511,-0.321539
anyone,0.009608,-0.303154
appreciated,0.009783,-0.131501
edu,0.015199,-0.106187
wrote,0.015612,-0.104174
just,0.018056,-0.101488
advance,0.018277,-0.100900


## 4) Apply the stopword-filter to remove the words from the list of texts

We inspect one particular document for the effect of the stopword filter.

We report the remaining faction of tokens in the filtered list of texts.

In [6]:
## get the list of words from df_filter and get a filtered list_of_texts
list_words_filter = list(df_filter.index)
list_texts_filter = remove_stopwords_from_list_texts(list_texts, list_words_filter)

print('Original text:', list_texts[0])
print('Filtered text:', list_texts_filter[0])
N = sum([ len(doc) for doc in list_texts ])
N_filter = sum([ len(doc) for doc in list_texts_filter ])
print('Remaining fraction of tokens',N_filter/N)

Original text: ['new', 'religion', 'forming', 'sign', 'yawn', 'the', 'church', 'kibology', 'did', 'first', 'and', 'better']

Filtered text: ['new', 'religion', 'sign', 'church', 'kibology', 'did', 'first']

Remaining fraction of tokens 0.5000040453507306
