# Goal of this notebook

Show what tf-idf (term frequency, inverse document frequency) does to terms from documents - and how it is limited.




## What is it?
### For context

Say you are creating a table of terms by documents, counting how often each term appears.

You can imagine something this might be useful to do things like 
- compare documents, by use of the same words,
- perhaps find similar terms - if you know which documents are similar, you can guess which terms might be related.

You can also imagine that ''just'' counts is messy. 
- Longer documents would have higher numbers not because terms are more important but just because there are more words in the document.
- Common words would have high counts just because they are common, not because they are meaningful.
  - In fact, things like [Zipfian word distribution](https://en.wikipedia.org/wiki/Zipf%27s_law) points out that in natural text,
    common terms are ''disproportionally'' common - a handful like the, be, to, of, and, in, a, that will cover maybe 20% of all text,
    while being semantically empty function words.

### Basic motivation of tf-idf

tf-idf was introduced in the 1970s as a heuristic of 'term specificity', 
based on two rough intuitions:

A term more common in a document is probably important to that document
but _at the same time_, a term common to all documents is probably less important in general.

- term frequency (tf) is often calculated as `(count of term within document / total terms in document)` - so counts expressed as a ratio.
- document frequency (df) is often calculated as `log( count of documents / count of documents that contain the term )`. 
- The word 'inverse' just indicates that we *divide* by this. IDF is just 1 / DF


Those ways of calculating are the basics.
Specific applications may have variant calculations they still call tf-idf (we mention this because if you use someone else's code, it may well use such an alternative, and not show the same values as a basic tutorial would).


## What does it give you, and what can you do with it?

TF-IDF isn't a method or goal, it's a tool.

Actually, by itself it's barely a tool, in that the table/matrix itself is,
for each term, just *data* - a weight for each document a term is in.

The use of that data lies in the application,
and different things may use that data very differently. 


Say, if you're doing scores, you probably want some logs in there.

If you are inspecting a collection of documents for what might be more unique and topical terms, 
you might wish to exclude terms appearing in most documents, just to ignore a lot of muck.

If you're creating a field-specific stopword list, or perhaps estimate how _not_ to the point a document is,
the muck is exactly what you are interested in.



### What could you do?

* score results in a search system - idf means putting little weight in things not interested in the document set, tf means focusing more on specifics

* make stopword lists / automatically downweigh repetitive words
  things you would consider stopwords will automatically grow the highest IDF
  and it's a gliding scale with other words that happen to be common

* try to score a document for how repetitive or overcomplete something is


### Limitations

Note that the larger and more heterogenous the document set, the more diluted some effects are,
and some uses stand and fall with a little control.

Say you're comparing all possible media products, then terms like 'film', 'book', 'album' are very distinguishing,
and you may want those to be reported so you can partition them later.

If you're looking at all your products this will be stronger.

If you're looking just within movie reviews, or just within all book reviews, or just album reviews, 
then these are the terms you want to fall away.

IDF, when calculated for such a specific collection, will basically do that for you,
but when you want to do things based on the result, you should be aware of this context sensitivity,
and you may want to do a little more curation.

<!-- -->

You should not assume its output is linear or usable as probabilities - it's ordinal, but distribution is skewed,
so this isn't even very good input to a lot of methods.

It's better than nothing, and e.g. in result scoring the values themselves don't have to make sense.

Applying log will help, but not 

<!-- -->

You should not assume the output lets you do direct comparisons anyway -- 
the numbers you end up with in the table is based on ratios within the document,
which means that longer document have lower values on each term by definition,
which may not match your intuition of how much about certain terms it is.



### Other notes

Whether you see the results of TF-IDF calculation as document-by-term matrix that happens to be sparsely filled, or another way,
depends a little on implementation and intent.

Tutorials out there will differ (it seems each field does its own thing).




# ...Okay, how about some examples?

## TFIDF with sklearn

In [135]:
import numpy, sklearn, sklearn.feature_extraction
import spacy
import wetsuite.helpers.spacy

nl = spacy.load('nl_core_news_lg')

In [171]:

class TfIdf:
    " This is mostly existing tfidf code, wrapped in a class to make it slightly easier to use"
    def __init__(self):
        self.docs = []
        self.docnames = []
        self.doci = 1

    def add_document(self, doc_as_term_sequence, docname=None):
        ' takes a document as a string '
        if docname==None:
            docname = 'doc%d'%self.doci
            self.doci += 1
        self.docnames.append( docname )
        self.docs.append( doc_as_term_sequence )

    def calc(self, **kwargs):
        """ mostly just calls sklearn's TfidfVectorizer.
            
            - min_df:int: ignore terms that appear in fewer than this amount of documents   (default is 2, rather than 1)
            - max_df:float: ignore terms that appear in fewer than this amount of documents   (default is 2, rather than 1)
        """
        min_df = kwargs.pop('min_df',2)
        max_df = kwargs.pop('max_df',1.0)
        ngram_range = kwargs.pop('ngram_range',(1,1))
        sublinear_tf = kwargs.pop('sublinear_tf',False)
        if len(kwargs) > 0:
            print("Unrecognized kwargs: %s"%', '.join(kwargs.keys()))

        # see also https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn-feature-extraction-text-tfidfvectorizer
        self.vectorizer      = sklearn.feature_extraction.text.TfidfVectorizer( 
            min_df=min_df,   # terms that appear in only one document are assumed to be misspelled or otherwise unimportant
            max_df=max_df,
            ngram_range=ngram_range,
            #max_features=100, 
            stop_words=None,
            sublinear_tf=sublinear_tf,
            use_idf=True, # (true by default)
        )
        # the fit part learns the vocabulary and idf,  the tansform part turns documents into document-term matrices
        self.doc_term_matrix = self.vectorizer.fit_transform( raw_documents=self.docs )
        #print(self.doc_term_matrix)

    def vocab_idf(self):
        ' returns (term, internalfeatureindex, IDF) for the entire vocabulary'
        ret = []
        for term, feati in self.vectorizer.vocabulary_.items():
            ret.append( (term, feati, self.vectorizer.idf_[feati]) )
        ret.sort(key=lambda x:x[2])
        return ret


class ShowTfIdf:
    ' ipython style notebook pretty-printer for TfIdf object'
    def __init__(self, ob):
        self.ob = ob
    
    def _repr_html_(self):
        ret = []
        if not hasattr(self.ob, 'vectorizer'):
            ret.append('<p>not calc()ulated yet</p>')
        else:
            #M = self.ob.doc_term_matrix.todense()
            M = self.ob.doc_term_matrix
            nM = M/numpy.amax(M)
            ret.append('<table>')
            try:
                names = self.ob.vectorizer.get_feature_names()  # older sklearn
            except AttributeError:
                names = self.ob.vectorizer.get_feature_names_out() 
            
            ret.append('<tr><th></th>')
            for name in names:
                ret.append('<th>%s</th>'%name) # TODO: escape
            ret.append('</tr>')

            num_docs, num_terms = nM.shape
            for doci in range( num_docs ):
                ret.append('<tr>')
                ret.append('<th>%s</th>'%self.ob.docnames[doci])
                for termi in range(num_terms):
                    p = nM[doci,termi]
                    ret.append('<td style="opacity:%.1f">%.3f</td>'%(p,p))
                ret.append('</tr>')
            
            ret.append('</table>')
        return ''.join(ret)

### Tiny example
Let's have a tiny small example we can grasp, and that wouldn't be spammy if we printed it

In [172]:
docs = [
    'I am the duck what goes quack',
    'I am a computer program',
    'I am a member of The The',
    'I am the fox',
    'I am the fox in a longer sentence',
]

tfidf = TfIdf()
for doc in docs:
    tfidf.add_document( doc )

tfidf.calc(min_df=1, max_df=1.0) 
ShowTfIdf( tfidf )

Unnamed: 0,am,computer,duck,fox,goes,in,longer,member,of,program,quack,sentence,the,what
doc1,0.303,0.0,0.636,0.0,0.636,0.0,0.0,0.0,0.0,0.0,0.636,0.0,0.358,0.636
doc2,0.433,0.908,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.908,0.0,0.0,0.0,0.0
doc3,0.345,0.0,0.0,0.0,0.0,0.0,0.0,0.725,0.725,0.0,0.0,0.0,0.817,0.0
doc4,0.591,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.698,0.0
doc5,0.315,0.0,0.0,0.534,0.0,0.662,0.662,0.0,0.0,0.0,0.0,0.662,0.373,0.0


Note: 
- 'am' is downscored due to being in every document. ('I' would be too if it weren't removed due to length)
- 'the' is an example of a term being more common in a document scores it up
- min_df and max_df can be used to filter out things that appear in less than min_df documents, and/or in more than max_df documents 
- a max_df that is <1.0 but still relatively high can be used to remove what you'ld call stopwords - e.g 0.7 would remove 'am' and 'the'


However, we usually don't have terms that are that unique / table that is that sparse. An example on more words:

## Example in unstructured text

In [None]:
import wetsuite.datasets, random
rvs = wetsuite.datasets.load('rvsadviezen')

larger_rvs_advice = []
for key, item in rvs.data.items():
    body = '\n'.join( item['body'] )
    if len(body)>500: # there's a few short ones we ignore
        larger_rvs_advice.append( (item['meta'].get('Kenmerk',None),body) )

selection = random.sample( larger_rvs_advice, 50 )

In [156]:
rvs_tfidf = TfIdf()
for kenmerk, body in selection:
    rvs_tfidf.add_document( body, docname=kenmerk )

In [None]:
# words that seem characteristic of the documents - terms that are neither rare, or everywhere - i.e. those that show up in a middling amount of documents
rvs_tfidf.calc(min_df=20, max_df=45, sublinear_tf=True)  # using actual numbers out of the 25 speaks to the imagination more 
                                      # min_df=0.4, max_df=0.6 would be equivalent in this case, and work for other amounts of documents
ShowTfIdf( rvs_tfidf )

In [None]:
# things that show up more than once (probably not a typo) but still only in a few documents
#  this is less useful,  somewhat because of the long tail in a Zipfian distribution
rvs_tfidf.calc(min_df=2, max_df=5 ) 
ShowTfIdf( rvs_tfidf )

In [None]:
# things that are very common - will generally be what you'ld call stopwords,   
#   ...or, say, anything that is standard in a header because that's on every page. Or otherwise not distinguish much.
rvs_tfidf.calc(min_df=40, sublinear_tf=True)
#rvs_tfidf.calc(min_df=0.3, max_df=0.6)
ShowTfIdf( rvs_tfidf )

Notes:
- a max_df that is <1.0 and smaller can be used to focus just on the more interesting terms
- a min_df that is a low fraction could be used to remove rare 
- 'things that show up in at most a few documents' is actually a mix of muck, and interesting bug less-common topics. So this is not that useful unless you can tell those apart.


To return to "seen moderately much" can be more interesting around n-grams, 
because this amounts to "which works  appear together" and becomes a crude form of collocation analysis,
though due to context of legal text, this finds more standard phrasing than it finds interesting terms.

In [170]:
rvs_tfidf.calc(min_df=15, max_df=30, ngram_range=(3,3), sublinear_tf=True, )  # using actual numbers out of the 25 speaks to the imagination more 
                                      # min_df=0.4, max_df=0.6 would be equivalent in this case, and work for other amounts of documents
ShowTfIdf( rvs_tfidf )


Unnamed: 0,26 vijfde lid,achterwege gebleven artikel,advies conform is,advisering van de,afdeling advisering van,artikel 26 vijfde,bij de afdeling,conform is achterwege,de afdeling advisering,de vice president,dit advies conform,gebleven artikel 26,het advies van,in de staatscourant,is achterwege gebleven,op het advies,openbaarmaking van dit,overweging aanhangig gemaakt,president van de,reactie op het,state ter overweging,ter overweging aanhangig,tot wijziging van,van de raad,van dit advies,van state ter,van wet tot,vice president van,vierde lid van,vijfde lid van,voorstel van wet
W03.15.0357/II,0.294,0.294,0.294,0.0,0.0,0.294,0.0,0.294,0.0,0.0,0.294,0.294,0.0,0.28,0.294,0.0,0.294,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.287,0.0,0.0,0.0,0.0,0.245,0.0
W05.22.0038/I,0.0,0.0,0.0,0.364,0.364,0.0,0.153,0.0,0.364,0.153,0.0,0.0,0.32,0.0,0.0,0.153,0.0,0.153,0.153,0.153,0.157,0.153,0.153,0.398,0.0,0.157,0.153,0.153,0.0,0.111,0.216
W10.08.0069/III,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
W09.07.0405/IV,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
W02.21.0005/II/K,0.0,0.0,0.0,0.394,0.394,0.0,0.188,0.0,0.394,0.188,0.0,0.0,0.188,0.0,0.0,0.188,0.0,0.188,0.188,0.188,0.0,0.188,0.0,0.447,0.0,0.0,0.0,0.188,0.0,0.137,0.0
W09.00.0484/V,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
W03.14.0477/II,0.279,0.279,0.279,0.0,0.0,0.279,0.0,0.279,0.0,0.0,0.279,0.279,0.0,0.266,0.279,0.0,0.279,0.0,0.0,0.0,0.0,0.0,0.318,0.0,0.272,0.0,0.0,0.0,0.0,0.232,0.0
W05.20.0371/I,0.0,0.0,0.0,0.383,0.383,0.0,0.16,0.0,0.383,0.16,0.0,0.0,0.272,0.0,0.0,0.16,0.0,0.16,0.16,0.16,0.165,0.16,0.0,0.418,0.0,0.165,0.0,0.16,0.0,0.0,0.227
W05.14.0263/I,0.279,0.279,0.279,0.0,0.0,0.279,0.0,0.279,0.0,0.0,0.279,0.279,0.0,0.266,0.279,0.0,0.279,0.0,0.0,0.0,0.0,0.0,0.318,0.0,0.272,0.0,0.0,0.0,0.0,0.232,0.0
W06.00.0275/IV,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


### Look at just IDF

In [134]:
rvs_tfidf.calc(min_df=2, max_df=5 ) 

for term, _, idf in rvs_tfidf.vocab_idf():
    print( '%.3f   %s'%( idf, term) )

# Note that 
# - IDF is lower for emptier terms like  am, the
# - this score isn't affected by how often the term appears in a single document or overall, 
#   just by how many documents it appears in

3.140   nadere
3.140   stb
3.140   cultuur
3.140   wetenschap
3.140   onderzoek
3.140   2000
3.140   wijzigingen
3.140   punten
3.140   vervalt
3.140   gaan
3.140   basis
3.140   belang
3.140   onderdeel
3.140   bepalen
3.140   was
3.140   hoe
3.140   verder
3.140   bijvoorbeeld
3.140   voorgaande
3.140   passen
3.140   weg
3.140   15
3.140   18
3.140   werking
3.140   19
3.140   leiden
3.140   merkt
3.140   juist
3.140   24
3.140   recht
3.140   hiervoor
3.140   buitenlandse
3.140   daarom
3.140   voorgestelde
3.140   voordat
3.140   gesteld
3.140   sociale
3.140   nieuwe
3.140   aangepast
3.140   vanwege
3.140   af
3.140   regelt
3.140   tijd
3.140   wanneer
3.140   terecht
3.140   via
3.140   sprake
3.140   verbeteringen
3.140   aanwijzing
3.140   november
3.140   eu
3.140   europees
3.140   lopen
3.140   staatssecretaris
3.140   koninkrijksrelaties
3.140   algemeen
3.140   begrotingsstaat
3.140   2018
3.140   twee
3.140   besluiten
3.140   daarna
3.140   toegevoegd
3.140   staan
3.

## Example in more structured data

Let's look at the same in some real-world data.

In [None]:
def simple_tokenize(text: str):
    ' split string into words - in a very dumb way but slightly better than just split() '
    import re
    l = re.split('[\\s!@#$%^&*":;/,?\xab\xbb\u2018\u2019\u201a\u201b\u201c\u201d\u201e\u201f\u2039\u203a\u2358\u275b\u275c\u275d\u275e\u275f\u2760\u276e\u276f\u2e42\u301d\u301e\u301f\uff02\U0001f676\U0001f677\U0001f678-]+', text)
    return list(e.strip("'")   for e in l  if len(e)>0)            


import pandas # just for presentation
df = pandas.DataFrame()

from sklearn.datasets import fetch_20newsgroups  # is easy enough to download and use
for category in ('alt.atheism', 'talk.religion.misc', 'sci.space','comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'rec.sport.baseball', 'rec.sport.hockey'):
    #print( "== %s =="%category )
    tfidf = TfIdf()
    for doc in fetch_20newsgroups(subset='train', categories=[category]).data:
        headers, body = doc.split('\n\n',1) # take off newsgroupy header
        tfidf.add_document( body )

    tfidf.calc()
    ct = []
    ii = []
    for term, _, idf in tfidf.vocab_idf()[:50]: 
        ct.append( term )
        #print( '%.3f   %s'%( idf, term) )
        ii.append( round(idf, 2) )

    df[ category ] = ct
    df[ category+'_idf' ] = ii


df


Unnamed: 0,alt.atheism,alt.atheism_idf,talk.religion.misc,talk.religion.misc_idf,sci.space,sci.space_idf,comp.sys.ibm.pc.hardware,comp.sys.ibm.pc.hardware_idf,comp.sys.mac.hardware,comp.sys.mac.hardware_idf,rec.sport.baseball,rec.sport.baseball_idf,rec.sport.hockey,rec.sport.hockey_idf
0,__o,5.23,herrings,5.14,ocean,5.19,3m,5.77,letter,5.75,response,5.45,pavel,5.32
1,tao,5.23,program,5.14,summer,5.19,holes,5.77,refreshing,5.75,logic,5.45,outlaws,5.32
2,bsa,5.23,door,5.14,detail,5.19,mutual,5.77,210,5.75,99,5.45,darryl,5.32
3,responsibility,5.23,population,5.14,68,5.19,sierra,5.77,starting,5.75,butler,5.45,kill,5.32
4,actual,5.23,internal,5.14,polar,5.19,ignore,5.77,originally,5.75,magowan,5.45,ratings,5.32
5,leger,5.23,offs,5.14,observe,5.19,ritvax,5.77,electronic,5.75,giant,5.45,status,5.32
6,nott,5.23,sake,5.14,galaxies,5.19,programmer,5.77,spend,5.75,furthermore,5.45,upon,5.32
7,mips,5.23,briefly,5.14,fire,5.19,legal,5.77,fixes,5.75,fairly,5.45,prospects,5.32
8,wainwright,5.23,popularity,5.14,comm,5.19,thrown,5.77,frank,5.75,bratt,5.45,feed,5.32
9,despite,5.23,orthodoxy,5.14,dept,5.19,relationship,5.77,process,5.75,rank,5.45,hill,5.32


So what does that give us?

Looks like an automatic stopword generator. That doesn't need to be aware of the language - it's not considering the text at all.

...and not a whole lot more. Once you get beyond 100, 200 it's a mix of everything. A thousand or so on it still looks that way.

And it's not about choosing a cutoff point, because "we don't think about the word, just about in how many documents it appears at all"
is just too crude to (in an admittedly not very large dataset) tell apart that, say, 'yes' and 'christian' have different value. Or 'powerbook' and 'hi'.
and (if you were asking "maybe it's not linear but still fairly ordinal?"), this mix also means it's also too crude for "this term is more specific without telling you how much".

In [None]:
## TFIDF with gensim

In [None]:

import pprint 
import gensim 
import gensim.utils 
import gensim.models

# gensim has a specific approach to processing documents.
#   Its Dictionary is intended as map a 'map each new string you get to a identifying number, so that models can work purely with numbers',
#   by looking it up in the dictionary you made for the text you're working on.
dic = gensim.corpora.Dictionary()

# for example, dictionary.doc2bow (bag of words) returns (an unordered set of) 'this word (by id) got used this many times'
# so if we do this:
corpus_as_id_counts = []
for text in docs:
    pp = gensim.utils.simple_preprocess(text) # split, lowercase, remove short strings
    #print( pp )
    #pp = simple_preprocess(doc)
    doc_id_counts = dic.doc2bow(pp, allow_update=True)  # we're currently creating, not looking up
    #print( doc_id_counts )
    corpus_as_id_counts.append( doc_id_counts ) 
# ...then our 'corpus' consists of documents that are each just a list of id,frequency


print( 'Dictionary as (id,term) pairs:', list(dic.items()) )
print( 'Corpus as (id,count) pairs:')
pprint.pprint(corpus_as_id_counts)

print( 'Pointing out that unseen terms will fall away')
print( dic.doc2bow('i am unseen words'.split(), allow_update=False) ) # 1 count of 'am'

# https://www.tutorialspoint.com/gensim/gensim_creating_tf_idf_matrix.htm



In [None]:
# most of the above is not interesting, just showing our work. 
#  If you don't care about keeping those IDs yourself, you can write that in less code:
dupcorp = gensim.corpora.Dictionary()
dupcorp.add_documents( list( gensim.utils.simple_preprocess(text)  for text in docs ) )
#print( list(dic.items()) )
#print( list(dupcorp.items()) )



In [None]:
tfidf = gensim.models.TfidfModel(corpus_as_id_counts, smartirs='ntc')

# because tfidf mostly represents a matrix, in which and rows are documents, you can 
for doc in tfidf[corpus_as_id_counts]:
    print()
    for id, freq in doc:
        print('id=%-3s  %-20r  %s'%( id, dic[id], round(freq, 3)) )
#print( dic['i'] )
#for id in dic'I like the foo and the bar''
#for doc in tfidf[BoW_corpus]:
#    print( doc) 


# You could e.g. also take text and look up 

#BoW_corpus = [ for doc in doc_tokenized]
