# Term Frequency - Inverse Document Frequency
- the aim of this notebook is to show the helpfulness of ```tf-idf``` in finding keywords and extracting information from a text

In [1]:
from dataset_loader import Dataset
from document_preprocessor import DocumentPreprocessor
from document_vectors import StatsKeeper
import random

# Load the dataset
- [dataset](http://archives.textfiles.com/stories.zip
) contains many novels/stories written by different authors, every story is written in english although many are saved in unsupported encodings
- the dataset will be used as a databank to show the usefulness of ```tf-idf```, the ability to extract keywords which set appart distinct stories
- because of troubles with encoding almost 7% of the data were not included to the dataset

In [2]:
dataset = Dataset()

Statistics of loading : correct : 437	incorrect : 30


# Load the document preprocessor
- ```DocumentPreprocessor``` performs all the usual transformations of 
text in order to remove the most usual and uninteresting words
- to show the importance of preprocessing we will create 2 types
of preprocessors, one which will use all the transformations, other which will use only lower_casing
- the transformations are well-named so that it won't be difficult to see which ones are used

In [3]:
doc_preproc_all = DocumentPreprocessor(remove_apostrophes=True,
                                         remove_punctuation=True,
                                         remove_single_characters=True,
                                         remove_stop_words=True,
                                         stemming=True,
                                         number_converting=True,
                                         lower_case=True)

doc_not_preproc = DocumentPreprocessor(lower_case=True)

# Load the ```StatsKeeper```
- the stats keeper is an object that holds all the important statistics about the whole corpus of documents
- the important statistics include :
    - ```tf-idfs``` of all words in the document
    - ```tfs``` of all words in the document
    - ```idfs``` of each word in the corpus of all documents
- we will create 2 ```statsKeepers``` to keep track of stats of both, the preprocessed and the not preprocessed data

In [4]:
statsKeeper_all = StatsKeeper()
statsKeeper_not = StatsKeeper()

# Fill ```StatsKeeper``` and ```DocumentPreprocessor``` with data from the dataset

In [5]:
for path, (title, text) in dataset.texts.items():
    preprocessed = doc_preproc_all.preprocess_document(
        path, text, title)
    statsKeeper_all.load_document(
        title, preprocessed.title, preprocessed.text)
    
    preprocessed = doc_not_preproc.preprocess_document(
        path, text, title)
    statsKeeper_not.load_document(
        title, preprocessed.title, preprocessed.text)

In [6]:
statsKeeper_all.compile()
statsKeeper_not.compile()

# The most common words in both datasets
- as we can see, the preprocessing got rid of conjunctions and interpunction, which naturally occur in almost each document, but don't carry any information value

## With preprocessing:

In [7]:
for rank, (token, frequency) in enumerate(statsKeeper_all._sorted_wdf[:10]):
    print("{}. {} : {}".format(rank + 1, token, frequency))

1. one : 414
2. time : 379
3. like : 356
4. would : 348
5. could : 347
6. look : 336
7. go : 330
8. see : 330
9. said : 327
10. back : 325


## Without preprocessing:

In [8]:
for rank, (token, frequency) in enumerate(statsKeeper_not._sorted_wdf[:10]):
    print("{}. {} : {}".format(rank + 1, token, frequency))

1. a : 437
2. the : 437
3. , : 436
4. and : 435
5. . : 434
6. to : 433
7. of : 431
8. in : 431
9. it : 422
10. for : 421


# Words providing the most information
- according to [Wikipedia](https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Inverse_document_frequency_2) the inverse document frequency is a measure of how much information the word provides

- we can see, that preprocessing helped with avoiding numbers and compound words to sneak into the list of most valuable words (information-wise)

- also, all the words which are listed appeared in exactly 1 document

$\ln(\frac{437}{1+1}) + 1\approx6.38$

## With preprocessing:

In [9]:
for rank, (token, frequency) in enumerate(statsKeeper_all._sorted_idfs[:10]):
    print("{}. {} : {}".format(rank + 1, token, frequency))

1. freewar : 6.384495062789089
2. walley : 6.384495062789089
3. quackeri : 6.384495062789089
4. fiftieth : 6.384495062789089
5. pipelin : 6.384495062789089
6. businesss : 6.384495062789089
7. bushpilot : 6.384495062789089
8. traplin : 6.384495062789089
9. whiteout : 6.384495062789089
10. preheat : 6.384495062789089


## Without preprocessing:

In [10]:
for rank, (token, frequency) in enumerate(statsKeeper_not._sorted_idfs[:10]):
    print("{}. {} : {}".format(rank + 1, token, frequency))

1. freeware : 6.384495062789089
2. 53. : 6.384495062789089
3. rigors : 6.384495062789089
4. walleye : 6.384495062789089
5. quackery : 6.384495062789089
6. southerner : 6.384495062789089
7. fiftieth : 6.384495062789089
8. pipelines : 6.384495062789089
9. drivein : 6.384495062789089
10. businesss : 6.384495062789089


# The keywords of example document
- the ```statsKeeper``` keeps the statistics for each individual document and for the whole dataset, hence we could have seen the most common words
- therefore the random document is selected from the corpus, where we can see how the keyword extraction with ```tf-idf``` can work

- the exact formula for calculating ```tf```: $\frac{termfrequency * \alpha}{numberOfWordsInTheDocument}$ where $\alpha$ is a value that represents if the word is present in the title, in the body or in both

- the exact formula for calculating ```tf-idf```: $tfidf[i] = tf[i] * idf[i]$

In [11]:
# pick random document
_, (title, text) = random.choice(list(dataset.texts.items()))

In [12]:
print("TITLE : {}".format(title))

TITLE : Dream Girl by Melina Huddy


In [13]:
print("\nTEXT : {}".format(text))


TEXT : DREAM GIRL
  by Melina Huddy

  She reminded me of someone that I'd known once, but I couldn't 
recall who. She came into the bar that first Saturday wearing red 
slacks and white high heels, looking good and smelling even better.

  "Draft."  Her voice was top shelf bourbon, deep amber, smooth and 
mellow. She sounded like someone used to being listened to.

  I got her a mug of beer and watched her drink while I tended bar. 
Saturday afternoons are pretty busy around here; folks stop in and 
have a couple, then go on about their shopping or whatever. 

  Dad bought this place from Jim Parker about forty years ago and 
let Mom name it and do the decorating. It's still called Kitty Korner 
and there are ceramic cats everywhere. I never have liked it, but who 
am I to change what's become a town institution?

  After Dad died and Mom went to the nursing home, I bought my sister 
out and Kitty Korner's all mine now. There used to be a laundromat next 
door, but I bought that, too

## Most frequent tokens in the document
- for bare term frequency we see that articles (the, a, an), pronouns (i, you) are the most frequent although they doesn't carry any information value

- preprocessing helps with getting rid of these words

### With preprocessing:

In [14]:
for rank, (token, frequency) in enumerate(statsKeeper_all.documents[title].sorted_term_frequencies[:10]):
    print("{}. {} : {}".format(rank + 1, token, frequency))

1. one : 0.014820592823712949
2. like : 0.014040561622464899
3. jack : 0.0124804992199688
4. virginia : 0.0109204368174727
5. got : 0.0093603744149766
6. know : 0.0093603744149766
7. ed : 0.0093603744149766
8. two : 0.00858034321372855
9. get : 0.00858034321372855
10. bar : 0.0070202808112324495


### Without preprocessing:

In [15]:
for rank, (token, frequency) in enumerate(statsKeeper_not.documents[title].sorted_term_frequencies[:10]):
    print("{}. {} : {}".format(rank + 1, token, frequency))

1. . : 0.07768800497203232
2. , : 0.056867619639527654
3. `` : 0.039776258545680544
4. the : 0.02952144188937228
5. i : 0.027035425730267248
6. and : 0.024549409571162212
7. a : 0.01709136109384711
8. to : 0.016469857054070853
9. 's : 0.015226848974518334
10. in : 0.014916096954630205


## Words with highest ```tf-idf``` in the document
- we see that since many stopwords, articles, pronouns, interpunction have really high term frequency, they can sneak into the top 10

- hence preprocessing is really important even if using normalized ```tf-idf```

### With preprocessing

In [16]:
for rank, (token, frequency) in enumerate(statsKeeper_all.documents[title].sorted_tf_idfs[:10]):
    print("{}. {} : {}".format(rank + 1, token, frequency))

1. virginia : 0.05458253496362619
2. jack : 0.04432176872053508
3. korner : 0.039840842825516934
4. ed : 0.038688330085952516
5. parker : 0.03838832993622029
6. kitti : 0.031076002477070208
7. bud : 0.02686460855606398
8. saturday : 0.026047673728754325
9. beer : 0.022963196397978496
10. marti : 0.021326849964566824


### Without preprocessing

In [17]:
for rank, (token, frequency) in enumerate(statsKeeper_not.documents[title].sorted_tf_idfs[:10]):
    print("{}. {} : {}".format(rank + 1, token, frequency))

1. . : 0.07786639312153125
2. , : 0.05673733866699201
3. `` : 0.04390633214089837
4. the : 0.029386332079675535
5. i : 0.028962756459296225
6. and : 0.024549409571162212
7. virginia : 0.02174481349389956
8. -- : 0.020548153579432325
9. she : 0.01997826607627479
10. jack : 0.017512960506449324
