<a id="table_of_content"></a>

# Part 1: Bag-of-Words Exercises

In the following, we will convert our corpus of 10-k documents into bag-of-words, and convert them into document vectors using Term-Frequency-Inverse-Document-Frequency (tf-idf) re-weighting. Afterward, we will compute sentiments and similarity metrics. Since we will be reusing our notebook, so the various sections are linked below:

1. <a href="#bag_of_word">Compute bag-of-words </a>: implement `bag_of_words` that converts tokenized words into a dictionary of word-counts

2. <a href="#sentiment">Sentiments </a>: using wordlists, compute positive and negative sentiments from bag-of-words. Implement `get_sentiment`

For solutions, see [bagofwords_solutions.py](./bagofwords_solutions.py). You can load the functions by simply calling 

`from bagofwords_solutions import *`

# Part 2: Document-Vector Exercises

3. <a href="#compute_idf">Compute idf </a>: computing the inverse document frequency, implement `get_idf`

4. <a href="#compute_tf">Compute tf </a>: computing the term frequency, implement `get_tf`

5. <a href="#doc_vector">Document vector </a>: using the functions `get_idf` and `get_tf`, compute a word_vector for each document in the corpus
6. <a href="#similarity">Similarities </a>: comparing two vectors, and compute cosine and jacard similarity metrics. Implement `get_cos` and `get_jac`

For solutions, see [bagofwords_solutions.py](./bagofwords_solutions.py). You can load the functions by simply calling 

`from bagofwords_solutions import *`


## 0. Initialization

First we read in our 10-k documents

In [1]:
# get a list of all 10-ks in our directory
files=! ls *10k*.txt
print("10-k files: ",files)
files = [open(f,"r").read() for f in files]

10-k files:  ['apple_10k.txt', 'ebay_10k.txt', 'sears_10k.txt']


here we define useful functions to tokenize the texts into words, remove stop-words, and lemmatize and stem our words

In [2]:
import numpy as np

# for nice number printing
np.set_printoptions(precision=3, suppress=True)

# tokenize and clean the text
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from collections import Counter
from nltk.corpus import stopwords
from nltk import word_tokenize
from nltk.tokenize import RegexpTokenizer
# tokenize anything that is not a number and not a symbol
word_tokenizer = RegexpTokenizer(r'[^\d\W]+')

nltk.download('stopwords')
nltk.download('wordnet')


sno = SnowballStemmer('english')
wnl = WordNetLemmatizer()

# get our list of stop_words
nltk.download('stopwords')
stop_words = set(stopwords.words('english')) 
# add some extra stopwords
stop_words |= {"may", "business", "company", "could", "service", "result", "product", 
               "operation", "include", "law", "tax", "change", "financial", "require",
               "cost", "market", "also", "user", "plan", "actual", "cash", "other",
               "thereto", "thereof", "therefore"}

# useful function to print a dictionary sorted by value (largest first by default)
def print_sorted(d, ascending=False):
    factor = 1 if ascending else -1
    sorted_list = sorted(d.items(), key=lambda v: factor*v[1])
    for i, v in sorted_list:
        print("{}: {:.3f}".format(i, v))

# convert text into bag-of-words
def clean_text(txt):
    lemm_txt = [ wnl.lemmatize(wnl.lemmatize(w.lower(),'n'),'v') \
                for w in word_tokenizer.tokenize(txt) if \
                w.isalpha() and w not in stop_words ]
    return [ sno.stem(w) for w in lemm_txt if w not in stop_words and len(w) > 2 ]

corpus = [clean_text(f) for f in files]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


<a id="bag_of_words"></a>

## 1. Bag-of-Words

Implement a function that converts a list of tokenized words into bag-of-words, i.e. a dictionary that outputs the frequency count of a word

** python already provide the `collections.Counter` class to perform this, but I encourage you to implement your own function as an exercise

<a href="#table_of_content">back to top</a>

In [3]:
from collections import defaultdict

def bag_of_words(words):
    # TO DO
    counts = {}
    for word in words:
        counts[word] = counts.get(word, 0) + 1
    return counts


<a id="sentiment"></a>

## 2. Sentiments
Count the fraction of words that appear in a wordlist, returning a sentiment score between 0 and 1:

$$
\textrm{score} = \frac{\textrm{Number of words in document matching wordlist}}{\textrm{Number of words in document}}
$$

Implement the score in a function `get_sentiment(words, wordlist)`, where words is a list of words. Feel free to use the previous `bag_of_words` function. 
(*for extra challenge, try to code the function in one-line*)

Here, I've included a positive and negative wordlist that I constructed by hand. Due to copyright issues, we are not able to provide other commonly used wordlists. I encourage you to try out different wordlists on your own.

<a href="#table_of_content">back to top</a>

In [4]:
# load wordlist first
import pickle

with open('positive_words.pickle', 'rb') as f:
    positive_words = pickle.load(f)
    # also need to stem and lemmatize the text
    positive_words = set(clean_text(" ".join(positive_words)))
    
with open('negative_words.pickle', 'rb') as f:
    negative_words = pickle.load(f)
    negative_words = set(clean_text(" ".join(negative_words)))
    
# check out the list
# print("positive words: ", positive_words)
# print("negative words: ", negative_words)

In [7]:
def get_sentiment(txt, wordlist):
    # TO DO
    matching_words = [ w for w in txt if w in wordlist ]
    return len(matching_words)/len(txt)
#     counts = bag_of_words(txt)
#     matches = sum((w in wordlist) * counts[w] for w in counts)
#     return matches / sum(counts.values())

In [8]:
# test your function
positive_sentiments = np.array([ get_sentiment(c, positive_words) for c in corpus ])
print(positive_sentiments)

negative_sentiments = np.array([ get_sentiment(c, negative_words) for c in corpus ])
print(negative_sentiments)

[ 0.012  0.013  0.012]
[ 0.001  0.001  0.002]


**before continuing part 2, go through the lesson material first!**

<a id="compute_idf"></a>

# Part 2 Document Vector Exercises

## 3. Computer idf
Given a corpus, or a list of bag-of-words, we want to compute for each word $w$, the inverse-document-frequency, or ${\rm idf}(w)$. This can be done in a few steps:

1. Gather a set of all the words in all the bag-of-words (python set comes in handy, and the union operator `|` is useful here)


2. Loop over each word $w$, and compute ${\rm df}_w$, the number of documents where this word appears at least once. A dictionary is useful for keeping track of ${\rm df}_w$


3. After computing ${\rm df}_w$, we can compute ${\rm idf}(w)$. There are usually two possibilities, the simplest one is 
$${\rm idf}(w)=\frac{N}{{\rm df}_w}$$
Where $N$ is the total number of documents in the corpus and $df_w$ the number of documents that contain the word $w$. Frequently, a logarithm term is added as well
$${\rm idf}(w)=\log\frac{N}{{\rm df}_w}$$
One issue with using the logarithm is that when ${\rm df}_w = N$, ${\rm idf}(w)=0$, indicating that words common to all documents would be ignored. If we don't want this behavior, we can define ${\rm idf}(w)=\log\left(1+N/{\rm df}_w\right)$ or ${\rm idf}(w)=1+\log\left(N/{\rm df}_w\right)$ instead. For us, we'll not use the extra +1 for ${\rm idf}$.

In the following, define a function called `get_idf(corpus, include_log=True)` that computes ${\rm idf}(w)$ for all the words in a corpus, where `corpus` for us is a processed list of bag-of-words (stemmed and lemmatized). The optional parameter `include_log` includes the logarithm in the computation.

<a href="#table_of_content">back to top</a>

In [19]:
# compute idf
def get_idf(corpus, include_log=True):
    # TO DO
    N = len(corpus)
    freq = defaultdict(int)
    words = set()
    corpus = [set(c) for c in corpus]
    for c in corpus:
        words |= c
        
    for w in words:
        freq[w] = sum([ w in c for c in corpus])
    
    if include_log:
        return { w:np.log(N/freq[w]) for w in freq }
    else:
        return { w:N/freq[w] for w in freq }

You should expect to see many idf values = 0! This is by design, because we have ${\rm idf}(w)=\log N_d/{\rm df}_w$ and $N_d/{\rm df}_w=1$ for the most common words!

In [20]:
# test your code
idf=get_idf(corpus)
print_sorted(idf, ascending=True)

wsk: 0.000
wnd: 0.000
oyv: 0.000
fix: 0.000
standard: 0.000
atb: 0.000
wld: 0.000
tzu: 0.000
rka: 0.000
marker: 0.000
expos: 0.000
similar: 0.000
jwa: 0.000
aox: 0.000
evi: 0.000
term: 0.000
axz: 0.000
tpi: 0.000
methodolog: 0.000
jci: 0.000
wcb: 0.000
store: 0.000
dff: 0.000
mjv: 0.000
regular: 0.000
best: 0.000
moo: 0.000
vzs: 0.000
thf: 0.000
rmd: 0.000
gzv: 0.000
entitycurrentreportingstatus: 0.000
ofg: 0.000
nyn: 0.000
underli: 0.000
ejx: 0.000
muq: 0.000
percentitemtyp: 0.000
ffn: 0.000
ull: 0.000
cca: 0.000
market: 0.000
owr: 0.000
rnu: 0.000
disposit: 0.000
fine: 0.000
moot: 0.000
minor: 0.000
zfe: 0.000
influenc: 0.000
connect: 0.000
kmb: 0.000
pyu: 0.000
ced: 0.000
cci: 0.000
ttz: 0.000
sharesitemtyp: 0.000
inclus: 0.000
llc: 0.000
lawsuit: 0.000
akd: 0.000
benefici: 0.000
operatingleasesfutureminimumpaymentsdueintwoyear: 0.000
communic: 0.000
lsz: 0.000
among: 0.000
aac: 0.000
misstat: 0.000
program: 0.000
uyu: 0.000
lvo: 0.000
ufk: 0.000
highest: 0.000
boq: 0.000
tzc: 0.000

solicit: 0.000
vendor: 0.000
jjr: 0.000
phm: 0.000
qte: 0.000
wih: 0.000
wio: 0.000
entri: 0.000
fdt: 0.000
giw: 0.000
gvq: 0.000
unpaid: 0.000
tyq: 0.000
mxw: 0.000
mgc: 0.000
efc: 0.000
stand: 0.000
matter: 0.000
ap: 0.000
ruz: 0.000
head: 0.000
kvj: 0.000
deh: 0.000
believ: 0.000
weak: 0.000
hardwar: 0.000
eom: 0.000
elh: 0.000
nxs: 0.000
sbe: 0.000
owi: 0.000
indemnifi: 0.000
sac: 0.000
okg: 0.000
asv: 0.000
zlc: 0.000
lnx: 0.000
fbc: 0.000
xkq: 0.000
pal: 0.000
gmonthdayitemtyp: 0.000
perform: 0.000
tck: 0.000
vwx: 0.000
rwz: 0.000
wlg: 0.000
wii: 0.000
mdq: 0.000
azx: 0.000
jdk: 0.000
compris: 0.000
reconcili: 0.000
mha: 0.000
text: 0.000
tip: 0.000
scg: 0.000
hcq: 0.000
nonnum: 0.000
first: 0.000
subsidiari: 0.000
hcj: 0.000
kvq: 0.000
place: 0.000
xfn: 0.000
gge: 0.000
mzj: 0.000
ahm: 0.000
lft: 0.000
hwh: 0.000
srj: 0.000
udt: 0.000
lpw: 0.000
dic: 0.000
reform: 0.000
tco: 0.000
mel: 0.000
wja: 0.000
reflect: 0.000
knv: 0.000
mvc: 0.000
nlz: 0.000
mwo: 0.000
deem: 0.000
soy: 0

mgw: 0.000
mnl: 0.000
sqa: 0.000
closur: 0.000
reason: 0.000
mqk: 0.000
kix: 0.000
final: 0.000
akm: 0.000
ddbc: 0.000
acr: 0.000
lgx: 0.000
jpz: 0.000
qzd: 0.000
mdt: 0.000
euv: 0.000
deplet: 0.000
wmk: 0.000
orw: 0.000
rather: 0.000
open: 0.000
behavior: 0.000
yxw: 0.000
matur: 0.000
oiy: 0.000
ggp: 0.000
eac: 0.000
fml: 0.000
ipm: 0.000
commenc: 0.000
yac: 0.000
gun: 0.000
vkj: 0.000
nvq: 0.000
fov: 0.000
vvi: 0.000
roleuri: 0.000
gyearitemtyp: 0.000
tuplesreport: 0.000
azf: 0.000
hmk: 0.000
unl: 0.000
hok: 0.000
rbc: 0.000
nonvest: 0.000
aw: 0.000
hbi: 0.000
mxv: 0.000
sei: 0.000
kve: 0.000
wpm: 0.000
fpe: 0.000
ccc: 0.000
guz: 0.000
largest: 0.000
chief: 0.000
uncollect: 0.000
dollar: 0.000
ssj: 0.000
explan: 0.000
mnu: 0.000
fwg: 0.000
integr: 0.000
kxb: 0.000
hde: 0.000
xya: 0.000
right: 0.000
mnp: 0.000
pax: 0.000
xbo: 0.000
vun: 0.000
glu: 0.000
soc: 0.000
wfh: 0.000
auto: 0.000
certain: 0.000
noncompli: 0.000
enterpris: 0.000
zae: 0.000
jod: 0.000
oid: 0.000
shift: 0.000
bbf:

plf: 0.405
tft: 0.405
zsw: 0.405
stw: 0.405
ulua: 0.405
lnu: 0.405
tmu: 0.405
qml: 0.405
fuel: 0.405
oll: 0.405
undistributedearningsofforeignsubsidiari: 0.405
rrk: 0.405
fddd: 0.405
ocd: 0.405
kev: 0.405
researchanddevelopmentexpens: 0.405
fnvb: 0.405
kgh: 0.405
bng: 0.405
ptz: 0.405
ygv: 0.405
qgu: 0.405
zya: 0.405
efjx: 0.405
gzc: 0.405
afbf: 0.405
nda: 0.405
wmw: 0.405
jnc: 0.405
pyq: 0.405
yri: 0.405
pqw: 0.405
ajp: 0.405
jbf: 0.405
mqf: 0.405
overpay: 0.405
air: 0.405
cxg: 0.405
apic: 0.405
zhj: 0.405
round: 0.405
ngt: 0.405
uzdr: 0.405
mwn: 0.405
putat: 0.405
advic: 0.405
kfo: 0.405
zrg: 0.405
xfz: 0.405
mnbi: 0.405
eqr: 0.405
gdr: 0.405
zym: 0.405
wej: 0.405
trq: 0.405
kuu: 0.405
old: 0.405
hxg: 0.405
jjc: 0.405
fya: 0.405
iqo: 0.405
pfh: 0.405
zwq: 0.405
vzwo: 0.405
gqc: 0.405
ftr: 0.405
rym: 0.405
lxo: 0.405
jql: 0.405
vwv: 0.405
cgzo: 0.405
fun: 0.405
uyn: 0.405
ryk: 0.405
osk: 0.405
xdt: 0.405
lbi: 0.405
jyr: 0.405
vmq: 0.405
ukj: 0.405
bxr: 0.405
xvh: 0.405
shorttermdebtwe

steal: 0.405
lqge: 0.405
czs: 0.405
mfu: 0.405
tbe: 0.405
cxf: 0.405
contact: 0.405
ncl: 0.405
zfg: 0.405
ejf: 0.405
vaa: 0.405
fyd: 0.405
unvr: 0.405
oii: 0.405
tgf: 0.405
sdh: 0.405
sys: 0.405
nwc: 0.405
wuc: 0.405
dun: 0.405
lot: 0.405
rmf: 0.405
fke: 0.405
wqc: 0.405
cbbf: 0.405
muy: 0.405
gie: 0.405
mkr: 0.405
rhw: 0.405
intuit: 0.405
tde: 0.405
jvu: 0.405
annum: 0.405
jct: 0.405
atu: 0.405
tkt: 0.405
ykx: 0.405
tjh: 0.405
xfe: 0.405
bor: 0.405
oof: 0.405
xyb: 0.405
fop: 0.405
dvh: 0.405
tzl: 0.405
hlq: 0.405
fnk: 0.405
duw: 0.405
emw: 0.405
vnc: 0.405
xxi: 0.405
ifi: 0.405
qfj: 0.405
fet: 0.405
vfi: 0.405
ogt: 0.405
lxp: 0.405
bbdc: 0.405
ffeb: 0.405
hwb: 0.405
wvwo: 0.405
gwe: 0.405
lzr: 0.405
wum: 0.405
jzm: 0.405
ogg: 0.405
nrv: 0.405
yvd: 0.405
xqo: 0.405
bns: 0.405
cgx: 0.405
njg: 0.405
zzv: 0.405
rcd: 0.405
eadf: 0.405
dbbb: 0.405
launch: 0.405
dqs: 0.405
xci: 0.405
rjj: 0.405
khu: 0.405
axu: 0.405
qdg: 0.405
qlg: 0.405
fairvaluehedgingmemb: 0.405
psp: 0.405
uza: 0.405
floo

bqg: 0.405
scheduleofequitymethodinvestmentequitymethodinvesteenameaxi: 0.405
othercomprehensiveincomeunrealizedgainlossonderivativesarisingduringperiodtax: 0.405
nbi: 0.405
bpx: 0.405
xut: 0.405
esq: 0.405
uun: 0.405
ton: 0.405
ihcn: 0.405
poq: 0.405
kfk: 0.405
nsp: 0.405
oop: 0.405
qxd: 0.405
sbv: 0.405
hew: 0.405
gwr: 0.405
inw: 0.405
fwx: 0.405
reset: 0.405
ttj: 0.405
bwi: 0.405
earningspersharetextblock: 0.405
ljc: 0.405
rnn: 0.405
iumd: 0.405
tariff: 0.405
hpq: 0.405
rmk: 0.405
ayf: 0.405
xpz: 0.405
sso: 0.405
eventu: 0.405
uni: 0.405
intangibleassetsgrossexcludinggoodwil: 0.405
qqg: 0.405
dwx: 0.405
vat: 0.405
lle: 0.405
jjd: 0.405
xjp: 0.405
dov: 0.405
bsu: 0.405
qiw: 0.405
djl: 0.405
incometaxespaidnet: 0.405
lae: 0.405
rln: 0.405
oip: 0.405
mdexw: 0.405
xcuf: 0.405
ihh: 0.405
eql: 0.405
wxt: 0.405
gdc: 0.405
wnu: 0.405
baba: 0.405
owq: 0.405
vxk: 0.405
qfvu: 0.405
pae: 0.405
kcw: 0.405
jcg: 0.405
rrf: 0.405
movz: 0.405
zgr: 0.405
bpm: 0.405
gcli: 0.405
ufo: 0.405
debtinstrume

omaena: 1.099
mixzu: 1.099
qldi: 1.099
ekb: 1.099
kgyi: 1.099
dkhv: 1.099
ezvr: 1.099
disclosureofsharebasedcompensationarrangementsbysharebasedpaymentawardtextblock: 1.099
mhcg: 1.099
othercomprehensiveincomelossdecreasefromdeconsolidationbeforetax: 1.099
yzpo: 1.099
jlyz: 1.099
kmzzh: 1.099
qzeur: 1.099
hdvtpb: 1.099
wlvjm: 1.099
ciut: 1.099
tgg: 1.099
yvhv: 1.099
aiqw: 1.099
shhhh: 1.099
okxblgw: 1.099
crwdf: 1.099
mtli: 1.099
xlxnoiyt: 1.099
ojvna: 1.099
leyuo: 1.099
werf: 1.099
bfceded: 1.099
kqpa: 1.099
statetaxcreditcarryforwardmemb: 1.099
bagi: 1.099
zftm: 1.099
qrcb: 1.099
shareholdersequityt: 1.099
yalf: 1.099
oyn: 1.099
madd: 1.099
qxpb: 1.099
ohflsocid: 1.099
eula: 1.099
pbbr: 1.099
vcmg: 1.099
evhvk: 1.099
cjv: 1.099
vkge: 1.099
lpzi: 1.099
qtch: 1.099
rqqf: 1.099
guarantornonguarantorsubsidiaryfinancialinform: 1.099
xjia: 1.099
iltf: 1.099
xjqivq: 1.099
euyl: 1.099
ppnx: 1.099
vkl: 1.099
ekxq: 1.099
swtl: 1.099
mrgj: 1.099
tqdb: 1.099
etl: 1.099
ewzvbu: 1.099
bupj: 1.099


nlrs: 1.099
cgkq: 1.099
speaker: 1.099
dissemin: 1.099
uybi: 1.099
rblq: 1.099
qxeh: 1.099
htb: 1.099
fmts: 1.099
saie: 1.099
yffwch: 1.099
izgi: 1.099
pefo: 1.099
dnxw: 1.099
uuz: 1.099
sejm: 1.099
cusfi: 1.099
iesaqdnw: 1.099
finitelivedintangibleassetsbymajorclassaxi: 1.099
wjnu: 1.099
kfvn: 1.099
zlo: 1.099
alibaba: 1.099
ccade: 1.099
urzo: 1.099
pcao: 1.099
ccedef: 1.099
orcf: 1.099
rtvvkrb: 1.099
tdj: 1.099
kcj: 1.099
afp: 1.099
odzru: 1.099
mcnfynw: 1.099
gzkn: 1.099
bvzr: 1.099
clelk: 1.099
amq: 1.099
zdxc: 1.099
mxfz: 1.099
vwl: 1.099
vwawq: 1.099
yowib: 1.099
guarantornonguarantorsubsidiaryfinancialinformationcondensedconsolidatingstatementofcomprehensiveincomedetail: 1.099
cfoj: 1.099
ncdn: 1.099
evsb: 1.099
mkzwj: 1.099
mjaxz: 1.099
uunb: 1.099
wyee: 1.099
tbbusc: 1.099
dkm: 1.099
increasedecreaseindeferredincometax: 1.099
jvst: 1.099
zcuk: 1.099
cpui: 1.099
gdxr: 1.099
sobki: 1.099
flgq: 1.099
kaa: 1.099
fcth: 1.099
ynlb: 1.099
dlbu: 1.099
iijq: 1.099
dkxk: 1.099
bhxb: 1.0

zdvfs: 1.099
joqk: 1.099
tppc: 1.099
abao: 1.099
gmco: 1.099
fceae: 1.099
ovlpver: 1.099
mswh: 1.099
tbhk: 1.099
mzmi: 1.099
unoa: 1.099
racfrtn: 1.099
caxsw: 1.099
tujh: 1.099
otyo: 1.099
movv: 1.099
gswm: 1.099
vax: 1.099
pdcq: 1.099
eicwnewl: 1.099
hexh: 1.099
mqlj: 1.099
sfff: 1.099
zlmbbt: 1.099
obbph: 1.099
hobi: 1.099
mtnvwz: 1.099
gsym: 1.099
xgwi: 1.099
qlskga: 1.099
ehto: 1.099
apadi: 1.099
yeydc: 1.099
kqtx: 1.099
mzts: 1.099
itfw: 1.099
iox: 1.099
ulip: 1.099
ftzr: 1.099
fxnf: 1.099
rrot: 1.099
kezbpv: 1.099
jkjv: 1.099
galv: 1.099
odbz: 1.099
adnd: 1.099
umyo: 1.099
rvuu: 1.099
fada: 1.099
ujd: 1.099
dyfp: 1.099
bovwcnn: 1.099
fixedr: 1.099
edadd: 1.099
csjv: 1.099
zumj: 1.099
bgk: 1.099
fqyd: 1.099
fcdee: 1.099
njlnk: 1.099
ccrplqris: 1.099
bnbt: 1.099
increasedecreaseinotherreceiv: 1.099
mefp: 1.099
mwat: 1.099
maxn: 1.099
cvp: 1.099
duetorelatedpartiescurr: 1.099
ocrr: 1.099
kjzw: 1.099
tzib: 1.099
lnecndjt: 1.099
bbee: 1.099
trail: 1.099
nayk: 1.099
kopb: 1.099
mjmltpi

bfda: 1.099
eujey: 1.099
yave: 1.099
krbw: 1.099
dfk: 1.099
ajgjo: 1.099
asyl: 1.099
uyd: 1.099
hsfow: 1.099
luri: 1.099
hbap: 1.099
vhjsi: 1.099
gock: 1.099
jmebso: 1.099
oqvrr: 1.099
ijsg: 1.099
ipagopf: 1.099
eswr: 1.099
ucgn: 1.099
ddqs: 1.099
idbm: 1.099
xxue: 1.099
ibpn: 1.099
ntlozt: 1.099
hyvp: 1.099
vhuf: 1.099
mxick: 1.099
vqazl: 1.099
aobu: 1.099
maov: 1.099
tnjiz: 1.099
qcn: 1.099
ulf: 1.099
khhm: 1.099
othercomprehensiveincomelossreclassificationadjustmentfromaocionderivativestax: 1.099
qqoip: 1.099
hfa: 1.099
aoimc: 1.099
knog: 1.099
oiro: 1.099
xctv: 1.099
fhgluo: 1.099
rpn: 1.099
abqx: 1.099
hiap: 1.099
guaranteeobligationsbynatureaxi: 1.099
tczvul: 1.099
ipw: 1.099
ufdm: 1.099
rfb: 1.099
lmyqz: 1.099
zhcj: 1.099
jojsj: 1.099
dymljo: 1.099
xdcj: 1.099
owyi: 1.099
dsaa: 1.099
dzuf: 1.099
cfcab: 1.099
ifgzoi: 1.099
vpod: 1.099
cigf: 1.099
taa: 1.099
kvzn: 1.099
hume: 1.099
ceeed: 1.099
vgbsfk: 1.099
vpay: 1.099
tff: 1.099
mreo: 1.099
lawb: 1.099
eipq: 1.099
joce: 1.099
bh

gcwja: 1.099
welfar: 1.099
mzux: 1.099
exmo: 1.099
qvx: 1.099
yoev: 1.099
mjrl: 1.099
rzyq: 1.099
sharebasedcompensationequityawardsotherthanoptionsexpectedtovestshar: 1.099
pynr: 1.099
baaff: 1.099
lrkk: 1.099
lijjw: 1.099
pavvqp: 1.099
tcmad: 1.099
rlsi: 1.099
xjfd: 1.099
wdfh: 1.099
rxbd: 1.099
zezohaof: 1.099
vyao: 1.099
tttp: 1.099
exsqigftf: 1.099
fqyysl: 1.099
ybqyz: 1.099
hbwx: 1.099
skadden: 1.099
mfnaw: 1.099
ffxu: 1.099
fbn: 1.099
jlxb: 1.099
evqp: 1.099
ebccc: 1.099
rkpa: 1.099
fmph: 1.099
vsw: 1.099
mzssno: 1.099
wnjiakq: 1.099
jurc: 1.099
definedbenefitplancontributionsbyplanparticip: 1.099
ffbfd: 1.099
crfob: 1.099
cqce: 1.099
waaua: 1.099
jtsyj: 1.099
naxm: 1.099
vdxm: 1.099
ktfm: 1.099
mkhp: 1.099
wvpnczr: 1.099
leghi: 1.099
kjvr: 1.099
zkvg: 1.099
wadj: 1.099
pwrc: 1.099
eoueqba: 1.099
mkht: 1.099
rceu: 1.099
vpujk: 1.099
jnlz: 1.099
skds: 1.099
zyw: 1.099
agdx: 1.099
clara: 1.099
cdaaa: 1.099
itnrrcnpx: 1.099
bajf: 1.099
elsho: 1.099
rubber: 1.099
qph: 1.099
ppca: 1.

nuco: 1.099
ylhe: 1.099
othercomprehensiveincomelossreclassificationadjustmentfromaociforsaleofsecuritiesbeforetax: 1.099
vguw: 1.099
dfj: 1.099
mkdel: 1.099
karb: 1.099
jte: 1.099
mtte: 1.099
ghrybia: 1.099
cqcwe: 1.099
rqqel: 1.099
ebdcb: 1.099
bekl: 1.099
muff: 1.099
bhhxf: 1.099
mkhjs: 1.099
jcjd: 1.099
accumulatedothercomprehensiveincomelossdistributionattributabletodiscontinuedoperationsnetoftax: 1.099
mycbzl: 1.099
terwc: 1.099
ccbdd: 1.099
unamortizeddebtissuanceexpens: 1.099
mjohkrqrqc: 1.099
vkgf: 1.099
ivec: 1.099
bfebb: 1.099
zajsj: 1.099
majortypesofdebtandequitysecuritiesaxi: 1.099
gtohoa: 1.099
jeeqm: 1.099
okel: 1.099
isnnhtq: 1.099
owyuvsp: 1.099
vxam: 1.099
ohih: 1.099
kekp: 1.099
ssytj: 1.099
campaign: 1.099
xeii: 1.099
zcsv: 1.099
obsz: 1.099
fik: 1.099
hysu: 1.099
ffead: 1.099
vetww: 1.099
tbzg: 1.099
yiwdrt: 1.099
bgtc: 1.099
kerh: 1.099
wnoga: 1.099
gkli: 1.099
dweh: 1.099
mzvd: 1.099
tspgb: 1.099
htcz: 1.099
jomi: 1.099
ddcgu: 1.099
qkwxj: 1.099
ugncg: 1.099
lmm

fif: 1.099
zexn: 1.099
lfvphqz: 1.099
eztgm: 1.099
wouj: 1.099
room: 1.099
kiwp: 1.099
ccbdc: 1.099
woi: 1.099
ouwz: 1.099
financialinstrumentscashandavailableforsalesecuritiesadjustedcostgrossunrealizedgainsgrossunrealizedlossesandfairvaluerecordedascashandcashequivalentsorshorttermorlongtermmarketablesecuritiesdetail: 1.099
mytz: 1.099
xtqn: 1.099
nkav: 1.099
xorw: 1.099
pkwd: 1.099
nctq: 1.099
ydhh: 1.099
ftn: 1.099
shwov: 1.099
pfef: 1.099
dawtg: 1.099
mjoo: 1.099
kaegfdm: 1.099
qtvw: 1.099
ngzbkbd: 1.099
aceb: 1.099
msiv: 1.099
mqewa: 1.099
yhfc: 1.099
mzkxbt: 1.099
dpini: 1.099
hlqlckv: 1.099
tfeq: 1.099
lijd: 1.099
qrkdko: 1.099
pcjk: 1.099
hkxr: 1.099
phish: 1.099
uotmp: 1.099
ezmpz: 1.099
ekci: 1.099
skzs: 1.099
zscf: 1.099
xvpef: 1.099
lvzn: 1.099
ktbro: 1.099
polk: 1.099
hzey: 1.099
kjtgwrudlxa: 1.099
cebec: 1.099
qgutp: 1.099
rehp: 1.099
vrwtm: 1.099
jnvt: 1.099
lmpwf: 1.099
axva: 1.099
quxb: 1.099
qeqpm: 1.099
iwzrb: 1.099
sooner: 1.099
qzqd: 1.099
vamx: 1.099
gejh: 1.099


uarp: 1.099
imqn: 1.099
acdic: 1.099
xxa: 1.099
mkio: 1.099
dmcm: 1.099
heizdx: 1.099
pmtv: 1.099
ear: 1.099
wind: 1.099
wcmcj: 1.099
mbkg: 1.099
mlzluyzheh: 1.099
zigtpu: 1.099
classifiedsmemb: 1.099
qimujnymm: 1.099
mtdr: 1.099
mxetc: 1.099
viov: 1.099
ecgqvh: 1.099
aswwj: 1.099
mpyhbp: 1.099
hilgev: 1.099
cajd: 1.099
hkxuqw: 1.099
isdi: 1.099
trobc: 1.099
zilng: 1.099
gcdq: 1.099
luvumz: 1.099
dzfsn: 1.099
othercomprehensiveincomeforeigncurrencytransactionandtranslationadjustmentbeforetaxportionattributabletopar: 1.099
fxvn: 1.099
pfga: 1.099
dmnntl: 1.099
hwso: 1.099
vczgx: 1.099
numberofpropertiescontributedtojointventur: 1.099
eezbi: 1.099
mvcb: 1.099
poxgg: 1.099
egcnfm: 1.099
eaectd: 1.099
oukd: 1.099
kryn: 1.099
ahkbb: 1.099
devzpf: 1.099
jzcre: 1.099
rkk: 1.099
efcdb: 1.099
kkg: 1.099
cwr: 1.099
jfgp: 1.099
fhxj: 1.099
xmlv: 1.099
fiho: 1.099
wawbrm: 1.099
swm: 1.099
ybfl: 1.099
rrmkb: 1.099
cifn: 1.099
wlhm: 1.099
ouwr: 1.099
hnk: 1.099
baqf: 1.099
zkrj: 1.099
iqhn: 1.099
mo

vmtn: 1.099
aebp: 1.099
toakta: 1.099
ipk: 1.099
mcex: 1.099
minbz: 1.099
mckluuhj: 1.099
sbod: 1.099
tow: 1.099
gtak: 1.099
eink: 1.099
ohp: 1.099
wcblun: 1.099
vjoz: 1.099
aajm: 1.099
consolidatedstatementofcashflow: 1.099
mpiwx: 1.099
paymentstoacquireinvest: 1.099
bfeam: 1.099
dbrr: 1.099
ivsu: 1.099
zbyy: 1.099
cgn: 1.099
umhz: 1.099
lno: 1.099
iyj: 1.099
zuima: 1.099
ajj: 1.099
pywr: 1.099
gvxb: 1.099
othercomprehensiveincomelossamortizationadjustmentfromaocipensionandotherpostretirementbenefitplansfornetpriorservicecostcreditnetoftax: 1.099
beccdfd: 1.099
mezl: 1.099
yncw: 1.099
debtinstrumentinterestrateeffectivepercentageraterangeminimum: 1.099
qwhlwj: 1.099
hxom: 1.099
voaop: 1.099
cqiq: 1.099
cgrr: 1.099
wbhp: 1.099
fvkc: 1.099
vqskbg: 1.099
zfrk: 1.099
lpsg: 1.099
ogfnd: 1.099
pnv: 1.099
jeji: 1.099
hmw: 1.099
hsxek: 1.099
scnz: 1.099
jcna: 1.099
aafdd: 1.099
xmzr: 1.099
kupn: 1.099
nyay: 1.099
gnurxjij: 1.099
gainlossonsaleofbusi: 1.099
wcrs: 1.099
uepf: 1.099
vmcmzr: 1.09

smlnr: 1.099
classofwarrantorrightexercisepriceofwarrantsorright: 1.099
vyjr: 1.099
ynmw: 1.099
gtvzn: 1.099
oeboo: 1.099
iyyiq: 1.099
srlnk: 1.099
kfoj: 1.099
saxc: 1.099
qkxen: 1.099
kap: 1.099
xyvc: 1.099
fvvrk: 1.099
hgao: 1.099
morhyi: 1.099
quxfu: 1.099
xknm: 1.099
fnuj: 1.099
dgsdac: 1.099
fxdma: 1.099
ajyw: 1.099
woov: 1.099
hstdq: 1.099
dbdcd: 1.099
jjue: 1.099
mbyay: 1.099
mjno: 1.099
insurancepolicytextblock: 1.099
acefd: 1.099
fmgw: 1.099
odf: 1.099
mkrep: 1.099
abbd: 1.099
elpr: 1.099
lwbb: 1.099
kpuz: 1.099
dward: 1.099
eadbf: 1.099
wlmmw: 1.099
fqqb: 1.099
ser: 1.099
mlomc: 1.099
xorupc: 1.099
ivkl: 1.099
jkvim: 1.099
yche: 1.099
xsq: 1.099
hvow: 1.099
irrespect: 1.099
cqixml: 1.099
enkz: 1.099
mquq: 1.099
passu: 1.099
dvxqt: 1.099
kcgd: 1.099
fqck: 1.099
selfinsurancereservesdiscount: 1.099
nubi: 1.099
ecoq: 1.099
contamin: 1.099
hfgaa: 1.099
xttb: 1.099
ztth: 1.099
mttf: 1.099
cjcnt: 1.099
mmja: 1.099
xlpx: 1.099
waj: 1.099
gmkv: 1.099
kfos: 1.099
svpqfd: 1.099
xxzlua:

ebayinc: 1.099
stln: 1.099
myrdgzq: 1.099
cytp: 1.099
uqdh: 1.099
laql: 1.099
vumwq: 1.099
xfatcx: 1.099
mbob: 1.099
pldr: 1.099
psf: 1.099
cvwyee: 1.099
xkwkgzgg: 1.099
oazr: 1.099
dami: 1.099
ehcp: 1.099
uuvt: 1.099
mrow: 1.099
owtj: 1.099
oidd: 1.099
mvij: 1.099
tnpp: 1.099
vbvx: 1.099
pumsk: 1.099
zhpe: 1.099
rzgw: 1.099
tcfwp: 1.099
mczx: 1.099
lujl: 1.099
transactionrevenuenetfromexternalcustom: 1.099
sjza: 1.099
hqdd: 1.099
bnyhyjm: 1.099
oowg: 1.099
cyb: 1.099
cdcf: 1.099
gvo: 1.099
ptsan: 1.099
iulax: 1.099
teeycn: 1.099
televis: 1.099
nudqr: 1.099
nfyut: 1.099
rzdt: 1.099
ozxdoqtk: 1.099
vlb: 1.099
bgqqnhjjp: 1.099
mwaoif: 1.099
xkay: 1.099
nbzmi: 1.099
yrkwc: 1.099
trn: 1.099
nwewv: 1.099
ecbba: 1.099
efimpv: 1.099
ebedb: 1.099
lml: 1.099
qhyl: 1.099
mmoj: 1.099
qijz: 1.099
voflsu: 1.099
gqnr: 1.099
malfunct: 1.099
dbdu: 1.099
oelmu: 1.099
qwhg: 1.099
oomz: 1.099
gavj: 1.099
mdgwc: 1.099
simpli: 1.099
efbec: 1.099
gjqf: 1.099
ttfu: 1.099
bkdt: 1.099
fdwf: 1.099
vtska: 1.099


gzss: 1.099
lvsnw: 1.099
nkvxi: 1.099
pbvd: 1.099
bafeff: 1.099
cnhsxkpga: 1.099
fdfddeec: 1.099
kep: 1.099
deconsolid: 1.099
lrh: 1.099
idi: 1.099
kori: 1.099
cyih: 1.099
bedi: 1.099
qhkr: 1.099
rlgm: 1.099
ylb: 1.099
kpia: 1.099
wdhr: 1.099
qdkdh: 1.099
jmpw: 1.099
xlxv: 1.099
ggdj: 1.099
lccnb: 1.099
xnfixvm: 1.099
gvqbp: 1.099
dfefe: 1.099
edccf: 1.099
ogbn: 1.099
vvqwvv: 1.099
hjp: 1.099
gamo: 1.099
tydr: 1.099
ojtcd: 1.099
qlki: 1.099
rad: 1.099
vzuc: 1.099
svqh: 1.099
unjufh: 1.099
imbybmp: 1.099
heq: 1.099
sohb: 1.099
chhr: 1.099
uzj: 1.099
dmmcl: 1.099
daee: 1.099
jufr: 1.099
voup: 1.099
xkkyoy: 1.099
xshn: 1.099
ijbxrcghsarzj: 1.099
jmgsb: 1.099
ihon: 1.099
saleleasebacktransactioncurrentportionofdeferredgainnet: 1.099
mzlr: 1.099
ckphj: 1.099
txml: 1.099
fgev: 1.099
lnn: 1.099
ukork: 1.099
rgrxbabn: 1.099
kqlp: 1.099
uumxz: 1.099
tuxgef: 1.099
gmyo: 1.099
yail: 1.099
gijk: 1.099
wnhi: 1.099
srbms: 1.099
gktb: 1.099
xgki: 1.099
kqxc: 1.099
rsji: 1.099
hhlt: 1.099
liw: 1.099
a

ygmi: 1.099
gjc: 1.099
keok: 1.099
hvyi: 1.099
mxnn: 1.099
ylerm: 1.099
kekrm: 1.099
duhjk: 1.099
unuv: 1.099
mhht: 1.099
mxcc: 1.099
gnu: 1.099
zup: 1.099
bkin: 1.099
afcc: 1.099
txr: 1.099
vxgz: 1.099
vlq: 1.099
iiq: 1.099
yvnc: 1.099
japxr: 1.099
rujkmz: 1.099
mmjtza: 1.099
rijl: 1.099
tubi: 1.099
kkreg: 1.099
pori: 1.099
dhbi: 1.099
rboqg: 1.099
lahp: 1.099
vynoum: 1.099
dreni: 1.099
gifdz: 1.099
uqpp: 1.099
xqb: 1.099
sifj: 1.099
xuop: 1.099
noncontributori: 1.099
jgag: 1.099
incometaxexaminationdescript: 1.099
auht: 1.099
carolina: 1.099
oeae: 1.099
gmse: 1.099
rcilb: 1.099
wplub: 1.099
snuvo: 1.099
kdcit: 1.099
vnnrvts: 1.099
vcaptq: 1.099
ibzkc: 1.099
uvev: 1.099
rvlvz: 1.099
lvjg: 1.099
jwlks: 1.099
vjofp: 1.099
gabe: 1.099
jdpnb: 1.099
zbk: 1.099
nngxe: 1.099
eflxnl: 1.099
appdso: 1.099
cdvl: 1.099
ubs: 1.099
mxukrl: 1.099
wppyyoc: 1.099
poell: 1.099
disposalgroupincludingdiscontinuedoperationcashandcashequival: 1.099
ykw: 1.099
yqepbu: 1.099
qdws: 1.099
kjn: 1.099
ycfeb: 1.0

snpn: 1.099
kbgz: 1.099
mzefi: 1.099
ydca: 1.099
ptkc: 1.099
yjyu: 1.099
kuxb: 1.099
pari: 1.099
ywuvp: 1.099
nsjoif: 1.099
fiwn: 1.099
vydj: 1.099
gmdk: 1.099
hbrnc: 1.099
ovuf: 1.099
javz: 1.099
esxeaq: 1.099
ltyo: 1.099
dcded: 1.099
longtermmarketablesecuritiesmaturitiestermminimum: 1.099
erjv: 1.099
otherproductsmemb: 1.099
cvhn: 1.099
apcow: 1.099
irlq: 1.099
myre: 1.099
dsyjvt: 1.099
ofvt: 1.099
lrbtz: 1.099
klpp: 1.099
upmvj: 1.099
mjfqt: 1.099
bpmku: 1.099
uakqhdr: 1.099
qoxf: 1.099
twbzi: 1.099
uqbodfi: 1.099
division: 1.099
sare: 1.099
lhjf: 1.099
juof: 1.099
afuh: 1.099
nnryx: 1.099
orfr: 1.099
auq: 1.099
mxsj: 1.099
nyp: 1.099
zuq: 1.099
gvnaa: 1.099
dtkg: 1.099
maxktcp: 1.099
zvhdd: 1.099
ehk: 1.099
khiy: 1.099
nyhu: 1.099
yvqwv: 1.099
clal: 1.099
vql: 1.099
owzz: 1.099
eeia: 1.099
acceleratedsharerepurchasesfinalpricepaidpershar: 1.099
qhetb: 1.099
xrlk: 1.099
ozb: 1.099
pueg: 1.099
tef: 1.099
rrogp: 1.099
nmuz: 1.099
lyvjn: 1.099
dobf: 1.099
bkd: 1.099
waterfal: 1.099
rg

jffb: 1.099
coni: 1.099
shqs: 1.099
tlfi: 1.099
rlzk: 1.099
kwr: 1.099
sxfz: 1.099
oxxdm: 1.099
curat: 1.099
vyrio: 1.099
iqkzwbc: 1.099
pful: 1.099
mmas: 1.099
fbeq: 1.099
eaphh: 1.099
hviv: 1.099
dxk: 1.099
meze: 1.099
bfbdc: 1.099
mrwh: 1.099
uvfrv: 1.099
cjme: 1.099
cugx: 1.099
gteb: 1.099
ndllrm: 1.099
debtinstrumentinterestratestatedpercentageraterangeminimum: 1.099
rculx: 1.099
inyim: 1.099
qqxhi: 1.099
nbdw: 1.099
uizn: 1.099
ocmz: 1.099
relatedpartydisclosuredetail: 1.099
polgl: 1.099
djbd: 1.099
jcoj: 1.099
jfipmfvu: 1.099
fsjk: 1.099
mahui: 1.099
fptw: 1.099
iyrc: 1.099
httb: 1.099
charh: 1.099
bou: 1.099
qzdb: 1.099
mlwt: 1.099
availableforsalesecuritiesdebtmaturitiesafterthreethroughfouryearsfairvalu: 1.099
plxr: 1.099
lnodm: 1.099
zynjv: 1.099
consolidatedstatementofstockholdersequ: 1.099
assj: 1.099
bfo: 1.099
ahim: 1.099
pensionandotherpostretirementdefinedbenefitplansliabilitiesnoncurr: 1.099
zbxddxfi: 1.099
ytp: 1.099
kzyiv: 1.099
vxyb: 1.099
lubi: 1.099
teaxh: 1.099


<a id="compute_tf"></a>

## 4. Compute tf

Below we will compute ${\rm tf}(w,d)$, or the term frequency for a given word $w$, in a given document $d$. Since our ultimate goal is to compute a document vector, we'd like to keep a few things in mind

1. Store the ${\rm tf}(w, d)$ for each word in a document as a dictionary
2. Even when a word doesn't appear in the document $d$, we still want to keep a $0$ entry in the dictionary. This is important when we convert the dictionary to a vector, where zero entries are important


There are multiple definitions for ${\rm tf}(w,d)$, the simplest one is

$$
{\rm tf}(w,d)=\frac{f_{w,d}}{a_d}
$$

where $f_{w,d}$ is the number of occurence of the word $w$ in the document $d$, and $a$ the average occurence of all the words in that document for normalization. Just like ${\rm idf}(w)$, a logarithm can be added

$$
{\rm tf}(w,d)=\begin{cases}
\frac{1+\log f_{w,d}}{1+\log a_d} & f_{w,d} > 0 \\
0 & f_{w,d} = 0 \\
\end{cases}
$$

Implement the function `get_df(txt, include_log=True)` that computes ${\rm tf}(w,d)$ for each word in the document (returns a defaultdict(int), so that when supplying words not in the document the dictionary will yield zero instead of an error). Also include the optional parameter `include_log` that enables the additional logarithm term in the computation. I suggest also adding a function called `_tf` that computes the function above. 

<a href="#table_of_content">back to top</a>

In [21]:
import numpy as np
from math import *

def _tf(freq, avg, include_log=True):
    # TO DO
    if include_log:
        return 0 if freq == 0 else (1+log(freq))/(1+log(avg))
    else:
        return freq/avg

def get_tf(txt, include_log=True):
    # TO DO
    freq = bag_of_words(txt)
    avg = np.mean(list(freq.values()))
    tf = {w:_tf(f,avg, include_log) for w,f in freq.items()}
    return defaultdict(int, tf)

In [22]:
tfs = [ get_tf(c) for c in corpus ]
print_sorted(tfs[0])

font: 2.571
style: 2.473
pad: 2.433
div: 2.392
size: 2.381
bottom: 2.346
align: 2.325
famili: 2.255
top: 2.245
leav: 2.245
vertic: 2.214
text: 2.209
colspan: 2.176
right: 2.164
xlink: 2.161
rowspan: 2.158
helvetica: 2.123
san: 2.123
serif: 2.123
inherit: 2.083
color: 2.058
background: 2.045
efefef: 2.044
class: 2.030
hide: 2.013
span: 2.001
overflow: 1.999
http: 1.998
type: 1.995
border: 1.992
label: 1.974
org: 1.973
strong: 1.961
xbrl: 1.958
gaap: 1.952
href: 1.925
none: 1.918
www: 1.912
amp: 1.906
role: 1.905
xbrli: 1.898
link: 1.898
height: 1.857
arcrol: 1.849
solid: 1.845
tabl: 1.834
show: 1.822
javascript: 1.822
clear: 1.818
void: 1.813
onclick: 1.813
width: 1.806
fasb: 1.804
line: 1.782
loc: 1.776
xsd: 1.756
locat: 1.727
indent: 1.717
order: 1.713
arc: 1.711
display: 1.708
weight: 1.708
white: 1.705
period: 1.698
justifi: 1.689
elt: 1.683
cellspac: 1.681
bold: 1.677
name: 1.675
cellpad: 1.673
doubl: 1.670
togglenext: 1.666
xml: 1.660
center: 1.653
arial: 1.649
fact: 1.643
resourc

uqi: 0.416
jng: 0.416
op: 0.416
pwrr: 0.416
ixcucpsx: 0.416
pyhncsvt: 0.416
fdj: 0.416
utv: 0.416
xez: 0.416
vmov: 0.416
lox: 0.416
mlh: 0.416
bev: 0.416
bld: 0.416
emt: 0.416
moo: 0.416
ivn: 0.416
poi: 0.416
lev: 0.416
ssj: 0.416
fua: 0.416
mux: 0.416
mqx: 0.416
oxl: 0.416
yhw: 0.416
vlg: 0.416
qqw: 0.416
mbk: 0.416
aaj: 0.416
uvi: 0.416
mqp: 0.416
wec: 0.416
flm: 0.416
twq: 0.416
nnz: 0.416
xlu: 0.416
izk: 0.416
qmj: 0.416
wqu: 0.416
koa: 0.416
gpu: 0.416
oaw: 0.416
swj: 0.416
cwq: 0.416
mlk: 0.416
cwx: 0.416
bei: 0.416
kxd: 0.416
xbz: 0.416
zxi: 0.416
oqa: 0.416
ddx: 0.416
pwt: 0.416
emf: 0.416
xfo: 0.416
fbz: 0.416
mam: 0.416
mup: 0.416
emj: 0.416
xal: 0.416
bni: 0.416
ndm: 0.416
rgi: 0.416
fdr: 0.416
vnm: 0.416
exk: 0.416
akx: 0.416
mvq: 0.416
hmi: 0.416
oac: 0.416
zcd: 0.416
fnx: 0.416
lgq: 0.416
mug: 0.416
pyj: 0.416
ggx: 0.416
qvd: 0.416
qyx: 0.416
lzu: 0.416
mmo: 0.416
usi: 0.416
alm: 0.416
aom: 0.416
aoq: 0.416
xhn: 0.416
xhu: 0.416
ocg: 0.416
ctk: 0.416
rtn: 0.416
xkt: 0.416

srp: 0.336
shw: 0.336
igo: 0.336
mbg: 0.336
qqq: 0.336
ccq: 0.336
skp: 0.336
pup: 0.336
npm: 0.336
xgb: 0.336
dfi: 0.336
oiv: 0.336
jni: 0.336
rivet: 0.336
getelementbyid: 0.336
hover: 0.336
reportformat: 0.336
contextcount: 0.336
elementcount: 0.336
entitycount: 0.336
footnotesreport: 0.336
segmentcount: 0.336
scenariocount: 0.336
tuplesreport: 0.336
unitcount: 0.336
myreport: 0.336
inputfil: 0.336
haspresentationlinkbas: 0.336
hascalculationlinkbas: 0.336
gwk: 0.336
rkg: 0.336
aul: 0.336
pzo: 0.336
srg: 0.336
brm: 0.336
dyg: 0.336
kva: 0.336
miq: 0.336
vpk: 0.336
yig: 0.336
dcr: 0.336
uko: 0.336
bko: 0.336
zfl: 0.336
oud: 0.336
vni: 0.336
mpn: 0.336
xza: 0.336
dav: 0.336
ilw: 0.336
iyv: 0.336
eso: 0.336
clp: 0.336
mdb: 0.336
cyj: 0.336
wbz: 0.336
snx: 0.336
let: 0.336
rgg: 0.336
wut: 0.336
yiy: 0.336
mwt: 0.336
mwh: 0.336
mdl: 0.336
efr: 0.336
ejk: 0.336
bkp: 0.336
bxg: 0.336
vgf: 0.336
sjv: 0.336
tzs: 0.336
jml: 0.336
mew: 0.336
pwv: 0.336
eej: 0.336
sqn: 0.336
kwi: 0.336
afz: 0.336

tzx: 0.198
mxsu: 0.198
wje: 0.198
ikeui: 0.198
qkht: 0.198
rstwxo: 0.198
cpyi: 0.198
ukqix: 0.198
hlo: 0.198
bajfo: 0.198
cgxo: 0.198
ioju: 0.198
atrven: 0.198
zixf: 0.198
wbgp: 0.198
cxwtkq: 0.198
kpcx: 0.198
zdlo: 0.198
kgbwqgxnu: 0.198
xan: 0.198
vycl: 0.198
pkxj: 0.198
wprusq: 0.198
fcvga: 0.198
ack: 0.198
xzt: 0.198
wuznnz: 0.198
eoxd: 0.198
tgb: 0.198
mzyx: 0.198
aiwyh: 0.198
mqsj: 0.198
opt: 0.198
rym: 0.198
qex: 0.198
oco: 0.198
sjnl: 0.198
jgb: 0.198
ewjma: 0.198
yxgm: 0.198
ocax: 0.198
bcz: 0.198
wpzl: 0.198
qwhg: 0.198
lvfe: 0.198
butexk: 0.198
cxj: 0.198
wewm: 0.198
mwue: 0.198
cop: 0.198
bugxpz: 0.198
jkx: 0.198
mskw: 0.198
nhxhkdqh: 0.198
yrh: 0.198
fhhhh: 0.198
wmgimg: 0.198
lkjb: 0.198
pwq: 0.198
qix: 0.198
chnj: 0.198
uyxo: 0.198
hvfw: 0.198
tkxi: 0.198
zcx: 0.198
oxi: 0.198
kuk: 0.198
pgxa: 0.198
bcxz: 0.198
ovc: 0.198
gpsh: 0.198
cxw: 0.198
eixjo: 0.198
nrooaai: 0.198
bwqix: 0.198
abao: 0.198
khop: 0.198
kyp: 0.198
ioakx: 0.198
gbwxf: 0.198
amk: 0.198
obex: 0.198
qui

jpu: 0.198
lwhb: 0.198
udk: 0.198
sudx: 0.198
exa: 0.198
fsrh: 0.198
psg: 0.198
zkze: 0.198
ywl: 0.198
wwmh: 0.198
hrfp: 0.198
rcgt: 0.198
hbpi: 0.198
iglja: 0.198
rexjfnn: 0.198
pvd: 0.198
nlpw: 0.198
cah: 0.198
pdinv: 0.198
upsl: 0.198
ipku: 0.198
jcw: 0.198
oshah: 0.198
ugk: 0.198
zsuypgmmfq: 0.198
qna: 0.198
fhki: 0.198
fgyde: 0.198
jvmo: 0.198
zxrc: 0.198
hsztz: 0.198
gpupi: 0.198
npex: 0.198
ama: 0.198
ytpwk: 0.198
jdqnzv: 0.198
dmpz: 0.198
ayr: 0.198
dqi: 0.198
puv: 0.198
hmpp: 0.198
jvl: 0.198
hugx: 0.198
ahk: 0.198
mgqf: 0.198
mrd: 0.198
pwkv: 0.198
mvtdgsuwyi: 0.198
qgutp: 0.198
hxs: 0.198
xfw: 0.198
ego: 0.198
pmj: 0.198
khso: 0.198
mtdaki: 0.198
lzhq: 0.198
jcld: 0.198
wtq: 0.198
dktx: 0.198
ncbj: 0.198
jda: 0.198
rquf: 0.198
keq: 0.198
mmkbr: 0.198
vshd: 0.198
mnqhtnm: 0.198
wrqhgvs: 0.198
pkl: 0.198
pfn: 0.198
tyq: 0.198
xbr: 0.198
wfx: 0.198
nkdae: 0.198
zyj: 0.198
iiu: 0.198
yxnq: 0.198
ufv: 0.198
akqw: 0.198
zbm: 0.198
ybqt: 0.198
kqwq: 0.198
mvxxlk: 0.198
ayaz: 0.198


lqv: 0.198
siqlzmp: 0.198
khup: 0.198
kyq: 0.198
ascca: 0.198
sks: 0.198
jyr: 0.198
bzenci: 0.198
hwmq: 0.198
mjwa: 0.198
mwwk: 0.198
ufw: 0.198
uh: 0.198
rgyv: 0.198
gwi: 0.198
fgq: 0.198
uiy: 0.198
algl: 0.198
uyj: 0.198
blx: 0.198
fxzn: 0.198
mfoat: 0.198
xqr: 0.198
ybm: 0.198
woha: 0.198
ycu: 0.198
yzf: 0.198
wti: 0.198
ircpsdjt: 0.198
hwz: 0.198
ouz: 0.198
glcd: 0.198
gtf: 0.198
xuu: 0.198
hxg: 0.198
pqpfzz: 0.198
ovmw: 0.198
nozc: 0.198
lynn: 0.198
nwqi: 0.198
umth: 0.198
hxgi: 0.198
gev: 0.198
suv: 0.198
ohkd: 0.198
gcee: 0.198
kori: 0.198
hco: 0.198
xwqelp: 0.198
zwqi: 0.198
ghjond: 0.198
nkqs: 0.198
aw: 0.198
npdxt: 0.198
zcnym: 0.198
nay: 0.198
spj: 0.198
rcilb: 0.198
ljt: 0.198
mjmcqx: 0.198
ettie: 0.198
orwr: 0.198
xuxg: 0.198
ffj: 0.198
gku: 0.198
ykoz: 0.198
klc: 0.198
zgnwujwg: 0.198
eiw: 0.198
fcla: 0.198
fgv: 0.198
atvn: 0.198
mvfn: 0.198
rxxn: 0.198
geti: 0.198
kndp: 0.198
angw: 0.198
ntv: 0.198
qrub: 0.198
paqdl: 0.198
nkn: 0.198
duiw: 0.198
owtj: 0.198
couqcm: 0.198

mdubn: 0.198
ndhl: 0.198
epq: 0.198
itq: 0.198
uyn: 0.198
myztwsq: 0.198
advqvgyxuhxgmu: 0.198
mvjm: 0.198
omhzq: 0.198
lmq: 0.198
kyqc: 0.198
rbag: 0.198
kwu: 0.198
lrwa: 0.198
qxt: 0.198
hgqf: 0.198
ism: 0.198
yav: 0.198
rff: 0.198
gwd: 0.198
icr: 0.198
teaz: 0.198
bjl: 0.198
tkd: 0.198
osx: 0.198
avmopl: 0.198
fhsfo: 0.198
qhlod: 0.198
jse: 0.198
ocxu: 0.198
udci: 0.198
wqt: 0.198
ykhi: 0.198
est: 0.198
txo: 0.198
xtoeb: 0.198
bajf: 0.198
lnl: 0.198
dpr: 0.198
ynadx: 0.198
jdp: 0.198
tfci: 0.198
hoj: 0.198
tfkx: 0.198
oyx: 0.198
fwsr: 0.198
sdub: 0.198
zdxc: 0.198
hkv: 0.198
thc: 0.198
ptt: 0.198
ieqf: 0.198
dws: 0.198
zts: 0.198
zstl: 0.198
dbhkb: 0.198
gee: 0.198
mfjow: 0.198
dauxt: 0.198
hhsm: 0.198
ihe: 0.198
saic: 0.198
jor: 0.198
mlxaa: 0.198
kgofl: 0.198
prd: 0.198
vbjj: 0.198
icb: 0.198
mixf: 0.198
qts: 0.198
zsk: 0.198
uvc: 0.198
ekuv: 0.198
ctirp: 0.198
gye: 0.198
gra: 0.198
acdh: 0.198
kkt: 0.198
okbyt: 0.198
tuwgsri: 0.198
opi: 0.198
dwi: 0.198
cyrh: 0.198
jcp: 0.198
qwe

qgfpx: 0.198
jhqi: 0.198
mdmavg: 0.198
zem: 0.198
kvycvg: 0.198
oaz: 0.198
llorf: 0.198
maz: 0.198
rfk: 0.198
wwx: 0.198
ibinx: 0.198
sbk: 0.198
ifsuaou: 0.198
oof: 0.198
ubg: 0.198
msqx: 0.198
mvxg: 0.198
gnef: 0.198
srfkep: 0.198
ksgus: 0.198
wct: 0.198
yurcim: 0.198
iqk: 0.198
jmlo: 0.198
hpx: 0.198
hibf: 0.198
gwv: 0.198
noylhm: 0.198
nfif: 0.198
pnvvl: 0.198
qpgj: 0.198
vqs: 0.198
wbis: 0.198
khpxd: 0.198
zsaa: 0.198
qlbeca: 0.198
zcuk: 0.198
ellhjn: 0.198
ykw: 0.198
jtb: 0.198
qhnkme: 0.198
gcgu: 0.198
awdac: 0.198
etlb: 0.198
iumd: 0.198
uso: 0.198
mmr: 0.198
ocmn: 0.198
qig: 0.198
nntx: 0.198
uzkti: 0.198
vgcd: 0.198
pxk: 0.198
jjpnene: 0.198
hbd: 0.198
evi: 0.198
iscip: 0.198
hsg: 0.198
wgcz: 0.198
qyf: 0.198
mrhkm: 0.198
dpt: 0.198
knbrv: 0.198
gsk: 0.198
ovip: 0.198
zmte: 0.198
zwo: 0.198
lfpdk: 0.198
ydmcwal: 0.198
twth: 0.198
lcasr: 0.198
mimb: 0.198
wert: 0.198
zshj: 0.198
pli: 0.198
qzyomd: 0.198
kaegvec: 0.198
edpk: 0.198
cru: 0.198
foyw: 0.198
nbgo: 0.198
xuad: 0.198
y

<a id="doc_vector"></a>

## 5. Document Vector
Combine the implementation for ${\rm tf}(w,d)$ and ${\rm idf}(w)$ to compute a word-vector for each document in a corpus. Don't forget the zero-padding that is needed when a word appears in some document but not others. 

Implmenet the function `get_vectors(tf,idf)`, the input is the output computed by `get_tf` and `get_idf`

(*optional challenge: implement in one line!*)

<a href="#table_of_content">back to top</a>

In [23]:
def get_vector(tf, idf):
    # TO DO
    return np.array([ tf[w]*idf[w] for w in idf ])

In [24]:
# test your code
doc_vectors = [ get_vector(tf, idf) for tf in tfs ]

for v in doc_vectors:
    print(v)

[ 0.     0.218  0.218 ...,  0.     0.218  0.298]
[ 0.  0.  0. ...,  0.  0.  0.]
[ 0.208  0.     0.    ...,  0.     0.     0.436]


<a id ="similarity"></a>

## 6. Similarity

Given two word-vectors $\vec u$ (or $u_i$) and $\vec v$ (or $v_i$), corresponding to two documents, we want to compute different similarity metrics. 

1. Cosine similarity, defined by 
$$
{\rm Sim}_{\cos} = \frac{\vec u \cdot \vec v}{|\vec u| |\vec v|}
$$

2. Jaccard similarity, defined by
$$
{\rm Sim}_{\rm Jaccard} = \frac{\sum_i \min\{u_i, v_i\}}{\sum_i \max\{u_i, v_i\}}
$$

Implement the function similarity as `sim_cis(u,v)` and `sim_jac(u,v)`. Utilize the numpy functions `numpy.linalg.norm` to compute norm of a vector and `np.dot` for computing dot-products. `np.minimum` and `np.maximum` are also useful vectorized pair-wise minimum and maximum functions.

(*optional challenge: implement both functions in one line!*)

<a href="#table_of_content">back to top</a>

In [25]:
from numpy.linalg import norm

def sim_cos(u,v):
    # TO DO
    return np.dot(u,v)/(norm(u)*norm(v))

def sim_jac(u,v):
    # TO DO
    return np.sum(np.minimum(u,v))/np.sum(np.maximum(u,v))

In [26]:
# test your co
# compute all the pairwise similarity metrics
size = len(doc_vectors)
matrix_cos = np.zeros((size,size))
matrix_jac = np.zeros((size,size))

for i in range(size):
    for j in range(size):
        u = doc_vectors[i]
        v = doc_vectors[j]
        matrix_cos[i][j] = sim_cos(u,v)
        matrix_jac[i][j] = sim_jac(u,v)
        
print("Cosine Similarity:")
print(matrix_cos)

print()
print("Jaccard Similarity:")
print(matrix_jac)

Cosine Similarity:
[[ 1.     0.027  0.035]
 [ 0.027  1.     0.045]
 [ 0.035  0.045  1.   ]]

Jaccard Similarity:
[[ 1.     0.024  0.031]
 [ 0.024  1.     0.041]
 [ 0.031  0.041  1.   ]]


### Good Job! You've finished all the exercises!

Here is an optional bonus challenge. We often need to debug other people's code to figure out what's wrong. It's particularly difficult when the code doesn't give errors but computes the wrong quantity. Can you describe why the following implementations for some of the exercises above may be wrong? Highlight the words at the bottom to reveal the solutions!

In [27]:
def get_idf_wrong(corpus, include_log=True):
    freq = defaultdict(int)
    for c in corpus:
        for w in c:
            freq[w] += 1
        
    N = len(corpus)
    if include_log:
        return { w:log(N/freq[w]) for w in freq }
    else:
        return { w:N/freq[w] for w in freq }


def get_sentiment_wrong(txt, wordlist):
    matching_words = [ w for w in wordlist if w in txt ]
    return len(matching_words)/len(txt)

def get_vectors_wrong(tf, idf):
    return np.array([ tf[w]*idf[w] for w in tf ])

# Solutions

Drag your mouse over the white space below this cell, and you'll see details about the solutions.  Or, if it's easier, just double-click on the white space below this cell, which will reveal the cell with hidden text.  Also, please check out the file [bagofwords_solutions.py](bagofwords_solutions.py).

<font color="white">
Solution

get_idf: the defaultdict freq here computes the total number of occurences in all the document. We only want to count it once when a word appears in a document. This may lead to a document frequency larger than N, leading to a negative idf! 

get_sentiment_wrong: if a word w appears twice in the document, it won't be counted properly!

get_vectors_wrong: tf may not contain all the words in idf. We need zero padding for words that appear in idf but not in tf! 
</font> 