# 1. Ways to represent words and their importance in a document

- Bags of words—Vectors of word counts or frequencies
- Bags of n-grams—Counts of word pairs (bigrams), triplets (trigrams), and so on 
- TF-IDF vectors—Word scores that better represent their importance

Each of these techniques can be applied separately or as part of an NLP pipeline. These are all statistical models in that they are frequency based. These “shallow” NLP machines are powerful and useful for many practical applications such as spam filtering and sentiment analysis.

### TF-IDF = term frequency x inverse document frequency 

- Term frequencies are the counts of each word in a document (see CountVectorizer from sklearn.feature_extraction.text or Counter from collections) 
- Inverse document frequency means that you’ll divide each of those word counts by the number of documents in which the word occurs.



# 2. Bag of words

We have already used one-hot encoding of each word and then combined all those vectors with a binary OR (or clipped sum) to create a vector representation of a text. And this binary bag-of-words vector makes a great index for document retrieval when loaded into a data structure such as a Pandas DataFrame.

We also looked at an even more useful vector representation that counts the number of occurrences, or frequency, of each word in the given text. 

As a first approximation, we assume that the more times a word occurs, the more meaning it must contribute to that document. A document that refers to “wings” and “rudder” frequently may be more relevant to a problem involving jet airplanes or air travel, than say a document that refers frequently to “cats” and “gravity.” Or if you have classified some words as expressing positive emotions—words like “good,” “best,” “joy,” and “fantastic”—the more a document that contains those words is likely to have positive “sentiment.” You can imagine though how an algorithm that relied on these simple rules might be mistaken or led astray.

## TF

In [1]:
story1 = '''The first story is about connecting the dots.

I dropped out of Reed College after the first 6 months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why did I drop out?

It started before I was born. My biological mother was a young, unwed college graduate student, and she decided to put me up for adoption. She felt very strongly that I should be adopted by college graduates, so everything was all set for me to be adopted at birth by a lawyer and his wife. Except that when I popped out they decided at the last minute that they really wanted a girl. So my parents, who were on a waiting list, got a call in the middle of the night asking: “We have an unexpected baby boy; do you want him?” They said: “Of course.” My biological mother later found out that my mother had never graduated from college and that my father had never graduated from high school. She refused to sign the final adoption papers. She only relented a few months later when my parents promised that I would someday go to college.

And 17 years later I did go to college. But I naively chose a college that was almost as expensive as Stanford, and all of my working-class parents’ savings were being spent on my college tuition. After six months, I couldn’t see the value in it. I had no idea what I wanted to do with my life and no idea how college was going to help me figure it out. And here I was spending all of the money my parents had saved their entire life. So I decided to drop out and trust that it would all work out OK. It was pretty scary at the time, but looking back it was one of the best decisions I ever made. The minute I dropped out I could stop taking the required classes that didn’t interest me, and begin dropping in on the ones that looked interesting.

It wasn’t all romantic. I didn’t have a dorm room, so I slept on the floor in friends’ rooms, I returned Coke bottles for the 5¢ deposits to buy food with, and I would walk the 7 miles across town every Sunday night to get one good meal a week at the Hare Krishna temple. I loved it. And much of what I stumbled into by following my curiosity and intuition turned out to be priceless later on. Let me give you one example:

Reed College at that time offered perhaps the best calligraphy instruction in the country. Throughout the campus every poster, every label on every drawer, was beautifully hand calligraphed. Because I had dropped out and didn’t have to take the normal classes, I decided to take a calligraphy class to learn how to do this. I learned about serif and sans serif typefaces, about varying the amount of space between different letter combinations, about what makes great typography great. It was beautiful, historical, artistically subtle in a way that science can’t capture, and I found it fascinating.

None of this had even a hope of any practical application in my life. But 10 years later, when we were designing the first Macintosh computer, it all came back to me. And we designed it all into the Mac. It was the first computer with beautiful typography. If I had never dropped in on that single course in college, the Mac would have never had multiple typefaces or proportionally spaced fonts. And since Windows just copied the Mac, it’s likely that no personal computer would have them. If I had never dropped out, I would have never dropped in on this calligraphy class, and personal computers might not have the wonderful typography that they do. Of course it was impossible to connect the dots looking forward when I was in college. But it was very, very clear looking backward 10 years later.

Again, you can’t connect the dots looking forward; you can only connect them looking backward. So you have to trust that the dots will somehow connect in your future. You have to trust in something — your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all the difference in my life.'''


In [5]:
from nltk.tokenize import word_tokenize
tokens = word_tokenize(story1)
tokens

['The',
 'first',
 'story',
 'is',
 'about',
 'connecting',
 'the',
 'dots',
 '.',
 'I',
 'dropped',
 'out',
 'of',
 'Reed',
 'College',
 'after',
 'the',
 'first',
 '6',
 'months',
 ',',
 'but',
 'then',
 'stayed',
 'around',
 'as',
 'a',
 'drop-in',
 'for',
 'another',
 '18',
 'months',
 'or',
 'so',
 'before',
 'I',
 'really',
 'quit',
 '.',
 'So',
 'why',
 'did',
 'I',
 'drop',
 'out',
 '?',
 'It',
 'started',
 'before',
 'I',
 'was',
 'born',
 '.',
 'My',
 'biological',
 'mother',
 'was',
 'a',
 'young',
 ',',
 'unwed',
 'college',
 'graduate',
 'student',
 ',',
 'and',
 'she',
 'decided',
 'to',
 'put',
 'me',
 'up',
 'for',
 'adoption',
 '.',
 'She',
 'felt',
 'very',
 'strongly',
 'that',
 'I',
 'should',
 'be',
 'adopted',
 'by',
 'college',
 'graduates',
 ',',
 'so',
 'everything',
 'was',
 'all',
 'set',
 'for',
 'me',
 'to',
 'be',
 'adopted',
 'at',
 'birth',
 'by',
 'a',
 'lawyer',
 'and',
 'his',
 'wife',
 '.',
 'Except',
 'that',
 'when',
 'I',
 'popped',
 'out',
 'they

In [4]:
from collections import Counter
bag_of_words = Counter(tokens)
bag_of_words

Counter({'The': 2,
         'first': 4,
         'story': 1,
         'is': 1,
         'about': 4,
         'connecting': 1,
         'the': 31,
         'dots': 4,
         '.': 39,
         'I': 31,
         'dropped': 6,
         'out': 11,
         'of': 9,
         'Reed': 2,
         'College': 2,
         'after': 1,
         '6': 1,
         'months': 4,
         ',': 34,
         'but': 2,
         'then': 1,
         'stayed': 1,
         'around': 1,
         'as': 3,
         'a': 13,
         'drop-in': 1,
         'for': 4,
         'another': 1,
         '18': 1,
         'or': 2,
         'so': 3,
         'before': 2,
         'really': 2,
         'quit': 1,
         'So': 4,
         'why': 1,
         'did': 2,
         'drop': 2,
         '?': 2,
         'It': 5,
         'started': 1,
         'was': 14,
         'born': 1,
         'My': 2,
         'biological': 2,
         'mother': 3,
         'young': 1,
         'unwed': 1,
         'college': 10,
        

A collections. Counter object is an unordered collection, also called a bag or multiset. Depending on your platform and Python version, you may find that a Counter is displayed in a seemingly reasonable order, like lexical order or the order that tokens appeared in your statement. But just as for a standard Python dict, you cannot rely on the order of your tokens (keys) in a Counter.

For short documents like this one, the unordered bag of words still contains a lot of information about the original intent of the sentence. And the information in a bag of words is sufficient to do some powerful things such as detect spam, compute sentiment (positivity, happiness, and so on), and even detect subtle intent, like sarcasm. It may be a bag, but it’s full of meaning and information.

In [8]:
bag_of_words.most_common(20)

[('.', 39),
 (',', 34),
 ('the', 31),
 ('I', 31),
 ('to', 19),
 ('that', 16),
 ('was', 14),
 ('and', 14),
 ('in', 14),
 ('a', 13),
 ('it', 12),
 ('out', 11),
 ('my', 11),
 ('college', 10),
 ('’', 10),
 ('of', 9),
 ('have', 9),
 ('had', 9),
 ('all', 8),
 ('on', 8)]

Specifically, **the number of times a word occurs in a given document is called the term frequency, commonly abbreviated TF**. 

```TF(t,d) = count of t in d / number of words in d```


Note. In some examples you may see the count of word occurrences normalized (divided) by the number of terms in the document. However, normalized frequency is really a probability, so it should probably not be called frequency.

In [9]:
import nltk
nltk.download('stopwords',quiet=True)

True

In [10]:
stopwords = nltk.corpus.stopwords.words('english')
print(len(stopwords))
print(stopwords)

179
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than

In [11]:
tokens_no_stopwords = [word for word in tokens if word not in stopwords]

bag_of_words_no_stopwords = Counter(tokens_no_stopwords)
bag_of_words_no_stopwords.most_common(20)

[('.', 39),
 (',', 34),
 ('I', 31),
 ('college', 10),
 ('’', 10),
 ('never', 7),
 ('dropped', 6),
 ('later', 6),
 ('would', 6),
 ('It', 5),
 ('And', 5),
 ('life', 5),
 ('looking', 5),
 ('first', 4),
 ('dots', 4),
 ('months', 4),
 ('So', 4),
 ('decided', 4),
 ('parents', 4),
 ('every', 4)]

In [12]:
# Let's remove punctuation
# RegexpTokenizer = a tokenizer that splits a string using a regular expression, 
# which matches either the tokens or the separators between tokens.

from nltk.tokenize import RegexpTokenizer

# "\w" matches Unicode word characters
tokenizer = nltk.RegexpTokenizer(r"\w+")
tokens_no_punctuation = tokenizer.tokenize(story1)
tokens_no_punctuation

['The',
 'first',
 'story',
 'is',
 'about',
 'connecting',
 'the',
 'dots',
 'I',
 'dropped',
 'out',
 'of',
 'Reed',
 'College',
 'after',
 'the',
 'first',
 '6',
 'months',
 'but',
 'then',
 'stayed',
 'around',
 'as',
 'a',
 'drop',
 'in',
 'for',
 'another',
 '18',
 'months',
 'or',
 'so',
 'before',
 'I',
 'really',
 'quit',
 'So',
 'why',
 'did',
 'I',
 'drop',
 'out',
 'It',
 'started',
 'before',
 'I',
 'was',
 'born',
 'My',
 'biological',
 'mother',
 'was',
 'a',
 'young',
 'unwed',
 'college',
 'graduate',
 'student',
 'and',
 'she',
 'decided',
 'to',
 'put',
 'me',
 'up',
 'for',
 'adoption',
 'She',
 'felt',
 'very',
 'strongly',
 'that',
 'I',
 'should',
 'be',
 'adopted',
 'by',
 'college',
 'graduates',
 'so',
 'everything',
 'was',
 'all',
 'set',
 'for',
 'me',
 'to',
 'be',
 'adopted',
 'at',
 'birth',
 'by',
 'a',
 'lawyer',
 'and',
 'his',
 'wife',
 'Except',
 'that',
 'when',
 'I',
 'popped',
 'out',
 'they',
 'decided',
 'at',
 'the',
 'last',
 'minute',
 'that

In [13]:
stopwords = nltk.corpus.stopwords.words('english')
tokens_no_stopwords_no_punctuation = [word for word in tokens_no_punctuation if word not in stopwords]
bag_of_words_no_stopwords_no_punctuation = Counter(tokens_no_stopwords_no_punctuation)
bag_of_words_no_stopwords_no_punctuation.most_common(20)

[('I', 31),
 ('college', 10),
 ('never', 7),
 ('dropped', 6),
 ('later', 6),
 ('would', 6),
 ('It', 5),
 ('And', 5),
 ('life', 5),
 ('looking', 5),
 ('first', 4),
 ('dots', 4),
 ('months', 4),
 ('So', 4),
 ('decided', 4),
 ('parents', 4),
 ('every', 4),
 ('connect', 4),
 ('drop', 3),
 ('mother', 3)]

### normalized term frequency

In [14]:
# with stopwords
times_college_appears = bag_of_words['college']
times_college_appears

10

In [15]:
# The number of unique tokens from our text:
num_unique_words = len(bag_of_words)
num_unique_words

337

In [18]:
tf = times_college_appears / num_unique_words

# Round a number to a given precision in decimal digits.
round(tf, 4)

0.0297

In [19]:
# count percent
print('college =',round(tf*100, 2),'%')

college = 2.97 %


In [20]:
# without stopwords
times_college_appears = bag_of_words_no_stopwords['college']

# The number of unique tokens from our text:
num_unique_words = len(bag_of_words_no_stopwords)
tf = times_college_appears / num_unique_words
round(tf, 4)

0.0377

In [21]:
# count percent
print('college =',round(tf*100, 2),'%')

college = 3.77 %


In [22]:
# without stopwords and punctuation
times_college_appears = bag_of_words_no_stopwords_no_punctuation['college']

# The number of unique tokens from our text:
num_unique_words = len(bag_of_words_no_stopwords_no_punctuation)
tf = times_college_appears / num_unique_words

print(round(tf, 4))
print('college =',round(tf*100, 2),'%')

0.0394
college = 3.94 %


# Vectorizing

In [23]:
tokens_no_stopwords_no_punctuation

['The',
 'first',
 'story',
 'connecting',
 'dots',
 'I',
 'dropped',
 'Reed',
 'College',
 'first',
 '6',
 'months',
 'stayed',
 'around',
 'drop',
 'another',
 '18',
 'months',
 'I',
 'really',
 'quit',
 'So',
 'I',
 'drop',
 'It',
 'started',
 'I',
 'born',
 'My',
 'biological',
 'mother',
 'young',
 'unwed',
 'college',
 'graduate',
 'student',
 'decided',
 'put',
 'adoption',
 'She',
 'felt',
 'strongly',
 'I',
 'adopted',
 'college',
 'graduates',
 'everything',
 'set',
 'adopted',
 'birth',
 'lawyer',
 'wife',
 'Except',
 'I',
 'popped',
 'decided',
 'last',
 'minute',
 'really',
 'wanted',
 'girl',
 'So',
 'parents',
 'waiting',
 'list',
 'got',
 'call',
 'middle',
 'night',
 'asking',
 'We',
 'unexpected',
 'baby',
 'boy',
 'want',
 'They',
 'said',
 'Of',
 'course',
 'My',
 'biological',
 'mother',
 'later',
 'found',
 'mother',
 'never',
 'graduated',
 'college',
 'father',
 'never',
 'graduated',
 'high',
 'school',
 'She',
 'refused',
 'sign',
 'final',
 'adoption',
 'pape

In [24]:
bag_of_words_no_stopwords_no_punctuation

Counter({'The': 2,
         'first': 4,
         'story': 1,
         'connecting': 1,
         'dots': 4,
         'I': 31,
         'dropped': 6,
         'Reed': 2,
         'College': 2,
         '6': 1,
         'months': 4,
         'stayed': 1,
         'around': 1,
         'drop': 3,
         'another': 1,
         '18': 1,
         'really': 2,
         'quit': 1,
         'So': 4,
         'It': 5,
         'started': 1,
         'born': 1,
         'My': 2,
         'biological': 2,
         'mother': 3,
         'young': 1,
         'unwed': 1,
         'college': 10,
         'graduate': 1,
         'student': 1,
         'decided': 4,
         'put': 1,
         'adoption': 2,
         'She': 3,
         'felt': 1,
         'strongly': 1,
         'adopted': 2,
         'graduates': 1,
         'everything': 1,
         'set': 1,
         'birth': 1,
         'lawyer': 1,
         'wife': 1,
         'Except': 1,
         'popped': 1,
         'last': 1,
         'minute

In [25]:
story1_vector = []
story1_length = len(tokens_no_stopwords_no_punctuation)

for key, value in bag_of_words_no_stopwords_no_punctuation.most_common():
    story1_vector.append(value / story1_length)
    
story1_vector

[0.07542579075425791,
 0.024330900243309004,
 0.0170316301703163,
 0.014598540145985401,
 0.014598540145985401,
 0.014598540145985401,
 0.012165450121654502,
 0.012165450121654502,
 0.012165450121654502,
 0.012165450121654502,
 0.009732360097323601,
 0.009732360097323601,
 0.009732360097323601,
 0.009732360097323601,
 0.009732360097323601,
 0.009732360097323601,
 0.009732360097323601,
 0.009732360097323601,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.0072992700729927005,
 0.004866180048661801,
 0.004866180048661801,
 0.004866180048661801,
 0.004866180048661801,
 0.004866180048661801,
 0.004866180048661801,
 0.004866180048661801,
 0.004866180048661801,
 0.004866180048661801,
 0.004866180048661801,
 0.004866180048661801,
 0.004866180048661801,
 

In [26]:
story2 = '''My second story is about love and loss.

I was lucky — I found what I loved to do early in life. Woz and I started Apple in my parents’ garage when I was 20. We worked hard, and in 10 years Apple had grown from just the two of us in a garage into a $2 billion company with over 4,000 employees. We had just released our finest creation — the Macintosh — a year earlier, and I had just turned 30. And then I got fired. How can you get fired from a company you started? Well, as Apple grew we hired someone who I thought was very talented to run the company with me, and for the first year or so things went well. But then our visions of the future began to diverge and eventually we had a falling out. When we did, our Board of Directors sided with him. So at 30 I was out. And very publicly out. What had been the focus of my entire adult life was gone, and it was devastating.

I really didn’t know what to do for a few months. I felt that I had let the previous generation of entrepreneurs down — that I had dropped the baton as it was being passed to me. I met with David Packard and Bob Noyce and tried to apologize for screwing up so badly. I was a very public failure, and I even thought about running away from the valley. But something slowly began to dawn on me — I still loved what I did. The turn of events at Apple had not changed that one bit. I had been rejected, but I was still in love. And so I decided to start over.

I didn’t see it then, but it turned out that getting fired from Apple was the best thing that could have ever happened to me. The heaviness of being successful was replaced by the lightness of being a beginner again, less sure about everything. It freed me to enter one of the most creative periods of my life.

During the next five years, I started a company named NeXT, another company named Pixar, and fell in love with an amazing woman who would become my wife. Pixar went on to create the world’s first computer animated feature film, Toy Story, and is now the most successful animation studio in the world. In a remarkable turn of events, Apple bought NeXT, I returned to Apple, and the technology we developed at NeXT is at the heart of Apple’s current renaissance. And Laurene and I have a wonderful family together.

I’m pretty sure none of this would have happened if I hadn’t been fired from Apple. It was awful tasting medicine, but I guess the patient needed it. Sometimes life hits you in the head with a brick. Don’t lose faith. I’m convinced that the only thing that kept me going was that I loved what I did. You’ve got to find what you love. And that is as true for your work as it is for your lovers. Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven’t found it yet, keep looking. Don’t settle. As with all matters of the heart, you’ll know when you find it. And, like any great relationship, it just gets better and better as the years roll on. So keep looking until you find it. Don’t settle.
'''

story3 = '''My third story is about death.

When I was 17, I read a quote that went something like: “If you live each day as if it was your last, someday you’ll most certainly be right.” It made an impression on me, and since then, for the past 33 years, I have looked in the mirror every morning and asked myself: “If today were the last day of my life, would I want to do what I am about to do today?” And whenever the answer has been “No” for too many days in a row, I know I need to change something.

Remembering that I’ll be dead soon is the most important tool I’ve ever encountered to help me make the big choices in life. Because almost everything — all external expectations, all pride, all fear of embarrassment or failure — these things just fall away in the face of death, leaving only what is truly important. Remembering that you are going to die is the best way I know to avoid the trap of thinking you have something to lose. You are already naked. There is no reason not to follow your heart.

About a year ago I was diagnosed with cancer. I had a scan at 7:30 in the morning, and it clearly showed a tumor on my pancreas. I didn’t even know what a pancreas was. The doctors told me this was almost certainly a type of cancer that is incurable, and that I should expect to live no longer than three to six months. My doctor advised me to go home and get my affairs in order, which is doctor’s code for prepare to die. It means to try to tell your kids everything you thought you’d have the next 10 years to tell them in just a few months. It means to make sure everything is buttoned up so that it will be as easy as possible for your family. It means to say your goodbyes.

I lived with that diagnosis all day. Later that evening I had a biopsy, where they stuck an endoscope down my throat, through my stomach and into my intestines, put a needle into my pancreas and got a few cells from the tumor. I was sedated, but my wife, who was there, told me that when they viewed the cells under a microscope the doctors started crying because it turned out to be a very rare form of pancreatic cancer that is curable with surgery. I had the surgery and I’m fine now.

This was the closest I’ve been to facing death, and I hope it’s the closest I get for a few more decades. Having lived through it, I can now say this to you with a bit more certainty than when death was a useful but purely intellectual concept:

No one wants to die. Even people who want to go to heaven don’t want to die to get there. And yet death is the destination we all share. No one has ever escaped it. And that is as it should be, because Death is very likely the single best invention of Life. It is Life’s change agent. It clears out the old to make way for the new. Right now the new is you, but someday not too long from now, you will gradually become the old and be cleared away. Sorry to be so dramatic, but it is quite true.

Your time is limited, so don’t waste it living someone else’s life. Don’t be trapped by dogma — which is living with the results of other people’s thinking. Don’t let the noise of others’ opinions drown out your own inner voice. And most important, have the courage to follow your heart and intuition. They somehow already know what you truly want to become. Everything else is secondary.

When I was young, there was an amazing publication called The Whole Earth Catalog, which was one of the bibles of my generation. It was created by a fellow named Stewart Brand not far from here in Menlo Park, and he brought it to life with his poetic touch. This was in the late 1960s, before personal computers and desktop publishing, so it was all made with typewriters, scissors and Polaroid cameras. It was sort of like Google in paperback form, 35 years before Google came along: It was idealistic, and overflowing with neat tools and great notions.

Stewart and his team put out several issues of The Whole Earth Catalog, and then when it had run its course, they put out a final issue. It was the mid-1970s, and I was your age. On the back cover of their final issue was a photograph of an early morning country road, the kind you might find yourself hitchhiking on if you were so adventurous. Beneath it were the words: “Stay Hungry. Stay Foolish.” It was their farewell message as they signed off. Stay Hungry. Stay Foolish. And I have always wished that for myself. And now, as you graduate to begin anew, I wish that for you.'''

In [27]:
docs = [story1, story2, story3]

In [28]:
doc_tokens = []
for doc in docs:
    tokens_no_punctuation = tokenizer.tokenize(doc.lower())
    tokens_no_stopwords_no_punctuation = [word for word in tokens_no_punctuation if word not in stopwords]
    doc_tokens += [sorted(tokens_no_stopwords_no_punctuation)]

In [29]:
print(len(doc_tokens[0]), len(doc_tokens[1]), len(doc_tokens[2]))

345 263 375


In [30]:
all_doc_tokens = sum(doc_tokens, [])
len(all_doc_tokens)

983

In [31]:
lexicon = sorted(set(all_doc_tokens))
len(lexicon)

583

Each of your three document vectors will need to have 583 values, even if the document for that vector doesn’t contain all 583 words in your lexicon. Each token is assigned a “slot” in your vectors corresponding to its position in your lexicon.

In [32]:
lexicon

['000',
 '10',
 '17',
 '18',
 '1960s',
 '1970s',
 '2',
 '20',
 '30',
 '33',
 '35',
 '4',
 '5',
 '6',
 '7',
 'across',
 'adopted',
 'adoption',
 'adult',
 'adventurous',
 'advised',
 'affairs',
 'age',
 'agent',
 'ago',
 'almost',
 'along',
 'already',
 'always',
 'amazing',
 'amount',
 'anew',
 'animated',
 'animation',
 'another',
 'answer',
 'apologize',
 'apple',
 'application',
 'approach',
 'around',
 'artistically',
 'asked',
 'asking',
 'avoid',
 'away',
 'awful',
 'baby',
 'back',
 'backward',
 'badly',
 'baton',
 'beautiful',
 'beautifully',
 'become',
 'began',
 'begin',
 'beginner',
 'believe',
 'beneath',
 'best',
 'better',
 'bibles',
 'big',
 'billion',
 'biological',
 'biopsy',
 'birth',
 'bit',
 'board',
 'bob',
 'born',
 'bottles',
 'bought',
 'boy',
 'brand',
 'brick',
 'brought',
 'buttoned',
 'buy',
 'call',
 'called',
 'calligraphed',
 'calligraphy',
 'came',
 'cameras',
 'campus',
 'cancer',
 'capture',
 'catalog',
 'cells',
 'certainly',
 'certainty',
 'change',


In [33]:
# OrderedDict = Dictionary that remembers insertion order
from collections import OrderedDict
zero_vector = OrderedDict((token, 0) for token in lexicon)

# collections.OrderedDict
zero_vector

OrderedDict([('000', 0),
             ('10', 0),
             ('17', 0),
             ('18', 0),
             ('1960s', 0),
             ('1970s', 0),
             ('2', 0),
             ('20', 0),
             ('30', 0),
             ('33', 0),
             ('35', 0),
             ('4', 0),
             ('5', 0),
             ('6', 0),
             ('7', 0),
             ('across', 0),
             ('adopted', 0),
             ('adoption', 0),
             ('adult', 0),
             ('adventurous', 0),
             ('advised', 0),
             ('affairs', 0),
             ('age', 0),
             ('agent', 0),
             ('ago', 0),
             ('almost', 0),
             ('along', 0),
             ('already', 0),
             ('always', 0),
             ('amazing', 0),
             ('amount', 0),
             ('anew', 0),
             ('animated', 0),
             ('animation', 0),
             ('another', 0),
             ('answer', 0),
             ('apologize', 0),
            

In [34]:
len(zero_vector)

583

Now we’ll make copies of that base vector, update the values of the vector for each document, and store them in an array.

copy.copy() creates an independent copy, a separate instance of your zero vector, rather than reusing a reference (pointer) to the original object’s memory location. Otherwise you’d just be overwriting the same zero_vector with new values in each loop, and you wouldn’t have a fresh zero on each pass of the loop.

In [35]:
import copy

doc_vectors = []
for doc in docs:
    vec = copy.copy(zero_vector)
    tokens_no_punctuation = tokenizer.tokenize(doc.lower())
    tokens_no_stopwords_no_punctuation = [word for word in tokens_no_punctuation if word not in stopwords]
    token_counts = Counter(tokens_no_stopwords_no_punctuation)
    
    for key, value in token_counts.items():
        vec[key] = value / len(lexicon)
    doc_vectors.append(vec)

In [39]:
doc_vectors[0]

OrderedDict([('000', 0),
             ('10', 0.003430531732418525),
             ('17', 0.0017152658662092624),
             ('18', 0.0017152658662092624),
             ('1960s', 0),
             ('1970s', 0),
             ('2', 0),
             ('20', 0),
             ('30', 0),
             ('33', 0),
             ('35', 0),
             ('4', 0),
             ('5', 0.0017152658662092624),
             ('6', 0.0017152658662092624),
             ('7', 0.0017152658662092624),
             ('across', 0.0017152658662092624),
             ('adopted', 0.003430531732418525),
             ('adoption', 0.003430531732418525),
             ('adult', 0),
             ('adventurous', 0),
             ('advised', 0),
             ('affairs', 0),
             ('age', 0),
             ('agent', 0),
             ('ago', 0),
             ('almost', 0.0017152658662092624),
             ('along', 0),
             ('already', 0),
             ('always', 0),
             ('amazing', 0),
             ('amo

You have three vectors, one for each document. What can you do with them? 

Your document word-count vectors can do all the cool stuff any vector can do.

For a natural language document vector space, **the dimensionality of your vector space is the count of the number of distinct words that appear in the entire corpus**. 

For TF (and TF-IDF), sometimes we call this dimensionality capital letter “K.” This number of distinct words is also the vocabulary size of your corpus, so in an academic paper it’ll usually be called “|V|.” 

You can then describe each document within this K-dimensional vector space by a K-dimensional vector. K = 583 in our three-document corpus.

We can measure the **Euclidean distance** between the vectors by subtracting them and computing the length of the distance between them, which is called the 2-norm distance.

**Two vectors are “similar” if they share similar direction**. They might have similar magnitude (length), which would mean that the word count (term frequency) vectors are for documents of about the same length. 

**We’d like to estimate of document similarity to find use of the same words about the same number of times in similar proportions. This accurate estimate would give you confidence that the documents they represent are probably talking about similar things.**

**Cosine similarity** is merely the cosine of the angle between two vectors (theta), which can be calculated from the Euclidian dot product using A·B=|A||B|*cosΘ. 

cosΘ = A·B / |A||B|

Cosine similarity is efficient to calculate because the dot product doesn’t require evaluation of any trigonometric functions. In addition, cosine similarity has a convenient range for most machine learning problems: -1 to +1.

In [41]:
import math

def cosine_sim(vec1, vec2):
    """ Let's convert our dictionaries to lists for easier matching."""
    vec1 = [val for val in vec1.values()]
    vec2 = [val for val in vec2.values()]
    dot_prod = 0
    for i, v in enumerate(vec1):
        dot_prod += v * vec2[i]
    mag_1 = math.sqrt(sum([x**2 for x in vec1]))
    mag_2 = math.sqrt(sum([x**2 for x in vec2]))
    return dot_prod / (mag_1 * mag_2)

In [38]:
[val for val in doc_vectors[0].values()]
# vec2 = [val for val in doc_vectors[1].values()]

# # dot_prod
# dot_prod = 0
# for i, v in enumerate(vec1):
#     dot_prod += v * vec2[i]
# dot_prod

[0,
 0.003430531732418525,
 0.0017152658662092624,
 0.0017152658662092624,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0.0017152658662092624,
 0.0017152658662092624,
 0.0017152658662092624,
 0.0017152658662092624,
 0.003430531732418525,
 0.003430531732418525,
 0,
 0,
 0,
 0,
 0,
 0,
 0,
 0.0017152658662092624,
 0,
 0,
 0,
 0,
 0.0017152658662092624,
 0,
 0,
 0,
 0.0017152658662092624,
 0,
 0,
 0,
 0.0017152658662092624,
 0.0017152658662092624,
 0.0017152658662092624,
 0.0017152658662092624,
 0,
 0.0017152658662092624,
 0,
 0,
 0,
 0.0017152658662092624,
 0.003430531732418525,
 0.003430531732418525,
 0,
 0,
 0.003430531732418525,
 0.0017152658662092624,
 0,
 0,
 0.0017152658662092624,
 0,
 0,
 0,
 0.003430531732418525,
 0,
 0,
 0,
 0,
 0.003430531732418525,
 0,
 0.0017152658662092624,
 0,
 0,
 0,
 0.0017152658662092624,
 0.0017152658662092624,
 0,
 0.0017152658662092624,
 0,
 0,
 0,
 0,
 0.0017152658662092624,
 0.0017152658662092624,
 0,
 0.0017152658662092624,
 0.005145797598627788,
 0.0017152658

https://medium.com/acing-ai/what-is-cosine-similarity-matrix-f0819e674ad1

So you need to take the dot product of two of your vectors in question—multiply the elements of each vector pairwise—and then sum up those products. You then divide by the norm (magnitude or length) of each vector. The vector norm is the same as its Euclidean distance from the head to the tail of the vector—the square root of the sum of the squares of its elements. This normalized dot product, like the output of the cosine function, will be a value between -1 and 1. It’s the cosine of the angle between these two vectors. This value is the same as the portion of the longer vector that’s covered by the shorter vector’s perpendicular projection onto the longer one. It gives you a value for how much the vectors point in the same direction.

**A cosine similarity of 1** represents identical normalized vectors that point in exactly the same direction along all dimensions. The vectors may have different lengths or magnitudes, but they point in the same direction. Remember you divided the dot product by the norm of each vector, and this can happen before or after the dot product. So the vectors are normalized so they both have a length of 1 as you do the dot product. **So the closer a cosine similarity value is to 1, the closer the two vectors are in angle**. 

**For NLP document vectors that have a cosine similarity close to 1, you know that the documents are using similar words in similar proportion. So the documents whose document vectors are close to each other are likely talking about the same thing.**

**A cosine similarity of 0** represents two vectors that share no components. They are orthogonal, perpendicular in all dimensions. For NLP TF vectors, this situation occurs only if the **two documents share no words in common**. Because these documents use completely different words, they must be talking about completely different things. **This doesn’t necessarily mean they have different meanings or topics, just that they use completely different words.**

**A cosine similarity of -1** represents two vectors that are anti-similar, completely opposite. They point in opposite directions. **This can never happen for simple word count (term frequency) vectors or even normalized TF vectors**. Counts of words can never be negative. So word count (term frequency) vectors will always be in the same “quadrant” of the vector space. None of the term frequency vectors can sneak around into one of the quadrants behind the tail of the other vectors. None of your term frequency vectors can have components (word frequencies) that are the negative of another term frequency vector, because term frequencies just can’t be negative.

In [43]:
cosine_sim(doc_vectors[0],doc_vectors[2])

0.20447272394029653

# Zadanie

tf(t, d) = count(t) / count(d)

idf(t, D) = log (number of documents / number of documents containing t) 

tfidf(t, d, D) = tf(t, d) * idf(t, D)

1) Napisz funkcje:

- def computeTF(token_counts, lexicon)
- def computeIDF(docs)
- def computeTFIDF

        
2) Stwórz wektory dla dla 3 historii uwzględniające TFIDF.

3) Porównaj historie.