Unsupervised Learning using Skip Thought Vectors (Python )

Now this, in my opinion, is the newest and most novel approach we've discussed here. The high level approach is as follows:

Text Cleaning -> Encoder/Decoder -> K Means Clustering -> Extract Sentences Closest to Cluster Center

Again, there are two main concepts I want to discuss before jumping into the solution:

Skip Thought Vectors

Here, we use a encoder/decoder framework to generate feature vectors Taking it from Kushal Chauhan's post, here is how the encoder and decoder layers are defined:

Encoder Network: The encoder is typically a GRU-RNN which generates a fixed length vector representation h(i) for each sentence S(i) in the input. The encoded representation h(i) is obtained by passing final hidden state of the GRU cell (i.e. after it has seen the entire sentence) to multiple dense layers.
Decoder Network: The decoder network takes this vector representation h(i) as input and tries to generate two sentences - S(i-1) and S(i+1), which could occur before and after the input sentence respectively. Separate decoders are implemented for generation of previous and next sentences, both being GRU-RNNs. The vector representation h(i) acts as the initial hidden state for the GRUs of the decoder networks.
Similar to how Word2Vec embeddings are trained by predicting the surrounding words, the Skip Thought Vectors are trained by predicting the sentence at time, t-1 and t+1. As this model is trained, the learned representation (hidden layer) will now place similar sentences closer together which enables higher performance clustering.

I encourage you to review the paper on the same subject for more clarity.

K-Means Clustering

Most of you will be familiar with this form of unsupervised learning but I want to elaborate on how it is used and why it is interesting.

As we are aware, each cluster will have some center point which, in the vector space, would indicate the point which closely represents the theme of that cluster. With this in mind, when trying to create a summary, we should only need the sentence which is the closest to the center of that cluster. The key here is choosing the correct number of clusters to do a good job of summarizing the content. Kushal's post recommends that we calculate the cluster size by taking 30% of the number of sentences.

In [None]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem.porter import *
import re

In [None]:
passage = """
It’s Back to School time here in Colorado, which means both my son and I will be hanging up the swim shorts and kayak paddles and getting back to more serious business for a while.

It has been a slow and endlessly sunny and leisurely summer, and a nice break for both of us, which has been very relaxing and a great time for bonding.

But relaxation has its limits. At some point all that Chilling Out fades its way into Complacency, and our natural Human nature starts to work against us, telling us to conserve energy and not really do much of anything. And laziness begets more laziness, and life actually becomes less fun.

You can see this effect in our activities. I’ve only completed two blog posts over the entire summer holidays, and together we have put out only two YouTube videos. Spending more time at home and less at the MMM Headquarters squat rack has caused me to lose at least five pounds of leg muscle that I had wanted to keep. Little MM has spent a lot less time practicing on the upright bass and putting out songs, and a lot more time playing video games and getting sucked into the “dank memes” and “Trove” channels on Reddit.

It has been a fun break, but as the freshly polished school buses awaken with the sunrise, it will be even more fun to get our own lives cranking into a higher gear as well. And if you’re reading this, it means I am off to a great start!

Complacency Is Expensive

This laziness was affecting my financial life, and your financial life too. I had let thousands of dollars of uninvested cash build up in my checking account, where it was sitting around earning nothing. My credit card bills had come in, been automatically paid, and filed themselves away without me even reviewing them for fraudulent transactions or wussypants spending on my part. And I had a growing mini-mountain of things I need to do regarding insurance, accounting, and legal stuff in both my personal and business domains.

And yet once I got my act together last week, I cleaned up the whole mess and set things straight in less than an hour.

It’s not Just Me, it’s You

When I talk to friends and family, I notice a common theme: they tend to set up certain “hassle” things once, and then ignore them as long as possible unless some absolute crisis comes along and forces them to make a change.

“Oh, I just do all my insurance stuff with Jim Schmidt’s Insurance office downtown, because my parents referred me to him when I first moved out for college.

Even better, his wife Jane runs a loan brokerage, so she handles all our family’s mortgage needs!”

On this surface, this sounds fun and folksy and like a nice way to do business. And that is exactly the way I like to live: keeping my business relationships as casual and fun as I can. But when it comes to money, complacency can come at a price, so at the bare minimum we should find out exactly what price we are paying.

For example, just recently a coworking member came to me and asked for some financial help. And as always, I suggested we start by looking at big recurring expenses. So we dug into the details of her insurance and other major bills streaming in from ol’ Jim and Jane, and found an interesting breakdown:

Required liability coverage on a 2010 Subaru Forester: $580 per year
Optional collision and comprehensive coverage ($500 deductible): $360 per year
Home insurance on a 2000 square foot house ($500 deductible): $1450 per year
Mortgage interest on a $300,000 loan at 4.85%:  $14,550 per year
Student Loan interest on an old $35,000 student loan at 5.5%: $1925 per year
Total: $18,865 per year.

It’s no wonder my friend was having financial stress – she had interest and insurance costs that were soaking up half of a reasonable annual budget before she could even buy her first bit of groceries or clothing.

So, right there we did a quick round of phone calls and online quotes, and streamlined a bit of the insurance coverage by increasing the deductibles. Within 90 minutes (she did most of the work while I had a beer and swept the floors of the HQ), we had the following new set of options:

Subaru liability coverage: $380 per year ($200 savings) through Geico
Removal of collision and comprehensive (in the unlikely event of a crash, they could afford to replace the car with less than two months of income) ($360 savings)
Home insurance on a 2000 square foot house ($5000 deductible): $650 per year ($800 savings) through Safeco
Refinanced mortgage to 3.375% through Credible.com*: $10,125 per year ($4,425 savings)
Refinanced Student Loan (also Credible) to 3.85%: $1347 per year ($578 savings)
New total expenses: $12,502 ($6363 per year in savings!!)

It is hard to even express the importance of what just happened here.  My friend just did two hours of work in total while drinking a glass of wine,  and dropped her annual expenses by over $500 per month, or six thousand dollars per year. And she will of course invest these savings, which will then compound to about to about $86,000 every ten years. 

Even if she has to do this annual round of phone calls and websites once per year to maintain the best rates on everything, she will be earning about $3150 per hour for this work. Hence the bold title of this article, which you can now see is very conservative.

The Optimization Council


The first Optimization Council meeting at MMM HQ

So you’re convinced. $3150 is enough to get you to pick up the phone, but how do know who to call? Who is going to be your coach if you don’t live near Longmont and thus can’t just join the HQ and have Mr. Money Mustache tell you what to do?

The great news is that all of this knowledge already exists, right in your own circle of friends. To extract it, you just need to gather them together and get them to talk about it.

Earlier this month, I floated exactly this idea with the members of my coworking space, proposing that we form a group with the witty name “The Optimization Council.”

The Council would meet every now and then to talk through life’s biggest expenses and opportunities, and harvest the wisdom of the group so we can all benefit from the best ideas in each category.

The response to this idea was overwhelmingly positive. So we called a first “test” meeting earlier this month and a small group of us talked through the first few categories, sharing not just names like “I use Schmidt Insurance”, but details like, “We have $250,000 coverage with a $1,000 deductible and our premium is $589 per year.”

The meeting was so lively that we quickly ran out of time, but resolved to meet again soon to figure out more things together. I served as the scribe using a shared google doc – here’s a snapshot of that to give you an idea of our topics:

So Yes. There is some thinking and work involved. But there’s also an opportunity to drastically improve your short term cashflow and long-term wealth, and break your friends out of their cautious shell to help them get the same benefits.

As we learned long ago in Protecting your Money Mustache from Spendy Friends, most people tend towards complacency, and following along with the group. Which leaves a big gaping void at the top of the pyramid where the leadership role waits unfilled.

If you are bold enough to climb into this spot (which really means just sending a few emails and Facebook messages, procuring a box or two of wine, and making a large tray of high-end nachos for your guests), you can all reap the rewards for decades to come.

And instead of avoiding this little chore like a hassle, dive into it like a gigantic shower of fun and wealth. After all, this is pretty much the core attitude of Mustachianism Itself.

In the comments: we can start our own Optimization Council right here. If you have found a good deal on any of the categories of life, feel free to share a quick summary of your location (state), and details of the company and product/service/price that you found is the best. To avoid spam filtering, please use names but not direct links.

A Note about Credible:

Watchful readers may have noticed I also mentioned this company on Twitter recently. After a few months of skepticism that the world needed yet another financial company, I was convinced by some conversations with the people running it and a Zoom video of the customer experience from a senior employee, with some very candid commentary on their design choices.

I like it because they import the lending models from their large supply of hooked-up finance companies, then run the rate comparisons on their own server rather than farming out your personal information to each separate lender. It saves you from filling out multiple applications when collecting rates, and also saves you from getting on everyone’s spam list (they don’t sell your contact information, which is a rare thing among loan search engines).

It was a hard model for them to get going, because the banks naturally want to have your information so they can spam you.  But now that they have a growing presence in the market, lenders are forced to come through Credible to get access to this pool of qualified people. After enough testing with people I knew, I found the experience is worth recommending.

So I also signed this blog up with their referral program  – please see my Affiliates philosophy if you are curious or skeptical about how any of that works!

With all that said, if you want to try it out, here are the links:
"""

Text standarization

In [None]:
contractions = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}

In [None]:
contractions_re = re.compile('(%s)' % '|'.join(contractions.keys()))
def expand_contractions(s, contractions_dict=contractions):
    def replace(match):
        return contractions_dict[match.group(0)]
    return contractions_re.sub(replace, s)
 
sentences = sent_tokenize(passage)    
sentences = [expand_contractions(i) for i in sentences]
sentences = [re.sub('\n', '', i) for i in sentences]

In [None]:
import skipthoughts

# You would need to download pre-trained models first
model = skipthoughts.load_model()

encoder = skipthoughts.Encoder(model)

In [None]:
encoded =  encoder.encode(sentences)

In [None]:
from sklearn.metrics import pairwise_distances_argmin_min
import numpy as np
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=20)
kmeans = kmeans.fit(encoded)

n_clusters = int(np.ceil(len(encoded)**0.6))
print(n_clusters)

avg = []
for j in range(n_clusters):
    idx = np.where(kmeans.labels_ == j)[0]
    avg.append(np.mean(idx))
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, encoded)
ordering = sorted(range(n_clusters), key=lambda k: avg[k])
summary = ' '.join([sentences[closest[idx]] for idx in ordering])

In [None]:
summary