# Notes
- Write code to train word embedding models, but don't commit the models, just leave a stub of a file that the model would ingest (do some cbow, skip-gram, fasttext, and maybe an edit-distance for some example terms.)

# Introduction

##### Main Reference
<a href= "http://www.aclweb.org/anthology/S17-2051"> SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity between Questions for Community Question Answering </a>

##### Gensim Softcosine 
https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb

## Community Question Answering

#### Example Document 1

- **URL**: https://community.spiceworks.com/topic/328118
- **Subject**: need to unlock a password protected Excel 2010 workbook
- **Body**: I have a user who had forgotten the password for an Excel 2010 file.  She cannot open it.  I tried changing the name to a zip file and opening the workbook file with an xml editor, but I can't get a readable format to come up so I can see the line of text with the password (so I can delete it).  What I'm getting is this gobbledy-gook: ºʝZܡ M/ؽ(+蝤¤7

#### Example Document 2

- **URL**: https://community.spiceworks.com/topic/1766186
- **Subject**: Password protected Excel 2016 spreadsheet---user forgot password!
- **Body**: Face palm time.  A user has forgotten their password to an Excel 2016 spreadsheet. Anybody have any slick tool that will at least let them open in read-only mode?  Something to bust through the password would be excellent.

#### Example Document 3

- **URL**: https://community.spiceworks.com/topic/1984626
- **Subject**: How secure are password protected Excel files?
- **Body**: Once in a while if we need to send credentials to a third party we will use password protected Excel files sent via secure email (third party service where they have to login to see the email). I'm curious if the Excel file is secure enough by itself? Based on this info it looks like the default for Excel 2016 will be AES 256, which should be effectively secure at least against a brute force attack. \__url\__ "Lets you configure the CNG cipher algorithm that is used. The default is AES." " Lets you configure the number of bits to use when you create the cipher key. The default is 256 bits."

# Preprocessing

**Word**, **Term**, **Token** := Collection of characters 

**Document** := Collection of words

**Corpus** := Collection of documents

**Vocabulary** := Collection of distinct terms that appear in a corpus

Allowable collections depend upon preprocessing choices

## Preprocessing the Example Documents

We choose to concatenate subject and body 

In [305]:
doc1 = """need to unlock a password protected Excel 2010 workbook <p>I have a user who had forgotten the password for an Excel 2010 file.&nbsp; She cannot open it.&nbsp; I tried changing the name to a zip file and opening the workbook file with an xml editor, but I can&#39;t get a readable format to come up so I can see the line of text with the password (so I can delete it).&nbsp; What I&#39;m getting is this gobbledy-gook:</p><p>ºʝZܡ M/ؽ(+蝤¤7</p>"""
doc2 = """Password protected Excel 2016 spreadsheet---user forgot password! <p>Face palm time.  A user has forgotten their password to an Excel 2016 spreadsheet.<br><br>Anybody have any slick tool that will at least let them open in read-only mode?  Something to bust through the password would be excellent.</p>"""
doc3 = """How secure are password protected Excel files? <p>Once in a while if we need to send credentials to a third party we will use password protected Excel files sent via secure email (third party service where they have to login to see the email).<br><br></p><div>I'm curious if the Excel file is secure enough by itself? Based on this info it looks like the default for Excel 2016 will be AES 256, which should be effectively secure at least against a brute force attack.<div><a href="https://technet.microsoft.com/en-us/library/cc179125%28v=office.16%29.aspx?f=255&amp;MSPPError=-2147217396">https://technet.microsoft.com/en-us/library/cc179125%28v=office.16%29.aspx?f=255&amp;MSPPError=-2147...</a><br></div><div><ul><li><i>"Lets you configure the CNG cipher algorithm that is used. </i><b><i>The default is AES</i></b><i>."</i></li><li><i>"Lets you configure the number of bits to use when you create the cipher key. </i><b><i>The default is 256 bits</i></b><i>."</i></li></ul></div> </div>"""
docs = [doc1, doc2, doc3]
doc1

'need to unlock a password protected Excel 2010 workbook <p>I have a user who had forgotten the password for an Excel 2010 file.&nbsp; She cannot open it.&nbsp; I tried changing the name to a zip file and opening the workbook file with an xml editor, but I can&#39;t get a readable format to come up so I can see the line of text with the password (so I can delete it).&nbsp; What I&#39;m getting is this gobbledy-gook:</p><p>\xc2\xba\xca\x9d\xc2\x99Z\xdc\xa1 \xc2\x84M/\xd8\xbd(+\xe8\x9d\xa4\xc2\xa47</p>'

- Do you keep punctuation?
- Do you preserve letter case?
- Do you remove stopwords? 
- Do you impose minimum word frequencies? 
- Do you stem tokens?

Results can be sensitive to these choices to varying extents. These choices will likely be different for each corpus and application. 

## Gensim Default Preprocessing

In [236]:
import os
import gensim
DATA_DIR = 'data'

# Helper IO Function
def list2txt(doc_list, filename):
    filepath = os.path.join(DATA_DIR, filename)
    outfile = open(filepath, 'w')
    outfile.write("\n".join(doc_list))
    return filepath

default_target_corpus = gensim.corpora.TextCorpus(list2txt(docs, 'raw.txt'))

In [237]:
default_target_corpus.dictionary.items()
default_target_corpus.dictionary.id2token

{0: u'changing',
 1: u'come',
 2: u'delete',
 3: u'editor',
 4: u'excel',
 5: u'file',
 6: u'forgotten',
 7: u'format',
 8: u'getting',
 9: u'gobbledy',
 10: u'gook',
 11: u'line',
 12: u'nbsp',
 13: u'need',
 14: u'open',
 15: u'opening',
 16: u'password',
 17: u'protected',
 18: u'readable',
 19: u'text',
 20: u'tried',
 21: u'unlock',
 22: u'user',
 23: u'workbook',
 24: u'xml',
 25: u'zip',
 26: u'anybody',
 27: u'bust',
 28: u'excellent',
 29: u'face',
 30: u'forgot',
 31: u'let',
 32: u'mode',
 33: u'palm',
 34: u'read',
 35: u'slick',
 36: u'spreadsheet',
 37: u'time',
 38: u'tool',
 39: u'aes',
 40: u'algorithm',
 41: u'amp',
 42: u'aspx',
 43: u'attack',
 44: u'based',
 45: u'bits',
 46: u'brute',
 47: u'cipher',
 48: u'cng',
 49: u'com',
 50: u'configure',
 51: u'create',
 52: u'credentials',
 53: u'curious',
 54: u'default',
 55: u'div',
 56: u'effectively',
 57: u'email',
 58: u'files',
 59: u'force',
 60: u'href',
 61: u'https',
 62: u'info',
 63: u'key',
 64: u'lets',
 65

Vocabular terms are indexed in order of appearence in the corpus

## Custom Preprocessing

In [238]:
import re # python regular expression library
import bs4 # python html parsing library "beautiful soup" 

# Sample custom preprocessing sequence
DELIMITER_PATTERN = u'[!?,;:\t\\\\"\\(\\)\\\'\u2026\u201c\u2013\u2019\u2026\n]|\\s\\-\\s|\.\s'
TOKEN_PATTERN = r'(?u)[\_][a-zA-Z0-9\_]*|[a-zA-Z0-9][a-zA-Z0-9.]*\b'

def html_parser(html):
    try:
        html = re.sub(r"<img.*?>", " __img__ ", html)
        html = re.sub(r"<a.*?/a>", " __url__ ", html)
        soup = bs4.BeautifulSoup(html, "html.parser")
        for br in soup.find_all("br"):
            br.replace_with("\n")
        return soup.get_text()
    except:  
        return ''
    
def split_sentences(doc):
    try:
        delimiters = re.compile(DELIMITER_PATTERN)
        sentences = delimiters.split(doc)
        return sentences
    except:
        return []

def tokenizer(sentence, token_pattern=TOKEN_PATTERN, lowercase=True):
    try:
        token_pattern = re.compile(token_pattern)
        if lowercase:
            sentence = sentence.lower()
        return token_pattern.findall(sentence)

    except:
        return []

def clean_text(html, lowercase=True):
    text = html_parser(html)
    if lowercase:
        text = text.lower()

    sents = split_sentences(text)
    sents = [' '.join(tokenizer(sent, lowercase=lowercase)) for sent in sents]
    try:
        sents = map(lambda x: x.strip(), sents)
    except:
        pass
    try:
        sents = [sent for sent in sents if len(sent) > 0]
    except:
        pass
    return ' '.join(sents)

In [239]:
doc3

'How secure are password protected Excel files? <p>Once in a while if we need to send credentials to a third party we will use password protected Excel files sent via secure email (third party service where they have to login to see the email).<br><br></p><div>I\'m curious if the Excel file is secure enough by itself? Based on this info it looks like the default for Excel 2016 will be AES 256, which should be effectively secure at least against a brute force attack.<div><a href="https://technet.microsoft.com/en-us/library/cc179125%28v=office.16%29.aspx?f=255&amp;MSPPError=-2147217396">https://technet.microsoft.com/en-us/library/cc179125%28v=office.16%29.aspx?f=255&amp;MSPPError=-2147...</a><br></div><div><ul><li><i>"Lets you configure the CNG cipher algorithm that is used. </i><b><i>The default is AES</i></b><i>."</i></li><li><i>"Lets you configure the number of bits to use when you create the cipher key. </i><b><i>The default is 256 bits</i></b><i>."</i></li></ul></div>\xc2\xa0</div>'

In [240]:
clean_text(doc3)

u'how secure are password protected excel files once in a while if we need to send credentials to a third party we will use password protected excel files sent via secure email third party service where they have to login to see the email i m curious if the excel file is secure enough by itself based on this info it looks like the default for excel 2016 will be aes 256 which should be effectively secure at least against a brute force attack __url__ lets you configure the cng cipher algorithm that is used the default is aes lets you configure the number of bits to use when you create the cipher key the default is 256 bits'

In [241]:
# Subclass gensim TextCorpus object to apply custom preprocesing
# Preprocessing functions need to be efficient if performance is a concern!
class CustomTextCorpus(gensim.corpora.TextCorpus):
  def get_texts(self):
    for doc in self.getstream():
        yield [word for word in clean_text(doc).split()]
  def __len__(self): 
    self.length = sum(1 for _ in self.get_texts())
    return self.length

preprocessed_docs = [clean_text(doc) for doc in docs]
custom_target_corpus = CustomTextCorpus(list2txt(preprocessed_docs, 'preprocessed.txt'))

In [242]:
custom_target_corpus.dictionary.items()
custom_target_corpus.dictionary.id2token

{0: u'2010',
 1: u'7',
 2: u'a',
 3: u'an',
 4: u'and',
 5: u'but',
 6: u'can',
 7: u'cannot',
 8: u'changing',
 9: u'come',
 10: u'delete',
 11: u'editor',
 12: u'excel',
 13: u'file',
 14: u'for',
 15: u'forgotten',
 16: u'format',
 17: u'get',
 18: u'getting',
 19: u'gobbledy',
 20: u'gook',
 21: u'had',
 22: u'have',
 23: u'i',
 24: u'is',
 25: u'it',
 26: u'line',
 27: u'm',
 28: u'name',
 29: u'need',
 30: u'of',
 31: u'open',
 32: u'opening',
 33: u'password',
 34: u'protected',
 35: u'readable',
 36: u'see',
 37: u'she',
 38: u'so',
 39: u't',
 40: u'text',
 41: u'the',
 42: u'this',
 43: u'to',
 44: u'tried',
 45: u'unlock',
 46: u'up',
 47: u'user',
 48: u'what',
 49: u'who',
 50: u'with',
 51: u'workbook',
 52: u'xml',
 53: u'zip',
 54: u'2016',
 55: u'any',
 56: u'anybody',
 57: u'at',
 58: u'be',
 59: u'bust',
 60: u'excellent',
 61: u'face',
 62: u'forgot',
 63: u'has',
 64: u'in',
 65: u'least',
 66: u'let',
 67: u'mode',
 68: u'only',
 69: u'palm',
 70: u'read',
 71: u'

# Documents and their Vector Space Representations

## Word Count Vectors

A gensim corpus object is an iterable of list of (int, num)

In [303]:
TARGET_CORPUS = default_target_corpus
doc_vectors = [word_count_vector for word_count_vector in TARGET_CORPUS]
doc_vector = doc_vectors[1]
doc_vector

[(4, 2),
 (6, 1),
 (14, 1),
 (16, 4),
 (17, 1),
 (22, 2),
 (26, 1),
 (27, 1),
 (28, 1),
 (29, 1),
 (30, 1),
 (31, 1),
 (32, 1),
 (33, 1),
 (34, 1),
 (35, 1),
 (36, 2),
 (37, 1),
 (38, 1)]

In [304]:
import pandas as pd
pd.DataFrame([(k,v,TARGET_CORPUS.dictionary.id2token[k]) for k,v in doc_vector], 
             columns=['vocab_index', 'word_count', 'token'])

Unnamed: 0,vocab_index,word_count,token
0,4,2,excel
1,6,1,forgotten
2,14,1,open
3,16,4,password
4,17,1,protected
5,22,2,user
6,26,1,anybody
7,27,1,bust
8,28,1,excellent
9,29,1,face


## TF-IDF Vectors

TF-IDF = Term Frequency-Inverse Document Frequency 

https://en.wikipedia.org/wiki/Tf-idf

Discount the frequency of terms common to many documents in the (training/target) corpus. 

In [298]:
# IDF's coming from the small target corpus
target_tfidf = gensim.models.TfidfModel(TARGET_CORPUS, dictionary=TARGET_CORPUS.dictionary, normalize=True)
iter(target_tfidf[TARGET_CORPUS]).next()

[(0, 0.18027306351311892),
 (1, 0.18027306351311892),
 (2, 0.18027306351311892),
 (3, 0.18027306351311892),
 (5, 0.1996002719256453),
 (6, 0.0665334239752151),
 (7, 0.18027306351311892),
 (8, 0.18027306351311892),
 (9, 0.18027306351311892),
 (10, 0.18027306351311892),
 (11, 0.18027306351311892),
 (12, 0.5408191905393568),
 (13, 0.0665334239752151),
 (14, 0.0665334239752151),
 (15, 0.18027306351311892),
 (18, 0.18027306351311892),
 (19, 0.18027306351311892),
 (20, 0.18027306351311892),
 (21, 0.18027306351311892),
 (22, 0.0665334239752151),
 (23, 0.36054612702623784),
 (24, 0.18027306351311892),
 (25, 0.18027306351311892)]

In [310]:
# Term mapping logic

In [259]:
# IDF's coming from the large training corpus
training_tfidf = gensim.models.TfidfModel()
training_tfidf[TARGET_CORPUS]
iter(training_tfidf[TARGET_CORPUS]).next()

[(0, 0.18027306351311892),
 (1, 0.18027306351311892),
 (2, 0.18027306351311892),
 (3, 0.18027306351311892),
 (5, 0.1996002719256453),
 (6, 0.0665334239752151),
 (7, 0.18027306351311892),
 (8, 0.18027306351311892),
 (9, 0.18027306351311892),
 (10, 0.18027306351311892),
 (11, 0.18027306351311892),
 (12, 0.5408191905393568),
 (13, 0.0665334239752151),
 (14, 0.0665334239752151),
 (15, 0.18027306351311892),
 (18, 0.18027306351311892),
 (19, 0.18027306351311892),
 (20, 0.18027306351311892),
 (21, 0.18027306351311892),
 (22, 0.0665334239752151),
 (23, 0.36054612702623784),
 (24, 0.18027306351311892),
 (25, 0.18027306351311892)]

In [302]:
# Note, gensim will eliminate terms that appear in every document
# This probably is not a concern when you are dealing with even slightly larger corpora 
test_docs = ['foo', 'foo bar', 'foo bar baz']
test_texts = [[word for word in doc.lower().split()] for doc in test_docs]
test_dict = gensim.corpora.Dictionary(test_texts)
test_corpus = [test_dict.doc2bow(text) for text in test_texts]
test_tfidf = gensim.models.TfidfModel(test_corpus)
[doc for doc in test_tfidf[test_corpus]]

[[], [(1, 1.0)], [(1, 0.34624155305796134), (2, 0.9381453975456102)]]

## Weighted-Average of Word Vectors 

In [None]:
# If you had a vector represenation of words...

# Words and their Vector Space Representations

We include code needed to generate the word-embedding models in the [appendix](#appendix)

# Question Similarity

## Notation
- $N$: Size of the vocabulary 
- $X_1, X_2$: Document vectors, $N\times 1$ dimensional  
- $M$: Term-similarity matrix, $N\times N$ dimensional

## Dot Product
Sum of the pairwise products,

$$X_1\cdot X_2 = \begin{pmatrix}X_{11} \\ \vdots \\ X_{1N}\end{pmatrix} \cdot \begin{pmatrix}X_{21} \\ \vdots \\ X_{2N}\end{pmatrix}= X_{11} (X_{21}) + X_{12}(X_{22}) + \cdots + X_{1N}(X_{2N}) = \sum_{i=1}^N X_{1i}(X_{2i}).$$

$$\begin{pmatrix}1 \\ 2 \\ 3 \end{pmatrix} \cdot \begin{pmatrix} 0\\ 1\\ 1 \end{pmatrix} = 1(0)+2(1)+3(1)=5.$$

## Cosine Similarity

\begin{eqnarray} 
\cos(X_1, X_2) &=& \frac{X_1\cdot X_2}{||X_1||\text{ }||X_2||}\\
               &=& \frac{X_1^T X_2}{\sqrt{X_1^T X_1}\sqrt{X_2^T X_2}}
\end{eqnarray}


In [273]:
cosim_index = gensim.similarities.MatrixSimilarity(TARGET_CORPUS)
cosim_index[TARGET_CORPUS]

array([[1.        , 0.42794803, 0.19976044],
       [0.42794803, 1.        , 0.205169  ],
       [0.19976044, 0.205169  , 1.        ]], dtype=float32)

In [300]:
cosim_index = gensim.similarities.MatrixSimilarity(target_tfidf[TARGET_CORPUS])
cosim_index[target_tfidf[TARGET_CORPUS]]

array([[1.        , 0.02395141, 0.00793372],
       [0.02395141, 1.        , 0.        ],
       [0.00793372, 0.        , 1.        ]], dtype=float32)

Notice that document 2 and 3 have 0 term overlap. 

Problematic? What about these two sentences: 

1. "Manifolds are very exciting."
2. "Geometry is super interesting."

These sentences have zero term overlap, but express related meanings. They are "semantically similar". 

In [None]:
cosim_index = gensim.similarities.MatrixSimilarity(training_tfidf[TARGET_CORPUS])
cosim_index[traning_tfidf[TARGET_CORPUS]]

## Soft Cosine Similarity
$$ \cos_M (X_1, X_2) = \frac{X_1^T M X_2}{\sqrt{X_1^T M X_1}\sqrt{X_2^T M X_2}}$$

### Term Similarity Matix

$M=...$

## Weighted-Average of Word Vectors

# Question Clustering

# Appendix <a name="appendix"></a>

## Imports

In [None]:
import bs4
import re
import csv
import os
import json
import pandas as pd
import gensim
from gensim.interfaces import TransformationABC
import scipy as sp
from sklearn.metrics.pairwise import linear_kernel
import sqlalchemy

<img src="images/top_10_graph.png">