# Notes
- Write code to train word embedding models, but don't commit the models, just leave a stub of a file that the model would ingest (do some cbow, skip-gram, fasttext, and maybe an edit-distance for some example terms.)


# Talk Title Here

**Main Reference** SimBow at SemEval-2017 Task 3: Soft-Cosine Semantic Similarity
between Questions for Community Question Answering http://www.aclweb.org/anthology/S17-2051

**Gensim Softcosine Notebook** https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/soft_cosine_tutorial.ipynb

## Introduction

### Community Question Answering

## Documents and their Vector Space Represenations

**Word**, **Term**, **Token** := Collection of characters (allowable collections depend upon preprocessing choices) 

**Document** := Collection of words

**Corpus** := Collection of documents

**Vocabulary** := Collection of distinct terms that appear in a corpus

#### Example Document 1

- **URL**: https://community.spiceworks.com/topic/328118
- **Subject**: need to unlock a password protected Excel 2010 workbook
- **Body**: I have a user who had forgotten the password for an Excel 2010 file.  She cannot open it.  I tried changing the name to a zip file and opening the workbook file with an xml editor, but I can't get a readable format to come up so I can see the line of text with the password (so I can delete it).  What I'm getting is this gobbledy-gook: ºʝZܡ M/ؽ(+蝤¤7

#### Example Document 2

- **URL**: https://community.spiceworks.com/topic/1766186
- **Subject**: Password protected Excel 2016 spreadsheet---user forgot password!
- **Body**: Face palm time.  A user has forgotten their password to an Excel 2016 spreadsheet. Anybody have any slick tool that will at least let them open in read-only mode?  Something to bust through the password would be excellent.

#### Example Document 3

- **URL**: https://community.spiceworks.com/topic/1984626
- **Subject**: How secure are password protected Excel files?
- **Body**: \__img\__ Once in a while if we need to send credentials to a third party we will use password protected Excel files sent via secure email (third party service where they have to login to see the email). I'm curious if the Excel file is secure enough by itself? Based on this info it looks like the default for Excel 2016 will be AES 256, which should be effectively secure at least against a brute force attack. \__url\__ "Lets you configure the CNG cipher algorithm that is used. The default is AES." " Lets you configure the number of bits to use when you create the cipher key. The default is 256 bits."

We choose to concatinate subject and body 

In [67]:
doc1 = 'need to unlock a password protected Excel 2010 workbook I have a user who had forgotten the password for an Excel 2010 file.  She cannot open it.  I tried changing the name to a zip file and opening the workbook file with an xml editor, but I can\'t get a readable format to come up so I can see the line of text with the password (so I can delete it).  What I\'m getting is this gobbledy-gook: ºʝZܡ M/ؽ(+蝤¤7'
doc2 = 'Password protected Excel 2016 spreadsheet---user forgot password! Face palm time.  A user has forgotten their password to an Excel 2016 spreadsheet. Anybody have any slick tool that will at least let them open in read-only mode? Something to bust through the password would be excellent.'
doc3 = 'How secure are password protected Excel files? __img__ Once in a while if we need to send credentials to a third party we will use password protected Excel files sent via secure email (third party service where they have to login to see the email). I\'m curious if the Excel file is secure enough by itself? Based on this info it looks like the default for Excel 2016 will be AES 256, which should be effectively secure at least against a brute force attack. __url__ "Lets you configure the CNG cipher algorithm that is used. The default is AES." " Lets you configure the number of bits to use when you create the cipher key. The default is 256 bits."'
docs = [doc1, doc2, doc3]

### Preprocessing
- Do you keep punctuation?
- Do you preserve letter case?
- Do you remove stopwords? 
- Do you impose minimum word frequencies? 
- Do you stem tokens? 

Results can be sensitive to these choices to varying extents. These choices will likely be different for each corpus and application. 

#### Gensim Default Preprocessing

In [123]:
import os
from gensim.corpora import TextCorpus
DATA_DIR = 'data'

# Helper IO Function
def list2txt(doc_list, filename):
    filepath = os.path.join(DATA_DIR, filename)
    outfile = open(filepath, 'w')
    outfile.write("\n".join(doc_list))
    return filepath

In [124]:
default_target_corpus = TextCorpus(list2txt(docs, 'raw.txt'))

In [125]:
default_target_corpus.dictionary.items()
default_target_corpus.dictionary.id2token

{0: u'changing',
 1: u'come',
 2: u'delete',
 3: u'editor',
 4: u'excel',
 5: u'file',
 6: u'forgotten',
 7: u'format',
 8: u'getting',
 9: u'gobbledy',
 10: u'gook',
 11: u'line',
 12: u'need',
 13: u'open',
 14: u'opening',
 15: u'password',
 16: u'protected',
 17: u'readable',
 18: u'text',
 19: u'tried',
 20: u'unlock',
 21: u'user',
 22: u'workbook',
 23: u'xml',
 24: u'zip',
 25: u'anybody',
 26: u'bust',
 27: u'excellent',
 28: u'face',
 29: u'forgot',
 30: u'let',
 31: u'mode',
 32: u'palm',
 33: u'read',
 34: u'slick',
 35: u'spreadsheet',
 36: u'time',
 37: u'tool',
 38: u'__img__',
 39: u'__url__',
 40: u'aes',
 41: u'algorithm',
 42: u'attack',
 43: u'based',
 44: u'bits',
 45: u'brute',
 46: u'cipher',
 47: u'cng',
 48: u'configure',
 49: u'create',
 50: u'credentials',
 51: u'curious',
 52: u'default',
 53: u'effectively',
 54: u'email',
 55: u'files',
 56: u'force',
 57: u'info',
 58: u'key',
 59: u'lets',
 60: u'like',
 61: u'login',
 62: u'looks',
 63: u'number',
 64: 

Vocabular terms are indexed in order of appearence in the corpus

#### Custom Preprocessing

In [126]:
import re # python regular expression library
import bs4 # python html parsing library "beautiful soup" 

# Sample custom preprocessing sequence
DELIMITER_PATTERN = u'[!?,;:\t\\\\"\\(\\)\\\'\u2026\u201c\u2013\u2019\u2026\n]|\\s\\-\\s|\.\s'
TOKEN_PATTERN = r'(?u)[a-zA-Z\_][a-zA-Z0-9\_]*|[0-9][a-zA-Z0-9.]*\b'

def html_parser(html):
#     try:
    html = re.sub(r'[^\x00-\x7f]', r'', html).strip().replace('+', '')
    text = bs4.BeautifulSoup(html, "html.parser").get_text()
    text = re.sub(r'[^\x00-\x7f]', r'', text).strip()
    return text
#     except:
#         return ''
    
def split_sentences(doc):
    try:
        delimiters = re.compile(DELIMITER_PATTERN)
        sentences = delimiters.split(doc)
        return sentences
    except:
        return []

def tokenizer(sentence, token_pattern=TOKEN_PATTERN, lowercase=True):
    try:
        token_pattern = re.compile(token_pattern)
        if lowercase:
            sentence = sentence.lower()
        return token_pattern.findall(sentence)

    except:
        return []

def clean_text(html, lowercase=True):
    text = html_parser(html)
    if lowercase:
        text = text.lower()

    sents = split_sentences(text)
    sents = [' '.join(tokenizer(sent, lowercase=lowercase)) for sent in sents]
    try:
        sents = map(lambda x: x.strip(), sents)
    except:
        pass
    try:
        sents = [sent for sent in sents if len(sent) > 0]
    except:
        pass
    return ' '.join(sents)

# Subclass gensim TextCorpus object to apply custom preprocesing
# Preprocessing functions need to be efficient if performance is a concern!
class CustomTextCorpus(TextCorpus):
  def get_texts(self):
    for doc in self.getstream():
        yield [word for word in clean_text(doc).split()]
  def __len__(self):
    self.length = sum(1 for _ in self.get_texts())
    return self.length

In [127]:
clean_text(doc2)

u'password protected excel 2016 spreadsheet user forgot password face palm time a user has forgotten their password to an excel 2016 spreadsheet anybody have any slick tool that will at least let them open in read only mode something to bust through the password would be excellent'

In [128]:
preprocessed_docs = [clean_text(doc) for doc in docs]
custom_target_corpus = CustomTextCorpus(list2txt(preprocessed_docs, 'preprocessed.txt'))

In [129]:
custom_target_corpus.dictionary.items()
custom_target_corpus.dictionary.id2token

{0: u'2010',
 1: u'7',
 2: u'a',
 3: u'an',
 4: u'and',
 5: u'but',
 6: u'can',
 7: u'cannot',
 8: u'changing',
 9: u'come',
 10: u'delete',
 11: u'editor',
 12: u'excel',
 13: u'file',
 14: u'for',
 15: u'forgotten',
 16: u'format',
 17: u'get',
 18: u'getting',
 19: u'gobbledy',
 20: u'gook',
 21: u'had',
 22: u'have',
 23: u'i',
 24: u'is',
 25: u'it',
 26: u'line',
 27: u'm',
 28: u'name',
 29: u'need',
 30: u'of',
 31: u'open',
 32: u'opening',
 33: u'password',
 34: u'protected',
 35: u'readable',
 36: u'see',
 37: u'she',
 38: u'so',
 39: u't',
 40: u'text',
 41: u'the',
 42: u'this',
 43: u'to',
 44: u'tried',
 45: u'unlock',
 46: u'up',
 47: u'user',
 48: u'what',
 49: u'who',
 50: u'with',
 51: u'workbook',
 52: u'xml',
 53: u'z',
 54: u'zip',
 55: u'2016',
 56: u'any',
 57: u'anybody',
 58: u'at',
 59: u'be',
 60: u'bust',
 61: u'excellent',
 62: u'face',
 63: u'forgot',
 64: u'has',
 65: u'in',
 66: u'least',
 67: u'let',
 68: u'mode',
 69: u'only',
 70: u'palm',
 71: u'rea

### Document Vectors

## Document Similarity

#### Notation
- $N$: Size of the vocabulary 
- $X_1, X_2$: Document vectors, $N\times 1$ dimensional  
- $M$: Term-similarity matrix, $N\times N$ dimensional

#### Dot Product
Sum of the pairwise products,

$$X_1\cdot X_2 = \begin{pmatrix}X_{11} \\ \vdots \\ X_{1N}\end{pmatrix} \cdot \begin{pmatrix}X_{21} \\ \vdots \\ X_{2N}\end{pmatrix}= X_{11} (X_{21}) + X_{12}(X_{22}) + \cdots + X_{1N}(X_{2N}) = \sum_{i=1}^N X_{1i}(X_{2i}).$$

Example, 

$$\begin{pmatrix}1 \\ 2 \\ 3 \end{pmatrix} \cdot \begin{pmatrix} 0\\ 1\\ 1 \end{pmatrix} = 1(0)+2(1)+3(1)=5.$$

#### Cosine Similarity
$$
\begin{eqnarray} 
\cos(X_1, X_2) &=& \frac{X_1\cdot X_2}{||X_1||\text{ }||X_2||}\\
               &=& \frac{X_1^T X_2}{\sqrt{X_1^T X_1}\sqrt{X_2^T X_2}}
\end{eqnarray}
$$

#### Soft Cosine Similarity
$$ \cos_M (X_1, X_2) = \frac{X_1^T M X_2}{\sqrt{X_1^T M X_1}\sqrt{X_2^T M X_2}}$$

We include code needed to generate the word-embedding models in the [appendix](#appendix)

## Appendix <a name="appendix"></a>

## Imports

In [None]:
import bs4
import re
import csv
import os
import json
import pandas as pd
import gensim
from gensim.interfaces import TransformationABC
import scipy as sp
from sklearn.metrics.pairwise import linear_kernel
import sqlalchemy

<img src="images/top_10_graph.png">