# Sent2Vec Exploration
Trying to investigate if sent2vec would be a nice tool to be used

### sent2vec installation instructions:
- use pip to install cython: pip install cython
- clone sent2vec githuh repo: git clone https://github.com/epfml/sent2vec.git
- go to the folder and install: pip install .
- in jupyter import the package

In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None 
import numpy as np
import re
import nltk
import os
import gensim
import gzip

import sent2vec
#from gensim import sent2vec
from gensim.models import word2vec, KeyedVectors
from gensim.test.utils import common_texts, get_tmpfile

from nltk.stem import PorterStemmer
from nltk import word_tokenize
nltk.download('stopwords')
STOP_WORDS = nltk.corpus.stopwords.words('english')

from sklearn.metrics.pairwise import cosine_similarity

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Data Wrangling

### Import

In [None]:
#os.chdir('/Users/patrickrs/Documents/GitLab/patrick-steiner/revealapp/00_exploration/Pat')
#data_original = pd.read_csv('../../../../severin-kappeler/06-NLP/data/job_ads_eng.csv')  # .sample(50000, random_state=23)

In [None]:
doc_1 = '''I opened up a money market account at First Niagara Bank in XXXX XXXX CT on XXXX/XXXX/2016. I was told that the APY interest rate was 1.01 % based on the advertisement in the XXXX XXXX as well as a verbal discussion. There was no mention that the rate was related to compounding and this was confirmed verbally with the bank officer. And that taking out the interest monthly would not affect the rate that I was to receive. After 33 days of " interest '', from XX/XX/XXXX to XX/XX/XXXX, I received not XXXX of 1.01 % but rather XXXX of 0.81 % and my printed statement says that my APY is now 0.81 % not the 1.01 % I was expecting. 
The interest I received was XXXX % LESS that I was expecting. The difference in the amount of money is not very much but First Niagara, in my opinion, lied to me with false advertisements and false verbal discussions, apparently to get me to put my money into a seemingly high interest money market account. Multiply this by XXXX customers and XXXX dollars and the amount of money they defrauded consumers could be very substantial. In my opinion, First Niagara deceived me and its customers and violated honest banking practices.
Product: Bank account or service'''
doc_2 = '''In this post we covered different approaches for word representation in NLP tasks (BOW, TF-IDF and Word Embeddings), learnt how to learn word representation from its context using Word2Vec, saw how we can extract meaningful phrases from a given corpus (NPMI and data-driven approach) and how to transform a given corpus in order to learn similar terms/words for each one of extracted terms/words using Word2Vec algorithm. The results of this process can be used in a downstream task, like Query Expansion in Information Extraction tasks, Document Classification, Clustering, Question-Answering and many more.'''
doc_3 = '''On our 1.6 billion words corpus, it took us 1 hour to construct bi-grams and another 2 hours to train Word2Vec (with batch Skip-Gram, 300 dimension, 10 epochs, context of k=5 , negative sampling of 5, learning rate of 0.01 and minimum word count of 5) on a machine with 16 CPUs and 64 RAM using AWS Sagemaker service. A great Notebook example of how to use AWS Sagemaker service to train Word2Vec can be found here.'''

In [None]:
sample_text = pd.DataFrame(data = [doc_1, doc_2, doc_3], columns=['Content'])

In [None]:
sample_text.head(5)

### Cleaning

In [None]:
#Function for cleaning and stemming the data
def clean_sentence(val):
    "remove chars that are not letters or numbers, downcase, then remove stop words"
    regex = re.compile('([^\s\w]|_)+')
    sentence = regex.sub('', val).lower()
    sentence = re.sub("xxxx", "", sentence)
    sentence = re.sub("xxx", "", sentence)
    sentence = re.sub("xx", "", sentence)
    sentence = re.sub("\s\s+", " ", sentence)
       
    # stemming of words (seems not to affect accuracy, but should make things faster
   # porter = PorterStemmer()
   # words = word_tokenize(sentence)
   # sentence = " ".join([porter.stem(word) for word in words])
      
    sentence = sentence.split(" ")
    for word in list(sentence):
        if word in STOP_WORDS:
            sentence.remove(word)  
    sentence = " ".join(sentence)
    
    return sentence


def clean_dataframe(data):
    "drop nans, then apply 'clean_sentence' function to question1 and 2"
    data = data[data['Content'] == data['Content']]  # removes nan since nan == nan -> False
    
    for col in ['Content']:
        for row_idx, row in enumerate(data[col]):
            for idx, sentence in enumerate(data[col][row_idx]): #data[col][row] is a list of list strings, where each string is a sentence in this doc.
                data[col][row_idx][idx] = clean_sentence(sentence)
    
    return data

#splitting data into sentences
def tokenize_to_sentences(data):
    for col in ['Content']:
        data[col] = data[col].apply(nltk.sent_tokenize)
        
    return data
    

In [None]:
tokenize_to_sentences(sample_text)
data = clean_dataframe(sample_text)

In [9]:
data['Content'][0]

['opened money market account first niagara bank ct 2016',
 'told apy interest rate 101 based advertisement well verbal discussion',
 'mention rate related compounding confirmed verbally bank officer',
 'taking interest monthly would affect rate receive',
 '33 days interest received 101 rather 081 printed statement says apy 081 101 expecting',
 'interest received less expecting',
 'difference amount money much first niagara opinion lied false advertisements false verbal discussions apparently get put money seemingly high interest money market account',
 'multiply customers dollars amount money defrauded consumers could substantial',
 'opinion first niagara deceived customers violated honest banking practices',
 'product bank account service']

## Embedding

Probably we will need to use some pretrained versions of those which are available in the github

In [11]:
model = sent2vec.Sent2vecModel()
#model.embed_sentences(data['Content'][0])
#model.load_model('model.bin')
#emb = model.embed_sentence("once upon a time .") 
#embs = model.embed_sentences(["first sentence .", "another sentence"])

In [10]:
model = sent2vec.Sent2vecModel(data['Content'], size=100, min_count = 1)

TypeError: __init__() takes exactly 0 positional arguments (1 given)

In [None]:
model.get_vocabulary()

In [None]:
# Model 1 is trained only on the available data
model_1 = word2vec.Word2Vec(corpus, size=300, min_count=1)

In [None]:
# Importing pre-trained model, updating vocab to include only words present in current dataset.
# and training the model (takes long to run)
model_2 = word2vec.Word2Vec(size=300, min_count=1)
model_2.build_vocab(corpus)
total_examples = model_2.corpus_count
model = gensim.models.KeyedVectors.load_word2vec_format('/Users/patrickrs/Documents/Gitlab/patrick-steiner/revealapp/Playground/Tag/GoogleNews-vectors-negative300.bin', binary=True)
model_2.build_vocab([list(model.vocab.keys())], update=True)
model_2.intersect_word2vec_format('/Users/patrickrs/Documents/Gitlab/patrick-steiner/revealapp/Playground/Tag/GoogleNews-vectors-negative300.bin', binary=True, lockf=1.0)
# intersect_word2vec_format() will let you bring vectors from an external file into a model that's already had its own vocabulary initialized
# see https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.intersect_word2vec_format.html
model_2.train(corpus, total_examples=total_examples, epochs=model_2.iter)