# Goal

In this notebook, we seek to find "hedging" bigrams by using the word2vec embeddings for unigrams that are part of hedging bigrams. Examples include:  

* "Sort" coming from "sort of"
* "kind" coming from "kind of"
* "Guess" used in "I guess"

Following the ideas from Ben Schmidt's [Rejecting the gender binary: a vector-space operation](http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html), we will build a vector for the collection of "hedging" unigrams and then use "vector rejection" to remove non-hedging uses of the hedging unigrams. For example: 

* "Sort" coming from "sort of" but not "arrange, list, classify" nor "variety, category, type"
* "kind" coming from "kind of" but not "friendly, nice, patience" nor "variety, category, type"
* "Guess" used in "I guess" but not "predict, determine, explain" 

 Look here: https://stackabuse.com/implementing-word2vec-with-gensim-library-in-python/
 
 https://medium.com/swlh/word-embedding-word2vec-with-genism-nltk-and-t-sne-visualization-43eae8ab3e2e


In [52]:
import numpy as np
import pandas as pd, re, csv, nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF
import nltk
from nltk.corpus import stopwords

# https://stackoverflow.com/questions/37101114/what-to-download-in-order-to-make-nltk-tokenize-word-tokenize-work
#nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

from gensim.models import Word2Vec

# for preprocessing the text
from gensim.parsing.preprocessing import remove_stopwords
import string


import matplotlib.pyplot as plt

In [69]:
# Creating a stopword list of consonants and the 

stoplist = list(string.ascii_letters[:26])
stoplist.pop(8) # Remove 'i'
stoplist.pop(0) # Remove 'a'

stoplist.append("ll")
stoplist.append("re")
stoplist.append("ve")


['b', 'c', 'd', 'e', 'f', 'g', 'h', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', 'll', 're', 've']


In [73]:
# Building my own stopword remover to get rid of the non "a" and "i" consonants
# Based on `remove_stopewords() in gensim
#    https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/parsing/preprocessing.py

def remove_stoplist(s):
    return " ".join(remove_consonants_tokens(s.split()))

def remove_stoplist_tokens(tokens):
    global stoplist
    return [token for token in tokens if token not in stoplist]

In [2]:
# Load data

# Load binary gendered talks 
talks_male = pd.read_csv('talks_male.csv', index_col='Talk_ID')
talks_female = pd.read_csv('talks_female.csv', index_col='Talk_ID')

# No one gender ==> NOG
talks_nog = pd.read_csv('talks_nog.csv', index_col='Talk_ID')

# Concatenate the data

all_talks = pd.concat([talks_male,talks_female,talks_nog])

In [8]:
partitioned_texts = all_talks.text.tolist()

# From https://www.geeksforgeeks.org/python-string-join-method/
#text_all = " ".join(partitioned_texts)

processed_texts = [text.lower() for text in partitioned_texts]
processed_texts = [re.sub('[^a-zA-Z]', ' ', text) for text in processed_texts]
processed_texts = [re.sub(r'\s+', ' ', text) for text in processed_texts]


In [38]:
partitioned_texts[0][:50]

"  Thank you so much, Chris. And it's truly a great"

In [85]:
processed_texts = [remove_stoplist(text) for text in processed_texts]
doc_texts = [nltk.word_tokenize(text) for text in processed_texts]

In [86]:
type(doc_texts)

list

In [87]:
len(doc_texts)

992

In [88]:
doc_texts[0]

['thank',
 'you',
 'so',
 'much',
 'chris',
 'and',
 'it',
 'truly',
 'a',
 'great',
 'honor',
 'to',
 'have',
 'the',
 'opportunity',
 'to',
 'come',
 'to',
 'this',
 'stage',
 'twice',
 'i',
 'extremely',
 'grateful',
 'i',
 'have',
 'been',
 'blown',
 'away',
 'by',
 'this',
 'conference',
 'and',
 'i',
 'want',
 'to',
 'thank',
 'all',
 'of',
 'you',
 'for',
 'the',
 'many',
 'nice',
 'comments',
 'about',
 'what',
 'i',
 'had',
 'to',
 'say',
 'the',
 'other',
 'night',
 'and',
 'i',
 'say',
 'that',
 'sincerely',
 'partly',
 'because',
 'mock',
 'sob',
 'i',
 'need',
 'that',
 'laughter',
 'put',
 'yourselves',
 'in',
 'my',
 'position',
 'laughter',
 'i',
 'flew',
 'on',
 'air',
 'force',
 'two',
 'for',
 'eight',
 'years',
 'laughter',
 'now',
 'i',
 'have',
 'to',
 'take',
 'off',
 'my',
 'shoes',
 'or',
 'boots',
 'to',
 'get',
 'on',
 'an',
 'airplane',
 'laughter',
 'applause',
 'i',
 'tell',
 'you',
 'one',
 'quick',
 'story',
 'to',
 'illustrate',
 'what',
 'that',
 'been

In [89]:
word2vec = Word2Vec(doc_texts, window = 10, min_count=2)

In [90]:
word2vec.wv.most_similar("guess")

[('saying', 0.7347651124000549),
 ('am', 0.7292705774307251),
 ('swear', 0.7259172201156616),
 ('sorry', 0.7208622097969055),
 ('say', 0.7200601100921631),
 ('remember', 0.7195522785186768),
 ('wonder', 0.7048721313476562),
 ('mean', 0.6956779360771179),
 ('wondering', 0.6921308040618896),
 ('oh', 0.6743535995483398)]

## Bigrams

https://stackoverflow.com/questions/19560044/how-to-concatenate-element-wise-two-lists-in-python
https://www.w3schools.com/python/ref_string_format.asp

In [94]:
doc_bigrams = []

for text_list in doc_texts[:100]:
    # Create two lists from the original text list
    # The first includes the 0 -> (n-1) words
    # The second includes the 1 -> n words
    l_text = text_list[:-1]
    r_text = text_list[1:]
    
    # Use zip and format with list comprehension to get a new list
    bi_list = ["{} {}".format(l_word,r_word) for l_word,r_word in zip(l_text,r_text)]
    
    doc_bigrams.append(bi_list)

In [95]:
type(doc_bigrams)

list

In [96]:
len(doc_bigrams)

100

In [97]:
doc_bigrams[0]

['thank you',
 'you so',
 'so much',
 'much chris',
 'chris and',
 'and it',
 'it truly',
 'truly a',
 'a great',
 'great honor',
 'honor to',
 'to have',
 'have the',
 'the opportunity',
 'opportunity to',
 'to come',
 'come to',
 'to this',
 'this stage',
 'stage twice',
 'twice i',
 'i extremely',
 'extremely grateful',
 'grateful i',
 'i have',
 'have been',
 'been blown',
 'blown away',
 'away by',
 'by this',
 'this conference',
 'conference and',
 'and i',
 'i want',
 'want to',
 'to thank',
 'thank all',
 'all of',
 'of you',
 'you for',
 'for the',
 'the many',
 'many nice',
 'nice comments',
 'comments about',
 'about what',
 'what i',
 'i had',
 'had to',
 'to say',
 'say the',
 'the other',
 'other night',
 'night and',
 'and i',
 'i say',
 'say that',
 'that sincerely',
 'sincerely partly',
 'partly because',
 'because mock',
 'mock sob',
 'sob i',
 'i need',
 'need that',
 'that laughter',
 'laughter put',
 'put yourselves',
 'yourselves in',
 'in my',
 'my position',
 'p

In [104]:
bigram2vec = Word2Vec(doc_bigrams, min_count=5, window = 20)

### 20 talks, default window
bigram2vec.wv.most_similar("kind of")

`[('is the', 0.9678242802619934),
 ('of the', 0.9669575691223145),
 ('it s', 0.9658999443054199),
 ('in the', 0.9641684889793396),
 ('and i', 0.9638648629188538),
 ('on the', 0.9616329669952393),
 ('and the', 0.9615666270256042),
 ('if you', 0.9613533616065979),
 ('you can', 0.9594166874885559),
 ('to do', 0.959026575088501)]`

### 100 talks, default window
bigram2vec.wv.most_similar("kind of")

`[('on the', 0.9999475479125977),
 ('of a', 0.9999454021453857),
 ('sort of', 0.9999436736106873),
 ('and the', 0.9999433755874634),
 ('into the', 0.9999430179595947),
 ('and he', 0.9999375343322754),
 ('it the', 0.9999349117279053),
 ('on a', 0.9999347925186157),
 ('from the', 0.9999346733093262),
 ('in this', 0.9999346733093262)]`

### 100 talks, window = 20
bigram2vec.wv.most_similar("kind of")

`[('of a', 0.9999916553497314),
 ('into the', 0.999990701675415),
 ('it a', 0.9999901056289673),
 ('on the', 0.9999899864196777),
 ('on a', 0.9999896287918091),
 ('it the', 0.9999895691871643),
 ('sort of', 0.99998939037323),
 ('like this', 0.9999883770942688),
 ('and a', 0.9999881982803345),
 ('to a', 0.9999878406524658)]`

### 100 talks, window = 20, min = 5
bigram2vec.wv.most_similar("kind of")

`[('on a', 0.9999414682388306),
 ('sort of', 0.9999383687973022),
 ('into the', 0.9999372959136963),
 ('in a', 0.9999362230300903),
 ('it the', 0.9999334812164307),
 ('it a', 0.9999327659606934),
 ('to a', 0.9999301433563232),
 ('so it', 0.9999282360076904),
 ('and a', 0.9999277591705322),
 ('like a', 0.9999269247055054)]`

In [105]:
bigram2vec.wv.most_similar("kind of")

[('on a', 0.9999414682388306),
 ('sort of', 0.9999383687973022),
 ('into the', 0.9999372959136963),
 ('in a', 0.9999362230300903),
 ('it the', 0.9999334812164307),
 ('it a', 0.9999327659606934),
 ('to a', 0.9999301433563232),
 ('so it', 0.9999282360076904),
 ('and a', 0.9999277591705322),
 ('like a', 0.9999269247055054)]

# Scratch work below here

## WANT: a list of lists 

 * Each text should be preserved 
 * Then we should create a list of words

In [45]:
test = ['one',"Two\'s",'.ThreE']
#test = partitioned_texts[0:3]
print(len(test))
#https://www.delftstack.com/howto/python/python-lowercase-list/

test = [word.lower() for word in test]
print(len(test))
test = [re.sub('[^a-zA-Z]', ' ', word) for word in test]
print(len(test))
test= [re.sub(r'\s+', ' ', word) for word in test]
print(len(test))


3
3
3
3


In [76]:
test

['one', 'two s', ' three']

In [80]:
remove_stoplist('two s and')

'two and'

In [81]:
test_texts = [remove_stoplist(text) for text in test]

In [82]:
test_texts

['one', 'two', 'three']

In [83]:
test_texts = [remove_stoplist(text) for text in test]

test_texts = [nltk.word_tokenize(text) for text in test_texts]

In [84]:
test_texts

[['one'], ['two'], ['three']]

In [None]:
constants = 

In [53]:
all_letters = string.ascii_letters

In [58]:
stops = list(all_letters[:26])

In [61]:
stops.pop(8)

AttributeError: 'str' object has no attribute 'pop'

In [62]:
stops.pop(0)

'a'

In [63]:
stops

['b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

In [None]:
test_texts = 

In [None]:
processed_texts = [text.lower() for text in partitioned_texts]
processed_texts = [re.sub('[^a-zA-Z]', ' ', text) for text in processed_texts]
processed_texts = [re.sub(r'\s+', ' ', text) for text in processed_texts]

In [19]:
test1 = test_sub[:-1]
test2 = test_sub[1:]

In [23]:
test1
test2

test3 = []
for inds in range(len(test1)):
    bigram = test1[inds] + " " + test2[inds]
    test3.append(bigram)

In [24]:
test3


['one twos', 'twos three']

In [29]:
["{} {}".format(l_word,r_word) for l_word,r_word in zip(test1,test2)]

['one twos', 'twos three']

In [28]:
for l_word,r_word in zip(test1,test2):
    print(r_word)

twos
three


In [11]:
processed_texts = text_all.lower()
processed_texts = re.sub('[^a-zA-Z]', ' ', processed_texts)
processed_texts = re.sub(r'\s+', ' ', processed_texts)

In [12]:
processed_texts[:50]

' thank you so much chris and it s truly a great ho'

In [10]:
# From https://stackabuse.com/implementing-word2vec-with-gensim-library-in-python/

all_sentences = nltk.sent_tokenize(processed_texts)

all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

In [11]:
type(all_sentences)

list

In [21]:
len(all_sentences[0])

10699443

In [26]:
word2vec = Word2Vec(all_words, min_count=2)

NameError: name 'Word2Vec' is not defined