# Goal

In this notebook, we seek to find "hedging" bigrams by using the word2vec embeddings for unigrams that are part of hedging bigrams. Examples include:  

* "Sort" coming from "sort of"
* "kind" coming from "kind of"
* "Guess" used in "I guess"

Following the ideas from Ben Schmidt's [Rejecting the gender binary: a vector-space operation](http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html), we will build a vector for the collection of "hedging" unigrams and then use "vector rejection" to remove non-hedging uses of the hedging unigrams. For example: 

* "Sort" coming from "sort of" but not "arrange, list, classify" nor "variety, category, type"
* "kind" coming from "kind of" but not "friendly, nice, patience" nor "variety, category, type"
* "Guess" used in "I guess" but not "predict, determine, explain" 

 Look here: https://stackabuse.com/implementing-word2vec-with-gensim-library-in-python/
 
 https://medium.com/swlh/word-embedding-word2vec-with-genism-nltk-and-t-sne-visualization-43eae8ab3e2e


In [1]:
import numpy as np
import pandas as pd, re, csv, nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import NMF
import nltk
from nltk.corpus import stopwords

# https://stackoverflow.com/questions/37101114/what-to-download-in-order-to-make-nltk-tokenize-word-tokenize-work
#nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

from gensim.models import Word2Vec

import matplotlib.pyplot as plt

In [2]:
# Load data

# Load binary gendered talks 
talks_male = pd.read_csv('talks_male.csv', index_col='Talk_ID')
talks_female = pd.read_csv('talks_female.csv', index_col='Talk_ID')

# No one gender ==> NOG
talks_nog = pd.read_csv('talks_nog.csv', index_col='Talk_ID')

# Concatenate the data

all_talks = pd.concat([talks_male,talks_female,talks_nog])

In [3]:
partitioned_texts = all_talks.text.tolist()

# From https://www.geeksforgeeks.org/python-string-join-method/
text_all = " ".join(partitioned_texts)

processed_texts = [text.lower() for text in partitioned_texts]
processed_texts = [re.sub('[^a-zA-Z]', ' ', text) for text in processed_texts]
processed_texts = [re.sub(r'\s+', ' ', text) for text in processed_texts]


In [4]:
doc_texts = [nltk.word_tokenize(text) for text in processed_texts]

In [5]:
type(doc_texts)

list

In [6]:
len(doc_texts)

992

In [7]:
doc_texts[0]

['thank',
 'you',
 'so',
 'much',
 'chris',
 'and',
 'it',
 's',
 'truly',
 'a',
 'great',
 'honor',
 'to',
 'have',
 'the',
 'opportunity',
 'to',
 'come',
 'to',
 'this',
 'stage',
 'twice',
 'i',
 'm',
 'extremely',
 'grateful',
 'i',
 'have',
 'been',
 'blown',
 'away',
 'by',
 'this',
 'conference',
 'and',
 'i',
 'want',
 'to',
 'thank',
 'all',
 'of',
 'you',
 'for',
 'the',
 'many',
 'nice',
 'comments',
 'about',
 'what',
 'i',
 'had',
 'to',
 'say',
 'the',
 'other',
 'night',
 'and',
 'i',
 'say',
 'that',
 'sincerely',
 'partly',
 'because',
 'mock',
 'sob',
 'i',
 'need',
 'that',
 'laughter',
 'put',
 'yourselves',
 'in',
 'my',
 'position',
 'laughter',
 'i',
 'flew',
 'on',
 'air',
 'force',
 'two',
 'for',
 'eight',
 'years',
 'laughter',
 'now',
 'i',
 'have',
 'to',
 'take',
 'off',
 'my',
 'shoes',
 'or',
 'boots',
 'to',
 'get',
 'on',
 'an',
 'airplane',
 'laughter',
 'applause',
 'i',
 'll',
 'tell',
 'you',
 'one',
 'quick',
 'story',
 'to',
 'illustrate',
 'wha

In [15]:
word2vec = Word2Vec(doc_texts, window = 20, min_count=2)

In [21]:
word2vec.wv.most_similar("guess")

[('yeah', 0.7081160545349121),
 ('oh', 0.6993322968482971),
 ('wow', 0.6840176582336426),
 ('mk', 0.6799947023391724),
 ('audience', 0.6738327741622925),
 ('wondering', 0.6576591730117798),
 ('rb', 0.654322624206543),
 ('lp', 0.6464571952819824),
 ('cp', 0.6443139314651489),
 ('fiesta', 0.6426558494567871)]

## WANT: a list of lists 

 * Each text should be preserved 
 * Then we should create a list of words

In [17]:
test = ['one',"Two's",'.ThreE']

#https://www.delftstack.com/howto/python/python-lowercase-list/

test = [word.lower() for word in test]

test_sub = [re.sub('[^a-zA-Z]', ' ', word) for word in test]
test_sub = [re.sub(r'\s+', ' ', word) for word in test_sub]
    
print(test_sub)


['one', 'two s', ' three']


In [4]:
processed_texts = text_all.lower()
processed_texts = re.sub('[^a-zA-Z]', ' ', processed_texts)
processed_texts = re.sub(r'\s+', ' ', processed_texts)

In [25]:
processed_texts[:50]

' thank you so much chris and it s truly a great ho'

In [10]:
# From https://stackabuse.com/implementing-word2vec-with-gensim-library-in-python/

all_sentences = nltk.sent_tokenize(processed_texts)

all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

In [11]:
type(all_sentences)

list

In [21]:
len(all_sentences[0])

10699443

In [26]:
word2vec = Word2Vec(all_words, min_count=2)

NameError: name 'Word2Vec' is not defined