<a href="https://colab.research.google.com/github/kcalizadeh/phil_nlp/blob/master/w2v.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Imports and Mounting Drive

In [1]:
# this cell mounts drive, sets the correct directory, then imports all functions
# and relevant libraries via the functions.py file
from google.colab import drive
import sys

# install relevent libraries not included with colab
!pip install lime
!pip install symspellpy

drive.mount('/gdrive',force_remount=True)

drive_path = '/gdrive/MyDrive/Colab_Projects/Phil_NLP'

sys.path.append(drive_path)

Collecting lime
[?25l  Downloading https://files.pythonhosted.org/packages/f5/86/91a13127d83d793ecb50eb75e716f76e6eda809b6803c5a4ff462339789e/lime-0.2.0.1.tar.gz (275kB)
[K     |█▏                              | 10kB 22.4MB/s eta 0:00:01[K     |██▍                             | 20kB 11.2MB/s eta 0:00:01[K     |███▋                            | 30kB 8.6MB/s eta 0:00:01[K     |████▊                           | 40kB 7.6MB/s eta 0:00:01[K     |██████                          | 51kB 4.2MB/s eta 0:00:01[K     |███████▏                        | 61kB 4.5MB/s eta 0:00:01[K     |████████▎                       | 71kB 4.7MB/s eta 0:00:01[K     |█████████▌                      | 81kB 5.1MB/s eta 0:00:01[K     |██████████▊                     | 92kB 5.5MB/s eta 0:00:01[K     |███████████▉                    | 102kB 5.7MB/s eta 0:00:01[K     |█████████████                   | 112kB 5.7MB/s eta 0:00:01[K     |██████████████▎                 | 122kB 5.7MB/s eta 0:00:01[K    

In [2]:
from functions import *
%load_ext autoreload
%autoreload 2

np.random_seed=17



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


### Load the Data

In [3]:
df = pd.read_csv('/gdrive/MyDrive/Colab_Projects/Phil_NLP/phil_nlp.csv')

df.sample(5)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str,sentence_length,sentence_lowered,lemmatized_str,tokenized_txt
103430,dialogues concerning natural religion,Hume,empiricism,The true system of the heavenly bodies is disc...,The true system of the heavenly bodies is disc...,69,the true system of the heavenly bodies is disc...,the true system of the heavenly body be disco...,"['The', 'true', 'system', 'of', 'the', 'heaven..."
7684,complete works,Plato,plato,"Someone made an unauthorized copy, so I didn't...","Someone made an unauthorized copy, so I didn't...",126,"someone made an unauthorized copy, so i didn't...","someone make an unauthorized copy , so -PRON-...","['Someone', 'made', 'an', 'unauthorized', 'cop..."
263345,the system of ethics,Fichte,german_idealism,It is implicit in the concept of such a symbol...,It is implicit in the concept of such a symbol...,271,it is implicit in the concept of such a symbol...,-PRON- be implicit in the concept of such a s...,"['It', 'is', 'implicit', 'in', 'the', 'concept..."
755,complete works,Plato,plato,"If it is complete lack of perception, like a d...","If it is complete lack of perception, like a d...",100,"if it is complete lack of perception, like a d...","if -PRON- be complete lack of perception , li...","['If', 'it', 'is', 'complete', 'lack', 'of', '..."
25868,complete works,Plato,plato,I thought you gave us good measure,I thought you gave us good measure,34,i thought you gave us good measure,-PRON- think -PRON- give -PRON- good measure,"['I', 'thought', 'you', 'gave', 'us', 'good', ..."


In [4]:
# using gensim's built-in tokenizer 
df['gensim_tokenized'] = df['sentence_str'].map(lambda x: simple_preprocess(x.lower(),deacc=True,
                                                        max_len=100))

In [16]:
# check how it worked
print(df.iloc[216066]['sentence_str'])
df['gensim_tokenized'][216066]

The bumble bee is a part of the reproductive system of the clover.


['the',
 'bumble',
 'bee',
 'is',
 'part',
 'of',
 'the',
 'reproductive',
 'system',
 'of',
 'the',
 'clover']

Well with that beautiful little quote, we are ready to start training our w2v model! At first we'll focus on a single school, since a single school is more likely to have consistency in their use of a word.

Unfortunately, we didn't have much luck with just training on the texts alone. The code for it is left here for posterity, but it was when we worked with GloVe as the base that we had results that were actually useful.

### Word 2 Vec Training

#### German Idealism as a Test Case

We start by examining the texts of German Idealism to get a feel for what kind of parameters would work best.

In [6]:
def make_w2v(series, stopwords=[], size=200, window=5, min_count=5, workers=-1, 
             epochs=20, lowercase=True, sg=0, seed=17, cbow_mean=1, alpha=0.025,
             sample=0.001, use_bigrams=True, threshold=10, bigram_min=5):
  # turn the series into a list, lower it, clean it
    sentences = [sentence for sentence in series]
    if lowercase:
      cleaned = []
      for sentence in sentences:
        cleaned_sentence = [word.lower() for word in sentence]
        cleaned_sentence = [word for word in sentence if word not in stopwords]
        cleaned.append(cleaned_sentence)
    else:
      cleaned = []
      for sentence in sentences:
        cleaned_sentence = [word for word in sentence]
        cleaned_sentence = [word for word in sentence if word not in stopwords]
        cleaned.append(cleaned_sentence)

  # incorporate bigrams
    if use_bigrams:
      bigram = Phrases(cleaned, min_count=bigram_min, threshold=threshold, delimiter=b' ')
      bigram_phraser = Phraser(bigram)
      tokens_list = []
      for sent in cleaned:
        tokens_ = bigram_phraser[sent]
        tokens_list.append(tokens_)
      cleaned = tokens_list
    else:
      cleaned = cleaned

  # build the model
    model = Word2Vec(cleaned, size=size, window=window, 
                     min_count=min_count, workers=workers, seed=seed, sg=sg,
                     cbow_mean=cbow_mean, alpha=alpha, sample=sample)
    model.train(series, total_examples=model.corpus_count, epochs=epochs)
    model_wv = model.wv
    
  # clear it to avoid unwanted transference
    del model

    return model_wv

In [7]:
gi_wv = make_w2v(df[df['school'] == 'german_idealism']['gensim_tokenized'], threshold=12)

We can check this model by trying out a few words. For that purpose we have a testing function that tries some common word combinations.

In [8]:
pairs_to_try = [(['law', 'moral'], []),
                (['self', 'consciousness'], []),
                (['dialectic'], []),
                (['logic'], []),
]

In [9]:
test_w2v_pos_neg(gi_wv, pairs_to_try)

Positive - ['law', 'moral']	Negative - []
- tual (0.25671)
- audience (0.25406)
- keeping with (0.2485)
- con cepts (0.24737)
- patriarchal (0.23828)

Positive - ['self', 'consciousness']	Negative - []
- heres (0.2445)
- favorinus (0.23846)
- essentiality (0.22503)
- terrible (0.22255)
- instruct (0.22111)

Positive - ['dialectic']	Negative - []
- variability (0.26327)
- uniquely (0.23825)
- classical (0.23736)
- be sure (0.23542)
- galileo (0.23516)

Positive - ['logic']	Negative - []
- resolves (0.32825)
- positing (0.27255)
- jus (0.2519)
- available (0.24122)
- pleasant (0.23952)



Although some of these make a modicum of sense a lot of them seem like just gibberish. Let's try messing with some parameters.



##### Trying Skip-gram instead of C-bow

In [10]:
# make a base model with the preset parameters
skip_gi_wv = make_w2v(series = df[df['school'] == 'german_idealism']['gensim_tokenized'], 
                         stopwords=[], sg=1, seed=0)

In [11]:
test_w2v(skip_gi_wv, pairs_to_try)

Positive - ['law', 'moral']	Negative - []
- corporation (0.29213)
- suitable (0.22615)
- upon (0.22309)
- pos sible (0.22234)
- perpetual (0.2209)

Positive - ['self', 'consciousness']	Negative - []
- precepts (0.25849)
- non indifference (0.2502)
- anatomy (0.24631)
- deeper (0.23479)
- friendly (0.23428)

Positive - ['dialectic']	Negative - []
- entertaining (0.2759)
- gether (0.25991)
- extinction (0.25482)
- showed (0.25199)
- ly (0.24175)

Positive - ['logic']	Negative - []
- intensive magnitude (0.2469)
- by (0.23059)
- ultimate (0.22363)
- lot (0.22206)
- mit (0.21848)



These seem mildy more sensible. Let's tweak the other parameters.

##### Parameter Testing

In [12]:
model_v1 = make_w2v(df[df['school'] == 'german_idealism']['gensim_tokenized'],
                       stopwords=[],
                       size=500,
                       window=5,
                       min_count=25,
                       epochs=10,
                       sg=1, 
                       seed=45)

len(model_v1.vocab)

2928

In [13]:
test_w2v(model_v1, pairs_to_try)

Positive - ['law', 'moral']	Negative - []
- medium (0.16708)
- various (0.13167)
- regulative (0.13002)
- selves (0.12905)
- how much (0.12313)

Positive - ['self', 'consciousness']	Negative - []
- child (0.16376)
- assigned (0.15046)
- standard (0.14171)
- unit (0.14042)
- analytically (0.13504)

Positive - ['dialectic']	Negative - []
- species (0.22154)
- order (0.15094)
- posteriori (0.13902)
- incentive (0.13526)
- consequently (0.13099)

Positive - ['logic']	Negative - []
- experiment (0.14999)
- convictions (0.14326)
- generated (0.12487)
- common sense (0.12247)
- far (0.12127)



Despite tweaking parameters far and wide, it's difficult to get any results that are compellingly sensible. In most cases there are one or two terms in the similarity list that make some sense but others that are just strange or unconnected

#### Trying Another School

In [17]:
cm_w2v = make_w2v(df[df['school'] == 'communism']['gensim_tokenized'],
                       stopwords=[],
                       size=700,
                       window=10,
                       min_count=10,
                       epochs=25,
                       sg=1, 
                       seed=10)

type(cm_w2v)

gensim.models.keyedvectors.Word2VecKeyedVectors

In [18]:
pairs_to_try=[(['material', 'conditions'], []),
              (['worker'], ['owner']),
              (['alienation', 'labor'], []),
              (['capital'], [])]

In [19]:
test_w2v(cm_w2v, pairs_to_try)

Positive - ['material', 'conditions']	Negative - []
- worthy (0.16433)
- dimensions (0.13099)
- denied (0.12751)
- likewise (0.12101)
- steam engine (0.12088)

Positive - ['worker']	Negative - ['owner']
- losing (0.14698)
- year (0.12498)
- takes (0.11245)
- attended (0.11156)
- prevented (0.11145)

Positive - ['alienation', 'labor']	Negative - []
- well (0.12642)
- belong to (0.1228)
- america (0.1216)
- autre (0.11858)
- we have (0.11818)

Positive - ['capital']	Negative - []
- changed into (0.14334)
- bill (0.13987)
- woman (0.12504)
- we find (0.12503)
- each other (0.11454)



Here the results were similar - a few words that made some sense and plenty that were just odd.  

### Transfer Learning with GloVe

We'll import GloVe vectors as w2v, then use those as a base from which to train new vectors that are tuned to our corpus.

In [56]:
glove_file = datapath('/gdrive/MyDrive/Colab_Projects/Phil_NLP/glove.6B.50d.txt')
tmp_file = get_tmpfile("test_word2vec.txt")

_ = glove2word2vec(glove_file, tmp_file)

glove_vectors = KeyedVectors.load_word2vec_format(tmp_file)

In [71]:
pairs_to_try = [(['law', 'moral'], []),
                (['self', 'consciousness'], []),
                (['dialectic'], []),
                (['logic'], []),
]

In [58]:
# check out how GloVe works on our test pairs
test_w2v_pos_neg(glove_vectors, pairs_to_try)

Positive - ['law', 'moral']	Negative - []
- morality (0.82654)
- legal (0.82652)
- laws (0.81529)
- constitutional (0.80616)
- fundamental (0.80217)

Positive - ['self', 'consciousness']	Negative - []
- sense (0.83446)
- mind (0.79755)
- vision (0.78202)
- belief (0.78031)
- life (0.77984)

Positive - ['dialectic']	Negative - []
- hegelian (0.88376)
- dialectical (0.83417)
- dialectics (0.80672)
- materialist (0.77674)
- metaphysics (0.77488)

Positive - ['logic']	Negative - []
- reasoning (0.81405)
- intuitionistic (0.76531)
- concepts (0.75831)
- logical (0.75604)
- theory (0.75026)



Ok these make a lot more sense right from the start. But we want them to be trained on our actual philosophical texts - that way we can see how different thinkers use different words and potentially use the vectors for classification.

So in the cells below we train the existing GloVe model on on the German Idealist texts as a test.

In [59]:
# isolate the relevant school
documents = df[df['school'] == 'german_idealism']['gensim_tokenized']

# format the series to be used
stopwords = []

sentences = [sentence for sentence in documents]
cleaned = []
for sentence in sentences:
  cleaned_sentence = [word.lower() for word in sentence]
  cleaned_sentence = [word for word in sentence if word not in stopwords]
  cleaned.append(cleaned_sentence)

# get bigrams
bigram = Phrases(cleaned, min_count=20, threshold=10, delimiter=b' ')
bigram_phraser = Phraser(bigram)

bigramed_tokens = []
for sent in cleaned:
    tokens = bigram_phraser[sent]
    bigramed_tokens.append(tokens)

# run again to get trigrams
trigram = Phrases(bigramed_tokens, min_count=20, threshold=10, delimiter=b' ')
trigram_phraser = Phraser(trigram)

trigramed_tokens = []
for sent in bigramed_tokens:
    tokens = trigram_phraser[sent]
    trigramed_tokens.append(tokens)

# build a toy model to update with
base_model = Word2Vec(size=300, min_count=5)
base_model.build_vocab(trigramed_tokens)
total_examples = base_model.corpus_count

# add GloVe's vocabulary & weights
base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

# train on our data
base_model.train(trigramed_tokens, total_examples=total_examples, epochs=base_model.epochs)
base_model_wv = base_model.wv
del base_model

In [67]:
test_w2v(base_model_wv, pairs_to_try)

Positive - ['perception']	Negative - ['body']
- experience (0.55736)
- except (0.52587)
- given (0.52239)
- at all (0.51327)
- us (0.49623)

Positive - ['dasein']	Negative - []
- spontaneous (0.95246)
- unrest (0.95197)
- contentment (0.94884)
- le (0.94803)
- dual (0.94744)

Positive - ['consciousness']	Negative - []
- self consciousness (0.90164)
- objectivity (0.85189)
- essence (0.84317)
- purpose (0.81389)
- reality (0.81387)

Positive - ['method']	Negative - ['science']
- remaining (0.55084)
- likeness (0.55007)
- separation (0.54241)
- attraction (0.53498)
- unlikeness (0.53303)



We can immediately see that these make a lot more sense (and the similarity scores are a lot higher). Self-consciousness is commonly associated with freedom in German idealism, logic with metaphysics, and the moral law with universality and the good. This is a massive improvement - these vectors can be fairly said to reflect how german idealists use these terms. Moreover, they are significantly different than the original GloVe model, which indicates that there was real learning going on here.

For comparison, let's check these same terms, but as used by Phenomenologists.

In [61]:
def train_glove(school, glove_vectors, threshold=10, stopwords=[],
                min_count=20):
  # isolate the relevant school
  documents = df[df['school'] ==school]['gensim_tokenized']

  # format the series to be used
  stopwords = []

  sentences = [sentence for sentence in documents]
  cleaned = []
  for sentence in sentences:
    cleaned_sentence = [word.lower() for word in sentence]
    cleaned_sentence = [word for word in sentence if word not in stopwords]
    cleaned.append(cleaned_sentence)

  # get bigrams
  bigram = Phrases(cleaned, min_count=min_count, threshold=threshold, 
                   delimiter=b' ')
  bigram_phraser = Phraser(bigram)

  bigramed_tokens = []
  for sent in cleaned:
      tokens = bigram_phraser[sent]
      bigramed_tokens.append(tokens)

  # run again to get trigrams
  trigram = Phrases(bigramed_tokens, min_count=min_count, threshold=threshold, 
                    delimiter=b' ')
  trigram_phraser = Phraser(trigram)

  trigramed_tokens = []
  for sent in bigramed_tokens:
      tokens = trigram_phraser[sent]
      trigramed_tokens.append(tokens)

  # build a toy model to update with
  model = Word2Vec(size=300, min_count=5)
  model.build_vocab(trigramed_tokens)
  total_examples = model.corpus_count

  # add GloVe's vocabulary & weights
  model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

  # train on our data
  model.train(trigramed_tokens, total_examples=total_examples, epochs=model.epochs)
  model_wv = model.wv
  del model
  return model_wv

In [77]:
pairs_to_try = [(['perception'], []),
                (['dasein'], []),
                (['consciousness'], []),
                (['method'], []),]

In [78]:
ph_model = train_glove(school='phenomenology', glove_vectors=glove_vectors)

test_w2v(ph_model, pairs_to_try)

Positive - ['perception']	Negative - []
- act (0.94016)
- reality (0.93967)
- representation (0.93682)
- phenomenon (0.92682)
- care (0.92664)

Positive - ['dasein']	Negative - []
- being (0.89551)
- itself (0.87371)
- truth (0.86199)
- consciousness (0.8399)
- existence (0.82935)

Positive - ['consciousness']	Negative - []
- representation (0.92266)
- knowledge (0.92159)
- truth (0.91747)
- movement (0.91747)
- perception (0.91281)

Positive - ['method']	Negative - []
- spirit (0.9675)
- necessity (0.96304)
- definition (0.96258)
- foundation (0.95916)
- source (0.95879)



Using the phenomenology vectors on some central terms of phenomenology once again yields some pretty compelling results. 

These vectors seem to be an effective tool for revealing how a word is used by a school. 

As a final kind of exploration of this method, we'll train w2v models in this way for each school and examine how each of them looks a couple of the same words. 

In [79]:
w2v_dict = {}

for school in df['school'].unique():
  w2v_dict[school] = train_glove(school, glove_vectors=glove_vectors)
  print(f'{school} completed')

plato completed
aristotle completed
empiricism completed
rationalism completed
analytic completed
continental completed
phenomenology completed
german_idealism completed
communism completed
capitalism completed


In [80]:
for school in df['school'].unique():
  print(f'\t{school.upper()}')
  print('----------------------')
  test_w2v(w2v_dict[school], [(['philosophy'], [])])

	PLATO
----------------------
Positive - ['philosophy']	Negative - []
- murder (0.9449)
- gesture (0.94414)
- tragedies (0.94357)
- relief (0.94174)
- friendship (0.94032)

	ARISTOTLE
----------------------
Positive - ['philosophy']	Negative - []
- delivery (0.8839)
- amplification (0.87914)
- mankind (0.86531)
- bodily pleasures (0.86393)
- poetry (0.86111)

	EMPIRICISM
----------------------
Positive - ['philosophy']	Negative - []
- religion (0.96312)
- treating (0.93914)
- doubtfulness (0.93277)
- history (0.93238)
- practice (0.92141)

	RATIONALISM
----------------------
Positive - ['philosophy']	Negative - []
- example (0.95553)
- discourse (0.95018)
- passage (0.94405)
- objections (0.94404)
- prejudice (0.94046)

	ANALYTIC
----------------------
Positive - ['philosophy']	Negative - []
- philosophical (0.89448)
- semantics (0.86649)
- reprinted (0.86304)
- modern (0.85659)
- carnap (0.856)

	CONTINENTAL
----------------------
Positive - ['philosophy']	Negative - []
- metaphysics 

Interestingly, many of these top words align quite strongly with the school's general attitude towards philosophy. 

The model seems solid - our next step is to train one on the entire corpus for use in classification. We do that, and export it, below.

In [81]:
documents = df['gensim_tokenized']

# format the series to be used
stopwords = []

sentences = [sentence for sentence in documents]
cleaned = []
for sentence in sentences:
  cleaned_sentence = [word.lower() for word in sentence]
  cleaned_sentence = [word for word in sentence if word not in stopwords]
  cleaned.append(cleaned_sentence)

# get bigrams
bigram = Phrases(cleaned, min_count=30, threshold=10, 
                  delimiter=b' ')
bigram_phraser = Phraser(bigram)

bigramed_tokens = []
for sent in cleaned:
    tokens = bigram_phraser[sent]
    bigramed_tokens.append(tokens)

# run again to get trigrams
trigram = Phrases(bigramed_tokens, min_count=30, threshold=10, 
                  delimiter=b' ')
trigram_phraser = Phraser(trigram)

trigramed_tokens = []
for sent in bigramed_tokens:
    tokens = trigram_phraser[sent]
    trigramed_tokens.append(tokens)

# build a toy model to update with
all_text_model = Word2Vec(size=300, min_count=5)
all_text_model.build_vocab(trigramed_tokens)
total_examples = all_text_model.corpus_count

# add GloVe's vocabulary & weights
all_text_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

# train on our data
all_text_model.train(trigramed_tokens, total_examples=total_examples, 
                     epochs=all_text_model.epochs)
all_text_wv = all_text_model.wv


As a test case, let's see how the philosophy thinks of itself as compared to how glove thinks of philosophy.

In [82]:
for model in [1, 2]:
  if model == 1:
    print(f'\tPHILOSOPHY CORPUS')
    print('------------------------------------')
    test_w2v(all_text_wv, [(['philosophy'], [])])
  if model == 2:
    print(f'\tBASE GLOVE')
    print('------------------------------------')
    test_w2v(glove_vectors, [(['philosophy'], [])])


	PHILOSOPHY CORPUS
------------------------------------
Positive - ['philosophy']	Negative - []
- metaphysics (0.79349)
- theology (0.78065)
- science (0.73188)
- philosophical (0.72342)
- psychology (0.70164)

	BASE GLOVE
------------------------------------
Positive - ['philosophy']	Negative - []
- theology (0.88151)
- philosophical (0.84362)
- mathematics (0.83389)
- psychology (0.82387)
- sociology (0.81085)



This sort of stands to reason - 'metaphysics' often has a different meaning outside of philosophical discussion, so it's not surprising to see it as the most changed term here. 

All in all, things look good, so let's export the vectors so that they can be used in our neural networks. 

In [85]:
all_text_wv.save_word2vec_format('/gdrive/MyDrive/Colab_Projects/Phil_NLP/w2v_models/w2v_for_nn.bin')

In [86]:
for school in w2v_dict.keys():
  w2v_dict[school].save_word2vec_format(f'/gdrive/MyDrive/Colab_Projects/Phil_NLP/w2v_models/{school}_w2v.bin')

And that's it! See our other notebooks for more of the modeling work. 