<a href="https://colab.research.google.com/github/kcalizadeh/PDP_data_processing/blob/master/w2v_for_imports.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Imports and Mounting Drive

In [9]:
# this cell mounts drive, sets the correct directory, then imports all functions
# and relevant libraries via the functions.py file
from google.colab import drive
import sys

drive.mount('/gdrive',force_remount=True)

drive_path = '/gdrive/MyDrive/Colab_Projects/Phil_NLP'

sys.path.append(drive_path)

Mounted at /gdrive


In [10]:
import pandas as pd
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
from gensim.test.utils import datapath, get_tmpfile
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.models import KeyedVectors
from gensim.models import Phrases
from gensim.models.phrases import Phraser


# a function for quickly testing w2v models
def test_w2v(model, pairs):
  for (pos, neg) in pairs:
    math_result = model.most_similar(positive=pos, negative=neg)
    print(f'Positive - {pos}\tNegative - {neg}')
    [print(f"- {result[0]} ({round(result[1],5)})") for result in math_result[:5]]
    print()

### Load the Data

In [11]:
df = pd.read_csv('/gdrive/MyDrive/Colab_Projects/philosophy_data_project/phil_nlp.csv')

df.sample(5)

Unnamed: 0,title,author,school,sentence_spacy,sentence_str,sentence_length,sentence_lowered,tokenized_txt,lemmatized_str
13786,Plato - Complete Works,Plato,plato,"So, since I am your friend, I would never dare...","So, since I am your friend, I would never dare...",89,"so, since i am your friend, i would never dare...","['so', 'since', 'am', 'your', 'friend', 'would...","so , since -PRON- be -PRON- friend , -PRON- w..."
294204,Capital,Marx,communism,When the system attained to a certain degree o...,When the system attained to a certain degree o...,254,when the system attained to a certain degree o...,"['when', 'the', 'system', 'attained', 'to', 'c...",when the system attain to a certain degree of...
316954,The Wealth Of Nations,Smith,capitalism,"In those corrupted governments, where there is...","In those corrupted governments, where there is...",173,"in those corrupted governments, where there is...","['in', 'those', 'corrupted', 'governments', 'w...","in those corrupt government , where there be ..."
295890,Capital,Marx,communism,"improvements succeeded each other so rapidly, ...","improvements succeeded each other so rapidly, ...",189,"improvements succeeded each other so rapidly, ...","['improvements', 'succeeded', 'each', 'other',...","improvement succeed each other so rapidly , t..."
226983,The Crisis Of The European Sciences And Phenom...,Husserl,phenomenology,Our focus on the world of perception (and it i...,Our focus on the world of perception (and it i...,242,our focus on the world of perception (and it i...,"['our', 'focus', 'on', 'the', 'world', 'of', '...",-PRON- focus on the world of perception ( and...


In [12]:
# using gensim's built-in tokenizer 
df['gensim_tokenized'] = df['sentence_str'].map(lambda x: simple_preprocess(x.lower(),deacc=True,
                                                        max_len=100))

In [13]:
# check how it worked
print(df.iloc[290646]['sentence_str'])
df['gensim_tokenized'][290646]

A spider conducts operations that resemble those of a weaver, and a bee puts to shame many an architect in the construction of her cells.


['spider',
 'conducts',
 'operations',
 'that',
 'resemble',
 'those',
 'of',
 'weaver',
 'and',
 'bee',
 'puts',
 'to',
 'shame',
 'many',
 'an',
 'architect',
 'in',
 'the',
 'construction',
 'of',
 'her',
 'cells']

Hmm, an interesting observation. 

### Transfer Learning with GloVe

We'll import GloVe vectors as w2v, then use those as a base from which to train new vectors that are tuned to our corpus.

In [6]:
# load the vectors. other vector sizes were used but yielded generally less sensible models
glove_file = datapath('/gdrive/MyDrive/Colab_Projects/philosophy_data_project/glove.6B.50d.txt')
tmp_file = get_tmpfile("test_word2vec.txt")

_ = glove2word2vec(glove_file, tmp_file)

glove_vectors = KeyedVectors.load_word2vec_format(tmp_file)

In [7]:
pairs_to_try = [(['law', 'moral'], []),
                (['self', 'consciousness'], []),
                (['dialectic'], []),
                (['logic'], []),
]

In [8]:
# check out how GloVe works on our test pairs
test_w2v(glove_vectors, pairs_to_try)

Positive - ['law', 'moral']	Negative - []
- morality (0.82654)
- legal (0.82652)
- laws (0.81529)
- constitutional (0.80616)
- fundamental (0.80217)

Positive - ['self', 'consciousness']	Negative - []
- sense (0.83446)
- mind (0.79755)
- vision (0.78202)
- belief (0.78031)
- life (0.77984)

Positive - ['dialectic']	Negative - []
- hegelian (0.88376)
- dialectical (0.83417)
- dialectics (0.80672)
- materialist (0.77674)
- metaphysics (0.77488)

Positive - ['logic']	Negative - []
- reasoning (0.81405)
- intuitionistic (0.76531)
- concepts (0.75831)
- logical (0.75604)
- theory (0.75026)



Now we want these to be trained on our actual philosophical texts - that way we can see how different thinkers use different words and potentially use the vectors for classification.

So in the cells below we train the existing GloVe model on on the German Idealist texts as a test.

#### German Idealism Example

In [14]:
# isolate the relevant school
documents = df[df['school'] == 'german_idealism']['gensim_tokenized']

# format the series to be used
stopwords = []

sentences = [sentence for sentence in documents]
cleaned = []
for sentence in sentences:
  cleaned_sentence = [word.lower() for word in sentence]
  cleaned_sentence = [word for word in sentence if word not in stopwords]
  cleaned.append(cleaned_sentence)

# get bigrams
bigram = Phrases(cleaned, min_count=20, threshold=10, delimiter=b' ')
bigram_phraser = Phraser(bigram)

bigramed_tokens = []
for sent in cleaned:
    tokens = bigram_phraser[sent]
    bigramed_tokens.append(tokens)

# run again to get trigrams
trigram = Phrases(bigramed_tokens, min_count=20, threshold=10, delimiter=b' ')
trigram_phraser = Phraser(trigram)

trigramed_tokens = []
for sent in bigramed_tokens:
    tokens = trigram_phraser[sent]
    trigramed_tokens.append(tokens)

# build a toy model to update with
base_model = Word2Vec(size=300, min_count=5)
base_model.build_vocab(trigramed_tokens)
total_examples = base_model.corpus_count

# add GloVe's vocabulary & weights
base_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

# train on our data
base_model.train(trigramed_tokens, total_examples=total_examples, epochs=base_model.epochs)
base_model_wv = base_model.wv
del base_model

In [15]:
test_w2v(base_model_wv, pairs_to_try)

Positive - ['law', 'moral']	Negative - []
- rule (0.82387)
- moral law (0.80709)
- freedom (0.80685)
- causality (0.79339)
- happiness (0.79102)

Positive - ['self', 'consciousness']	Negative - []
- self consciousness (0.92469)
- essence (0.87722)
- immediacy (0.87623)
- objectivity (0.87523)
- negativity (0.87135)

Positive - ['dialectic']	Negative - []
- method (0.93279)
- doctrine (0.9254)
- antinomy (0.92351)
- deduction (0.91393)
- exposition (0.89874)

Positive - ['logic']	Negative - []
- pure reason (0.85429)
- doctrine (0.84595)
- method (0.838)
- science (0.82746)
- dialectic (0.81556)



These seem to reflect the true uses of words in German Idealism. 'Self' + 'consciousness' is rightly associated with 'self consciousness' and 'moral' + 'law' with 'moral law'. It even identifies the German Idealist tendency to unify logic and metaphysics. 

These vectors can be fairly said to reflect how german idealists use these terms. Moreover, they are significantly different than the original GloVe model, which indicates that there was real learning going on here.

For comparison, let's check these same terms, but as used by Phenomenologists.

#### Phenomenology Comparision

In [16]:
def train_glove(source_type, source, glove_vectors, threshold=10, stopwords=[],
                min_count=20):
  # isolate the relevant school
  documents = df[df[source_type] == source]['gensim_tokenized']

  # format the series to be used
  stopwords = []

  sentences = [sentence for sentence in documents]
  cleaned = []
  for sentence in sentences:
    cleaned_sentence = [word.lower() for word in sentence]
    cleaned_sentence = [word for word in sentence if word not in stopwords]
    cleaned.append(cleaned_sentence)

  # get bigrams
  bigram = Phrases(cleaned, min_count=min_count, threshold=threshold, 
                   delimiter=b' ')
  bigram_phraser = Phraser(bigram)

  bigramed_tokens = []
  for sent in cleaned:
      tokens = bigram_phraser[sent]
      bigramed_tokens.append(tokens)

  # run again to get trigrams
  trigram = Phrases(bigramed_tokens, min_count=min_count, threshold=threshold, 
                    delimiter=b' ')
  trigram_phraser = Phraser(trigram)

  trigramed_tokens = []
  for sent in bigramed_tokens:
      tokens = trigram_phraser[sent]
      trigramed_tokens.append(tokens)

  # build a toy model to update with
  model = Word2Vec(size=300, min_count=5)
  model.build_vocab(trigramed_tokens)
  total_examples = model.corpus_count

  # add GloVe's vocabulary & weights
  model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

  # train on our data
  model.train(trigramed_tokens, total_examples=total_examples, epochs=model.epochs)
  model_wv = model.wv
  del model
  return model_wv

In [17]:
ph_model = train_glove(source_type='school', source='phenomenology', glove_vectors=glove_vectors)

In [18]:
test_w2v(ph_model, pairs_to_try)

Positive - ['law', 'moral']	Negative - []
- ideality (0.99645)
- unfolding (0.99551)
- ab (0.99506)
- class (0.99357)
- intentionality (0.99174)

Positive - ['self', 'consciousness']	Negative - []
- certainty (0.95246)
- potentiality (0.94868)
- existence (0.94029)
- nature (0.93831)
- authentic (0.93513)

Positive - ['dialectic']	Negative - []
- danger (0.98913)
- room (0.9881)
- shadow (0.9869)
- furthermore (0.9869)
- publicness (0.98683)

Positive - ['logic']	Negative - []
- definition (0.98022)
- rpretation (0.9788)
- type (0.9781)
- transcendence (0.97677)
- dispensation (0.97635)



Using the phenomenology vectors on some central terms of German idealism once again yields some pretty compelling results, except for where the words are rarely used by the phenomenologists. This is to be expected. Let's try the word vectors on some central terms of phenomenology.

In [19]:
pairs_to_try = [(['perception'], []),
                (['dasein'], []),
                (['consciousness'], []),
                (['method'], []),]

test_w2v(ph_model, pairs_to_try)

Positive - ['perception']	Negative - []
- act (0.9331)
- death (0.93105)
- representation (0.93053)
- movement (0.92326)
- synthesis (0.91799)

Positive - ['dasein']	Negative - []
- being (0.8908)
- truth (0.87897)
- itself (0.87614)
- future (0.85072)
- consciousness (0.83773)

Positive - ['consciousness']	Negative - []
- movement (0.93418)
- representation (0.91543)
- body (0.90922)
- existence (0.90389)
- future (0.90297)

Positive - ['method']	Negative - []
- spirit (0.97806)
- necessity (0.95537)
- phenomenology (0.95236)
- ontology (0.95015)
- history (0.95015)



#### Training on every school & author

To further explore this, we'll train w2v models in this way for each school and examine how each of them looks at the same word - 'philosophy.' We can use these in our future dashboard work.

In [20]:
w2v_dict = {}

for school in df['school'].unique():
  w2v_dict[school] = train_glove('school', school, glove_vectors=glove_vectors)
  print(f'{school} completed')

plato completed
aristotle completed
empiricism completed
rationalism completed
analytic completed
continental completed
phenomenology completed
german_idealism completed
communism completed
capitalism completed
stoicism completed


In [21]:
for school in df['school'].unique():
  print(f'\t{school.upper()}')
  print('----------------------')
  test_w2v(w2v_dict[school], [(['philosophy'], [])])

	PLATO
----------------------
Positive - ['philosophy']	Negative - []
- exploits (0.94807)
- every kind (0.94536)
- nourishment (0.9443)
- construction (0.9407)
- waves (0.93759)

	ARISTOTLE
----------------------
Positive - ['philosophy']	Negative - []
- practice (0.90368)
- affairs (0.89227)
- writer (0.8858)
- mortals (0.88398)
- philosophical (0.87431)

	EMPIRICISM
----------------------
Positive - ['philosophy']	Negative - []
- subtle (0.91686)
- practice (0.913)
- religion (0.91133)
- ethics (0.88737)
- popular (0.87861)

	RATIONALISM
----------------------
Positive - ['philosophy']	Negative - []
- name (0.93098)
- prince (0.92751)
- territory (0.92547)
- intention (0.92448)
- doctrine (0.92422)

	ANALYTIC
----------------------
Positive - ['philosophy']	Negative - []
- philosophical (0.88962)
- semantics (0.85898)
- symbolism (0.83177)
- quotation (0.82118)
- modern (0.81951)

	CONTINENTAL
----------------------
Positive - ['philosophy']	Negative - []
- history (0.95461)
- notio

Interestingly, many of these top words align quite strongly with the school's general attitude towards philosophy. Continental thinkers mentioning unreason, analytic philosophers focusing on semantics, and phenomenologists associating philosophy with a method all track well. The ones that don't make sense are those that don't problematize the nature of philosophy to any great degree - capitalist thinkers aren't out there trying to discuss the nature of philosophy.

We'd also like vectors trained for each individual author. We can use these in our dashboard to enable intra-school comparisons of authors and generally allow for more fine-grained data exploration.

In [22]:
#@title Glove Training Function Modified for Authors
def train_glove_author(school, glove_vectors, threshold=10, stopwords=[],
                min_count=20):
  # isolate the relevant school
  documents = df[df['author'] ==school]['gensim_tokenized']

  # format the series to be used
  stopwords = []

  sentences = [sentence for sentence in documents]
  cleaned = []
  for sentence in sentences:
    cleaned_sentence = [word.lower() for word in sentence]
    cleaned_sentence = [word for word in sentence if word not in stopwords]
    cleaned.append(cleaned_sentence)

  # get bigrams
  bigram = Phrases(cleaned, min_count=min_count, threshold=threshold, 
                   delimiter=b' ')
  bigram_phraser = Phraser(bigram)

  bigramed_tokens = []
  for sent in cleaned:
      tokens = bigram_phraser[sent]
      bigramed_tokens.append(tokens)

  # run again to get trigrams
  trigram = Phrases(bigramed_tokens, min_count=min_count, threshold=threshold, 
                    delimiter=b' ')
  trigram_phraser = Phraser(trigram)

  trigramed_tokens = []
  for sent in bigramed_tokens:
      tokens = trigram_phraser[sent]
      trigramed_tokens.append(tokens)

  # build a toy model to update with
  model = Word2Vec(size=300, min_count=5)
  model.build_vocab(trigramed_tokens)
  total_examples = model.corpus_count

  # add GloVe's vocabulary & weights
  model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

  # train on our data
  model.train(trigramed_tokens, total_examples=total_examples, epochs=model.epochs)
  model_wv = model.wv
  del model
  return model_wv

In [23]:
for author in df['author'].unique():
  w2v_dict[author] = train_glove('author', author, glove_vectors=glove_vectors)
  print(f'{author} completed')

Plato completed
Aristotle completed
Locke completed
Hume completed
Berkeley completed
Spinoza completed
Leibniz completed
Descartes completed
Malebranche completed
Russell completed
Moore completed
Wittgenstein completed
Lewis completed
Quine completed
Popper completed
Kripke completed
Foucault completed
Derrida completed
Deleuze completed
Merleau-Ponty completed
Husserl completed
Heidegger completed
Kant completed
Fichte completed
Hegel completed
Marx completed
Lenin completed
Smith completed
Ricardo completed
Keynes completed
Epictetus completed
Marcus Aurelius completed


With this finished - our next step is to train one on the entire corpus for use in classification.

#### Building a Model for the full Corpus

In [24]:
documents = df['gensim_tokenized']

# format the series to be used
stopwords = []

sentences = [sentence for sentence in documents]
cleaned = []
for sentence in sentences:
  cleaned_sentence = [word.lower() for word in sentence]
  cleaned_sentence = [word for word in sentence if word not in stopwords]
  cleaned.append(cleaned_sentence)

# get bigrams
bigram = Phrases(cleaned, min_count=30, threshold=10, 
                  delimiter=b' ')
bigram_phraser = Phraser(bigram)

bigramed_tokens = []
for sent in cleaned:
    tokens = bigram_phraser[sent]
    bigramed_tokens.append(tokens)

# run again to get trigrams
trigram = Phrases(bigramed_tokens, min_count=30, threshold=10, 
                  delimiter=b' ')
trigram_phraser = Phraser(trigram)

trigramed_tokens = []
for sent in bigramed_tokens:
    tokens = trigram_phraser[sent]
    trigramed_tokens.append(tokens)

# build a toy model to update with
all_text_model = Word2Vec(size=300, min_count=5)
all_text_model.build_vocab(trigramed_tokens)
total_examples = all_text_model.corpus_count

# add GloVe's vocabulary & weights
all_text_model.build_vocab([list(glove_vectors.vocab.keys())], update=True)

# train on our data
all_text_model.train(trigramed_tokens, total_examples=total_examples, 
                     epochs=all_text_model.epochs)
all_text_wv = all_text_model.wv

As a test case, let's see how the philosophy thinks of itself as compared to how glove thinks of philosophy.

In [25]:
for model in [1, 2]:
  if model == 1:
    print(f'\tPHILOSOPHY CORPUS')
    print('------------------------------------')
    test_w2v(all_text_wv, [(['philosophy'], [])])
  if model == 2:
    print(f'\tBASE GLOVE')
    print('------------------------------------')
    test_w2v(glove_vectors, [(['philosophy'], [])])


	PHILOSOPHY CORPUS
------------------------------------
Positive - ['philosophy']	Negative - []
- metaphysics (0.7847)
- theology (0.77706)
- religion (0.72583)
- transcendental philosophy (0.71516)
- philosophical (0.71454)

	BASE GLOVE
------------------------------------
Positive - ['philosophy']	Negative - []
- theology (0.88151)
- philosophical (0.84362)
- mathematics (0.83389)
- psychology (0.82387)
- sociology (0.81085)



This sort of stands to reason - 'metaphysics' often has a different meaning outside of philosophical discussion, so it's not surprising to see it as the most changed term here. 

#### Exporting for Only New Texts


In [35]:
w2v_dict['Wittgenstein'].save('/gdrive/MyDrive/Colab_Projects/philosophy_data_project/w2v_models/Wittgenstein_w2v.wordvectors')
w2v_dict['analytic'].save('/gdrive/MyDrive/Colab_Projects/philosophy_data_project/w2v_models/analytic_w2v.wordvectors')

In [None]:
# use this cell to build the newest author/text/school in the corpus, then export in the cell below
epictetus_wv = train_glove(source_type='author', source='Epictetus', glove_vectors=glove_vectors)
aurelius_wv = train_glove(source_type='author', source='Marcus Aurelius', glove_vectors=glove_vectors)
stoicism_wv = train_glove(source_type='school', source='stoicism', glove_vectors=glove_vectors)

In [None]:
# test out new model
stoicism_wv.most_similar('man')

[('without', 0.9999505877494812),
 ('thou art', 0.9999483823776245),
 ('anything', 0.999948263168335),
 ('neither', 0.9999476671218872),
 ('an', 0.999946117401123),
 ('but', 0.9999449253082275),
 ('him', 0.9999449253082275),
 ('thou hast', 0.9999435544013977),
 ('able', 0.9999430179595947),
 ('before', 0.9999427795410156)]

In [None]:
epictetus_wv.save('/gdrive/MyDrive/Colab_Projects/philosophy_data_project/w2v_models/Epictetus_w2v.wordvectors')
stoicism_wv.save('/gdrive/MyDrive/Colab_Projects/philosophy_data_project/w2v_models/Stoiticism_w2v.wordvectors')

#### Finalized exporting

All in all, things look good, so let's export the vectors so that they can be used in our neural networks and in our dash app. 

In [None]:
# do not run these cells if you want to keep old w2v versions
all_text_wv.save_word2vec_format('/gdrive/MyDrive/Colab_Projects/philosophy_data_project/w2v_models/w2v_for_nn.bin')
all_text_wv.save('/gdrive/MyDrive/Colab_Projects/philosophy_data_project/w2v_models/w2v_for_nn.wordvectors')

for source in w2v_dict.keys():
  w2v_dict[source].save(f'/gdrive/MyDrive/Colab_Projects/philosophy_data_project/w2v_models/{source}_w2v.wordvectors')

And that's it! See our other notebooks for more of the modeling work. 