## Word2Vec Embedding with CA10 Opinions

I used a combination of this [tutorial](https://www.shanelynn.ie/word-embeddings-in-python-with-spacy-and-gensim/) and a little bit of [this one](https://machinelearningmastery.com/develop-word-embeddings-python-gensim/) in a desperate attempt to get it to work. I also used [this](https://radimrehurek.com/gensim/models/word2vec.html) in an effort to trouble shoot gensim syntax

In [41]:
import os
import io
import string
import tqdm
import numpy as np
import sys
import re
import random
from gensim.models import Word2Vec, KeyedVectors
from gensim.similarities import Similarity
from gensim.models.phrases import Phraser, Phrases
import gensim
#turn on to give progress feedback, i used it when it kept failing
#importlogging
#logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 

In [16]:
#import a sample texts into one massive list
#I know I wont be able to decern which case is which but ¯\_(ツ)_/¯ thats future Keira's problem
corpus=[]
ca10='ca10/'
file_sample=random.sample(os.listdir(ca10), 5000)
for file_name in file_sample:
    file=os.path.join(ca10,file_name)
    txt=open(file, encoding='latin-1').read()
    corpus.append(txt)
    

print('Found %s texts.' % len(corpus))


Found 5000 texts.


In [32]:
with open("ca10model_trainning_sample.txt", "w") as output:
    output.write(str(file_sample))

In [17]:
#yay data cleaning
sentences = []
for ii in range(len(corpus)):
    sentences = [re.sub(pattern=r'[\!"#$%&\*+,-./:;<=>?@^_`()|~=]', 
                        repl='', 
                        string=x
                       ).strip().split(' ') for x in corpus[ii].split('\\n') 
                      if not x.endswith('writes:')]
    sentences = [x for x in sentences if x != ['']]
    corpus[ii] = sentences

In [18]:
all_sentences = []
for text in corpus:
    all_sentences += text

In [37]:
common_terms = ["of", "with", "without", "and", "or", "the", "a"]
phrases = Phrases(all_sentences)
bigram = Phraser(phrases)
all_sentences = list(bigram[all_sentences])

2022-04-13 20:37:28,676 : INFO : collecting all words and their counts
2022-04-13 20:37:28,680 : INFO : PROGRESS: at sentence #0, processed 0 words and 0 word types
2022-04-13 20:37:28,927 : INFO : PROGRESS: at sentence #10000, processed 152176 words and 64684 word types
2022-04-13 20:37:29,123 : INFO : PROGRESS: at sentence #20000, processed 287671 words and 106594 word types
2022-04-13 20:37:29,299 : INFO : PROGRESS: at sentence #30000, processed 408062 words and 142799 word types
2022-04-13 20:37:29,543 : INFO : PROGRESS: at sentence #40000, processed 544993 words and 179831 word types
2022-04-13 20:37:29,746 : INFO : PROGRESS: at sentence #50000, processed 663222 words and 212813 word types
2022-04-13 20:37:29,973 : INFO : PROGRESS: at sentence #60000, processed 812171 words and 254420 word types
2022-04-13 20:37:30,225 : INFO : PROGRESS: at sentence #70000, processed 959789 words and 293207 word types
2022-04-13 20:37:30,496 : INFO : PROGRESS: at sentence #80000, processed 1117006

In [45]:
model = Word2Vec(all_sentences, 
                 min_count=3,   # Ignore words that appear less than this
                 workers=2,     # Number of processors (parallelisation)
                 window=5,)     # Context window for words during training
       

2022-04-13 20:43:08,700 : INFO : collecting all words and their counts
2022-04-13 20:43:08,704 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2022-04-13 20:43:08,775 : INFO : PROGRESS: at sentence #10000, processed 142937 words, keeping 17391 word types
2022-04-13 20:43:08,859 : INFO : PROGRESS: at sentence #20000, processed 270862 words, keeping 27424 word types
2022-04-13 20:43:08,920 : INFO : PROGRESS: at sentence #30000, processed 382997 words, keeping 35735 word types
2022-04-13 20:43:08,981 : INFO : PROGRESS: at sentence #40000, processed 512059 words, keeping 43384 word types
2022-04-13 20:43:09,038 : INFO : PROGRESS: at sentence #50000, processed 623351 words, keeping 50130 word types
2022-04-13 20:43:09,103 : INFO : PROGRESS: at sentence #60000, processed 762584 words, keeping 58934 word types
2022-04-13 20:43:09,165 : INFO : PROGRESS: at sentence #70000, processed 901686 words, keeping 66959 word types
2022-04-13 20:43:09,233 : INFO : PROGRESS: at 

In [50]:
model.train(sentences, total_examples=model.corpus_count, epochs=30, report_delay=1)

2022-04-13 20:49:11,900 : INFO : Word2Vec lifecycle event {'msg': 'training model with 2 workers on 263215 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5', 'datetime': '2022-04-13T20:49:11.900278', 'gensim': '4.0.1', 'python': '3.8.8 (default, Apr 13 2021, 12:59:45) \n[Clang 10.0.0 ]', 'platform': 'macOS-10.14.6-x86_64-i386-64bit', 'event': 'train'}
2022-04-13 20:49:11,910 : INFO : worker thread finished; awaiting finish of 1 more threads
2022-04-13 20:49:11,921 : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-13 20:49:11,923 : INFO : EPOCH - 1 : training on 3311 raw words (2128 effective words) took 0.0s, 125791 effective words/s
2022-04-13 20:49:11,933 : INFO : worker thread finished; awaiting finish of 1 more threads
2022-04-13 20:49:11,947 : INFO : worker thread finished; awaiting finish of 0 more threads
2022-04-13 20:49:11,949 : INFO : EPOCH - 2 : training on 3311 raw words (2135 effective words) took 0.0s, 137392 effective

(63890, 99330)

In [51]:
model

<gensim.models.word2vec.Word2Vec at 0x10c8a3220>

In [52]:
model.vector_size

100

In [60]:
model.wv.save_word2vec_format('ca10model', binary=True)

2022-04-13 20:54:29,291 : INFO : storing 263215x100 projection weights into ca10model


In [80]:
model.wv.most_similar(positive=["man"],negative=["woman"])

[('magnitude', 0.48763507604599),
 ('sense', 0.48722323775291443),
 ('exclude_every', 0.4810146689414978),
 ('decorum', 0.4686262309551239),
 ('think', 0.4683132767677307),
 ('sort', 0.4582390785217285),
 ('invade', 0.455127090215683),
 ('endeavor', 0.45301854610443115),
 ('understanding', 0.44597065448760986),
 ('realm', 0.44386303424835205)]

In [89]:
model.wv.most_similar('homosexual')

[('ethnic', 0.6522204875946045),
 ('daily_living', 0.6244298219680786),
 ("McDonnell's", 0.6004822254180908),
 ('recruiting', 0.5957875847816467),
 ('traumatic', 0.5808737277984619),
 ('minimizing', 0.5797561407089233),
 ('\\numerous', 0.5795542597770691),
 ('impulsive', 0.5721673965454102),
 ('sexual', 0.5656524896621704),
 ('antisocial', 0.5608785152435303)]