# Train word2vec from scratch

### Reasons for 
- `word2vec`: it is more **robust** to changes in small corpora compared to other standard embedding methods.
- `skip-gram architecture`: skip-gram predicts neighboring words based on the central word. Therefore, it is much better at capturing **semantic relationships**.
- `training from scrach`: unlike fine-tuning a pre-trained model, it can reflect the co-occurrence pattern **solely in the generated datasets** without introducing the existing pattern from pre-trained models. 

In [1]:
import random

from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec

### Define global parameters

After preprocessing, texts are stored in .txt files separately for each gender. Since ChatGPT is frequently either "at capacity" or "having too many requests in 1 hour", and its connection with ChatGPT Wrapper is sometimes unstable, data collection can be extremely slow. Therefore, the model training was divided into four sessions with 300 new stories each. Stories were broken down into tokenized sentences and put together for random shuffling.

- PATH: Location of the files.
- FILE_NAMES: A list of file names with 100 stories for every gender.
- TRAIN_MODEL_NAME: If this is not the initial model for the first 300 stories, use it to load a previous model.
- SAVE_MODEL_NAME: The model to be saved after training.

In [2]:
PATH = ''
# FILE_NAMES = ['h1-100', 's1-100', 't1-100']
# FILE_NAMES = ['h101-200', 's101-200', 't101-200']
# FILE_NAMES = ['h201-300', 's201-300', 't201-300']
FILE_NAMES = ['h301-400', 's301-400', 't301-400']

TRAINED_MODEL_NAME = '3rd_300_hst'           # False
SAVE_MODEL_NAME = '4th_400_hst'              # '1st_100_hst', '2nd_200_hst', '3rd_300_hst', '4th_400_hst'  

In [3]:
data = []
for f_name in FILE_NAMES:
    with open(PATH + f_name +'.txt') as f:
        file = f.read().replace('\n','').split('.')
        data += file

In [4]:
tok = [word_tokenize(i) for i in data]
random.shuffle(tok)
# len(tok)

### Training and saving the model




In [5]:
if TRAINED_MODEL_NAME:
    model = Word2Vec.load(f"{TRAINED_MODEL_NAME}.model")
    if model.epochs == 5:
        model.train(tok, total_examples=len(tok), epochs=model.epochs)
else:
    model = Word2Vec(sentences=tok, vector_size=100, min_count=1, sg=1, workers=4, epochs=5)

model.save(f"{SAVE_MODEL_NAME}.model")
    