# NLP Word_embedding using Word2Vec and Glove

Word embedding is the expression of a word or phrase as a flat vector. These vectors are learned by the model to reflect similarities between words. In this way, AI models can better understand the relationships between words

Word embedding is learned using text data. For example, using a text corpus (large text database), it learns for each word its relationship with the words around it. This relationship is defined as the co-occurrence frequency between the word and its surrounding words. Then, using this co-occurrence data, a vector is created for each word. These vectors are designed to reflect the similarities between the meanings of the words.

For example, the words “teacher”, “student”, “grading” often appear in the same texts and have similar meanings, so their resulting vectors will be close to each other. However, the words “teacher” and “orange” do not often appear in the same texts and have different meanings, so their vectors will be far apart.

## Feature Representation (Feature Extraction for word embeddings)

Each element in the word embbeding vector, which usually takes a value between -1 and +1 (in some pre-trained models this value can be greater than -1 and +1), is called a feature representation. The process of generating these feature representations by the model is called feature extraction.

In ML, feature extraction is a method used to identify features or attributes present in the dataset. These features allow them to express the data in the dataset in a meaningful way. These features are manually selected from the dataset.

In DL, feature extraction is a method used to learn features from the dataset. The methods used for this method include artificial neural networks such as Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). These neural networks automatically learn features from the dataset, and these features allow them to express the data in the dataset in a more meaningful way.

In summary, in ML, features are manually selected by the user, while in DL, features are detected and learned by the automated model.

## Word2Vec

In [3]:
!pip install gensim
# conda install -c conda-forge gensim

Collecting gensim
  Downloading gensim-4.3.2-cp311-cp311-win_amd64.whl.metadata (8.5 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.0.4-py3-none-any.whl.metadata (23 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.16.0-cp311-cp311-win_amd64.whl.metadata (6.8 kB)
Downloading gensim-4.3.2-cp311-cp311-win_amd64.whl (24.0 MB)
   ---------------------------------------- 0.0/24.0 MB ? eta -:--:--
   ---------------------------------------- 0.2/24.0 MB 5.3 MB/s eta 0:00:05
   - -------------------------------------- 0.7/24.0 MB 9.2 MB/s eta 0:00:03
   -- ------------------------------------- 1.3/24.0 MB 10.2 MB/s eta 0:00:03
   --- ------------------------------------ 1.8/24.0 MB 10.6 MB/s eta 0:00:03
   --- ------------------------------------ 2.4/24.0 MB 10.8 MB/s eta 0:00:03
   ---- ----------------------------------- 2.9/24.0 MB 10.9 MB/s eta 0:00:02
   ----- ---------------------------------- 3.4/24.0 MB 10.7 MB/s eta 0:00:02
   ----

In [4]:
from nltk.tokenize import word_tokenize
import pandas as pd
from gensim.models import Word2Vec
import nltk

In [5]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\serda\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\serda\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\serda\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\serda\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [6]:
df = pd.read_csv('newspaper.zip', names = ["news"])
df

# pandas' read_csv function can also read zip files. 
# Since the data (corpus) we zip is a txt file, there are no feature name(s) specified in the file. Therefore we use the names parameter for feature naming.
# If the names parameter is not used, the text in the first line is assigned as feature names by default.

Unnamed: 0,news
0,iran devlet televizyonu ülkedeki eyaletin sind...
1,gösterilerde fitnecilere ölüm münafıklara ölüm...
2,dini lider ali hamaney ve cumhurbaşkanı mahmud...
3,musevi ye ölüm ve idam idam sloganları duyuldu
4,muhalefet liderleri kaçtı mı aşure günü yaşana...
...,...
411520,dışişleri bakanlığı ndan yapılan yazılı açıkla...
411521,açıklamada abd nin ankara büyükelçiliği ve ist...
411522,seyahat uyarısı güncelleme kararının temmuz da...
411523,amerikalı turistlerin açıkça türkiye deki ulus...


In [7]:
df.news[0]

'iran devlet televizyonu ülkedeki eyaletin sinde yapılan reformcuları protesto amaçlı yürüyüşlere milyonlarca kişinin katıldığını bildirdi '

In [8]:
word_tokenize(df.news[0])

['iran',
 'devlet',
 'televizyonu',
 'ülkedeki',
 'eyaletin',
 'sinde',
 'yapılan',
 'reformcuları',
 'protesto',
 'amaçlı',
 'yürüyüşlere',
 'milyonlarca',
 'kişinin',
 'katıldığını',
 'bildirdi']

In [9]:
%%time
corpus=[word_tokenize(i) for i in df.news]
print(corpus[:5])

# word2vec algorithm requires the whole corpus to be 2-dimensional. For this reason, we use a for loop here to extract all the documents/lines one by one and separate them into word tokens. 
# The word_tokenize function by default splits the text into word tokens and puts them in a list.  

[['iran', 'devlet', 'televizyonu', 'ülkedeki', 'eyaletin', 'sinde', 'yapılan', 'reformcuları', 'protesto', 'amaçlı', 'yürüyüşlere', 'milyonlarca', 'kişinin', 'katıldığını', 'bildirdi'], ['gösterilerde', 'fitnecilere', 'ölüm', 'münafıklara', 'ölüm', 'abd', 'ye', 'ölüm', 'ingiltere', 'ye', 'ölüm', 'sloganları', 'atıldı'], ['dini', 'lider', 'ali', 'hamaney', 've', 'cumhurbaşkanı', 'mahmud', 'ahmedinejad', 'ı', 'destekleyen', 'iranlılar', 'son', 'olaylarda', 'yeğeni', 'öldürülen', 'mir', 'hüseyin', 'musevi', 'başta', 'olmak', 'üzere', 'muhalefet', 'liderlerini', 'kınadılar'], ['musevi', 'ye', 'ölüm', 've', 'idam', 'idam', 'sloganları', 'duyuldu'], ['muhalefet', 'liderleri', 'kaçtı', 'mı', 'aşure', 'günü', 'yaşanan', 'çatışmalarda', 'devlet', 'kaynaklarına', 'göre', 'u', 'terörist', 'olmak', 'üzere', 'kişi', 'ölmüştü']]
CPU times: total: 28.1 s
Wall time: 1min 26s


In [10]:
from gensim.utils import effective_n_jobs

effective_n_jobs(-1)

# indicates the maximum number of cores on your computer that you can use for training.

12

In [11]:
%%time
model = Word2Vec(sentences=corpus, 
                 alpha=0.025, # learning rate
                 vector_size=100, 
                 window=5, # how many words before and after to be considered
                 min_count=5, # ignore tokens 5 and less in number
                 sg=1, # skipgram, default = 0
                 workers=12)

# vector_size, this is where we specify how many dimensional word embeddings we want.
# We specify in the window parameter how many tokens before and after this token should be taken into account when establishing semantic relationships between a token and other tokens. 
# The recommended numbers are between 5 and 15.
# min_count, tokens with 5 or less occurrences in the corpus are not included in the training. Usually numbers like 3,4,5 are preferred.
# sg =1, train with skipgram algorithm.
# sg = 0, train with the CBOW algorithm.
# alpha = learning rate
# workers = number of cores we will use for training

CPU times: total: 10min 33s
Wall time: 1min 40s


In [12]:
model.wv['ankara'] # ww = word2vec

# word_embedding with 100 elements/dimensions

array([ 0.09865374,  0.32249972,  0.08113028,  0.2481778 , -0.33505496,
        0.15236615, -0.14483552,  0.8517523 , -0.3906854 , -0.32861456,
       -0.5102848 , -0.31961584, -0.26706418,  0.47893932,  0.03734279,
        0.12921727,  0.3474915 , -0.0190613 ,  0.16443539, -1.1152455 ,
        0.14464793,  0.49279496,  0.35456955, -0.8454716 ,  0.24567877,
        0.41372973, -0.44399342,  0.1175387 , -0.38418385,  0.27414685,
        0.12816934,  0.13105349,  0.27895677,  0.16632143,  0.23622537,
       -0.13932958, -0.00644227, -0.16935098,  0.19433688, -0.32669592,
        0.40069288, -0.21965155,  0.36966822, -0.36889434,  0.57474166,
        0.6174525 , -0.28649634,  0.14898083,  0.2916678 , -0.30093798,
        0.49045202, -0.39107722, -0.15191537,  0.0565801 , -0.15135272,
       -0.11940034,  0.22841056, -0.5871908 , -0.56219393,  0.19957703,
        0.04338143, -0.1397902 ,  0.15929896, -0.05982132, -0.54558593,
       -0.20749934,  0.05791085,  0.24839656,  0.03841949, -0.35

In [13]:
model.wv.most_similar('öğretmen')

[('okuldaki', 0.7556234002113342),
 ('öğrenci', 0.73649001121521),
 ('öğretmeni', 0.7340482473373413),
 ('öğrenciye', 0.7300719618797302),
 ('öğretmenin', 0.7291283011436462),
 ('başörtülü', 0.7160113453865051),
 ('erkekten', 0.7152495384216309),
 ('hizmetli', 0.7111763954162598),
 ('üniversite', 0.7080069780349731),
 ('öğrenciyle', 0.7061008810997009)]

In [14]:
model.wv.most_similar('kırmızı')

[('çizgileri', 0.6615160703659058),
 ('sarı', 0.6613019108772278),
 ('gömlekliler', 0.6467803120613098),
 ('çizgi', 0.642117977142334),
 ('turuncu', 0.636613130569458),
 ('gömlekli', 0.6291239857673645),
 ('renkli', 0.6171365976333618),
 ('halıda', 0.6147487759590149),
 ('ışıkta', 0.6074832677841187),
 ('bültenle', 0.606572151184082)]

In [15]:
model.wv.most_similar('eve')

[('evine', 0.839626669883728),
 ('apartmana', 0.7691282629966736),
 ('dükkana', 0.76540607213974),
 ('mağazaya', 0.7547361850738525),
 ('karakola', 0.7370243668556213),
 ('arabaya', 0.7318121790885925),
 ('hapishaneye', 0.7251378297805786),
 ('restorana', 0.7166224718093872),
 ('odasına', 0.7107588052749634),
 ('arabasına', 0.7095082402229309)]

In [16]:
model.wv.most_similar('mavi')

[('marmara', 0.8875664472579956),
 ('gemisine', 0.6941417455673218),
 ('baskınıyla', 0.6774566769599915),
 ('filo', 0.6320633888244629),
 ('filosundaki', 0.6211588382720947),
 ('saldırısındaki', 0.617448627948761),
 ('baskınının', 0.6149155497550964),
 ('baskınına', 0.6054602265357971),
 ('filodaki', 0.6010278463363647),
 ('dökme', 0.5925293564796448)]

In [18]:
model.wv.most_similar(positive=['öğrenme', 'doktor'], negative=['tedavi'], topn=8)

[('kaliteli', 0.6613228917121887),
 ('dersin', 0.656999409198761),
 ('psikoloji', 0.6441785097122192),
 ('almancayı', 0.6401728391647339),
 ('dersine', 0.6384567618370056),
 ('salgılanan', 0.6371878385543823),
 ('nesilden', 0.6354333162307739),
 ('doçent', 0.6344307065010071)]

In [22]:
model.wv.most_similar(positive=['ankara', 'belçika'], negative=['brüksel'], topn=5)

[('hollanda', 0.6296952962875366),
 ('avusturya', 0.6233313083648682),
 ('fransa', 0.6201995015144348),
 ('danimarka', 0.6020408272743225),
 ('kanada', 0.5748323798179626)]

In [20]:
model.save("word2vec.model")

In [21]:
model = Word2Vec.load("word2vec.model")

## Glove

In [24]:
from gensim.models import KeyedVectors

# We use the KeyedVectors function to convert word embeddings in a different format to word2vec format.

In [25]:
glove_model = 'glove.6B.100d.txt'  # trained with 6 Billion words 100 matrix dimension
model2 = KeyedVectors.load_word2vec_format(glove_model, no_header=True) #'glove.6B.100d.txt', if no_header = False it wants the 6B.100d information 

# 'Word2Vec format usually has a header on the first line of the file, which contains the word count and vector size.
# However, since our txt file is in glove format, glove.txt files do not have a header. We need to specify that there is no header information with no_header=True, otherwise you will get an error.

In [26]:
model2['teacher']

array([ 0.44374 ,  0.67311 , -0.51096 ,  0.20882 , -0.10662 ,  0.55098 ,
       -0.035593,  0.25126 , -0.32789 ,  1.0762  , -0.49637 , -0.4298  ,
        0.36764 ,  0.57894 , -0.25027 , -0.41021 ,  0.086998, -0.16843 ,
       -0.85764 ,  1.0404  , -1.0314  ,  0.095147,  0.30729 ,  0.12348 ,
        0.22745 , -0.52157 , -0.72478 , -1.0843  ,  0.035966,  0.62985 ,
       -1.0991  ,  0.67161 ,  0.33797 ,  0.14551 , -0.90049 , -0.064415,
       -0.75247 ,  0.21741 ,  0.51594 , -0.46291 , -0.77598 ,  0.40705 ,
        0.1889  , -0.43402 ,  0.23202 , -0.081453, -0.3882  , -0.34444 ,
        0.080225, -0.28274 , -0.38869 , -0.58152 , -0.25558 ,  1.0027  ,
       -0.11114 , -1.5402  , -0.16761 , -0.26558 ,  0.9325  ,  0.069397,
        0.96618 ,  0.15449 , -0.22905 , -0.1761  ,  0.13225 , -0.55741 ,
        0.9234  , -0.04845 ,  0.50202 ,  1.0144  , -0.1256  ,  0.30486 ,
        0.090808,  0.17642 , -0.23146 ,  0.68386 ,  0.37269 , -0.37316 ,
       -0.025728, -1.0279  , -0.33142 ,  0.036028, 

In [27]:
model2.most_similar('ankara')

[('turkey', 0.7512096762657166),
 ('istanbul', 0.6787630915641785),
 ('turkish', 0.6690374612808228),
 ('damascus', 0.6372509002685547),
 ('tbilisi', 0.6322181820869446),
 ('erdogan', 0.6258037090301514),
 ('moscow', 0.6217040419578552),
 ('brussels', 0.6181437969207764),
 ('skopje', 0.6164302229881287),
 ('cyprus', 0.606403112411499)]

In [28]:
model2.most_similar('teacher')

[('student', 0.8083399534225464),
 ('school', 0.75455641746521),
 ('teaching', 0.7521439790725708),
 ('taught', 0.741184651851654),
 ('teachers', 0.7291542887687683),
 ('graduate', 0.7134960293769836),
 ('instructor', 0.7077120542526245),
 ('students', 0.6828974485397339),
 ('teaches', 0.6552315354347229),
 ('education', 0.6528989672660828)]

In [29]:
model2.most_similar('doctor')

[('physician', 0.7673240303993225),
 ('nurse', 0.75215083360672),
 ('dr.', 0.7175194025039673),
 ('doctors', 0.7080884575843811),
 ('patient', 0.7074184417724609),
 ('medical', 0.6995992660522461),
 ('surgeon', 0.6905338764190674),
 ('hospital', 0.6900930404663086),
 ('psychiatrist', 0.658909797668457),
 ('dentist', 0.6447421312332153)]

In [30]:
model2.most_similar(positive=['woman', 'son'], negative=['man'], topn=1)

[('daughter', 0.9090957641601562)]

In [31]:
model2.most_similar(positive=['woman', 'father'], negative=['man'], topn=1)

[('mother', 0.9024618864059448)]

In [32]:
model2.most_similar(positive=['woman', 'uncle'], negative=['man'], topn=1)

[('aunt', 0.8368030190467834)]

In [33]:
model2.most_similar(positive=['ankara', 'germany'], negative=['berlin'], topn=1)

[('turkey', 0.81471186876297)]

In [34]:
model2.most_similar(positive=['teach', 'doctor'], negative=['treat'], topn=1)

[('teacher', 0.7610154151916504)]

In [35]:
model2.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

[('queen', 0.7698541283607483)]

In [36]:
model2.most_similar(positive=['love', 'jealous'], negative=['hate'], topn=1)

[('lover', 0.7032662630081177)]

END OF THE PROJECT