**Implementing Word2Vec with Gensim Library in Python**
---

Before we could summarize Wikipedia articles, we need to fetch them. To do so we will use a couple of libraries. The first library that we need to download is the Beautiful Soup library, which is a very useful Python utility for web scraping. Execute the following command at command prompt to download the Beautiful Soup utility.

In [2]:
import bs4 as bs
import urllib.request
import re
import nltk

scrapped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Artificial_intelligence')
article = scrapped_data .read()

parsed_article = bs.BeautifulSoup(article,'lxml')

paragraphs = parsed_article.find_all('p')

article_text = ""

for p in paragraphs:
    article_text += p.text

In [3]:
# Preprocessing the data

In [4]:
# Cleaing the text
processed_article = article_text.lower()
processed_article = re.sub('[^a-zA-Z]', ' ', processed_article )
processed_article = re.sub(r'\s+', ' ', processed_article)

# Preparing the dataset
all_sentences = nltk.sent_tokenize(processed_article)

all_words = [nltk.word_tokenize(sent) for sent in all_sentences]

# Removing Stop Words
from nltk.corpus import stopwords
for i in range(len(all_words)):
    all_words[i] = [w for w in all_words[i] if w not in stopwords.words('english')]

`In the script above, we convert all the text to lowercase and then remove all the digits, special characters, and extra spaces from the text. After preprocessing, we are only left with the words.`

`The Word2Vec model is trained on a collection of words. First, we need to convert our article into sentences. We use nltk.sent_tokenize utility to convert our article into sentences. To convert sentences into words, we use nltk.word_tokenize utility. As a last preprocessing step, we remove all the stop words from the text.`

`After the script completes its execution, the all_words object contains the list of all the words in the article. We will use this list to create our Word2Vec model with the Gensim library.`

### Creating Word2Vec Model

In [5]:
from gensim.models import Word2Vec

word2vec = Word2Vec(all_words, min_count=2)

In [10]:
# Vocabulary using .key_to_index
vocabulary = word2vec.wv.key_to_index

for word in vocabulary:
    print(word, end=", ")

ai, intelligence, artificial, learning, human, machine, problems, many, networks, research, used, search, knowledge, use, also, neural, symbolic, may, researchers, systems, logic, field, computer, general, machines, problem, would, algorithms, reasoning, data, mind, intelligent, applications, tools, solve, could, humans, since, include, example, computing, specific, system, however, based, developed, ability, decision, mathematical, optimization, number, goals, information, one, approaches, two, well, level, including, recognition, world, risk, program, theory, difficult, agent, algorithm, even, using, deep, neurons, u, term, others, several, known, input, like, widely, make, sub, first, goal, rather, whether, language, fiction, form, inputs, related, issue, behavior, successful, particular, processing, perception, objects, techniques, methods, described, turing, new, question, approach, model, people, increase, solutions, tasks, robotics, called, time, formal, future, logics, natural,

### Model Analysis
We successfully created our Word2Vec model in the last section. Now is the time to explore what we created.

### Finding Vectors for a Word

In [11]:
v1 = word2vec.wv['artificial']

The vector `v1` contains the vector representation for the word "artificial". By default, a hundred dimensional vector is created by Gensim Word2Vec. This is a much, much smaller vector as compared to what would have been produced by bag of words. If we use the bag of words approach for embedding the article, the length of the vector for each will be 1206 since there are 1206 unique words with a minimum frequency of 2. If the minimum frequency of occurrence is set to 1, the size of the bag of words vector will further increase. On the other hand, vectors generated through Word2Vec are not affected by the size of the vocabulary.

### Finding Similar Words

In [12]:
sim_words = word2vec.wv.most_similar('intelligence')

In [13]:
sim_words

[('internet', 0.40523967146873474),
 ('ai', 0.34835532307624817),
 ('objects', 0.33982163667678833),
 ('researchers', 0.3379882872104645),
 ('difficult', 0.3354819416999817),
 ('algorithms', 0.33525350689888),
 ('science', 0.3136553466320038),
 ('goals', 0.3053082823753357),
 ('bias', 0.29881370067596436),
 ('may', 0.296377956867218)]