# **Import Libraries**

In [1]:
!pip install wikipedia
!pip install Wikipedia-API
import wikipedia
import wikipediaapi
import numpy as np
import pandas as pd

import gensim
from gensim.models import KeyedVectors



# **Load Data**

In [2]:
wiki = wikipediaapi.Wikipedia(
    language = 'en',
    extract_format = wikipediaapi.ExtractFormat.WIKI)

In [3]:
domains = ['Football', 'Linux', 'Health', 'Music', 'Artificial Intelligence']

In [4]:
docs_per_domain = 4
domains_dict = {}
for domain in domains:
    results = wikipedia.search(domain, results=docs_per_domain)
    domains_dict[domain] = []
    for result in results:
        domains_dict[domain].append(wiki.page(result).summary)

# **Explore data**

In [5]:
# Explore first doc from each domain
for domain in domains:
    print('{} Domain:'.format(domain))
    print('**********************\n')    
    print(domains_dict[domain][0],'\n')

Football Domain:
**********************

Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.  Unqualified, the word football normally means the form of football that is the most popular where the word is used. Sports commonly called football include association football (known as soccer in North America and Oceania); gridiron football (specifically American football or Canadian football); Australian rules football; rugby union and rugby league; and Gaelic football. These various forms of football share to varying extent common origins and are known as football codes.
There are a number of references to traditional, ancient, or prehistoric ball games played in many different parts of the world. Contemporary codes of football can be traced back to the codification of these games at English public schools during the 19th century. The expansion and cultural influence of the British Empire allowed these rules of football to spread to areas o

In [6]:
df = pd.DataFrame.from_dict(domains_dict, orient='index')

df = df.rename(columns={0: 'DOC1', 1: 'DOC2', 2: 'DOC3', 3: 'DOC4'})

df

Unnamed: 0,DOC1,DOC2,DOC3,DOC4
Football,Football is a family of team sports that invol...,A football player or footballer is a sportsper...,College football is gridiron football consisti...,"Association football, more commonly known as s..."
Linux,Linux ( (listen) LEE-nuuks or LIN-uuks) is a ...,The Linux kernel is a mostly free and open-sou...,A Linux distribution (often abbreviated as dis...,Linux Mint is a community-driven Linux distrib...
Health,"Health, according to the World Health Organiza...",HealtH (also known as Health and H.E.A.L.T.H.)...,"Mental health encompasses emotional, psycholog...",Health care or healthcare is the maintenance o...
Music,Music is the art of arranging sounds in time t...,Music is an art form consisting of sound and s...,Classical music generally refers to the formal...,"A music video is a video of variable length, t..."
Artificial Intelligence,Artificial intelligence (AI) is intelligence d...,A.I. Artificial Intelligence (also known as A....,Artificial general intelligence (AGI) is the h...,The history of artificial intelligence (AI) be...


In [7]:
raw_data = df.values
raw_data = raw_data.reshape((20,))

In [8]:
raw_data.shape

(20,)

In [9]:
print(raw_data)

['Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.  Unqualified, the word football normally means the form of football that is the most popular where the word is used. Sports commonly called football include association football (known as soccer in North America and Oceania); gridiron football (specifically American football or Canadian football); Australian rules football; rugby union and rugby league; and Gaelic football. These various forms of football share to varying extent common origins and are known as football codes.\nThere are a number of references to traditional, ancient, or prehistoric ball games played in many different parts of the world. Contemporary codes of football can be traced back to the codification of these games at English public schools during the 19th century. The expansion and cultural influence of the British Empire allowed these rules of football to spread to areas of British influence outside the direct

# **Data Preprocessing**

In [10]:
df = df.applymap(gensim.utils.simple_preprocess)

In [11]:
df

Unnamed: 0,DOC1,DOC2,DOC3,DOC4
Football,"[football, is, family, of, team, sports, that,...","[football, player, or, footballer, is, sportsp...","[college, football, is, gridiron, football, co...","[association, football, more, commonly, known,..."
Linux,"[linux, listen, lee, nuuks, or, lin, uuks, is,...","[the, linux, kernel, is, mostly, free, and, op...","[linux, distribution, often, abbreviated, as, ...","[linux, mint, is, community, driven, linux, di..."
Health,"[health, according, to, the, world, health, or...","[health, also, known, as, health, and, is, ame...","[mental, health, encompasses, emotional, psych...","[health, care, or, healthcare, is, the, mainte..."
Music,"[music, is, the, art, of, arranging, sounds, i...","[music, is, an, art, form, consisting, of, sou...","[classical, music, generally, refers, to, the,...","[music, video, is, video, of, variable, length..."
Artificial Intelligence,"[artificial, intelligence, ai, is, intelligenc...","[artificial, intelligence, also, known, as, is...","[artificial, general, intelligence, agi, is, t...","[the, history, of, artificial, intelligence, a..."


In [12]:
data = df.values
data = data.reshape((20,)).tolist()
print(*data, sep='\n')

['football', 'is', 'family', 'of', 'team', 'sports', 'that', 'involve', 'to', 'varying', 'degrees', 'kicking', 'ball', 'to', 'score', 'goal', 'unqualified', 'the', 'word', 'football', 'normally', 'means', 'the', 'form', 'of', 'football', 'that', 'is', 'the', 'most', 'popular', 'where', 'the', 'word', 'is', 'used', 'sports', 'commonly', 'called', 'football', 'include', 'association', 'football', 'known', 'as', 'soccer', 'in', 'north', 'america', 'and', 'oceania', 'gridiron', 'football', 'specifically', 'american', 'football', 'or', 'canadian', 'football', 'australian', 'rules', 'football', 'rugby', 'union', 'and', 'rugby', 'league', 'and', 'gaelic', 'football', 'these', 'various', 'forms', 'of', 'football', 'share', 'to', 'varying', 'extent', 'common', 'origins', 'and', 'are', 'known', 'as', 'football', 'codes', 'there', 'are', 'number', 'of', 'references', 'to', 'traditional', 'ancient', 'or', 'prehistoric', 'ball', 'games', 'played', 'in', 'many', 'different', 'parts', 'of', 'the', 'w

# **Load Pretrained Model**

In [13]:
PATH = '/content/drive/MyDrive/GoogleNews-vectors-negative300.bin.gz'

# Limit parameter used to avoid ram crashing (Memory issue)
model = KeyedVectors.load_word2vec_format(PATH, binary=True, limit=100000)

# **Embedding Representation**

In [14]:
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

In [15]:
def avg_sentence(sentence, wv):
  v = np.zeros(model.vector_size)
  for w in sentence:
    if w in wv:
      v += wv[w]
  return v / len(sentence)

In [16]:
embedding_matrix = []
for doc in data:
  embedding = avg_sentence(doc, model.wv)
  embedding_matrix.append(embedding)

  This is separate from the ipykernel package so we can avoid doing imports until


In [17]:
embedding_matrix = np.array(embedding_matrix)
embedding_matrix.shape

(20, 300)

In [18]:
embedding_matrix

array([[ 0.00849747,  0.03131719,  0.07107907, ..., -0.02274082,
         0.00785408,  0.00172902],
       [-0.0227956 , -0.00056978,  0.09591675, ..., -0.01887176,
        -0.02062445,  0.01835723],
       [-0.01435654,  0.0424444 ,  0.06613703, ..., -0.01653203,
         0.00780914,  0.0013567 ],
       ...,
       [ 0.02164145,  0.02290526,  0.0202538 , ..., -0.03659842,
         0.02245023, -0.019981  ],
       [ 0.03173395,  0.01690552,  0.03109745, ..., -0.0403591 ,
         0.02480696, -0.02229476],
       [ 0.04418222,  0.03954977,  0.02940984, ..., -0.0214473 ,
         0.03385666, -0.0146371 ]])

# **Save Model**

In [19]:
model.save("word2vec.model")

# **Testing Football Domain**

##### **Input Embedding Vector**

In [20]:
input = 'a team sport played with a spherical ball between two teams of 11 players'
input_vec = avg_sentence(input.split(), model.wv)

# Take input from user
# input_vec = avg_sentence(input().split(), model.wv)

  


In [21]:
print('input embedding vector shape:', input_vec.shape)

input embedding vector shape: (300,)


In [22]:
# input embedding vector
input_vec

array([-0.05023193,  0.03846087,  0.07135664,  0.00639561,  0.08267648,
       -0.05812291,  0.02211435, -0.12029157,  0.01212856,  0.05596488,
       -0.02593994, -0.07042585, -0.00855364,  0.00213623, -0.04576111,
        0.08850098,  0.05884824,  0.11397988,  0.01102121, -0.03002003,
       -0.06168039,  0.10042899,  0.00104196, -0.05881391,  0.033444  ,
       -0.04958235, -0.11429923,  0.08403669,  0.06378174, -0.01189314,
       -0.03216335,  0.0078125 ,  0.09334891, -0.02377755,  0.02654157,
        0.05056763,  0.04679871, -0.00772749,  0.03955078,  0.13374111,
        0.12801688, -0.05007499,  0.02551923, -0.00807408,  0.00486537,
       -0.06917844,  0.08680071, -0.08721924,  0.04819162,  0.07196481,
       -0.02589634,  0.076137  , -0.06620952, -0.10070801,  0.00634984,
       -0.04552938,  0.07743617,  0.0090332 , -0.00262015, -0.05038016,
       -0.0916748 ,  0.03256662, -0.05369786, -0.10349819, -0.02272252,
        0.0493818 , -0.01552037,  0.01740374,  0.06158447,  0.11

##### **Similarity Measure**

In [23]:
sims = []
for emb in embedding_matrix:
  sim = cosine_sim(input_vec, emb)
  sims.append(sim)

In [24]:
from operator import itemgetter
indices, most_similar = zip(*sorted(enumerate(sims), key=itemgetter(1), reverse = True))

In [25]:
most_similar

(0.7622986911105237,
 0.7135719060202561,
 0.701553122109776,
 0.6752741304679356,
 0.4928304959349676,
 0.4852626517722278,
 0.4824777974276544,
 0.47462963127517066,
 0.4702624609080954,
 0.4649263964498985,
 0.45909340337320176,
 0.42627915457931503,
 0.41777286409551123,
 0.413535576217066,
 0.4074479170149959,
 0.40165940544295586,
 0.3901727098142545,
 0.3885856492741299,
 0.37419344538841154,
 0.3459998238835243)

In [26]:
# indices of most most_similar docs
indices

(3, 1, 2, 0, 17, 16, 19, 9, 12, 18, 14, 15, 8, 10, 5, 7, 4, 6, 11, 13)

##### **Results**

In [27]:
# visualize first 5 most similar results to the input
for i in indices[:5]:
  print(raw_data[i])
  print('\n****************************************\n')

Association football, more commonly known as simply football or soccer, is a team sport played with a spherical ball between two teams of 11 players. It is played by approximately 250 million players in over 200 countries and dependencies, making it the world's most popular sport. The game is played on a rectangular field called a pitch with a goal at each end. The objective of the game is to score more goals than the opposition by moving the ball beyond the goal line into the opposing goal, usually within a time frame of 90 or more minutes.
Football is played in accordance with a set of rules known as the Laws of the Game. The ball is 68–70 cm (27–28 in) in circumference and known as the football. The two teams compete to get the ball into the other team's goal (between the posts and under the bar), thereby scoring a goal. Players are not allowed to touch the ball with hands or arms while it is in play, except for the goalkeepers within the penalty area. Players may use any other part

In [28]:
# Get most relevant domain to the input
print('Most relevant domain:', domains[indices[0]//docs_per_domain])

Most relevant domain: Football


# **Testing Linux Domain**

##### **Input Embedding Vector**

In [29]:
input = 'The Unix operating system was conceived and implemented in 1969'
input_vec = avg_sentence(input.split(), model.wv)

# Take input from user
# input_vec = avg_sentence(input().split(), model.wv)

  


In [30]:
print('input embedding vector shape:', input_vec.shape)

input embedding vector shape: (300,)


In [31]:
# input embedding vector
input_vec

array([-3.83255005e-02,  3.33679199e-02,  6.50390625e-02,  3.51562500e-02,
       -8.53759766e-02, -4.84619141e-02,  5.66406250e-03, -9.63897705e-03,
        8.49853516e-02,  9.18457031e-02, -6.50085449e-02, -2.60986328e-02,
        2.74047852e-02, -2.96630859e-02, -8.60671997e-03, -1.82617187e-02,
        8.12377930e-03,  4.17480469e-03, -6.95190430e-02,  7.62939453e-03,
       -8.35815430e-02,  1.31933594e-01,  3.94805908e-02, -5.44853210e-02,
        1.02716064e-01, -4.14062500e-02, -6.14028931e-02,  6.03515625e-02,
       -2.49023438e-02, -4.35180664e-02, -5.61523437e-04, -7.71728516e-02,
       -1.15808487e-02,  2.63671875e-03,  7.96875000e-02,  1.84936523e-02,
       -3.37890625e-02, -7.67913818e-02,  9.58129883e-02,  3.03234100e-02,
        2.68554688e-02,  5.24169922e-02,  6.98303223e-02,  3.26385498e-03,
        4.69726563e-02, -6.59912109e-02, -6.66381836e-02,  9.76562500e-05,
       -3.59252930e-02,  8.39233398e-02, -5.18554688e-02, -1.14404297e-01,
       -2.12890625e-02, -

##### **Similarity Measure**

In [32]:
sims = []
for emb in embedding_matrix:
  sim = cosine_sim(input_vec, emb)
  sims.append(sim)

In [33]:
from operator import itemgetter
indices, most_similar = zip(*sorted(enumerate(sims), key=itemgetter(1), reverse = True))

In [34]:
most_similar

(0.7113737512714442,
 0.6756592851826665,
 0.6286151807296876,
 0.6116173977406468,
 0.5882216862141098,
 0.5619681925553414,
 0.5413218160537241,
 0.5040648384727285,
 0.5022811078130285,
 0.49458563560452995,
 0.45756939632329885,
 0.45170985435941113,
 0.4478685430123984,
 0.41862953569508643,
 0.4136001005666491,
 0.40948977725739666,
 0.4057676187151734,
 0.36958547191199653,
 0.35830498126165267,
 0.32553099381495615)

In [35]:
# indices of most most_similar docs
indices

(5, 4, 6, 7, 19, 16, 17, 14, 18, 9, 0, 3, 11, 2, 12, 15, 8, 13, 10, 1)

##### **Results**

In [36]:
# visualize first 5 most similar results to the input
for i in indices[:5]:
  print(raw_data[i])
  print('\n****************************************\n')

The Linux kernel is a mostly free and open-source, monolithic, modular, multitasking, Unix-like operating system kernel. It was originally authored in 1991 by Linus Torvalds for his i386-based PC, and it was soon adopted as the kernel for the GNU operating system, which was written to be a free (libre) replacement for UNIX.
Linux as a whole is released under the GNU General Public License version 2 only, but it contains files under other compatible licenses. However, Linux began including proprietary binary blobs in its source tree and main distribution in 1996. This led to other projects starting work to remove the proprietary blobs in order to produce a 100% libre kernel, which eventually led to the Linux-libre project being founded.Since the late 1990s, it has been included as part of a large number of operating system distributions, many of which are commonly also called Linux. However, there is a controversy surrounding the naming of such systems; some people, including Richard St

In [37]:
# Get most relevant domain to the input
print('Most relevant domain:', domains[indices[0]//docs_per_domain])

Most relevant domain: Linux


# **Testing Artificial Intelligence Domain**

##### **Input Embedding Vector**

In [38]:
input = 'the simulation of human intelligence processes by machines, especially computer systems'
input_vec = avg_sentence(input.split(), model.wv)

input_vec = avg_sentence(input.split(), model.wv)

# Take input from user
# input_vec = avg_sentence(input().split(), model.wv)

  
  after removing the cwd from sys.path.


In [39]:
print('input embedding vector shape:', input_vec.shape)

input embedding vector shape: (300,)


In [40]:
# input embedding vector
input_vec

array([ 3.26704545e-02, -9.21075994e-03,  6.54407848e-02,  8.80376642e-02,
       -1.18846547e-01,  3.86186080e-02, -3.48455256e-03, -3.36248224e-02,
        2.71218040e-02,  4.00945490e-02, -2.02192827e-02, -1.31558505e-01,
       -3.67868597e-02,  6.26997514e-03, -3.44357924e-02,  3.55557528e-02,
       -6.28786954e-02,  6.64395419e-02, -4.05051491e-03, -6.72607422e-02,
       -5.20907315e-02, -2.49800249e-02, -2.52109874e-02,  5.46174483e-02,
        1.32657138e-01, -2.27633390e-03, -6.36208274e-02,  1.37301358e-02,
       -1.30851052e-02, -7.58167614e-02, -2.22084739e-02, -7.62051669e-02,
       -4.92942116e-02,  2.32627175e-02, -2.87309126e-02, -6.12293590e-02,
       -4.34930975e-02, -2.74103338e-03,  9.67684659e-02,  1.71120384e-02,
        2.10515803e-02,  2.64115767e-02,  3.40021307e-02,  9.69349254e-02,
        4.12320224e-02, -2.18491988e-02, -6.97465376e-02,  2.45971680e-02,
       -5.26344993e-02,  8.44053789e-02, -4.47609641e-02, -2.43308327e-02,
       -6.58957741e-02, -

##### **Similarity Measure**

In [41]:
sims = []
for emb in embedding_matrix:
  sim = cosine_sim(input_vec, emb)
  sims.append(sim)

In [42]:
from operator import itemgetter
indices, most_similar = zip(*sorted(enumerate(sims), key=itemgetter(1), reverse = True))

In [43]:
most_similar

(0.7452344962159336,
 0.66241436536152,
 0.6502918699871371,
 0.647774131896869,
 0.645837796829403,
 0.6227774831642384,
 0.5859108592256483,
 0.5725396499064507,
 0.5717783929223123,
 0.5688797582792084,
 0.5480876828900194,
 0.5139804628413409,
 0.49445866143572254,
 0.4882912095083163,
 0.4783935623880307,
 0.46585848830367765,
 0.4501046754060713,
 0.4395174217152231,
 0.4320561292745675,
 0.37219058514974207)

In [44]:
# indices of most most_similar docs
indices

(16, 19, 5, 4, 6, 18, 11, 8, 7, 10, 17, 14, 0, 15, 12, 9, 3, 2, 13, 1)

##### **Results**

In [45]:
# visualize first 5 most similar results to the input
for i in indices[:5]:
  print(raw_data[i])
  print('\n****************************************\n')

Artificial intelligence (AI) is intelligence demonstrated by machines, as opposed to the natural intelligence displayed by animals including humans. Leading AI textbooks define the field as the study of "intelligent agents": any system that perceives its environment and takes actions that maximize its chance of achieving its goals.Some popular accounts use the term "artificial intelligence" to describe machines that mimic "cognitive" functions that humans associate with the human mind, such as "learning" and "problem-solving", however, this definition is rejected by major AI researchers.AI applications include advanced web search engines (e.g., Google), recommendation systems (used by YouTube, Amazon and Netflix), understanding human speech (such as Siri and Alexa), self-driving cars (e.g., Tesla), automated decision-making and competing at the highest level in strategic game systems (such as chess and Go).
As machines become increasingly capable, tasks considered to require "intellige

In [46]:
# Get most relevant domain to the input
print('Most relevant domain:', domains[indices[0]//docs_per_domain])

Most relevant domain: Artificial Intelligence
