# First Example

**Word embedding** is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers. The term word2vec literally translates to word to vector.

In [1]:
import pandas as pd

df = pd.read_csv('data.csv')
df.head()

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
0,BMW,1 Series M,2011,premium unleaded (required),335.0,6.0,MANUAL,rear wheel drive,2.0,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,26,19,3916,46135
1,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Convertible,28,19,3916,40650
2,BMW,1 Series,2011,premium unleaded (required),300.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,High-Performance",Compact,Coupe,28,20,3916,36350
3,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,"Luxury,Performance",Compact,Coupe,28,18,3916,29450
4,BMW,1 Series,2011,premium unleaded (required),230.0,6.0,MANUAL,rear wheel drive,2.0,Luxury,Compact,Convertible,28,18,3916,34500


Since, the sole purpose of this notebook is to learn about word2vec, so I am skipping the EDA part.

Making a new column "Name" by combining 'Make' and 'Model'.

In [2]:
df['Name']= df['Make']+ " " + df['Model']

In [3]:
# Select features from original dataset to form a new dataframe 
df1 = df[['Transmission Type', 'Engine Fuel Type', 'Driven_Wheels', 'Market Category', 'Name',
          'Vehicle Size', 'Vehicle Style']]

# For each row, combine all the columns into one column
df2 = df1.apply(lambda x: ','.join(x.astype(str)), axis=1) 

df2.head()

0    MANUAL,premium unleaded (required),rear wheel ...
1    MANUAL,premium unleaded (required),rear wheel ...
2    MANUAL,premium unleaded (required),rear wheel ...
3    MANUAL,premium unleaded (required),rear wheel ...
4    MANUAL,premium unleaded (required),rear wheel ...
dtype: object

In [4]:
# Store them in the pandas dataframe
df_clean = pd.DataFrame({'clean': df2})
df_clean

Unnamed: 0,clean
0,"MANUAL,premium unleaded (required),rear wheel ..."
1,"MANUAL,premium unleaded (required),rear wheel ..."
2,"MANUAL,premium unleaded (required),rear wheel ..."
3,"MANUAL,premium unleaded (required),rear wheel ..."
4,"MANUAL,premium unleaded (required),rear wheel ..."
...,...
11909,"AUTOMATIC,premium unleaded (required),all whee..."
11910,"AUTOMATIC,premium unleaded (required),all whee..."
11911,"AUTOMATIC,premium unleaded (required),all whee..."
11912,"AUTOMATIC,premium unleaded (recommended),all w..."


In [5]:
# Create the list of list format of the custom corpus for gensim modeling.
# Separating each column element with ','
sent = [row.split(',') for row in df_clean['clean']]
sent[:2]

[['MANUAL',
  'premium unleaded (required)',
  'rear wheel drive',
  'Factory Tuner',
  'Luxury',
  'High-Performance',
  'BMW 1 Series M',
  'Compact',
  'Coupe'],
 ['MANUAL',
  'premium unleaded (required)',
  'rear wheel drive',
  'Luxury',
  'Performance',
  'BMW 1 Series',
  'Compact',
  'Convertible']]

## Model Training

In [6]:
from gensim.models import Word2Vec
model = Word2Vec(sent, min_count = 1, workers = 3, window = 3, sg = 1)



#### **Question.** How similar are "Porsche 718 Cayman" and "Nissan Van"?

In [7]:
model.wv.similarity('Porsche 718 Cayman', 'Nissan Van')

0.81985104

#### **Question.** Which 5 cars are most similar to "Mercedes-Benz SLK-Class" in features?

In [8]:
model.wv.most_similar('Mercedes-Benz SLK-Class')[:5]

[('Audi S5', 0.9945767521858215),
 ('Nissan GT-R', 0.9911941885948181),
 ('BMW Z4 M', 0.9908928871154785),
 ('Audi RS 5', 0.990876317024231),
 ('BMW M6', 0.9902618527412415)]

# Second Example

In [9]:
from nltk.corpus import gutenberg
sentences = list(gutenberg.sents('shakespeare-hamlet.txt')) 

In [10]:
# First line.
print(sentences[0])

# Third line.
print(sentences[2])

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']']
['Scoena', 'Prima', '.']


- Making everything lowercase.
- Remove punctuations, numbers, etc.

In [11]:
import re

for i in range(len(sentences)):
    sentences[i] = [word.lower() for word in sentences[i] if re.match('^[a-zA-Z]+', word)]

# First line.
print(sentences[0])

# Third line.
print(sentences[2])

['the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare']
['scoena', 'prima']


### Removing Stop Words.

In [12]:
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

cleaned_sentences = []

for item in sentences:
    temp = []
    
    for x in item:
        
        if x not in stop_words:
            temp.append(x)
            
    cleaned_sentences.append(temp)

In [13]:
cleaned_sentences

[['tragedie', 'hamlet', 'william', 'shakespeare'],
 ['actus', 'primus'],
 ['scoena', 'prima'],
 ['enter', 'barnardo', 'francisco', 'two', 'centinels'],
 ['barnardo'],
 [],
 ['fran'],
 ['nay', 'answer', 'stand', 'vnfold', 'selfe'],
 ['bar'],
 ['long', 'liue', 'king'],
 ['fran'],
 ['barnardo'],
 ['bar'],
 [],
 ['fran'],
 ['come', 'carefully', 'vpon', 'houre'],
 ['bar'],
 ['tis', 'strook', 'twelue', 'get', 'thee', 'bed', 'francisco'],
 ['fran'],
 ['releefe', 'much', 'thankes', 'tis', 'bitter', 'cold', 'sicke', 'heart'],
 ['barn'],
 ['haue', 'quiet', 'guard'],
 ['fran'],
 ['mouse', 'stirring'],
 ['barn'],
 ['well', 'goodnight'],
 ['meet', 'horatio', 'marcellus', 'riuals', 'watch', 'bid', 'make', 'hast'],
 ['enter', 'horatio', 'marcellus'],
 ['fran'],
 ['thinke', 'heare'],
 ['stand'],
 ['hor'],
 ['friends', 'ground'],
 ['mar'],
 ['leige', 'men', 'dane'],
 ['fran'],
 ['giue', 'good', 'night'],
 ['mar'],
 ['farwel', 'honest', 'soldier', 'hath', 'relieu'],
 ['fra'],
 ['barnardo', 'ha', 'place'

In [14]:
model_2 = Word2Vec(cleaned_sentences, sg = 1, window = 3, min_count = 1, workers = 3)

In [15]:
model_2.wv.most_similar('hamlet')[:5]

[('haue', 0.984722375869751),
 ('giue', 0.9832130670547485),
 ('shall', 0.9816908240318298),
 ('selfe', 0.9816685318946838),
 ('mine', 0.980940580368042)]