#### Genism Word2Vec

In [3]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

import re  # For preprocessing

from time import time  # To time our operations
from collections import defaultdict  # For word frequency

import spacy  # For preprocessing

from gensim.models import Word2Vec

# import logging  # Setting up the loggings to monitor gensim
# logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

from sklearn.manifold import TSNE

from numpy import dot
from numpy.linalg import norm

## Dataset Description
The dataset used is from Kaggle Dataset( https://www.kaggle.com/CooperUnion/cardataset) 

This cars dataset includes features such as `make`, `model`, `year`, `engine`, and other properties of the car. 

We will use these features to generate the word embeddings for each make model and then compare the similarities between different make model. 

The following dataframe shows the detail information of this dataset.

In [6]:
location = 'https://github.com/gridflowai/gridflowAI-datasets-icons/raw/master/AI-DATASETS/01-MISC/word2vec-data.csv'

In [7]:
df = pd.read_csv(location)
df.sample(10)

Unnamed: 0,Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
11218,Hyundai,Veloster,2014,regular unleaded,132.0,4.0,AUTOMATED_MANUAL,front wheel drive,3.0,Hatchback,Compact,2dr Hatchback,36,28,1439,21650
10645,Ford,Transit Connect,2015,regular unleaded,169.0,4.0,AUTOMATIC,front wheel drive,4.0,,Compact,Passenger Minivan,28,20,5657,26710
268,Nissan,350Z,2007,regular unleaded,306.0,6.0,MANUAL,rear wheel drive,2.0,High-Performance,Compact,Coupe,25,18,2009,36100
1563,Cadillac,ATS,2017,premium unleaded (recommended),272.0,4.0,AUTOMATIC,rear wheel drive,4.0,"Luxury,Performance",Midsize,Sedan,31,22,1624,34595
2196,Chevrolet,Camaro,2017,premium unleaded (recommended),455.0,8.0,MANUAL,rear wheel drive,2.0,High-Performance,Midsize,Coupe,25,16,1385,36905
11160,Aston Martin,V8 Vantage,2015,premium unleaded (required),430.0,8.0,AUTOMATED_MANUAL,rear wheel drive,2.0,"Exotic,High-Performance",Compact,Coupe,21,14,259,138695
1387,BMW,ALPINA B6 Gran Coupe,2016,premium unleaded (required),600.0,8.0,AUTOMATIC,all wheel drive,4.0,"Factory Tuner,Luxury,High-Performance",Large,Sedan,24,15,3916,122200
11269,Toyota,Venza,2013,regular unleaded,268.0,6.0,AUTOMATIC,all wheel drive,4.0,"Crossover,Performance",Midsize,Wagon,25,18,2031,39020
3013,Chevrolet,Corvette,2017,premium unleaded (recommended),460.0,8.0,MANUAL,rear wheel drive,2.0,High-Performance,Compact,Coupe,25,16,1385,65450
3192,Cadillac,CT6,2017,regular unleaded,335.0,6.0,AUTOMATIC,all wheel drive,4.0,"Luxury,Performance",Large,Sedan,27,18,1624,83495


In [8]:
df.shape

(11914, 16)

#### Genism word2Vec 

requires that a format of `list of list` for training where every __document is contained in a list and every list contains list of tokens of that document__. 

At first, we need to generate a format of `list of list` for training the make model word embedding. 

To achieve these, we need to do the following data preprocessing steps :

- Create a new column for __Make Model__
- Generate a format of list of list for each __Make Model__ with the following features: 
    - Engine Fuel Type, 
    - Transmission Type, 
    - Driven_Wheels, 
    - Market Category, 
    - Vehicle Size and 
    - Vehicle Style.

Create a new column for Make Model

In [9]:
df['Maker_Model'] = df['Make']+ " " + df['Model']

Generate a format of list of list for each Make Model

In [10]:
df1 = df[['Engine Fuel Type',
          'Transmission Type',
          'Driven_Wheels',
          'Market Category',
          'Vehicle Size', 
          'Vehicle Style', 
          'Maker_Model']]

df1.head()

Unnamed: 0,Engine Fuel Type,Transmission Type,Driven_Wheels,Market Category,Vehicle Size,Vehicle Style,Maker_Model
0,premium unleaded (required),MANUAL,rear wheel drive,"Factory Tuner,Luxury,High-Performance",Compact,Coupe,BMW 1 Series M
1,premium unleaded (required),MANUAL,rear wheel drive,"Luxury,Performance",Compact,Convertible,BMW 1 Series
2,premium unleaded (required),MANUAL,rear wheel drive,"Luxury,High-Performance",Compact,Coupe,BMW 1 Series
3,premium unleaded (required),MANUAL,rear wheel drive,"Luxury,Performance",Compact,Coupe,BMW 1 Series
4,premium unleaded (required),MANUAL,rear wheel drive,Luxury,Compact,Convertible,BMW 1 Series


In [11]:
pd.set_option('max_colwidth', 140)

In [12]:
# For each row, combine all the columns into one column
df2 = df1.apply(lambda x: ','.join(x.astype(str)), axis=1) 

df2.head()

0    premium unleaded (required),MANUAL,rear wheel drive,Factory Tuner,Luxury,High-Performance,Compact,Coupe,BMW 1 Series M
1                   premium unleaded (required),MANUAL,rear wheel drive,Luxury,Performance,Compact,Convertible,BMW 1 Series
2                    premium unleaded (required),MANUAL,rear wheel drive,Luxury,High-Performance,Compact,Coupe,BMW 1 Series
3                         premium unleaded (required),MANUAL,rear wheel drive,Luxury,Performance,Compact,Coupe,BMW 1 Series
4                               premium unleaded (required),MANUAL,rear wheel drive,Luxury,Compact,Convertible,BMW 1 Series
dtype: object

In [13]:
# Store them in the pandas dataframe
df_clean = pd.DataFrame({'clean': df2}) 
df_clean.head()

Unnamed: 0,clean
0,"premium unleaded (required),MANUAL,rear wheel drive,Factory Tuner,Luxury,High-Performance,Compact,Coupe,BMW 1 Series M"
1,"premium unleaded (required),MANUAL,rear wheel drive,Luxury,Performance,Compact,Convertible,BMW 1 Series"
2,"premium unleaded (required),MANUAL,rear wheel drive,Luxury,High-Performance,Compact,Coupe,BMW 1 Series"
3,"premium unleaded (required),MANUAL,rear wheel drive,Luxury,Performance,Compact,Coupe,BMW 1 Series"
4,"premium unleaded (required),MANUAL,rear wheel drive,Luxury,Compact,Convertible,BMW 1 Series"


In [14]:
# Create the list of list format of the custom corpus for gensim modeling 
sent = [row.split(',') for row in df_clean['clean']]

In [15]:
len(sent)

11914

In [16]:
# show the example of list of list format of the custom corpus for gensim modeling 
sent[:2]

[['premium unleaded (required)',
  'MANUAL',
  'rear wheel drive',
  'Factory Tuner',
  'Luxury',
  'High-Performance',
  'Compact',
  'Coupe',
  'BMW 1 Series M'],
 ['premium unleaded (required)',
  'MANUAL',
  'rear wheel drive',
  'Luxury',
  'Performance',
  'Compact',
  'Convertible',
  'BMW 1 Series']]

#### Genism word2vec Model Training

- __size__: The number of dimensions of the embeddings and the default is 100.
- __window__: The maximum distance between a target word and words around the target word. The default window is 5.
- __min_count__: The minimum count of words to consider when training the model; words with occurrence less than this count will be ignored. The default for min_count is 5.
- __workers__: The number of partitions during training and the default workers is 3.

- __sg__: The training algorithm, either CBOW(0) or skip gram (1). 

The default training alogrithm is CBOW.

In [17]:
## Train the genisim word2vec model with our own custom corpus
model = Word2Vec(sent, 
                 min_count = 2,
                 vector_size=      50,
                 workers   = 3, 
                 window    = 3, 
                 sg        = 1)

In [19]:
# Get the vocabulary
vocab = model.wv.index_to_key

# Print first 10 words as an example
print(vocab[:10])

['AUTOMATIC', 'regular unleaded', 'front wheel drive', 'Compact', 'Midsize', 'nan', 'rear wheel drive', 'Luxury', 'Sedan', 'MANUAL']


In [20]:
len(vocab)

936

In [21]:
## We can obtain the word embedding directly from the training model
model.wv['BMW Z8']

array([-0.11443728,  0.1334992 ,  0.00770702,  0.00082791,  0.05860547,
       -0.04584971, -0.0435245 ,  0.21960777, -0.01032996, -0.01677156,
        0.09349984, -0.04748993, -0.01885843, -0.03473523,  0.03534262,
        0.01199429,  0.11990506,  0.01819527, -0.02782521, -0.17387272,
       -0.03394634,  0.00131194,  0.16644926,  0.03219851,  0.10352757,
        0.00688926, -0.02091132,  0.13381971,  0.02607962, -0.00314559,
       -0.0640417 ,  0.03661994, -0.00969608,  0.00539498, -0.05330382,
       -0.02629333,  0.10953173,  0.00850281,  0.06331643,  0.05940827,
        0.04957276, -0.01379175, -0.08206447,  0.04063516,  0.17337848,
        0.02134152, -0.0412142 ,  0.00308212, -0.00636822,  0.08817437],
      dtype=float32)

#### Compare Similarities

use Word2vec to compute similarity between two make model in the vocabulary by invoking the model.similarity() and passing in the relvevant words. 

For instance, model.similarity(`Porsche 718 Cayman`, `Nissan Van`) 

This will give us the `Euclidian similarity` between `Porsche 718 Cayman` and `Nissan Van`.

In [22]:
model.wv.similarity('Porsche 718 Cayman', 'Toyota Tercel')

0.7930343

From the above example, we can tell that Porsche 718 Cayman is more similar with Mercedes-Benz SLK-Class than Nissan Van. 



In [24]:
model.wv.similarity('Hyundai Accent', 'Mercedes-Benz SLK-Class')

0.76696205

#### __model.most_similar()__ 

In [25]:
## Show the most similar vehicles for Mercedes-Benz SLK-Class : Default by eculidean distance 
model.wv.most_similar('Mercedes-Benz SLK-Class')[:5]

[('Porsche Boxster', 0.9851690530776978),
 ('Ferrari California', 0.9848737716674805),
 ('Subaru BRZ', 0.9847809076309204),
 ('BMW M', 0.9844955801963806),
 ('Scion FR-S', 0.984492838382721)]

In [26]:
## Show the most similar vehicles for Toyota Camry : Default by eculidean distance 
model.wv.most_similar('Toyota Camry')[:15]

[('Hyundai Sonata', 0.9861137270927429),
 ('Toyota Avalon', 0.9856813549995422),
 ('Nissan Altima', 0.9830912947654724),
 ('Nissan Sentra', 0.9830172657966614),
 ('Buick Verano', 0.9784183502197266),
 ('Chevrolet Cruze', 0.976456344127655),
 ('Volkswagen Passat', 0.9762012958526611),
 ('Ford Five Hundred', 0.9731436967849731),
 ('Chevrolet Malibu', 0.9727350473403931),
 ('Oldsmobile Alero', 0.9715654253959656)]

However, `Euclidian similarity` cannot work well for the `high-dimensional word vectors`, 

This is because Euclidian similarity will increase the number of dimensions increases even if the word embedding stands for different meanings. 

Alternatively, we can use `cosine similarity` to measure the similarity between two vectors. 

Under cosine similarity, no similarity is expressed as a 90-degree angle while the total similarity of 1 is at 0 degree angle.

In [27]:
def cosine_sim (model, word, target_list , num) :
    cosine_dict = {}
    word_list   = []
    a           = model[word]
    
    for item in target_list :
        if item != word :
            b = model [item]
            
            cos_sim = dot(a, b)/(norm(a)*norm(b))
            
            cosine_dict[item] = cos_sim
            
    dist_sort=sorted(cosine_dict.items(), key=lambda dist: dist[1],reverse = True) ## in Descedning order 
    
    for item in dist_sort:
        word_list.append((item[0], item[1]))
        
    return word_list[0:num]

In [28]:
list(df1.Maker_Model.unique())[:15]

['BMW 1 Series M',
 'BMW 1 Series',
 'Audi 100',
 'FIAT 124 Spider',
 'Mercedes-Benz 190-Class',
 'BMW 2 Series',
 'Audi 200',
 'Chrysler 200',
 'Nissan 200SX',
 'Nissan 240SX',
 'Volvo 240',
 'Mazda 2',
 'BMW 3 Series Gran Turismo',
 'BMW 3 Series',
 'Mercedes-Benz 300-Class']

In [37]:
Maker_Model = list(df1.Maker_Model.unique()) ## only get the unique Maker_Model_Year

## Show the most similar Mercedes-Benz SLK-Class by cosine similarity
cosine_sim (model, 'Mercedes-Benz SLK-Class', Maker_Model, 5)

[('BMW M6', 0.99418545),
 ('Nissan GT-R', 0.9940916),
 ('Aston Martin V12 Vanquish', 0.9935934),
 ('Audi S5', 0.9934372),
 ('Mercedes-Benz SLS AMG GT', 0.99321395)]