# About this notebook 

#### Feature: Description

This notebook employs Word2Vec natural language processing (NLP) algorithm in gensim to find similarities between words on the pet's description field.

<div class="span5 alert alert-success">
<p> <I> Feature Description: </I> The "Description" data is a profile write-up for each pet.
     <br>
    <I> Source: </I> https://www.kaggle.com/c/petfinder-adoption-prediction/data  </p>
</div>

<div class="span5 alert alert-success">
<p> <I> Predictor (Adoption Speed) Description: </I> 

Contestants are required to predict this value. The value is determined by how quickly, if at all, a pet is adopted.   
<br> 
The values are determined in the following way:   
0 - Pet was adopted on the same day as it was listed.    
1 - Pet was adopted between 1 and 7 days (1st week) after being listed.    
2 - Pet was adopted between 8 and 30 days (1st month) after being listed.    
3 - Pet was adopted between 31 and 90 days (2nd & 3rd month) after being listed.    
4 - No adoption after 100 days of being listed.    

</p>
</div>

In [20]:
import warnings
warnings.filterwarnings('ignore')

%cd C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData

C:\Users\Ken\Documents\KenP\Applications-DataScience\SpringboardCourseWork\CapstoneProject2Repository\09 PetfindersData\TrainingData


<div class="span5 alert alert-info">
<p> <B>  Imports and Data Loading: </B>  </p>
</div>

In [21]:
#Imports
import pandas as pd
import numpy as np
import gensim
import nltk

In [22]:
#Import the csv file

dfi = pd.read_csv('train.csv')

In [23]:
#Create dataframe of pet description feature
dfd = dfi[['Description']]
dfd.columns = ['description']
dfd.head(1)

Unnamed: 0,description
0,Milo went missing after a week with her new ad...


<div class="span5 alert alert-info">
<p> <B>  Tokenize and lemmatize the description data </B>  </p>
</div>

In [24]:
#Tokenize and lemmatize the description data

mylist = []
for index, row in dfd.iterrows():
    
    #mylist = row[0]
 
    #split sentence into words
    tokens = nltk.word_tokenize(str(row[0]))
    
    #remove all tokens that are not alphabetic
    words = [word for word in tokens if word.isalpha()]
    
    #convert the tokens to lowercase
    wordslc = [word.lower() for word in words]
    
    mylist.append(wordslc)


#print(wordslc)
dfd['tokenized_desc'] = mylist
dfd.head(1)

Unnamed: 0,description,tokenized_desc
0,Milo went missing after a week with her new ad...,"[milo, went, missing, after, a, week, with, he..."


In [25]:
#build vocabulary and train model
model = gensim.models.Word2Vec(mylist, size=150, window=10, min_count=2, workers=10)

model.train(mylist, total_examples=len(mylist), epochs=10)

(6770378, 9040180)

In [26]:
#find similarity
w1 = 'dog'
model.wv.most_similar(positive =w1)

[('cat', 0.6178869009017944),
 ('pet', 0.5253076553344727),
 ('puppy', 0.4892923831939697),
 ('security', 0.45897799730300903),
 ('dogs', 0.44591596722602844),
 ('pup', 0.43469223380088806),
 ('doggie', 0.43293023109436035),
 ('alerting', 0.4191069006919861),
 ('watchdog', 0.3913504481315613),
 ('children', 0.3816896080970764)]

In [27]:
#similarity between two different words
model.wv.similarity(w1='dog',w2='cat')

0.6178869

In [28]:
#similarity between two different words
model.wv.similarity(w1='dog',w2='pet')

0.52530766

In [29]:
#similarity between two different words
model.wv.similarity(w1='cat',w2='pet')

0.4153346