## PorterStemmer

In [0]:
import nltk
from nltk.stem import PorterStemmer

In [2]:
stemmer = PorterStemmer()

example = "A cat was chasing a mouse"

example = [stemmer.stem(token) for token in example.split(" ")]

print(" ".join(example))

A cat wa chase a mous


In [3]:
text = "Tesla, Inc. (formerly Tesla Motors, Inc.) is an American automotive and energy company based in Palo Alto, California.[7] The company specializes in electric car manufacturing and, through its SolarCity subsidiary, solar panel manufacturing. It operates multiple production and assembly plants, notably Gigafactory 1 near Reno, Nevada, and its main vehicle manufacturing facility at Tesla Factory in Fremont, California."

text = [stemmer.stem(token) for token in text.split(" ")]

print(" ".join(text))

tesla, inc. (formerli tesla motors, inc.) is an american automot and energi compani base in palo alto, california.[7] the compani special in electr car manufactur and, through it solarc subsidiary, solar panel manufacturing. It oper multipl product and assembl plants, notabl gigafactori 1 near reno, nevada, and it main vehicl manufactur facil at tesla factori in fremont, california.


## WordNet Lemmatizer

In [4]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

example = "A cat was chasing mice"

example = [lemmatizer.lemmatize(token) for token in example.split(" ")]

print(" ".join(example))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
A cat wa chasing mouse


In [5]:
example = "There was cacti around the corner"

example = [lemmatizer.lemmatize(token) for token in example.split(" ")]

print(" ".join(example))

There wa cactus around the corner


In [6]:
print(lemmatizer.lemmatize('better', pos = 'a'))

good


In [7]:
print(lemmatizer.lemmatize('better'))

better


In [8]:
print(lemmatizer.lemmatize('Tesla, Inc. (formerly Tesla Motors, Inc.) is an American automotive and energy company based in Palo Alto, California.[7] The company specializes in electric car manufacturing and, through its SolarCity subsidiary, solar panel manufacturing. It operates multiple production and assembly plants, notably Gigafactory 1 near Reno, Nevada, and its main vehicle manufacturing facility at Tesla Factory in Fremont, California. As of June 2018, Tesla sells the Model S, Model X and Model 3 vehicles, Powerwall and Powerpack batteries, solar panels, solar roof tiles, and some related products. '))

Tesla, Inc. (formerly Tesla Motors, Inc.) is an American automotive and energy company based in Palo Alto, California.[7] The company specializes in electric car manufacturing and, through its SolarCity subsidiary, solar panel manufacturing. It operates multiple production and assembly plants, notably Gigafactory 1 near Reno, Nevada, and its main vehicle manufacturing facility at Tesla Factory in Fremont, California. As of June 2018, Tesla sells the Model S, Model X and Model 3 vehicles, Powerwall and Powerpack batteries, solar panels, solar roof tiles, and some related products. 


## CountVectorizer and TfidfVectorizer

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
vect = CountVectorizer(binary = True)

corpus = ["Tessaract is an optical character recognition engine", "optical character recognition"]
vect.fit(corpus)

print(vect.transform(corpus).toarray())

[[1 1 1 1 1 1 1]
 [0 1 0 0 1 1 0]]


In [11]:
vocab = vect.vocabulary_

for key in sorted(vocab.keys()):
  print("{}:{}".format(key, vocab[key]))

an:0
character:1
engine:2
is:3
optical:4
recognition:5
tessaract:6


In [12]:
vect = TfidfVectorizer(binary = True)

corpus = ["CNN is good optical character recognition", "optical character recognition"]
vect.fit(corpus)

print(vect.transform(["Today is good optical"]).toarray())

[[0.         0.         0.6316672  0.6316672  0.44943642 0.        ]]


## Cosine Similarity

In [13]:
import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(vect.transform(["Tessaract is an optical character recognition engine"]).toarray(), 
                               vect.transform(["Optical character recognition"]).toarray())

print(similarity)

[[0.77651453]]


## Spacy

In [0]:
import spacy

In [0]:
nlp = spacy.load('en_core_web_sm')

## Finding cosine similarities for collections of text

In [0]:
str1 = "Summer is a charming flirt. Easy-going and casual. Summer doesn't huff and puff to win our affections. It has us at \"Hello.\" Winter broods like the tortured protagonist of big fat Russian novels. It is daunting and dramatic, burning with a slow intensity. The season's reputation precedes itself, and often, not in a good way. It has a way of whittling down everything to its bare bones. Even relationships not attuned to its ebbs and flows can fray. At a dinner conversation I once attended, I listened in bemusement as a recent divorcee made the case that it was the Scandinavian frost that had cooled his ex-wife's ardour. How original."

str2 = "One of the finer books I read this year was John Kaag's Hiking With Nietzsche, in which Kaag, a professor of philosophy, rekindles his passion for the German thinker while tracing picturesque hiking trails in the mountains of Switzerland. It's a near-precise rendering of the travelogue as a self-help book. A young Kaag was an avowed Nietzsche acolyte but given the ravages of responsibilities and adulthood, the writer put his affinity to test by undertaking physically enduring hikes through the Alps, revisiting haunts that the philosopher escaped to, in search of solitude and salve. The journey's demands, coupled with his own inner turmoil, are catnip for anybody feeling at cross purposes with their own life."

str3 = "If there's a phrase I would prefer to retire from online bios, personal or professional, it is, \"I love travel.\" Or some approximation of that sentiment. To clarify, I am not against travellers or those who proudly flaunt their passion for travel. On the contrary, editing a travel magazine has now made me oddly protective of travellers and their ilk. My submission is that \"love to travel,\" suggested so casually, just doesn't feel adequate to the depth of emotion it sparks in true devotees. In February, the month of love as endowed by our great gifting industrial complex, we are wrestling with what \"love for travel\" means in tangible, life-affecting terms. The early throes of discovering travel might not be too dissimilar to the beginnings of a feverish affair. A fleeting scene, sound or feeling that at first arouses, then enchants and eventually lures us into a hypnotic state, evoking wooly-eyed reveries about what could be."

In [0]:
vect = TfidfVectorizer(binary = True)

corpus = [str1, str2, str3]
vect.fit(corpus)

vecstr1 = vect.transform([str1]).toarray()
vecstr2 = vect.transform([str2]).toarray()
vecstr3 = vect.transform([str3]).toarray()

In [0]:
sim = cosine_similarity(vecstr1, vecstr2)

In [21]:
sim

array([[0.06390515]])

In [24]:
print('Cosine similarity between text 1 and 2:', cosine_similarity(vecstr1, vecstr2))

print('Cosine similarity between text 2 and 3:', cosine_similarity(vecstr2, vecstr3))

print('Cosine similarity between text 1 and 3:', cosine_similarity(vecstr1, vecstr3))

Cosine similarity between text 1 and 2: [[0.06390515]]
Cosine similarity between text 2 and 3: [[0.08754239]]
Cosine similarity between text 1 and 3: [[0.08875505]]
