## PorterStemmer

In [0]:
import nltk
from nltk.stem import PorterStemmer

In [2]:
stemmer = PorterStemmer()

example = "A cat was chasing a mouse"

example = [stemmer.stem(token) for token in example.split(" ")]

print(" ".join(example))

A cat wa chase a mous


In [3]:
text = "Tesla, Inc. (formerly Tesla Motors, Inc.) is an American automotive and energy company based in Palo Alto, California.[7] The company specializes in electric car manufacturing and, through its SolarCity subsidiary, solar panel manufacturing. It operates multiple production and assembly plants, notably Gigafactory 1 near Reno, Nevada, and its main vehicle manufacturing facility at Tesla Factory in Fremont, California."

text = [stemmer.stem(token) for token in text.split(" ")]

print(" ".join(text))

tesla, inc. (formerli tesla motors, inc.) is an american automot and energi compani base in palo alto, california.[7] the compani special in electr car manufactur and, through it solarc subsidiary, solar panel manufacturing. It oper multipl product and assembl plants, notabl gigafactori 1 near reno, nevada, and it main vehicl manufactur facil at tesla factori in fremont, california.


## WordNet Lemmatizer

In [4]:
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

example = "A cat was chasing mice"

example = [lemmatizer.lemmatize(token) for token in example.split(" ")]

print(" ".join(example))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
A cat wa chasing mouse


In [5]:
example = "There was cacti around the corner"

example = [lemmatizer.lemmatize(token) for token in example.split(" ")]

print(" ".join(example))

There wa cactus around the corner


In [6]:
print(lemmatizer.lemmatize('better', pos = 'a'))

good


In [7]:
print(lemmatizer.lemmatize('better'))

better


In [8]:
print(lemmatizer.lemmatize('Tesla, Inc. (formerly Tesla Motors, Inc.) is an American automotive and energy company based in Palo Alto, California.[7] The company specializes in electric car manufacturing and, through its SolarCity subsidiary, solar panel manufacturing. It operates multiple production and assembly plants, notably Gigafactory 1 near Reno, Nevada, and its main vehicle manufacturing facility at Tesla Factory in Fremont, California. As of June 2018, Tesla sells the Model S, Model X and Model 3 vehicles, Powerwall and Powerpack batteries, solar panels, solar roof tiles, and some related products. '))

Tesla, Inc. (formerly Tesla Motors, Inc.) is an American automotive and energy company based in Palo Alto, California.[7] The company specializes in electric car manufacturing and, through its SolarCity subsidiary, solar panel manufacturing. It operates multiple production and assembly plants, notably Gigafactory 1 near Reno, Nevada, and its main vehicle manufacturing facility at Tesla Factory in Fremont, California. As of June 2018, Tesla sells the Model S, Model X and Model 3 vehicles, Powerwall and Powerpack batteries, solar panels, solar roof tiles, and some related products. 


## CountVectorizer and TfidfVectorizer

In [0]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

In [10]:
vect = CountVectorizer(binary = True)

corpus = ["Tessaract is an optical character recognition engine", "optical character recognition"]
vect.fit(corpus)

print(vect.transform(corpus).toarray())

[[1 1 1 1 1 1 1]
 [0 1 0 0 1 1 0]]


In [11]:
vocab = vect.vocabulary_

for key in sorted(vocab.keys()):
  print("{}:{}".format(key, vocab[key]))

an:0
character:1
engine:2
is:3
optical:4
recognition:5
tessaract:6


In [12]:
vect = TfidfVectorizer(binary = True)

corpus = ["CNN is good optical character recognition", "optical character recognition"]
vect.fit(corpus)

print(vect.transform(["Today is good optical"]).toarray())

[[0.         0.         0.6316672  0.6316672  0.44943642 0.        ]]


## Cosine Similarity

In [13]:
import pandas as pd
import numpy as np

from sklearn.metrics.pairwise import cosine_similarity

similarity = cosine_similarity(vect.transform(["Tessaract is an optical character recognition engine"]).toarray(), 
                               vect.transform(["Optical character recognition"]).toarray())

print(similarity)

[[0.77651453]]


## Spacy

In [0]:
import spacy

In [0]:
nlp = spacy.load('en_core_web_sm')