## Natural Language Processing Made Easy – using SpaCy (in Python)

https://www.analyticsvidhya.com/blog/2017/04/natural-language-processing-made-easy-using-spacy-%E2%80%8Bin-python/

In [1]:
import pandas as pd
import spacy

In [2]:
# The nlp object is used to create documents, access linguistic annotations and different nlp properties
# Load English tokenizer, tagger, parser, NER and word vectors.  This is an instance of the English class with 
# the model weights loaded in, so spaCy can predict part-of-speech tags, dependency labels and named entities
nlp_parser = spacy.load('en_core_web_sm')
nlp_parser.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1f18b6ffcf8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1f18ce96ee8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1f18ce96f48>)]

In [3]:
document = str(open('../data/tripadvisorReviews.txt', encoding="utf8").read())

## SpaCy Pipeline and Properties

Implementation of spaCy and access to different properties is initiated by creating pipelines.  A pipeline is created by 
loading the models, of which english-core-web is the default model.

By default, sentence segmentation is performed by the DependencyParser, so the Sentencizer lets you implement a simpler, 
rule-based strategy that doesn’t require a statistical model to be loaded

In [4]:
from spacy.pipeline import Sentencizer

# https://github.com/explosion/spaCy/issues/3569
# By default, nlp.add_pipe will add the component last in the pipeline (after the parser).  To use custom sentence boundaries, 
# you have to apply them before the parser – nlp.add_pipe(sentencizer, before="parser"). The parser will then take your 
# custom boundaries into account as well and only assign dependencies that are consistent with the sentences.  This can 
# lead to an improvement in parsing accuracy
sentencizer = Sentencizer()
nlp_parser.add_pipe(sentencizer, before = 'parser')

parsed_document = nlp_parser(document)
nlp_parser.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x1f18b6ffcf8>),
 ('sentencizer', <spacy.pipeline.pipes.Sentencizer at 0x1f18d0c3f60>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x1f18ce96ee8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x1f18ce96f48>)]

## The document is now part of spacy.english model’s class

The properties of a document (or tokens) can listed by using following command:

In [5]:
dir(parsed_document)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_bulk_merge',
 '_py_tokens',
 '_realloc',
 '_vector',
 '_vector_norm',
 'cats',
 'char_span',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_disk',
 'get_extension',
 'get_lca_matrix',
 'has_extension',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'merge',
 'noun_chunks',
 'noun_chunks_iterator',
 'print_tree',
 'remove_extension',
 'retokenize',
 'sentiment',
 'sents',
 'set_extension',
 'similarity',
 'tensor',
 'text',
 'text_with_ws',
 'to_array',
 'to_byte

## Tokenization

Every spaCy document is tokenized into sentences and further into tokens which can be accessed by iterating the document:

In [6]:
# first token of the doc 
print(parsed_document[0])

# last token of the doc  
print(parsed_document[len(parsed_document) - 5])

# List of sentences of our doc 
list(parsed_document.sents)

﻿Nice
boston


[﻿Nice place Better than some reviews give it credit for.,
 Overall, the rooms were a bit small but nice.,
 Everything was clean, the view was wonderful and it is very well located (the Prudential Center makes shopping and eating easy and the T is nearby for jaunts out and about the city).,
 Overall, it was a good experience and the staff was quite friendly.,
 
 what a surprise What a surprise the Sheraton was after reading some of the reviews.,
 it would appear there is a massive difference in the rooms, the South tower being the best.,
 Check in was very efficient and the room was lovely, very large with the most comfortable beds ever.,
 The hotel as stated is in a fantastic location and the Wrentham Village outlet is well worth a visit for bargain shopping ( the bus picks up outside).,
 The hotel bar is a little pricey ( not helped by the current dollar rate) but is a nice place to relax after a busy day shopping.,
 There is a number of restaurants close by.,
 A cab from the airport

## Part of Speech Tagging

Part-of-speech tags are the properties of the word that are defined by the usage of the word in the grammatically correct 
sentence.  These tags can be used as the text features in information filtering, statistical models, and rule based parsing.
spaCy's part-of-speech and dependency tags are:

* ADJ: adjective
* ADP: adposition
* ADV: adverb
* AUX: auxiliary verb
* CONJ: coordinating conjunction
* DET: determiner
* INTJ: interjection
* NOUN: noun
* NUM: numeral
* PART: particle
* PRON: pronoun
* PROPN: proper noun
* PUNCT: punctuation
* SCONJ: subordinating conjunction
* SYM: symbol
* VERB: verb
* X: other

https://spacy.io/api/annotation

The list of other attributes for tokens can be found at https://spacy.io/api/token

## Dependency Tokens

spaCy's dependency tag scheme is based upon the ClearNLP project; the meanings of the tags can be found at https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md:

* ACL: Clausal modifier of noun
* ACOMP: Adjectival complement
* ADVCL: Adverbial clause modifier
* ADVMOD: Adverbial modifier
* AGENT: Agent
* AMOD: Adjectival modifier
* APPOS: Appositional modifier
* ATTR: Attribute
* AUX: Auxiliary
* AUXPASS: Auxiliary (passive)
* CASE: Case marker
* CC: Coordinating conjunction
* CCOMP: Clausal complement
* COMPOUND: Compound modifier
* CONJ: Conjunct
* CSUBJ: Clausal subject
* CSUBJPASS: Clausal subject (passive)
* DATIVE: Dative
* DEP: Unclassified dependent
* DET: Determiner
* DOBJ: Direct Object
* EXPL: Expletive
* INTJ: Interjection
* MARK: Marker
* META: Meta modifier
* NEG: Negation modifier
* NOUNMOD: Modifier of nominal
* NPMOD: Noun phrase as adverbial modifier
* NSUBJ: Nominal subject
* NSUBJPASS: Nominal subject (passive)
* NUMMOD: Number modifier
* OPRD: Object predicate
* PARATAXIS: Parataxis
* PCOMP: Complement of preposition
* POBJ: Object of preposition
* POSS: Possession modifier
* PRECONJ: Pre-correlative conjunction
* PREDET: Pre-determiner
* PREP: Prepositional modifier
* PRT: Particle
* PUNCT: Punctuation
* QUANTMOD: Modifier of quantifier
* RELCL: Relative clause modifier
* ROOT: Root
* XCOMP: Open clausal complement

In [7]:
# Get all tags
all_tags = {w.pos: w.pos_ for w in parsed_document}
all_tags

{100: 'VERB',
 92: 'NOUN',
 86: 'ADV',
 85: 'ADP',
 90: 'DET',
 95: 'PRON',
 97: 'PUNCT',
 84: 'ADJ',
 89: 'CCONJ',
 96: 'PROPN',
 94: 'PART',
 103: 'SPACE',
 99: 'SYM',
 93: 'NUM',
 101: 'X',
 87: 'AUX',
 91: 'INTJ'}

In [8]:
# To get the meaning of a dependency tag, use explain()
print('nsubj:', spacy.explain('nsubj'))
print('dobj:', spacy.explain('dobj'))
print('pobj:', spacy.explain('pobj'))

nsubj: nominal subject
dobj: direct object
pobj: object of preposition


In [9]:
# All tags of first sentence of our document 
for word in list(parsed_document.sents)[1]:
    print(word, word.pos_)

Overall ADV
, PUNCT
the DET
rooms NOUN
were VERB
a DET
bit NOUN
small ADJ
but CCONJ
nice ADJ
. PUNCT


In [10]:
for word in list(parsed_document.sents)[1]:
    print(word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_)

Overall 8578797347073582537 overall 164681854541413346 RB 86 ADV
, 2593208677638477497 , 2593208677638477497 , 97 PUNCT
the 7425985699627899538 the 15267657372422890137 DT 90 DET
rooms 14629044596807299988 room 783433942507015291 NNS 92 NOUN
were 10382539506755952630 be 17109001835818727656 VBD 100 VERB
a 11901859001352538922 a 15267657372422890137 DT 90 DET
bit 1794436035373472204 bit 15308085513773655218 NN 92 NOUN
small 16938367552274787525 small 10554686591937588953 JJ 84 ADJ
but 14560795576765492085 but 17571114184892886314 CC 89 CCONJ
nice 14121509715367036122 nice 10554686591937588953 JJ 84 ADJ
. 12646065887601541794 . 12646065887601541794 . 97 PUNCT


## unigram: An n-gram consisting of a single item from a sequence

n-gram is a set of occurring words within given window, so when:
* n = 1 it is Unigram
* n = 2 it is bigram
* n = 3 it is trigram and so on

In [11]:
# Define some parameters  
noisy_pos_tags = ['PROP']
min_token_length = 2

# Function to check if the token is a noise or not  
def isNoise(token):     
    is_noise = False
    if token.pos_ in noisy_pos_tags:
        is_noise = True 
    elif token.is_stop == True:
        is_noise = True
    elif len(token.string) <= min_token_length:
        is_noise = True
    return is_noise

def cleanup(token, lower = True):
    if lower:
       token = token.lower()
    return token.strip()

from collections import Counter

# Top unigrams used in the reviews 
cleaned_list = [cleanup(word.string) for word in parsed_document if not isNoise(word)]
Counter(cleaned_list) .most_common(5)

[('hotel', 685),
 ('room', 653),
 ('great', 300),
 ('sheraton', 286),
 ('location', 272)]

## Entity Detection

spaCy is capable of identifying entitiy phrases from the document (such as person, location, organization, dates, numerals, 
etc.), accessed through the .ents property.

In [12]:
# Find all the types of named entities in our document
labels = set([w.label_ for w in parsed_document.ents]) 
print('labels ', len(labels), list(labels), '\n')
for label in labels: 
    entities = [cleanup(e.string, lower = False) for e in parsed_document.ents if label == e.label_] 
    entities = list(set(entities)) 
    print(label, len(entities), entities, '\n')

labels  18 ['NORP', 'LANGUAGE', 'PRODUCT', 'GPE', 'FAC', 'PERCENT', 'CARDINAL', 'QUANTITY', 'ORG', 'EVENT', 'DATE', 'WORK_OF_ART', 'MONEY', 'PERSON', 'TIME', 'ORDINAL', 'LOC', 'LAW'] 

NORP 20 ['European', 'Catholic', 'Hispanic', 'Armani', 'Finish', '65F.We', 'Christian', 'Yahoo', 'Irish', 'French', 'Hynes', 'American', 'Americans', 'Coffee', 'Donuts', 'T.', 'Amercian', 'Brazilian', 'Starbucks', 'Italian'] 

LANGUAGE 1 ['English'] 

PRODUCT 9 ['Highly', 'the Rodeo Dr', 'Cash', 'the USS Constitution', 'The Atlantis Fitness', 'Copley', 'Suburban', "5'1", 'Speakman'] 

GPE 64 ['St Charles River', 'North of Boylston', 'Fenway', 'Logan', 'Sweetsleeper', 'Nine West', 'The North Wing', 'Expedia', 'the Ramada Hong Kong', 'T.P', 'Hines', 'California', 'NYC', 'Sheratons', 'Florentines', 'Britain', 'Cambridge', 'Lexington', 'Letdown', 'Nice', 'Lenox', 'P.S.', 'Honeymoon', 'London', 'North', 'Buffalo', 'View', 'Maine', 'Hong Kong', 'Ireland', 'Sprint', 'Sheraton', 'the United States', 'Portland', 

ORG 261 ['Ritz', 'the Club Level Express Elevator', 'Wonderful Hotel and Staff', 'Intercontinental', 'Freedom Trail', 'Floor of Sheraton Hotel', 'HVAC', 'SPG Points', 'the Prudential Shopping Centre', 'the Sheraton Boston', 'Suite', 'Prudential Centre', 'Club Floor', 'Big Pool', 'Sweet Sleeper', 'GO', 'Great Boston Hotel', 'a Holiday Inn', 'the Caesar Salad', 'the Hynes Conv', 'Trailfinders', 'Lamps', 'Sheraton Sleeper Bed', 'Great Hotel/Location', 'Copley', 'the Prudential Center', 'Club-lounge', 'Warm', 'Hilton', 'Better Options', 'Great Location', 'Red Sox', 'the Sheraton/Boston', 'MFA', 'MBTA', 'TERRIBLE', 'Virgin', 'Hotel and City', 'Hynes Conv', 'Complimentary USA', 'the Christian Science', 'Location, Location, Location', 'Gold SPG', 'Plate', 'The Lobster Ravioli', 'the Boston Sheraton', 'the Hynes Convention Center', 'the Democratic Convention', 'Expeirence', 'Hotel', 'Atlantic Fish Co.', 'the Hynes Convention', 'Government', 'Starwood (Shertaon Hotel Chain', "Sheraton Promise'"

DATE 216 ['the 29th', 'July 26-30', 'late February', 'last april', 'Aug. 23-25', 'the 23rd', 'our 7 month', 'new year', '10 days', 'a week later', '15 year old', '5 nights', 'December', 'our weekend', 'ten consecutive days', '10th March 2005', '21 year old', 'the 2nd day', '2 day', 'Patriots Day weekend', 'a working day', 'many years', 'Dec 1999', 'Dec 2001', 'November 21-25', 'the 28th', 'that day', '2665', 'June 2006', 'early December', 'a week', 'every day', 'a two week holiday', "45' later", 'the year 2008', '1965', '4 days', '2007', '3 nights', 'all weekend', 'last weekend', 'the 8th', 'one day', 'maybe day', 'the day', 'August', 'next year', 'two year old', 'mid September', 'the end of the day', 'June 13-17', 'at least a day', 'more than a month ago', 'this year - first', 'the next day', '1002', 'a convention week', 'this year', 'the final day', 'last week', '6 month', '17th', '2 nights', 'the last few years', 'the weekend', 'nine year old', 'the coming week', 'a day', 'Sept. 6-7

TIME 153 ['early evening', '5 nights', 'evening', '9pm', 'minutes', '2pm', 'two hours', 'late at night', 'the first night at 12:30', 'one evening', 'a couple of hours', 'more than one night', '5am', 'our last night', 'the night', 'evening happy hours', 'all night', 'less than a minute', 'each morning', '3 nights', 'the wee hours of the morning', 'three night', 'the minute', '3am', '7 minute', '5 walking minutes', '15-20 minute', 'almost 9pm at night', '30 minutes', '4:00 a.m.', '2 nights', '9 hours', 'over 30 minutes', 'two minute', '8:30', 'morning', '6 night', 'approximately two hours', '7:00 AM', 'forty minutes', '3 hours', 'about an hour', 'One morning', 'about two hours', 'the third night', 'around 2 p.m.', '1:30 am', '5 minutes', 'night', 'the next morning', 'a hour and a half', '4 more hours', 'about 25 minutes', 'zero at night', '5 am', '6 and 8 minutes', 'a.m. hours', 'a second night', '9am', '12 noon', '2 minute', 'only a few hours', '12 hours', '10 - 15 minute', '6:00 AM', '

## Dependency Parsing

The spaCy syntactic dependency parser is used for sentence boundary detection and phrase chunking. The relations can be accessed by the properties .children, .root, .ancestor, etc.

In [13]:
# Extract all review sentences that contains the term - hotel
hotel = [sent for sent in parsed_document.sents if 'hotel' in sent.string.lower()]

print(hotel[2])

# Create dependency tree
sentence = hotel[2]
for word in sentence:
    print(word, ': ', str(list(word.children)))

A cab from the airport to the hotel can be cheaper than the shuttles depending what time of the day you go.
A :  []
cab :  [A, from]
from :  [airport, to]
the :  []
airport :  [the]
to :  [hotel]
the :  []
hotel :  [the]
can :  []
be :  [cab, can, cheaper, .]
cheaper :  [than]
than :  [shuttles]
the :  []
shuttles :  [the, depending]
depending :  [time]
what :  []
time :  [what, of]
of :  [day]
the :  []
day :  [the, go]
you :  []
go :  [you]
. :  []


## Parsing a dependency tree - Adjectives

Parse the dependency tree of all the sentences which contains the term hotel and list the adjectival tokens used for hotel. pos_words() parses a dependency tree and extracts relevant pos tag.

In [14]:
# Check all adjectives used with a word 
def pos_words (sentence, token, ptag):
    sentences = [sent for sent in sentence.sents if token in sent.string]     
    pwrds = []
    for sent in sentences:
        for word in sent:
            if 'hotel' in word.string: 
                   pwrds.extend([child.string.strip() for child in word.children
                                                      if child.pos_ == ptag] )
    return Counter(pwrds).most_common(10)

pos_words(parsed_document, 'hotel', 'ADJ')

[('other', 20),
 ('great', 10),
 ('nice', 7),
 ('good', 7),
 ('better', 6),
 ('Nice', 5),
 ('different', 5),
 ('many', 5),
 ('best', 4),
 ('wonderful', 3)]

## Parsing a dependency tree - Nouns

Parse the dependency tree for noun phrases.

In [15]:
# Generate Noun Phrases 
doc = nlp_parser(u'I love data science on analytics lawrence') 
for np in doc.noun_chunks:
    print(np.text, np.root.dep_, np.root.head.text)

I nsubj love
data science dobj love
analytics lawrence pobj on


In [16]:
# To get the meaning of a dependency tag, use explain()
print('nsubj:', spacy.explain('nsubj'))
print('dobj:', spacy.explain('dobj'))
print('pobj:', spacy.explain('pobj'))

nsubj: nominal subject
dobj: direct object
pobj: object of preposition


## Word to Vectors Integration

spaCy uses [GloVe](https://nlp.stanford.edu/projects/glove/) vectors to generate vectors, an unsupervised learning algorithm for obtaining vector representations for words.

The default en_core_web_en doesn't have word vectors, but [en_core_web_md does](https://spacy.io/models/en#en_core_web_md).

[Model size indicator](https://spacy.io/models#conventions): sm, md or lg.
For example, en_core_web_sm is a small English model trained on written web text (blogs, news, comments), that includes vocabulary, vectors, syntax and entities.  While the sm models don't contain actual word vectors, they take advantage of the shared context-sensitive token vectors used by the tagger, parser and NER. This means that you can still use the similarity() methods, even without word vectors.

In [17]:
# !python -m spacy download en_core_web_md

In [22]:
from numpy import dot 
from numpy.linalg import norm 
from spacy.lang.en import English

# Load the medium English model
nlp = spacy.load('en_core_web_md')
print('pipeline:', nlp_parser.pipeline)

# Generate word vector of the word - apple  
apple = nlp.vocab[u'apple']

# Cosine similarity function 
cosine = lambda v1, v2: dot(v1, v2) / (norm(v1) * norm(v2))
others = list({w for w in nlp.vocab if w.has_vector and w.orth_.islower() and w.lower_ != str('apple')})
print('others:', list(others))

# Sort by similarity score
others.sort(key = lambda w: cosine(w.vector, apple.vector)) 
others.reverse()

print('top most similar words to apple:')
for word in others[:20]:
    print(word.orth_)

pipeline: [('tagger', <spacy.pipeline.pipes.Tagger object at 0x000001F18B6FFCF8>), ('sentencizer', <spacy.pipeline.pipes.Sentencizer object at 0x000001F18D0C3F60>), ('parser', <spacy.pipeline.pipes.DependencyParser object at 0x000001F18CE96EE8>), ('ner', <spacy.pipeline.pipes.EntityRecognizer object at 0x000001F18CE96F48>)]
others: 

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)




top most similar words to apple:
blackberry
honeycrisp
pears
prunes
apples
crabapples
3g/3gs
3gs
iphone4
iphone
fig
fruit
frutti
strawberry
icecream
popsicle
stawberry
shortcake
creamsicle
ipad/iphone


## Create a sklearn pipeline with components: cleaner, tokenizer, vectorizer, classifier

For tokenizer and vectorizer we build our own custom modules using spacy

In [23]:
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS as stopwords 
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics import accuracy_score 
from sklearn.base import TransformerMixin 
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

# Custom transformer using spaCy 
class predictors(TransformerMixin):
    def transform(self, X, **transform_params):
        return [clean_text(text) for text in X]
    def fit(self, X, y = None, **fit_params):
        return self
    def get_params(self, deep=True):
        return {}

# Basic utility function to clean the text 
def clean_text(text):     
    return text.strip().lower()

## Create a custom tokenizer function using spacy parser and some basic cleaning

The text features can be replaced with word vectors (especially beneficial in deep learning models)

In [24]:
import string
punctuations = string.punctuation

parser = English()

# Create spacy tokenizer that parses a sentence and generates tokens.  These can also be replaced by word vectors 
def spacy_tokenizer(sentence):
    mytokens = nlp_parser(sentence)
    mytokens = [word.lemma_.lower().strip() if word.lemma_ != '-PRON-' else word.lower_ for word in mytokens]
    mytokens = [word for word in mytokens if word not in stopwords and word not in punctuations]
    return mytokens

# Create vectorizer object to generate feature vectors, we will use custom spacy’s tokenizer
vectorizer = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1,1))
classifier = LinearSVC()

## Create the pipeline, load the data (sample here), and run the classifier model

In [25]:
pipe = Pipeline([("cleaner", predictors()),
                 ('vectorizer', vectorizer),
                 ('classifier', classifier)])

# Load sample data
train = [('I love this sandwich.', 'pos'),          
         ('this is an amazing place!', 'pos'),
         ('I feel very good about these beers.', 'pos'),
         ('this is my best work.', 'pos'),
         ("what an awesome view", 'pos'),
         ('I do not like this restaurant', 'neg'),
         ('I am tired of this stuff.', 'neg'),
         ("I can't deal with this", 'neg'),
         ('he is my sworn enemy!', 'neg'),          
         ('my boss is horrible.', 'neg')] 
test =   [('the beer was good.', 'pos'),     
         ('I do not enjoy my job', 'neg'),
         ("I ain't feelin dandy today.", 'neg'),
         ("I feel amazing!", 'pos'),
         ('Gary is a good friend of mine.', 'pos'),
         ("I can't believe I'm doing this.", 'neg')]

# Create model and measure accuracy
pipe.fit([x[0] for x in train], [x[1] for x in train]) 

# pos or neg
pred_data = pipe.predict([x[0] for x in test])

for (sample, pred) in zip(test, pred_data):
    print (sample, pred)

print ('LinearSVC | CountVectorizer Accuracy:', accuracy_score([x[1] for x in test], pred_data))

('the beer was good.', 'pos') pos
('I do not enjoy my job', 'neg') neg
("I ain't feelin dandy today.", 'neg') neg
('I feel amazing!', 'pos') pos
('Gary is a good friend of mine.', 'pos') pos
("I can't believe I'm doing this.", 'neg') neg
LinearSVC | CountVectorizer Accuracy: 1.0


In [26]:
# Another random review
pipe.predict(['This was a horrible movie'])

array(['neg'], dtype='<U3')

In [28]:
example = ['I do enjoy my job', 'What a poor product!  I will have to get a new one', 'I feel amazing']
pipe.predict(example)

array(['neg', 'neg', 'pos'], dtype='<U3')

## Text Classification With Machine Learning and SpaCy - Using Tfid

In [30]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Using Tfidf
tfvectorizer = TfidfVectorizer(tokenizer = spacy_tokenizer)

classifier_rf = RandomForestClassifier(n_jobs = -1, max_depth = 6, n_estimators = 10)

In [31]:
# Using Tfid
pipe_tfid = Pipeline([('cleaner', predictors()),
                      ('vectorizer', tfvectorizer),
                      ('classifier', classifier)])

In [32]:
# Create model and measure accuracy
pipe_tfid.fit([x[0] for x in train], [x[1] for x in train]) 

# pos or neg
pred_data = pipe_tfid.predict([x[0] for x in test])

for (sample, pred) in zip(test, pred_data):
    print (sample, pred)

print ('Random Forest | TfidfVectorizer Accuracy:', accuracy_score([x[1] for x in test], pred_data))

('the beer was good.', 'pos') pos
('I do not enjoy my job', 'neg') neg
("I ain't feelin dandy today.", 'neg') neg
('I feel amazing!', 'pos') pos
('Gary is a good friend of mine.', 'pos') pos
("I can't believe I'm doing this.", 'neg') neg
Random Forest | TfidfVectorizer Accuracy: 1.0


In [33]:
# Another random review
pipe_tfid.predict(['This was a great movie'])

array(['neg'], dtype='<U3')

In [37]:
example = ['I do enjoy my job', 'What a poor product!  I will ahve to get a new one', 'I feel amazing']
pipe_tfid.predict(example)

array(['neg', 'neg', 'pos'], dtype='<U3')

In [35]:
pipe_tfid.predict(['Gary is a good friend of mine.'])

array(['pos'], dtype='<U3')

In [None]:
# randomizedsearchCV for hyperparameter tuning