<a href="https://colab.research.google.com/github/paulgureghian/SpaCy/blob/master/SpaCy1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Created in April 2019 by Paul A. Gureghian.**

### **This scientific notebook has Python code to do text classification using the 'SpaCy' package.**

### **The first section will focus on 'word tokenization'.**

In [0]:
### Import libraries
import spacy
import string
import pandas as pd
import en_core_web_sm
from spacy import displacy
from sklearn import metrics
from spacy.lang.en import English
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
from spacy.lang.en.stop_words import STOP_WORDS
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [0]:
### Load English tokenizer
nlp = English()

In [0]:
### Text to process
text = """When learning data science, you shoulndn't get discouraged!
        Challenges and setbacks aren't failure, they're just part of the journey. You've got this!"""

In [179]:
### Process the text 
doc = nlp(text)
print(doc)

When learning data science, you shoulndn't get discouraged!
        Challenges and setbacks aren't failure, they're just part of the journey. You've got this!


In [180]:
### Create a list of word tokens
token_list = []
for token in doc:
    token_list.append(token.text)

print(token_list)    

['When', 'learning', 'data', 'science', ',', 'you', "shoulndn't", 'get', 'discouraged', '!', '\n        ', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failure', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']


### **This section will focus on 'sentence tokenization'.**

In [0]:
### Create the pipeline 'sentencizer' component
sbd = nlp.create_pipe('sentencizer') 

In [0]:
### Add the component to the pipeline
nlp.add_pipe(sbd)

In [183]:
### Process the text
doc = nlp(text)
print(doc)

When learning data science, you shoulndn't get discouraged!
        Challenges and setbacks aren't failure, they're just part of the journey. You've got this!


In [184]:
### Create list of sentence tokens
sentence_list = []
for sentence in doc.sents:
    sentence_list.append(sentence.text)
    
print(sentence_list)    

["When learning data science, you shoulndn't get discouraged!", "\n        Challenges and setbacks aren't failure, they're just part of the journey.", "You've got this!"]


### **Cleaning Text Data: Removing Stopwords.**

In [0]:
### Analyze the default stopwords in SpaCy
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

In [186]:
### Print total number of stopwords
print("Number of stop words: %d" % len(spacy_stopwords))

Number of stop words: 305


In [187]:
### Print the first ten stopwords
print("The first ten stop words: %s" % list(spacy_stopwords)[:20]) 

The first ten stop words: ['from', 'when', 'call', 'whole', 'less', 'afterwards', 'is', 'already', 'whence', 'whereafter', 'always', 'to', 'might', 'the', 'could', 'part', 'doing', 'quite', 'whose', 'many']


In [188]:
### Remove the stop words from source text
filtered_sentence = []

for word in doc:
    if word.is_stop == False:
       filtered_sentence.append(word)
        
print("Filtered sentence:", filtered_sentence)        

Filtered sentence: [When, learning, data, science, ,, shoulndn't, discouraged, !, 
        , Challenges, setbacks, n't, failure, ,, 're, journey, ., You, 've, got, !]


### **Lexicon Normalization and Lemmatization.**

In [189]:
### Implement Lemmatization
lem = nlp("run runs running runner")

for word in lem:
    print(word.text, word.lemma_)

run run
runs run
running run
runner runner


### **Part of Speech (POS) Tagging.**

In [190]:
### Implement POS Tagging
nlp = en_core_web_sm.load()

docs = nlp(u"All is well that ends well.")

for word in docs:
    print(word.text, word.pos_)

All DET
is VERB
well ADV
that ADJ
ends VERB
well ADV
. PUNCT


### **Entity Detection.**

In [191]:
### Implement Entity Detection
nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, becoming the latest national flash point over refusals to inoculate against dangerous diseases.

At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.

The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

entities = [(i, i.label_, i.label) for i in nytimes.ents]
print(entities)

[(New York City, 'GPE', 382), (Tuesday, 'DATE', 388), (At least 285, 'CARDINAL', 394), (September, 'DATE', 388), (Brooklyn, 'GPE', 382), (Williamsburg, 'GPE', 382), (four, 'CARDINAL', 394), (Zip, 'PERSON', 378), (Bill de Blasio, 'PERSON', 378), (Tuesday, 'DATE', 388), (Orthodox, 'NORP', 379), (Jews, 'NORP', 379), (6 months old, 'DATE', 388), (1,000, 'MONEY', 391)]


In [192]:
### Visualize the entities output
displacy.render(nytimes, style="ent", jupyter=True) 

### **Dependency Parsing.**

In [193]:
### Implement Dependency Parsing
docp = nlp("In pursuit of a wall, President Trump ran into one.")

for chunk in docp.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

pursuit pursuit pobj In
a wall wall pobj of
President Trump Trump nsubj ran


In [194]:
### Visualize the Dependencies
displacy.render(docp, style="dep", jupyter=True)

### **Word Vector Representation.**

In [195]:
### Implement WVR
mango = nlp(u'mango')

print(mango.vector.shape)
print(mango.vector)

(384,)
[ 1.81898862e-01 -5.30111313e-01  2.66826463e+00  6.92421854e-01
 -1.97660947e+00  3.68705654e+00 -4.39795065e+00 -9.98801291e-01
  4.40463066e-01  2.16391638e-01 -3.65440279e-01 -7.81076103e-02
 -2.61327028e-02 -2.29889131e+00 -4.02842969e-01  2.03411388e+00
 -1.13863683e+00 -2.47938871e+00 -6.85229778e-01  2.18901730e+00
  2.21150255e+00  1.11644936e+00  1.71971738e-01  4.38696742e-01
 -1.64694774e+00 -4.35405135e-01 -3.02480489e-01  8.34272265e-01
 -1.12027764e+00  7.75548697e-01 -5.96541584e-01 -1.65593314e+00
  5.41058123e-01 -3.40727717e-01 -3.47570509e-01  5.06470382e-01
  3.71737331e-01 -9.64704275e-01 -8.57091904e-01  8.52468491e-01
 -3.29184246e+00  4.53452921e+00  2.02872545e-01 -1.16221458e-01
 -1.18046367e+00  4.02978331e-01 -5.31236291e-01 -9.04556274e-01
  1.07802963e+00  3.54202926e-01 -1.02039969e+00 -1.33428836e+00
 -3.28955364e+00  6.58581913e-01 -4.01282251e-01  3.08272779e-01
  4.82804346e+00 -1.29300499e+00 -2.84544349e+00 -1.12305379e+00
 -5.03152847e-01  

### **Text Classification.**

**Via 'Logistic Regression Classification' using 'Scikit-Learn'.**

In [196]:
### Load data set 
df_amazon = pd.read_csv("amazon_alexa.tsv", sep="\t")

df_amazon.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [197]:
### Print shape of dataframe
df_amazon.shape

(3150, 5)

In [198]:
### Print data information
df_amazon.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
rating              3150 non-null int64
date                3150 non-null object
variation           3150 non-null object
verified_reviews    3150 non-null object
feedback            3150 non-null int64
dtypes: int64(2), object(3)
memory usage: 123.1+ KB


In [199]:
### Feedback column count
df_amazon.feedback.value_counts()

1    2893
0     257
Name: feedback, dtype: int64

In [200]:
### Tokenize the Amazon dataset
punctuations = string.punctuation
print("Punctuations:", punctuations, "\n")

nlp = spacy.load("en")
stop_words = STOP_WORDS
print("Stop_Words:", stop_words, "\n")

parser = English() # English tokenizer

# Define a tokenizer function
def spacy_tokenizer(sentence):
    
    tokens = parser(sentence) 
    
    tokens = [word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in tokens]    
    
    tokens = [word for word in tokens if word not in stop_words and word not in punctuations]
    
    return tokens

Punctuations: !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~ 

Stop_Words: {'from', 'when', 'call', 'whole', 'less', 'afterwards', 'is', 'already', 'whence', 'whereafter', 'always', 'to', 'might', 'the', 'could', 'part', 'doing', 'quite', 'whose', 'many', 'thru', 'each', 'thereafter', 'whoever', 'thereby', 'say', 'were', 'once', 'as', 'done', 'mine', 'my', 'alone', 'latter', 'other', 'besides', 'a', 'made', 'none', 'because', 'would', 'really', 'you', 'nine', 'or', 'moreover', 'did', 'eleven', 'so', 'too', 'else', 'should', 'at', 'never', 'becomes', 'same', 'also', 'indeed', 'latterly', 'ever', 'nowhere', 'where', 'all', 'hers', 'its', 'not', 'both', 'regarding', 'only', 'whenever', 'before', 'which', 'five', 'become', 'your', 'something', 'here', 'being', 'such', 'more', 'serious', 'seem', 'whatever', 'further', 'everything', 'please', 'seemed', 'back', 'along', 'it', 'whither', 'for', 'he', 'his', 'myself', 'whether', 'nevertheless', 'herein', 'various', 'wherever', 'was', 'some', 'below', 'themse

### **Define a Custom Transformer.**

In [0]:
### Custom transformer class
class predictors(TransformerMixin):
    
    def transform(self, X, **transform_params):
      
        return [clean_text(text) for text in X]
      
    def fit(self, X, y=None, **fit_params):
      
        return self
      
    def get_params(self, deep=True):
      
        return {}
      
# Text cleaning function
def clean_text(text):
  
    return text.strip().lower()

### **Vectorization Feature Engineering (TF-IDF).**

In [0]:
### Bag of Words matrix
bow_vector = CountVectorizer(tokenizer = spacy_tokenizer, ngram_range = (1 ,1))

In [0]:
### TF-IDF vector
tfidf_vector = TfidfVectorizer(tokenizer = spacy_tokenizer) 

In [0]:
### Split the Amazon data into train and test sets
X = df_amazon['verified_reviews']  # Independent variables
y_labels = df_amazon['feedback']  # Dependent variables

X_train, X_test, y_train, y_test = train_test_split(X, y_labels, test_size = 0.3)

### **Create a Pipeline and Generate the Model.**

In [0]:
### Instantiate the LR Classifier
logistic_regression_classifier = LogisticRegression()

In [0]:
### Create a Pipeline 
pipe = Pipeline([("cleaner", predictors()),
                 ("vectorizer", bow_vector),
                 ("classifier", logistic_regression_classifier)])

In [207]:
### Generate the model
pipe.fit(X_train, y_train)



Pipeline(memory=None,
     steps=[('cleaner', <__main__.predictors object at 0x7f4585273a90>), ('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ng...penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False))])

### **Evaluate the model.**

In [208]:
### Get predictions from the model
y_pred = pipe.predict(X_test)

### Print model accuracy 
print("Logistic Regression Accuracy:", metrics.accuracy_score(y_test, y_pred))
print("Logistic Regression Precision:", metrics.precision_score(y_test, y_pred))
print("Logistic Regression Recall:", metrics.recall_score(y_test, y_pred))

Logistic Regression Accuracy: 0.9481481481481482
Logistic Regression Precision: 0.9508733624454149
Logistic Regression Recall: 0.9954285714285714


### **Over the course of this notebook, I went from simple text analysis using 'SpaCy'  to building a ML Model using SKLearn.**