# Natural Language Processing

In this exercise we will calculate a variety of feature extraction methods on a news article dataset and use various classifiers to predict the article's category.

We will first use classical methods for feature extraction with Naive Bayes, followed by more recent methods of using word embeddings with a simple Linear SVM model.

In [1]:
# # Uncomment the below line to install
! pip install spacy
! python -m spacy download en_core_web_md

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_md')


In [0]:
import sklearn
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, HashingVectorizer
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.metrics import classification_report, f1_score, accuracy_score
from sklearn.svm import LinearSVC
import numpy as np
import spacy

In [0]:
data = fetch_20newsgroups(subset="all")

In [4]:
print(data.DESCR)

.. _20newsgroups_dataset:

The 20 newsgroups text dataset
------------------------------

The 20 newsgroups dataset comprises around 18000 newsgroups posts on
20 topics split in two subsets: one for training (or development)
and the other one for testing (or for performance evaluation). The split
between the train and test set is based upon a messages posted before
and after a specific date.

This module contains two loaders. The first one,
:func:`sklearn.datasets.fetch_20newsgroups`,
returns a list of the raw texts that can be fed to text feature
extractors such as :class:`sklearn.feature_extraction.text.CountVectorizer`
with custom parameters so as to extract feature vectors.
The second one, :func:`sklearn.datasets.fetch_20newsgroups_vectorized`,
returns ready-to-use features, i.e., it is not necessary to use a feature
extractor.

**Data Set Characteristics:**

    Classes                     20
    Samples total            18846
    Dimensionality               1
    Features       

In [5]:
text = data["data"]
target = data["target"]
print("The following are the 20 topics that an article can belong to:")
print(data["target_names"])

The following are the 20 topics that an article can belong to:
['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [0]:
X_train, X_test, y_train, y_test = train_test_split(text, target, random_state=0)

In [7]:
print(f"The training dataset contains {len(X_train)} articles.")
print(f"The test dataset contains {len(X_test)} articles.")

The training dataset contains 14134 articles.
The test dataset contains 4712 articles.


Scikit learn implements the BoW feature representation using [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html), and it also has implementations for [TF-IDF](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer) and [hashed vector](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer) representations.

Determine the feature representations of our dataset using each of those approaches.

In [78]:
%%time
# Use English stopwords and produce a BoW representation for the data using up to trigrams
# Save the vectorizer as counter and the transformed data as X_train_bow, and X_test_bow
# YOUR CODE HERE
counter = CountVectorizer(stop_words = "english", ngram_range = (1, 3))
counter.fit(X_train, y_train)

X_train_bow = counter.transform(X_train)
X_test_bow = counter.transform(X_test)

CPU times: user 36.2 s, sys: 1.13 s, total: 37.3 s
Wall time: 37.4 s


In [0]:
assert counter
assert counter.stop_words == "english"
assert counter.ngram_range == (1,3)
assert len(counter.get_feature_names()) == 3034327
assert X_train_bow.shape == (14134, 3034327)
assert X_test_bow.shape == (4712, 3034327)

Note that sklearn implements a [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) and [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html). The main difference between the two is in the inputs to fitting and transforming. The [Vectorizer's fit/transform](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#sklearn.feature_extraction.text.TfidfVectorizer.fit) take an input of text whereas the [transformer's](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer.fit) take an input of a BoW vector. Given that we already determined the BoW vectors, it would be more time efficient to use TfidfTransformer.

In [80]:
%%time
# Use the BoW representation you just created above to produce a TFIDF representation of the data
# Save the transformer to tfidfer and the transformed data as X_train_tfidf, and X_test_tfidf
tfidfer = TfidfTransformer()
#tfidfer.fit(X_train_bow)
X_train_tfidf = tfidfer.fit_transform(X_train_bow)
X_test_tfidf = tfidfer.fit_transform(X_test_bow)


CPU times: user 3.46 s, sys: 11.9 ms, total: 3.48 s
Wall time: 3.48 s


In [0]:
assert tfidfer
assert X_train_tfidf.shape  == (14134, 3034327)
assert X_test_tfidf.shape  == (4712, 3034327)

Now use the hashing vectorizer to do the same.

In [85]:
%%time 
# Use English stopwords and produce a Hashed vector representation for the data using up to trigrams
# Save the vectorizer as hasher and the transformed data as X_train_hash, and X_test_hash
# Make sure you set alternate_sign to False so we can use this representation with Multinomial Naive Bayes
hasher = HashingVectorizer(stop_words = "english", ngram_range = (1,3), alternate_sign = False).fit(X_train, y_train)
X_train_hash = hasher.fit_transform(X_train)
X_test_hash = hasher.fit_transform(X_test)



CPU times: user 7.07 s, sys: 2.94 ms, total: 7.08 s
Wall time: 7.09 s


In [0]:
assert hasher
assert hasher.stop_words == "english"
assert hasher.ngram_range == (1,3)
assert X_train_hash.shape == (14134, 1048576)
assert X_test_hash.shape == (4712, 1048576)

Compare the time it took to run the count vectorizer vs the hashing vectorizer even though they both will iterate through all the words.

Now recall [Naive Bayes Classification](http://scikit-learn.org/stable/modules/naive_bayes.html) which we discussed early on in the supervised learning lectures. We will use Naive Bayes classifiers to predict the topic of the articles and compare our feature representations. Use a Multinomial Naive Bayes classifier to predict the topics.

In [89]:
for feat_name, train_feat, test_feat in zip(["Bag of Words", "TF-IDF", "Hashing"],[X_train_bow, X_train_tfidf, X_train_hash], [X_test_bow, X_test_tfidf, X_test_hash]):
    # Create a Multinomial Naive Bayes model saved to `mnb` and fit it to train_feat
    # YOUR CODE HERE
    mnb = MultinomialNB()
    mnb.fit(train_feat, y_train)
    y_pred = mnb.predict(test_feat)
    print(f"Results for {feat_name}")
    print("-"*80)
    print(classification_report(y_test, y_pred))
    print("-"*80)


Results for Bag of Words
--------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.91      0.94      0.92       205
           1       0.78      0.87      0.82       245
           2       0.92      0.76      0.83       250
           3       0.77      0.83      0.80       243
           4       0.89      0.85      0.87       255
           5       0.84      0.91      0.88       240
           6       0.90      0.75      0.82       249
           7       0.89      0.90      0.89       219
           8       0.96      0.91      0.94       246
           9       0.92      0.97      0.94       227
          10       0.96      0.98      0.97       287
          11       0.88      0.97      0.92       234
          12       0.93      0.82      0.87       247
          13       0.93      0.92      0.93       250
          14       0.90      0.96      0.93       240
          15       0.93      

In [0]:
assert isinstance(mnb, MultinomialNB)

## Learned Embeddings

We will use [spacy](https://spacy.io/) for more sophisticated NLP. Make sure you downloaded the english model in the commented code at the top of the notebook before proceeding. It may take some time to download.

Spacy allows us to parse text and automatically does the following:
- tokenization
- lemmatization
- sentence splitting
- entity recognition
- token vector representation


In [26]:
%%time
nlp = spacy.load("en_core_web_md")

CPU times: user 19.1 s, sys: 830 ms, total: 19.9 s
Wall time: 20.1 s


In [0]:
text = "This is the first sentence in this test string. The quick brown fox jumps over the lazy dog."

parsed_text = nlp(text)

In [28]:
for sent in parsed_text.sents:
    print(f"Analyzing sentence: {sent}")
    print(f"Lemmatization: {sent.lemma_}")
    for token in sent:
        print(f"Analyzing token: {token}")
        if token.is_sent_start:
            print("This token is the first one in the sentence")
        if token.is_stop:
            print("Stop word")
        else:
            print("Not stop word")
        print(f"Entity type: {token.ent_type_}")
        print(f"Part of speech: {token.pos_}")
        print(f"Lemma: {token.lemma_}")
        print("-"*10)
    print("-"*50)

Analyzing sentence: This is the first sentence in this test string.
Lemmatization: this be the first sentence in this test string .
Analyzing token: This
This token is the first one in the sentence
Stop word
Entity type: 
Part of speech: DET
Lemma: this
----------
Analyzing token: is
Stop word
Entity type: 
Part of speech: VERB
Lemma: be
----------
Analyzing token: the
Stop word
Entity type: 
Part of speech: DET
Lemma: the
----------
Analyzing token: first
Stop word
Entity type: ORDINAL
Part of speech: ADJ
Lemma: first
----------
Analyzing token: sentence
Not stop word
Entity type: 
Part of speech: NOUN
Lemma: sentence
----------
Analyzing token: in
Stop word
Entity type: 
Part of speech: ADP
Lemma: in
----------
Analyzing token: this
Stop word
Entity type: 
Part of speech: DET
Lemma: this
----------
Analyzing token: test
Not stop word
Entity type: 
Part of speech: NOUN
Lemma: test
----------
Analyzing token: string
Not stop word
Entity type: 
Part of speech: NOUN
Lemma: string
-------

In [0]:
### Come up with a couple sentences to test out and set the text to my_text
### You can go to your favorite website or news source and copy a paragraph from there

my_text = "An apple is a sweet, edible fruit produced by an apple tree (Malus domestica). Apple trees are cultivated worldwide and are the most widely grown species in the genus Malus. The tree originated in Central Asia, where its wild ancestor, Malus sieversii, is still found today. Apples have been grown for thousands of years in Asia and Europe and were brought to North America by European colonists. Apples have religious and mythological significance in many cultures, including Norse, Greek and European Christian traditions."

In [0]:
assert len(my_text) > 10
assert my_text.count(".") > 2

In [31]:
parsed = nlp(my_text)
for sent in parsed.sents:
    print(f"Analyzing sentence: {sent}")
    print(f"Lemmatization: {sent.lemma_}")
    for token in sent:
        print(f"Analyzing token: {token}")
        if token.is_sent_start:
            print("This token is the first one in the sentence")
        if token.is_stop:
            print("Stop word")
        else:
            print("Not stop word")
        print(f"Entity type: {token.ent_type_}")
        print(f"Part of speech: {token.pos_}")
        print(f"Lemma: {token.lemma_}")
        print("-"*10)
    print("-"*50)

Analyzing sentence: An apple is a sweet, edible fruit produced by an apple tree (Malus domestica).
Lemmatization: an apple be a sweet , edible fruit produce by an apple tree ( Malus domestica ) .
Analyzing token: An
This token is the first one in the sentence
Stop word
Entity type: 
Part of speech: DET
Lemma: an
----------
Analyzing token: apple
Not stop word
Entity type: 
Part of speech: NOUN
Lemma: apple
----------
Analyzing token: is
Stop word
Entity type: 
Part of speech: VERB
Lemma: be
----------
Analyzing token: a
Stop word
Entity type: 
Part of speech: DET
Lemma: a
----------
Analyzing token: sweet
Not stop word
Entity type: 
Part of speech: ADJ
Lemma: sweet
----------
Analyzing token: ,
Not stop word
Entity type: 
Part of speech: PUNCT
Lemma: ,
----------
Analyzing token: edible
Not stop word
Entity type: 
Part of speech: ADJ
Lemma: edible
----------
Analyzing token: fruit
Not stop word
Entity type: 
Part of speech: NOUN
Lemma: fruit
----------
Analyzing token: produced
Not sto

If we use the larger spacy models, we get the GloVe representation for some words based on a pre-trained model. The GloVe vectors should be in 300 dimensions.

In [32]:
print(token.vector)
token.vector.shape

[ 0.012001   0.20751   -0.12578   -0.59325    0.12525    0.15975
  0.13748   -0.33157   -0.13694    1.7893    -0.47094    0.70434
  0.26673   -0.089961  -0.18168    0.067226   0.053347   1.5595
 -0.2541     0.038413  -0.01409    0.056774   0.023434   0.024042
  0.31703    0.19025   -0.37505    0.035603   0.1181     0.012032
 -0.037566  -0.5046    -0.049261   0.092351   0.11031   -0.073062
  0.33994    0.28239    0.13413    0.070128  -0.022099  -0.28103
  0.49607   -0.48693   -0.090964  -0.1538    -0.38011   -0.014228
 -0.19392   -0.11068   -0.014088  -0.17906    0.24509   -0.16878
 -0.15351   -0.13808    0.02151    0.13699    0.0068061 -0.14915
 -0.38169    0.12727    0.44007    0.32678   -0.46117    0.068687
  0.34747    0.18827   -0.31837    0.4447    -0.2095    -0.26987
  0.48945    0.15388    0.05295   -0.049831   0.11207    0.14881
 -0.37003    0.30777   -0.33865    0.045149  -0.18987    0.26634
 -0.26401   -0.47556    0.68381   -0.30653    0.24606    0.31611
 -0.071098   0.030417

(300,)

One thing that spaCy allows us to easily do with their vectors is identify the similarity between one word and another. Let's try it out.

In [0]:
# Try out different words here and see if the results you get seem reasonable.
first_word = "cat" 
second_word = "dog"

In [34]:
print(nlp(first_word).similarity(nlp(second_word)))

0.8016854705531046


Given that the parsing of text takes some time, we will only consider the first 1000 articles in our data.

In [0]:
new_X_train, new_X_test, new_y_train, new_y_test = train_test_split(X_train[:1000], y_train[:1000], random_state=0)

This [tweet](https://twitter.com/_inesmontani/status/1113413036985991170) may be relevant to understanding the most performant ways to use spacy in the future.

In [71]:
%%time
# Using `nlp` from above, parse every instance of new_X_train
# save the document **vectors** to a np.array called X_train_glove
# This cell will take a long time to run. 
# Try changing the number of articles to run this on to just 10 until you pass the asserts
# Then change it back to 1000

#X_train_glove = np.array(list(nlp.pipe(new_X_train).vector))
#X_test_glove = np.array(list(nlp.pipe(new_X_test).vector))

X_train_glove = np.array([nlp(text).vector for text in new_X_train])
X_test_glove = np.array([nlp(text).vector for text in new_X_test])

CPU times: user 1min 6s, sys: 435 ms, total: 1min 6s
Wall time: 1min 6s


In [66]:
print(len(new_X_train))
print(X_train_glove.shape)

7
(7, 300)


In [0]:
assert X_train_glove.shape == (len(new_X_train), 300)
assert X_test_glove.shape == (len(new_X_test), 300)

In [73]:
svm = LinearSVC().fit(X_train_glove, new_y_train)
y_pred = svm.predict(X_test_glove)
print(classification_report(new_y_test, y_pred))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         7
           1       0.50      0.50      0.50        10
           2       0.60      0.55      0.57        11
           3       0.60      0.50      0.55        12
           4       0.50      0.50      0.50        12
           5       0.58      0.64      0.61        11
           6       0.64      0.90      0.75        10
           7       0.77      0.94      0.85        18
           8       0.88      0.82      0.85        17
           9       0.76      0.76      0.76        17
          10       0.64      0.64      0.64        14
          11       0.85      0.79      0.81        14
          12       0.69      0.47      0.56        19
          13       0.80      1.00      0.89        16
          14       0.75      0.60      0.67        10
          15       0.56      0.77      0.65        13
          16       0.85      0.79      0.81        14
          17       0.73    

Note that the results here aren't necessarily a fair comparison since we are only using a small subset of the data for both training and testing.

We will not cover LDA in this exercise but if you are interested in topic modeling, you should check out [Gensim](https://radimrehurek.com/gensim/) and its [LDA implementation](https://radimrehurek.com/gensim/models/ldamodel.html).

## Feedback

In [0]:
def feedback():
    """Provide feedback on the contents of this exercise
    
    Returns:
        string
    """
    # YOUR CODE HERE
    return "This exercise suffered from inconsistent resutls across machines. That is, certain sections of code worked for some of my partners whereas not for me and vice versa. I did appreciate being exposed to some best practices."