### NLP Part II

**OBJECTIVES**

- Introduce `spacy` for part of speech tagging
- Use corpus readers to extract sentence and word features of raw text files
- Build and explore topic models
- Discuss word vectors and similarity

In [2]:
from nltk.corpus import PlaintextCorpusReader

In [13]:
heights = PlaintextCorpusReader('heights/', 'heights.*')
lemonde = PlaintextCorpusReader('le_monde/', 'le_monde.*')
community = PlaintextCorpusReader('community/', 'community.*')
amigos = PlaintextCorpusReader('amigos/', 'amigos.*')

In [15]:
lemonde.raw()[:100]

"My wife and I get food from here whenever we're in town. I have relatives who live just a few blocks"

In [45]:
import spacy
nlp = spacy.load("en_core_web_sm")
text = heights.raw()
doc = nlp(text)

In [46]:
named_entities = [i for i in doc.ents]

In [47]:
[(i, i.label_) for i in named_entities][:5]

[(The Heights, 'GPE'),
 (at least 4, 'CARDINAL'),
 (five, 'CARDINAL'),
 (Taco, 'GPE'),
 (Tuesday, 'DATE')]

In [42]:
from spacy import displacy

In [48]:
#displacy.render(doc, jupyter = True, style = 'ent')

In [49]:
from sklearn.datasets import fetch_20newsgroups

In [50]:
news = fetch_20newsgroups(categories = ['alt.atheism',
 'comp.graphics'])

In [51]:
X, y = news.data, news.target

In [52]:
print(X[0])

From: frank@D012S658.uucp (Frank O'Dwyer)
Subject: Re: After 2000 years, can we say that Christian Morality is
Organization: Siemens-Nixdorf AG
Lines: 28
NNTP-Posting-Host: d012s658.ap.mchp.sni.de

In article <1993Apr15.125245.12872@abo.fi> MANDTBACKA@FINABO.ABO.FI (Mats Andtbacka) writes:
|In <1qie61$fkt@horus.ap.mchp.sni.de> frank@D012S658.uucp writes:
|> In article <30114@ursa.bear.com> halat@pooh.bears (Jim Halat) writes:
|
|> #I'm one of those people who does not know what the word objective means 
|> #when put next to the word morality.  I assume its an idiom and cannot
|> #be defined by its separate terms.
|> #
|> #Give it a try.
|> 
|> Objective morality is morality built from objective values.
|
|      "And these objective values are ... ?"
|Please be specific, and more importantly, motivate.

I'll take a wild guess and say Freedom is objectively valuable.  I base
this on the assumption that if everyone in the world were deprived utterly
of their freedom (so that their every a

In [53]:
y[0]

0

In [54]:
from sklearn.feature_extraction.text import CountVectorizer

In [55]:
from sklearn.linear_model import LogisticRegression

In [62]:
from sklearn.pipeline import make_pipeline

In [63]:
from sklearn.preprocessing import StandardScaler

In [66]:
pipe = make_pipeline(CountVectorizer(max_features = 500), StandardScaler(with_mean = False), LogisticRegression())

In [67]:
pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('countvectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=500, min_df=1, ngram_range=(1, 1),
                                 preprocessor=None, stop_words=None,
                                 strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, voca...,
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=False, with_std=True)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                 

In [68]:
pipe.score(X, y)

1.0

In [69]:
from sklearn.model_selection import cross_validate

In [71]:
from sklearn.model_selection import train_test_split

In [76]:
cvect = CountVectorizer()

In [77]:
cvect.fit(X)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [79]:
cvect.transform(X)

<1064x21366 sparse matrix of type '<class 'numpy.int64'>'
	with 154964 stored elements in Compressed Sparse Row format>

In [80]:
cvect.fit_transform(X)

<1064x21366 sparse matrix of type '<class 'numpy.int64'>'
	with 154964 stored elements in Compressed Sparse Row format>

In [81]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
X_train = cvect.fit_transform(X_train)
X_test = cvect.transform(X_test)

In [73]:
cross_validate(pipe, X_train, y_train, cv = 5)

{'fit_time': array([0.12546611, 0.11617994, 0.11869383, 0.11560202, 0.10505581]),
 'score_time': array([0.01980805, 0.02043295, 0.0177772 , 0.01826596, 0.02770591]),
 'test_score': array([0.95625   , 0.98125   , 0.95625   , 0.94968553, 0.96226415])}

In [74]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('countvectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=500, min_df=1, ngram_range=(1, 1),
                                 preprocessor=None, stop_words=None,
                                 strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, voca...,
                ('standardscaler',
                 StandardScaler(copy=True, with_mean=False, with_std=True)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                 

In [75]:
pipe.score(X_test, y_test)

0.9736842105263158