### About
This notebook works through extracting parts of speech, the entities, the noun chunks, and the topic vector and string. These features may be useful to more accurately categorize arbitrary text blobs.

[Language Model](#Language-Model)  
[Corpra](#Corpra)  
[Topic Modeling](#Topic-Modeling)  
[Parts of speech](#Parts-of-Speech)  
[Entity Detection](#Entity-Detection)  
[Tokenize](#Tokenize)  
[Topic Vector](#Topic-Vector)  
[Topic String](#Topic-String)  

In [1]:
from tool import import_dir; import_dir("../")
%time from src.lib import nlp as libnlp

CPU times: user 524 ms, sys: 144 ms, total: 668 ms
Wall time: 503 ms


---
### Language Model
I am using the [medium english model](https://spacy.io/models/en#en_core_web_md) provided by spacy. I could likely improve results by using one of the larger models, but I chose the medium to work in a memory constrained environment.

In [2]:
%time (nlp, keys, vectors) = libnlp.load_model("en_core_web_md")

CPU times: user 17 s, sys: 544 ms, total: 17.5 s
Wall time: 17.6 s


---
### Corpra

In [3]:
%%time
doc = nlp(u"""
    Explosions will continually shake the earth
    Radiated robot men will stalk each other
    The rich and the chosen will watch from space platforms
    Dante's Inferno will be made to look like a children's playground
    The sun will not be seen and it will always be night
    Trees will die
""".strip())

CPU times: user 52.2 ms, sys: 1.98 ms, total: 54.2 ms
Wall time: 54 ms


---
### Topic Modeling
The primary feature used to derive a topic is the "topic vector". The topic vector is the average of the word vectors. When converting from vector space to text, I am finding a word in the models vocabulary whose vector is closest to the topic vector. The closest word might not actually be close to the topic vector, but could be a limitation of the model.

---
### Parts of Speech
This data needs to be transformed to make it friendlier for machine learning work.

One possible solution is a part of speech hasmap. This could be represented as a python dictionary where the keys are the string text for the part of speech and the value is the number of times it appears in a sample. 

This structure should include every possible key to allow allow corpra to be compared.
```
{
    "NOUN": <number of occurances>,
    "PRON": <number of occurances>,
    ...
}
```

In [4]:
# proof of concept for creating a pos hash
def pos_hash(pos):
    _hash = {}
    for tag in pos:
        if tag not in _hash:
            _hash[tag] = 1
        else: 
            _hash[tag] += 1
            
    return _hash

In [5]:
pos = [p[1] for p in libnlp.parts_of_speech(doc)]
%time pos_hash(pos)

CPU times: user 13 µs, sys: 0 ns, total: 13 µs
Wall time: 16.7 µs


{'NOUN': 11,
 'AUX': 10,
 'ADV': 2,
 'VERB': 8,
 'DET': 6,
 'SPACE': 5,
 'ADJ': 2,
 'CCONJ': 2,
 'ADP': 1,
 'PROPN': 3,
 'PART': 4,
 'SCONJ': 1,
 'PRON': 1}

In [6]:
%time libnlp.parts_of_speech(doc)

CPU times: user 0 ns, sys: 267 µs, total: 267 µs
Wall time: 274 µs


[('Explosions', 'NOUN'),
 ('will', 'AUX'),
 ('continually', 'ADV'),
 ('shake', 'VERB'),
 ('the', 'DET'),
 ('earth', 'NOUN'),
 ('\n    ', 'SPACE'),
 ('Radiated', 'NOUN'),
 ('robot', 'NOUN'),
 ('men', 'NOUN'),
 ('will', 'AUX'),
 ('stalk', 'VERB'),
 ('each', 'DET'),
 ('other', 'ADJ'),
 ('\n    ', 'SPACE'),
 ('The', 'DET'),
 ('rich', 'ADJ'),
 ('and', 'CCONJ'),
 ('the', 'DET'),
 ('chosen', 'VERB'),
 ('will', 'AUX'),
 ('watch', 'VERB'),
 ('from', 'ADP'),
 ('space', 'NOUN'),
 ('platforms', 'NOUN'),
 ('\n    ', 'SPACE'),
 ('Dante', 'PROPN'),
 ("'s", 'PART'),
 ('Inferno', 'PROPN'),
 ('will', 'AUX'),
 ('be', 'AUX'),
 ('made', 'VERB'),
 ('to', 'PART'),
 ('look', 'VERB'),
 ('like', 'SCONJ'),
 ('a', 'DET'),
 ('children', 'NOUN'),
 ("'s", 'PART'),
 ('playground', 'NOUN'),
 ('\n    ', 'SPACE'),
 ('The', 'DET'),
 ('sun', 'NOUN'),
 ('will', 'AUX'),
 ('not', 'PART'),
 ('be', 'AUX'),
 ('seen', 'VERB'),
 ('and', 'CCONJ'),
 ('it', 'PRON'),
 ('will', 'AUX'),
 ('always', 'ADV'),
 ('be', 'AUX'),
 ('night', 'N

---
### Noun Chunking
This is probably magic.

Noun chunking seems to work well in most cases.

In [7]:
%time libnlp.noun_chunks(doc)

CPU times: user 481 µs, sys: 20 µs, total: 501 µs
Wall time: 510 µs


[('Explosions', 'NP'),
 ('the earth', 'NP'),
 ('Radiated robot men', 'NP'),
 ('space platforms', 'NP'),
 ("Dante's Inferno", 'NP'),
 ("a children's playground", 'NP'),
 ('The sun', 'NP'),
 ('it', 'NP'),
 ('night', 'NP'),
 ('Trees', 'NP')]

---
### Entity Detection

Entity parsing seems to do better with larger models. `Dante's inferno` being classified as an organization could probably be improved with a dataset that has more contextual references. It did reasonably well without that context.

In [8]:
%time libnlp.entities(doc)

CPU times: user 94 µs, sys: 0 ns, total: 94 µs
Wall time: 111 µs


[("Dante's Inferno", 'ORG'), ('night', 'TIME')]

---
I have wrapped some functions as lambdas. This is purely for convenience. Python doesnt allow partial function application (without additional imports), and this is just a work around.

In [9]:
tokenize = lambda corpra: libnlp.tokenize(nlp, corpra)
topic_vector = lambda tokens: libnlp.topic_vector(nlp, tokens)
topic_string = lambda vector: libnlp.topic_string(nlp, keys, vectors, vector)

---
### Tokenize
```
"a sentance" -> ["a", "sentance"]
```

In [10]:
%time tokens = tokenize(doc); print(tokens)

[Explosions, will, continually, shake, the, earth, 
    , Radiated, robot, men, will, stalk, each, other, 
    , The, rich, and, the, chosen, will, watch, from, space, platforms, 
    , Dante, 's, Inferno, will, be, made, to, look, like, a, children, 's, playground, 
    , The, sun, will, not, be, seen, and, it, will, always, be, night, 
    , Trees, will, die]
CPU times: user 204 ms, sys: 94 µs, total: 204 ms
Wall time: 205 ms


---
### Topic Vector
```
You shall know a word by the company it keeps - Firth
```

The topic vector is the average of the word vectors. This is not the best approximation a topic. A neural net with a labeled dataset would likely outperform this method.

```
( [1,2,3] + [6,5,4] ) * [0.5,0.5,0.5] = [3.5,3.5,3.5]
```

In [11]:
%time topic = topic_vector(tokens); print(f"{topic[:5]} ... len: {len(topic)}")

[-0.03331937  0.02858579 -0.07996928 -0.0517877   0.01438936] ... len: 300
CPU times: user 2.92 ms, sys: 120 µs, total: 3.04 ms
Wall time: 6.59 ms


---
### Topic String

The topic string is the word in our vocabulary that is closest to the topic vector.

The closest word is found by calculating the cosine distance between the topic vector and every single vector in our vocabulary. The smallest cosine distance is chosen.

This is still in numeric form. It is them used as a key to look up the string value in the vocab data struct.

```
[1.1,1.9,2.7,3.8] -> [1,2,3,4] -> "joe"
```


In [12]:
%time topic_string(topic)

CPU times: user 886 ms, sys: 740 ms, total: 1.63 s
Wall time: 1.63 s


'one'