### OCI Data Science - Useful Tips
<details>
<summary><font size="2">Check for Public Internet Access</font></summary>

```python
import requests
response = requests.get("https://oracle.com")
assert response.status_code==200, "Internet connection failed"
```
</details>
<details>
<summary><font size="2">Helpful Documentation </font></summary>
<ul><li><a href="https://docs.cloud.oracle.com/en-us/iaas/data-science/using/data-science.htm">Data Science Service Documentation</a></li>
<li><a href="https://docs.cloud.oracle.com/iaas/tools/ads-sdk/latest/index.html">ADS documentation</a></li>
</ul>
</details>
<details>
<summary><font size="2">Typical Cell Imports and Settings for ADS</font></summary>

```python
%load_ext autoreload
%autoreload 2
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

import logging
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.ERROR)

import ads
from ads.dataset.factory import DatasetFactory
from ads.automl.provider import OracleAutoMLProvider
from ads.automl.driver import AutoML
from ads.evaluations.evaluator import ADSEvaluator
from ads.common.data import ADSData
from ads.explanations.explainer import ADSExplainer
from ads.explanations.mlx_global_explainer import MLXGlobalExplainer
from ads.explanations.mlx_local_explainer import MLXLocalExplainer
from ads.catalog.model import ModelCatalog
from ads.common.model_artifact import ModelArtifact
```
</details>
<details>
<summary><font size="2">Useful Environment Variables</font></summary>

```python
import os
print(os.environ["NB_SESSION_COMPARTMENT_OCID"])
print(os.environ["PROJECT_OCID"])
print(os.environ["USER_OCID"])
print(os.environ["TENANCY_OCID"])
print(os.environ["NB_REGION"])
```
</details>

***
# <font color=red>Part of Speech tagging with nltk and scikit-learn</font>
<p style="margin-left:10%; margin-right:10%;">by the <font color=teal> Oracle Cloud Infrastructure Data Science Team </font></p>

***

<font color=gray>ADS Sample Notebook.

Copyright (c) 2021 Oracle, Inc.  All rights reserved.
Licensed under the Universal Permissive License v 1.0 as shown at https://oss.oracle.com/licenses/upl.
</font>

# Overview
This notebook demo will show how to develop a token classification system for tagging the part of speech that a word is. The skills taught in this notebook are applicable to other problems, like named entity recognition.

We use `scikit-learn` and `nltk` to build an effective POS classifier in seconds. We use the Penn `treebank` corpus as our training dataset. More specific information about what a treebank is can be found [here](https://en.wikipedia.org/wiki/Treebank)

**Important:**

Placeholder text for required values are surrounded by angle brackets that must be removed when adding the indicated content. For example, when adding a database name to `database_name = "<database_name>"` would become `database_name = "production"`.

---

## Prerequisites:
 - Experience with the topic: Novice
 - Professional experience: None
 
This notebook is intended for Data Scientists with desire to learn about Natural Language Processing tasks as well as experienced Data Sciencests who want to add another tool to their toolbox

---

### First, import the necessary libraries

In [None]:
import nltk
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score
import numpy as np

We will download the treebank dataset - this may take awhile because the treebank dataset consists of 3914 tagged sentences and 100676 tokens.

In [None]:
nltk.download('treebank')

We load the sentences

In [None]:
tagged_sentences = nltk.corpus.treebank.tagged_sents()

Here is an example of a sentence and its part of speech tags. 

In [None]:
tagged_sentences[1]

The various parts of speech codes used in the treebank dataset and their corresponding description are as follows: 


1. 	CC 	Coordinating conjunction
2. 	CD 	Cardinal number
3. 	DT 	Determiner
4. 	EX 	Existential there
5. 	FW 	Foreign word
6. 	IN 	Preposition or subordinating conjunction
7. 	JJ 	Adjective
8. 	JJR 	Adjective, comparative
9. 	JJS 	Adjective, superlative
10. 	LS 	List item marker
11. 	MD 	Modal
12. 	NN 	Noun, singular or mass
13. 	NNS 	Noun, plural
14. 	NNP 	Proper noun, singular
15. 	NNPS 	Proper noun, plural
16. 	PDT 	Predeterminer
17. 	POS 	Possessive ending
18. 	PRP 	Personal pronoun
19. 	PRP 	Possessive pronoun
20. 	RB 	Adverb
21. 	RBR 	Adverb, comparative
22. 	RBS 	Adverb, superlative
23. 	RP 	Particle
24. 	SYM 	Symbol
25. 	TO 	to
26. 	UH 	Interjection
27. 	VB 	Verb, base form
28. 	VBD 	Verb, past tense
29. 	VBG 	Verb, gerund or present participle
30. 	VBN 	Verb, past participle
31. 	VBP 	Verb, non-3rd person singular present
32. 	VBZ 	Verb, 3rd person singular present
33. 	WDT 	Wh-determiner
34. 	WP 	Wh-pronoun
35. 	WP 	Possessive wh-pronoun
36. 	WRB 	Wh-adverb 

We need to write a function to take in these tagged sentences and return a feature dictionary for that particular sentence. scikit-learn's offical docs have more details about this process [here](https://scikit-learn.org/stable/modules/feature_extraction.html#loading-features-from-dicts)

One could also use word embeddings from models like `word2vec` - but each vector component must be included as an independent dictionary key/value. `scikit-learn` doesn't support storing a whole numpy 

In [None]:
def extract_features(tagged_sentence, index):
    token, tag = tagged_sentence[index]
    prev_token = ""
    if index > 0:
        prev_token, prev_tag = tagged_sentence[index - 1]
    is_number = False
    try:
        if float(token):
            is_number = True
    except:
        pass
    features_dict = {"token": token
        , "lower_cased_token": token.lower()
        , "prev_token": prev_token
        , "suffix1": token[-1]
        , "suffix2": token[-2:]
        , "suffix3": token[-3:]
        , "prefix1": token[:1]
        , "prefix2": token[:2]
        , "prefix3": token[:3]
        , "is_capitalized": token.upper() == token
        , "is_number": is_number}
    return features_dict

Here is what the output of this feature extractor looks like on a part our dataset

In [None]:
extract_features(tagged_sentences[0], 1)

We can now use our function to generate our input data. This is a performance intensive task, and to keep it easy to run this notebook on small shapes, we downsample to only the first 100 sentences. One is free to remove this downsampling amd they will find that the model performs more effectively but takes far longer to train.

In [None]:
X_features = []
for sentence in tagged_sentences[0:100]:
    for k in range(len(sentence)):
        X_features.append(extract_features(sentence, k))



We need to use a `DictVectorizer` to convert from our dictionary repersentation into data that `scikit-learn` understands.

In [None]:
vectoriser = DictVectorizer(sparse=False)
X = vectoriser.fit_transform(X_features)

We need our labels (the POS tags)

In [None]:
y = []
for sentence in tagged_sentences[0:100]:
    for k in range(len(sentence)):
        y.append(sentence[k][1])

We can now train a model to do POS tagging on this corpus. POS tagging is compute intensvie so we choose the speedy `SGDClassifier`. For even larger datasets, one may want to try the `LinearSVC` model with careful hyperparamater choices.

In [None]:
clf = SGDClassifier(n_jobs = -1)
clf.fit(X, y)

Finally, we can perform cross validation to get an idea of the performance of our model

In [None]:
scores = cross_val_score(clf, X, y, cv=5)

In [None]:
scores

Let's print the average to see the performance that we might estimate to get in the real world

In [None]:
np.mean(scores)