The directory `data` contains a subdirectory named `docbin`, which contains two spaCy *DocBin* objects named `small.spacy` and `medium.spacy`.

Both *DocBin* objects contain the same texts, but which have been processed using different language models. 

Whereas the `small.spacy` was created using the small language model for English (`en_core_web_sm`), the file `medium.spacy` was created using the medium language model (`en_core_web_md`).

Load the *Doc* objects contained in `small.spacy` and store them into a list under the variable `small_docs`.

Then load the *Doc* objects in `medium.spacy` and store them into a list under the variable `medium_docs`.

Run the cell below to download the *DocBin* objects on your server.

In [5]:
import spacy
from pathlib import Path
from spacy.tokens import DocBin
from spacy.tokens import Doc

# docbin_loaded = DocBin().from_disk(path='data/docbin.spacy')
small_nlp = spacy.load('en_core_web_sm')
medium_nlp = spacy.load('en_core_web_md')
small_docs = list(DocBin().from_disk(path='data/docbin/small.spacy').get_docs(small_nlp.vocab))
medium_docs = list(DocBin().from_disk(path='data/docbin/medium.spacy').get_docs(medium_nlp.vocab))


Collect fine-grained part-of-speech tags for all *Doc* objects in both `small_docs` and `medium_docs`. Store the part-of-speech tags into lists named `small_tokens` and `medium_tokens`, respectively.

Next, calculate the *precision* score between `medium_tokens` and `small_tokens` to assess to what extent the models produce similar predictions for part-of-speech tags.

To do so, use the `precision_score()` function from the `metrics` module of the scikit-learn library. Use `micro` averaging and set the `zero_division` argument to 0.

Store the result under the variable `pr`.

In [6]:
from sklearn import metrics
small_tokens = []
medium_tokens = []
for s in small_docs:
    for token in s:
        small_tokens.append(token.tag_)
for m in medium_docs:
    for token in m:
        medium_tokens.append(token.tag_)
        
pr = metrics.precision_score(medium_tokens, small_tokens, average="micro", zero_division=0)
pr

0.9613049388309923

The directory `data` contains a subdirectory named `corp`, which contains a plain text file with comma-separated values named `pos.csv`.

Read the file contents using pandas and store it into a DataFrame under the variable `data`.

Get the five most common part-of-speech tags from the column `pos` and store this information into a pandas *Series* named `top_pos`.

Then get the five most common tokens for all tokens tagged as `NOUN`, and store this information as a pandas *Series* under the variable `top_nouns`.

In [7]:
import pandas as pd

data = pd.read_csv('data/corp/pos.csv')
n = 5
top_pos = data['pos'].value_counts()[:n].sort_values(ascending=False)

#above_35 = titanic[titanic["Age"] > 35] (Dataframe subset)
sub = data[data['pos'] == 'NOUN']
top_nouns = sub.value_counts()[:n].sort_values(ascending=False)

In [8]:
top_nouns

token  pos 
pct    NOUN    57
year   NOUN    42
dlrs   NOUN    36
oil    NOUN    35
mln    NOUN    30
dtype: int64