# Working with textacy and spaCy

In this notebook we'll be working with textacy, spaCy, and a bit of `TODO?`.

## Useful links:
Keep these handy, the corpus we'll work with (should) fit(s) in memory, and `spacy` and `textacy` contain a lot of convenient methods for extracting meaningful representations. 

1. [spaCy's Doc API reference](https://spacy.io/api/doc): our `textacy.Corpus` consists of `spacy.Doc` instances.
   
1. [textacy's corpus API reference](https://chartbeat-labs.github.io/textacy/api_reference/lang_doc_corpus.html#textacy.corpus.Corpus)

1. [textacy's document analysis quickstart](https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#analyze-a-doc): `textacy` adds a bunch of methods on top of the `spacy.Doc` methods, e.g., keyword extraction, readability scores, etc.

1. [textacy's corpus analysis quickstart](https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#analyze-a-corpus): On top of that, `textacy` provides a bunch of methods for doing corpus-level operations, e.g., topic modeling, computing doc_term matrices, etc.


In [1]:
import textacy

## Load the Wikinews corpus

We prepared a `textacy.datasets.wikimedia.Wikinews`-corpus in the `get_wikinews_data.ipynb`-notebook. 

Note that when loading the `textacy.Corpus`, we need to have a spaCy `en` model, either the medium (`en_core_web_md`) or large (`en_core_web_lg`) model, if we want to use word vectors (I think we do). Download a model with, e.g.:

```bash
python -m spacy download en_core_web_md
```

In [2]:
corpus = textacy.Corpus.load("en_core_web_md", "./data/enwikinews/textacy_corpus.bin.gz")
print(corpus)

Corpus(21779 docs, 11752070 tokens)


Document in the corpus can be directly accesssed through their indices, e.g.:

In [3]:
doc = corpus[1576]  # Take one sample document
doc  # Holds the article content
doc._.meta  # Wikinews metadata is accessible through `._.meta`. Yeah (._.)
doc.has_vector

True

## Document Representations

We can extract meaningful representations and additional metadata, by using [spaCy's Doc methods](https://spacy.io/api/doc) and [`textacy`-specific additional Doc methods](https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#analyze-a-doc), e.g.:

In [None]:
entities = doc.ents  # spaCy.Doc.ents
sentiment_score = doc.sentiment  # spaCy.Doc.sentiment
word_embeddings = doc.vector

In [None]:
import textacy.ke  # textacy's keyword extraction module

keywords = textacy.ke.textrank(doc, normalize="lemma", topn=10)
print("keywords: {}\n".format(", ".join([k for k, s in keywords])))

doc_stats = textacy.TextStats(doc)
f_k_grade_level = doc_stats.flesch_kincaid_grade_level
print("Flesch-Kincaid Grade Level: {}".format(f_k_grade_level))

## Corpus-level operations

Finally, `textacy` allows us to conveniently process corpus-level statistics, create similarity matrices, etc.
* See [textacy's quickstart](https://chartbeat-labs.github.io/textacy/getting_started/quickstart.html#analyze-a-corpus) for reference

In [None]:
import textacy.vsm  # vector space models

vectorizer = textacy.vsm.Vectorizer(tf_type="linear", apply_idf=True, idf_type="smooth", norm="l2", 
                                    min_df=2, max_df=0.95)
doc_term_matrix = vectorizer.fit_transform((doc._.to_terms_list(ngrams=1, entities=True, 
                                                                as_strings=True) for doc in corpus))
print(doc_term_matrix.shape)  # n_docs * n_vocab

In [None]:
import textacy.tm  # topic models

model = textacy.tm.TopicModel("nmf", n_topics=10)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)

In [None]:
for topic_idx, top_terms in model.top_topic_terms(vectorizer.id_to_term, top_n=10):
    print("topic", topic_idx+1, ":", ", ".join(top_terms))

Other ways to manipulate corpus
===
`textacy.Corpus.vectors` exposes the articles' underlying word embedding vectors, which allow for quick similarity computations;

In [None]:
from scipy.spatial.distance import pdist, squareform

distances = squareform(pdist(corpus.vectors, 'cosine'))  # Compute all pair-wise cosine similarities

In [None]:
print("Most similar to:\n   {}\n".format(doc._.meta['title']))

for rank, idx in enumerate(distances[1576].argsort()[1:11]):
    title = corpus[idx]._.meta['title']
    similarity = 1-distances[1576][idx]
    print("{}. {} ({})".format(rank+1, title, similarity))

These are just a few examples, but it's good to think a bit about how to model similarity in different ways, using different representations or metadata, for the next steps...