# More feature extraction: Textacy

The Natural Language Toolkit is the grand daddy of text mining libraries/modules for Python, but since its inception, other tools have made themeselves available. One of those tools is called "Textacy". Textacy does many of the things done by the NLTK, and it does more. This extra functionality is both more complicated and more expressive than the 'Toolkit. 

In [None]:
# configure
FILE    = './texts/shakespeare-sonnets.txt'
KEYWORD = 'love'
MODEL   = 'en_core_web_sm'


In [None]:
# require
import textacy
import os
import spacy
from textacy.ke import yake


In [None]:
# slurp up some plain text...
data = open( FILE ).read()

# ...and do the tiniest bit of normalization ("cleaning") against it
data = data.replace( '\n', ' ' ).replace( '\t', ' ').replace( '  ', ' ')


In [None]:
# perform a keyword in context (KWIC) query against the data; concordance
#result = textacy.text_utils.KWIC( data, KEYWORD )
#print( list( result ) )


In [None]:
# create a spaCy "doc object"; depending on the size of the input, this may take a few minutes to process
size           = os.stat( FILE ).st_size
nlp            = spacy.load( MODEL  )
nlp.max_length = size 
doc            = nlp( data )


In [None]:
doc._.preview

In [None]:
textacy.TextStats( doc ).flesch_reading_ease

In [None]:
list(textacy.extract.ngrams( doc, 2, filter_stops=True, filter_punct=True, filter_nums=False) )

In [None]:
yake( doc )

In [None]:
list( textacy.extract.entities( doc ) )

In [None]:
list( textacy.extract.subject_verb_object_triples( doc ) )

In [None]:
list( textacy.extract.noun_chunks( doc ) )

In [None]:
list( textacy.extract.semistructured_statements( doc, entity='Jove', cue='be' ) ) 