This version is for testing Textacy.

# Taxtacy Examples

In [1]:
import textacy

Efficiently stream documents from disk and into a processed corpus:

In [2]:
cw = textacy.corpora.CapitolWords()
docs = cw.records(speaker_name={'Hillary Clinton', 'Barack Obama'})
content_stream, metadata_stream = textacy.fileio.split_record_fields(docs, 'text')
corpus = textacy.Corpus('en', texts=content_stream, metadatas=metadata_stream)
corpus

Corpus(1241 docs; 859498 tokens)

Represent corpus as a document-term matrix, with flexible weighting and filtering:

In [3]:
doc_term_matrix, id2term = textacy.vsm.doc_term_matrix(
     (doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True)
      for doc in corpus),
     weighting='tfidf', normalize=True, smooth_idf=True, min_df=2, max_df=0.95) 
print(repr(doc_term_matrix))

<1241x11328 sparse matrix of type '<class 'numpy.float64'>'
	with 211901 stored elements in Compressed Sparse Row format>


Train and interpret a topic model:

In [4]:
model = textacy.tm.TopicModel('nmf', n_topics=10)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)
doc_topic_matrix.shape

(1241, 10)

In [5]:
for topic_idx, top_terms in model.top_topic_terms(id2term, top_n=10):
    print('topic', topic_idx, ':', '   '.join(top_terms))

topic 0 : new   's   child   people   â€   need   year   today   work   york
topic 1 : rescind   quorum   order   consent   unanimous   ask   president   mr.   madam   objection
topic 2 : dispense   reading   consent   unanimous   amendment   ask   president   mr.   madam   presiding
topic 3 : virginia   west virginia   west   senator   yield   question   thank   objection   inquiry   massachusetts
topic 4 : senators   desiring   chamber   vote   ridership   expedited   reassurance   displace   reliant   confidence
topic 5 : amendment   pending   aside   set   unanimous   ask   consent   mr.   president   desk
topic 6 : health   care   mental   patient   child   quality   medical   information   hospital   americans
topic 7 : iraq   iraqi   war   troop   iraqis   american   u.s.   afghanistan   military   policy
topic 8 : senate   thursday   wednesday   session   unanimous   consent   authorize   p.m.   committee   september
topic 9 : medicare   drug   senior   medicaid   prescription 

Basic indexing as well as flexible selection of documents in a corpus:

In [6]:
obama_docs = list(corpus.get(
    lambda doc: doc.metadata['speaker_name'] == 'Barack Obama'))
len(obama_docs)

411

In [8]:
doc = corpus[-1]
doc

Doc(3002 tokens; "In the Federalist Papers, we often hear the ref...")

Preprocess plain text, or highlight particular terms in it:

In [9]:
textacy.preprocess_text(doc.text, lowercase=True, no_punct=True)[:70]

'in the federalist papers we often hear the reference to the senates ro'

In [10]:
textacy.text_utils.keyword_in_context(doc.text, 'America', window_width=35)

g on this tiny piece of Senate and  America n history. Some 10 years ago, I ask
o do the hard work in New York and  America , who get up every day and do the v
say: You know, you never can count  America  out. Whenever the chips are down, 
 what we know will give our fellow  America ns a better shot at the kind of fut
aith in this body and in my fellow  America ns. I remain an optimist, that Amer
ricans. I remain an optimist, that  America 's best days are still ahead of us.


Extract various elements of interest from parsed documents:

In [11]:
list(textacy.extract.ngrams(
    doc, 2, filter_stops=True, filter_punct=True, filter_nums=False))[:25]

[Federalist Papers,
 Senate's,
 's role,
 violent passions,
 pernicious resolutions,
 everlasting credit,
 common ground,
 8 years,
 tiny piece,
 American history,
 10 years,
 years ago,
 New York,
 fellow New,
 New Yorkers,
 skeptics wondering,
 stalwart supporters,
 Senator Barbara,
 Barbara Mikulski,
 new colleagues,
 August recess,
 Pennsylvania Avenue,
 Senate went,
 extraordinary pressure,
 constituency work]

In [12]:
list(textacy.extract.ngrams(
    doc, 3, filter_stops=True, filter_punct=True,))

[hear the reference,
 Senate's role,
 avert the consequences,
 sudden and violent,
 intemperate and pernicious,
 credit and wisdom,
 years of service,
 piece of Senate,
 Senate and American,
 10 years ago,
 asked the people,
 people of New,
 economy has grown,
 grown more interconnected,
 world more interdependent,
 fellow New Yorkers,
 supporters and people,
 Senator Barbara Mikulski,
 kind of read,
 road and set,
 colleagues will eventually,
 eventually be able,
 able to enjoy,
 end of Pennsylvania,
 went on recess,
 thrilled and relieved,
 August recess roll,
 returned in 2001,
 Nation was attacked,
 attacked on 9/11,
 toll was devastating,
 devastating and New,
 New York bore,
 bore the heaviest,
 Nearly 3,000 lives,
 lives were lost,
 World Trade Center,
 Center in ruins,
 cloud of debris,
 debris and poison,
 remember the rallying,
 rallying of support,
 support and sense,
 sense of common,
 represented here showed,
 words but specific,
 Senators sent staff,
 sent staff members,


In [13]:
list(textacy.extract.named_entities(
    doc, drop_determiners=True, exclude_types='numeric'))[:100]

[Senate,
 Senate,
 American,
 New York,
 New Yorkers,
 Senate,
 Barbara Mikulski,
 Senate,
 Pennsylvania Avenue,
 Senate,
 Washington,
 States,
 New York,
 State,
 World Trade Center,
 States,
 New York,
 New Yorkers,
 Robert Byrd,
 State,
 New York,
 Chuck Schumer,
 New York,
 Manhattan,
 Chuck,
 FEMA,
 La Guardia,
 Manhattan,
 Hudson River,
 Federal,
 World Trade Center,
 Dante,
 Chuck,
 Schumer,
 Oval Office,
 Bush,
 New York's,
 Schumer,
 Senate,
 Chuck,
 New York,
 Inouye,
 Cochran,
 Appropriations Committee,
 Harkin,
 Murray â€,
 New York,
 New York,
 New York,
 Kent Conrad,
 Byron Dorgan,
 Tom Harkin,
 Max Baucus,
 New York,
 New York,
 New York,
 New York,
 America,
 North Country,
 eBay,
 New York,
 Corning,
 Jamestown,
 Rochester,
 State,
 Bioinformatics,
 Buffalo,
 Stanley Theater,
 Utica,
 New York,
 Seneca Falls,
 National Women's Hall of Fame,
 Women's Rights Convention,
 Senate,
 America,
 New Yorkers,
 Americans,
 State,
 Senate,
 Democratic,
 Republican,
 New York,
 Em

In [14]:
pattern = textacy.constants.POS_REGEX_PATTERNS['en']['NP']
pattern

'<DET>? <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN>|<PROPN> <PART>?)+'

In [15]:
list(textacy.extract.pos_regex_matches(doc, pattern))[:10]

[the Federalist Papers,
 the reference,
 the Senate's role,
 the consequences,
 sudden and violent passions,
 intemperate and pernicious resolutions,
 the everlasting credit,
 wisdom,
 our Founders,
 an effort]

In [16]:
list(textacy.extract.semistructured_statements(doc, 'I', cue='be'))

[(I, was, on the other end of Pennsylvania Avenue),
 (I, was, , a very new Senator, and my city and my State had been devastated),
 (I, am, grateful to have had Senator Schumer as my partner and my ally),
 (I, am, very excited about what can happen in the next 4 years),
 (I, been, a New Yorker, but I know I always will be one)]

In [20]:
textacy.keyterms.textrank(doc, n_keyterms=5)

[('new', 0.009941897243924757),
 ('senator', 0.007506064460010034),
 ('york', 0.0066900884741152135),
 ('day', 0.005906363319080454),
 ('senate', 0.005810906203697855)]

Compute common statistical attributes of a text:

In [17]:
textacy.text_stats.readability_stats(doc)

{'automated_readability_index': 12.67505954743946,
 'coleman_liau_index': 9.877273770543866,
 'flesch_kincaid_grade_level': 10.76065815799921,
 'flesch_readability_ease': 62.78013134180233,
 'gunning_fog_index': 13.601208416038112,
 'n_chars': 11504,
 'n_polysyllable_words': 222,
 'n_sents': 100,
 'n_syllables': 3528,
 'n_unique_words': 1108,
 'n_words': 2519,
 'smog_index': 11.640900244366641}

Count terms individually, and represent documents as a bag-of-terms with flexible weighting and inclusion criteria:

In [18]:
doc.count('New')

22

In [19]:
bot = doc.to_bag_of_terms(ngrams={2, 3}, as_strings=True)
sorted(bot.items(), key=lambda x: x[1], reverse=True)[:10]

[('new york', 18),
 ('senate', 8),
 ('first', 6),
 ('state', 4),
 ('america', 3),
 ('look forward', 3),
 ('new yorkers', 3),
 ('chuck', 3),
 ('senator schumer', 2),
 ('world trade center', 2)]

# Data

In [21]:
import pandas as pd
import os
import codecs

intermediate_directory = os.path.join('.', 'intermediate') # Directory for storing data

In [22]:
df = pd.read_pickle('df_save.pck')
df = df[df['Language'].isin(['en'])]
# Combine title and contents
df['Text'] = df['Title'] + '. ' + df['Content']
df.head()

Unnamed: 0,Date,ArticleCode,Language,Title,Content,Text
0,2015-01-05,1,en,WTO celebrates 20 years of helping global econ...,"GENEVA, Jan 1 (KUNA) -- The World Trade Organi...",WTO celebrates 20 years of helping global econ...
2,2015-01-05,3,en,"Delegates of 45 nations,WTO Chief to be at CII...","New Delhi, Jan 2 (PTI) Over 1,000 delegates fr...","Delegates of 45 nations,WTO Chief to be at CII..."
3,2015-01-05,4,en,"Compliance Rulings On Aircraft, Tuna, COOL To ...","World Trade Online Posted: December 30, 2014 A...","Compliance Rulings On Aircraft, Tuna, COOL To ..."
4,2015-01-05,5,en,Sudan making new push at WTO membership: official,"Sudan Tribune 4 January 2015 January 3, 2015 (...",Sudan making new push at WTO membership: offic...
5,2015-01-05,6,en,China Ends Rare-Earth Minerals Export Quotas,China Ends Decade-Old Quota System Limiting Ex...,China Ends Rare-Earth Minerals Export Quotas. ...


In [23]:
# Normalize words
replacements = {
    'Text': {
        r'\'s': '', 
        'Indian': 'India', 
        'nextgeneration': 'next generation', 
        '//iconnect\.wto\.org/': '', 
        '-': ' ', 
        'U.S.': 'United States', 
        ' US ': 'United States', 
        'S.Korea': 'South Korea', 
        'S. Korea': 'South Korea', 
        'WTO': 'world trade organization', 
        '‘': '', 
        'imports': 'import', 
        'Imports': 'import', 
        'exports': 'export', 
        'Exports': 'export', 
        'NZ ': 'New Zealand ', 
        '\"': '',
        '\'': '', 
    }
}
df.replace(replacements, regex=True, inplace=True)

In [24]:
texts = df['Text'].tolist()
titles = df['Title'].tolist()
dates = df['Date'].tolist()
articlecodes = df['ArticleCode'].tolist()
print(str(len(texts)) + ' texts')
print(str(len(titles)) + ' titles')
print(str(len(dates)) + ' dates')
print(str(len(articlecodes)) + ' articlecodes')

6409 texts
6409 titles
6409 dates
6409 articlecodes


# Processing

In [25]:
corpus = textacy.Corpus('en', texts=texts)
corpus

Corpus(6409 docs; 4253476 tokens)

Represent corpus as a document-term matrix, with flexible weighting and filtering:

In [26]:
doc_term_matrix, id2term = textacy.vsm.doc_term_matrix(
     (doc.to_terms_list(ngrams=1, named_entities=True, as_strings=True)
      for doc in corpus),
     weighting='tfidf', normalize=True, smooth_idf=True, min_df=2, max_df=0.95) 
print(repr(doc_term_matrix))

<6409x35516 sparse matrix of type '<class 'numpy.float64'>'
	with 1378866 stored elements in Compressed Sparse Row format>


Train and interpret a topic model:

In [27]:
model = textacy.tm.TopicModel('nmf', n_topics=10)
model.fit(doc_term_matrix)
doc_topic_matrix = model.transform(doc_term_matrix)
doc_topic_matrix.shape

(6409, 10)

In [28]:
for topic_idx, top_terms in model.top_topic_terms(id2term, top_n=10):
    print('topic', topic_idx, ':', '   '.join(top_terms))

topic 0 : china   chinese   beijing   steel   market   economy   status   dumping   eu   chinas
topic 1 : india   solar   indias   theunited   modi   delhi   sitharaman   visa   country   new delhi
topic 2 : eu   uk   britain   brexit   european   british   europe   trade   ttip   deal
topic 3 : tpp   japan   obama   pacific   trade   deal   congress   vietnam   asia   japanese
topic 4 : trade   organization   world   nairobi   doha   member   tfa   azevedo   country   agreement
topic 5 : growth   global   year   economy   bank   imf   cent   oecd   g20   economic
topic 6 : russia   ukraine   russian   ukrainian   moscow   kazakhstan   interfax   brics   sanction   organization
topic 7 : states   united states   united   canada   mexico   trade   cool   canadian   say   organization
topic 8 : trump   clinton   american   trade   nafta   republican   america   mexico   obama   trumps
topic 9 : brazil   mercosur   argentina   venezuela   brazilian   macri   uruguay   paraguay   bloc   ro

Basic indexing as well as flexible selection of documents in a corpus:

In [6]:
obama_docs = list(corpus.get(
    lambda doc: doc.metadata['speaker_name'] == 'Barack Obama'))
len(obama_docs)

411

In [29]:
doc = corpus[-1]
doc

Doc(667 tokens; "UK mobile users face return of steep roaming bi...")

Preprocess plain text, or highlight particular terms in it:

In [30]:
textacy.preprocess_text(doc.text, lowercase=True, no_punct=True)[:70]

'uk mobile users face return of steep roaming bills after brexit decemb'

In [31]:
textacy.text_utils.keyword_in_context(doc.text, 'America', window_width=35)

Extract various elements of interest from parsed documents:

In [32]:
list(textacy.extract.ngrams(
    doc, 2, filter_stops=True, filter_punct=True, filter_nums=False))[:25]

[UK mobile,
 mobile users,
 users face,
 face return,
 steep roaming,
 roaming bills,
 DECEMBER 29,
 2016 Duncan,
 Duncan Robinson,
 Nic Fildes,
 London Financial,
 Financial Times,
 Times British,
 British mobile,
 mobile phone,
 phone users,
 users face,
 face bills,
 worst case,
 case scenario,
 comprehensive free,
 free trade,
 trade deal,
 roaming fees,
 past decade]

In [33]:
list(textacy.extract.ngrams(
    doc, 3, filter_stops=True, filter_punct=True,))

[UK mobile users,
 mobile users face,
 users face return,
 return of steep,
 steep roaming bills,
 bills after Brexit,
 2016 Duncan Robinson,
 Robinson in Brussels,
 Brussels and Nic,
 Fildes in London,
 London Financial Times,
 Financial Times British,
 Times British mobile,
 British mobile phone,
 mobile phone users,
 phone users face,
 users face bills,
 song they stream,
 stream while roaming,
 worst case scenario,
 UK can agree,
 agree a comprehensive,
 comprehensive free trade,
 free trade deal,
 deal after Brexit,
 EU has campaigned,
 campaigned against roaming,
 reducing what operators,
 operators can charge,
 abolition for nearly,
 nearly all users,
 reforms were repeatedly,
 prime minister David,
 minister David Cameron,
 quits the bloc,
 allowing continental carriers,
 carriers to charge,
 charge British consumers,
 like and potentially,
 potentially leaving consumers,
 bills of 10,
 10 per MB,
 commonly paid byUnited,
 paid byUnited Statesvisitors,
 arrange special data,
 s

In [34]:
list(textacy.extract.named_entities(
    doc, drop_determiners=True, exclude_types='numeric'))[:100]

[UK,
 Brexit,
 Duncan Robinson,
 Brussels,
 Nic Fildes,
 London Financial Times,
 British,
 EU,
 UK,
 Brexit,
 EU,
 David Cameron,
 Brexit,
 Britain,
 British,
 MB,
 EU,
 Spotify,
 EU,
 EU,
 Financial Times,
 France,
 Spain,
 UK,
 EU,
 European Commission,
 Gnther Oettinger,
 German,
 Mr Oettinger,
 European Parliament,
 British,
 UK,
 British,
 Spain,
 Spanish,
 Italy,
 Spain,
 Europe,
 UK,
 Three,
 TalkTalk,
 Sky,
 EU]

In [35]:
pattern = textacy.constants.POS_REGEX_PATTERNS['en']['NP']
pattern

'<DET>? <NUM>* (<ADJ> <PUNCT>? <CONJ>?)* (<NOUN>|<PROPN> <PART>?)+'

In [36]:
list(textacy.extract.pos_regex_matches(doc, pattern))[:10]

[UK,
 mobile users,
 return,
 steep roaming bills,
 Brexit,
 DECEMBER,
 2016 Duncan Robinson,
 Brussels,
 Nic Fildes,
 London Financial Times]

In [37]:
list(textacy.extract.semistructured_statements(doc, 'I', cue='be'))

[]

In [39]:
textacy.keyterms.textrank(doc, n_keyterms=10)

[('eu', 0.023616474047813325),
 ('uk', 0.018059274440052973),
 ('official', 0.01786864654486733),
 ('network', 0.017025303231299565),
 ('year', 0.016368067390966894),
 ('british', 0.015185226975152614),
 ('operator', 0.014821543750413845),
 ('deal', 0.013082494046337226),
 ('company', 0.012657655964019306),
 ('brexit', 0.012260144419567541)]

Compute common statistical attributes of a text:

In [40]:
textacy.text_stats.readability_stats(doc)

{'automated_readability_index': 15.315467054401019,
 'coleman_liau_index': 12.325261918433934,
 'flesch_kincaid_grade_level': 12.62947017518973,
 'flesch_readability_ease': 51.985979856727454,
 'gunning_fog_index': 15.881097950209233,
 'n_chars': 3048,
 'n_polysyllable_words': 80,
 'n_sents': 23,
 'n_syllables': 926,
 'n_unique_words': 305,
 'n_words': 613,
 'smog_index': 13.783426738976498}

Count terms individually, and represent documents as a bag-of-terms with flexible weighting and inclusion criteria:

In [41]:
doc.count('New')

22

In [42]:
bot = doc.to_bag_of_terms(ngrams={2, 3}, as_strings=True)
sorted(bot.items(), key=lambda x: x[1], reverse=True)[:10]

[('eu', 7),
 ('uk', 5),
 ('british', 4),
 ('free trade deal', 3),
 ('last year', 3),
 ('free trade', 3),
 ('spain', 3),
 ('trade deal', 3),
 ('brexit', 3),
 ('bilateral deal', 2)]