# Introduction

There is a perception that Twitter data can be used to surface insights: unexpected features of the data that have business value. In this tutorial, I will explore some of the difficulties and opportunities of turning that perception into reality. 

We will focus exclusively on _text_ analysis, and on insights represented by textual differences between documents and corpora. We will start by constructing a small, simple data set that represents a few notions of what insights _should_ be surfaced. We can then examine which technique uncover which insights.

Next, we will move to real data, where we don't know what we might surface. We will have to address data cleaning and curation, both at the beginning and in an iterative fashion as our insights-generation surfaces artifacts of insufficient data curation. We will finish by developing and evaluating a variety of tools and techiques for comparing text-based data.

## Resources

Good further reading, and the source of some of the ideas here:
https://de.dariah.eu/tatom/feature_selection.html

# Setup

Requires Python 3.6 or greater

In [1]:
import itertools
import nltk
import operator
import numpy as np

In [2]:
import sklearn
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer

# A Synthetic Example

Let's build some intuition by creating two artificial documents, which represent textual differences that we might intend to surface. 

In [3]:
doc0,doc1 = ('bun cat cat dog bird','bun cat dog dog dog')

In terms of unigram frequency, here are 3 differences:
* 1 more "cat" in doc0 than in doc1
* 2 more "dog" in doc1 than in doc0
* "bird" only exists in doc0

Let's throw together a function that prints out the differences in term frequencies:

In [4]:
def func(doc0,doc1,vectorizer):
    """
    print difference in absolute term-frequency difference for each unigram
    """
    tf = vectorizer.fit_transform([doc0,doc1])
    # this is a 2-column matrix, where the columns represent doc0 and doc1
    tfa = tf.toarray()
    # make tuples of the tokens and the difference of their doc0 and doc1 coefficients
    # if we use a basic token count vectorizer, this is the term frequency difference 
    tup = zip(vectorizer.get_feature_names(),tfa[0] - tfa[1])
    # print the top-10 tokens ranked by the difference measure
    for token,score in list(reversed(sorted(tup,key=operator.itemgetter(1))))[:10]:
        print(token,score)

In [5]:
func(doc0,doc1,CountVectorizer())

cat 1
bird 1
bun 0
dog -2


Observations:
* positive numbers are more "doc0-like"
* the "dog" score is higher in absolute value than the bird score
* "bird" and "cat" are indistinguishable

Let's try inverse-document frequency.

In [6]:
func(doc0,doc1,TfidfVectorizer())

bird 0.4976748316029239
cat 0.406688138613708
bun 0.05258839701797219
dog -0.5504342921375551


Observations:
* "bird" now has a larger coefficient that "cat"
* "dog is still most significant that "cat"

How does this scale?

Let's construct:
* doc0 is +1 "cat"
* doc0 is +40 "bun"
* doc0 is +1 "bird"


In [7]:
doc0 = 'cat '*5 + 'dog '*3 + 'bun '*350 + 'bird '
doc1 = 'cat '*4 + 'dog '*3 + 'bun '*310 

In [8]:
func(doc0,doc1,CountVectorizer())

bun 40
cat 1
bird 1
dog 0


In [9]:
func(doc0,doc1,TfidfVectorizer())

bird 0.004015025079257322
cat 0.00138206928601671
bun -1.67582883906503e-05
dog -0.0011059905945811823


Observations:
* "bird" stands out strongly
* "cat" and "dog" are similar in absolute value
* "bun" is the least significant token

What about including 2-grams?

In [10]:
func(doc0,doc1,TfidfVectorizer(ngram_range=(1,2)))

bun bird 0.002843178755055885
bird 0.002843178755055885
cat cat 0.0012384810301358405
cat 0.0009769930065187133
bun bun 0.0001180047417965735
bun -0.0001434832818205667
dog bun -0.00026148802361712674
cat dog -0.00026148802361712674
dog dog -0.0005229760472342535
dog -0.0007844640708513807


That's impossible to read. Let's build better formatting into our function.

In [11]:
def func(doc0,doc1,vectorizer):
    tf = vectorizer.fit_transform([doc0,doc1])
    tfa = tf.toarray()
    tup = zip(vectorizer.get_feature_names(),tfa[0] - tfa[1])
    
    # print 
    max_token_length = 0
    output_tuples = list(reversed(sorted(tup,key=operator.itemgetter(1))))[:10]

    for token,score in output_tuples:
        if max_token_length < len(token):
            max_token_length = len(token)
    for token,score in output_tuples:
        print(f"{token:{max_token_length}s} {score:.3e}") 

In [12]:
func(doc0,doc1,TfidfVectorizer(ngram_range=(1,2)))

bun bird 2.843e-03
bird     2.843e-03
cat cat  1.238e-03
cat      9.770e-04
bun bun  1.180e-04
bun      -1.435e-04
dog bun  -2.615e-04
cat dog  -2.615e-04
dog dog  -5.230e-04
dog      -7.845e-04


Observations:
* grams with "bird" still stand out
* scores are getting hard to interpret

Let's get some real data.

In [13]:
import string
from tweet_parser.tweet import Tweet
from searchtweets import (ResultStream,
                           collect_results,
                           gen_rule_payload,
                           load_credentials)

search_args = load_credentials(filename="~/.twitter_keys.yaml",
                               account_type="enterprise")

In [14]:
_pats_rule = "#patriots OR @patriots"

In [15]:
_eagles_rule = "#eagles OR @eagles"

In [16]:
from_date="2018-01-28"
to_date="2018-01-29"
max_results = 3000

pats_rule = gen_rule_payload(_pats_rule,
                        from_date=from_date,
                        to_date=to_date,
                        )
eagles_rule = gen_rule_payload(_eagles_rule,
                        from_date=from_date,
                        to_date=to_date,
                        )

In [17]:
eagles_results_list = collect_results(eagles_rule, 
                               max_results=max_results, 
                               result_stream_args=search_args)

In [18]:
pats_results_list = collect_results(pats_rule, 
                               max_results=max_results, 
                               result_stream_args=search_args)

Join all tweet bodies in a corpus into one space-delimited document.

In [19]:
eagles_body_text = [tweet['body'] for tweet in eagles_results_list]
eagles_doc = ' '.join(eagles_body_text)

In [20]:
pats_body_text = [tweet['body'] for tweet in pats_results_list]
pats_doc = ' '.join(pats_body_text)

Let's have a look at the data (AS YOU ALWAYS SHOULD).

In [21]:
eagles_body_text[:10]

['RT @JClarkNBCS: #Eagles team hotel \n\n#FlyEaglesFly \n#SuperBowl https://t.co/B0ZUh4Kt5B',
 '@JC1053 @Eagles Don’t forget to mention the cowgirls haven’t won a playoff game since 1996.',
 'RT @JeffSkversky: #Eagles QB Nick Foles today - 1 week before the Super Bowl:\n\n"I feel really good right now. I feel calm, excited, this is…',
 'RT @JNels: The Wireless network is ready for the @Patriots and @Eagles at @NFL @MNSuperBowl2018 @SuperBowl ... @verizon boosted capacity 10…',
 '@BrunaDusi @NFLBrasil @Eagles @Patriots Mas n gosto deles, n vou torcer para eles',
 "RT @6abc: And they're off! #Eagles are on their way to #SuperBowl LII https://t.co/jJZzNkaQa8 https://t.co/FJVNXlrCis",
 'RT @JClarkNBCS: #Eagles fight song as buses arrive at team hotel\n#FlyEaglesFly \n#SuperBowl https://t.co/cIWtv8CqqA',
 'RT @Eagles: One week.\n\n#SBLII | #FlyEaglesFly https://t.co/3iMOwEfvjI',
 'RT @RPABreaks: We are approaching #SuperBowl next Sunday @Patriots vs @Eagles I will be giving this card away v

Whew...this is gonna take some cleaning.

Let's start with a tokenizer and a stopword list.

In [22]:
tokenizer = nltk.tokenize.TweetTokenizer()
stopwords = nltk.corpus.stopwords.words('english')
stopwords.extend(string.punctuation)

In [28]:
vectorizer = TfidfVectorizer(
    tokenizer=tokenizer.tokenize,
    stop_words=stopwords,
    ngram_range=(1,2)
)

Here are the top 10 1- and 2-grams for the Eagles corpus/document.

In [29]:
func(eagles_doc,pats_doc,vectorizer)

rt @eagles                            3.134e-01
@eagles                               2.961e-01
#flyeaglesfly                         2.444e-01
#sblii #flyeaglesfly                  1.964e-01
#sblii                                1.456e-01
@eagles one                           1.318e-01
week #sblii                           1.293e-01
https://t.co/3imowefvji               1.293e-01
#flyeaglesfly https://t.co/3imowefvji 1.293e-01
’ #sblii                              1.098e-01


Add the ability to specify `n` in top-`n`.

In [30]:
def compare_docs(doc0,doc1,vectorizer,n_to_display=10):
    tfm_sparse = vectorizer.fit_transform([doc0,doc1])
    tfm = tfm_sparse.toarray()
    tup = zip(vectorizer.get_feature_names(),tfm[0] - tfm[1])
    
    # print 
    max_token_length = 0
    output_tuples = list(reversed(sorted(tup,key=operator.itemgetter(1))))[:n_to_display]

    for token,score in output_tuples:
        if max_token_length < len(token):
            max_token_length = len(token)
    for token,score in output_tuples:
        print(f"{token:{max_token_length}s} {score:.3e}") 

In [31]:
compare_docs(eagles_doc,pats_doc,vectorizer,n_to_display=30)

rt @eagles                            3.134e-01
@eagles                               2.961e-01
#flyeaglesfly                         2.444e-01
#sblii #flyeaglesfly                  1.964e-01
#sblii                                1.456e-01
@eagles one                           1.318e-01
week #sblii                           1.293e-01
https://t.co/3imowefvji               1.293e-01
#flyeaglesfly https://t.co/3imowefvji 1.293e-01
’ #sblii                              1.098e-01
https://t.co/uryvv4dxhv               1.092e-01
#flyeaglesfly https://t.co/uryvv4dxhv 1.092e-01
#eagles                               1.048e-01
https://t.co/3imowefvji rt            1.030e-01
https://t.co/uryvv4dxhv rt            8.232e-02
@eagles ’                             7.867e-02
’                                     7.201e-02
one week                              5.223e-02
one                                   5.189e-02
looks back                            2.942e-02
season                                2.

In [32]:
compare_docs(pats_doc,eagles_doc,vectorizer,n_to_display=30)

@patriots                  3.641e-01
#patriots                  2.046e-01
rt @patriots               1.623e-01
…                          1.120e-01
amendola                   7.496e-02
​                          7.416e-02
danny                      7.393e-02
reasons                    7.256e-02
reasons call               7.188e-02
playoff amendola           7.188e-02
lots reasons               7.188e-02
https://t.co/q7cank6hna    7.188e-02
danny playoff              7.188e-02
call danny                 7.188e-02
amendola 5                 7.188e-02
@patriots lots             7.188e-02
5 https://t.co/q7cank6hna  7.188e-02
vs                         5.939e-02
rt                         5.298e-02
playoff                    5.286e-02
5                          5.265e-02
call                       5.196e-02
https://t.co/q7cank6hna rt 5.169e-02
lots                       5.119e-02
lost                       4.733e-02
#notdone                   4.686e-02
#notdone network           4.655e-02
l

We can't really evalute more sophisticated text comparison techniques without doing better filtering on the data.

In [33]:
# add token filtering to the TweetTokenizer
def filter_tokens(token):
    if len(token) < 2:
        return False
    if token.startswith('http'):
        return False
    if '’' in token:
        return False
    if '…' in token or '...' in token:
        return False
    return True
def custom_tokenizer(doc):
    initial_tokens = tokenizer.tokenize(doc)
    return [token for token in initial_tokens if filter_tokens(token)]

In [34]:
vectorizer = TfidfVectorizer(
    tokenizer=custom_tokenizer,
    stop_words=stopwords,
    ngram_range=(1,2),
)

In [35]:
compare_docs(eagles_doc,pats_doc,vectorizer,n_to_display=20)

rt @eagles           3.371e-01
@eagles              3.140e-01
#flyeaglesfly        2.625e-01
#sblii #flyeaglesfly 2.112e-01
#flyeaglesfly rt     1.755e-01
#sblii               1.535e-01
@eagles one          1.418e-01
week #sblii          1.390e-01
@eagles #sblii       1.180e-01
#eagles              1.107e-01
one week             5.366e-02
one                  5.301e-02
looks back           3.164e-02
season               3.030e-02
unscripted looks     3.012e-02
unscripted           3.012e-02
two unscripted       3.012e-02
start #eagles        3.012e-02
season including     3.012e-02
back start           3.012e-02


In [36]:
compare_docs(pats_doc,eagles_doc,vectorizer,n_to_display=20)

@patriots        4.065e-01
#patriots        2.282e-01
rt @patriots     1.803e-01
amendola         8.329e-02
danny            8.215e-02
reasons          8.063e-02
reasons call     7.987e-02
playoff amendola 7.987e-02
lots reasons     7.987e-02
danny playoff    7.987e-02
call danny       7.987e-02
@patriots lots   7.987e-02
rt               7.507e-02
vs               6.739e-02
playoff          5.878e-02
amendola rt      5.781e-02
call             5.775e-02
lots             5.688e-02
lost             5.261e-02
#notdone         5.218e-02


Retweets makes a mess of a term frequency analysis on documents consisting of concatenated tweet bodies. Remove them for now.

In [37]:
eagles_body_text_noRT = [tweet['body'] for tweet in eagles_results_list if tweet['verb'] == 'post']
eagles_doc_noRT = ' '.join(eagles_body_text_noRT)

pats_body_text_noRT = [tweet['body'] for tweet in pats_results_list if tweet['verb'] == 'post']
pats_doc_noRT = ' '.join(pats_body_text_noRT)

vectorizer = TfidfVectorizer(
    tokenizer=custom_tokenizer,
    stop_words=stopwords,
    ngram_range=(1,2),
)

compare_docs(eagles_doc_noRT,pats_doc_noRT,vectorizer,n_to_display=20)
print("\n")
compare_docs(pats_doc_noRT,eagles_doc_noRT,vectorizer,n_to_display=20)

@eagles                     5.495e-01
#eagles                     1.709e-01
#flyeaglesfly               1.377e-01
#flyeaglesfly te            5.409e-02
eagles                      4.964e-02
fans                        3.324e-02
@thompicks                  3.245e-02
@ryanmjb @mancitypeter      3.245e-02
@ryanmjb                    3.245e-02
@mancitypeter @joebennett27 3.245e-02
@mancitypeter               3.245e-02
@jordanwillis67             3.245e-02
@joebennett27               3.245e-02
@byisportsoracle @ryanmjb   3.245e-02
@byisportsoracle            3.245e-02
philadelphia                2.712e-02
@thebenjohn                 2.596e-02
@joebennett27 @eagles       2.596e-02
@eliotevans26               2.596e-02
#flyeaglesfly @eagles       2.596e-02


@patriots                4.842e-01
#patriots                2.920e-01
#releasethememo          6.175e-02
shop                     5.832e-02
t-shirt shop             5.660e-02
t-shirt                  5.660e-02
pats t-shirt             5.6

Well, now we have clear evidence of the political notion of the "#patriots" clause in our rule. Let's simplfy things by removing the hashtags from the rules.

In [38]:
_pats_rule = "@patriots"

In [39]:
_eagles_rule = "@eagles"

In [40]:
from_date="2018-01-28"
to_date="2018-01-29"
max_results = 20000

pats_rule = gen_rule_payload(_pats_rule,
                        from_date=from_date,
                        to_date=to_date,
                        )
eagles_rule = gen_rule_payload(_eagles_rule,
                        from_date=from_date,
                        to_date=to_date,
                        )

In [41]:
eagles_results_list = collect_results(eagles_rule, 
                               max_results=max_results, 
                               result_stream_args=search_args)

In [42]:
pats_results_list = collect_results(pats_rule, 
                               max_results=max_results, 
                               result_stream_args=search_args)

In [131]:
eagles_body_text_noRT = [tweet['body'] for tweet in eagles_results_list if tweet['verb'] == 'post']
eagles_doc_noRT = ' '.join(eagles_body_text_noRT)

pats_body_text_noRT = [tweet['body'] for tweet in pats_results_list if tweet['verb'] == 'post']
pats_doc_noRT = ' '.join(pats_body_text_noRT)

vectorizer = TfidfVectorizer(
    tokenizer=custom_tokenizer,
    stop_words=stopwords,
    ngram_range=(1,2),
)

compare_docs(eagles_doc_noRT,pats_doc_noRT,vectorizer,n_to_display=20)
print("\n")
compare_docs(pats_doc_noRT,eagles_doc_noRT,vectorizer,n_to_display=20)

@eagles                     7.461e-01
#flyeaglesfly               7.819e-02
eagles                      5.846e-02
@greengoblin                3.692e-02
@ike58reese                 3.536e-02
@nbcphiladelphia            3.423e-02
fans                        3.192e-02
@lanejohnson65              2.725e-02
@malcolmjenkins @eagles     2.567e-02
@eagles @eagles             2.538e-02
@joel9one                   2.514e-02
@greengoblin @lanejohnson65 2.481e-02
@jclarknbcs                 2.412e-02
@malcolmjenkins             2.371e-02
@nbcsphilly                 2.288e-02
fly                         2.110e-02
@joel9one @ike58reese       2.077e-02
@ike58reese @malcolmjenkins 2.077e-02
#eagles                     1.979e-02
@nbcphiladelphia @eagles    1.940e-02


@patriots           7.699e-01
@nfl                8.946e-02
@nfl @patriots      5.809e-02
@nfl @jaguars       3.519e-02
@jaguars            3.484e-02
@jaguars @patriots  3.045e-02
@patriots @patriots 2.953e-02
@thehall            2.549e-0

Things we could do:
* vectorize tweets as documents, and summarize or aggregate the coeeficients 
* select tokens for which the mean coefficient within a corpus is zero
* look at the difference in mean coefficient

Let's start by going back to simple corpora, and account for individual docs this time.

In [43]:
corpus0 = ["cat","cat dog"]
corpus1 = ["bun","dog","cat"]

In [44]:
# basic unigram vectorizer with Twitter-specific tokenization and stopwords
vectorizer = CountVectorizer(
                    tokenizer=custom_tokenizer,
                    stop_words=stopwords,
                    ngram_range=(1,1)
)

In [45]:
# get the term-frequency matrix
m = vectorizer.fit_transform(corpus0+corpus1)
vocab = np.array(vectorizer.get_feature_names())
print(vocab)

m = m.toarray()
print(m)

['bun' 'cat' 'dog']
[[0 1 0]
 [0 1 1]
 [1 0 0]
 [0 0 1]
 [0 1 0]]


In [46]:
# get TF matrices for each corpus
corpus0_indices = range(len(corpus0))
corpus1_indices = range(len(corpus0),len(corpus0)+len(corpus1))
m0 = m[corpus0_indices,:]
m1 = m[corpus1_indices,:]
print(m0)

[[0 1 0]
 [0 1 1]]


In [47]:
# calculate the average term frequency within each corpus
c0_means = np.mean(m0,axis=0)
c1_means = np.mean(m1,axis=0)
print(c0_means)

[0.  1.  0.5]


In [48]:
# calculate the indices of the distinct tokens, which only occur in a single corpus
distinct_indices = c0_means * c1_means == 0
print(vocab[distinct_indices])

['bun']


In [49]:
print(m[:, np.invert(distinct_indices)])

[[1 0]
 [1 1]
 [0 0]
 [0 1]
 [1 0]]


In [155]:
# build and identify the corpora
docs = eagles_body_text_noRT + pats_body_text_noRT
eagles_indices = range(len(eagles_body_text_noRT))
pats_indices = range(len(eagles_body_text_noRT),len(eagles_body_text_noRT) + len(pats_body_text_noRT))

In [161]:
# use a single vectorizer because we care about the joint vocabulary
vectorizer = CountVectorizer(
                    tokenizer=custom_tokenizer,
                    stop_words=stopwords,
                    ngram_range=(1,1)
)

dtm = vectorizer.fit_transform(docs).toarray()
vocab = np.array(vectorizer.get_feature_names())


eagles_dtm = dtm[eagles_indices, :]
pats_dtm = dtm[pats_indices, :]

Take the average coefficient for each vocab element, for each corpus.

In [186]:
# columns for every token in the vocab; rows for tweets in the corpus
eagles_means = np.mean(eagles_dtm,axis=0)
pats_means = np.mean(pats_dtm,axis=0)

Start by looking for _distinct_ tokens, which only exist in one corpus.

In [224]:
# get indices for any column with zero mean in either corpus
distinctive_indices = eagles_means * pats_means == 0

In [225]:
print(str(np.count_nonzero(distinctive_indices)) + " distinct tokens out of " + str(len(vocab)))

8073 distinct tokens out of 11759


In [226]:
eagles_ranking = np.argsort(eagles_means[distinctive_indices])[::-1]
pats_ranking = np.argsort(pats_means[distinctive_indices])[::-1]
total_ranking = np.argsort(eagles_means[distinctive_indices] + pats_means[distinctive_indices])[::-1]

In [227]:
vocab[distinctive_indices][total_ranking]

array(['@greengoblin', '@thehall', '@mark15_11', ..., 'craig', 'crazies',
       '##pitt'], dtype='<U52')

In [228]:
print("Top distinct Eagles tokens by average term count in Eagles corpus")
for token in vocab[distinctive_indices][eagles_ranking][:10]:
    print_str = f"{token:30s} {eagles_means[vectorizer.vocabulary_[token]]:.3g}"
    print(print_str)

Top distinct Eagles tokens by average term count in Eagles corpus
@greengoblin                   0.0269
@joslewis                      0.0133
@johnkincade                   0.00968
@johngaudreau03                0.00884
@nhlflames                     0.00673
@airbnb                        0.00673
@themightyerock                0.00652
@torreysmithwr                 0.00589
@smittybarstool                0.00589
@treyburton8                   0.00568


In [229]:
print("Top distinct Patriots tokens by average term count in Patriots corpus")
for token in vocab[distinctive_indices][pats_ranking][:10]:
    print_str = f"{token:30s} {pats_means[vectorizer.vocabulary_[token]]:.3g}"
    print(print_str)

Top distinct Patriots tokens by average term count in Patriots corpus
@thehall                       0.0186
@mark15_11                     0.0144
@tallguy2436                   0.0142
@bossportsextra                0.0142
@hoperugg                      0.0142
@lesliej23                     0.0139
@raider_35_24                  0.0137
@aolisn87                      0.0131
@jcapmany1231                  0.0126
@sarahlee626                   0.0126


How does this change if we account for inverse document frequency?

Let's build a function and encapsulate this.

In [230]:
def compare_corpora(corpus0,corpus1,vectorizer,n_to_display=10):
    corpus0_indices = range(len(corpus0))
    corpus1_indices = range(len(corpus0), len(corpus0) + len(corpus1))
    m_sparse = vectorizer.fit_transform(corpus0 + corpus1)
    m = m_sparse.toarray()

    vocab = np.array(vectorizer.get_feature_names())
    m_corpus0 = m[corpus0_indices,:]
    m_corpus1 = m[corpus1_indices,:]
    
    corpus0_means = np.mean(m_corpus0,axis=0)
    corpus1_means = np.mean(m_corpus1,axis=0)
    
    distinctive_indices = corpus0_means * corpus1_means == 0
    print(str(np.count_nonzero(distinctive_indices)) + " distinct tokens out of " + str(len(vocab)) + '\n')    
    
    corpus0_ranking = np.argsort(corpus0_means[distinctive_indices])[::-1]
    corpus1_ranking = np.argsort(corpus1_means[distinctive_indices])[::-1]

    print("Top distinct tokens from corpus0 by average term count in corpus")
    for token in vocab[distinctive_indices][corpus0_ranking][:n_to_display]:
        print_str = f"{token:30s} {corpus0_means[vectorizer.vocabulary_[token]]:.3g}"
        print(print_str)
    print()
    print("Top distinct tokens from corpus1 by average term count in corpus")
    for token in vocab[distinctive_indices][corpus1_ranking][:n_to_display]:
        print_str = f"{token:30s} {corpus1_means[vectorizer.vocabulary_[token]]:.3g}"
        print(print_str)    
    
    
    """
    tup = zip(vectorizer.get_feature_names(),tfm[0] - tfm[1])
    
    # print 
    max_token_length = 0
    output_tuples = list(reversed(sorted(tup,key=operator.itemgetter(1))))[:n_to_display]

    for token,score in output_tuples:
        if max_token_length < len(token):
            max_token_length = len(token)
    for token,score in output_tuples:
        print(f"{token:{max_token_length}s} {score:.3e}") 
    """

In [231]:
#vectorizer = TfidfVectorizer(
vectorizer = CountVectorizer(
                    tokenizer=custom_tokenizer,
                    stop_words=stopwords,
                    ngram_range=(1,1)
)
compare_corpora(eagles_body_text_noRT,pats_body_text_noRT,vectorizer)

8073 distinct tokens out of 11759

Top distinct tokens from corpus0 by average term count in corpus
@greengoblin                   0.0269
@joslewis                      0.0133
@johnkincade                   0.00968
@johngaudreau03                0.00884
@nhlflames                     0.00673
@airbnb                        0.00673
@themightyerock                0.00652
@torreysmithwr                 0.00589
@smittybarstool                0.00589
@treyburton8                   0.00568

Top distinct tokens from corpus1 by average term count in corpus
@thehall                       0.0186
@mark15_11                     0.0144
@tallguy2436                   0.0142
@bossportsextra                0.0142
@hoperugg                      0.0142
@lesliej23                     0.0139
@raider_35_24                  0.0137
@aolisn87                      0.0131
@jcapmany1231                  0.0126
@sarahlee626                   0.0126


Now let's remove the distrinctive tokens and look at the maximum _difference_ in means.

In [232]:
def compare_corpora(corpus0,corpus1,vectorizer,n_to_display=10):
    corpus0_indices = range(len(corpus0))
    corpus1_indices = range(len(corpus0), len(corpus0) + len(corpus1))
    m_sparse = vectorizer.fit_transform(corpus0 + corpus1)
    m = m_sparse.toarray()

    vocab = np.array(vectorizer.get_feature_names())
    m_corpus0 = m[corpus0_indices,:]
    m_corpus1 = m[corpus1_indices,:]
    
    corpus0_means = np.mean(m_corpus0,axis=0)
    corpus1_means = np.mean(m_corpus1,axis=0)
    
    distinctive_indices = corpus0_means * corpus1_means == 0
    print(str(np.count_nonzero(distinctive_indices)) + " distinct tokens out of " + str(len(vocab)) + '\n')    
    
    corpus0_ranking = np.argsort(corpus0_means[distinctive_indices])[::-1]
    corpus1_ranking = np.argsort(corpus1_means[distinctive_indices])[::-1]

    print("Top distinct tokens from corpus0 by average term count in corpus")
    for token in vocab[distinctive_indices][corpus0_ranking][:n_to_display]:
        print_str = f"{token:30s} {corpus0_means[vectorizer.vocabulary_[token]]:.3g}"
        print(print_str)
    print()
    print("Top distinct tokens from corpus1 by average term count in corpus")
    for token in vocab[distinctive_indices][corpus1_ranking][:n_to_display]:
        print_str = f"{token:30s} {corpus1_means[vectorizer.vocabulary_[token]]:.3g}"
        print(print_str)    
    
    # remove distinct tokens
    m = m[:, np.invert(distinctive_indices)]
    vocab = vocab[np.invert(distinctive_indices)]
    
    corpus0_means = np.mean(m_corpus0,axis=0)
    corpus1_means = np.mean(m_corpus1,axis=0)
    keyness = corpus0_means - corpus1_means
    
    """
    tup = zip(vectorizer.get_feature_names(),tfm[0] - tfm[1])
    
    # print 
    max_token_length = 0
    output_tuples = list(reversed(sorted(tup,key=operator.itemgetter(1))))[:n_to_display]

    for token,score in output_tuples:
        if max_token_length < len(token):
            max_token_length = len(token)
    for token,score in output_tuples:
        print(f"{token:{max_token_length}s} {score:.3e}") 
    """

In [233]:
vectorizer = CountVectorizer(
                    tokenizer=custom_tokenizer,
                    stop_words=stopwords,
                    ngram_range=(1,1)
)
compare_corpora(eagles_body_text_noRT,pats_body_text_noRT,vectorizer)

8073 distinct tokens out of 11759

Top distinct tokens from corpus0 by average term count in corpus
@greengoblin                   0.0269
@joslewis                      0.0133
@johnkincade                   0.00968
@johngaudreau03                0.00884
@nhlflames                     0.00673
@airbnb                        0.00673
@themightyerock                0.00652
@torreysmithwr                 0.00589
@smittybarstool                0.00589
@treyburton8                   0.00568

Top distinct tokens from corpus1 by average term count in corpus
@thehall                       0.0186
@mark15_11                     0.0144
@tallguy2436                   0.0142
@bossportsextra                0.0142
@hoperugg                      0.0142
@lesliej23                     0.0139
@raider_35_24                  0.0137
@aolisn87                      0.0131
@jcapmany1231                  0.0126
@sarahlee626                   0.0126


In [235]:
distinctive_indices

array([ True,  True, False, ...,  True,  True,  True])

In [241]:
d = np.array([False, True, False])
dtm.shape

(8559, 11759)