# Logistic Regression assignment


This is **Code help Notebook** for the Logistic Regression assignment.

The goal of this assignment is to introduce you to the idea of a **Maximum Entropy Language Model**,
that is, a Logistic Regression (LR) model that tries to predict the next word on the basis of features of the word's linguistic context.  We will look at two versions of the model.  One is a simple bigram model formulated as an LR model (the only context feature is the word immediately preceding the word we are predicting,
(which we will call the **target**).  The other is a bigram model augmented with **trigger word features**, words that are known triggers for other words, which may be found arbitrarily far from the target.  The idea of building an LR language model is due to [Rosenfeld (1994)](https://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/roni/papers/me-csl-revised.pdf) and we follow his method for finding trigger words.

The code in the section entitled **Preparing the filtered corpus** all needs to run in order for
this notebook to work.

The code in the section entitled  **Finding triggers  using Mutual Information (Rosenfeld 1994)**
does not need to run (it takes a while).  You can circumvent it by using the value assigned to
`triggers` at the end of that section. The correct value  for `triggers`(a set of 150 words) is spelled out at the end of that section.

## Preparing the filtered corpus

To restrict the number of parameters in our model and yet demonstarte the ideas on reealistic data, we are going to build an LR language model on a **filtered** data set.  We will model onl noun co-occurrences and we will limit our model to nouns occurring over 100 times in the Brown corpus, That is, gioven that Brpwn is about 1.2 million words,
we will look at nounds with a relative fdrequency greater than about:

This gives us a vocabulary of between 600 and 700 words.

In [None]:
from sklearn.metrics.cluster import contingency_matrix
from sklearn.metrics import mutual_info_score
import numpy as np

Corpus prep,  Vocab filter (freq threshold and noun)

$$
\begin{array}{rcrrr}
k & pos & \# sents & V & N\\
\hline
200 & \text{N} & 28,468  & 258 & 21,709\\
\mathbf{100} &  \text{N}  & \mathbf{19,243} & \mathbf{615} & \mathbf{55, 792} \\
 1 &        & 57,340  & 49,815 & 1,116,192\\
\end{array}
$$

In [28]:
from nltk import FreqDist
from nltk.corpus import brown

def relevant(w, tag, fd=None, k=100, pos_chars=None):
    """
    pos_char is a usually 'N' or 'V' or 'NV' or to select the nominal or verbal pos sets of Brown
    """
    if pos_chars is not None and not tag[0] in pos_chars:
        return False
    if fd is not None:
        return fd[w] > k
    else:
        return True


bw = [w.lower() for (w,t) in brown.tagged_words()]
print(len(bw))
fd = FreqDist(bw)
tagged_sents = brown.tagged_sents()

###############   Filtered vocab #####################################################################

#k=200,pos_chars="N"
#k=100, pos_chars = "N"
k, pos_chars, filtered_vocab = 100, "N", True

if filtered_vocab:
    f_bw = [w.lower() for (w,t) in brown.tagged_words() if relevant(w,t,fd,k=k,pos_chars=pos_chars)]
    f_fd = FreqDist(f_bw)
    #print(len(bw))
    f_sents = [[w.lower() for (w,t) in sent if relevant(w, t,f_fd,k=k,pos_chars=pos_chars)] for sent in tagged_sents]
    f_sents = [s for s in f_sents if len(s)>1]
    f_vocab=set(f_fd.keys())
    f_V = len(f_vocab)
    print("Filtered info: ","Sents", len(f_sents),"Vocab", f_V, "N", sum(len(s) for s in f_sents))

###############   End filtered vocab ###################################################################

sents = [[w.lower() for (w,t) in sent] for sent in tagged_sents]
vocab=set(fd.keys())
sents = [s for s in sents if len(s)>1]
num_events = len(sents)
V = len(vocab)
num_tokens = sum(len(s) for s in sents)
#print(num_events,V)
print("Base info: ","Sents", num_events,"Vocab", V, "N", num_tokens)

1161192
Filtered info:  Sents 19243 Vocab 615 N 55792
Base info:  Sents 57013 Vocab 49815 N 1160865


For ease of computation, the corpus has been filtered to include only nouns with frequency greater than 100 in the Brown corpus:

Freq threshold for f_vocab, part of speech for f_vocab.

In [None]:
k,pos_chars

(100, 'N')

In [None]:
len(f_fd)

615

In [None]:
len(fd)

49815

In [None]:
len(f_sents)

19243

In [None]:
len(tagged_sents)

57340

In [None]:
fd["time"]

1598

Sentences of length 1 have been filtered too, meaning that frequencies of some words have gone down:

In [None]:
f_fd["time"]

1555

In [None]:
fd["language"]

109

In [None]:
f_fd["language"]

106

Pick the 100 most words in the filetered vocabulary: These are the words to predict.

In [None]:
target_list = f_fd.most_common(100)
target_set = {w for (w,ct) in target_list}
print(len(target_set))

100


In [None]:
f_sents[:5]

In [None]:
#wd="work"
#sent = class_sents[wd][0]
#if not wd== sent[0]:
#   print(sent[:sent.index(wd)])


['issue', 'state', 'sales']


Make a dictionary such that each key is a target word and
the corresponding value is a list of brown "filtered sentences" containing that target word.
Filter sentences in which the target word is the first word (such a sentence has no history that can be
used to predict the target word).

In [None]:
from collections import defaultdict

def make_class_sents(f_sents,target_set):
    class_sents0 = defaultdict(list)
    for sent in f_sents:
        wds = target_set.intersection(sent)
        for wd in wds:
            if not wd == sent[0]:
                class_sents0[wd].append(sent[:sent.index(wd)])
    return {wd:sents for (wd,sents) in class_sents0.items() if len(sents)>100}

#{wd:sents for (wd,sents) in class_sents0.items() if len(sents)>100}
class_sents = make_class_sents(f_sents,target_set)

In [None]:
#len(class_sents["work"])  276
# len(class_sents) 91
# so we actually only have 91 lcasses to predict
# total_num_class_sents = sum(1 for sents in class_sents.values() for sent in sents)
# total_num_class_sents 16026
# so we have 16K  histories to split into

16026

Wehave 91 target words.

In [None]:
len(class_sents)

91

In [None]:
# number of sentences/wds in the filtered corpus

print("sents", sum(1 for sents in class_sents.values() for s in sents))
print("words", sum(1 for sents in class_sents.values() for s in sents for wd in s))

sents 16026
words 28714


In [None]:
#for s in sents[:25]:
#    print(s)

## Finding triggers  using Mutual Information (Rosenfeld 1994)

## Mutual information calculation

This code takes a while to run.  You can run it  if you like, or you can just use the value
of `triggers`, the set of words being computed ion this section, which is assigned to the right set at the end of the **Trigger Vocab calculation** section.

The comcept of a **trigger word**.

For MI_ calculation: Each word gets assigned avector representing what sentences it has occurred in,

In [None]:

f_encoder = {wd:i for (i,wd) in enumerate(f_vocab)}


def get_mi_word_vecs (sents,V,encoder):
    vecs = np.zeros((V,len(sents)),dtype=int)
    for (j,sent) in enumerate(sents):
        cts = FreqDist(sent)
        for (wd,ct) in cts.items():
            vecs[encoder[wd],j] = ct
    return vecs

f_vecs = get_mi_word_vecs (f_sents,f_V,f_encoder)

Shape is V x len(f_sents).

In [None]:
f_vecs.shape

(615, 19243)

Now we can get the mutual information of two words by taking the mutual information score
of their two word vectors.  So we compute all the pairwird MI scores for the filtered
vocab.

In [27]:
import time

f_enc_pairs = list(f_encoder.items())
#mis0 = np.zeros((V,V))
#mis = Triangular2DArray(mis0)
#vecs[0]

def get_mis (V,enc_pairs,vecs,batch_sz=50):
    mis = np.zeros((V,V))
    vecs_b = (vecs > 0)
    for (idx,(wd,i)) in enumerate(enc_pairs):
        if idx%batch_sz == 0:
            print(f"Processing word {idx} {time.ctime()}")
        for (wd2,j) in enc_pairs[idx:]:
            mis[i,j] = mutual_info_score(vecs_b[i],vecs_b[j])
    return mis

print(time.ctime())
mis = get_mis (f_V,f_enc_pairs,f_vecs)
print(time.ctime())
#word_vecs = vecs.sum(axis=0)
#del vecs

Tue Feb 24 22:32:44 2026
Processing word 0 Tue Feb 24 22:32:44 2026
Processing word 50 Tue Feb 24 22:33:41 2026
Processing word 100 Tue Feb 24 22:34:33 2026
Processing word 150 Tue Feb 24 22:35:20 2026
Processing word 200 Tue Feb 24 22:36:02 2026
Processing word 250 Tue Feb 24 22:36:40 2026
Processing word 300 Tue Feb 24 22:37:13 2026
Processing word 350 Tue Feb 24 22:37:41 2026
Processing word 400 Tue Feb 24 22:38:04 2026
Processing word 450 Tue Feb 24 22:38:23 2026
Processing word 500 Tue Feb 24 22:38:37 2026
Processing word 550 Tue Feb 24 22:38:45 2026
Processing word 600 Tue Feb 24 22:38:49 2026
Tue Feb 24 22:38:49 2026


In [29]:
mis[:,0].max()

np.float64(0.055420455303496284)

In [30]:
mi_max_vals = mis.max(axis=0)
#mi_max_vals

In [32]:
mis[f_encoder["man"]]

array([0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
       0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00,
      

##  End MI calculations

## Trigger vocab calculation

Now for each vocab word find exactly one "trigger", another word strongly associated according to
mutual information:

In [33]:
def print_triggers (trigger_pairs, top_n=None):
    if top_n is not None:
        trigger_pairs = trigger_pairs[:top_n]
    for (wd,trig,score) in trigger_pairs:
        print(f"{wd} {trig} {score:.5f}")

def get_triggers (mis, decoder, threshhold=.0002,verbose=False):
    # Zero out self triggers for now
    for i in range(mis.shape[0]):
        mis[i,i] = 0
    # Find the best trigger of word in V
    trigger_pairs = [(decoder[idx],decoder[mis[idx].argmax()]) for idx in range(mis.shape[0])]
    # as well as the MI scores
    trigger_scores = [(decoder[idx],mis[idx].max()) for idx in range(mis.shape[0])]
    # Apply threshhold
    filtered_trigger_pairs= []
    for (i,(wd, trig)) in enumerate(trigger_pairs):
        score = trigger_scores[i][1]
        if score > threshhold:
            filtered_trigger_pairs.append((wd,trig,score))
    # Sort by score
    filtered_trigger_pairs.sort(key=lambda x: x[2])
    if verbose:
        print_triggers (filtered_trigger_pairs)
    return filtered_trigger_pairs

threshhold,verbose = .0002,False
f_decoder = {i:wd for (wd,i) in f_encoder.items()}
filtered_trigger_pairs = get_triggers (mis, f_decoder,threshhold=threshhold,verbose=verbose)
(wds,triggers,scores) = zip(*filtered_trigger_pairs)
triggers = set(triggers)
print(len(triggers))

150


In [34]:
#(wds00,triggers00,scores00) = zip(*filtered_trigger_pairs)
for (w,t,s) in filtered_trigger_pairs[::-1]:
    print(f"{w:<25} {t:<25} {s:.5f}")

women                     men                       0.00613
girls                     boys                      0.00565
wife                      husband                   0.00388
view                      point                     0.00249
eyes                      face                      0.00235
tax                       year                      0.00211
mother                    father                    0.00195
costs                     cost                      0.00193
market                    stock                     0.00188
aid                       countries                 0.00187
research                  development               0.00183
door                      room                      0.00170
college                   students                  0.00170
children                  school                    0.00167
service                   costs                     0.00164
game                      ball                      0.00162
forms                     list          

An important limitation of **mutual information**:  These words have been
discovered because the occurrence of one of them in a sentence increases ythe likelihood oif its partner occurring in a sentence.  So they're here bnecause they ocurred in the **same** sentence often,
not because they occurred in **similar** sentences often.  We will return to this issue and offer a solution when we consider embeddings models of word meaning.

## Must execute the next code cell

You can run the code to find the correct value for `triggers`, or you can just use the value for `triggers` set in the next cell:

In [35]:
triggers = {'action',
 'aid',
 'amount',
 'analysis',
 'answer',
 'areas',
 'attention',
 'bed',
 'blood',
 'board',
 'boy',
 'boys',
 'business',
 'car',
 'case',
 'cent',
 'child',
 'college',
 'community',
 'company',
 'corner',
 'cost',
 'countries',
 'couple',
 'court',
 'day',
 'defense',
 'distance',
 'door',
 'effect',
 'equipment',
 'extent',
 'face',
 'fact',
 'factors',
 'faith',
 'floor',
 'forces',
 'form',
 'forms',
 'freedom',
 'friends',
 'front',
 'function',
 'future',
 'game',
 'government',
 'growth',
 'hair',
 'hands',
 'head',
 'heart',
 'history',
 'home',
 'hours',
 'husband',
 'ideas',
 'image',
 'increase',
 'industry',
 'influence',
 'issue',
 'labor',
 'land',
 'language',
 'law',
 'leaders',
 'length',
 'letter',
 'level',
 'life',
 'line',
 'literature',
 'man',
 'market',
 'material',
 'meaning',
 'means',
 'meeting',
 'member',
 'members',
 'method',
 'mind',
 'money',
 'month',
 'months',
 'morning',
 'mother',
 'mouth',
 'nations',
 'number',
 'numbers',
 'others',
 'paper',
 'parts',
 'party',
 'persons',
 'piece',
 'plane',
 'point',
 'policy',
 'pool',
 'population',
 'pressure',
 'principle',
 'problem',
 'production',
 'program',
 'programs',
 'progress',
 'property',
 'range',
 'reaction',
 'religion',
 'research',
 'respect',
 'sales',
 'school',
 'schools',
 'science',
 'season',
 'situation',
 'society',
 'son',
 'sound',
 'spirit',
 'state',
 'statement',
 'street',
 'student',
 'summer',
 'sun',
 'surface',
 'systems',
 'tax',
 'temperature',
 'terms',
 'thing',
 'town',
 'treatment',
 'truth',
 'value',
 'values',
 'war',
 'water',
 'way',
 'ways',
 'women',
 'world',
 'years'}

This should evaluate to 150 to get the results I want you to reproduce:

In [None]:
len(triggers)

150

## Make the Korpus

To make a corpus you will call the function `prepare_korpus` defined and called in the few code cells.

The first thing it will do is convert the filtered corpus into a set of history, predicted_word
pairs.  This is done in the function `make_class_sents`, which takes as its arguments
the filtered corpus and the set of class words (`class_set`)
which are all frequent words
selected to guarantee there would be enough examples of each predicted word to make reasonable training possible, even in  this small dataset.  

The function `make_class_sents` returns  a dictionary `class_sents` which has t he following structure:

```python
class_wd |-> histories
```

where each `class_wd` is a word to be predicted and `histories` is a list
of (filtered) histories for which_class_wd is the next word.

Here's an example of the contents:

In [36]:
class_sents = make_class_sents(f_sents,target_set)
# Histories followed by the word "time"
class_sents["time"][:20]

[['law'],
 ['police', 'trial'],
 ['today', 'business'],
 ['administration', 'policy'],
 ['city'],
 ['night', 'study', 'changes'],
 ['group', 'mind'],
 ['sales', 'state', 'tax'],
 ['right'],
 ['scene'],
 ['defense'],
 ['game'],
 ['sun'],
 ['points', 'years', 'school'],
 ['boy'],
 ['efforts'],
 ['party'],
 ['home'],
 ['evening'],
 ['market', 'years']]

If we mush all the histories associated with all the target words, there are 16,026 histories.  That's how many histories we will train on.

In [37]:
sum(len(sents) for sents in class_sents.values())

16026

The `class_sents` dictionary is then used to create `korpus`, the array representation of all 16,026
histories,  and `Y` the corresponding 16,026 words to predict

In [38]:
def prepare_korpus (f_sents, target_set, active_triggers):
    # Do creation of class_sents here because this code destructively modifies the sent lists
    # class sents is a dictionary:  class_wd |-> histories
    # where each eclass_d is a wd to be predicted and hsitories is a list
    # (filetered) histories for which_class_wd is the next word.
    class_sents = make_class_sents(f_sents,target_set)
    korpus0 = []
    final_vocab0 = set()
    for cls_wd,sents in class_sents.items():
        for sent in sents:
            sent[-1] = sent[-1] + "_b"
            final_vocab0.add(sent[-1])
            korpus0.append((sent,cls_wd))

    final_vocab = final_vocab0 | active_triggers

    #########   FINAL  PASS: Vectorize; Create korpus and Y   ########################
    final_sample_sz, final_V = (len(korpus0),len(final_vocab))
    korpus = np.zeros((final_sample_sz, final_V))
    #final_dim = final_V + len(active_triggers)
    final_encoder = {wd:i for (i,wd) in enumerate(final_vocab)}

    trigger_ct = 0
    Y = []    #np.array((final_sample_sz,))
    for (i,(sent,cls_wd)) in enumerate(korpus0):
        bigram_wd = sent[-1]
        #print(bigram_wd,final_encoder[bigram_wd])
        korpus[i,final_encoder[bigram_wd]] += 1
        these_triggers = active_triggers.intersection(sent)
        for trig in these_triggers:
            trigger_ct += 1
            korpus[i,final_encoder[trig]] += 1
        Y.append(cls_wd)
    Y=np.array(Y)
    ##########  END FINAL  PASS  #######################
    print(f"Korpus created:: korpus shape: {korpus.shape}  Y shape: {Y.shape} triggers used: {trigger_ct}")
    return korpus, korpus0, Y


In [39]:
# Make the corpus.  Must execute this cell.
korpus, korpus0, Y = prepare_korpus (f_sents, target_set, triggers)

Korpus created:: korpus shape: (16026, 508)  Y shape: (16026,) triggers used: 5511


##  Number of features for the base model (with triggers)

The number of features for this model is 508.

The `prepare_corpus` function returns three things:

1. `korpus`:  16,026 histories from the Brown corpus encoded as a 16,026x508 array, where 508 is the number of dimensions in a history vector, the encoded representation of a history.
2. `Y`:  the 16,026 target words for those history vectors. These are the words our language model will try to predict. A target word is always a word that occurred later in the same sentence as the words in its history in the orginal Brown corpus.
2.  `big_korpus0`: a sequence of history, target pairs represented as words.

The histories in `korpus0` differ from the histories in `class_sents` only in that they're in a flat list
and the last words have been modified to have "_b" on them.  This is because one of the best trigger words for any word is itself:  once a content word with unigram probability $p$ occurs in any document the likelihood of its occurring in the rest of the document is higher than $p$. This property is sometimes called **burstiness.**
To allow for the possibility for the same word to occur in
the history both as the last word (the bigram prefix) and earlier (as a trigger), trigger words
and words in final position in a history have different features; `korpus0` was a convenience
in building the training data, but will play no role in training.  It does however
help when we have questions about how a particular histopry gave rise to its feature reprersentation in
`korpus`.

In [40]:
korpus0[12]

(['food', 'family_b'], 'place')

Here is part of a row from `korpus`.

It  is a sparse matrix, mostly 0s.

In [41]:
korpus[12,:25]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0.])

But each row will have at least one non-zero value in there because the bigram word (with a "_b" at
the end) will always be an encoded feature of the history.

In [42]:
korpus[12].sum()

np.float64(1.0)

Consider how the word sequence in `korpus0[12]` is related to the 1D array `korpus[12]`.

In [43]:
korpus0[12]

(['food', 'family_b'], 'place')

There is only one active feature in row 12, because "food" is not a trigger word. Since tt's not in bigram possition
and it's not a trigger, we ignore it.

In [44]:
"food"  in triggers

False

Let's find a more interesting example.  The row with the greatest number of active features:

In [45]:
korpus.sum(axis=1).argmax()

np.int64(1377)

When there is a no trigger word the history for the
word prediction has only one non zero feature, the feature
for the immediately preceding word (the bigram feature).
When there is also a single trigger word, there are two nonzero features
features.  History `1484` has 6 trigger words in additionto the bigram word.

In [46]:
korpus[1484].sum()

np.float64(1.0)

This history and target word (word-to-predict, or class) of sample 1484:

In [47]:
korpus0[1484]

(['study', 'end_b'], 'man')

In [48]:
ts_1484 = triggers.intersection(korpus0[1484][0])
ts_1484

set()

Alas none of these trigger words has much of an association  with the given target word:

In [49]:
#  Attn:  Re-evaluate this cell only if you have computed themi scores in thsi notebook session.
man_vec = mis[f_encoder["man"]]
for t in ts_1484:
    print(f"{t} {man_vec[f_encoder[t]]}")

In [50]:
sents = [[w.lower() for (w,t) in sent] for sent in tagged_sents]

Here's the original sentence from the corpus.  The target word `man` is the last word.

In [51]:
print(" ".join([w for s in sents for w in s if len(ts_1484.intersection(s))==6]))




The average row sum is approximately 1.34,
which means there are a significant number of trigger words in `korpus`.

In [52]:
Ss = korpus.sum(axis=1)
Ss.mean()

np.float64(1.3438786971171846)

##  Training the Logistic Regression classifier (with triggers)

This takes a little time, see the wall time printouts below from my Mac.  Your mileage may vary.

In [53]:
from sklearn.linear_model import LogisticRegression
import time

#     LogisticRegression(penalty='deprecated',
#                        C=1.0, l1_ratio=0.0, dual=False,
#                         tol=0.0001, fit_intercept=True,
#                         intercept_scaling=1, class_weight=None,
#                         random_state=None, solver='lbfgs',
#                         max_iter=100, verbose=0, warm_start=False, n_jobs=None)

# we do want solver - 'saga' (sag wil also work also gd for large datasets)
# and l1_ratio = 0, Also it's a good multiclass algorithm.
lrc = LogisticRegression(solver="saga")
print(time.ctime())
#  Training command
lrc.fit(korpus,Y)
print(time.ctime())

Tue Feb 24 22:56:36 2026
Tue Feb 24 22:57:43 2026


In [94]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

X0_tokens = [feat_list for (feat_list, _) in korpus0]

X0_text = [" ".join(feats) for feats in X0_tokens]

vec0 = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
X0 = vec0.fit_transform(X0_text)

In [95]:
import numpy as np

lrc_bigram = LogisticRegression(max_iter=1000)
lrc_bigram.fit(X0, Y)

probs_bigram = lrc_bigram.predict_proba(X0)

class_idx_bigram = np.searchsorted(lrc_bigram.classes_, Y)
correct_probs_bigram = probs_bigram[np.arange(len(Y)), class_idx_bigram]

perp_bigram = np.exp(-np.mean(np.log(correct_probs_bigram)))
perp_bigram

np.float64(30.08564140162269)

In [96]:
trigger_model_is_better = perp1 < perp_bigram
trigger_model_is_better

np.False_

In [88]:
num_features1 = korpus.shape[1]
num_features1

508

In [87]:
import numpy as np

probs = lrc.predict_proba(korpus)

class_indices = np.searchsorted(lrc.classes_, Y)

correct_probs = probs[np.arange(len(Y)), class_indices]

perp1 = np.exp(-np.mean(np.log(correct_probs)))

perp1

np.float64(37.20636413221655)

In [97]:
num_features_bigram = X0.shape[1]
num_features_bigram

716

In [98]:
perp_uniform = len(lrc.classes_)
perp_uniform

91

In [99]:
import numpy as np
from sklearn.metrics import precision_score

y_pred = lrc.predict(korpus)

precisions = precision_score(Y, y_pred, average=None)

best_index = np.argmax(precisions)

max_precision_word = lrc.classes_[best_index]

max_precision_word

np.str_('times')

## Predictions with the model

There are two different models that arise in the assignment questions, one with trigger words, one
without (the bigrams only model).  The model you are being gven in this code help notebook is
the model with triggers.

All the questions on the assignment involve predictions made **on the training set**. We can do two kinds
of predictions with our trained `lrc` model, predicting probabilities for the training set
and predicting words (the classes our classifier predicts are words) or predicting probabilities.

In [54]:
predicted_words = lrc.predict(korpus)
predicted_probs = lrc.predict_proba(korpus)

In [55]:
predicted_words[:10]

array(['years', 'year', 'man', 'time', 'year', 'school', 'place', 'place',
       'day', 'world'], dtype='<U11')

One predicted word for every history in the training set:

In [56]:
predicted_words.shape

(16026,)

Our `predicted_probs` is a 2D array with shape

In [57]:
predicted_probs.shape

(16026, 91)

The reason for this should become clear in the discussion of `predict_proba` in the following section.

##

## Example of using classifiers in scikit learn (sprinkled liberally with hints for the assignment)



In [58]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
from inspect import signature

### The data

In [59]:
newsgroups = fetch_20newsgroups()
newsgroups['target_names']

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

In [60]:
# We want a multiclass problem, so pick three of the 20 categories
categories = ['alt.atheism', 'sci.space','comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train',
                                     categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test',
                                     categories=categories)

In [61]:
print(newsgroups_train.data[0])

From: degroff@netcom.com (21012d)
Subject: Re: Venus Lander for Venus Conditions.
Organization: Netcom Online Communications Services (408-241-9760 login: guest)
Lines: 8


  I doubt there are good prospects for  a self armoring system
for venus surface conditions (several hundred degrees, very high
pressure of CO2, possibly sulfuric and nitric acids or oxides
but it is a notion to consider for outer planets rs where you might
pick up ices under less extream upper atmosphere conditions buying
deeper penetration.  A nice creative idea, unlikly but worthy of
thinking about.



In [62]:
newsgroups_train['target_names']

['alt.atheism', 'comp.graphics', 'sci.space']

The classnames we will pass to the classifier in training are integers.
The integers are aligned with the class names in the order in target_names.
Therefore we can set up a simple decoder dictionary that maps from class indices to class names:

In [63]:
first_class = newsgroups_train.target[0]
print(first_class)
decoder = np.array(newsgroups_train.target_names)
decoder[2]

2


np.str_('sci.space')

###  Training

Train a logistic regression classifier on this multiclass problem.
Also test it:

In [64]:
# Not usually prize-winning with language data
# vectorizer = CountVectorizer()

##########  Mapping from a sequence of texts to a feature representation of the data
vectorizer = TfidfVectorizer()
vectors_train = vectorizer.fit_transform(newsgroups_train.data)
vectors_test = vectorizer.transform(newsgroups_test.data)
# (1657, 29663)
# 1,657 documents. 29663 features.  Why so many features?  That's how many
# distinct vocab items cropped up in these 1,657 documents.  We use words
# as features in our language model as well but there are way fewer
# features because our training data consists of filtered document texts and therefore a filtered vocab.
print(vectors_train.shape)
########## End of feature  mapping #####################################

clf = LogisticRegression(solver="saga")
# targets are [0,1,2]  aligned with newsgroups_train.target_names
clf.fit(vectors_train, newsgroups_train.target)
#  The usual thing we do with trained classifiers
pred = clf.predict(vectors_test)
metrics.f1_score(newsgroups_test.target, pred, average="micro")
#  f1 score average = "micro"  0.9407548825982005

(1657, 29663)


0.9401088929219601

##  Predicting probabilities

For this assignment we're more interested in having the clasifier produce probabilities:

We classify a fresh example using `predict_proba`.

Notice there are three probabilities.  That's because there are three classes:

In [65]:
space_text = ["Space is the final frontier."]
probs = clf.predict_proba(vectorizer.transform(space_text))
probs

array([[0.05980826, 0.18001922, 0.76017252]])

In [66]:
probs.sum()

np.float64(0.9999999999999999)

Class with the highest prob:

In [67]:
probs.argmax()

np.int64(2)

In [68]:
decoder = np.array(newsgroups_train.target_names)
decoder[probs.argmax()]

np.str_('sci.space')

Which is the same answer I could have gotten through `predict`:

In [69]:
decoder[clf.predict(vectorizer.transform(space_text))[0]]

np.str_('sci.space')

Note  that in our language modeling example you didn't need to call `vectorizer.transform`.
I wrote `prepare_korpus` to do that job, because I wanted some custom "vectorizing".
But the output was still a 2D array with the same number of rows as there were histories
to classify and the same number of columns as there were features to classify with.

In [70]:
# For a sequence of n inputs predict_proba will produce a  nx3 array.  Here n=2
texts = ["Space is the final frontier.", "Rasters are common in computer images."]
probs = clf.predict_proba(vectorizer.transform(texts))
print("Probs shape: ", probs.shape)
cls_names = newsgroups_train.target_names
cls_idxs = probs.argmax(axis=1)
print("Predicted class indices: ",cls_idxs)
decoder = np.array(cls_names)
decoder[cls_idxs]

Probs shape:  (2, 3)
Predicted class indices:  [2 1]


array(['sci.space', 'comp.graphics'], dtype='<U13')

##  Retrieving the predicted probabilities for a sequence of classes

Hint:  This discussion relates to computing the perplexity of the data:

Now suppose in the interest of finding the **hard** examples in the test set,  I want to know the probability
my trained classifier assigns to the **correct** class for each example.

In [71]:
pred_probs = clf.predict_proba(vectors_test)

In [72]:
pred_probs.shape

(1102, 3)

I need to retrieve a different column index from each row of `pred_probs`, as dictated by the correct classes
for the test data (`newsgroups_test.target`).

In [73]:
newsgroups_test.target

array([2, 1, 1, ..., 0, 1, 2])

This can be done via **fancy indexing** of the probs array. For a 1D array we pass a list containing
a sequence of the indices we want.

In [74]:
a = np.arange(4,62,3)
print(a)
print(a.shape)
a[[2,8,11,15]]

[ 4  7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 58 61]
(20,)


array([10, 28, 37, 49])

For a 2D array, we pass two sequences, one for the row indices we want, the other for the column indices:

In [75]:
aa = a.reshape((5,4))
print(aa)
# The second element retrieved is aa[2,3]
aa[[1,2,4],[2,3,1]]

[[ 4  7 10 13]
 [16 19 22 25]
 [28 31 34 37]
 [40 43 46 49]
 [52 55 58 61]]


array([22, 37, 55])

Back to our original problem.  We want an array consisting of one element from each row,
the element corresponding to the correct class for that row:

In [76]:
num_rows = len(newsgroups_test.target)
all_rows_idxs = list(range(num_rows))
column_idxs = list(newsgroups_test.target)

prediction_probs_for_correct_classes = pred_probs[all_rows_idxs,column_idxs]
prediction_probs_for_correct_classes

array([0.8863373 , 0.91611412, 0.6824465 , ..., 0.90234175, 0.68914973,
       0.81725171])

To find the lowest prob assigned to a correct class on the test set we first find its index:

In [77]:
example_idx = prediction_probs_for_correct_classes.argmin()
example_idx

np.int64(548)

Here's the probability assigned to the correct class:

In [78]:
prediction_probs_for_correct_classes[example_idx]

np.float64(0.14183114567444172)

And here is that example:

In [79]:
doc548 = newsgroups_test.data[548]
print(doc548)

From: gkm@wampyr.cc.uow.edu.au (Glen K Moore)
Subject: Fax/email wanted for Louis Friedman/Planetary Society
Organization: University of Wollongong, NSW, Australia.
Lines: 7
NNTP-Posting-Host: wampyr.cc.uow.edu.au
Summary: Want to obtain fax/email address for Planetary Society
Keywords: Planetary Friedman

If available please send to
Glen Moore
Director
Science Centre
Wollongong, Australia
fax: 61 42 213151   email: gkm@cc.uow.edu.au




Confirming the probs:

In [80]:
probs = clf.predict_proba(vectorizer.transform([doc548]))
probs

array([[0.06889007, 0.78927879, 0.14183115]])

##  Precision by class

In [81]:
from sklearn import metrics

Here `clf` is the classifier trained above. Note the use of `average=None`.  This gets the class
by class precision results:

In [82]:
pred = clf.predict(vectors_test)
metrics.precision_score(newsgroups_test.target, pred, average=None)

array([0.97674419, 0.90510949, 0.94871795])

Note the order of the arguments matters for precision.  Swapping predicted and true labelings changes the scores.

In [83]:
metrics.precision_score(pred,newsgroups_test.target, average=None)

array([0.92163009, 0.9562982 , 0.93908629])

The first order given is correct, as the function signature shows:

In [84]:
signature(metrics.f1_score)

<Signature (y_true, y_pred, *, labels=None, pos_label=1, average='binary', sample_weight=None, zero_division='warn')>