# NGram Language Models

```yaml
Course:  DS 5001
Module:  03 Lab
Topic:   NGram Language Models
Author:  R.C. Alvarado
Date:    12 December 2023
```

## Purpose 

We now create a series of simple n-gram langage models from our small corpus and evaluate them.

## Pattern

1. Import corpus &rarr; `TOKEN`, `VOCAB`.
2. Extract ngrams from training tokens &rarr; `NGRAM`.
3. Count ngrams and convert to models &rarr; `MODEL`.
4. Convert test sentences into tokens &rarr; `TEST_SENT`, `TEST_TOKEN`.
5. Extract ngrams from test tokens &rarr; `TEST_NGRAM`.
6. Test model by joining model information `M.i` to `TEST_NGRAM` and then summing i per sentence &rarr; `TEST_NGRAM'`, `TEST_SENT'`.
7. Compute model perplexity by averaging sentence information sums and exponentiating. 

## Set Up

### Import libraries

In [1]:
import pandas as pd
import numpy as np

### Configure

In [2]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_dir = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']

In [3]:
OHCO = ['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num']
path_prefix = f"{output_dir}/austen-combo"
n = 3

## Get Data

We grab our corpus of two novels.

In [4]:
VOCAB = pd.read_csv(f"{path_prefix}-VOCAB.csv").set_index('term_str')
TOKEN = pd.read_csv(f"{path_prefix}-TOKENS.csv").set_index(OHCO)

## Generate Models

This function generates models up to the length specified.

Our approach is to bind the sequence of term strings in TOKEN to itself with an offset of 1 for each value of $n$.

In [5]:
TOKEN

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,0,0,0,Sir,sir
1,1,0,0,1,Walter,walter
1,1,0,0,2,Elliot,elliot
1,1,0,0,3,of,of
1,1,0,0,4,Kellynch,kellynch
...,...,...,...,...,...,...
2,50,22,0,8,and,and
2,50,22,0,9,Sensibility,sensibility
2,50,22,0,10,by,by
2,50,22,0,11,Jane,jane


### `get_ngrams()`

In [6]:
def get_ngrams(TOKEN, VOCAB, n=2, sent_key='sent_num'):
    

    OHCO = TOKEN.index.names
    grouper = list(OHCO)[:OHCO.index(sent_key)+1]

    PADDED = TOKEN.groupby(grouper)\
        .apply(lambda x: '<s> ' + ' '.join(x.term_str) + ' </s>')\
        .apply(lambda x: pd.Series(x.split()))\
        .stack().to_frame('term_str')
    PADDED.index.names = grouper + ['token_num']
    
    # Handle OOV terms -- MAY NOT BE NEEDED
    PADDED.loc[~PADDED.term_str.isin(list(VOCAB.index) + ['<s>','</s>']), 'term_str'] = '<unk>'
    
    for i in range(1, n):
        PADDED = PADDED.join(PADDED.term_str.shift(-i), rsuffix=i)

    PADDED.columns = [f'w{j}' for j in range(n)]

    PADDED = PADDED.fillna('<s>')
    # PADDED = PADDED[~((PADDED.w0 == '</s>') & (PADDED[f'w{n-1}'] == '<s>'))]

    return PADDED

In [7]:
NG = get_ngrams(TOKEN, VOCAB, n=3)

In [8]:
NG.iloc[:,:3]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,w0,w1,w2
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,0,0,0,<s>,sir,walter
1,1,0,0,1,sir,walter,elliot
1,1,0,0,2,walter,elliot,of
1,1,0,0,3,elliot,of,kellynch
1,1,0,0,4,of,kellynch,hall
...,...,...,...,...,...,...,...
2,50,22,0,10,sensibility,by,jane
2,50,22,0,11,by,jane,austen
2,50,22,0,12,jane,austen,</s>
2,50,22,0,13,austen,</s>,<s>


### `get_ngram_counts()`

In [9]:
def get_ngram_counts(NGRAM):
    "Compress the sequences into counts"
    
    n = len(NGRAM.columns)
    C = [None for i in range(n)]
    
    for i in range(n):

        # Count distinct ngrams
        C[i] = NGRAM.iloc[:, :i+1].value_counts().to_frame('n').sort_index()
    
        # Get joint probabilities (MLE)
        C[i]['p'] = C[i].n / C[i].n.sum()
        C[i]['i'] = np.log2(1/C[i].p)

        # Get conditional probabilities (MLE)
        if i > 0:
            C[i]['cp'] = C[i].n / C[i-1].n
            C[i]['ci'] = np.log2(1/C[i].cp)
            
    return C

Generate unigram, bigram, and trigram models.

In [10]:
M = get_ngram_counts(NG)

In [11]:
M[0].sort_values('n')

Unnamed: 0_level_0,n,p,i
w0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
journeys,1,0.000004,17.813944
edify,1,0.000004,17.813944
ponder,1,0.000004,17.813944
politics,1,0.000004,17.813944
politicians,1,0.000004,17.813944
...,...,...,...
and,6290,0.027297,5.195100
to,6923,0.030044,5.056762
the,7435,0.032266,4.953827
<s>,12812,0.055601,4.168736


In [12]:
M[1].sort_values('n')

Unnamed: 0_level_0,Unnamed: 1_level_0,n,p,i,cp,ci
w0,w1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1760,1,0.000004,17.813944,0.333333,1.584963
often,keeps,1,0.000004,17.813944,0.012195,6.357552
often,led,1,0.000004,17.813944,0.012195,6.357552
often,makes,1,0.000004,17.813944,0.012195,6.357552
often,observed,1,0.000004,17.813944,0.012195,6.357552
...,...,...,...,...,...,...
of,the,857,0.003719,8.070793,0.139440,2.842281
<s>,but,929,0.004032,7.954409,0.072510,3.785673
<s>,i,1000,0.004340,7.848160,0.078052,3.679424
<s>,and,1417,0.006149,7.345320,0.110599,3.176584


In [13]:
M[1].sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,n,p,i,cp,ci
w0,w1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
apartment,as,1,4e-06,17.813944,0.166667,2.584963
of,no,16,6.9e-05,13.813944,0.002603,8.585432
false,or,1,4e-06,17.813944,0.25,2.0
which,brings,1,4e-06,17.813944,0.000991,9.97871
the,observant,1,4e-06,17.813944,0.000134,12.860117


In [14]:
M[2].sample(5)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,n,p,i,cp,ci
w0,w1,w2,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
said,anne,you,1,4e-06,17.813944,0.055556,4.169925
died,it,would,1,4e-06,17.813944,1.0,0.0
<s>,dashwood,now,2,9e-06,16.813944,0.014706,6.087463
feeling,on,his,1,4e-06,17.813944,1.0,0.0
unaffected,sincerity,</s>,1,4e-06,17.813944,1.0,0.0


In [15]:
M[2].loc[('captain','wentworth')].sort_values('n', ascending=False).head()

Unnamed: 0_level_0,n,p,i,cp,ci
w2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
s,26,0.000113,13.113504,0.132653,2.91427
</s>,24,0.000104,13.228981,0.122449,3.029747
was,15,6.5e-05,13.907053,0.076531,3.707819
in,8,3.5e-05,14.813944,0.040816,4.61471
had,7,3e-05,15.006589,0.035714,4.807355


In [16]:
M[2].loc[('anne','elliot')].sort_values('n', ascending=False).head()

Unnamed: 0_level_0,n,p,i,cp,ci
w2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
</s>,5,2.2e-05,15.492016,0.217391,2.201634
and,3,1.3e-05,16.228981,0.130435,2.938599
as,2,9e-06,16.813944,0.086957,3.523562
with,2,9e-06,16.813944,0.086957,3.523562
again,1,4e-06,17.813944,0.043478,4.523562


## Predict Sentences

### Get a list of test sentences

In [17]:
test_sentences = """
I love you
I love cars
I want to
Anne said to
said to her
he said to
she said to
said to him
she read the
she went to
robots fly ufos
""".split('\n')[1:-1]

### Convert list to TEST_SENT

In [18]:
TEST_SENT = pd.DataFrame({'sent_str':test_sentences})
TEST_SENT.index.name = 'sent_num'

In [19]:
TEST_SENT

Unnamed: 0_level_0,sent_str
sent_num,Unnamed: 1_level_1
0,I love you
1,I love cars
2,I want to
3,Anne said to
4,said to her
5,he said to
6,she said to
7,said to him
8,she read the
9,she went to


### Convert TEST_SENT to TEST_TOKEN

In [20]:
TEST_TOKEN = TEST_SENT.sent_str.str.split(expand=True).stack().to_frame('token_str')
TEST_TOKEN.index.names = ['sent_num', 'token_num']
TEST_TOKEN['term_str'] = TEST_TOKEN.token_str.str.replace(r'[\W_]+', '').str.lower()

In [21]:
TEST_TOKEN

Unnamed: 0_level_0,Unnamed: 1_level_0,token_str,term_str
sent_num,token_num,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,I,i
0,1,love,love
0,2,you,you
1,0,I,i
1,1,love,love
1,2,cars,cars
2,0,I,i
2,1,want,want
2,2,to,to
3,0,Anne,anne


### Extract TEST_NGRAMS from TEST_TOKEN

In [22]:
TEST_NGRAMS = get_ngrams(TEST_TOKEN, VOCAB, n=3, sent_key='sent_num')

In [23]:
TEST_NGRAMS.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,w0,w1,w2
sent_num,token_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,<s>,i,love
0,1,i,love,you
0,2,love,you,</s>
0,3,you,</s>,<s>
0,4,</s>,<s>,i
1,0,<s>,i,love
1,1,i,love,<unk>
1,2,love,<unk>,</s>
1,3,<unk>,</s>,<s>
1,4,</s>,<s>,i


### Test the model

We test by joining the test ngrams with the model and then saving aggregate statistics to the sentence dataframe.

Note that testing is a special case of the split-apply-combine pattern.

### `test_model()`

In [39]:
def test_model(model, test_ngrams):

    # Get the model level and info feature
    n = len(model.index.names) - 1 
    f = 'c' * bool(n) + 'i'        

    # Do the test by join and then split-apply-combine
    # fillna() is used to hanlde OOV terms
    T = test_ngrams.join(model[f], on=model.index.names, how='left').fillna(model[f].max()).copy()
        
    R = T.groupby('sent_num')[f].agg(['sum','mean'])
    R['pp'] = np.exp2(R['mean'])
    
    return R

### Run tests and save as RESULT

In [40]:
RESULT = pd.concat(
    [test_model(M[i], TEST_NGRAMS.iloc[:,:i+1]) for i in range(len(M))],
    keys=[f"M{n}" for n in range(len(M))],
    axis=1
)

In [37]:
RESULT.style.background_gradient()

Unnamed: 0_level_0,M0,M0,M0,M1,M1,M1,M2,M2,M2
Unnamed: 0_level_1,sum,mean,pp,sum,mean,pp,sum,mean,pp
sent_num,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
0,89.06972,17.813944,230426.0,21.181306,4.236261,18.846977,26.875379,5.375076,41.501044
1,89.06972,17.813944,230426.0,39.580865,7.916173,241.549148,52.580832,10.516166,1464.473595
2,89.06972,17.813944,230426.0,24.502196,4.900439,29.866147,42.166699,8.43334,345.691132
3,89.06972,17.813944,230426.0,27.70945,5.54189,46.588113,42.131175,8.426235,343.992908
4,89.06972,17.813944,230426.0,21.158476,4.231695,18.787421,25.335105,5.067021,33.521644
5,89.06972,17.813944,230426.0,23.756214,4.751243,26.931877,28.439985,5.687997,51.553444
6,89.06972,17.813944,230426.0,23.994194,4.798839,27.835204,31.749848,6.34997,81.570159
7,89.06972,17.813944,230426.0,21.14156,4.228312,18.743415,22.871142,4.574228,23.822094
8,89.06972,17.813944,230426.0,40.629998,8.126,279.363474,53.580832,10.716166,1682.23841
9,89.06972,17.813944,230426.0,21.641655,4.328331,20.08896,38.144617,7.628923,197.940544


Show results for a given model.

In [27]:
pd.concat([TEST_SENT, RESULT['M0']], axis=1).sort_values('pp').style.background_gradient()

Unnamed: 0_level_0,sent_str,sum,mean,pp
sent_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,I love you,89.06972,17.813944,230426.0
1,I love cars,89.06972,17.813944,230426.0
2,I want to,89.06972,17.813944,230426.0
3,Anne said to,89.06972,17.813944,230426.0
4,said to her,89.06972,17.813944,230426.0
5,he said to,89.06972,17.813944,230426.0
6,she said to,89.06972,17.813944,230426.0
7,said to him,89.06972,17.813944,230426.0
8,she read the,89.06972,17.813944,230426.0
9,she went to,89.06972,17.813944,230426.0


In [28]:
pd.concat([TEST_SENT, RESULT['M2']], axis=1).sort_values('pp').style.background_gradient()

Unnamed: 0_level_0,sent_str,sum,mean,pp
sent_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7,said to him,22.871142,4.574228,23.822094
4,said to her,25.335105,5.067021,33.521644
0,I love you,26.875379,5.375076,41.501044
5,he said to,28.439985,5.687997,51.553444
6,she said to,31.749848,6.34997,81.570159
9,she went to,38.144617,7.628923,197.940544
3,Anne said to,42.131175,8.426235,343.992908
2,I want to,42.166699,8.43334,345.691132
1,I love cars,52.580832,10.516166,1464.473595
8,she read the,53.580832,10.716166,1682.23841


Compare a feature across models.

We use `.swaplevel()` to change the order of the column levels to make selection easy.

In [29]:
RESULT.swaplevel(axis=1).style.background_gradient()

Unnamed: 0_level_0,sum,mean,pp,sum,mean,pp,sum,mean,pp
Unnamed: 0_level_1,M0,M0,M0,M1,M1,M1,M2,M2,M2
sent_num,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
0,89.06972,17.813944,230426.0,21.181306,4.236261,18.846977,26.875379,5.375076,41.501044
1,89.06972,17.813944,230426.0,39.580865,7.916173,241.549148,52.580832,10.516166,1464.473595
2,89.06972,17.813944,230426.0,24.502196,4.900439,29.866147,42.166699,8.43334,345.691132
3,89.06972,17.813944,230426.0,27.70945,5.54189,46.588113,42.131175,8.426235,343.992908
4,89.06972,17.813944,230426.0,21.158476,4.231695,18.787421,25.335105,5.067021,33.521644
5,89.06972,17.813944,230426.0,23.756214,4.751243,26.931877,28.439985,5.687997,51.553444
6,89.06972,17.813944,230426.0,23.994194,4.798839,27.835204,31.749848,6.34997,81.570159
7,89.06972,17.813944,230426.0,21.14156,4.228312,18.743415,22.871142,4.574228,23.822094
8,89.06972,17.813944,230426.0,40.629998,8.126,279.363474,53.580832,10.716166,1682.23841
9,89.06972,17.813944,230426.0,21.641655,4.328331,20.08896,38.144617,7.628923,197.940544


In [30]:
RESULT.swaplevel(axis=1)['pp'].style.background_gradient()

Unnamed: 0_level_0,M0,M1,M2
sent_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,230426.0,18.846977,41.501044
1,230426.0,241.549148,1464.473595
2,230426.0,29.866147,345.691132
3,230426.0,46.588113,343.992908
4,230426.0,18.787421,33.521644
5,230426.0,26.931877,51.553444
6,230426.0,27.835204,81.570159
7,230426.0,18.743415,23.822094
8,230426.0,279.363474,1682.23841
9,230426.0,20.08896,197.940544


In [31]:
RESULT.swaplevel(axis=1)['mean'].style.background_gradient()

Unnamed: 0_level_0,M0,M1,M2
sent_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,17.813944,4.236261,5.375076
1,17.813944,7.916173,10.516166
2,17.813944,4.900439,8.43334
3,17.813944,5.54189,8.426235
4,17.813944,4.231695,5.067021
5,17.813944,4.751243,5.687997
6,17.813944,4.798839,6.34997
7,17.813944,4.228312,4.574228
8,17.813944,8.126,10.716166
9,17.813944,4.328331,7.628923


### Compute Model Perplexity

In [32]:
np.exp2(RESULT.swaplevel(axis=1)['mean'].mean())

M0    230426.000000
M1        56.334266
M2       231.744887
dtype: float64

In language modeling, perplexity is a measure of how well a probability model predicts a test set. It is often used to compare different models: the lower the perplexity, the better the model's performance in terms of predicting the test set. When dealing with n-gram models like unigrams, bigrams, and trigrams, the relationship between the value of 'n' and the perplexity of the model on a given test corpus can vary, depending on several factors.

1. **Data Sparsity**: As 'n' increases in n-gram models (moving from unigram to bigram to trigram), the models become more sensitive to the specific sequences of words in the training data. This can lead to an issue known as data sparsity: trigram models, for example, require a lot more data to encounter all possible word sequences of length three. If your training corpus isn't sufficiently large and diverse, trigram and higher n-gram models may suffer because they haven't seen enough examples of each word sequence during training. 

2. **Generalization vs. Specificity**: Unigram models, being the simplest, have high generalizability but low specificity—they treat each word independently and don't capture the context. Trigram models, on the other hand, capture more context but might become too specific to the training data, failing to generalize well to unseen data. If the test corpus contains many word sequences not seen in the training corpus, a higher n-gram model might perform poorly, leading to higher perplexity.

3. **Model Complexity and Overfitting**: Higher n-gram models (like trigrams) are more complex and can potentially overfit the training data. Overfitting occurs when a model learns patterns that are specific to the training data, including noise and outliers, rather than capturing the underlying structure of the language. This can lead to increased perplexity on the test set, as the model is less able to generalize to unseen data.

In summary, whether perplexity increases or decreases with n depends on the characteristics of your training and test data, as well as how well the model's complexity is suited to the amount and diversity of the data. In an ideal scenario with ample, diverse training data, you might expect a bigram or trigram model to outperform a unigram model, leading to lower perplexity. However, in practical scenarios, especially when dealing with limited or highly specific datasets, the relationship might not be so straightforward.

## Generate Text

We use so-called "stupid back-off" to account for missing ngrams.

### `generate_text()`

In [33]:
def generate_text(n=250):
    
    m1, m2, m3 = M
    start_word = m1.sample(weights='p').index[0][0]
    words = [start_word]
    
    for i in range(n):
        
        if len(words) == 1:
            next_word = m2.loc[start_word].sample(weights='p').index[0]
        
        elif len(words) > 1:

            # Get previous two words
            bg = tuple(words[-2:])
            
            # Try trigram model
            try:
                next_word = m3.loc[bg].sample(weights='cp').index[0]
            
            # If not in model, back off ...
            except KeyError:
                
                # Get the last word in the bigram
                ug = bg[1] 

                if ug == '<s>':
                    next_word = m1.sample(weights='p').index[0][0]
                else:
                    next_word = m2.loc[ug].sample(weights='cp').index[0]
            
        # Some words are returned as single item tuples
        if isinstance(next_word, tuple):
            next_word = next_word[0]
        
        words.append(next_word)
    
    text = ' '.join(words)
    lines = text.split('<s> <s>')
    for line in lines:
        print(line.strip().upper())

In [34]:
generate_text()

SAID NO MORE AND MORE PARTICULAR ACCOUNT OF THEIR DANGER THAT SHE SHOULD EXPRESS HERSELF PROPERLY BY LETTER COULD NOT FINISH HIS HALF AVERTED EYES AND WITH THE MORNING </S> <S> SOME MOTHERS MIGHT HAVE BEEN IN THE WISH OF SOCIETY </S> <S> SO THERE WAS NOTHING TO LOUISA </S> <S> THE IMPERTINENCE OF THESE SORT OF A REVIVAL OF HIS MERITS </S> <S> THERE ARE MOMENTS WHEN THE DOOR TOWARDS THE MANNERS OF ONE SISTER TO A THIRD PERSON OR DRIVING OUT IN ANNUITIES ON THE TABLE AND CONTINUED HER LETTER DIRECTLY WHILE MARIANNE STILL DEEPER BY TREATING THEIR APPREHENSIONS WERE THE WALLS COVERED WITH KISSES HER WOUND BATHED WITH LAVENDER WATER BY ONE OF THEIR LOOKS AND IN ANOTHER MOMENT HENRIETTA SINKING UNDER THE CONSTANT USE AND WITH AN ENERGY WHICH ALWAYS ADORNED HER PRAISE </S> <S> I ASSURE YOU BE SO GOOD HUMOURED AND MERRY </S> <S> HE WAS BY HER HUSBAND WAS A GREAT PLEASURE IN BEING TOGETHER AT ALL AWARE OF BEING PERFECTLY READY TO GIVE HIM MORE THAN EXCUSABLE IN THE COMPARISON IT NECESSARILY PRO

In [35]:
generate_text()

CANNOT THINK IT TOO LATE THEN I MUST HAVE </S> <S> D </S> <S> HE RECEIVED THE INFORMATION WITHOUT THE SMALLEST SUSPICION THEREFORE HAD EVER COME WITHIN THE LAST EVENING OF THEIR HAVING PASSED CLOSE BY HIM THAT I MIGHT HAVE BEEN TAKEN UP BY SOMETHING ELSE </S> <S> YOUR HOME COUNTRY FRIENDS ALL QUITTED </S> <S> WELL AS TO THE MANNER THE SILENCE AND A VERY GOOD INDEED </S> <S> HOW MUCH HE WANTED TO ANIMATE HER CURIOSITY GRATIFIED </S> <S> AT LAST RECOVERING FROM ILLNESS HAD BEEN ASHAMED TO BE INTRODUCED TO THEM CONCLUDED A SHORT RECAPITULATION OF WHAT SHE HAD TURNED FROM HIM SOME TASTE OF THE QUESTION COULD NOT LIVE ENTIRELY BY THE PRESENT WAS NOT ONLY UPON GAINING EVERY NEW PRINT OF MERIT IN MAINTAINING HER OWN COMPLETE INDEPENDENCE OF MRS </S> <S> HIS KINDNESS IN STEPPING FORWARD TO </S> <S> YES YES WE CAN MEAN NO OTHER VISITOR APPEARED THAT EVENING IN WESTGATE BUILDINGS AS ANNE S LOT TO BE READ WITH LESS RELUCTANCE THAN SHE FORESAW AND TO BE THOUGHT SO THEN HE HAD DONE IT AWAY </S> <S>

## Save

In [36]:
path_prefix = f"{output_dir}/austen-combo"
NG.to_csv(f"{path_prefix}-NG.csv")
for i in range(len(M)):
    M[i].to_csv(f"{path_prefix}-M{i}.csv", index=True)