## NER Letter Chunk Spacy

## Resources

In [1]:
import spacy
import pandas as pd
from collections import Counter
from spacy import displacy

## Get Data

In [273]:
# Sentence Data
df = pd.read_csv("20240411_PhD_Data4NER-LtrChk.csv", index_col=0) 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2270 entries, 0 to 2269
Data columns (total 29 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   docID-AT          2270 non-null   int64  
 1   docauthorid       2270 non-null   object 
 2   docauthorname     2270 non-null   object 
 3   docid             2270 non-null   object 
 4   docyear           2235 non-null   float64
 5   docmonth          2171 non-null   float64
 6   authorgender      2270 non-null   object 
 7   agewriting        1536 non-null   float64
 8   agedeath          1525 non-null   float64
 9   relMin            1870 non-null   object 
 10  nationalOrigin    2266 non-null   object 
 11  authorLocation    2270 non-null   object 
 12  U                 2076 non-null   object 
 13  M                 2076 non-null   object 
 14  S                 2076 non-null   object 
 15  F                 2076 non-null   object 
 16  L                 2076 non-null   object 


## Test the NER of various models on the texts

In [264]:
# The next few cells are run multiple times to check performance of various pre-trained models. 
# Do not run the this cell until after the first pass
nlp = spacy.load("en_core_web_md")

In [265]:
# I started with 0 and checked 
chunk = 1349

In [274]:
# Place narratives into a list representing the corpus
texts = df.text.values.tolist()
texts[chunk]

"Margaret Ellin is making a tart for Tea Willie is drawing on the slate Patty is crocheting Sarah is playing Eddie is playing with the cat and I am writing to you. Mama is trying to do without a servant now for they are more trouble than they are worth. You must excuse Mama's writing this mail as the gathering is on her forefinger of her right hand. Margaret Ellin and me have to carry on the correspondence. Yesterday was Grandmama Stretch's birthday and Mama said we might have a pudding for a treat. So Margaret Ellin made a currant dumpling and it was first rate. Besides it was her first attempt."

In [267]:
# Test on first item
item = texts[chunk]

# Run the language model on the 1st narrative
narrative = nlp(item)

# Find the mentions to people in the narrative

for ent in narrative.ents:

    mentions = [ent.text for ent in narrative.ents if ent.label_ == 'PERSON']
        
    counts = {}
    for person in mentions:
        counts[person] = counts.get(person, 0) + 1
    
    individuals = set(mentions)
    
print("Number of mentions:", len(mentions), "\n")
print(counts, "\n")    
print("Number of individuals:", len(individuals), "\n")
print("Individuals:", individuals, "\n")    
    

Number of mentions: 8 

{'Margaret Ellin': 3, 'Patty': 1, 'Sarah': 1, 'Eddie': 1, 'Mama': 2} 

Number of individuals: 5 

Individuals: {'Sarah', 'Mama', 'Margaret Ellin', 'Patty', 'Eddie'} 



This isn't great because Darling Sister Justina is being broken into two entities and others aren't being found (e.g., Eugénie de Guérin, Sister Blandina). Let's highlight the labels for a sample text.

In [268]:
text = texts[chunk]
doc = nlp(text)
displacy.render(doc, style="ent", options = {"ents": ["PERSON"]})

Also, only one of the two references to Mother Josephine is being labeled. Let's see if changing the model will help. For chunk 567, this performed well. For chunk 1135, this model reasonably well. It mis-identifies Snow as a person but that is because the work is capitalised. For 1703, this model is perfect. There are no named people. For 2269, this model is perfect. For 284, this model performed perfectly. For 851, this model misses Agnes, The Colonel, Tiny man (a nickname so understandable). Also, possible the Jessie is a boat rather than a person. However, given the number of person references, this model does reasonably well. For 1419, this model performs almost perfectly. It attaches "more than L10" to Robert's name for some reason. For 1988, this model is perfect. For 100, Sister Justina is missed and the Sister without a personal name is tagged erratically (sometimes, sometimes not).

I'm attempting to use the 19th century Word2Vec langauge model (Hosseini, 2021). The files come as Gensim export formats .model and .npy. First, I had to load these into Gensim and then export a .txt file that I could use with Spacy. See notebook 20240412_histLM.ipynb for that step. Once I had the .txt file, I followed the steps described at https://www.youtube.com/watch?v=JmLQedi80_Y and https://github.com/wjbmattingly/spacy_custom_vectors/blob/main/spacy_word_vecs.ipynb. I have decided not to train the model on my text but rather use the pre-trained model. These are reflected in the notebook spacy_word_vecs in my home directory.

In [269]:
#nlp = spacy.load("/Users/alaynemoody/spacy_custom_vectors-main/models/01")

In [270]:
# This does not seem necessary but keeping just in case.
#nlp.initialize()

In [271]:
#text = texts[chunk]
#doc = nlp(text)
#displacy.render(doc, style="ent", options = {"ents": ["PERSON"]})

This is even worse because it's not capturing Sister Justin or Sister M Louis and is mixing up places (e.g, Mt St Vincent and Steuvenville). Let's try the large Spacy model. 

For chunk 567, this performed well. For chunk 1135, this one performs ok. It does not misidentify Snow but it does miss Mr Young, which orthographically is clearly a person. For 1703, this model is also perfect. For 2269, this model performs badly, missing Captain Hale but misinterpreting Character (capitalized for emphasis) as a person. For 284, this model does ok, catching John but also tagging Sister, which refers to the narrator. For 851, this model misses Dunbar, Agnes, the Colonel, the Tiny man, Duponts and Henry -- plus it tags the potential boat. All in all, this model performs poorly. For 1419, this model performs poorly, missing multiple referenecs to Crtichlow and Mr Davies as well as to Cundall. For 1988, this model is perfect. For 100, this model misses Sister Justina and erratically tags Sister without a surname. Also incorrectly tages "Vd" and so I'm calling it a poor performance. 

In [155]:
#nlp = spacy.load("en_core_web_lg")

In [156]:
#text = texts[chunk]
#doc = nlp(text)
#displacy.render(doc, style="ent", options = {"ents": ["PERSON"]})

This isn't much better as the two Mother Josephine references are being treated as two different people, and Sister Justina and Sister Blandina are being missed. I am going to re-run the cells above of some more texts to see how we go. After 10 trials, the medium and large models were tied. Doing another 10 with just those two, focusing on the authors that tripped up the models most: Segale, Moodie, Harris (more concerned with Sarah than Harris because she has more letters in the corpus (see counts below).

Additional Runs: For 567, this performed poorly because it identified DeWitt and Davenport as people when in fact the quotes indicate it is a company (probably a publishing house). For 1135, this model performs best -- skipping Snow and finding Young. For 1703, again perfect. For 2269, this model is perfect. For 284, this model is perfect. For 851, this model misses the Colonel, Mrs H Traill and the Tiny man, so the same number as the medium model -- reasonably well. For 1419, this model performs perfectly. For 1988, this model is perfect. For 100, this model it better about not tagging the Sister-sans-surname references but it incorrectly tags "Por amor."

In [34]:
#len(df[df["docauthorid"]=="per0001043"])

531

In [48]:
#df['docauthorname'][851]
#df['docauthorname'][567]
#len(df[df["docauthorname"]=="Moodie, Susannah Strickland, 1803-1885"])
#df['docID-AT'][df["docauthorname"]=="Moodie, Susannah Strickland, 1803-1885"].describe()

count     484.000000
mean      773.500000
std       139.863028
min       532.000000
25%       652.750000
50%       773.500000
75%       894.250000
max      1015.000000
Name: docID-AT, dtype: float64

In [49]:
#df['docauthorname'][1419]
len(df[df["docauthorname"]=="Harris, Sarah Stretch, 1818-1897"])
#df['docID-AT'][df["docauthorname"]=="Harris, Sarah Stretch, 1818-1897"].describe()

count     261.000000
mean     1271.260536
std        86.756163
min      1124.000000
25%      1195.000000
50%      1270.000000
75%      1344.000000
max      1422.000000
Name: docID-AT, dtype: float64

In [44]:
#df['docauthorname'][1135]
len(df[df["docauthorname"]=="Harris, Critchlow, 1813-1899"])

52

After ten trials it became clear that the histLM performed by far worse than the two Spacy models and so I dropped it before continuing on with the final ten trials. The medium sized model slightly outperformed the large model (score of 2.4 compared to 2.5) and so I have decided to proceed with this one. 

## Named entity extraction for the texts

In [272]:
nlp = spacy.load("en_core_web_md")

In [275]:
mentsTot = [] 
mentsDis = []
indsTot = []

for item in texts:

# Run the language model on the 1st narrative
    narrative = nlp(item)

# Find the mentions to people in the narrative

    for ent in narrative.ents:

        mentions = [ent.text for ent in narrative.ents if ent.label_ == 'PERSON']
        
        counts = {}
        for person in mentions:
            counts[person] = counts.get(person, 0) + 1
    
        individuals = set(mentions)
    
    mentsTot.append(len(mentions))
    mentsDis.append(counts)
    indsTot.append(len(individuals))
    
                   
print(len(mentsTot)) 
print(len(indsTot))
print(len(mentsDis))

print(mentsTot[0]) 
print(indsTot[0])
print(mentsDis[0])


2270
2270
2270
5
5
{'Darling Sister': 1, 'Justina': 1, 'Sister M Louis': 1, "Mother Josephine's": 1, 'Mother Regina': 1}


## Self-references

Now for 1st person singular pronounds, subjective and objective only per Tackman, A. M., Sbarra, D. A., Carey, A. L., Donnellan, M. B., Horn, A. B., Holtzman, N. S., Edwards, T. S., Pennebaker, J. W., & Mehl, M. R. (2019). Depression, Negative Emotionality, and Self-Referential Language: A Multi-Lab, Multi-Measure, and Multi-Language-Task Research Synthesis. Journal of Personality and Social Psychology, 116(5), 817–834. https://doi.org/10.1037/pspp0000187.


In [276]:
pronounAll = ["I ", 
               "I'm ", 
               "I've ", 
               "I'll ", 
               "I'd ", 
               " me ", 
               "Me ", 
               " myself ", 
               "Myself "]
pronounAll

['I ', "I'm ", "I've ", "I'll ", "I'd ", ' me ', 'Me ', ' myself ', 'Myself ']

In [277]:
pronounSub = ["I ", "I'm ", "I've ", "I'll ", "I'd "]
pronounSub

['I ', "I'm ", "I've ", "I'll ", "I'd "]

In [278]:
pronounObj = [" me ", 
               "Me ", 
               " myself ", 
               "Myself "]
pronounObj

[' me ', 'Me ', ' myself ', 'Myself ']

## Now test

In [279]:
chunk = 600

In [280]:
#texts = [x.lower() for x in texts]

In [281]:
texts[chunk]

'magnitized by them. The expression of her face is sad even to melancholy but sweetly feminine. I do not believe that the raps are produced by spirits that have been of this world but I cannot believe that she with her pure spiritual face is capable of deceiving. She certainly does not procure these mysterious sounds by foot or hand and though I cannot help thinking that they emanate from her mind and that she is herself the spirit I believe she is perfectly unconscious of it herself. But to make you understand more about it I had better describe the scene first prefacing it with my being a great sceptic on the subject and therefore as a consequence of my doubts anxious to investigate it to the bottom. Miss Fox has near relatives in this place to whom Mr Moodie had expressed a wish to see the fair Kate should she again visit our town. One morning about three weeks since I was alone in the drawing room when my servant girl announced Miss F and her cousin. I had seen her the summer befor

In [282]:
# Subjective

Count = 0

for i in pronounSub:
    Count = texts[chunk].count(i) + Count

print(Count)

9


In [283]:
# Objective

Count = 0

for i in pronounObj:
    Count = texts[chunk].count(i) + Count

print(Count)

0


In [284]:
# All pronouns

Count = 0

for i in pronounAll:
    Count = texts[chunk].count(i) + Count

print(Count)

9


## Now run on all

In [285]:
# Now the rest

fppAll_Ct = []

for item in texts:
    Count = 0
    for i in pronounAll:
        #print(texts[0].count(i))
        Count = item.count(i) + Count
    
    fppAll_Ct.append(Count)

print(len(fppAll_Ct))
print(fppAll_Ct[0:9])

2270
[10, 16, 16, 20, 15, 18, 14, 24, 22]


In [286]:
# Now just subjective pronouns

fppSub_Ct = []

for item in texts:
    Count = 0
    for i in pronounSub:
        #print(texts[0].count(i))
        Count = item.count(i) + Count
    
    fppSub_Ct.append(Count)

print(len(fppSub_Ct))
print(fppSub_Ct[0:9])

2270
[9, 13, 13, 18, 11, 13, 12, 20, 14]


In [287]:
# Now just subjective pronouns

fppObj_Ct = []

for item in texts:
    Count = 0
    for i in pronounObj:
        #print(texts[0].count(i))
        Count = item.count(i) + Count
    
    fppObj_Ct.append(Count)

print(len(fppObj_Ct))
print(fppObj_Ct[0:9])

2270
[1, 3, 3, 2, 4, 5, 2, 4, 8]


## Add new variables to metadata

In [290]:
df['mentsDis'] = [', '.join(x) for x in mentsDis]
df['mentsTot'] = mentsTot
df['indsTot'] = indsTot
df['fppAll_Ct'] = fppAll_Ct
df['fppSub_Ct'] = fppSub_Ct
df['fppObj_Ct'] = fppObj_Ct
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2270 entries, 0 to 2269
Data columns (total 35 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   docID-AT          2270 non-null   int64  
 1   docauthorid       2270 non-null   object 
 2   docauthorname     2270 non-null   object 
 3   docid             2270 non-null   object 
 4   docyear           2235 non-null   float64
 5   docmonth          2171 non-null   float64
 6   authorgender      2270 non-null   object 
 7   agewriting        1536 non-null   float64
 8   agedeath          1525 non-null   float64
 9   relMin            1870 non-null   object 
 10  nationalOrigin    2266 non-null   object 
 11  authorLocation    2270 non-null   object 
 12  U                 2076 non-null   object 
 13  M                 2076 non-null   object 
 14  S                 2076 non-null   object 
 15  F                 2076 non-null   object 
 16  L                 2076 non-null   object 


In [291]:
df.head()

Unnamed: 0,docID-AT,docauthorid,docauthorname,docid,docyear,docmonth,authorgender,agewriting,agedeath,relMin,...,lexicalDiversity,chunks,position,topicNumber,mentsDis,mentsTot,indsTot,fppAll_Ct,fppSub_Ct,fppObj_Ct
0,1,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,1872.0,11.0,F,22.0,91.0,True,...,0.634146,12,0.083333,7,"Darling Sister, Justina, Sister M Louis, Mothe...",5,5,10,9,1
1,2,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,1872.0,11.0,F,22.0,91.0,True,...,0.619231,12,0.166667,10,"Tait, McCann",2,2,16,13,3
2,3,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,1872.0,11.0,F,22.0,91.0,True,...,0.621212,12,0.25,10,,0,0,16,13,3
3,4,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,1872.0,11.0,F,22.0,91.0,True,...,0.610108,12,0.333333,21,"Sister Anthony, Bigelow",3,2,20,18,2
4,5,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,1872.0,11.0,F,22.0,91.0,True,...,0.590226,12,0.416667,21,"Sister Anthony, McCabe, Segale, Henry, Seminar...",7,7,15,11,4


In [292]:
df.to_csv("20240414_PhD_FinalData-LtrChk.csv")