## NER Letter Chunk Spacy

## Resources

In [1]:
import spacy
import pandas as pd
from collections import Counter
from spacy import displacy

## Get Data

In [2]:
# Sentence Data
df = pd.read_csv("20240611_PhD_Data4NER-LtrChk.csv", index_col=0) 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2392 entries, 0 to 2391
Data columns (total 31 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   docID-AT          2392 non-null   int64  
 1   docid             2392 non-null   object 
 2   docyear           2392 non-null   int64  
 3   docmonth          2364 non-null   float64
 4   authorName        2177 non-null   object 
 5   docauthorid       2392 non-null   object 
 6   authorLocation    2392 non-null   object 
 7   authorGender      2392 non-null   object 
 8   nationalOrigin    2392 non-null   object 
 9   irish             2392 non-null   bool   
 10  otherUK           2392 non-null   bool   
 11  relMin            1065 non-null   object 
 12  catholic          1065 non-null   object 
 13  otherChristian    1065 non-null   object 
 14  U                 1253 non-null   object 
 15  M                 1276 non-null   object 
 16  S                 1245 non-null   object 


## Test the NER of various models on the texts

In [3]:
# The next few cells are run multiple times to check performance of various pre-trained models. 
# Do not run the this cell until after the first pass
nlp = spacy.load("en_core_web_md")

In [4]:
# I started with 0 and checked 
chunk = 1349

In [5]:
# Place narratives into a list representing the corpus
texts = df.text.values.tolist()
texts[chunk]

"Jacques and Hays who greatly admired it. I wish they would order one of me. I had no idea that it would look so well; but it is hard work to draw with a needle so many groups out of your own head — 16 groups of flowers it took a group to every vandyke. I must take a rest now for the gay colors have made my eyes weak. I saw from my window at Weston the funeral of poor Col Richard Denison. The cemetry was just opposite our place at Weston. It was a pouring day but there was a great number of gentlemen came by train and carriages to the funeral. His death was very sudden. He had some small growth of flesh cut out of the nose that impeded his breathing and went out one cold day soon after before it was healed. Erysypelas set in and he died in a few hours. I see Dr Lister of BV is gone but have heard no particulars of his death. I saw Malcolm at Katey's last week but he looks very thin and pale and consumptive. Katey was much struck with Philip Wiggs appearance. I didn't see him but I like

In [6]:
# Test on first item
item = texts[chunk]

# Run the language model on the 1st narrative
narrative = nlp(item)

# Find the mentions to people in the narrative

for ent in narrative.ents:

    mentions = [ent.text for ent in narrative.ents if ent.label_ == 'PERSON']
        
    counts = {}
    for person in mentions:
        counts[person] = counts.get(person, 0) + 1
    
    individuals = set(mentions)
    
print("Number of mentions:", len(mentions), "\n")
print(counts, "\n")    
print("Number of individuals:", len(individuals), "\n")
print("Individuals:", individuals, "\n")    
    

Number of mentions: 8 

{'Jacques': 1, 'Hays': 1, 'Richard Denison': 1, 'Lister': 1, 'Malcolm': 1, 'Katey': 1, 'Philip Wiggs': 1, 'Arthur Strickland': 1} 

Number of individuals: 8 

Individuals: {'Lister', 'Hays', 'Malcolm', 'Arthur Strickland', 'Katey', 'Richard Denison', 'Philip Wiggs', 'Jacques'} 



This isn't great because Darling Sister Justina is being broken into two entities and others aren't being found (e.g., Eugénie de Guérin, Sister Blandina). Let's highlight the labels for a sample text.

In [7]:
text = texts[chunk]
doc = nlp(text)
displacy.render(doc, style="ent", options = {"ents": ["PERSON"]})

Also, only one of the two references to Mother Josephine is being labeled. Let's see if changing the model will help. For chunk 567, this performed well. For chunk 1135, this model reasonably well. It mis-identifies Snow as a person but that is because the work is capitalised. For 1703, this model is perfect. There are no named people. For 2269, this model is perfect. For 284, this model performed perfectly. For 851, this model misses Agnes, The Colonel, Tiny man (a nickname so understandable). Also, possible the Jessie is a boat rather than a person. However, given the number of person references, this model does reasonably well. For 1419, this model performs almost perfectly. It attaches "more than L10" to Robert's name for some reason. For 1988, this model is perfect. For 100, Sister Justina is missed and the Sister without a personal name is tagged erratically (sometimes, sometimes not).

I'm attempting to use the 19th century Word2Vec langauge model (Hosseini, 2021). The files come as Gensim export formats .model and .npy. First, I had to load these into Gensim and then export a .txt file that I could use with Spacy. See notebook 20240412_histLM.ipynb for that step. Once I had the .txt file, I followed the steps described at https://www.youtube.com/watch?v=JmLQedi80_Y and https://github.com/wjbmattingly/spacy_custom_vectors/blob/main/spacy_word_vecs.ipynb. I have decided not to train the model on my text but rather use the pre-trained model. These are reflected in the notebook spacy_word_vecs in my home directory.

In [8]:
#nlp = spacy.load("/Users/alaynemoody/spacy_custom_vectors-main/models/01")

In [9]:
# This does not seem necessary but keeping just in case.
#nlp.initialize()

In [10]:
#text = texts[chunk]
#doc = nlp(text)
#displacy.render(doc, style="ent", options = {"ents": ["PERSON"]})

This is even worse because it's not capturing Sister Justin or Sister M Louis and is mixing up places (e.g, Mt St Vincent and Steuvenville). Let's try the large Spacy model. 

For chunk 567, this performed well. For chunk 1135, this one performs ok. It does not misidentify Snow but it does miss Mr Young, which orthographically is clearly a person. For 1703, this model is also perfect. For 2269, this model performs badly, missing Captain Hale but misinterpreting Character (capitalized for emphasis) as a person. For 284, this model does ok, catching John but also tagging Sister, which refers to the narrator. For 851, this model misses Dunbar, Agnes, the Colonel, the Tiny man, Duponts and Henry -- plus it tags the potential boat. All in all, this model performs poorly. For 1419, this model performs poorly, missing multiple referenecs to Crtichlow and Mr Davies as well as to Cundall. For 1988, this model is perfect. For 100, this model misses Sister Justina and erratically tags Sister without a surname. Also incorrectly tages "Vd" and so I'm calling it a poor performance. 

In [11]:
#nlp = spacy.load("en_core_web_lg")

In [12]:
#text = texts[chunk]
#doc = nlp(text)
#displacy.render(doc, style="ent", options = {"ents": ["PERSON"]})

This isn't much better as the two Mother Josephine references are being treated as two different people, and Sister Justina and Sister Blandina are being missed. I am going to re-run the cells above of some more texts to see how we go. After 10 trials, the medium and large models were tied. Doing another 10 with just those two, focusing on the authors that tripped up the models most: Segale, Moodie, Harris (more concerned with Sarah than Harris because she has more letters in the corpus (see counts below).

Additional Runs: For 567, this performed poorly because it identified DeWitt and Davenport as people when in fact the quotes indicate it is a company (probably a publishing house). For 1135, this model performs best -- skipping Snow and finding Young. For 1703, again perfect. For 2269, this model is perfect. For 284, this model is perfect. For 851, this model misses the Colonel, Mrs H Traill and the Tiny man, so the same number as the medium model -- reasonably well. For 1419, this model performs perfectly. For 1988, this model is perfect. For 100, this model it better about not tagging the Sister-sans-surname references but it incorrectly tags "Por amor."

In [13]:
#len(df[df["docauthorid"]=="per0001043"])

In [14]:
#df['docauthorname'][851]
#df['docauthorname'][567]
#len(df[df["docauthorname"]=="Moodie, Susannah Strickland, 1803-1885"])
#df['docID-AT'][df["docauthorname"]=="Moodie, Susannah Strickland, 1803-1885"].describe()

In [17]:
#df['docauthorname'][1419]
#len(df[df["docauthorname"]=="Harris, Sarah Stretch, 1818-1897"])
#df['docID-AT'][df["docauthorname"]=="Harris, Sarah Stretch, 1818-1897"].describe()

In [18]:
#df['docauthorname'][1135]
#len(df[df["docauthorname"]=="Harris, Critchlow, 1813-1899"])

After ten trials it became clear that the histLM performed by far worse than the two Spacy models and so I dropped it before continuing on with the final ten trials. The medium sized model slightly outperformed the large model (score of 2.4 compared to 2.5) and so I have decided to proceed with this one. 

## Named entity extraction for the texts

In [19]:
nlp = spacy.load("en_core_web_md")

In [20]:
mentsTot = [] 
mentsDis = []
indsTot = []

for item in texts:

# Run the language model on the 1st narrative
    narrative = nlp(item)

# Find the mentions to people in the narrative

    for ent in narrative.ents:

        mentions = [ent.text for ent in narrative.ents if ent.label_ == 'PERSON']
        
        counts = {}
        for person in mentions:
            counts[person] = counts.get(person, 0) + 1
    
        individuals = set(mentions)
    
    mentsTot.append(len(mentions))
    mentsDis.append(counts)
    indsTot.append(len(individuals))
    
                   
print(len(mentsTot)) 
print(len(indsTot))
print(len(mentsDis))

print(mentsTot[0]) 
print(indsTot[0])
print(mentsDis[0])


2392
2392
2392
3
3
{'Isabella Moore': 1, 'Willie': 1, 'Bye Isabella Weir': 1}


## Self-references

Now for 1st person singular pronounds, subjective and objective only per Tackman, A. M., Sbarra, D. A., Carey, A. L., Donnellan, M. B., Horn, A. B., Holtzman, N. S., Edwards, T. S., Pennebaker, J. W., & Mehl, M. R. (2019). Depression, Negative Emotionality, and Self-Referential Language: A Multi-Lab, Multi-Measure, and Multi-Language-Task Research Synthesis. Journal of Personality and Social Psychology, 116(5), 817–834. https://doi.org/10.1037/pspp0000187.


In [21]:
pronounAll = ["I ", 
               "I'm ", 
               "I've ", 
               "I'll ", 
               "I'd ", 
               " me ", 
               "Me ", 
               " myself ", 
               "Myself "]
pronounAll

['I ', "I'm ", "I've ", "I'll ", "I'd ", ' me ', 'Me ', ' myself ', 'Myself ']

In [22]:
pronounSub = ["I ", "I'm ", "I've ", "I'll ", "I'd "]
pronounSub

['I ', "I'm ", "I've ", "I'll ", "I'd "]

In [23]:
pronounObj = [" me ", 
               "Me ", 
               " myself ", 
               "Myself "]
pronounObj

[' me ', 'Me ', ' myself ', 'Myself ']

## Now test

In [24]:
chunk = 600

In [25]:
#texts = [x.lower() for x in texts]

In [26]:
texts[chunk]

"December 3 1898. My Dear Aunt Maggie I have intended writing to you for some time but I suppose you have almost given up expecting to hear from me. Bella taught me how to make Rennaissance Renaissance and I am sending you a little center-piece I made. It is the first piece I did and perhaps not quite so well done as the next will be. Len insists that he must send you something and put in one of his own cards. We are sending it inside a newspaper and hope it will reach you all right by Christmas. I was very glad to receive your letter some time ago We are all well but Mamma says to tell you something that she knows you will be sorry to hear. My cousin Uncle Joe's second son Joseph Jr. Junior died on the fourth of December out in Denver Colorado where he had gone for his health three months before. It was a great shock to us all as it was the first break in a family of nine children and the first death since his Mother's eighteen years ago. He was just twenty-six years old and had been 

In [27]:
# Subjective

Count = 0

for i in pronounSub:
    Count = texts[chunk].count(i) + Count

print(Count)

9


In [28]:
# Objective

Count = 0

for i in pronounObj:
    Count = texts[chunk].count(i) + Count

print(Count)

1


In [29]:
# All pronouns

Count = 0

for i in pronounAll:
    Count = texts[chunk].count(i) + Count

print(Count)

10


## Now run on all

In [30]:
# Now the rest

fppAll_Ct = []

for item in texts:
    Count = 0
    for i in pronounAll:
        #print(texts[0].count(i))
        Count = item.count(i) + Count
    
    fppAll_Ct.append(Count)

print(len(fppAll_Ct))
print(fppAll_Ct[0:9])

2392
[13, 10, 3, 13, 7, 4, 16, 14, 5]


In [31]:
# Now just subjective pronouns

fppSub_Ct = []

for item in texts:
    Count = 0
    for i in pronounSub:
        #print(texts[0].count(i))
        Count = item.count(i) + Count
    
    fppSub_Ct.append(Count)

print(len(fppSub_Ct))
print(fppSub_Ct[0:9])

2392
[12, 9, 2, 9, 7, 4, 16, 13, 5]


In [32]:
# Now just subjective pronouns

fppObj_Ct = []

for item in texts:
    Count = 0
    for i in pronounObj:
        #print(texts[0].count(i))
        Count = item.count(i) + Count
    
    fppObj_Ct.append(Count)

print(len(fppObj_Ct))
print(fppObj_Ct[0:9])

2392
[1, 1, 1, 4, 0, 0, 0, 1, 0]


## Add new variables to metadata

In [33]:
df['mentsDis'] = [', '.join(x) for x in mentsDis]
df['mentsTot'] = mentsTot
df['indsTot'] = indsTot
df['fppAll_Ct'] = fppAll_Ct
df['fppSub_Ct'] = fppSub_Ct
df['fppObj_Ct'] = fppObj_Ct
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2392 entries, 0 to 2391
Data columns (total 37 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   docID-AT          2392 non-null   int64  
 1   docid             2392 non-null   object 
 2   docyear           2392 non-null   int64  
 3   docmonth          2364 non-null   float64
 4   authorName        2177 non-null   object 
 5   docauthorid       2392 non-null   object 
 6   authorLocation    2392 non-null   object 
 7   authorGender      2392 non-null   object 
 8   nationalOrigin    2392 non-null   object 
 9   irish             2392 non-null   bool   
 10  otherUK           2392 non-null   bool   
 11  relMin            1065 non-null   object 
 12  catholic          1065 non-null   object 
 13  otherChristian    1065 non-null   object 
 14  U                 1253 non-null   object 
 15  M                 1276 non-null   object 
 16  S                 1245 non-null   object 


In [34]:
df.head()

Unnamed: 0,docID-AT,docid,docyear,docmonth,authorName,docauthorid,authorLocation,authorGender,nationalOrigin,irish,...,scoreCom,chunks,position,topicNumber,mentsDis,mentsTot,indsTot,fppAll_Ct,fppSub_Ct,fppObj_Ct
0,1,20910,1891,7.0,Isabella Weir Moore,IED0107,USA,F,Irish,True,...,0.5151,1,1.0,3,"Isabella Moore, Willie, Bye Isabella Weir",3,3,13,12,1
1,2,21062,1871,11.0,E. Rothwell,IED0179,Canada,F,Irish,True,...,0.279733,2,0.5,3,"Kate, Lydia, Maria, Bissin, Garnetts, Tom Fitz...",11,10,10,9,1
2,3,21062,1871,11.0,E. Rothwell,IED0179,Canada,F,Irish,True,...,0.081575,2,1.0,4,"Edith, Edward, Annie, Richard Garnett, Kate, E...",6,6,3,2,1
3,4,21324,1892,5.0,Isabella Weir Moore,IED0107,USA,F,Irish,True,...,0.9423,1,1.0,3,"Anna, Brotherinlaw, Husband",4,3,13,9,4
4,5,21334,1891,10.0,Mary Savage,IED0621,USA,F,Irish,True,...,0.146967,2,0.5,4,"Lizzie, Johns, James Wm, William N, Nick John,...",13,12,7,7,0


In [35]:
df.to_csv("20240611_PhD_FinalData-LtrChk.csv")