## NER Letter Spacy

## Resources

In [1]:
import spacy
import pandas as pd
from collections import Counter

In [2]:
nlp = spacy.load("en_core_web_md")

## Get Data

In [3]:
# Sentence Data
df = pd.read_csv("20240222_PhD_Data4NER-Letter.csv", index_col=0) 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 576 entries, 0 to 575
Data columns (total 31 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   docID-AT          576 non-null    int64  
 1   docauthorid       576 non-null    object 
 2   docauthorname     576 non-null    object 
 3   docid             576 non-null    object 
 4   sourcetitle       576 non-null    object 
 5   docyear           573 non-null    float64
 6   docmonth          519 non-null    float64
 7   docday            474 non-null    float64
 8   authorgender      576 non-null    object 
 9   agewriting        460 non-null    float64
 10  birthyear         463 non-null    float64
 11  deathyear         449 non-null    float64
 12  religionNew       450 non-null    object 
 13  relMin            474 non-null    object 
 14  nationalOrigin    575 non-null    object 
 15  britishEmpire_EU  573 non-null    object 
 16  translated        576 non-null    bool   
 1

In [4]:
# Place narratives into a list representing the corpus
texts = df.text.values.tolist()
texts[0]

' TRINIDAD On Train from Steubenville, Ohio, to Cincinnati. Nov 30, 1872. My Darling Sister Justina: How interestedly you, Sister M Louis and myself read Eugénie de Guérin\'s Journal and her daily anxieties to save her brother from being a spiritual outcast! This Journal which I propose keeping for you will deal with incidents occurring on my journey to Trinidad and happenings in that far-off land to which I am consigned. The Journal will begin with the first act. Here is Mother Josephine\'s letter: Mt St Vincent, O, Nov 27, 1872. Sister Blandina, Steubenville, O My Dear Child: You are missioned to Trinidad. You will leave Cincinnati Wednesday and alone. Mother Regina will attend to your needs. Devotedly, Mother Josephine. This letter thrilled us both. I was delighted to make the sacrifice, and you were hiding your feelings that I might not lose any merit. Neither of us could find Trinidad on the map except in the island of Cuba. So we concluded that Cuba was my destination. I was to l

In [5]:
# Test on first item
item = texts[0]

# Run the language model on the 1st narrative
narrative = nlp(item)

# Find the mentions to people in the narrative

for ent in narrative.ents:

    mentions = [ent.text for ent in narrative.ents if ent.label_ == 'PERSON']
        
    counts = {}
    for person in mentions:
        counts[person] = counts.get(person, 0) + 1
    
    individuals = set(mentions)
    
print("Number of mentions:", len(mentions), "\n")
print(counts, "\n")    
print("Number of individuals:", len(individuals), "\n")
print("Individuals:", individuals, "\n")    
    

Number of mentions: 38 

{'Darling Sister': 1, 'Justina': 1, "Mother Josephine's": 1, 'Sister Blandina': 1, 'Mother Regina': 3, 'Tait': 1, 'McCann': 1, 'Sister Anthony': 2, 'Bigelow': 1, 'McCabe': 1, 'Segale': 1, 'Henry': 1, 'Seminary': 1, 'Sisters Gabriella': 1, 'Gabriella': 1, "Sister Sophia's": 1, 'Benedicta': 1, 'Leverone': 1, 'Garibaldi': 2, 'Gardelli': 1, 'Mary': 2, 'John': 1, "John Leverone's": 1, 'Genoese': 1, 'St Francis Xavier': 1, 'Grace': 2, 'Rev Dr Callaghan': 1, 'monica': 1, 'Sister Benedicta': 1, 'Sisters Antonia': 1, 'Gonzaga': 1, 'Sister Antonia': 1} 

Number of individuals: 32 

Individuals: {'Genoese', 'Sisters Gabriella', 'Gonzaga', 'McCabe', 'Rev Dr Callaghan', 'St Francis Xavier', 'Sister Antonia', 'McCann', 'Sisters Antonia', 'Justina', 'monica', "Mother Josephine's", 'Sister Anthony', 'Sister Benedicta', 'Henry', 'John', "John Leverone's", 'Gabriella', 'Garibaldi', 'Leverone', 'Bigelow', 'Seminary', 'Gardelli', 'Tait', 'Benedicta', 'Mary', 'Mother Regina', 'Grac

In [6]:
mentsTot = []
mentsDis = []
indsTot = []

for item in texts:

# Run the language model on the 1st narrative
    narrative = nlp(item)

# Find the mentions to people in the narrative

    for ent in narrative.ents:

        mentions = [ent.text for ent in narrative.ents if ent.label_ == 'PERSON']
        
        counts = {}
        for person in mentions:
            counts[person] = counts.get(person, 0) + 1
    
        individuals = set(mentions)
    
    mentsTot.append(len(mentions))
    mentsDis.append(counts)
    indsTot.append(len(individuals))
    
                   
print(len(mentsTot)) 
print(len(indsTot))
print(len(mentsDis))

print(mentsTot[0]) 
print(indsTot[0])
print(mentsDis[0])


576
576
576
38
32
{'Darling Sister': 1, 'Justina': 1, "Mother Josephine's": 1, 'Sister Blandina': 1, 'Mother Regina': 3, 'Tait': 1, 'McCann': 1, 'Sister Anthony': 2, 'Bigelow': 1, 'McCabe': 1, 'Segale': 1, 'Henry': 1, 'Seminary': 1, 'Sisters Gabriella': 1, 'Gabriella': 1, "Sister Sophia's": 1, 'Benedicta': 1, 'Leverone': 1, 'Garibaldi': 2, 'Gardelli': 1, 'Mary': 2, 'John': 1, "John Leverone's": 1, 'Genoese': 1, 'St Francis Xavier': 1, 'Grace': 2, 'Rev Dr Callaghan': 1, 'monica': 1, 'Sister Benedicta': 1, 'Sisters Antonia': 1, 'Gonzaga': 1, 'Sister Antonia': 1}


Now for 1st person singular pronounds, subjective only per Tackman, A. M., Sbarra, D. A., Carey, A. L., Donnellan, M. B., Horn, A. B., Holtzman, N. S., Edwards, T. S., Pennebaker, J. W., & Mehl, M. R. (2019). Depression, Negative Emotionality, and Self-Referential Language: A Multi-Lab, Multi-Measure, and Multi-Language-Task Research Synthesis. Journal of Personality and Social Psychology, 116(5), 817–834. https://doi.org/10.1037/pspp0000187.


In [7]:
pronounsAll = ["I ", "I'm ", "I've ", "I'll ", "I'd ", " me ", " myself ", "my ", "My", "mine"]
pronounsAll

['I ',
 "I'm ",
 "I've ",
 "I'll ",
 "I'd ",
 ' me ',
 ' myself ',
 'my ',
 'My',
 'mine']

In [8]:
pronounSub = ["I ", "I'm ", "I've ", "I'll ", "I'd "]
pronounSub

['I ', "I'm ", "I've ", "I'll ", "I'd "]

In [9]:
# Count on first one

print(texts[0])

Count = 0

for i in pronounsAll:
    Count = texts[0].count(i) + Count

print(Count)

 TRINIDAD On Train from Steubenville, Ohio, to Cincinnati. Nov 30, 1872. My Darling Sister Justina: How interestedly you, Sister M Louis and myself read Eugénie de Guérin's Journal and her daily anxieties to save her brother from being a spiritual outcast! This Journal which I propose keeping for you will deal with incidents occurring on my journey to Trinidad and happenings in that far-off land to which I am consigned. The Journal will begin with the first act. Here is Mother Josephine's letter: Mt St Vincent, O, Nov 27, 1872. Sister Blandina, Steubenville, O My Dear Child: You are missioned to Trinidad. You will leave Cincinnati Wednesday and alone. Mother Regina will attend to your needs. Devotedly, Mother Josephine. This letter thrilled us both. I was delighted to make the sacrifice, and you were hiding your feelings that I might not lose any merit. Neither of us could find Trinidad on the map except in the island of Cuba. So we concluded that Cuba was my destination. I was to leav

In [10]:
# Now the rest

fppAll_Ct = []

for item in texts:
    Count = 0
    for i in pronounsAll:
        #print(texts[0].count(i))
        Count = item.count(i) + Count
    
    fppAll_Ct.append(Count)

print(len(fppAll_Ct))
print(fppAll_Ct[0])

576
161


In [11]:
# Now just subjective pronouns

fppSub_Ct = []

for item in texts:
    Count = 0
    for i in pronounSub:
        #print(texts[0].count(i))
        Count = item.count(i) + Count
    
    fppSub_Ct.append(Count)

print(len(fppSub_Ct))
print(fppSub_Ct[0])

576
104


In [12]:
df['mentsDis'] = [', '.join(x) for x in mentsDis]
df['mentsTot'] = mentsTot
df['indsTot'] = indsTot
df['fppAll_Ct'] = fppAll_Ct
df['fppSub_Ct'] = fppSub_Ct
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 576 entries, 0 to 575
Data columns (total 36 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   docID-AT          576 non-null    int64  
 1   docauthorid       576 non-null    object 
 2   docauthorname     576 non-null    object 
 3   docid             576 non-null    object 
 4   sourcetitle       576 non-null    object 
 5   docyear           573 non-null    float64
 6   docmonth          519 non-null    float64
 7   docday            474 non-null    float64
 8   authorgender      576 non-null    object 
 9   agewriting        460 non-null    float64
 10  birthyear         463 non-null    float64
 11  deathyear         449 non-null    float64
 12  religionNew       450 non-null    object 
 13  relMin            474 non-null    object 
 14  nationalOrigin    575 non-null    object 
 15  britishEmpire_EU  573 non-null    object 
 16  translated        576 non-null    bool   
 1

In [13]:
df.head()

Unnamed: 0,docID-AT,docauthorid,docauthorname,docid,sourcetitle,docyear,docmonth,docday,authorgender,agewriting,...,scoreNeg,scorePos,scoreNeu,scoreCompound,topicNumber,mentsDis,mentsTot,indsTot,fppAll_Ct,fppSub_Ct
0,1,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,At the End of the Santa Fe Trail,1872.0,11.0,30.0,F,22.0,...,0.051,0.131,0.818,0.9994,5,"Darling Sister, Justina, Mother Josephine's, S...",38,32,161,104
1,2,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D004,At the End of the Santa Fe Trail,1872.0,12.0,6.0,F,22.0,...,0.055,0.111,0.834,0.9993,1,"Sisters, Mass, Sister Gabriella, Martha, Siste...",31,18,156,99
2,3,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D005,At the End of the Santa Fe Trail,1872.0,12.0,10.0,F,22.0,...,0.056,0.101,0.843,0.9987,1,"Kit Carson, Mrs Mullen, Seller, Otero, leastwi...",27,23,127,76
3,4,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D006,At the End of the Santa Fe Trail,1872.0,12.0,21.0,F,22.0,...,0.061,0.099,0.84,0.9981,5,"Ida Chené, Mrs Chené, Mass, Et introibo, Kyrie...",34,17,88,60
4,5,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D007,At the End of the Santa Fe Trail,1873.0,3.0,1.0,F,23.0,...,0.071,0.081,0.848,0.967,1,"Sister Marcella, Bishop Salpointe, Sister Mart...",17,13,103,69


In [15]:
df.to_csv("20240225_PhD_FinalData-Letter.csv")