## NER Letter Spacy

## Resources

In [1]:
import spacy
import pandas as pd
from collections import Counter

In [2]:
nlp = spacy.load("en_core_web_md")

## Get Data

In [3]:
# Sentence Data
df = pd.read_csv("20240222_PhD_Data4NER-Chunk.csv", index_col=0) 
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3785 entries, 0 to 3784
Data columns (total 34 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   docID-AT          3785 non-null   int64  
 1   docauthorid       3785 non-null   object 
 2   docauthorname     3785 non-null   object 
 3   docid             3785 non-null   object 
 4   sourcetitle       3785 non-null   object 
 5   docyear           3741 non-null   float64
 6   docmonth          2824 non-null   float64
 7   docday            2215 non-null   float64
 8   authorgender      3785 non-null   object 
 9   agewriting        2817 non-null   float64
 10  birthyear         2861 non-null   float64
 11  deathyear         2849 non-null   float64
 12  religionNew       2501 non-null   object 
 13  relMin            3198 non-null   object 
 14  nationalOrigin    3780 non-null   object 
 15  britishEmpire_EU  3773 non-null   object 
 16  translated        3785 non-null   bool   


In [4]:
# Place narratives into a list representing the corpus
texts = df.chunk.values.tolist()
texts[0]

"TRINIDAD On Train from Steubenville, Ohio, to Cincinnati. Nov 30, 1872. My Darling Sister Justina: How interestedly you, Sister M Louis and myself read Eugénie de Guérin's Journal and her daily anxieties to save her brother from being a spiritual outcast! This Journal which I propose keeping for you will deal with incidents occurring on my journey to Trinidad and happenings in that far-off land to which I am consigned. The Journal will begin with the first act. Here is Mother Josephine's letter: Mt St Vincent, O, Nov 27, 1872. Sister Blandina, Steubenville, O My Dear Child: You are missioned to Trinidad. You will leave Cincinnati Wednesday and alone. Mother Regina will attend to your needs. Devotedly, Mother Josephine. This letter thrilled us both. I was delighted to make the sacrifice, and you were hiding your feelings that I might not lose any merit. Neither of us could find Trinidad on the map except in the island of Cuba. So we concluded that Cuba was my destination. I was to leav

In [5]:
# Test on first item
item = texts[0]

# Run the language model on the 1st narrative
narrative = nlp(item)

# Find the mentions to people in the narrative

for ent in narrative.ents:

    mentions = [ent.text for ent in narrative.ents if ent.label_ == 'PERSON']
        
    counts = {}
    for person in mentions:
        counts[person] = counts.get(person, 0) + 1
    
    individuals = set(mentions)
    
print("Number of mentions:", len(mentions), "\n")
print(counts, "\n")    
print("Number of individuals:", len(individuals), "\n")
print("Individuals:", individuals, "\n")    
    

Number of mentions: 5 

{'Darling Sister': 1, 'Justina': 1, "Mother Josephine's": 1, 'Sister Blandina': 1, 'Mother Regina': 1} 

Number of individuals: 5 

Individuals: {'Mother Regina', 'Justina', 'Darling Sister', "Mother Josephine's", 'Sister Blandina'} 



In [6]:
mentsTot = []
mentsDis = []
indsTot = []

for item in texts:

# Run the language model on the 1st narrative
    narrative = nlp(item)

# Find the mentions to people in the narrative

    for ent in narrative.ents:

        mentions = [ent.text for ent in narrative.ents if ent.label_ == 'PERSON']
        
        counts = {}
        for person in mentions:
            counts[person] = counts.get(person, 0) + 1
    
        individuals = set(mentions)
    
    mentsTot.append(len(mentions))
    mentsDis.append(counts)
    indsTot.append(len(individuals))
    
                   
print(len(mentsTot)) 
print(len(indsTot))
print(len(mentsDis))

print(mentsTot[0]) 
print(indsTot[0])
print(mentsDis[0])


3785
3785
3785
5
5
{'Darling Sister': 1, 'Justina': 1, "Mother Josephine's": 1, 'Sister Blandina': 1, 'Mother Regina': 1}


Now for 1st person singular pronounds, subjective only per Tackman, A. M., Sbarra, D. A., Carey, A. L., Donnellan, M. B., Horn, A. B., Holtzman, N. S., Edwards, T. S., Pennebaker, J. W., & Mehl, M. R. (2019). Depression, Negative Emotionality, and Self-Referential Language: A Multi-Lab, Multi-Measure, and Multi-Language-Task Research Synthesis. Journal of Personality and Social Psychology, 116(5), 817–834. https://doi.org/10.1037/pspp0000187.


In [7]:
pronounsAll = ["I ", "I'm ", "I've ", "I'll ", "I'd ", " me ", " myself ", "my ", "My", "mine"]
pronounsAll

['I ',
 "I'm ",
 "I've ",
 "I'll ",
 "I'd ",
 ' me ',
 ' myself ',
 'my ',
 'My',
 'mine']

In [8]:
pronounSub = ["I ", "I'm ", "I've ", "I'll ", "I'd "]
pronounSub

['I ', "I'm ", "I've ", "I'll ", "I'd "]

In [9]:
# Count on first one

print(texts[0])

Count = 0

for i in pronounsAll:
    Count = texts[0].count(i) + Count

print(Count)

TRINIDAD On Train from Steubenville, Ohio, to Cincinnati. Nov 30, 1872. My Darling Sister Justina: How interestedly you, Sister M Louis and myself read Eugénie de Guérin's Journal and her daily anxieties to save her brother from being a spiritual outcast! This Journal which I propose keeping for you will deal with incidents occurring on my journey to Trinidad and happenings in that far-off land to which I am consigned. The Journal will begin with the first act. Here is Mother Josephine's letter: Mt St Vincent, O, Nov 27, 1872. Sister Blandina, Steubenville, O My Dear Child: You are missioned to Trinidad. You will leave Cincinnati Wednesday and alone. Mother Regina will attend to your needs. Devotedly, Mother Josephine. This letter thrilled us both. I was delighted to make the sacrifice, and you were hiding your feelings that I might not lose any merit. Neither of us could find Trinidad on the map except in the island of Cuba. So we concluded that Cuba was my destination. I was to leave

In [10]:
# Now the rest

fppAll_Ct = []

for item in texts:
    Count = 0
    for i in pronounsAll:
        #print(texts[0].count(i))
        Count = item.count(i) + Count
    
    fppAll_Ct.append(Count)

print(len(fppAll_Ct))
print(fppAll_Ct[0])

3785
11


In [11]:
# Now just subjective pronouns

fppSub_Ct = []

for item in texts:
    Count = 0
    for i in pronounSub:
        #print(texts[0].count(i))
        Count = item.count(i) + Count
    
    fppSub_Ct.append(Count)

print(len(fppSub_Ct))
print(fppSub_Ct[0])

3785
5


In [12]:
df['mentsDis'] = [', '.join(x) for x in mentsDis]
df['mentsTot'] = mentsTot
df['indsTot'] = indsTot
df['fppAll_Ct'] = fppAll_Ct
df['fppSub_Ct'] = fppSub_Ct
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3785 entries, 0 to 3784
Data columns (total 39 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   docID-AT          3785 non-null   int64  
 1   docauthorid       3785 non-null   object 
 2   docauthorname     3785 non-null   object 
 3   docid             3785 non-null   object 
 4   sourcetitle       3785 non-null   object 
 5   docyear           3741 non-null   float64
 6   docmonth          2824 non-null   float64
 7   docday            2215 non-null   float64
 8   authorgender      3785 non-null   object 
 9   agewriting        2817 non-null   float64
 10  birthyear         2861 non-null   float64
 11  deathyear         2849 non-null   float64
 12  religionNew       2501 non-null   object 
 13  relMin            3198 non-null   object 
 14  nationalOrigin    3780 non-null   object 
 15  britishEmpire_EU  3773 non-null   object 
 16  translated        3785 non-null   bool   


In [13]:
df.head()

Unnamed: 0,docID-AT,docauthorid,docauthorname,docid,sourcetitle,docyear,docmonth,docday,authorgender,agewriting,...,scoreNeu,scoreCompound,chunks,position,topicNumber,mentsDis,mentsTot,indsTot,fppAll_Ct,fppSub_Ct
0,1,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,At the End of the Santa Fe Trail,1872.0,11.0,30.0,F,22.0,...,0.827,0.9425,15,0.066667,12,"Darling Sister, Justina, Mother Josephine's, S...",5,5,11,5
1,2,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,At the End of the Santa Fe Trail,1872.0,11.0,30.0,F,22.0,...,0.859,0.8625,15,0.133333,12,"Josephine, Tait, McCann",3,3,15,10
2,3,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,At the End of the Santa Fe Trail,1872.0,11.0,30.0,F,22.0,...,0.883,0.6977,15,0.2,12,"Tait, McCann",2,2,15,10
3,4,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,At the End of the Santa Fe Trail,1872.0,11.0,30.0,F,22.0,...,0.812,0.9451,15,0.266667,12,,0,0,16,12
4,5,per0001043,"Segale, Sister Blandina, 1850-1941",S1019-D002,At the End of the Santa Fe Trail,1872.0,11.0,30.0,F,22.0,...,0.793,0.9509,15,0.333333,12,"Sister Anthony, Bigelow",3,2,16,15


In [14]:
df.to_csv("20240225_PhD_FinalData-Chunk.csv")