### EXERCISE 1.6

In [2]:
import spacy
nlp = spacy.load("en_core_web_sm")
print("spaCy loaded successfully!")


spaCy loaded successfully!


### 1. IMPORT LIBRARIES

In [17]:
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
import networkx as nx
import matplotlib.pyplot as plt
import scipy
import re

nlp = spacy.load("en_core_web_sm")
print("All imports + spaCy model loaded!")


All imports + spaCy model loaded!


### 2. Load the twentieth-century text file.


In [24]:
from pathlib import Path

base = Path.home() / "Desktop" / "Achievement 7 "

In [26]:
file_path = base / "02 Data" / "Prepared Data" / "key_events_20th_century.txt"

with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
    data_20th = f.read()
print(data_20th[:500])

Historic events in the 20th century

World at the beginning of the century

Spanish flu

Between the wars

Global war: World War II (1939–1945)

The post-war world

The world at the end of the century

See also

References

Sources

External links

The20th centurychanged the world in unprecedented ways. TheWorld Warssparked tension between countries and led to the creation ofatomic bombs, theCold Warled to theSpace Raceand the creation of space-based rockets, and theWorld Wide Webwas created. Th


### 3. Text Wrangling Evaluation

The text contains special characters, inconsistent spacing, punctuation, and numerical references that could interfere with NLP tasks such as tokenization and entity recognition. Additionally, some country names appear in different formats (e.g., abbreviations or alternate spellings) that do not exactly match the lookup list, which could lead to inaccurate counts.

To address this, the text was cleaned by normalizing spacing, removing unnecessary special characters, and standardizing country name variants. The corrected text was saved as a new .txt file to ensure accurate and consistent analysis.

### 4.Create NER OBJECT

In [40]:
# Create spaCy Doc (NER object)
doc_20th = nlp(data_20th)   

# Preview first 30 entities
for ent in list(doc_20th.ents)[:30]:
    print(ent.text, "|", ent.label_)


the 20th century | DATE
the beginning of the century | DATE
Spanish | NORP
World War II | EVENT
the end of the century | DATE
The20th | PERSON
TheWorld | ORG
Warled to theSpace Raceand | ORG
today | DATE
the 20th century | DATE
The 1900s | DATE
the decade | DATE
1914 | DATE
thePanama Canal | WORK_OF_ART
the 1900s | DATE
the Congo | ORG
1914 to 1918 | DATE
the First World War | EVENT
The First World War | EVENT
WWI | ORG
The Great War | EVENT
July 1914 | DATE
November 1918 | DATE
ErzherzogFranz Ferdinand | ORG
byGavrilo Princip | PERSON
theYoung Bosnialiberation | ORG
Crisis | ORG
the end of July 1914 | DATE
theBritish Empire | GPE
France | GPE


### 5. Splitting Sentence

In [44]:

sentences_data = []

for sent in doc_20th.sents:
    entities = [ent.text for ent in sent.ents]
    sentences_data.append({
        "sentence": sent.text,
        "entities": entities
    })

df_sentences = pd.DataFrame(sentences_data)
df_sentences.head()


Unnamed: 0,sentence,entities
0,Historic events in the 20th century\n\nWorld a...,"[the 20th century, the beginning of the centur..."
1,TheWorld Warssparked tension between countries...,"[TheWorld, Warled to theSpace Raceand]"
2,These advancements have played a significant r...,[today]
3,The new beginning of the 20th century marked s...,[the 20th century]
4,The 1900s saw the decade herald a series of in...,"[The 1900s, the decade]"


### 6. Filter Entities

In [47]:
df_sentences[df_sentences["entities"].map(len) > 0].head(10)


Unnamed: 0,sentence,entities
0,Historic events in the 20th century\n\nWorld a...,"[the 20th century, the beginning of the centur..."
1,TheWorld Warssparked tension between countries...,"[TheWorld, Warled to theSpace Raceand]"
2,These advancements have played a significant r...,[today]
3,The new beginning of the 20th century marked s...,[the 20th century]
4,The 1900s saw the decade herald a series of in...,"[The 1900s, the decade]"
5,1914 saw the completion of thePanama Canal.\n\n,"[1914, thePanama Canal]"
6,TheScramble for Africacontinued in the 1900s a...,[the 1900s]
7,Theatrocities in the Congo,[the Congo]
10,"From 1914 to 1918, the First World War, and it...","[1914 to 1918, the First World War]"
11,"The First World War (or simply WWI), termed ""T...","[The First World War, WWI, The Great War, July..."


In [49]:
df_sentences_filtered = df_sentences[df_sentences["entities"].map(len) > 0]
df_sentences_filtered.tail(10)


Unnamed: 0,sentence,entities
235,"China, an ancient nation comprising a fifth of...","[China, fifth, West, East]"
236,"With the end of colonialism and the Cold War, ...","[the Cold War, billion, Africa, centuries]"
237,The world was undergoing its second major peri...,"[second, first, the 19th century, World War]"
238,I.[265]Since the US was in a position of almos...,"[US, China, India, West]"
241,Others said that the powerful nations with lar...,[theThird World]
243,"Meanwhile, inSouth Africa, the apartheid came ...","[inSouth Africa, andNelson Mandelabecame, firs..."
244,"InRwanda, an estimated one million people were...","[InRwanda, an estimated one million]"
245,Terrorism rose in the late century; theOklahom...,"[the late century, theOklahoma City, 168]"
248,Despotssuch asKim Jong-ilofNorth Koreacontinue...,[Despotssuch asKim Jong-ilofNorth]
254,"Like the threat of anuclear world war, the shi...","[theKyoto, the 20th century’s, thatNew Year's,..."


### 7. Create DataFrames

In [52]:
# Make a copy to avoid SettingWithCopyWarning
df_rel = df_sentences_filtered.copy()

# Take only the first name from each entity
df_rel["character_entities"] = df_rel["entities"].apply(
    lambda x: [item.split()[0] for item in x]
)


In [64]:
import numpy as np

relationships = []
window_size = 5

for i in range(len(df_rel)):
    end_i = min(i + window_size, len(df_rel) - 1)
    
    # Combine entities across the window
    char_list = sum(df_rel.loc[i:end_i, "character_entities"], [])
    
    # Remove consecutive duplicates
    char_unique = [char_list[j] for j in range(len(char_list))
                   if j == 0 or char_list[j] != char_list[j - 1]]
    
    # Create relationships
    if len(char_unique) > 1:
        for idx, a in enumerate(char_unique[:-1]):
            b = char_unique[idx + 1]
            relationships.append({"source": a, "target": b})


In [66]:
df_relationships = pd.DataFrame(relationships)
df_relationships.head()


Unnamed: 0,source,target
0,the,Spanish
1,Spanish,World
2,World,the
3,the,The20th
4,The20th,TheWorld


In [68]:
# Sort source/target alphabetically
df_relationships = pd.DataFrame(
    np.sort(df_relationships.values, axis=1),
    columns=df_relationships.columns
)

# Add edge weight
df_relationships["value"] = 1

# Aggregate relationships
df_relationships = (
    df_relationships
    .groupby(["source", "target"], as_index=False)
    .sum()
)

df_relationships.head(10)


Unnamed: 0,source,target,value
0,1,Hitler,6
1,1,Japanese,6
2,1,US,6
3,1,World,6
4,10,8,6
5,10,Allies,12
6,10,Japanese,6
7,10,Manchukuo,6
8,10,a,6
9,127.[182,1970,6


### 8. Save and Export

In [71]:
# Save and Export to CSV File

df_relationships.to_csv("character_relationships.csv", index=False)
print("File saved as character_relationships.csv")


File saved as character_relationships.csv
