# Exercise 1.6: Named Entity Recognition and Network Analysis

## Objective
Apply Named Entity Recognition (NER) algorithm to extract relationships between countries mentioned in the 20th century timeline. This will prepare data for network visualization in Exercise 1.7.

## 1. Import Libraries and Setup

I'm importing all necessary libraries for NER analysis and network relationship extraction.

In [2]:
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
import networkx as nx
import matplotlib.pyplot as plt
import re

# Load SpaCy English model
nlp = spacy.load("en_core_web_sm")

print("Libraries loaded successfully!")

Libraries loaded successfully!


## 2. Load 20th Century Text Data

I'm loading the text file I created in Exercise 1.5.

In [3]:
# Load the 20th century timeline text
with open('20th_century_events.txt', 'r', errors='ignore') as file: 
    data = file.read().replace('\n', ' ')

print(f"Text loaded successfully!")
print(f"Total characters: {len(data)}")
print(f"\nFirst 500 characters:\n{data[:500]}")

Text loaded successfully!
Total characters: 101768

First 500 characters:
    Timeline of the 20th century - Wikipedia                            Jump to content        Main menu      Main menu move to sidebar hide    		Navigation 	   Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us      		Contribute 	   HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages                    Search            Search                       Appearance                 Donate  Create account  Log in         Personal tools      Donate Create account 


## 3. Evaluate Text and Data Wrangling Needs

### Observations:

**Special Characters:**
- The text contains navigation elements from Wikipedia ("Jump to content", "Main menu", "move to sidebar")
- Multiple spaces and formatting artifacts from web scraping
- No problematic special characters that would break NER

### Wrangling Strategy:
1. Load countries list
2. Check if country names need standardization
3. Clean any formatting issues if necessary

In [4]:
# Load countries list
with open('countries_list.txt', 'r', errors='ignore') as f:
    countries = [line.strip() for line in f.readlines()]

print(f"Total countries loaded: {len(countries)}")
print(f"\nFirst 20 countries:")
for i, country in enumerate(countries[:20], 1):
    print(f"{i}. {country}")

Total countries loaded: 373

First 20 countries:
1. Afghanistan
2. Albania
3. Algeria
4. Andorra
5. Angola
6. Antigua and Barbuda
7. Argentina
8. Armenia
9. Australia
10. Austria
11. Azerbaijan
12. Bahamas
13. Bahrain
14. Bangladesh
15. Barbados
16. Belarus
17. Belgium
18. Belize
19. Benin
20. Bhutan


In [5]:
# Check sample of text for country mentions
sample_countries = ['United States', 'China', 'Russia', 'Germany', 'France', 'United Kingdom']

print("Checking how countries appear in the text:\n")
for country in sample_countries:
    count = data.count(country)
    count_lower = data.lower().count(country.lower())
    print(f"{country}: {count} exact matches, {count_lower} case-insensitive matches")

# Check if multi-word countries exist in our list
print("\n\nMulti-word countries in our list:")
multi_word_countries = [c for c in countries if ' ' in c]
print(f"Total multi-word countries: {len(multi_word_countries)}")
print(f"Examples: {multi_word_countries[:10]}")

Checking how countries appear in the text:

United States: 55 exact matches, 55 case-insensitive matches
China: 32 exact matches, 34 case-insensitive matches
Russia: 30 exact matches, 30 case-insensitive matches
Germany: 23 exact matches, 23 case-insensitive matches
France: 17 exact matches, 17 case-insensitive matches
United Kingdom: 28 exact matches, 28 case-insensitive matches


Multi-word countries in our list:
Total multi-word countries: 136
Examples: ['Antigua and Barbuda', 'Bosnia and Herzegovina', 'Burkina Faso', 'Cabo Verde', 'Central African Republic', 'Costa Rica', "Côte d'Ivoire", 'Democratic Republic of the Congo', 'Dominican Republic', 'El Salvador']


## 4. Text Wrangling Assessment

### Key Findings:

**1. Case Sensitivity:**
- Country names appear with consistent capitalization in the text
- No major case-sensitivity issues detected

**2. Multi-word Countries:**
- 136 out of 373 countries have multi-word names
- Examples: "United States", "United Kingdom", "Costa Rica"
- SpaCy's NER should handle these correctly with dependency parsing

**3. Special Characters:**
- One country has special characters: "Côte d'Ivoire" (with accent and apostrophe)
- The text contains Wikipedia navigation elements but these won't interfere with NER

**4. Name Variations:**
- "China" appears 34 times (case-insensitive) vs 32 exact matches - possible references like "china" (lowercase)
- Countries are generally referred to by their common names (not official names)

### Decision:
**No text wrangling needed.** The text is clean enough for NER analysis. SpaCy's algorithm will:
- Handle multi-word entities through dependency parsing
- Recognize proper nouns regardless of surrounding navigation text
- Process the text as-is

I'll proceed directly to creating the NER object.

## 5. Create NER Object

I'm applying SpaCy's Named Entity Recognition algorithm to the 20th century timeline text.

In [6]:
# Create NER object using SpaCy
print("Processing text with NER algorithm...")
print("This may take a minute...")

book = nlp(data)

print(f"\nNER object created successfully!")
print(f"Total tokens processed: {len(book)}")
print(f"Total sentences: {len(list(book.sents))}")
print(f"Total entities found: {len(book.ents)}")


NER object created successfully!
Total tokens processed: 19561
Total sentences: 1401
Total entities found: 4040


## 6. Visualize Sample of Identified Entities

Next I'll examine a sample of what the NER algorithm detected.

In [7]:
# Display a sample of identified entities
print("Sample of identified entities (first 50):\n")
for ent in list(book.ents)[:50]:
    print(f"{ent.text:30} | {ent.label_:15} | {ent.label_}")

print("\n\nEntity type distribution:")
entity_types = {}
for ent in book.ents:
    entity_types[ent.label_] = entity_types.get(ent.label_, 0) + 1

for label, count in sorted(entity_types.items(), key=lambda x: x[1], reverse=True):
    print(f"{label:15}: {count}")

Sample of identified entities (first 50):

the 20th century               | DATE            | DATE
Search            Search                       Appearance                 Donate  Create | WORK_OF_ART     | WORK_OF_ART
Log in         Personal        | ORG             | ORG
1                              | CARDINAL        | CARDINAL
1900s                          | DATE            | DATE
1.1                            | CARDINAL        | CARDINAL
1.3                            | CARDINAL        | CARDINAL
1.4                            | CARDINAL        | CARDINAL
1910s                          | DATE            | DATE
2.1                            | CARDINAL        | CARDINAL
2.3 1912         2.4 1913         2.5 1914         2.6 1915         2.7 1916         2.8 1917         2.9 1918 | PRODUCT         | PRODUCT
3.6                            | CARDINAL        | CARDINAL
3.9                            | CARDINAL        | CARDINAL
3.10                           | CARDINAL        | CAR

## 7. Split Sentence Entities

I am creating a dataframe where each row contains a sentence and all the entities (GPE - Geopolitical Entities) found in that sentence.

In [8]:
# Create list of dictionaries with sentences and their entities
df_sentences = []

# Loop through sentences, extract GPE entities (countries/places)
for sent in book.sents:
    # Get all GPE entities from the sentence
    entity_list = [ent.text for ent in sent.ents if ent.label_ == 'GPE']
    df_sentences.append({"sentence": sent.text, "entities": entity_list})

# Convert to dataframe
df_sentences = pd.DataFrame(df_sentences)

print(f"Total sentences: {len(df_sentences)}")
print(f"\nSentences with at least one GPE entity: {len(df_sentences[df_sentences['entities'].map(len) > 0])}")
print(f"\nFirst 10 rows:")
df_sentences.head(10)

Total sentences: 1401

Sentences with at least one GPE entity: 593

First 10 rows:


Unnamed: 0,sentence,entities
0,Timeline of the 20th century - Wikipedia ...,[]
1,also 14 Further reading 15 Ref...,[]
2,informationCite this pageGet shortened URLDown...,[Download]
3,Please help improve this article by adding cit...,[]
4,Unsourced material may be challenged and removed.,[]
5,"Find sources: ""Timeline of the 20th century"" –...",[]
6,(Learn how and when to remove this message) M...,[Millennia]
7,"1900s[edit] See also: Edwardian Era, Gilded Ag...",[]
8,January 22: Edward VII became King of England ...,"[Emperor, India]"
9,March 2: The Platt Amendment provides for Cuba...,[]


## 8. Filter Entities Using Countries List

I'm filtering the entities to keep only those that match countries from my scraped list. This removes non-country locations like cities or regions.

In [9]:
# Function to filter entities - keep only countries from our list
def filter_entity(ent_list, countries_list):
    """
    Filters entity list to keep only entities that match country names
    """
    return [ent for ent in ent_list if ent in countries_list]

# Apply filter to create new column with only country entities
df_sentences['country_entities'] = df_sentences['entities'].apply(
    lambda x: filter_entity(x, countries)
)

# Filter dataframe to keep only sentences with country entities
df_sentences_filtered = df_sentences[df_sentences['country_entities'].map(len) > 0]

print(f"Sentences with country entities: {len(df_sentences_filtered)}")
print(f"\nFirst 10 filtered sentences:")
df_sentences_filtered.head(10)

Sentences with country entities: 335

First 10 filtered sentences:


Unnamed: 0,sentence,entities,country_entities
8,January 22: Edward VII became King of England ...,"[Emperor, India]",[India]
10,June: Emily Hobhouse reports on the poor condi...,[South Africa],[South Africa]
12,September 7: The Eight-Nation Alliance defeats...,"[the Boxer Rebellion, China]",[China]
17,May 20: Cuba given independence by the United ...,"[Cuba, the United States]",[Cuba]
21,"Venezuelan crisis of 1902–1903, in which Brita...","[Britain, Germany, Italy, Venezuela]","[Germany, Italy, Venezuela]"
23,June 11: King Alexander I of Serbia and his wi...,[Serbia],[Serbia]
25,In Russia the Bolsheviks and the Mensheviks fo...,"[Russia, Bolsheviks]",[Russia]
27,"November 18: Independence of Panama, the Hay–B...","[Panama, the United States, Panama]","[Panama, Panama]"
30,April 8: Entente Cordiale signed between Brita...,"[Britain, France]",[France]
34,1905[edit] January 22: The Revolution of 1905 ...,[Russia],[Russia]


## 9. Create Relationships Between Countries

I'm analyzing which countries appear together in close proximity (within a 5-sentence window). If two countries are mentioned near each other multiple times, they likely have a historical relationship.

**Method:**
- Window size: 5 sentences
- If two countries appear in the same window, a relationship is created
- Relationship strength = frequency of co-occurrence

In [10]:
# Create relationships based on country co-occurrences
relationships = []
window_size = 5

# Iterate through the filtered dataframe
for i in range(len(df_sentences_filtered)):
    # Define end of window (max 5 sentences ahead or end of dataframe)
    end_i = min(i + window_size, len(df_sentences_filtered))
    
    # Get all countries mentioned in this window
    indices = df_sentences_filtered.index[i:end_i]
    char_list = sum(df_sentences_filtered.loc[indices, 'country_entities'].tolist(), [])
    
    # Remove consecutive duplicates
    char_unique = [char_list[j] for j in range(len(char_list))
                   if (j == 0) or char_list[j] != char_list[j-1]]
    
    # Create relationships between pairs of countries
    if len(char_unique) > 1:
        for idx, a in enumerate(char_unique[:-1]):
            b = char_unique[idx + 1]
            relationships.append({"source": a, "target": b})

print(f"Total relationships found: {len(relationships)}")
print(f"\nFirst 10 relationships:")
for i, rel in enumerate(relationships[:10], 1):
    print(f"{i}. {rel['source']} ↔ {rel['target']}")

Total relationships found: 1594

First 10 relationships:
1. India ↔ South Africa
2. South Africa ↔ China
3. China ↔ Cuba
4. Cuba ↔ Germany
5. Germany ↔ Italy
6. Italy ↔ Venezuela
7. South Africa ↔ China
8. China ↔ Cuba
9. Cuba ↔ Germany
10. Germany ↔ Italy


## 10. Aggregate & Count Relationship Frequencies

I'll summarize the relationships to show how many times each country pair appears together, which will indicate the strength of their historical connection.

In [11]:
# Convert to dataframe
relationship_df = pd.DataFrame(relationships)

# Sort the pairs alphabetically to ensure A->B and B->A are counted together
relationship_df = pd.DataFrame(
    np.sort(relationship_df.values, axis=1),
    columns=relationship_df.columns
)

# Add value column for counting
relationship_df["value"] = 1

# Group by source-target pairs and sum frequencies
relationship_df = relationship_df.groupby(
    ["source", "target"], 
    sort=False, 
    as_index=False
).sum()

# Sort by frequency (highest first)
relationship_df = relationship_df.sort_values('value', ascending=False).reset_index(drop=True)

print(f"Unique country pairs: {len(relationship_df)}")
print(f"\nTop 20 most frequent country relationships:\n")
print(relationship_df.head(20).to_string(index=False))

Unique country pairs: 326

Top 20 most frequent country relationships:

     source           target  value
     Brazil            China     20
    Austria          Germany     17
      China            Japan     13
Afghanistan             Iran     13
    Germany            Japan     12
       Iran           Poland     12
      Italy Northern Ireland     12
    Austria          Hungary     10
     France          Germany     10
  Nicaragua    United States     10
      China        Hong Kong     10
     France          Morocco     10
  Indonesia         Malaysia     10
     Israel        Palestine     10
     France          Tunisia     10
    Denmark           Norway     10
 Bangladesh         Pakistan     10
     Israel          Lebanon      9
    Germany            Italy      9
   Portugal            Spain      9


## 11. Export Relationships Dataframe

I'm saving the relationships dataframe as a CSV file for use in later exercise.

In [12]:
# Export relationships dataframe to CSV
relationship_df.to_csv('country_relationships.csv', index=False)

print("Relationships dataframe exported successfully!")
print(f"File: country_relationships.csv")
print(f"Total rows: {len(relationship_df)}")
print(f"\nDataframe summary:")
print(f"- Unique countries involved: {len(set(relationship_df['source'].tolist() + relationship_df['target'].tolist()))}")
print(f"- Strongest relationship: {relationship_df.iloc[0]['source']} ↔ {relationship_df.iloc[0]['target']} ({relationship_df.iloc[0]['value']} co-occurrences)")
print(f"- Average relationship strength: {relationship_df['value'].mean():.2f}")

Relationships dataframe exported successfully!
File: country_relationships.csv
Total rows: 326

Dataframe summary:
- Unique countries involved: 105
- Strongest relationship: Brazil ↔ China (20 co-occurrences)
- Average relationship strength: 4.89
