# Importing libraries 

In [1]:
import pandas as pd
import numpy as np
import spacy
import os

In [2]:
# Load the Text file 
with open('20th_century_events.txt', 'r', encoding='utf-8', errors='ignore') as file:
    data = file.read()

In [3]:
print(data[:1000])



Key events of the 20th century - Wikipedia



























Jump to content







Main menu





Main menu
move to sidebar
hide



		Navigation
	


Main pageContentsCurrent eventsRandom articleAbout WikipediaContact us





		Contribute
	


HelpLearn to editCommunity portalRecent changesUpload fileSpecial pages



















Search











Search






















Appearance
















Donate

Create account

Log in








Personal tools





Donate Create account Log in




























Contents
move to sidebar
hide




(Top)





1
Historic events in the 20th century




Toggle Historic events in the 20th century subsection





1.1
World at the beginning of the century






1.1.1
"The war to end all wars": World War I (1914‚Äì1918)










1.2
Spanish flu






1.2.1
Russian Revolution and communism










1.3
Between the wars






1.3.1
Economic depression








1.3.2
The rise of dictatorship










1.4
Global war: World War II 

### Cleaning New Lines

In [4]:
data_clean = data.replace('\n', ' ')

### Cleaning Citations (The Regex Tool)

In [5]:
import re
# This finds all [1], [2], etc., and replaces them with nothing
data_clean = re.sub(r'\[\d+\]', '', data_clean)

### Fixing Country Names

In [6]:
# Standardize names so they match lookup list
data_clean = data_clean.replace('USA', 'United States')

### Saving the "Clean" Version

In [7]:
with open('twentieth_century_CLEAN.txt', 'w', encoding='utf-8') as f:
    f.write(data_clean)

## Data Cleaning & Wrangling Observations

**Project: Twentieth-Century Key Events NER Analysis**
During the initial evaluation of the scraped twentieth-century text file, several "dirty" data issues were identified that could interfere with the Named Entity Recognition (NER) algorithm. Below are the specific observations and the corrective actions taken.

**üîç Initial Observations**
Whitespace and Line Breaks: The raw text contained numerous \n (newline) characters. In NLP, these can prematurely break a sentence, causing the algorithm to miss relationships between countries that span across a line break.

**Citation Markers**: Because the data was scraped, it contained Wikipedia-style citations (e.g., [1], [15]). These are problematic because the NER algorithm might try to process them as part of a word or a "Work of Art" entity.

**Special Characters**: Unexpected symbols and non-standard spacing were present, which can confuse the Tokenization step of the NLP process.

**Entity Inconsistency**: Some countries were referred to by multiple names (e.g., "U.S.A." vs "United States"). These must be standardized to match the countries_lookup.txt list exactly for the filtering step to work correctly.

**üõ†Ô∏è Cleaning Steps Taken**
Removing Line Breaks: I used the .replace('\n', ' ') method to turn all hard returns into single spaces, creating a continuous flow of text for better sentence segmentation.

**Removing Citations**: I utilized the re (Regular Expression) library to identify and remove any digits inside square brackets.

Pattern used: r'\[\d+\]'.

**Standardizing Country Names**: I performed a manual check against my country list and used string replacement to ensure that country names in the text (Source) match the names in my lookup dictionary (Reference).

**Encoding Check**: The file was saved using UTF-8 encoding to ensure that any special characters or accented names (like "C√¥te d'Ivoire") are preserved correctly without turning into garbage text.

**‚úÖ Final Result**
The cleaned text is now a "Sequence of Tokens" ready to be converted into a spaCy Doc object. By removing the "noise" (citations and line breaks), the Dependency Parsing step can now accurately look at the tokens in their specific context.

### NER Object

In [9]:
nlp = spacy.load("en_core_web_sm")
ner_doc = nlp(data_clean)

**Summary**

**üõ†Ô∏è Creating the NER Object**

- The Goal: To transform raw text into a "smart" document that understands grammar and entities.

- The Process: I loaded the spaCy English language module (en_core_web_sm) and passed my cleaned text through it to create a Doc object.

- Result: Every word is now analyzed. The AI can now distinguish between a country (labeled as GPE), a person, or a date.

### Splitting Sentence Entities

In [10]:
df_sentences = []

# Loop through every sentence identified by spaCy
for sent in ner_doc.sents:
    # Get a list of the text for every entity in that sentence
    entity_list = [ent.text for ent in sent.ents]
    # Store the sentence and its found entities in a dictionary
    df_sentences.append({"sentence": sent, "entities": entity_list})

# Turn the list into a structured table (DataFrame)
df_sentences = pd.DataFrame(df_sentences)

**Summary**

In this step, you take the "smart" ner_doc created by spaCy and break it down so that you can see which entities appear together in specific sentences.

**‚úÇÔ∏è Splitting Sentence Entities**
- The Goal: To break the text down into smaller, manageable pieces (sentences).

- The Process: I used a loop to iterate through every sentence in the book.

- Technical Output: I created a table where each row contains one sentence and a list of all the "entities" (important names) found within it.

### Load lookup list

In [11]:
# Read text file and turn it into a list of names
with open('countries_lookup.txt', 'r') as f:
    countries = [line.strip() for line in f]

### Filter the table

In [12]:
# This function keeps an entity only if it is in your country list
def filter_entities(ent_list, lookup_list):
    return [ent for ent in ent_list if ent in lookup_list]

# Create a new column containing ONLY the matched countries
df_sentences['character_entities'] = df_sentences['entities'].apply(lambda x: filter_entities(x, countries))

# Remove any sentences that now have zero countries left
df_sentences_filtered = df_sentences[df_sentences['character_entities'].map(len) > 0]

**Summary**

**üîç Filtering with the Country Lookup**
I filtered out "noise". If a sentence mentioned "Winston Churchill" and "France," this step deletes "Winston Churchill" because he is a person, leaving only "France".

- The Goal: To remove all "noise" and focus only on the countries of interest.

- The Process: I loaded my countries_lookup.txt file and compared it against the entities found by the AI.

- Result: Any entity not on my specific list (like names of people or dates) was discarded, leaving a clean dataset of country mentions.

### Creating Relationships (The Sliding Window)

In [13]:
relationships = []

# Iterate through the filtered sentences
for i in range(len(df_sentences_filtered)):
    # Look at a window of the next 5 sentences
    end_i = min(i + 5, len(df_sentences_filtered))
    # Combine all country entities found in this 5-sentence block
    char_list = sum((df_sentences_filtered.iloc[i:end_i].character_entities), [])

    # Remove duplicates appearing right next to each other
    char_unique = [char_list[j] for j in range(len(char_list)) 
                   if (j == 0) or char_list[j] != char_list[j-1]]

    # If more than one country is in the window, they have a relationship
    if len(char_unique) > 1:
        for idx, a in enumerate(char_unique[:-1]):
            b = char_unique[idx + 1]
            relationships.append({"source": a, "target": b})

# Create the final relationship table
relationship_df = pd.DataFrame(relationships)

**Summary**

**ü§ù Creating Relationships (Sliding Window)**
- The Goal: To identify which countries interacted with each other based on how close they appear in the text.

- The Logic: I implemented a Sliding Window of 5 sentences. If two different countries appear within this 5-sentence span, the code records a "relationship" between them.

- Final Calculation: I grouped these pairs together to count the total number of interactions, which tells us the "strength" of the connection between specific countries.


### Exporting The Data 

In [15]:
# Count how many times each pair appeared
relationship_df = pd.DataFrame(relationships)
relationship_df = relationship_df.groupby(["source", "target"], sort=False, as_index=False).sum()
# Save to your path
relationship_df.to_csv('country_relationships.csv', index=False)

**Summary**

**üíæFinalizing and Exporting the Data**

- This final step transforms the raw list of interactions into a structured file ready for network visualization.

- Aggregating Relationships: I used the .groupby() function to sum up every instance where two countries appeared together, creating a "weight" for each connection.

- The Goal: This step converts hundreds of individual mentions into a summary table that shows exactly how many times each country pair interacted throughout the text.

- Exporting to CSV: The final table was saved as a CSV file to the project path: C:\Users\ANITA BOADU\Twentieth Century Project\20th_century.

- Ready for Visualization: This file now serves as the "source of truth" for building the character network graph in the next exercise.