# 1.6 Natural Language Processing and Network Analysis

## This script contains the following:
#### 1. Import Libraries
#### 2. Import Data
    NOTE
#### 3. Create Named Entity Recognition Object
#### 4. Splitting the Sentence Entities
#### 5. Filter the Entities Using the Country List
#### 6. Create a Relationship Dataframe
#### 7. Export the Data

### 1. Import Libraries

In [5]:
import pandas as pd
import numpy as np
import spacy
from spacy import displacy
import networkx as nx
import os
import matplotlib.pyplot as plt
import scipy
import re

In [6]:
# Download English module
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [7]:
# Load spacy English module
NER = spacy.load("en_core_web_sm")

### 2. Import Data

In [9]:
# Load the article
path = os.path.join(os.path.dirname('/Users/matthewjones/Documents/CareerFoundry/Data Visualization with Python/Achievement 1/20th-Century/02. Data/'), '20th Century Events_sans_punc.txt')
    
with open(path, 'r', errors='ignore') as file:
    data = file.read().replace('\n', '')

In [10]:
# Import the list of countries as a dataframe
path2 = r'/Users/matthewjones/Documents/CareerFoundry/Data Visualization with Python/Achievement 1/20th-Century/02. Data'

countries = pd.read_csv(os.path.join(path2, 'cleaned_countries_list.csv'), index_col = 0)

In [11]:
# Check the output
countries.head()

Unnamed: 0,country_name,country_alias,clean_country_alias
0,Afghanistan,Afghanistan,Afghanistan
1,Albania,Albania,Albania
2,Algeria,Algeria,Algeria
3,Andorra,Andorra,Andorra
4,Angola,Angola,Angola


In [12]:
countries.shape

(214, 3)

#### NOTE
    Between the text mining stage and Network Analysis stage, a separate script was made to clean the data. In that process, the names of countries in the text document to be consistent with what Spacy was registering as an entity (e.g. it did not reliably pick up 'United States', but would pick up 'the United States'). All other mentions of 'the' were removed, and so were extraneous characters (punctuation and numbers).
    
    This script also included cleaning the countries list. There were a few countries referenced in the article that no longer exist (so they were not on the countries list). It would have been inaccurate to assign those mentions to present-day countries, so the older countries were added. Some countries with multiple words also needed an alias to match how they were referenced in the text document (e.g. 'China, The People's Republic of' became 'China', and 'Bosnia and Herzegovina' became 'Bosnia'. And finally, the extra spaces around the country names were stripped so the names would match with the entity names.
    
    The additional script and cleaned data were saved and included in the project folder.

### 3. Create Named Entity Recognition Object
Using the NLP module Spacy to apply an NER algorithm

In [15]:
# Set the NER object
article = NER(data)

In [16]:
# Visualize identified entities
displacy.render(article[273:20000], style = "ent", jupyter = True)

### 4. Splitting the Sentence Entities
Storing each sentence's entities as a list in a dataframe

In [18]:
# Create an empty shell to store results
df_sentences = [] 

# Loop through sentences, to get entity list for each sentence
for sent in article.sents:
    entity_list = [ent.text for ent in sent.ents]
    df_sentences.append({"sentence": sent, "entities": entity_list})
    
# Convert the list into a dataframe
df_sentences = pd.DataFrame(df_sentences)

In [19]:
# Check the output
df_sentences.head(10)

Unnamed: 0,sentence,entities
0,"( , Key, , events, , of, , th, , c...",[Navigation Main pageContentsCurrent ...
1,"(race1.4.5The, , end, , of, , Cold, ...",[]
2,"(informationCite, , this, , pageGet, ...","[URLDownload QR , PDFPrintable]"
3,"(World, , Wars, , sparked, , tension,...",[]
4,"(These, , advancements, , have, , pla...",[today]
5,"(Historic, , events, , in, , th, , ...",[]
6,"(s, , saw, , decade, , herald, , a...",[]
7,"(From, , to, , First, , Wo...",[]
8,"(`, `, , war, , to, , end, , all, ...","[1918, Sarajevo]"
9,"(war, , and, , by, , extension, , ...","[Gavrilo Princip of , Bosnian, Serbs]"


### 5. Filter the Entities Using the Country List
Identifying only the entities that match the countries we are analyzing

In [21]:
# Write a function to filter out entities not on the cleaned countries list
def filter_entity(ent_list, countries):
    return [ent for ent in ent_list 
            if ent in list(countries['clean_country_alias'])]

In [22]:
# Apply the function and store the results in a new column
df_sentences['country_entities'] = df_sentences['entities'].apply(lambda x: filter_entity(x, countries))

In [60]:
# Check the output
df_sentences.head(10)

Unnamed: 0,sentence,entities,country_entities
0,"( , Key, , events, , of, , th, , c...",[Navigation Main pageContentsCurrent ...,[]
1,"(race1.4.5The, , end, , of, , Cold, ...",[],[]
2,"(informationCite, , this, , pageGet, ...","[URLDownload QR , PDFPrintable]",[]
3,"(World, , Wars, , sparked, , tension,...",[],[]
4,"(These, , advancements, , have, , pla...",[today],[]
5,"(Historic, , events, , in, , th, , ...",[],[]
6,"(s, , saw, , decade, , herald, , a...",[],[]
7,"(From, , to, , First, , Wo...",[],[]
8,"(`, `, , war, , to, , end, , all, ...","[1918, Sarajevo]",[]
9,"(war, , and, , by, , extension, , ...","[Gavrilo Princip of , Bosnian, Serbs]",[]


In [24]:
# Filter out sentences that don't have any character entities
df_sentences_filtered = df_sentences[df_sentences['country_entities'].map(len) > 0]

# Check the output
df_sentences_filtered

Unnamed: 0,sentence,entities,country_entities
13,"(Allies, , known, , initially, ,...","[British, Russia]",[Russia]
14,"(Germany, , Austria, -, Hungary, ...","[Germany, Austria]","[Germany, Austria]"
15,"(In, , Russia, , ended, , hos...",[Russia],[Russia]
16,"(Bolsheviks, , negotiated, , Treaty, ...",[Russia],[Russia]
17,"(In, , treaty, , Bolshevik, , Ru...","[Russia, Baltic]",[Russia]
...,...,...,...
1098,"(Retrieved, , -12, -, 20.^, , `, `, , ...",[the Soviet Union],[the Soviet Union]
1135,"(`, `, , Why, , Skylab, , Was, , t...","[the United States, First]",[the United States]
1261,"(`, `, , Anti, -, the, United, Statesn, ...","[Middle East, Lebanon]",[Lebanon]
1267,"(Rise, , of, , China, , and, , Ind...","[China, India]","[China, India]"


### 6. Create a Relationship Dataframe
Calculating how much each country interacts with one another in the article

In [25]:
# Defining relationships 

# window size = 5 : this defines how many sentences will be looked at simultaneously 
relationships = [] # create an empty list

for i in range(df_sentences_filtered.index[-1]):
    end_i = min(i+5, df_sentences_filtered.index[-1])
    country_list = sum((df_sentences_filtered.loc[i: end_i].country_entities), [])
    
    # Remove duplicated characters that are next to each other
    country_unique = [country_list[i] for i in range(len(country_list)) 
                   if (i==0) or country_list[i] != country_list[i-1]]
    
    if len(country_unique) > 1:
        for idx, a in enumerate(country_unique[:-1]):
            b = country_unique[idx + 1]
            relationships.append({"source": a, "target": b})

In [62]:
# Convert the list into a dataframe
relationship_df = pd.DataFrame(relationships)

In [64]:
# Sort the cases with a->b and b->a
relationship_df = pd.DataFrame(np.sort(relationship_df.values, axis = 1), columns = relationship_df.columns)

In [28]:
# Summarize the interactions by giving a value for every interaction captured, then group the interactions
relationship_df["value"] = 1
relationship_df_grouped = relationship_df.groupby(["source","target"], sort=False, as_index=False).sum()

# Check the output
relationship_df_grouped.head(10)

Unnamed: 0,source,target,value
0,Germany,Russia,9
1,Austria,Germany,10
2,Austria,Russia,5
3,Austria,Hungary,6
4,Germany,Hungary,9
5,Germany,Yugoslavia,4
6,Czechoslovakia,Yugoslavia,12
7,Germany,Italy,14
8,Spain,the United Kingdom,2
9,France,the United Kingdom,6


### 7. Export the Data

In [29]:
# Save the dataframe as a csv file
relationship_df_grouped.to_csv(os.path.join(path2, 'country_relationships.csv'))