Open in notebook viewer to see the online version http://nbviewer.jupyter.org/

# Le Temps coverage on WW2 

In [1]:
#import libraries
import os
import glob
import datetime as dt
import xml.etree.ElementTree as et
import pandas as pd
import os.path
import urllib.request
import json 
import errno
import folium

For the purposes of this projects we are going to analyze the information provided in the dataset regarding the World War II and more specifically the Pacific Theater. The time that we are interested in is between the years 1937 and 1945. Since the data referring to this period, from the 2 newspapers which compose Le Temps, is less than 1 GB, we were able to download this subset of the dataset from the cluster and work with it on our host machines.

In our initial data analysis we tried to brush up the data and create DataFrames which include only the useful information. The articles are in XML format and are have been collected in the dataset by month. The columns of our DataFrames will be “name”, “date” and “text” (article’s body). We found that there where many duplicates regarding the already ids to the articles so we decided to create a new one based on the source of information (newspaper initials), date and the order that are stored in our files. Each of these features are not unique but their combination constitutes a unique id. Furthermore, we had to discard the articles that had some of the meta information but there was no article body to read.

For every month there are thousands of articles and obviously a minority of them is refer to the Pacific Theater. Therefore, we build a dictionary on which will base our text analysis. Our first approach is to get all the articles that include some of the dictionary words. This could not be a solution because a great percentage of articles is untitled. We extended the search in the articles’ body too. By analyzing the results, we realized that although there have been added articles that refer to our scenario, we get some irrelevant ones as well. This in not a problem so far because we have shrink the dataset significantly and the irrelevant be identified during the text analysis.

An improvement which we are concerning to implement is to give to the documents’ words TF-IDF features and retrieve the documents based on similarity scores given as input the dictionary we have built. 
As regards the New York Times dataset, we have applied exactly the same methodology. In this case we had to download the files based on an API Key that we requested for. There are two options, either by using the available interface of New York Times machine or by given a URL to the browser specifying again the month and the year. We went for the second options and we wrote a script that downloads all the json files. We realized that although the information is well organized, there is only a small part of the articles’ texts available. We found out that the offered tools do not solve the problem of limited information, and therefore we have sent email asking for alternative was of accessing the data and we are waiting for response.  



In [2]:
# Loading dictionary
f = open('./dict/dictionary_fr.txt', 'r', encoding='utf-8')
x = f.readlines()
dict_fr = [l.replace('\n', '').lower() for l in x]
dict_fr

['cambodge',
 'chang-haï',
 'chine',
 'chinois',
 'corée',
 'extrême-orient',
 'haipong',
 'hongkong',
 'indo-chine',
 'japon',
 'java',
 'kwantung',
 'laos',
 'macao',
 'malaisie',
 'nation blanche',
 'nations blanches',
 'nippon',
 'pacifique',
 'philippines',
 'shanghai',
 'singapour',
 'sutsugu',
 'thaïlande',
 'tokio']

In [3]:
# Function to load all data from gdl and jdg
def loader(source):
    # Make empty arrays which will contain the parsed fields
    art_name = []
    art_date = []
    art_text = []
    art_counters = []
    
    source_df = pd.DataFrame()
    
    # Build the path # ./data/gdl/ **/*.xml
    path = './data/' + source + '/' 
    doc_names = glob.glob(path + '**/*.xml', recursive=True)
    
    for file_ in doc_names:
        i = 0
        tree = et.parse(file_)
        root = tree.getroot()
        # Loop through each 
        for entity in root.iter('entity'):
            # If the article contains a word from the custom dictionary
            if any(word in str(entity.find('meta').find('name').text).lower() for word in dict_fr):
                # If the article text is empty then don't store it and move to the next iteration
                if(entity.find('full_text').text is None):
                    continue
                art_name.append(entity.find('meta').find('name').text)
                art_text.append(entity.find('full_text').text) 
                
                date_format = dt.datetime.strptime(entity.find('meta').find('issue_date').text,'%d/%m/%Y').strftime('%Y%m%d')
                art_date.append(date_format) 
                
                art_counters.append(i)
                i = i+1
            
    source_df['name'] = art_name
    #source_df['date'] = art_date
    source_df['text'] = art_text 
    source_df['date'] = art_date
    form_cn = ["{:03d}".format(item) for item in art_counters]
    source_df['month_order'] = form_cn
    source_df['origin'] = source
    # ids = to ensure it is unique we made it a combination of the newspaper code + the date + its ordering number in the list of each specific month  
    source_df['id'] = source_df['origin'] + "_" + source_df['date'].map(str) + "_" + source_df['month_order'].map(str)
    return source_df

In [4]:
jdg_df = loader('jdg')

In [5]:
gdl_df = loader('gdl')

In [6]:
articles = pd.DataFrame()
articles = pd.concat([gdl_df,jdg_df], ignore_index=True)
articles.head()

Unnamed: 0,name,text,date,month_order,origin,id
0,Les événements de Chine,Les événements de Chine La défense chinoise da...,19380101,0,gdl,gdl_19380101_000
1,La note britannique au Japon,"La note britannique au Japon Tokio, 31 décembr...",19380101,1,gdl,gdl_19380101_001
2,On parlé de propositions de paix à la Chine,On parlé de propositions de paix à la Chine A ...,19380101,2,gdl,gdl_19380101_002
3,Les événements de Chine,Les événements de Chine L'avance japonaise dan...,19380105,3,gdl,gdl_19380105_003
4,LA PRÉTENTION D'UN MINISTRE JAPONAIS D'ÉLIMINE...,LA PRÉTENTION D'UN MINISTRE JAPONAIS D'ÉLIMINE...,19380105,4,gdl,gdl_19380105_004


In [7]:
articles.describe()

Unnamed: 0,name,text,date,month_order,origin,id
count,796,796,796,796,796,796
unique,656,796,458,108,2,796
top,La guerre en Extrême-Orient,La guerre sino-japonaise -Les troupes japonais...,19411220,2,gdl,gdl_19390111_013
freq,25,1,10,36,456,1


We need to considered the following scenarios:<br>
-there are articles that are related to our research but have different keywords (ex. WW2 generals, ministers) in the title<br>
-there are articles that are not related to our research but are collected since they contain our keywords<br>

We have already taken into consideration cases where the article has no name, and in most of the cases[...]


# The New York Times coverage on WW2

In [8]:
# Create a new directory of it does not already exist
def create_directory(directory):
    try:
        os.makedirs(directory)
    except OSError as e:
        if e.errno != errno.EEXIST:
            raise

The following code needs the API-key value to be passed in the url. We removed it form the notebook to respect the Terms of Use set by New York Times 

In [None]:
#we store the files this way './Data/nyt/nyt_1938/01.json'
directory = './data/nyt'

 
for year in range(1937, 1946):
    save_path = directory + '/nyt_' + str(year)
    create_directory(save_path)
    
    for month in range(1, 13):        
        file_path =  save_path + '/'
        #months less that 10 should be displayed like this "01"
        if month < 10:
            file_path = file_path + '0' + str(month) + '.json'
        else:
            file_path = str(month) + '.json'
        
        #form the url query according to the year and month
        with urllib.request.urlopen("http://api.nytimes.com/svc/archive/v1/" + str(year) + "/" + str(month) + ".json?api-key={api-key}") as url:
            data = json.loads(url.read().decode())
        with open(file_path, 'w') as outfile:
            json.dump(data, outfile)

# Map of Japan in 1938

Here we display a map of the Far-East Asia in 1938. The counties' boundaries and the influence areas of the nations around this area showed many changes during the following years and they are completely different to the current ones. For example, in this initial representation it is obvious that Japan does not have the same size as today. Also, some markers have been added to show some cities that played an important role in the Pacific Theater. In our final submission we aim to show an interactive representation of all these changes at this period. A different map snapshot is going to be displayed according to the invasions that happened. 

In [10]:
json_data = json.load(open(r'topojson/cntry1938.topojson'))

In [11]:
def my_color_function(geometries):
    """Maps low values to green and hugh values to red."""
    if geometries['properties']['NAME'] == "Manchuria" or geometries['properties']['NAME'] == "Japan":
        return 'red'
    else:
        if geometries['properties']['NAME'] == "Mongolia":
            return '#00FF00'
        if geometries['properties']['NAME'] == "China":
            return 'yellow'
        if geometries['properties']['NAME'] == "Bhutan":
            return 'maroon'
        if geometries['properties']['NAME'] == "Tibet":
            return '#FF00FF'
        if geometries['properties']['NAME'] == "Malaysia":
            return '#800000'
        if geometries['properties']['NAME'] == "Taiwan":
            return '#33FFBD'
        if geometries['properties']['NAME'] == "Thailand":
            return '#DAF7A6'
        if geometries['properties']['NAME'] == "Laos":
            return '#FFC300'
        if geometries['properties']['NAME'] == "Burma":
            return '#581845'
        if geometries['properties']['NAME'] == "Cambodia":
            return '#900C3F'
        if geometries['properties']['NAME'] == "Philippines":
            return '#3375FF'
        if geometries['properties']['NAME'] == "Dutch East Indies":
            return 'forestgreen'
        if geometries['properties']['NAME'] == "French Indo-China":
            return 'AQUA'
        if geometries['properties']['NAME'] == "Cochin China":
            return 'PURPLE'
        if geometries['properties']['NAME'] == "Papua New Guinea":
            return 'BLUE'
        return 'grey'

In [12]:
pacific_map = folium.Map(location=[30, 120], tiles='cartodbpositron', zoom_start=3)
folium.TopoJson(open('topojson/cntry1938.topojson'),
                'objects.cntry1938',
                 style_function=lambda geometries: {
                    'fillColor': my_color_function(geometries)                    
                    }
               ).add_to(pacific_map)
folium.Marker([45.8038, 126.5350], popup='Harbin').add_to(pacific_map)
folium.Marker([35.6895, 139.6917], popup='Tokyo').add_to(pacific_map)
folium.Marker([1.3521, 103.8198], popup='Singapore').add_to(pacific_map)
folium.Marker([22.3964, 114.1095], popup='Hong Kong').add_to(pacific_map)
folium.Marker([39.9042, 116.4074], popup='Beijing').add_to(pacific_map)
folium.Marker([31.2304, 121.4737], popup='Shanghai').add_to(pacific_map)
pacific_map

# Text Analysis

For retrieving meaningful information, we need to extract the different entities in the articles (Persons, Countries etc) and represent the events according to these.<br>
For that purpose, we used the Natural Language Toolkit for python. We didn’t manage to get the desirable results because we must adapt the offered tools in the French language. We do not upload the code of this task because preliminary stage. However, by the first results we got we believe that that it is feasible to end up with the information that we need in order to create the interactive maps of the final version.