# Code to migrate the data files in the archive to JSON

We want all the data in a modern format.

# WWII Data

- allied.xml - list of allied planes
- axis.xml - lise of axis planes
- holtgrewe.xml - This is the primary data file.
- index.xml - Looks like some kind of mapping file???

# Holtgrewe Models

The main location for data (in the WWII collection) is holtgrewe.xml.  It
consists of a series of <model> tags.  Here is an example:

```xml
<model index="1" holtgrewe="001" affiliation="1" cos="Great Britain" com="Great Britain" logo="grb">
  <img name="001"></img>
  <title><![CDATA[Miles M.9A Master Mk I]]></title>
  <type type="Advanced Trainer"></type>
  <firstYR>1941</firstYR>
  <displayedAs>
    <![CDATA[This model represents an aircraft of the Royal Air Force No. ...]]>
  </displayedAs>
  <history>
    <![CDATA[As monoplane fighters began to equip Royal Air Force units, ...]]>
  </history>
  <tech>
    <powerplant>
      <![CDATA[One Rolls-Royce Kestrel XXX, liquid-cooled inline piston engine of 715 hp (533 kW)]]>
    </powerplant>
    <performance>
      <maxspeed>226 mph at 15,000 feet (364 km/h at 4,570 m)</maxspeed>
      <ceiling>28,000 feet (8,500 m)</ceiling>
      <range>500 miles (805 km)</range>
    </performance>
    <weight>4,160 pounds (1,886 kg)</weight>
    <maxweight>5,352 pounds (2,428 kg)</maxweight>
    <dimensions span="38 feet 10 inches (11.85 m)" lengths="30 feet 5 inches (9.27 m)" height="9 feet 3 inches (2.82 m)">
    </dimensions>
    <armament><![CDATA[Provisions for one fixed, forward-firing Vickers 0.303 i...]]></armament>
    <crew num="2 [one student, one instructor]"></crew>
    <comments>
      <![CDATA[Model Builder's Comments: This Master has the typical... "]]>
    </comments>
  </tech>
</model>
```

We desired to convert these models into JSON format for use in a web application.
This could be encoded as follows (editing some fields for clarity):

```json
{
  modelNumber: "001",
  countryOfService: "Great Britain",
  countryOfManufacture: "Great Britain",
  logo: "grb",
  title: "Miles M.9A Master Mk I",
  type: "Advanced Trainer",
  firstYear: 1941,
  displayedAs: "This model represents an aircraft of the Royal Air Force No. ... ",
  history: "As monoplane fighters began to equip Royal Air Force units, ...",
  tech: {
    powerplant: "One Rolls-Royce Kestrel XXX, liquid-cooled inline piston engine of 715 hp (533 kW)",
    performance: {
      maxSpeed: "226 mph at 15,000 feet (364 km/h at 4,570 m)",
      ceiling: "28,000 feet (8,500 m)"
      range: "500 miles (805 km)"
    },
    weight: "4,160 pounds (1,886 kg)",
    maxWeight: "5,352 pounds (2,428 kg)",
    dimensions: {
      span: "38 feet 10 inches (11.85 m)",
      length: "30 feet 5 inches (9.27 m)",
      height: "9 feet 3 inches (2.82 m)"
    }
    armament: "Provisions for one fixed, forward-firing Vickers 0.303 i... ",
    crewSize: "2 [one student, one instructor]",
    comments: "Model Builder's Comments: This Master has the typical... "
  }
}
```

We could make this data more generally useful, by removing formatting from
data fields, allowing the display software to convert units in a standard
way.  As it appears that the standard units are English, I would adopt that
for data values (feet (decimal), miles, pounds, mph).  This would further
convert this model to:

```json
{
  modelNumber: "001",
  countryOfService: "Great Britain",
  countryOfManufacture: "Great Britain",
  logo: "grb",
  title: "Miles M.9A Master Mk I",
  type: "Advanced Trainer",
  firstYear: 1941,
  displayedAs: "This model represents an aircraft of the Royal Air Force No. ... ",
  history: "As monoplane fighters began to equip Royal Air Force units, ...",
  tech: {
    powerplant: {
      engines: 1,
      hp: 715,
      description: "Rolls-Royce Kestrel XXX, liquid-cooled inline piston engine",
    },
    performance: {
      maxSpeed: {
        speed: 226,
        altitude: 15000
       }
      ceiling: 28000,
      range: 500,
    },
    weight: 4160,
    maxWeight: 5352,
    dimensions: {
      span: 38.83,
      length: 30.42,
      height: 9.25,
    }
    armament: "Provisions for one fixed, forward-firing Vickers 0.303 i... ",
    crew: {
      size: 2,
      makeup: "one student, one instructor"
    }
  },
  comments: "Model Builder's Comments: This Master has the typical... "
}
```

In [34]:
!ls -al ../archive/wwii/data

total 1156
drwxrwxr-x 2 mckoss mckoss    4096 Jan 26 16:00 .
drwxrwxr-x 4 mckoss mckoss    4096 Jan 26 17:18 ..
-rw-rw-r-- 1 mckoss mckoss   14879 Jan 26 15:59 allied.xml
-rw-rw-r-- 1 mckoss mckoss   13466 Jan 26 15:59 axis.xml
-rw-rw-r-- 1 mckoss mckoss 1133333 Jan 26 15:59 holtgrewe.xml
-rw-rw-r-- 1 mckoss mckoss    6213 Jan 26 15:59 index.xml


In [None]:
!pip install xmltodict

In [2]:
DATA_DIR = '../archive/wwii/data'

In [25]:
import json
import xmltodict

with open(f"{DATA_DIR}/holtgrewe.xml", 'r') as f:
    data = f.readlines()

# Parser needs a single top-level element - so we'll wrap the data in a root element
data.insert(1, '<root>\n')
data.append('</root>\n')

models = xmltodict.parse(''.join(data))['root']['model']
print(json.dumps(models[0], indent=2))

{
  "@index": "1",
  "@holtgrewe": "001",
  "@affiliation": "1",
  "@cos": "Great Britain",
  "@com": "Great Britain",
  "@logo": "grb",
  "img": {
    "@name": "001"
  },
  "title": "Miles M.9A Master Mk I",
  "type": {
    "@type": "Advanced Trainer"
  },
  "firstYR": "1941",
  "displayedAs": "This model represents an aircraft of the Royal Air Force No. 8 Flying Training School, Ternhill, England, April 1940.",
  "history": "As monoplane fighters began to equip Royal Air Force units, it became obvious that an advanced trainer was needed to transition new young pilots from their basic elementary trainers to the high performance fighters they would be flying. To fulfill this need, the Air Ministry issued a specification for a high performance trainer. George Miles designed the \"Master\" in response. The new trainer was of smooth aeronautical design and was built of wood. Miles' first prototype, which first flew on June 3, 1937, was rejected by the Air Ministry, but Miles continued on 

In [318]:
import re

current_model = None
errors = []

def reshape_wwii_model(m):
  return {
    'modelNumber': parse_model(m['@holtgrewe']),
    'countryOfService': m['@cos'],
    'countryOfManufacture': m['@com'],
    'logo': m['@logo'],
    'title': m['title'],
    'type': m['type']['@type'],
    'firstYear': m['firstYR'],
    'displayedAs': m['displayedAs'],
    'history': m['history'],
    'tech': {
      'powerplant': parse_powerplant(m['tech']['powerplant']),
      'performance': {
        'maxSpeed': parse_speed(m['tech']['performance']['maxspeed']),
        'ceiling': parse_ceiling(m['tech']['performance']['ceiling']),
        'range': parse_range(m['tech']['performance']['range']),
      },
      'weight': parse_weight(m['tech']['weight']),
      'maxWeight': parse_weight(m['tech']['maxweight']),
      'dimensions': {
        'span': parse_dimension(m['tech']['dimensions']['@span']),
        'length': parse_dimension(m['tech']['dimensions']['@lengths']),
        'height': parse_dimension(m['tech']['dimensions']['@height']),
      },
      'armament': m['tech']['armament'],
      'crew': parse_crew(m['tech']['crew']['@num']),
    },
    'comments': m['tech']['comments'],
  }

def parse_model(m):
  global current_model
  current_model = m.strip()
  return current_model

def parse_powerplant(d):
  engines = 1
  if re.match(r'One ', d):
    d = d[4:]
  elif re.match(r'Two ', d):
    d = d[4:]
    engines = 2

  result = {
    'engines': engines,
    'description': d
  }

  # See if power given as thrust (jets)
  m = re.match(r'(.*) ([\d,]+) pounds', d)
  if m is not None:
    result['thrust'] = parse_int(m.group(2))

  # Check for horsepower
  m = re.match(r'(.*) of ([\d,]+) hp (.*)', d)

  if m is not None:
    result['hp'] = parse_int(m.group(2))
  
  return result

def parse_speed(s):
  if s == "Not documented":
    return s

  m = re.match(r'([\d,]+) mph( (.*)at ([\d,]+) feet)?', s)
  if m is None:
    return {
      'speed': s
    }

  result = {
    'speed': parse_int(m.group(1)),
  }

  if m.group(4) is not None:  
    result['altitude'] = parse_int(m.group(4))

  return result

def parse_ceiling(c):
  if c == "Not Documented":
    c = "Not documented"

  if c == "Not documented" or c == "Not available":
    return c

  # One altitude in miles!
  m = re.match(r'([\d]+) miles', c)
  if m is not None:
    return round(parse_int(m.group(1)) * 5280, -3)

  # A few use "." as thousands separator!
  m = re.match(r'([\d]+\.\d{3}) feet', c)
  if m is not None:
    return parse_int(m.group(1).replace('.', ''))

  m = re.match(r'(Below|below|Usually below|Over)? ?([\d,]+) feet', c)

  if m is None:
    error(f"Couldn't parse ceiling: {c}")
    return None

  return parse_int(m.group(2))

def parse_range(r):
  if r == "Not documented" or r == "Not available":
    return r

  m = re.match(r'(Up to|up to|Over)? ?([\d,]+) miles', r)
  if m is None:
    return r

  return parse_int(m.group(2))

def parse_weight(w):
  if w == "Not documented" or w == "Not applicable" or w == "Not available":
    return w

  if w is None:
    return None

  m = re.match(r'([\d,]+(.\d+)?) pounds', w)
  if m is None:
    error(f"Couldn't parse weight: {w}")
    return None

  return parse_float(m.group(1))

def parse_dimension(d):
  m = re.match(r'(\d+) feet( ([\d]+(\.\d+)?) inch(es)?)?', d)
  if m is None:
    error(f"Couldn't parse dimension: {d}")
    return None

  ft = parse_int(m.group(1))
  if m.group(3) is not None:
    inches = parse_float(m.group(3))
  else:
    inches = 0

  return round(ft + inches/12, 2)

def parse_crew(c):
  mPre = re.match(r'(Usually crew of|Flight crew of)? ?(.*)', c)
  c = mPre.group(2)

  m = re.match(r'(\d+) (or|to) (\d+),? ?\[?([^\]]*)\]?', c)
  if m is not None:
    result = {
      'min': parse_int(m.group(1)),
      'max': parse_int(m.group(3)),
      'comment': m.group(4)
    }

    if m.group(4) != '':
      result['comment'] = m.group(4)

    return result

  m = re.match(r'(\d+|None|Two)( crew)?,? *(plus|and|with)? ?(.*)?', c)
  if m is None:
    error(f"Couldn't parse crew: {c}")
    return None

  if m.group(1) == 'None':
    size = 0
  elif m.group(1) == 'Two':
    size = 2
  else:
    size = parse_int(m.group(1))

  result = {
    'min': size,
    'max': size,
  }

  comment = m.group(4)

  if comment is not None and comment != '':
    m_d = re.match(r'\[(.+)\]', comment)
    if m_d is not None:
      result['comment'] = m_d.group(1)
    else:
      result['comment'] = comment

  return result

def parse_int(s):
  return int(s.replace(',', ''))

def parse_float(s):
  return float(s.replace(',', ''))

def error(s):
  errors.append(f"{current_model}: {s}")


# Test code on a sample model

print(json.dumps(reshape_wwii_model(models[0]), indent=2))

if len(errors) > 0:
  error_lines = '\n'.join(errors)
  print(f"Errors:\n{error_lines}")


{
  "modelNumber": "001",
  "countryOfService": "Great Britain",
  "countryOfManufacture": "Great Britain",
  "logo": "grb",
  "title": "Miles M.9A Master Mk I",
  "type": "Advanced Trainer",
  "firstYear": "1941",
  "displayedAs": "This model represents an aircraft of the Royal Air Force No. 8 Flying Training School, Ternhill, England, April 1940.",
  "history": "As monoplane fighters began to equip Royal Air Force units, it became obvious that an advanced trainer was needed to transition new young pilots from their basic elementary trainers to the high performance fighters they would be flying. To fulfill this need, the Air Ministry issued a specification for a high performance trainer. George Miles designed the \"Master\" in response. The new trainer was of smooth aeronautical design and was built of wood. Miles' first prototype, which first flew on June 3, 1937, was rejected by the Air Ministry, but Miles continued on with the project as a private venture. With the need for an adva

In [319]:
all_models = []

for m in models:
  all_models.append(reshape_wwii_model(m))

print(f"Finished parsing with model '{current_model}'.  {len(errors)} errors.")
print('\n'.join(errors))

Finished parsing with model '426'.  0 errors.



In [299]:
with open('../data/wwii-models.json', 'w+') as f:
  json.dump(all_models, f, indent=2)

In [303]:
from collections import Counter

logos = Counter([m['logo'] for m in all_models])

print(logos)

Counter({'ger': 84, 'usa': 70, 'jap': 64, 'grb': 62, 'rus': 33, 'fra': 31, 'ita': 29, 'net': 8, 'fin': 7, 'cze': 6, 'pol': 5, 'chi': 4, 'aus': 4, 'slo': 3, 'bul': 2, 'saf': 2, 'hun': 2, 'yug': 2, 'rom': 1, 'cro': 1, 'icb': 1, 'lit': 1, 'eth': 1, 'can': 1, 'bel': 1, 'tha': 1})


# WWI Collection

The World War I collection is the priority for this project - as the WWII app
has been recreated in Intuiface already.  Ideally, we can recreate the expereince
of the Flash app (though, we don't yet have an emulator that can run this app
in full).

The data for these 340 planes are stored within individual XML files.  It
is less "structured" than the WWII set, in that individual specs are not
broken out in individual tags, but are just enumerated in one large "specs"
fields (which looked like html).

# XML Format for WWI Collection

```xml
<?xml version="1.0" encoding="UTF-8"?>

<holtPlane>
  <mainInfo id="100" cos="Austria-Hungary">    
    <type><![CDATA[Reconnaissance Aircraft]]></type>
    <plnName><![CDATA[Fokker B.III Light Observation Biplane]]></plnName>
  </mainInfo>
  <history>
    <![CDATA[As the dominance of the Fokker monoplanes began to wane in the face of new and improved machines, Anthony Fokker instructed designer, Martin Kreutzer to begin work on a biplane fighter. One of the designs Kreutzer developed was the M.17E. While the German military showed little interest in The Fokker M.17, a number of the biplanes were sold to Austria-Hungary, which re-designated the aircraft, the B.III.

    The B.III was constructed of canvas-covered wood framing except for a metal cowling about its rotary engine. The small aircraft employed wing warping for lateral control and the vast majority of the machines were unarmed. Some evidence indicates that at least one B.III (modeled here) was equipped with a machine gun, though there are conflicting reports as to the weapon's placement and type.

    Most B.IIIs were employed as reconnaissance scouts or trainers, though the prototype for this model was reportedly flown on the Eastern Front as a fighter by Oberleutnant, Plus Moosbrugger. World War I historical records are otherwise silent about Moosbrugger's combat exploits, or lack thereof.]]>
  </history>
  <deployment>
    <![CDATA[This model represents an aircraft of Flik (Squadron) 11, Luftfahrtruppe (Austro-Hungarian Air Force), Eastern Front, June, 1916.]]>
    </deployment>
  <technicalSpecs>
    <![CDATA[<b>Manufacturer:</b> Fokker Aeroplanbau GmbH
    <b>Country of Manufacture:</b> Germany<br /><b>Power plant:</b> One Oberursel, 7 cylinder air-cooled rotary piston engine of 80 hp (60 kW)<br /><b>Performance:</b>
    Maximum Speed: 83 mph (134 km/h)
    Ceiling: undocumented
    Endurance: 2 hours 15 minutes
    Empty Weight: 602 pounds (273 kg)
    Loaded Weight: undocumented
    <b>Dimensions: </b>
    Span: 23 feet 8 inches (7.21 m)
    Length: 20 feet 4 inches (6.20 m)
    Height: 8 feet 4 inches (2.54 m)
    <b>Armament:</b> One fixed forward firing synchronized 8.00mm (.314 inch) Bergmann LMG 15nA machine gun.<b/r>
    <b>Crew:</b>1<br/>
    <br />]]>
  </technicalSpecs>
</holtPlane>
```

Target JSON format:

```json
{
  "modelNumber": "100",
  "countryOfService": "Austria-Hungary",
  "type": "Reconnaissance Aircraft",
  "title": "Fokker B.III Light Observation Biplane",
  "history": "As the dominance of the Fokker monoplanes ...",
  "deployment": "This model represents an aircraft of Flik (Squadron) 11, Luftfahrtruppe (Austro-Hungarian ...",
  "tech": "<b>Manufacturer:</b> Fokker Aeroplanbau GmbH
<b>Country of Manufacture:</b> Germany<br /><b>Power plant:</b> One Oberursel, 7 cylinder air-cooled rotary piston engine of 80 hp (60 kW)<br /><b>Performance:</b>
Maximum Speed: 83 mph (134 km/h)
Ceiling: undocumented
Endurance: 2 hours 15 minutes
Empty Weight: 602 pounds (273 kg)
Loaded Weight: undocumented
<b>Dimensions: </b>
Span: 23 feet 8 inches (7.21 m)
Length: 20 feet 4 inches (6.20 m)
Height: 8 feet 4 inches (2.54 m)
<b>Armament:</b> One fixed forward firing synchronized 8.00mm (.314 inch) Bergmann LMG 15nA machine gun.<b/r>
<b>Crew:</b>1<br/>
<br />"
}
```



In [309]:
from glob import glob

WWI_DATA = "../archive/wwi/data/planes"

wwi_files = list(glob(f"{WWI_DATA}/*.xml"))
print(len(wwi_files))

152


In [320]:
import os

sample_file = f"{WWI_DATA}/100.xml"

with open(sample_file, 'r') as f:
  model = xmltodict.parse(f.read())['holtPlane']

model

OrderedDict([('mainInfo',
              OrderedDict([('@id', '100'),
                           ('@cos', 'Austria-Hungary'),
                           ('type', 'Reconnaissance Aircraft'),
                           ('plnName',
                            'Fokker B.III Light Observation Biplane')])),
             ('history',
              "As the dominance of the Fokker monoplanes began to wane in the face of new and improved machines, Anthony Fokker instructed designer, Martin Kreutzer to begin work on a biplane fighter. One of the designs Kreutzer developed was the M.17E. While the German military showed little interest in The Fokker M.17, a number of the biplanes were sold to Austria-Hungary, which re-designated the aircraft, the B.III.\n\nThe B.III was constructed of canvas-covered wood framing except for a metal cowling about its rotary engine. The small aircraft employed wing warping for lateral control and the vast majority of the machines were unarmed. Some evidence indicates t

In [330]:
errors = []

def reshape_wwi_model(m):
  return {
    'modelNumber': parse_model(m['mainInfo']['@id']),
    'countryOfService': m['mainInfo']['@cos'],
    'title': m['mainInfo']['plnName'],
    'history': m['history'],
    'deployment': m['deployment'],
    'tech': m['technicalSpecs']
  }

# Test code on a sample model

print(json.dumps(reshape_wwi_model(model), indent=2))

if len(errors) > 0:
  error_lines = '\n'.join(errors)
  print(f"Errors:\n{error_lines}")

{
  "modelNumber": "100",
  "countryOfService": "Austria-Hungary",
  "title": "Fokker B.III Light Observation Biplane",
  "history": "As the dominance of the Fokker monoplanes began to wane in the face of new and improved machines, Anthony Fokker instructed designer, Martin Kreutzer to begin work on a biplane fighter. One of the designs Kreutzer developed was the M.17E. While the German military showed little interest in The Fokker M.17, a number of the biplanes were sold to Austria-Hungary, which re-designated the aircraft, the B.III.\n\nThe B.III was constructed of canvas-covered wood framing except for a metal cowling about its rotary engine. The small aircraft employed wing warping for lateral control and the vast majority of the machines were unarmed. Some evidence indicates that at least one B.III (modeled here) was equipped with a machine gun, though there are conflicting reports as to the weapon's placement and type.\n\nMost B.IIIs were employed as reconnaissance scouts or trai

In [331]:
all_wwi_models = []

for file_name in wwi_files:
  with open(file_name, 'r') as f:
    model = xmltodict.parse(f.read())['holtPlane']

  all_wwi_models.append(reshape_wwi_model(model))

In [333]:
with open('../data/wwi-models.json', 'w+') as f:
  json.dump(all_wwi_models, f, indent=2)

In [334]:
from collections import Counter

countries = Counter([m['countryOfService'] for m in all_wwi_models])

print(countries)

Counter({'Germany': 47, 'Great Britain': 38, 'France': 25, 'Austria-Hungary': 11, 'Italy': 11, 'United States': 8, 'Russia': 7, 'Turkey': 3, 'Belgium': 1, 'Portugal': 1})
