In [222]:
import pandas as pd
import numpy as np
pd.options.mode.chained_assignment = None
import re

In [2]:
def standard_headers(df):
    cols = [col.lower().replace(' ', '_') for col in df.columns]
    df.columns = cols
    return df

In [3]:
airplanes = pd.read_csv('../data/raw/Airplane_Crashes_and_Fatalities_Since_1908.csv')
airplanes = standard_headers(airplanes)
airplanes = airplanes.rename(columns={'flight_#': 'flight_no'}) #Character can be a problem so we get rid of it.
display(airplanes.head())
display(airplanes.shape)

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
0,09/17/1908,17:18,"Fort Myer, Virginia",Military - U.S. Army,,Demonstration,Wright Flyer III,,1.0,2.0,1.0,0.0,"During a demonstration flight, a U.S. Army fly..."
1,07/12/1912,06:30,"AtlantiCity, New Jersey",Military - U.S. Navy,,Test flight,Dirigible,,,5.0,5.0,0.0,First U.S. dirigible Akron exploded just offsh...
2,08/06/1913,,"Victoria, British Columbia, Canada",Private,-,,Curtiss seaplane,,,1.0,1.0,0.0,The first fatal airplane accident in Canada oc...
3,09/09/1913,18:30,Over the North Sea,Military - German Navy,,,Zeppelin L-1 (airship),,,20.0,14.0,0.0,The airship flew into a thunderstorm and encou...
4,10/17/1913,10:30,"Near Johannisthal, Germany",Military - German Navy,,,Zeppelin L-2 (airship),,,30.0,30.0,0.0,Hydrogen gas which was being vented was sucked...


(5268, 13)

In [4]:
airplanes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5268 entries, 0 to 5267
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   date          5268 non-null   object 
 1   time          3049 non-null   object 
 2   location      5248 non-null   object 
 3   operator      5250 non-null   object 
 4   flight_no     1069 non-null   object 
 5   route         3562 non-null   object 
 6   type          5241 non-null   object 
 7   registration  4933 non-null   object 
 8   cn/in         4040 non-null   object 
 9   aboard        5246 non-null   float64
 10  fatalities    5256 non-null   float64
 11  ground        5246 non-null   float64
 12  summary       4878 non-null   object 
dtypes: float64(3), object(10)
memory usage: 535.2+ KB


In [5]:
airplanes.isna().sum()

date               0
time            2219
location          20
operator          18
flight_no       4199
route           1706
type              27
registration     335
cn/in           1228
aboard            22
fatalities        12
ground            22
summary          390
dtype: int64

We can observe quite a number of Nas. This could be due to multiple reasons: lack of obligation to provide that information (as many measures were implemented with time and progressively with the industry development), non-registered flights (like military or pirate flights), unclear information or damaged sensors.

For now we will not drop any rows or columns and continue to explore the data, as one missing value in a row may not be that relevant to understand the case. Date is the essential paramenter to classify the flights (also will allow us to build a time accident rate) and we have no missing values on that one.

Since date and time columns are pretty much self-explanatory, we will go ahead and jump to the location column.

<br><br><h3>Location<h3>

The location of the accident.

In [6]:
airplanes['location'].nunique(dropna = False)

4304

In [7]:
airplanes['location'].value_counts()

Sao Paulo, Brazil                    15
Moscow, Russia                       15
Rio de Janeiro, Brazil               14
Anchorage, Alaska                    13
Manila, Philippines                  13
                                     ..
Near Charana, Bolivia                 1
Monte Matto, Italy                    1
Misaki Mountain, Japan                1
Angelholm, Sweden                     1
State of Arunachal Pradesh, India     1
Name: location, Length: 4303, dtype: int64

We have 4304 different locations for the flights (we also have 20 Nas and one of them is included in this calculation, so in reality we would have 4303).

Some of these locations can be in the same country and even region or city. As this column stores info as a string, just a minor difference in the writing would count as a different value, so we will have to consider that when using the information (as we can see in the value counts - 2 locations from the same country on top, an USA state but without stating the country). We will have to get the country information from this column.

<br><br><h3>Flight operators<h3>

The airline company managing the flight.

In [8]:
airplanes['operator'].nunique()

2476

In [9]:
airplanes['operator'].value_counts()

Aeroflot                               179
Military - U.S. Air Force              176
Air France                              70
Deutsche Lufthansa                      65
Air Taxi                                44
                                      ... 
Military - Argentine Navy                1
Richland Flying Service - Air Taxii      1
Harbor Airlines - Air Taxi               1
Aerovias Venezolanas SA (Venezuela)      1
Strait Air                               1
Name: operator, Length: 2476, dtype: int64

We have 2476 unique flight operators. Nas have not been considered as the nunique function drops them by default. Here we have some considerations: this info is also stored in a string so any minor difference or typo would count as a different value. Also, some flights are labeled as private - they may be grouped but it doesn't mean they were handled by the same company.
We can already see which operators have the biggest incident rate: Aeroflot, USAF and Air France (closely followed by Lufthansa).

<br><br><h3>Flight number<h3>

This column specifies the flight number. It is expected to have NAs since only commercial flights with an established route have this code. <br> This is a combination of the airline's IATA code and 1-4 digit number.

In [10]:
airplanes['flight_no'].value_counts(dropna = False)

NaN     4199
-         67
1         10
4          7
21         6
        ... 
621        1
215        1
208B       1
158        1
447        1
Name: flight_no, Length: 725, dtype: int64

There are many Nas, it may be interesting to keep it just for extra information purposes (for example, to provide extra narrative context when talking about an specific flight when this info is available). That would be the only purpose of this column, also because flight names are usually based on aircraft routes, and these can change with relative frequence depending on many factors.

<br><br><h3>Route<h3>

This column indicates the route of the aircraft - normally specifying departure and programmed destination.

In [11]:
airplanes['route'].value_counts(dropna = False)

NaN                           1706
Training                        81
Sightseeing                     29
Test flight                     17
Test                             6
                              ... 
Manila - Lapu Lapu               1
Saint Denis - Paris              1
Cork - London                    1
Peoria, IL - St. Louis, MO       1
Mechuka for Jorhat               1
Name: route, Length: 3245, dtype: int64

As before, we have quite a number of Nas but it may be interesting to keep this column for extra context.

<br><br><h3>Type<h3>

This column indicates the aircraft model.

In [12]:
airplanes['type'].value_counts(dropna = False)

Douglas DC-3                                334
de Havilland Canada DHC-6 Twin Otter 300     81
Douglas C-47A                                74
Douglas C-47                                 62
Douglas DC-4                                 40
                                           ... 
Boeing 727-21                                 1
NAMC-YS-11-111                                1
Lockheed EC-121H                              1
Cessna 205A                                   1
Airbus A330-203                               1
Name: type, Length: 2447, dtype: int64

We can observe that the Douglas models have the highest number of accidents by far. We can later consider to cross check it with the operators.

In [13]:
airplanes_model = airplanes[(airplanes['type'].isna())]
airplanes_model

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
49,04/06/1921,,"Point Cook, Australia",Military - Royal Australian Air Force,,,,H3021,,2.0,1.0,0.0,
52,05/17/1921,,"Rock Springs, Wyoming",US Aerial Mail Service,,,,176,,1.0,1.0,0.0,
61,04/08/1922,,"Pao Ting Fou, China",,,,,,,17.0,17.0,0.0,All seventeen aboard were Chinese nationals.
86,11/06/1924,,"Cabrerolles, France",Grands Express Aeriens,,,,F-AFBD,,1.0,1.0,0.0,
97,09/07/1925,,"Toul, France",CIDNA,,,,,,3.0,3.0,0.0,
114,04/15/1927,,"King Hill, Idaho",Varney Air Lines,,,,,,1.0,1.0,0.0,Crashed after an unsuccessful attempt at fly i...
138,03/03/1928,,"Rio de Janeiro, Brazil",,,,,,,10.0,10.0,0.0,
220,09/25/1930,,"Southesk, Saskatchewan, Canada",Western Canada Airways,,Calgary - Moosejaw,,,,3.0,3.0,0.0,The air mail plane crashed in fog while en ro...
359,05/29/1935,,"San Barbra, Honduras",,,,,,,9.0,6.0,0.0,Crashed into the Ulua River.
545,11/09/1940,,"Rio de Janeiro, Brazil",,,Rio de Janeiro - Sao Paulo,,,,18.0,18.0,0.0,Midair collisioin with a private plane.


We will try to find these missing values. This is the result.

|Index| Date| Model| Source (if any)|
| :-: | :-: | --          | --: |
|49| 	04/06/1921 | Avro 504K| https://aviation-safety.net/wikibase/27223 |
|52| 	05/17/1921 | De Havilland DH-4|http://www.planecrashinfo.com/1921/1921-8.htm |
|61| 	04/08/1922 | No info|
|86| 	11/06/1924 | Breguet 14|http://www.planecrashinfo.com/1924/1924-5.htm |
|97| 	09/07/1925 | No info|
|114| 	04/15/1927 | Swallow |http://www.planecrashinfo.com/1927/1927-4.htm |
|138| 	03/03/1928 | No info|
|220|	09/25/1930 | Boeing 40 | http://www.planecrashinfo.com/1930/1930-16.htm |
|359| 	05/29/1935 | No info|
|545| 	11/09/1940 | No info|
|567| 	12/11/1941 | No info|
|632| 	11/08/1943 | No info|
|678| 	11/09/1944 | No info|
|717| 	06/29/1945 | de Havilland DH.98 Mosquito Mk I| https://aviation-safety.net/wikibase/70185 |
|767| 	03/17/1946 | Douglas C-47 (DC-3) | https://aviation-safety.net/database/record.php?id=19460317-0 |
|768| 	03/18/1946 | No info | |
|772| 	04/08/1946 | Douglas C-47B-1-DL (DC-3) | https://aviation-safety.net/database/record.php?id=19460408-0 |
|773| 	04/22/1946 | Lockheed 14 Super Electra | http://www.planecrashinfo.com/1946/1946-21.htm |
|806| 	09/20/1946 | Curtiss Wright C-46 | http://www.planecrashinfo.com/1946/1946-55.htm |
|1144| 	08/08/1951 | Douglas C-47A-20-DK (DC-3) | https://aviation-safety.net/database/record.php?id=19510808-0 |
|1190| 	03/26/1952 | No info |
|1355| 	12/29/1954 | No info |
|1386| 	08/06/1955 | Ilyushin Il-14 | https://aviation-safety.net/database/record.php?id=19550806-0 |
|2289| 	01/16/1969 | Douglas C-47 Skytrain (DC-3) | http://www.planecrashinfo.com/1969/1969-6.htm |
|4399| 	02/11/1996 | helicopter | https://aviation-safety.net/wikibase/181862 |
|4602| 	11/27/1998 | Piper PA-32R-300 | https://aviation-safety.net/wikibase/711 |
|5094| 	04/23/2006 | Antonov An-2R | https://aviation-safety.net/database/record.php?id=20060423-1 |

In [14]:
airplanes['type'].iloc[49] = '''Avro 504K'''
airplanes['type'].iloc[52] = '''De Havilland DH-4'''
airplanes['type'].iloc[86] = '''Breguet 14'''
airplanes['type'].iloc[114] = '''Swallow'''
airplanes['type'].iloc[220] = '''Boeing 40'''
airplanes['type'].iloc[717] = '''de Havilland DH.98 Mosquito Mk I'''
airplanes['type'].iloc[767] = ''' Douglas C-47 (DC-3)'''
airplanes['type'].iloc[772] = '''Douglas C-47B-1-DL (DC-3)'''
airplanes['type'].iloc[773] = '''Lockheed 14 Super Electra'''
airplanes['type'].iloc[806] = '''Curtiss Wright C-46'''
airplanes['type'].iloc[1144] = '''Douglas C-47A-20-DK (DC-3)'''
airplanes['type'].iloc[1386] = '''Ilyushin Il-14'''
airplanes['type'].iloc[2289] = '''Douglas C-47 Skytrain (DC-3)'''
airplanes['type'].iloc[4399] = '''helicopter'''
airplanes['type'].iloc[4602] = '''Piper PA-32R-300'''
airplanes['type'].iloc[5094] = '''Antonov An-2R'''
airplanes['type'].iloc[49] = '''Avro'''

In [15]:
airplanes_model = airplanes[(airplanes['type'].isna())]
airplanes_model

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
61,04/08/1922,,"Pao Ting Fou, China",,,,,,,17.0,17.0,0.0,All seventeen aboard were Chinese nationals.
97,09/07/1925,,"Toul, France",CIDNA,,,,,,3.0,3.0,0.0,
138,03/03/1928,,"Rio de Janeiro, Brazil",,,,,,,10.0,10.0,0.0,
359,05/29/1935,,"San Barbra, Honduras",,,,,,,9.0,6.0,0.0,Crashed into the Ulua River.
545,11/09/1940,,"Rio de Janeiro, Brazil",,,Rio de Janeiro - Sao Paulo,,,,18.0,18.0,0.0,Midair collisioin with a private plane.
567,12/11/1941,,"Miami, Florida",Pan American Airways,,,,NC21V,,3.0,3.0,0.0,
632,11/08/1943,,"Poona, India",Military - Indian Air Force,,,,,,1.0,1.0,37.0,Crashed into a village.
678,11/09/1944,,"Seljord, Norway",Military - U.S. Army Air Corps,,,,42-52196,,,,,
768,03/18/1946,,"Between Chungking and Shanghai, China",China National Aviation Corporation,,Chunking - Shanghai,,139,,,,,Disappeared while en route. Plane never located.
1190,03/26/1952,,"Moscow, Russia",Aeroflot,,,,,,70.0,70.0,0.0,The plane overshot the runway and collided wit...


<br><br><h3>Registration and cn/In<h3>

These columns indicate the registration number of the aircraft and the serial number given by the manufacturer. It is not relevant for our dataset and they contain quite a number or Nas, but we will keep them as we did with the flight number.

<br><br><h3>Aboard<h3>

Time to check the 'aboard' column, which indicates how many persons were aboard each flight (including crew members as the lower values indicate).

In [16]:
airplanes['aboard'].value_counts(dropna = False)

2.0      377
3.0      370
4.0      296
5.0      239
6.0      223
        ... 
240.0      1
218.0      1
192.0      1
269.0      1
228.0      1
Name: aboard, Length: 240, dtype: int64

In [17]:
airplanes['aboard'].isna().sum()

22

There are 22 flights for which we have no information on how many passengers were onboard. We can assume that there was at least the pilot on board, so we will check the type of flight to see if we can gather more information and eventually fill in the values by comparing them to similar flights.<br> As a last resort, we could also look for them on google as the information could be available, and we have enough data to do a search.<br>
We will also take advantage of the descritpion column to get more information.

In [18]:
airplanes_nopassengers = airplanes[(airplanes['aboard'].isna())]
airplanes_nopassengers

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
26,10/20/1919,,English Channel,Aircraft Transport and Travel,,,De Havilland DH-4,G-EAHG,,,,,
333,08/10/1934,,"Ningbo, China",China National Aviation Corporation,,,Sikorsky S-38B,,,,,,
348,03/07/1935,,"Schievelbein, Germany",Deruluft,,,Rochrbach Roland,D-AJYP,45,,3.0,0.0,Fuselage failure.
364,08/13/1935,,"Hangow, China",China National Aviation Corporation,,,Sikorsky S-38B,NV40V,,,,,Destoryed in a storm.
423,12/26/1936,,"Nanking, China",China National Aviation Corporation,,,Douglas DC-2,NC14269,,,,,
526,09/26/1939,,North Sea,KLM Royal Dutch Airlines,,Stockholm - Amsterdam,Douglas DC-3,PH-ASM,2142,,1.0,0.0,One Swedish passenger was killed when the plan...
537,07/07/1940,,Gulf of Tonkin,Air France,,,Dewoitine D-338,F-AQBA,1,,,,Shot down by a Japanese military fighter.
570,01/24/1942,,"Near Samarinda, Borneo",KNILM,,,Douglas DC-3,PK-AFW,1982,,,,Shot down by Japanese military aircraft.
571,01/26/1942,,"Kupang, Timor",KNILM,,,Grumman G-21 Goose,PK-AFS,1081,,,,Shot down by Japanese military aircraft.
573,02/14/1942,,,China National Aviation Corporation,,,Douglas DC-2,45,,,,,


In [19]:
airplanes['summary'][526]

'One Swedish passenger was killed when the plane was attacked by German fighters. The plane was able to land safely in Amsterdam.'

In [20]:
airplanes['summary'][570]

'Shot down by Japanese military aircraft.'

In [21]:
airplanes['summary'][1479]

'Explosive decompression. A passenger was sucked out of a window at 18,000 feet when the window he was sitting next to shattered. The body was never recovered.'

In [22]:
airplanes['summary'][3369]

'A sixteen-year-old boy was killed when a bomb detonated under a seat cushion. The explosion caused minor damage and the plane landed safely at Honolulu. The bomb was placed onboard by Mohammed Rashed, a Jordanian terrorist with the May 15 Organization.'

<br>8 aircrafts were involved on incidents during the WWII period. Some of them directly involved in combat (three of them shot down by a Japanese military fighter). Other on which a passenger died after the plane was attacked by a fighter.<br><br>
One of the incidents refers to a passenger dying of cholera while en route.<br><br>
One of the incidents involves a bomb exploding on board and killing a passenger, but the plane managed to land without any further casualties.<br><br>

Since we want to focus mainly on aircraft accidents - whether these are mechanic, man-caused or weather caused, or due to hijacking/sabotage - we can drop these rows as non relevant. This also could mean that some of the casualties in the dataset are not related to airplane crashes, but on-flight health related deaths. We will have to dig in deeper but we will start by clearing these rows (7 in total).

In [23]:
airplanes = airplanes.drop([airplanes.index[526], airplanes.index[537], airplanes.index[570], airplanes.index[571], airplanes.index[587], 
                            airplanes.index[3369], airplanes.index[4080]])
airplanes = airplanes.reset_index(drop=True)

In [24]:
airplanes_nopassengers = airplanes[(airplanes['aboard'].isna())]
airplanes_nopassengers

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
26,10/20/1919,,English Channel,Aircraft Transport and Travel,,,De Havilland DH-4,G-EAHG,,,,,
333,08/10/1934,,"Ningbo, China",China National Aviation Corporation,,,Sikorsky S-38B,,,,,,
348,03/07/1935,,"Schievelbein, Germany",Deruluft,,,Rochrbach Roland,D-AJYP,45.0,,3.0,0.0,Fuselage failure.
364,08/13/1935,,"Hangow, China",China National Aviation Corporation,,,Sikorsky S-38B,NV40V,,,,,Destoryed in a storm.
423,12/26/1936,,"Nanking, China",China National Aviation Corporation,,,Douglas DC-2,NC14269,,,,,
569,02/14/1942,,,China National Aviation Corporation,,,Douglas DC-2,45,,,,,
588,10/01/1942,,"Kunming, China",China National Aviation Corporation,,,Douglas C-47,69,,,,,Crashed while attempting to land after losing ...
673,11/09/1944,,"Seljord, Norway",Military - U.S. Army Air Corps,,,,42-52196,,,,,
763,03/18/1946,,"Between Chungking and Shanghai, China",China National Aviation Corporation,,Chunking - Shanghai,,139,,,,,Disappeared while en route. Plane never located.
827,12/25/1946,,"Lunghwa, Shanghai, China",China National Aviation Corporation,,,"Curtiss C-46, C-47, DC-3",115,,,87.0,4.0,Various accidents involving three aircraft una...


The last flight of the list is relatively recent so we will probably be able to find more information on google.

7 people dead. <br>
https://hemeroteca.lavanguardia.com/preview/2000/03/23/pagina-40/34056506/pdf.html?search=aviocar

In [25]:
airplanes.iloc[4698]

date                                                   03/22/2000
time                                                          NaN
location                                          Herreira, Spain
operator                             Military - Ejército del Aire
flight_no                                                     NaN
route                                          Sevilla - Herreira
type                                      CASA 212-DE Aviocar 200
registration                                            TM-12D-73
cn/in                                                         314
aboard                                                        NaN
fatalities                                                    NaN
ground                                                        NaN
summary         Crashed while attempting to land in poor weather.
Name: 4698, dtype: object

In [26]:
airplanes['aboard'][4698] = 7
airplanes['fatalities'][4698] = 7
airplanes.iloc[4698]

date                                                   03/22/2000
time                                                          NaN
location                                          Herreira, Spain
operator                             Military - Ejército del Aire
flight_no                                                     NaN
route                                          Sevilla - Herreira
type                                      CASA 212-DE Aviocar 200
registration                                            TM-12D-73
cn/in                                                         314
aboard                                                        7.0
fatalities                                                    7.0
ground                                                        NaN
summary         Crashed while attempting to land in poor weather.
Name: 4698, dtype: object

In [27]:
airplanes_nopassengers = airplanes[(airplanes['aboard'].isna())]
airplanes_nopassengers

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
26,10/20/1919,,English Channel,Aircraft Transport and Travel,,,De Havilland DH-4,G-EAHG,,,,,
333,08/10/1934,,"Ningbo, China",China National Aviation Corporation,,,Sikorsky S-38B,,,,,,
348,03/07/1935,,"Schievelbein, Germany",Deruluft,,,Rochrbach Roland,D-AJYP,45.0,,3.0,0.0,Fuselage failure.
364,08/13/1935,,"Hangow, China",China National Aviation Corporation,,,Sikorsky S-38B,NV40V,,,,,Destoryed in a storm.
423,12/26/1936,,"Nanking, China",China National Aviation Corporation,,,Douglas DC-2,NC14269,,,,,
569,02/14/1942,,,China National Aviation Corporation,,,Douglas DC-2,45,,,,,
588,10/01/1942,,"Kunming, China",China National Aviation Corporation,,,Douglas C-47,69,,,,,Crashed while attempting to land after losing ...
673,11/09/1944,,"Seljord, Norway",Military - U.S. Army Air Corps,,,,42-52196,,,,,
763,03/18/1946,,"Between Chungking and Shanghai, China",China National Aviation Corporation,,Chunking - Shanghai,,139,,,,,Disappeared while en route. Plane never located.
827,12/25/1946,,"Lunghwa, Shanghai, China",China National Aviation Corporation,,,"Curtiss C-46, C-47, DC-3",115,,,87.0,4.0,Various accidents involving three aircraft una...


For the last value (05/09/1989 Near Tainjin, China), 10/10 dead: <br>
https://aviation-safety.net/wikibase/32772

In [28]:
airplanes['aboard'][3837] = 10
airplanes['fatalities'][3837] = 10

In [29]:
airplanes_nopassengers = airplanes[(airplanes['aboard'].isna())]
airplanes_nopassengers

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
26,10/20/1919,,English Channel,Aircraft Transport and Travel,,,De Havilland DH-4,G-EAHG,,,,,
333,08/10/1934,,"Ningbo, China",China National Aviation Corporation,,,Sikorsky S-38B,,,,,,
348,03/07/1935,,"Schievelbein, Germany",Deruluft,,,Rochrbach Roland,D-AJYP,45.0,,3.0,0.0,Fuselage failure.
364,08/13/1935,,"Hangow, China",China National Aviation Corporation,,,Sikorsky S-38B,NV40V,,,,,Destoryed in a storm.
423,12/26/1936,,"Nanking, China",China National Aviation Corporation,,,Douglas DC-2,NC14269,,,,,
569,02/14/1942,,,China National Aviation Corporation,,,Douglas DC-2,45,,,,,
588,10/01/1942,,"Kunming, China",China National Aviation Corporation,,,Douglas C-47,69,,,,,Crashed while attempting to land after losing ...
673,11/09/1944,,"Seljord, Norway",Military - U.S. Army Air Corps,,,,42-52196,,,,,
763,03/18/1946,,"Between Chungking and Shanghai, China",China National Aviation Corporation,,Chunking - Shanghai,,139,,,,,Disappeared while en route. Plane never located.
827,12/25/1946,,"Lunghwa, Shanghai, China",China National Aviation Corporation,,,"Curtiss C-46, C-47, DC-3",115,,,87.0,4.0,Various accidents involving three aircraft una...


This is the search result:

|Index|Date|Info|Source (if any)|
|:-:|:-:|:--|---|
|26|10/20/1919|One aboard, no fatalities|http://www.planecrashinfo.com/1920/1920-31.htm |
|333|08/10/1934|Undetermined aboard & fatalities| http://www.planecrashinfo.com/1934/1934-18.htm |
|348|03/07/1935|13/13 dead| https://aviation-safety.net/database/record.php?id=19350720-0 |
|364|08/13/1935|Undetermined aboard & fatalities|http://www.planecrashinfo.com/1935/1935-24.htm |
|423|12/26/1936|Undetermined aboard & fatalities|http://www.planecrashinfo.com/1936/1936-49.htm |
|569|02/14/1942|No info|
|588|10/01/1942|No info|
|673|11/09/1944|10/10 killed + additional description|https://www.baaa-acro.com/crash/crash-consolidated-b-24-liberator-near-seljord-10-killed |
|763|03/18/1946|Undetermined aboard & fatalities|http://www.planecrashinfo.com/1946/1946-15.htm |
|827|12/25/1946|Several partial matches, inconclusive|
|1474|04/20/1957|Undetermined aboard, one fatality| http://www.planecrashinfo.com/1957/1957-21.htm |
|3002|11/03/1977|One fatality|http://www.airsafe.com/events/airlines/elal.htm |
|3318|12/16/1981|12/12 fatalities. Date incorrect.| https://aviation-safety.net/wikibase/33132 |

With this recovered information we can fill out the missing info a decide what to do with the rows for which we have no information. <br>We need to take into consideration the historic context - the missing information is all from one source, the China National Aviation Corporation. This airliner was part of the Nationalist Chinese goverment before they lost the war, so the information (especially in the last years when they were about to lose the war) is expected to be unclear. <br>Since we are talking about 7 rows, we can also drop them so we have no other missing values for fatalities in the dataset, and then fill out the new information obtained.

In [30]:
airplanes = airplanes.drop([airplanes.index[333], airplanes.index[364], airplanes.index[423], airplanes.index[569], airplanes.index[588], airplanes.index[763], 
                            airplanes.index[827]])
airplanes = airplanes.reset_index(drop=True)

In [31]:
airplanes_nopassengers = airplanes[(airplanes['aboard'].isna())]
airplanes_nopassengers

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
26,10/20/1919,,English Channel,Aircraft Transport and Travel,,,De Havilland DH-4,G-EAHG,,,,,
347,03/07/1935,,"Schievelbein, Germany",Deruluft,,,Rochrbach Roland,D-AJYP,45.0,,3.0,0.0,Fuselage failure.
668,11/09/1944,,"Seljord, Norway",Military - U.S. Army Air Corps,,,,42-52196,,,,,
1467,04/20/1957,,"Jirkouk, Iraq",Air France,,"Tehran, Iran - Istanbul, Turkey",Lockheed Super Constellation,F-BGNE,4514.0,,1.0,0.0,Explosive decompression. A passenger was sucke...
2995,11/03/1977,,"Belgrade, Yugoslavia",El Al,,,Boeing B-747,,,,1.0,0.0,Explosive decompression.
3311,12/16/1981,,"Kuala Belait, Brunei",Bristow Helicopters,,,Aerospatiale Puma,9M-SSC,1481.0,,12.0,0.0,


In [32]:
airplanes['aboard'][26] = 1
airplanes['fatalities'][26] = 0
airplanes['ground'][26] = 0
airplanes['summary'][26] = '''Crashed into the sea while attempting to land in fog.'''

airplanes['aboard'][347] = 13
airplanes['fatalities'][347] = 13
airplanes['summary'][347] = '''The DC-2, named "Gaai" operated on a passenger service from Milan, Italy to Amsterdam, the Netherlands. The aircraft took off at 11:36 hours, 
bound for Frankfurt, Germany, which was the next planned stop. Cruising at 5000 m altitude, ice accretion forced the crew to descend. At 3000 m the flight was out of 
icing conditions. However the aircraft was now flying between clouds shrouded mountains. Attempting to navigate visually, the flight continued at low altitude. 
Likely the crew entered the wrong mountain pass. They circled a valley, looking for a way out but low clouds and rain made it very difficult to continue flight.
The captain likely decided to perform a gear-up forced landing in the valley. Flaps were selected down and engine power was decreased. In a left hand turn the aircraft 
stalled and impacted the ground. The four crew members and the nine passengers died.'''

airplanes['aboard'][668] = 10
airplanes['fatalities'][668] = 10
airplanes['ground'][668] = 0
airplanes['type'][668] = '''Consolidated B-24 Liberator'''
airplanes['summary'][668] = ''' En route, the crew encountered poor weather and icing conditions. While all engines and both wings were contaminated by ice, 
the aircraft was unable to maintain the prescribed altitude and hit the slope of Mt Skorve located in the region of Seljord. All ten crew members were killed. '''

airplanes['aboard'][3311] = 12
airplanes['fatalities'][3311] = 12
airplanes['summary'][3311] = '''A second stage epicyclic module planetary gear fatigue failure caused loss of the main rotor, which severed the tail, causing the aircraft 
to crash in a swamp near the border settlement of Kuala Belait. The gearbox had a recent history of metallic debris being found on the magnetic chip detector
in the main module. The aircraft was on contract to Sarawak Shell. '''

In [33]:
airplanes_nopassengers = airplanes[(airplanes['aboard'].isna())]
airplanes_nopassengers

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
1467,04/20/1957,,"Jirkouk, Iraq",Air France,,"Tehran, Iran - Istanbul, Turkey",Lockheed Super Constellation,F-BGNE,4514.0,,1.0,0.0,Explosive decompression. A passenger was sucke...
2995,11/03/1977,,"Belgrade, Yugoslavia",El Al,,,Boeing B-747,,,,1.0,0.0,Explosive decompression.


We only have 2 remaining Nas for the aboard column. Not so relevant as we are focused more on the number of casualties. Both are decompression incidents so we can also calculate the ratio of these compared to the total of accidents.

<br><br><h3>Fatalities<h3>

Without any doubt, the worst possible issue that an airline company can face, even if the deaths are related to an underlying medical condition of the passenger, the mere association to the flight industry is a grime reminder of the human body fragility.

In [34]:
airplanes['fatalities'].isna().sum()

0

We don't have any missing values. Looks like the ones already filled in manually and the rows we dropped with no information has completed this column as well. We will explore now the ground column.

<br><br><h3>Ground<h3>

Ground, when talking about aircraft accidents, is how we refer to the casualties derived from the accident that were not inside the aircraft, like ground personnel or civilians.

In [35]:
airplanes['ground'].isna().sum()

11

We have 11 missing values. Let's take a look and find out if we can fill those out too.

In [36]:
airplanes_ground = airplanes[(airplanes['ground'].isna())]
airplanes_ground

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
228,11/18/1930,c: 2:00,"Techachapi Mountains, California",PacifiAir Transport,,"Burbank, CA - Oakland, CA",Boeing 40,NC5340,,3.0,3.0,,Crashed into a mountainside at an altitude of ...
308,11/09/1933,22:35,"Portland, Oregon",United Air Lines,,"Seattle, WA - Dallas, TX",Boeing 247,NC13345,,9.0,4.0,,Crashed in a thickly wooded area upon taking o...
310,11/20/1933,,"Near Tsinan, China",China National Aviation Corporation,,Canton - Shanghi,Sinson,,,8.0,8.0,,Crashed into the Chingshan mountain range in fog.
523,11/20/1939,,"Gosport, England",British Airways,,,Airspeed Oxford,G-AFFM,,2.0,2.0,,
633,03/22/1944,,New Guinea,Military - U.S. Army,,Port Moreaby - Nadzab,Consolidated B-24 Liberator,,,21.0,21.0,,Disappeared while en route on a non-combat mis...
1165,02/07/1952,,"Kaneko, Japan",Military - U.S. Air Force,,,Boeing B-29,,,17.0,17.0,,Hit power lines and crashed into houses. Seven...
1174,03/12/1952,,"Near Sequin, Texas",Military - U.S. Air Force / U.S. Air Force,,Training,Boeing B-29 / Boeing B-29,,,15.0,15.0,,While on a training mission and flying blind o...
1810,12/20/1962,,"Kadena AB, Okinawa",Military - U.S. Air Force,,,KB-50,,,12.0,12.0,,"Twelve killed, including civilians. Two civili..."
4691,03/22/2000,,"Herreira, Spain",Military - Ejército del Aire,,Sevilla - Herreira,CASA 212-DE Aviocar 200,TM-12D-73,314,7.0,7.0,,Crashed while attempting to land in poor weather.
4695,06/23/2000,11:41,"Boca Raton, Florida",Universal Jet Aviation,,Boca Raton - Fort Pierce,Learjet 55,N220JC,55-050,3.0,3.0,,Shortly after takeoff the aircraft impacted an...


We can find very relevant information on the summary column for these (only one has no summary).

In [37]:
airplanes_ground['summary'][228]

'Crashed into a mountainside at an altitude of 4,500  feet during a snowstorm.'

We can safely assume no ground casualties at that height.

In [38]:
airplanes_ground['summary'][308]

'Crashed in a thickly wooded area upon taking off after the pilot became lost in fog.'

Same as before.

In [39]:
airplanes_ground['summary'][310]

'Crashed into the Chingshan mountain range in fog.'

We will assume 0 ground casualties

In [40]:
airplanes_ground['summary'][523]

nan

Since it is the only one missing the summary, we will try to find more info online.

https://en.wikipedia.org/wiki/List_of_accidents_and_incidents_involving_airliners_in_the_United_Kingdom#1930%E2%80%931939

Airspeed Oxford G-AFFM being operated by British Airways crashed at Gosport, Hampshire after it hit a barrage balloon cable, two crew killed.

This matches the info we have about aboard and fatalities. The date, model and registration perfectly matches so we will update the info accordingly.

In [41]:
airplanes_ground['summary'][633]

'Disappeared while en route on a non-combat mission. Wreckage found 39 years later on 4/30/1983.'

We will assume no ground casualties.

In [42]:
airplanes_ground['summary'][1165]

'Hit power lines and crashed into houses. Seventeen killed including civilians.'

Bad luck, we will have to look up more info, although B29s usually had a crew of 10-15 members. If the accident details cannot be found, we will consider 2-7 ground casualties.

https://aviation-safety.net/wikibase/85403<br>
Info found, 13/13 crew members, 4 ground casualties. We will also update description to this one as it is more complete:<br><br>
*Destroyed during combat operations 7 February 1952: Shortly after takeoff from Yokota AB, while climbing in snow falls, the heavy bomber went out of control and crashed in a huge explosion on several houses located about three miles (5 km) northwest of the airfield. All 13 crew members and five people on the ground were killed.*

In [43]:
airplanes_ground['summary'][1174]

'While on a training mission and flying blind on instruments the planes collided. One plane struck the ground and disintegrated. The other glided down several miles away, exploded and burned. Both planes crashed on ranches several miles apart about 18 miles from San Antonio . Six killed on one plane and seven on the other.'

Even though they crashed into ranches, summary reports only the aircraft casualties. We will assume 0 ground fatalities.

In [44]:
airplanes_ground['summary'][1810]

'Twelve killed, including civilians. Two civilian houses burnt.'

We need to find out either how many civilians or how many crew members died. 

https://www.town.kadena.okinawa.jp/kadena/P08_base%20digest_English.pdf<br>
*A KB50 aerial tanker crashed on take-off into Kadena’s Yara district. 2 were killed, 8 injured, and three homes completely destroyed in the ensuing fire.*<br><br>
Looks like the website refers only to civilian casualties, so we will correct them accordingly.

In [45]:
airplanes_ground['summary'][4691]

'Crashed while attempting to land in poor weather.'

We can assume 0 ground casualties.

In [46]:
airplanes_ground['summary'][4695]

"Shortly after takeoff the aircraft impacted another plane and crashed to the ground. The failure of the pilot's of both airplanes to maintain a visual lookout while climbing and maneuvering resulting in an in-flight collision and subsequent collision with residences and terrain."

We need to dig in deeper.

http://edition.cnn.com/2000/US/06/23/plane.crash.03/index.html<br>
4 dead, 3 from this aircraft and another from the other aircraft they collided with. No ground casualties.

In [47]:
airplanes_ground['summary'][5250]

'The cargo plane crashed while on approach to Isiro-Matari Airport.'

We can assume no ground fatalities, just the crew. This concludes the exploration so we will now correct the values.

In [48]:
airplanes['ground'][228] = 0

airplanes['ground'][308] = 0

airplanes['ground'][310] = 0

airplanes['ground'][523] = 0
airplanes['summary'][523] = '''Airspeed Oxford G-AFFM being operated by British Airways crashed at Gosport, Hampshire after it hit a barrage balloon cable, two crew killed.'''

airplanes['ground'][633] = 0

airplanes['aboard'][1165] = 13
airplanes['fatalities'][1165] = 13
airplanes['ground'][1165] = 5
airplanes['summary'][1165] = '''Destroyed during combat operations 7 February 1952: Shortly after takeoff from Yokota AB, while climbing in snow falls, the heavy bomber went out of control 
and crashed in a huge explosion on several houses located about three miles (5 km) northwest of the airfield. All 13 crew members and five people on the ground were killed. '''

airplanes['ground'][1174] = 0

airplanes['aboard'][1810] = 10
airplanes['fatalities'][1810] = 10
airplanes['ground'][1810] = 2

airplanes['ground'][4691] = 0

airplanes['ground'][4695] = 0

airplanes['ground'][5250] = 0

In [49]:
airplanes.reset_index(drop=True)

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
0,09/17/1908,17:18,"Fort Myer, Virginia",Military - U.S. Army,,Demonstration,Wright Flyer III,,1,2.0,1.0,0.0,"During a demonstration flight, a U.S. Army fly..."
1,07/12/1912,06:30,"AtlantiCity, New Jersey",Military - U.S. Navy,,Test flight,Dirigible,,,5.0,5.0,0.0,First U.S. dirigible Akron exploded just offsh...
2,08/06/1913,,"Victoria, British Columbia, Canada",Private,-,,Curtiss seaplane,,,1.0,1.0,0.0,The first fatal airplane accident in Canada oc...
3,09/09/1913,18:30,Over the North Sea,Military - German Navy,,,Zeppelin L-1 (airship),,,20.0,14.0,0.0,The airship flew into a thunderstorm and encou...
4,10/17/1913,10:30,"Near Johannisthal, Germany",Military - German Navy,,,Zeppelin L-2 (airship),,,30.0,30.0,0.0,Hydrogen gas which was being vented was sucked...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5249,05/20/2009,06:30,"Near Madiun, Indonesia",Military - Indonesian Air Force,,Jakarta - Maduin,Lockheed C-130 Hercules,A-1325,1982,112.0,98.0,2.0,"While on approach, the military transport cras..."
5250,05/26/2009,,"Near Isiro, DemocratiRepubliCongo",Service Air,,Goma - Isiro,Antonov An-26,9Q-CSA,5005,4.0,4.0,0.0,The cargo plane crashed while on approach to I...
5251,06/01/2009,00:15,"AtlantiOcean, 570 miles northeast of Natal, Br...",Air France,447,Rio de Janeiro - Paris,Airbus A330-203,F-GZCP,660,228.0,228.0,0.0,The Airbus went missing over the AtlantiOcean ...
5252,06/07/2009,08:30,"Near Port Hope Simpson, Newfoundland, Canada",Strait Air,,Lourdes de BlanSablon - Port Hope Simpson,Britten-Norman BN-2A-27 Islander,C-FJJR,424,1.0,1.0,0.0,The air ambulance crashed into hills while att...


In [50]:
airplanes.head()

Unnamed: 0,date,time,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
0,09/17/1908,17:18,"Fort Myer, Virginia",Military - U.S. Army,,Demonstration,Wright Flyer III,,1.0,2.0,1.0,0.0,"During a demonstration flight, a U.S. Army fly..."
1,07/12/1912,06:30,"AtlantiCity, New Jersey",Military - U.S. Navy,,Test flight,Dirigible,,,5.0,5.0,0.0,First U.S. dirigible Akron exploded just offsh...
2,08/06/1913,,"Victoria, British Columbia, Canada",Private,-,,Curtiss seaplane,,,1.0,1.0,0.0,The first fatal airplane accident in Canada oc...
3,09/09/1913,18:30,Over the North Sea,Military - German Navy,,,Zeppelin L-1 (airship),,,20.0,14.0,0.0,The airship flew into a thunderstorm and encou...
4,10/17/1913,10:30,"Near Johannisthal, Germany",Military - German Navy,,,Zeppelin L-2 (airship),,,30.0,30.0,0.0,Hydrogen gas which was being vented was sucked...


In [51]:
airplanes.isna().sum()

date               0
time            2205
location          19
operator          18
flight_no       4187
route           1695
type               9
registration     333
cn/in           1220
aboard             2
fatalities         0
ground             0
summary          383
dtype: int64

Time column is not so relevant in this case as we have many missing values. It is simply not feasible to fill out those values by hand and we can't drop so many rows either (almost 50% of the dataset) so we will drop the column.

In [52]:
airplanes = airplanes.drop(['time'], axis=1)
airplanes.head()

Unnamed: 0,date,location,operator,flight_no,route,type,registration,cn/in,aboard,fatalities,ground,summary
0,09/17/1908,"Fort Myer, Virginia",Military - U.S. Army,,Demonstration,Wright Flyer III,,1.0,2.0,1.0,0.0,"During a demonstration flight, a U.S. Army fly..."
1,07/12/1912,"AtlantiCity, New Jersey",Military - U.S. Navy,,Test flight,Dirigible,,,5.0,5.0,0.0,First U.S. dirigible Akron exploded just offsh...
2,08/06/1913,"Victoria, British Columbia, Canada",Private,-,,Curtiss seaplane,,,1.0,1.0,0.0,The first fatal airplane accident in Canada oc...
3,09/09/1913,Over the North Sea,Military - German Navy,,,Zeppelin L-1 (airship),,,20.0,14.0,0.0,The airship flew into a thunderstorm and encou...
4,10/17/1913,"Near Johannisthal, Germany",Military - German Navy,,,Zeppelin L-2 (airship),,,30.0,30.0,0.0,Hydrogen gas which was being vented was sucked...


<br><br><h2>Cardinality<h2>

Now we will have to deal with the cardinality of the aircraft types. We will group models with small differences into the base model to allow a better visualization.

In [58]:
airplanes = airplanes.fillna('')

In [828]:
#airplanes['type'].value_counts()

The most common manufacturers are the following, so we will explore the data separately and try to group the models as much as possible:

Douglas, Antonov, Tupolev, Lockheed, Boeing, Airbus, Cessna

antonov = airplanes[(airplanes['type'].str.contains('Antonov'))]
douglas = airplanes[(airplanes['type'].str.contains('Douglas'))]
tupolev = airplanes[(airplanes['type'].str.contains('Tupolev'))]
lockheed = airplanes[(airplanes['type'].str.contains('Lockheed'))]
boeing = airplanes[(airplanes['type'].str.contains('Boeing'))]
airbus = airplanes[(airplanes['type'].str.contains('Airbus'))]
cessna = airplanes[(airplanes['type'].str.contains('Cessna'))]

In [770]:
def antonov_header(x):
    index = x.find('Antonov')
    if index != -1:
        return x[index:]
    else:
        return x
    
def cardinality_reducer_antonov(col):
    if 'Antonov' in col:
        if '140' in col:
            return re.sub('Antonov A?N?n?-?140\w?\w?[^|]*$', 'Antonov AN-140', col)
        elif '124' in col:
            return re.sub('Antonov A?N?n?-?124\w?\w?[^|]*$', 'Antonov AN-124', col)
        elif '74' in col:
            return re.sub('Antonov A?N?n?-?74\w?\w?[^|]*$', 'Antonov AN-74', col)
        elif '72' in col:
            return re.sub('Antonov A?N?n?-?72\w?\w?[^|]*$', 'Antonov AN-72', col)
        elif '32' in col:
            return re.sub('Antonov A?N?n?-?32\w?\w?[^|]*$', 'Antonov AN-32', col)
        elif '28' in col:
            return re.sub('Antonov A?N?n?-?28\w?\w?[^|]*$', 'Antonov AN-28', col)
        elif '26' in col:
            return re.sub('Antonov A?N?n?-?26\w?\w?[^|]*$', 'Antonov AN-26', col)
        elif '24' in col:
            return re.sub('Antonov A?N?n?-?24\w?\w?[^|]*$', 'Antonov AN-24', col)
        elif '12' in col:
            return re.sub('Antonov A?N?n?-?12\w?\w?[^|]*$', 'Antonov AN-12', col)
        elif '10' in col:
            return re.sub('Antonov A?N?n?-?10\w?\w?', 'Antonov AN-10', col)
        elif '8' in col:
            return re.sub('Antonov A?N?n?-?8\w?\w?', 'Antonov AN-8', col)
        elif '2' in col:
            return re.sub('Antonov A?N?n?-?2\w?\w?[^|]*$', 'Antonov AN-2', col)
        else:
            return col

In [771]:
antonov = airplanes[(airplanes['type'].str.contains('Antonov'))]

In [772]:
antonov['type'] = antonov['type'].apply(antonov_header)
antonov['type'] = antonov['type'].apply(cardinality_reducer_antonov)

In [773]:
antonov['type'].value_counts()

Antonov AN-12     64
Antonov AN-24     62
Antonov AN-26     47
Antonov AN-32     24
Antonov AN-2      18
Antonov AN-10     10
Antonov AN-28      9
Antonov AN-8       4
Antonov AN-124     3
Antonov AN-72      2
Antonov AN-140     2
Antonov AN-74      2
Antonov AN-9       1
Antonov AN-30      1
Name: type, dtype: int64

In [271]:
#douglas['type'].value_counts()

In [810]:
text = '''
Douglas DC-8-43 Douglas DC-8-63F Douglas DC-8-62  Douglas DC-8-63CF Douglas DC-8-61 Douglas DC-8-55F
'''

In [811]:
pattern = 'Douglas D?C?-?8\w?\w?[^|]*$'

In [812]:
print(re.findall(pattern, text))

['Douglas DC-8-43 Douglas DC-8-63F Douglas DC-8-62  Douglas DC-8-63CF Douglas DC-8-61 Douglas DC-8-55F\n']


In [820]:
douglas = airplanes[(airplanes['type'].str.contains('Douglas'))]

In [822]:
def douglas_header(x):
    index = x.find('Douglas')
    if index != -1:
        return x[index:]
    else:
        return x
    
def douglas_header8(x):
    index = x.find('Douglas DC-8')
    if index != -1:
        return x[index:]
    else:
        return x    

def douglas_header4(x):
    index = x.find('Douglas DC-4')
    if index != -1:
        return x[index:]
    else:
        return x    
    
def cardinality_reducer_douglas(col):
    if 'Douglas' in col:
        if '124' in col:
            return re.sub('Douglas C?-?124\w?\w?[^|]*$', 'Douglas C-124', col)
        elif '82' in col:
            return re.sub('Douglas M?D?-?82\w?\w?[^|]*$', 'Douglas MD-82', col)
        elif '74' in col:
            return re.sub('Douglas C?-?74\w?\w?[^|]*$', 'Douglas C-74', col)
        elif '54' in col:
            return re.sub('Douglas C?-?54\w?\w?[^|]*$', 'Douglas C-54', col)
        elif '53' in col:
            return re.sub('Douglas C?-?53\w?\w?[^|]*$', 'Douglas C-53', col)
        elif '47' in col:
            return re.sub('Douglas C?-?47\w?\w?[^|]*$', 'Douglas C-47', col)
        elif 'MD-11' or 'MD11' in col:
            return re.sub('Douglas D?C?-?10\w?\w?[^|]*$', 'Douglas C-10', col)
        elif '10' in col:
            return re.sub('Douglas D?C?-?10\w?\w?[^|]*$', 'Douglas C-10', col)
        elif '9' in col:
            return re.sub('Douglas D?C?-?9\w?\w?[^|]*$', 'Douglas DC-9', col)
        elif '8' in col:
            return re.sub('Douglas D?C?-?8\w?\w?[^|]*$', 'Douglas DC-8', col)
        elif '7' in col:
            return re.sub('Douglas D?C?-?7\w?\w?[^|]*$', 'Douglas DC-7', col)
        elif '6' in col:
            return re.sub('Douglas D?C?-?6\w?\w?[^|]*$', 'Douglas DC-6', col)
        elif 'MD-4' or 'MD4' in col:
            return re.sub('Douglas M?D?-?4\w?\w?[^|]*$', 'Douglas MD-4', col)
        elif '4' in col:
            return re.sub('Douglas D?C?-?4\w?\w?[^|]*$', 'Douglas DC-4', col)
        elif '3' in col:
            return re.sub('Douglas D?C?-?3\w?\w?[^|]*$', 'Douglas DC-3', col)
        elif '2' in col:
            return re.sub('Douglas D?C?-?2\w?\w?[^|]*$', 'Douglas DC-2', col)
        else:
            return col
        


In [823]:
douglas['type'] = douglas['type'].apply(douglas_header)
douglas['type'] = douglas['type'].apply(douglas_header8)
douglas['type'] = douglas['type'].apply(douglas_header4)

douglas['type'] = douglas['type'].apply(cardinality_reducer_douglas)

In [825]:
douglas['type'] = douglas['type'].apply(douglas_header4)


In [829]:
#douglas['type'].value_counts()

In [827]:
douglas['type'].unique()

array(['Douglas M-4', 'Douglas M-3', 'Douglas DC-2-115A',
       'Douglas DC-2-112', 'Douglas DC-2', 'Douglas DC-2-120',
       'Douglas DC-2-115E', 'Douglas DC-3A', 'Douglas DC-2-115L',
       'Douglas DC-3', 'Douglas DST-A-207A', 'Douglas DF-151',
       'Douglas DC 3-A-SB-3-G-14', 'Douglas DC-2-115B',
       'Douglas DC-2-115H', 'Douglas DC-3-3', 'Douglas DC-2-221',
       'Douglas DC-3-A-269', 'Douglas C-49E', 'Douglas C-39-DO  (DC-2)',
       'Douglas DC-3 / Boeing B-34', 'Douglas C-54', 'Douglas C-53',
       'Douglas C-47', 'Douglas DC3-G102', 'Douglas R4D-6',
       'Douglas DC-3-201C /  Army A-26', 'Douglas DC-2-243',
       'Douglas DC-3-201E', 'Douglas DC-3 Dakota', 'Douglas DC-3-194H',
       'Douglas DC-3-227B', 'Douglas DC-3 (C-47DL)',
       'Douglas DC-3 (C-53D-DO)', 'Douglas DC-3 (C-47-A5-DL)',
       'Douglas DC-3 ( C-47-DO)', 'Douglas DC-4-1009',
       'Douglas DC-3A-228D', 'Douglas DC-4', 'Douglas DC-3 (C-47A-90-DL)',
       'Douglas DC-3 (C-47-B-1-DK)', 'Douglas D

In [469]:
text = '''
Antonov AN-26B Antonov AN26 Antonov An26 Antonov An-26 Antonov 26BV Antonov 2R
'''

In [752]:
text = '''
Antonov AN-12B Antonov AN12 Antonov An12 Antonov An-12 Antonov 12BV Antonov 12R Antonov 12aljsaipfnaig
'''

In [531]:
pattern = 'Antonov A?N?n?-?12\w?\w?'

In [753]:
pattern = 'Antonov A?N?n?-?12\w?\w?[^|]*$'

In [461]:
pattern = 'Antonov A?\d?\d?[Nn]-?26\w?'

In [524]:
pattern = 'Antonov A?N?n?-?26\w?\w?'

In [748]:
pattern = 'Antonov A?N?n?-?24\w?\w?[^|]*$'

In [749]:
text = '''Antonov AN-24 / Soviet Air Force TU-16,  Antonov AN-24 / Yakovlev Yak-40'''

In [754]:
print(re.findall(pattern, text))

['Antonov AN-12B Antonov AN12 Antonov An12 Antonov An-12 Antonov 12BV Antonov 12R Antonov 12aljsaipfnaig\n']


<br><br><br><h2>We're done with the basic cleaning, now we will export this dataframe into a clean dataset.<h2>

In [164]:
>>> pd.set_option("display.max_rows", None)

airplanes.to_csv('../data/cleaned/airplanes_clean.csv', index = False)