# Part 1: Decision Tree

## Dataset
In this exercise we will use the dataset containing technical data of various aircrafts obtained from Aircraft Performance Database at https://contentzone.eurocontrol.int/aircraftperformance/
Retrieving raw data from websites is not part of the workshop so you can find dataset in JSON format in *data* directory.

We can use Pandas method to load the dataset from JSON file into a DataFrame.

Pandas documentation: http://pandas.pydata.org/pandas-docs/stable/

In [None]:
import pandas as pd

In [None]:
df = pd.read_json('data/aircrafts.json')

To display data types of DataFrame columns

In [None]:
print(df.dtypes)

To display content of *n* first rows

In [None]:
print(df.head(3))

To get number of rows or both number of rows and columns (shape)

In [None]:
print(len(df))
print(df.shape)

To get column with a column name

In [None]:
col1 = df.wing_span
col2 = df["wing_span"]
col3 = df[["wing_span"]]
type(col1), type(col2), type(col3)

To get unique values of the column

In [None]:
print(df.APC.unique())
print(df.type_code.unique())

#### First model
Take a look at WTC and MTOW columns
WTC (Wake Turbulence Category) value is based on MTOW (Maximum Take-Off Weight) and decision tree model should have rules that are similar to WTC categories definition which is:

* H (Heavy) aircraft types of 136 000 kg (300 000 lb) or more;
* M (Medium) aircraft types less than 136 000 kg (300 000 lb) and more than 7 000 kg (15 500 lb); and
* L (Light) aircraft types of 7 000 kg (15 500 lb) or less.

To get new DataFrame that contains only these two columns

In [None]:
X = df[['MTOW', 'WTC']]

To filter out all rows that have at least one missing value in any column

In [None]:
X = X.dropna()
len(X)

To train a model we must divide our data and create separate arrays of features and class labels.

In [None]:
y = X['WTC']
X = X.drop(labels=['WTC'], axis=1)

In order to measure the quality of our model we can divide both arrays into training and testing data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

To create a model object

In [None]:
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(max_depth=3)

To train the model using training part of the data

In [None]:
model.fit(X_train, y_train)

To use the trained model to classify new data

In [None]:
model.predict([6000])

To score the trained model using the testing part of the data

In [None]:
model.score(X_test, y_test)

Compare few models with different maximum depth parameter (see how the score changes, prevent overfitting)

In [None]:
for n in range(1, 5):
    other_model = DecisionTreeClassifier(max_depth=n)
    other_model.fit(X_train, y_train)
    print(n, other_model.score(X_test, y_test))

To visualize the tree using Graphviz http://webgraphviz.com/

For offline tools to display graph tree, see Graphviz (http://www.graphviz.org/) or _pydotplus_ Python package

In [None]:
from sklearn.tree import export_graphviz
export_graphviz(model.tree_, out_file='wtc_aircrafts_tree.dot', feature_names=X.columns, class_names=model.classes_)

#### Second Model
Based on visual characteristics let's try to classify aircraft to approach category (to get a clue about it's landing speed, necessary field length, weight)

First we need to get the features and class labels (APC) and after that drop any rows with missing values

In [None]:
labels = ['length', 'height', 'wing_span', 'Ceiling', 'type_code', 'Range', 'APC']
X = df[labels]
X = X.dropna()
print(X.dtypes)

Decision tree classifier can process only numeric features. We need to convert Ceiling and Type code.

First, to cut flight level from the Ceiling feature.

In [None]:
X['Ceiling'] = pd.to_numeric(X['Ceiling'].apply(lambda fl: fl[2:]))

We can divide Type code into three columns and then convert nonnumeric features into dummy variables

In [None]:
X['type_1'] = X['type_code'].apply(lambda s: s[0])
X['num_engines'] = pd.to_numeric(X['type_code'].apply(lambda s: s[1]))
X['type_2'] = X['type_code'].apply(lambda s: s[2])
X = pd.get_dummies(X, columns=['type_1', 'type_2'])

Otherwise, we could convert Type code directly to dummy variables

In [None]:
#X = pd.get_dummies(X, columns=['type_code'])
X = X.drop(labels=['type_code'], axis=1)

In [None]:
y = X['APC']
X = X.drop(labels=['APC'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
for n in range(1, 11):
    model = DecisionTreeClassifier(max_depth=n)
    model.fit(X_train, y_train)
    print(n, model.score(X_test, y_test))

Again, we can visualize trained tree model.

In [None]:
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)
export_graphviz(model.tree_, out_file='apc_aircrafts_tree.dot', feature_names=X.columns, class_names=model.classes_)