### Common data science packages in Python ###
| Import name	| Common alias	| Description |
| :------------ | :------------ | :---------- |
| numpy	| np	| NumPy includes functions and classes that aid in numerical computation. NumPy is used in many other data science packages. |
| pandas | pd	| pandas provides methods and classes for tabular and time-series data. |
| sklearn	| sk	| scikit-learn provides implementations of many machine learning algorithms with a uniform syntax for preprocessing data, specifying models, fitting models with cross-validation, and assessing models. |
| matplotlib.pyplot	| plt	| matplotlib allows the creation of data visualizations in Python. The functions mostly expect NumPy arrays.
| seaborn	| sns	| seaborn also allows the creation of data visualizations but works better with pandas DataFrame. |
| scipy.stats	| sp.stats	| SciPy provides algorithms and functions for computing problems that arise in science, engineering and statistics. scipy.stats provides the functions for statistics. |
| statsmodels	| sm	| statsmodels adds functionality to Python to estimate many different kinds of statistical models, make inferences from those models, and explore data. |



In [1]:
# Import packages

## Most coding styles require package imports at the top of the notebook.
## This style prevents running much of a notebook to find a package needs to be installed.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn.model_selection import train_test_split

## Load dataset

In [3]:
# Load the hawks dataset
hawks = pd.read_csv('hawks.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'hawks.csv'

This dataset includes data about hawks at a nature preserve in Iowa.
 * Species - CH:Cooper's Hawk, SS:Sharp-shinned Hawk, RT:Redtail Hawk
 * Age - A:Adult, I:Immature
 * Wing - length of the primary wing feather in mm
 * Weight - Body weight in g
 * Culmen - Length along the top of the bill from tip to face in mm
 * Hallux - Length in mm of the killing talon
 * Tail - Aggregate measurement related to the length of the tail in mm

In [4]:
# Calculate summary statistics using .describe()
hawks.describe(include='all')

NameError: name 'hawks' is not defined

In [None]:
# Visualize the relationship between each pair of numerical variables
sns.pairplot(data=hawks, hue='Species')

In [None]:
# Calculate the mean for each feature according to each species.

## pandas makes calculating statistics of groups easy with the groupby function.
hawks.groupby('Species').mean(numeric_only=True)

In [None]:
# Calculate the maximum for each feature according to each species.
hawks.groupby('Species').max(numeric_only=True)

Looking at the Hallux column in the pairplot and the summary groupby statistics, the maximum Hallux length is 341mm or over a foot long! Hallux length may have some entry errors that put the decimal point in the wrong place.

In [None]:
# Plot the distribution of hallux length to determine a good cutoff
sns.histplot(data=hawks, x='Hallux', hue='Species')

In [None]:
# Remap the extreme outliers
## Adjust the cutoff to remap obviously wrong Hallux lengths to the correct units.
## Change this value and rerun this cell and the cell below until happy with the results.
cutoff = 200
hawks.loc[hawks['Hallux'] > cutoff, 'Hallux'] = (
    hawks.loc[hawks['Hallux'] > cutoff, 'Hallux'] / 10
)
hawks.dropna()

In [None]:
# Plot to see if all obvious outliers have remapped.
sns.pairplot(data=hawks, hue='Species')

In [None]:
# Use everything but species to predict species
X = hawks.drop('Species', axis=1)
y = hawks['Species']

# Encode Age as a dummy variable.
X = pd.get_dummies(X, drop_first=True)

# Create a training/testing split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=20220705
)

In [None]:
# Use a random forest model from scikit-learn
species_rf = RandomForestClassifier(
    max_depth=1,  # Change this value and run this cell and all those below.
    n_estimators=10,
    random_state=20220706,
)
species_rf.fit(X_train, y_train)  # Fit the model on the training set

In [None]:
# Make predictions for the test set.
y_pred = species_rf.predict(X_test)

In [None]:
# Plot the confusion matrix to check how the model did
# using a function from scikit-learn's metrics subpackage.
# The number of correct classifications appear on the diagonal.

## Does changing max_depth  lead to more correct classifications?
metrics.ConfusionMatrixDisplay.from_predictions(y_test, y_pred)