# I. Machine Learning Readability

__Source:__ https://www.kaggle.com/learn/machine-learning-explainability

The following are techniques to establish how to read results of models and how to figure out the features to get there.

## 1. Permutation Importance

Permutation Importance is a simple sklearn way of identifying the best features. This is good if the data isn't intuitive or you don't have proper labels on your data - that way, you can work with the highest correlated features and play with them to get even better features.

Permutation importances are done __after a model has been fit!__

In [None]:
# obtain feature importance via "eli5" sklearn library

import eli5
from eli5.sklearn import PermutationImportance
from sklearn.model_selection import train_test_split

base_features = ['pickup_longitude',
                 'pickup_latitude',
                 'dropoff_longitude',
                 'dropoff_latitude',
                 'passenger_count']

# fit model first!
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
first_model = RandomForestRegressor(n_estimators=30, 
                                    random_state=1)
first_model.fit(train_X, train_y)

# permutation importance requires a second fit...but gotta separately fit the model first

perm = PermutationImportance(first_model, random_state=1).fit(val_X, val_y)

eli5.show_weights(perm, feature_names = base_features)

## 2. Partial Dependence Plots

Feature Importance above shows _what_ variables most affect predictions...but it doesn't show _how_ a feature affects a prediction.

That's where __Partial Dependence Plots__ come in. Like Permutations, PD plots are calculated __after__ a model has been fit.

__How it Works:__ You alter the value of _one feature_ after a model has been fit and see the change in predictions.

In [None]:
# EXAMPLE

from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier(random_state=0, 
                                    max_depth=5, 
                                    min_samples_split=5).fit(train_X, train_y)


from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots

# Create the data that we will plot
pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=val_X, model_features=feature_names, feature='Goal Scored')

# plot it
pdp.pdp_plot(pdp_goals, 'Goal Scored')
plt.show()

For the PDP Graph: 
- x axis is the feature values (price, goals scored, etc.)
- y axis is the change in prediction from what it would be predicted at the baseline (leftmost of the graph) value. Shaded area is a confidence interval.

PDP Graphs only analyze one feature. However, you can change it to account for changes in _two_ features in a quasi-heatmap graph, where the heat is the degree of change. See below:

In [None]:
'''
2 feature PDP graph is similar to regular PDP plot except we use pdp_interact 
instead of pdp_isolate and pdp_interact_plot instead of pdp_isolate_plot
'''
features_to_plot = ['Goal Scored', 'Distance Covered (Kms)']
inter1  =  pdp.pdp_interact(model=tree_model, dataset=val_X, model_features=feature_names, features=features_to_plot)

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour')
plt.show()

## 3. SHAP Values

__SHAP:__ "SHapley Additive exPlanations" breaks down a prediction and shows the impact of _each_ feature. The SHAP values of all features sum up to explain why my prediction was different from the baseline. For example, if the model predicted "0.7" and the baseline actual value is 0.5, below the outputted line graph is an explanation of which features pushed the prediction higher and which features contibuted to it being lower. Totalling those feature influences up, you get 0.7.

Good examples for looking into SHAP values are:
- A model says a bank shouldn't loan someone money, and the bank is legally required to provide a basis for the loan rejection.

- A healthcare provider wants to identify the factors that are driving the patient's risk of some disease, so that those factors can be targeted.

Example of SHAP value code below:

In [None]:
from sklearn.model_selection import train_test_split
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)

from sklearn.ensemble import RandomForestClassifier
my_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)

row_to_show = 5
data_for_prediction = val_X.iloc[row_to_show]  
# use 1 row of data here. Could use multiple rows if desired

data_for_prediction_array = data_for_prediction.values.reshape(1, -1)


my_model.predict_proba(data_for_prediction_array)

########################################################

import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(my_model)
# Also DeepExplainer for Deep Learning models and
# KernelExplainer for all other models (although generally slower and offers only an approximation)

# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
# The above variable creates two arrays - feature by feature probability of a negative outcome
# and feature by feature probability of a positive outcome

# may just use X_test for "data_for_prediction"

########################################################
# An easier way to understand the above SHAP value arrays is to visualize them
# Below is the code for that:

shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)

# can also do
shap.summary_plot(shap_values[1], val_X)

# NOTE: As stated above, SHAP values has two arrays (probability of outputting 0 and probability of outputting 1)
# . We are calling the one for TRUE/1


## 4. Using Permutation Importance, Partial Dependence Plots, and SHAP Values Together

__Permutation Importance__ shows _which_ features are important.
__SHAP Values__ show how _all_ features affect the model.
__Partial Dependence Plot__ shows how _a specific feature_ affects the model.

Combining SHAP values and dependence plots, we get a new kind of plot: __Dependence Contribution Plots.__ These plots use SHAP values as the y axis and the analyzed feature on the x axis. The general trend of the scatter will suggest the correlation (if any) of that feature with the outcome variable. (You can also color the scatter points for analysis of another variable).

In [2]:
# Contibution Plot example
# Do the below on X_test, ***after*** fitting the model!!!

import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(my_model)

# calculate shap values. This is what we will plot.
shap_values = explainer.shap_values(X)

# make plot.
shap.dependence_plot('Ball Possession %', shap_values[1], X, interaction_index="Goal Scored")
# HERE - Ball possession = feature analyzed in relation to its SHAP value
# Goal scored - the color-coded feature for extra-analysis

ModuleNotFoundError: No module named 'shap'

# KHAN ACADEMY NOTES: Algorithms

## Binary Search

Imagine trying to find a book in a list of 50 books, ordered numerically from 1 to 50. __Linear Search__ starts at Book \#1, checks to see if that book is correct, and if not, go onto Book \#2, and if not correct, then onto Book \#3, etc.

In a linear search, the worst case scenario is if Book \#50 is the correct book. That's 50 guesses! A basic algorithm in computers is to instead use 

__Binary Search.__ Here, binary search checks Book #\25, checks if the actual book is higher or lower. If higher, it rejects Books 1 to 24, and it's second guess would be Book \# 38 - halfway between books 26 and 50. The Binary Search algorithm repeats the high/low check until it finds the correct book.

For a linear search, the maximum number of guesses for a list of n books is n. For a binary search, the maximum number of guesses is log<sub>2</sub>(n), rounded up. As you can see, this is _exponentially_ more performant.

---

## Asymptotic Notation - Determining the Efficiency (Rate) of an Algorithm

Measuring _how long_ exponentially performant algorithms such as Binary Search take is defined by asymptotic notation. There are three forms: Big Theta, Big-O, and Big-Omega notations.

### Big Theta Notation

Used for functions involving a _constant_ runtime. for k1 x n and k2 x n, where (k1 x n) < runtime function < (k2 x n), the runtime rate would be approximately somewhere _in_ between k2 x n and k1 x n _after a sufficiently large n_; before this sufficiently large n, runtime cannot be adequately determined. If the above inequality holds true, we say "the running time is Theta(n)." In laymen's terms, this says "if the running time of the function f(n) is Theta(g(n)), then f(n) grows asymptotically _at the same rate_ as g(n)."

### Big-O Notation

Big-Theta designates a runtime between an upper and lower bound functions, but we don't know exactly where in between those functions the runtime function is in. Big-O notation expressly identifies __only the upper bound function__ and says that "runtime is lower than (or "at most") O(n)." (pronounced "Big-O of f(n)" or just "O of f(n)").

Although it gives less information than Big-Theta, it is more accurate that works in all cases for a given function. In laymen's terms, "If f(n)'s runtime is O(g(n)), then f(n) grows asymptotically _no faster than_ g(n)."

__An example explaining the difference:__
for binary search algorithm f(n), if the runtime rate is k * log(n) + c, where k and c are constants, 

We say that f(n) is Theta(log(n)) (same rate as runtime rate)

We say that f(n) is O(log(n)), O(n), O(n^2), O(n^3), and O(2^n) (as the runtime rate is "no faster than the rate of Big-O")

### Big-Omega Notation

If Big-O is an upper bound, Big-Theta is an upper _and_lower bound_, then Big-Omega must be the rate "_at least_ the rate of the runtime of the function" (i.e., asymptotic lower bounds).

#### What's the deal with the above?

It gives programmers knowledge of the worst-case scenario of runtime (Big-Omega) or best-case scenario of runtime (Big-O). For example, Omega(1) means that the function works _at least_ on pace with constant time - it could very well program faster (such as polynomially (n^2 or 3n^4) or exponentially (3.5^n or 2^(2n))).

---
## Sorting - Selection Sort

