#### Now let's try to predict the extent of damage.

CULL is defined in the FIA database user guide as "the percent of the cubic-foot volume in a live or dead tally tree
that is rotten or missing.

In [None]:
# Removing rows which have no CULL data to use this as a feature
train_for_cull = train.dropna(axis = 0, subset = ["CULL_pre_burn", "CULL_post_burn"])

splits = 5
kfold = KFold(splits, random_state=216, shuffle=True)
# indicator_features = ["CULL_pre_burn", "DIA_pre_burn", "HT_pre_burn", "DRYBIO_AG_pre_burn",
#                       "ELEV", "SOFTWOOD", "YRS_SINCE_BURN", "NUM_BURNS", "BURN_AREA_TOTAL"]
indicator_features = ["CULL_pre_burn", "DRYBIO_AG_pre_burn", "ELEV", "SOFTWOOD", "YRS_SINCE_BURN", "BURN_AREA_TOTAL"]


for k in range(2,20):
    regressor_pipe = Pipeline([("Scaler", StandardScaler()), 
                                ("KNN Regressor", KNeighborsRegressor(n_neighbors=k))])
    
    score_list = []
    for i, (train_index, test_index) in enumerate(kfold.split(train_for_cull[indicator_features],train_for_cull["CULL_post_burn"])):
        t_train = train_for_cull.iloc[train_index]
        t_val = train_for_cull.iloc[test_index]

        regressor_pipe.fit(t_train[indicator_features], t_train["CULL_post_burn"])
        score_list.append(regressor_pipe.score(t_val[indicator_features],t_val["CULL_post_burn"]))
    
    print(f"{k} Neighbors scores: {score_list} \n Average score for {k} Neighbors: {sum(score_list)/len(score_list)} \n")

2 Neighbors scores: [-0.07088436488676653, -0.03321278000147476, -0.24458748974467648, -0.4825613727167164, -0.06990846090123459] 
 Average score for 2 Neighbors: -0.18023089365017375 

3 Neighbors scores: [-0.015149857704784608, -0.10981459752277534, -0.20195116940888913, -0.3182416494408127, -0.06487536344660039] 
 Average score for 3 Neighbors: -0.14200652750477244 

4 Neighbors scores: [0.004438737315258345, 0.015861552430050163, -0.14048850235388444, -0.33127220760280873, 0.11740778649537764] 
 Average score for 4 Neighbors: -0.0668105267432014 

5 Neighbors scores: [0.003278039185080628, 0.09490878944576364, -0.11468405643872903, -0.1947899435986875, 0.0986185356764131] 
 Average score for 5 Neighbors: -0.022533727146031836 

6 Neighbors scores: [0.02212216940282008, 0.13895653247086814, -0.11620092083625666, -0.15317237862092092, 0.13518282531630632] 
 Average score for 6 Neighbors: 0.005377645546563392 

7 Neighbors scores: [0.03058080799068541, 0.13626805768723527, -0.08697235

From sklearn's documentation: "The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a R^2 score of 0.0."

Noteworthy scores:
- Almost all scores I calculated were under 0.1. This indicates KNN is not a very effective method of predicting CULL.
- .083 using all features, 17 neighbors
- .082 using ["CULL_pre_burn", "ELEV", "SOFTWOOD", "YRS_SINCE_BURN", "BURN_AREA_TOTAL"], 17 neighbors
    - adding in DIA or HT made it worse
- .101 using ["CULL_pre_burn", "DRYBIO_AG_pre_burn", "ELEV", "SOFTWOOD", "YRS_SINCE_BURN", "BURN_AREA_TOTAL"] and 18 neighbors

At least it's doing better than the mean regressor since it's consistently over 0. This may be a difficult task because so many entries in the CULL field are 0, even after burn. See below:

In [None]:
train_for_cull["CULL_post_burn"].value_counts(normalize = True).sort_index()

CULL_post_burn
0.0     0.871755
1.0     0.047221
2.0     0.013765
3.0     0.005576
4.0     0.001742
5.0     0.022129
6.0     0.001220
7.0     0.001220
8.0     0.001568
9.0     0.000174
10.0    0.009061
12.0    0.000523
15.0    0.006447
16.0    0.000174
18.0    0.000523
19.0    0.000174
20.0    0.004356
22.0    0.000174
25.0    0.002788
30.0    0.002614
35.0    0.000871
40.0    0.000871
45.0    0.000348
50.0    0.000697
55.0    0.000348
60.0    0.000523
65.0    0.000174
80.0    0.000523
90.0    0.000697
95.0    0.000523
99.0    0.001220
Name: proportion, dtype: float64

Human bias may be playing a part; there's a pronounced tendency towards values that are multiples of 5. The database documentation doesn't have much detail about *how* field crews estimate it, either.