In [1]:
import pandas as pd

# Load dataset
### Dataset is from previous assignment, a nice and cleaned version

In [2]:
df = pd.read_csv('JO_pivoted.csv')
df.drop(["Unnamed: 0"], axis=1, inplace=True)
df

Unnamed: 0,region,year,barley,energy forest,fallow land,"field peas for cooking, fodder peas, vetches and field beans",green fodder,green peas,horticulture plants,ley for hay and forage plants,...,triticale,unspecified arable land,utilized ley for hay,utilized ley for hay and pasture,utilized pasture,white beans,winter barley,winter rape,winter turnip rape,winter wheat
0,0114 Upplands Väsby,1981,500.0,0.0,179.0,0.0,43.0,0.0,0.0,0.0,...,0.0,0.0,0.0,229.0,0.0,0.0,0.0,0.0,0.0,80.0
1,0114 Upplands Väsby,1985,586.0,0.0,30.0,11.0,63.0,0.0,0.0,0.0,...,0.0,0.0,0.0,201.0,0.0,0.0,0.0,0.0,0.0,40.0
2,0114 Upplands Väsby,1989,264.0,0.0,124.0,22.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,275.0,0.0,0.0,0.0,0.0,14.0,477.0
3,0114 Upplands Väsby,1990,213.0,0.0,57.0,38.0,2.0,0.0,0.0,0.0,...,0.0,0.0,0.0,213.0,0.0,0.0,0.0,0.0,2.0,520.0
4,0114 Upplands Väsby,1991,328.0,0.0,91.0,4.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,217.0,0.0,0.0,0.0,0.0,6.0,180.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4055,2584 Kiruna,1999,0.0,0.0,17.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,272.0,0.0,0.0,0.0,0.0,0.0,0.0
4056,2584 Kiruna,2001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,109.0,0.0,151.0,0.0,0.0,0.0,0.0,0.0,0.0
4057,2584 Kiruna,2002,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,90.0,0.0,140.0,0.0,0.0,0.0,0.0,0.0,0.0
4058,2584 Kiruna,2003,0.0,0.0,15.0,0.0,0.0,0.0,0.0,0.0,...,0.0,69.0,0.0,143.0,0.0,0.0,0.0,0.0,0.0,0.0


# Feature Importance

Not all features are guaranteed to be useful for a model, let's find out which ones matters the most (and the least!).

This will be the same Decision Tree Classifier model as before.

In [3]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

X = df.drop(["region"], axis=1)
y = df["region"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1337)

clf = DecisionTreeClassifier(random_state=1337)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

0.7068965517241379

The base model is done, so let's get to work on feature importance

In [4]:
from sklearn.inspection import permutation_importance

pi = permutation_importance(clf, X_test, y_test, n_repeats=50, random_state=1337)
df_fi = pd.DataFrame([X.columns, pi['importances_mean']]).T.set_index(0)
df_fi.sort_values(by=1, ascending=False, inplace=True)
df_fi

Unnamed: 0_level_0,1
0,Unnamed: 1_level_1
total arable land,0.652808
table potatoes,0.304778
oats,0.291034
winter wheat,0.179286
sugar beets,0.145517
ley for seeds,0.078695
potatoes for processing of starch,0.074532
fallow land,0.057709
spring rape,0.055764
utilized ley for hay and pasture,0.037956


Not suprisingly, the total arable land is highly telling of which region it is. Interestingly, the 'year' feature is so bad it went beyond unimportant, it is actively hurting the performance of the model.

Let's train a model on the two best features and see how it does.

In [5]:
feats = ["total arable land", "table potatoes"]
X_train, X_test = X_train[feats], X_test[feats]
clf.fit(X_train, y_train)
y_pred_good = clf.predict(X_test)
accuracy_score(y_pred_good, y_test)

0.43349753694581283

So it got significantly worse with only 2 features, but still, 43% accuracy on 290 classes is pretty good.

Let's do the same, but with the worst 2 features. Since the place for 2nd worse feature is three-way tied, let's just select one of them arbitrarily.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=1337)
feats = ["year", "green peas"]
X_train, X_test = X_train[feats], X_test[feats]
clf.fit(X_train, y_train)
y_pred_bad = clf.predict(X_test)
accuracy_score(y_pred_bad, y_test)

0.0012315270935960591

Suprise suprise, training the model on the worst 2 features yielded an accuracy that is literally worse than random guess.

In [7]:
print(1/290)

0.0034482758620689655


Just for fun, let's train on the 2 median features and see how it does.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=1337)
feats = ["triticale", "mixed grain"]
X_train, X_test = X_train[feats], X_test[feats]
clf.fit(X_train, y_train)
y_pred_bad = clf.predict(X_test)
accuracy_score(y_pred_bad, y_test)

0.020935960591133004

Obviously not nearly as good as the top 2 fatures, but it is significantly better than random guess.

What about guessing on majority class? Let's do some code magic to see what the accuracy is when just making fixed guesses for what region it is.

In [9]:
value_counts = y_train.value_counts()

value_counts_df = pd.DataFrame({
    'Region': value_counts.index,
    'Count': value_counts.values
})

value_counts_df

Unnamed: 0,Region,Count
0,1276 Klippan,14
1,1435 Tanum,14
2,0562 Finspång,14
3,1230 Staffanstorp,14
4,1884 Nora,14
...,...,...
285,1864 Ljusnarsberg,8
286,2262 Timrå,8
287,0483 Katrineholm,7
288,2580 Luleå,7


In [10]:
results_list = []

for index, row in value_counts_df.iterrows():
    region = row["Region"]
    count = row["Count"]
    
    y_majority = [region] * len(y_test)
    acc = accuracy_score(y_majority, y_test)

    results_list.append({
        'Region': region,
        'Accuracy': acc
    })

results_df = pd.DataFrame(results_list)

merged_df = pd.merge(value_counts_df, results_df, on="Region")
merged_df.head(20)

Unnamed: 0,Region,Count,Accuracy
0,1276 Klippan,14,0.0
1,1435 Tanum,14,0.0
2,0562 Finspång,14,0.0
3,1230 Staffanstorp,14,0.0
4,1884 Nora,14,0.0
5,0563 Valdemarsvik,14,0.0
6,2584 Kiruna,14,0.0
7,2161 Ljusdal,14,0.0
8,0181 Södertälje,14,0.0
9,2260 Ånge,14,0.0


Turns out that all of the majority classes results in exactly 0% accuracy. It isn't until the 2nd majority classes that the accuracy is above 0.