In [1]:
# ! wget https://github.com/koaning/wow-avatar-datasets/raw/main/wow-full.parquet

In [2]:
import pandas as pd

In [3]:
df = pd.read_parquet("wow-full.parquet").sort_values('datetime').head(1_000_000)
df['where'] = df['where'].astype(str)

In [4]:
df.head()

Unnamed: 0,player_id,guild,level,race,class,where,datetime
6862733,0,,5,Orc,Warrior,Durotar,2005-12-31 23:59:46
6862734,1,,9,Orc,Shaman,Durotar,2005-12-31 23:59:46
6862738,5,,16,Orc,Hunter,The Barrens,2005-12-31 23:59:52
6862739,6,,18,Orc,Warlock,The Barrens,2005-12-31 23:59:52
6862740,7,,17,Orc,Hunter,Silverpine Forest,2005-12-31 23:59:52


<!-- Let's pretend that we want to make a model that predicts the level of the player based on where the character is as well as the day of the week. 

The thinking: certain regions are meant for more high-level characters and maybe the weekend players are less hardcore than the week players. The goal isn't really to build the best model, but rather to talk about the code we write in order to build models in the first place. As you'll soon see, there's a reason why stuff might break unless you're careful.
 -->
 
## Making Features in Pandas

In [16]:
def get_sparse_features(dataf):
    return pd.get_dummies(dataf['where'])

def get_datetime_features(dataf):
    return pd.get_dummies(df['datetime'].dt.dayofweek)
    
X = pd.concat([get_sparse_features(df), get_datetime_features(df)], axis=1)

X.columns = X.columns.astype(str)

y = df['level']

In [17]:
from sklearn.linear_model import LinearRegression

mod = LinearRegression().fit(X, y)
mod.predict(X)

array([23.42112875, 23.42112875, 26.46307373, ..., 49.57870245,
       49.57870245, 57.97415638])

<br><br><br><br><br><br><br>


<!-- So ... this works ... but lets now pretend that we're going to run this model in production. 

What will happen? We'll increase the number of rows that we read in from the dataset. This way we can mimic the new data that the model will have to process.  -->

In [39]:
df = pd.read_parquet("wow-full.parquet").sort_values('datetime').head(1_000_000)
df['where'] = df['where'].astype(str)

In [40]:
set_train = set(df['where'].unique())

In [41]:
set_infer = set(df['where'].unique())

In [27]:
X = pd.concat([get_sparse_features(df), get_datetime_features(df)], axis=1)

# Note that this is a pandas specific thing we gotta do, error otherwise! (show in vid!)
X.columns = X.columns.astype(str)

<!-- We repeat the code, follow the same steps ... but ...  -->

In [28]:
mod.predict(X)

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- Ahn'kahet: The Old Kingdom
- Blade's Edge Mountains
- Borean Tundra
- Crystalsong Forest
- Dalaran
- ...
Feature names seen at fit time, yet now missing:
- 0
- 1
- 2
- 3
- 4
- ...


<br><br><br><br><br><br><br><br><br><br><br><br><br><br>


If we use `pd.get_dummies` to get the features that we're interested in ... we risk that "in production" the whole thing breaks down because we might see a new category. A new category would require a new column to appear in our dummy features ... and that means that our `X` now has a different shape than we had when we trained the model. 

One thing we could do is that we rewrite the way we generate features. We could write something in pandas such that we store the features seen during training such that unseen categories can be ignored later. But ... if that's the fix ... then why not use scikit-learn components that do this directly? Sure, we could write our own, but it's a lot safer to use the battle-tested code that's in available projects. 

So let's rewrite the feature generation code by using scikit-learn components instead. 

In [12]:
df = pd.read_parquet("wow-full.parquet").sort_values('datetime').head(1_000_000)
df['where'] = df['where'].astype(str)

In [44]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline, make_union
from sklearn.linear_model import LinearRegression
from skrub import SelectCols, DatetimeEncoder

In [47]:
pipe = make_pipeline(
    make_union(
        make_pipeline(
            SelectCols("where"),
            OneHotEncoder(handle_unknown="infrequent_if_exist", min_frequency=10),
        ),
        make_pipeline(
            SelectCols("datetime"),
            DatetimeEncoder(resolution=None, add_total_seconds=False, add_day_of_the_week=True),
            OneHotEncoder(handle_unknown="infrequent_if_exist", min_frequency=10),
        )
    ),
    LinearRegression()
)

In [48]:
pipe

In [49]:
y = df["level"]
X = df.drop(columns=["level"])

pipe.fit(X, y)

In [50]:
new_data = pd.DataFrame([{"where": "Megaton Dinosaurhead", "datetime": pd.to_datetime("2006-02-12 12:12:12")}])

In [51]:
pipe.predict(new_data)

array([55.80165946])

The main thing I hope to drive at here is that it's usually just _way_ easier to work with scikit-learn components. If there's ever a need to write custom code then you can still totally do that, but even then you'll probably want to write it in a custom scikit-learn component instead. 