<div class='alert alert-warning'>

# JupyterLite warning

If you are running the current notebook in JupyterLite, you may encounter some unexpected behavior.

The main difference is that imports take longer than usual, for example the first `import sklearn` can take up to 10-20s.

If you notice problems, feel free to open an [issue](https://github.com/probabl-ai/youtube-appendix/issues/new/choose) about it.
</div>

In [1]:
import pandas as pd

df = pd.read_parquet("datasets/wow.parquet").sort_values('datetime')
df['where'] = df['where'].astype(str)

In [2]:
df.head()

Unnamed: 0,player_id,guild,level,race,class,where,datetime
2783,886,8.0,49,Tauren,Druid,Tanaris,2006-01-01 11:30:47
3427,110,5.0,49,Tauren,Warrior,Desolace,2006-01-02 02:00:48
495,1156,2.0,14,Troll,Rogue,Mulgore,2006-01-02 08:31:35
4387,740,6.0,49,Troll,Rogue,Undercity,2006-01-02 18:52:01
3902,132,8.0,54,Tauren,Warrior,Feralas,2006-01-02 19:30:55


<!-- Let's pretend that we want to make a model that predicts the level of the player based on where the character is as well as the day of the week. 

The thinking: certain regions are meant for more high-level characters and maybe the weekend players are less hardcore than the week players. The goal isn't really to build the best model, but rather to talk about the code we write in order to build models in the first place. As you'll soon see, there's a reason why stuff might break unless you're careful.
 -->
 
## Making Features in Pandas

In [3]:
def get_sparse_features(dataf):
    return pd.get_dummies(dataf['where'])

def get_datetime_features(dataf):
    return pd.get_dummies(df['datetime'].dt.dayofweek)
    
X = pd.concat([get_sparse_features(df), get_datetime_features(df)], axis=1)

X.columns = X.columns.astype(str)

y = df['level']

In [4]:
from sklearn.linear_model import LinearRegression

mod = LinearRegression().fit(X, y)
mod.predict(X)

array([51.02145602, 41.47247364, 33.49723443, ..., 77.40556152,
       58.59542372, 50.0217949 ], shape=(5000,))

In [5]:
df = pd.read_parquet("datasets/wow.parquet").sort_values('datetime')
df['where'] = df['where'].astype(str)

In [6]:
set_train = set(df['where'].unique())

In [7]:
set_infer = set(df['where'].unique())

In [8]:
X = pd.concat([get_sparse_features(df), get_datetime_features(df)], axis=1)

# Note that this is a pandas specific thing we gotta do, error otherwise! (show in vid!)
X.columns = X.columns.astype(str)

In [9]:
mod.predict(X)

array([51.02145602, 41.47247364, 33.49723443, ..., 77.40556152,
       58.59542372, 50.0217949 ], shape=(5000,))

<br><br><br><br><br><br><br>


If we use `pd.get_dummies` to get the features that we're interested in ... we risk that "in production" the whole thing breaks down because we might see a new category. A new category would require a new column to appear in our dummy features ... and that means that our `X` now has a different shape than we had when we trained the model. 

One thing we could do is that we rewrite the way we generate features. We could write something in pandas such that we store the features seen during training such that unseen categories can be ignored later. But ... if that's the fix ... then why not use scikit-learn components that do this directly? Sure, we could write our own, but it's a lot safer to use the battle-tested code that's in available projects. 

So let's rewrite the feature generation code by using scikit-learn components instead. 

In [10]:
df = pd.read_parquet("datasets/wow.parquet").sort_values('datetime')
df['where'] = df['where'].astype(str)

In [11]:
%pip install skrub
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline, make_union
from sklearn.linear_model import LinearRegression
from skrub import SelectCols, DatetimeEncoder

In [12]:
pipe = make_pipeline(
    make_union(
        make_pipeline(
            SelectCols("where"),
            OneHotEncoder(handle_unknown="infrequent_if_exist", min_frequency=10),
        ),
        make_pipeline(
            make_column_transformer((DatetimeEncoder(resolution=None, add_total_seconds=False, add_weekday=True), "datetime"),
                        remainder="drop"),
            OneHotEncoder(handle_unknown="infrequent_if_exist", min_frequency=10),
        )
    ),
    LinearRegression()
)
pipe

In [13]:
y = df["level"]
X = df.drop(columns=["level"])

pipe.fit(X, y)

In [14]:
new_data = pd.DataFrame([{"where": "Megaton Dinosaurhead", "datetime": pd.to_datetime("2006-02-12 12:12:12")}])

In [15]:
pipe.predict(new_data)

array([64.77592139])

The main thing I hope to drive at here is that it's usually just _way_ easier to work with scikit-learn components. If there's ever a need to write custom code then you can still totally do that, but even then you'll probably want to write it in a custom scikit-learn component instead. 