In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from datetime import datetime

# A NOTE ON INDEXING

In [20]:
#Let’s start with a DataFrame with two columns:
df = pd.DataFrame({"user_id": [1, 2, 3], "score": [10, 15, 20]})
df

Unnamed: 0,user_id,score
0,1,10
1,2,15
2,3,20


In [21]:
#A regular getitem operation on a DataFrame provides a view in most cases:
view = df["user_id"]
view

0    1
1    2
2    3
Name: user_id, dtype: int64

As a consequence, the new object view still references the parent object df and its data. Hence, writing into the view will also modify the parent object.

In [22]:
view.iloc[0] = 10
view

0    10
1     2
2     3
Name: user_id, dtype: int64

In [23]:
df #OMG! It updated the original dataframe! This is bonkers!

Unnamed: 0,user_id,score
0,10,10
1,2,15
2,3,20


This setitem operation will consequently update not only our view but also df. This happens because the underlying data are shared between both objects. This is only true, if the column user_id occurs only once in df. As soon as user_id is duplicated the getitem operation returns a DataFrame. This means the returned object is a copy instead of a view:

In [24]:
df = pd.DataFrame(
    [[1, 10, 2], [3, 15, 4]], 
    columns=["user_id", "score", "user_id"],
)
df

Unnamed: 0,user_id,score,user_id.1
0,1,10,2
1,3,15,4


In [25]:
not_a_view = df["user_id"]
not_a_view

Unnamed: 0,user_id,user_id.1
0,1,2
1,3,4


In [26]:
not_a_view.iloc[0] = 10

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  not_a_view.iloc[0] = 10


The setitem operation does not update df.

In [27]:
df #Dio mio, true!

Unnamed: 0,user_id,score,user_id.1
0,1,10,2
1,3,15,4


In [32]:
# We again start with a regular DataFrame:
df = pd.DataFrame({"user_id": [1, 2, 3], "score": [10, 15, 20]})
df

Unnamed: 0,user_id,score
0,1,10
1,2,15
2,3,20


In [30]:
# We can update all user_ids that have a score greater than 15 through:
df["user_id"][df["score"] > 15] = 5
df

Unnamed: 0,user_id,score
0,1,10
1,2,15
2,5,20


We take the column user_id and apply the filter afterwards. This works perfectly fine, because the column selection creates a view and the setitem operation updates said view. We can switch both operations as well:

In [33]:
#initialize again
df = pd.DataFrame({"user_id": [1, 2, 3], "score": [10, 15, 20]})
df[df["score"] > 15]["user_id"] = 5

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df["score"] > 15]["user_id"] = 5


In [34]:
df #Dio mio it didn't update!

Unnamed: 0,user_id,score
0,1,10
1,2,15
2,3,20


This execution order produces another SettingWithCopyWarning. In contrast to our earlier example, nothing happens. The DataFrame df is not modified. This is a silent no-operation. The boolean mask always creates a copy of the initial DataFrame. Hence, the initial getitem operation returns a copy. The return value is not assigned to any variable and is only a temporary result. The setitem operation updates this temporary copy. As a result, the modification is lost. The fact that masks return copies while column selections return views is an implementation detail. Ideally, such implementation details should not be visible.

Another approach of doing this is as follows:

In [35]:
new_df = df[df["score"] > 15]
new_df

Unnamed: 0,user_id,score
2,3,20


In [37]:
new_df["user_id"] = 10
new_df #DIO MIO now it changed!

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  new_df["user_id"] = 10


Unnamed: 0,user_id,score
2,10,20


Theoretically, one setitem operation could propagate through the whole call-chain, updating many DataFrames at once.

---

# A note on machine learning

- Models approximate real-life situations using limited data.
- In doing so, errors can arise due to assumptions that are overly simple (bias) or overly complex (variance).
    - When a model is less complex, it ignores relevant information, and error due to bias is high. As the model becomes more complex, error due to bias decreases.
    - On the other hand, when a model is less complex, error due to variance is low. Error due to variance increases as complexity increases.
- Building models is about making sure there's a balance between the two.

---