<a href="https://colab.research.google.com/github/jack-cao-623/python_learning/blob/main/why_you_should_never_use_inplace_true.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Why you should probably never use pandas inplace = True
From [this article](https://towardsdatascience.com/why-you-should-probably-never-use-pandas-inplace-true-9f9f211849e4)

**Basically, don't use inplace = True**

## How inplace works

In [1]:
import pandas as pd

In [2]:
# create dataframe
ice_cream = pd.DataFrame(
    {
        'state': ['frozen'],
        'flavour': ['vanilla']
    }
)

print(ice_cream)

    state  flavour
0  frozen  vanilla


In [3]:
# replace 'frozen' with 'melted' using inplace = True
ice_cream.replace(
    to_replace = {'frozen': 'melted'},
    inplace = True
)

print(ice_cream)

    state  flavour
0  melted  vanilla


In [4]:
# recreate dataframe
ice_cream = pd.DataFrame(
    {
        'state': ['frozen'],
        'flavour': ['vanilla']
    }
)

print(ice_cream)

    state  flavour
0  frozen  vanilla


In [5]:
# replace 'frozen' with 'melted'; assign it to a new variable
melted_ice_cream = ice_cream.replace(
    to_replace = {'frozen': 'melted'}
)

print(melted_ice_cream)

    state  flavour
0  melted  vanilla


In [6]:
# ice_cream dataframe
print(ice_cream)

    state  flavour
0  frozen  vanilla


## Example of where inplace = True can go wrong

In [11]:
# create a pandas DataFrame
df = pd.DataFrame(
    {
        'city': ['London', 'Amsterdam', 'New York', None],
        'sales': [100, 300, 200, 400]
    }
)

print(df)

        city  sales
0     London    100
1  Amsterdam    300
2   New York    200
3       None    400


In [9]:
# function that drops missing cities and sorts from highest to lowest sales
def create_top_city_leaderboard(df):
  df.dropna(subset = ['city'], inplace = True)
  df.sort_values(by = ['sales'], ascending = False, inplace = True)
  return df

In [12]:
# function that calculates sum of sales
def calculate_total_sales(df):
  return df['sales'].sum()

In [13]:
# total sales
calculate_total_sales(df = df)

1000

In [14]:
# leaderboard
create_top_city_leaderboard(df = df)

Unnamed: 0,city,sales
1,Amsterdam,300
2,New York,200
0,London,100


In [15]:
# total sales
calculate_total_sales(df = df)
  # we lost 400 because we dropped the row

600

## Better way to do the above, i.e., no inplace = True

In [16]:
# create a pandas DataFrame
df = pd.DataFrame(
    {
        'city': ['London', 'Amsterdam', 'New York', None],
        'sales': [100, 300, 200, 400]
    }
)

print(df)

        city  sales
0     London    100
1  Amsterdam    300
2   New York    200
3       None    400


In [20]:
# function that drops missing cities and ranks by sales desc
def create_top_city_leaderboard(df):
  return(
      df
        .dropna(subset = ['city'])
        .sort_values(by = ['sales'], ascending = False)
  )

In [18]:
# function that returns total sales
def calculate_total_sales(df):
  return(
      df['sales'].sum()
  )

In [24]:
# leaderboard
create_top_city_leaderboard(df = df)

Unnamed: 0,city,sales
1,Amsterdam,300
2,New York,200
0,London,100


In [23]:
# total sales
calculate_total_sales(df = df)

1000