# Updating Data frames

## Data: Palmer Penguins

In [50]:
# import standard libraries 
import pandas as pd
import numpy as np

# import seaborn with standard abbreviation
import seaborn as sns

# use random library to create random numbers
import random

In [74]:
# import data from sns
penguins = sns.load_dataset('penguins')

# look at datamframes head 
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## Adding a column 

General syntax to add a single column is:

```
df['new_col_name'] = new_column_values
```

`new_column_values` could be:

- a `pd.Series` or `numpy.array` of the same length as the data frame
- a single scalar (single number, single string)

**Example**

Want to create a new column where the body mass is in kg instead of grams

In [78]:
# add column
penguins['body_mass_kg'] = penguins.body_mass_g/1000

# check if this new column has been addded - will output True if its been added successfully
print('body_mass_kg' in penguins.columns)

True


To create a new column and insert it as a particular position we use `insert()`:
```
df.insert(loc = integers_index,
            column = 'new_column_name',
            value = new_col_vales) # location of new column
```

Example:

Each penguin gets a unique code as a 3 digit number. Add this column at the beginning of the data frame:

In [79]:
# create random 3-digit code 
# sample is without replacement
codes = random.sample(range(100,1000), 
                      len(penguins)) # as many rows as in the specific dataframe

penguins.insert(loc = 0,
               column = 'code', 
               value = codes)

ValueError: cannot insert code, already exists

In [54]:
# check the output - new column 'code' should be the first column 
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg
0,797,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,37.5
1,492,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,38.0
2,999,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,32.5
3,396,Adelie,Torgersen,,,,,,
4,916,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,34.5


## Adding multiple columns

We can assign multiple columns in teh same call using `assign()` method:

Syntax:
```
df.assign( new_col1 = new_col1_values,
            new_col2 = new_col2_values)
```

Notice: New column names are not strings, we are assigning them as if we are creating variables...aka no quotation marks.

**Example:**

Add columns:
- flipper length converted from mm to cm, and
- a code representing the observer

In [80]:
# create codes fro observers
observers = random.choices(['a', 'b', 'c'], # sample from this array
                          k = len(penguins)) # get this many items

# output is column of a, b, c values of length `k`
##observers

In [75]:
penguins = penguins.assign(flipper_length_cm = penguins.flipper_length_mm/100,
                            observer = observers) # insert column defined above

penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,flipper_length_cm,observer
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,1.81,a
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,1.86,c
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,1.95,a
3,Adelie,Torgersen,,,,,,,a
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,1.93,b


## Remove columns

We can remove columns using the `drop()` method. Syntax:
```
df = df.drop(columns = col_names)
```

where `col_names` can be sinlge column name (string) or a list of column names.

**Example**

We want to drop the `flipper_length_mm` and `body_mass_g` columns:

In [81]:
# reassign when using drop
penguins = penguins.drop(columns = ['flipper_length_mm', 'body_mass_g'])

In [69]:
print(penguins.columns)

Index(['code', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'sex',
       'body_mass_kg', 'flipper_length_cm', 'observer'],
      dtype='object')


## Updating values

Sometimes we want to update a certain calue in our data frame.

### A single value

We can access a single value in a `pd.DataFrame` using the locators:
- `at[]` : to select by labels
- `iat[]` : to select by integer index/position 

Syntax for `at[]`:
```
df.at[index, col_name]
```

Syntax for `iat[]`:
```
df.at[index_row, index_col_name]
```

In [82]:
penguins.at[3, 'bill_length_mm']

nan

We got an `NA` value, lets update it to 38.3 mm. We do this using [at] too:

In [83]:
# update value in place
penguins.at[3, 'bill_length_mm'] = 38.3

## Update multiple values in a column 

### By condition 

Think of `case_when` in R. 

**Example**

We want to classify penguins such that:
- small : body mass < 3kg
- medium : 3kj <= body mass < 5
- large : 5 <= body mass 

One way to do this is to use `numpy.select()` to create a new column:
1. create new column
2. assign it

In [84]:
# create a list with conditions
conditions = [
    
    penguins.body_mass_kg <3, # small condition 
    (penguins.body_mass_kg >= 3) & (penguins.body_mass_kg < 5), # med condition
    penguins.body_mass_kg >= 5 # large condition
]

# create list for choices 
choices = [
    
    'small',
    'medium',
    'large'
]

penguins['size'] = np.select(conditions,
         choices,
         default = np.nan)

penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,flipper_length_cm,observer,body_mass_kg,size
0,369,Adelie,Torgersen,39.1,18.7,Male,1.81,a,3.75,medium
1,923,Adelie,Torgersen,39.5,17.4,Female,1.86,c,3.8,medium
2,388,Adelie,Torgersen,40.3,18.0,Female,1.95,a,3.25,medium
3,580,Adelie,Torgersen,38.3,,,,a,,
4,537,Adelie,Torgersen,36.7,19.3,Female,1.93,b,3.45,medium


### Update a column by selecting values

Sometimes we just want to update a few values that satisfy a condition. We can do this by selecting using `loc` ( if selecting by label) and then assign a new value
```
# modifies data in place
df.loc[row_selection, col_name] = new_values
```

where
- `row_selection`: rows we want to update
- ` col_name` : a single column name, and
- `new_values` = the new value or values we want to update


**Example**

We want to update "Male" value in the sex column to "M"

In [87]:
penguins.loc[penguins.sex == "Male", # select row
            'sex'] = 'M' # select column and assign value

In [89]:
# see that 'Male' has been updated to 'M'
penguins.sex.unique()

array(['M', 'Female', nan], dtype=object)

## `SettingWithCopyWarning`

Suppose we wnat to udpate "Female" value in the sex column to "F". This is an example that we might try, but WONT work:

In [90]:
penguins[penguins.sex == 'Female']['sex'] = 'F'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  penguins[penguins.sex == 'Female']['sex'] = 'F'


When we select data with **chained indexing** `[][]` instead of `loc` we get this warning. `pandas` is trying to warn us that our code is ambiguous and there might be a bug.

In this case, we did not update penguins data frame. 

## Views and copies

Some `pandas` operations return a view to your data, while others return a copy to uour data. 

**Views**
- are actual subsets of the original data. When we update them we are modifying the original dataframe. 

**Copies**
- are unique objects that are independent of our original dataframe. When we update a copy, we are not modifying our original dara frame. 

Depending on what we are trying to do, we might want to modify a copy or a view.

### Another `SettingWithCopyWarning`

Another common situation when this warning comes up is if we try updating a subset of data frame already stored in a variable.

Example:

We only want data from Biscoe Island and, after doing some analyses, we want to add a new column to it:

In [93]:
biscoe = penguins[penguins.island == 'Biscoe']

## 50 lines of code later...

biscoe['sample_column'] = 100

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  biscoe['sample_column'] = 100


Essentially what we did was
```
penguins[penguins.island == 'Biscoe']['sample_column'] = 100
```

And this wont work :/

To fix this we can **take control of the copy-view situation and ask for a copy of dataset when subsetting the data**. 

Use the `copy()` method for this.

In [95]:
# make a copy of the data - breaks the cycle 
biscoe = penguins[penguins.island == 'Biscoe'].copy()

# no warning because it knows what dataset I am handling
biscoe['sample_column'] = 100

In [96]:
biscoe.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,flipper_length_cm,observer,body_mass_kg,size,sample_column
20,252,Adelie,Biscoe,37.8,18.3,Female,1.74,a,3.4,medium,100
21,353,Adelie,Biscoe,37.7,18.7,M,1.8,b,3.6,medium,100
22,129,Adelie,Biscoe,35.9,19.2,Female,1.89,c,3.8,medium,100
23,982,Adelie,Biscoe,38.2,18.1,M,1.85,a,3.95,medium,100
24,113,Adelie,Biscoe,38.8,17.2,M,1.8,c,3.8,medium,100
