# Data: Palmer penguins 

In [33]:
# Standard libraries
import pandas as pd
import numpy as np

# import seaborn with its standard abbreviation
import seaborn as sns

# will use the random library 
import random


In [46]:
penguins = sns.load_dataset("penguins")

# look at dataframe's head
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## Adding a clomn 

General syntax to add a single columns is 

```
df['new_col_name'] = new_column_values
```

`new_column_values` could be:
- a `pd.Series` or `numpy.array` of the same length as the data frame
- a single scalar (a single numer, a single string)

**Example**
Want to create a new column where the body mass is in kg instead of grams


In [13]:
# added a new column
# smae syntax 
penguins['body_mass_kg'] = penguins.body_mass_g/100

print('body_mass_kg' in penguins.columns)

True


In [14]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,37.5
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,38.0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,32.5
3,Adelie,Torgersen,,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,34.5


To create a new column adn instert it at a particular position we use `insert()`:
```
df.insert(loc = integers_index,
          column = 'new_column_name',
          value = new_col_values) # location of new column
```

example: 
Suppose each penguins gets a unique code as a three digit number. Add this column at the beginning of the dtaa frame: 



In [49]:
# create random 3-digit codes
# random.sample used for random sampling wo replacement
codes = random.sample(range(100,1000), len(penguins))

# insert codes at the front of data frame = index 0
penguins.insert(loc=0, 
                column = 'code',
                value = codes)
        
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,559,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,777,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,290,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,534,Adelie,Torgersen,,,,,
4,956,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


Q: what happens if we reassign with insert? 

## Adding multiple columns 
We can assign multiple columns in the same call using `assign()` method. 
Syntax: 
```
df.assign(new_col1_name = new_col1_values,
          new_col2_name = new_col2_values) 
```

Notice: new column names are not strings, we declare them as if we were creting variables

Example: 
Add Columns: 
    
- flipper length covnerted from mm to cm, and 
- a code representing the observer 
        

In [50]:
# create codes for observers
observers = random.choices(['A','B','C'], # sample from this array
                         k=len(penguins))# get this/these? many items/ k=name of frame

penguins = penguins.assign(flipper_length_cm = penguins.flipper_length_mm/10,
                           observer = observers)

penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,flipper_length_cm,observer
0,559,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,18.1,B
1,777,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,18.6,C
2,290,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,19.5,C
3,534,Adelie,Torgersen,,,,,,,B
4,956,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,19.3,B


## Removing columns
We can remove columns using the `drop()` method. Syntax: 
```
df = df.drop(columns = col_names)
```
where `col_names` can be a single column name(string) or a lis tof column names.

Example

We want to drop the flipper length in mm and the body mass in grams

In [51]:
# reassign when using drop
penguins = penguins.drop(columns = ['flipper_length_mm', 'body_mass_g'])
print(penguins.columns)

Index(['code', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'sex',
       'flipper_length_cm', 'observer'],
      dtype='object')


## Updating values
Sometimes we wan tto update certain values in our data frame. 

### A single value
We can access a single value in a `pd.DataFrame` using the locators:

- `at[]` : to select by labels
- `lat[]`: to select by interger index / position

Syntax for `at[]`:
```
df.at[single_index_value, 'column_name']
```

`at[]` equivalent to `loc[]` when accessing a single values

Example 
Want to know the bill length of the penguins in the fourth row. 

In [31]:
penguins.at[3, 'bill_length_mm']

nan

We got an NA. Let's update to 38.3mm. We do this using `at[]` too:

In [35]:
penguins.at[3, 'bill_length_mm']= 38.3
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,38.3,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


In [38]:
penguins.iat[1,0] = 999
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,999,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,38.3,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## Update multiple values in a column

### By condition

Think of `case_when` in R.

Example 

We want to classify penguins such that: 
- small penguins : body mass < 3kg
- medium penguins : 3kg <= body mass < 5kg
- big penguins : body mass > 5kg

One way to do this is using `numpy.select()` to create a new column

In [44]:
# create alist with the conditions
conditions = [
    penguins.body_mass_kg < 3,
    (3 <= penguins.body_mass_kg)&(penguins.body_mass_kg <5),
    5 <= penguins.body_mass_kg
]

# create a list wiht the choices
choice = [
    'small',
    'medium',
    'large'
]

# default = value for anything that falls outside conditions
penguins['size'] = np.select(conditions, choices, default = np.nan)

AttributeError: 'DataFrame' object has no attribute 'body_mass_kg'

In [45]:
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


### Update a column by selecting values

Somtimes we just want to update a few values that satisfy a condition. 
We can do this by selecting using `loc` (if selecting by label) and then assigning a new value. 

```
# modified data in place
df.loc[row_selecting, col_name] = new_values
```
where

- `row_selection` = rows we want to update, 
- `col_name` = a single column name, and 
- `new_values` = 

Example 

We want to update the "Male" value in the sex column to "M"

In [None]:
penguins.loc[penguins.sex == 'Male', 'sex'] = 'M'

In [47]:
penguins.sex.unique()

array(['Male', 'Female', nan], dtype=object)

## `SettingWithCopyWarning`
Suppose we want to update the "Female" value in the sex column to "F".
This is an example of somethign we might try, but won't work: 

In [53]:
penguins[penguins.sex == 'Female']['sex'] = 'F'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  penguins[penguins.sex == 'Female']['sex'] = 'F' #[][]


When we select data with **chained indexing** `[][]` instead of `loc` we get this warning. `pandas` is trying to alert us that our code is trying to alert us to a potential bug. 

In this case that we did not update our data frame:


In [54]:
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,flipper_length_cm,observer
0,559,Adelie,Torgersen,39.1,18.7,Male,18.1,B
1,777,Adelie,Torgersen,39.5,17.4,Female,18.6,C
2,290,Adelie,Torgersen,40.3,18.0,Female,19.5,C
3,534,Adelie,Torgersen,,,,,B
4,956,Adelie,Torgersen,36.7,19.3,Female,19.3,B


## View and copies

Some `pandas` operations return a view to your data, while other return a copy to your data. 
- **Views** are actual subsets of the original data, when we update them, we are modifying the original data frame.

- **Copies** are unique objects, independent of our original data frames. When we update a copy we are not modifying the original data frame.

Depending on what we are trying to do we might want to modify a copy or a view.

## Another `SettingWithCopyWarning`

Another common situaiton when this warning comes up is twhen we try updating a subset of our data frame already stored in variable

Example

Suppose we only want to use data from Biscoe island and, after doing some analyses, we want to add a new column to it:


In [56]:
biscoe = penguins[penguins.island == 'Biscoe']

# 50 lines of code

biscoe['sample_column'] = 100

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  biscoe['sample_column'] = 100


Essentially what we did was 
```
penguins[penguins.island == 'Biscoe']['sample_column'] = 100
```

To fix this we can **take control of the copy-view situation and explicitely ask for a copy of the dataset when subsetting the data.** Use the `copy()` method for this

In [58]:
biscoe = penguins[penguins.island == 'Biscoe'].copy()
biscoe['sampe_col'] = 100 

In [59]:
biscoe.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,flipper_length_cm,observer,sampe_col
20,512,Adelie,Biscoe,37.8,18.3,Female,17.4,B,100
21,909,Adelie,Biscoe,37.7,18.7,Male,18.0,B,100
22,703,Adelie,Biscoe,35.9,19.2,Female,18.9,A,100
23,569,Adelie,Biscoe,38.2,18.1,Male,18.5,C,100
24,346,Adelie,Biscoe,38.8,17.2,Male,18.0,C,100


In [64]:
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,flipper_length_cm,observer
0,559,Adelie,Torgersen,39.1,18.7,Male,18.1,B
1,777,Adelie,Torgersen,39.5,17.4,Female,18.6,C
2,290,Adelie,Torgersen,40.3,18.0,Female,19.5,C
3,534,Adelie,Torgersen,,,,,B
4,956,Adelie,Torgersen,36.7,19.3,Female,19.3,B
