# Data:Palmer Penguins

In [1]:
#standard libraries
import pandas as pd
import numpy as np

#import seaborn with its standard abbreviation
import seaborn as sns

#will use the random library to create some random numbers
import random

In [3]:
#import penguins dataset
penguins = sns.load_dataset("penguins")

# look at dataframe's head
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female


## Adding a column

General syntax to add a single columns is
```
df['new_col_name'] = new_column_values
```

`new_column_values` could be:

- a `pd.Series` or `numpy.array` of the same length as the data frame
- a single scalar (a single number, a single string)

**Example**

Want to create a new column where the body mass is in kg instead of grams

In [4]:
# add a new column body_mass_kg 
# sane syntax as adding a new key to a dictionary
penguins['body_mass_kg'] = penguins.body_mass_g/1000 #dividing by 1000 to convert from g to kg

# confirm the new column is in the data frame
print('body_mass_kg' in penguins.columns)

# take a look at the new column
penguins.head()

True


Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,3.75
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,3.8
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,3.25
3,Adelie,Torgersen,,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,3.45


To create a new column and inset it at a particular position we use `insert()`:
```
df.insert(loc = integers_index,
          column = 'new_column_name',
          value = new_col_values) #location of new column
```
Example:
Suppose each penguins gets a unique code as a three digit number. Add this column at the beginning of the data frame:

In [6]:
# create random 3-digit codes
# random.sample used for random sampling wo replacement
codes = random.sample(range(100,1000), len(penguins))

# insert codes at the front of data frame = index 0
penguins.insert(loc=0, 
                column = 'code',
                value = codes)

In [7]:
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,body_mass_kg
0,171,Adelie,Torgersen,39.1,18.7,181.0,3750.0,Male,3.75
1,586,Adelie,Torgersen,39.5,17.4,186.0,3800.0,Female,3.8
2,804,Adelie,Torgersen,40.3,18.0,195.0,3250.0,Female,3.25
3,521,Adelie,Torgersen,,,,,,
4,143,Adelie,Torgersen,36.7,19.3,193.0,3450.0,Female,3.45


## Adding multiple columns

We can assing multiple oclumns in the same call using `assign()` method.
Syntax:
```
df.assign( new_col1_name = new_col1_values, 
           new_col2_name = new_col2_values)
```
Notice: new column names are not strings, we declare them as if we were creating variables

Example: Add columns:

- flipper length converted from mm to cm, and
- a code representing the observer

In [32]:
#create codes from observers
observers = random.choice(['A', 'B', 'C'], #sample from this array
                          k = len(penguins)) #get this/these many items

# create new columns in the data frame
# random.choices used for random sampling with replacement
# need to reassign output of assign() to update the data frame
penguins = penguins.assign( flipper_length_cm = penguins.flipper_length_mm /10, 
                            observer =   random.choices(['A','B','C'], k=len(penguins)))
# look at result
penguins.head()

TypeError: choice() got an unexpected keyword argument 'k'

## Removing columns

We can remove columns using the `drop()` method. Syntax:
```
df = df.drop(columns = col_names)
```
where `col_names` can be a single column name (str) or a list of column names.

Example

We want to drop the flipper length in mm and the body mass in grams

In [9]:
# use a list of column names
# reassign output of drop() to dataframe to update it
penguins = penguins.drop(columns=['flipper_length_mm','body_mass_g'])

# check columns
print(penguins.columns)

Index(['code', 'species', 'island', 'bill_length_mm', 'bill_depth_mm', 'sex',
       'body_mass_kg'],
      dtype='object')


## Updating values

Sometimes we want to update certain values in our data frams.

### A single value
We can access a single value in a `pd.DataFrame` using the locators:

- `at[]` :to select by labels
- `iat[]`: to select by integer index/position

Syntax for `at[]`:
```
df.at[single_index_value, 'column_name']
```
`at[]` equivalent to `loc[]` when accessing a single value

Example
Want to know the bill length of the penguin in the fourth row.

In [11]:
# access value at row with index=3 and column='bill_length_mm'
penguins.at[3,'bill_length_mm']

nan

We got an NA. Let's update to 38.3 mm. We do this using `at[]` to:

In [12]:
# update NA to 38.3
penguins.at[3,'bill_length_mm'] = 38.3

# check it was updated
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,body_mass_kg
0,171,Adelie,Torgersen,39.1,18.7,Male,3.75
1,586,Adelie,Torgersen,39.5,17.4,Female,3.8
2,804,Adelie,Torgersen,40.3,18.0,Female,3.25
3,521,Adelie,Torgersen,38.3,,,
4,143,Adelie,Torgersen,36.7,19.3,Female,3.45


In [13]:
#Q: what happens if you reverse the addingment in .at?
a = 10
b = 20

b = a
print(b)

10




## Update multiple values in a column

### By condition

Think of `case_when` in R.

Example

We want to classify penguins such that:

- small penguins: body mass < 3kg
- medium penguins: 3kg <= body mass < 5kg
- large penguins: body mass > 5kg

One way to do this is using `numpy.select()` to create a new column

In [17]:
#create a list with the conditions
conditions = [penguins.body_mass_kg < 3, 
              (3 <= penguins.body_mass_kg) & (penguins.body_mass_kg < 5),
              5 <= penguins.body_mass_kg]

# create a list with the choices
choices = ["small",
           "medium",
           "large"]

# add the selections using np.select
# default = value for anything that falls outside conditions
penguins['size'] = np.select(conditions, choices, default=np.nan)

penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,body_mass_kg,size
0,171,Adelie,Torgersen,39.1,18.7,Male,3.75,medium
1,586,Adelie,Torgersen,39.5,17.4,Female,3.8,medium
2,804,Adelie,Torgersen,40.3,18.0,Female,3.25,medium
3,521,Adelie,Torgersen,38.3,,,,
4,143,Adelie,Torgersen,36.7,19.3,Female,3.45,medium


### Update a column by selecting values

Sometimes we just want to update a few values that satify a condition.
We can do this by selecting data using `loc` (if selecting by label) and then assigning a new value
```
# modifies data in place
df.loc[row_selection, col_name] = new_values
```
Where
- `row_selection` = rows we want to update,
- `col_name` = a single column name, and 
- `new_values` = the new value or values we want to update. If using multiple values, make sure they are the same length as the data frame.

Example

We want to update the "Male" value in the sex column to "M"

In [20]:
penguins.loc[penguins.sex == 'Male', 'sex'] = 'M'

In [22]:
penguins.head()
#or
penguins.sex.unique()

array(['M', 'Female', nan], dtype=object)

## `SettingWithCopyWarning`

Suppose we want to update the "Female" value in the sex column to "F". This is an example of something we might try, but won't work:

In [23]:
penguins[penguins.sex == 'Female']['sex'] = 'F'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  penguins[penguins.sex == 'Female']['sex'] = 'F'


When we select data with **chained indexing** `[][]` instead of `loc` we get this warning. `pandas` is trying to alert us that our code is ambiguous and there might be a bug

In this case, we did not update the penguins data frame:

In [24]:
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,body_mass_kg,size
0,171,Adelie,Torgersen,39.1,18.7,M,3.75,medium
1,586,Adelie,Torgersen,39.5,17.4,Female,3.8,medium
2,804,Adelie,Torgersen,40.3,18.0,Female,3.25,medium
3,521,Adelie,Torgersen,38.3,,,,
4,143,Adelie,Torgersen,36.7,19.3,Female,3.45,medium


## Views and Copies

Some `pandas` operations return a view to your data, while others return a copy to your data.

- **Views** are actual subsets of the original data, when we update them, we are modifying the original data frame.

- **copies** are unique objects, independent of our original data frames. When we update a copy we are not modifying the original data frame.

### Another `SettingWithCopyWarning`

Another common situation when this warning comes up is when we try updating a subset of our data frame already stored in a variable

Example

We only want datafrom biscoe island and, after doing some analysis, we want to add a new column to it.

In [26]:
# select penguins from Biscoe island
biscoe = penguins[penguins.island=='Biscoe']

# 50 lines of code here

# add a column, we get a warning
biscoe['sample_col'] = 100

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  biscoe['sample_col'] = 100


Essentially what we did was
```
penguins[pengiuns.island=='Biscoe']['sample_column'] = 100
```

To fix this we can **take control of the copy-view situation ane explicitely ask for a copy of the dataset when subsetting the data.** Use the `copy()` method for this

In [28]:
biscoe = penguins[penguins.island=='Biscoe'].copy()

biscoe['sample_col'] = 100

In [29]:
penguins.head()

Unnamed: 0,code,species,island,bill_length_mm,bill_depth_mm,sex,body_mass_kg,size
0,171,Adelie,Torgersen,39.1,18.7,M,3.75,medium
1,586,Adelie,Torgersen,39.5,17.4,Female,3.8,medium
2,804,Adelie,Torgersen,40.3,18.0,Female,3.25,medium
3,521,Adelie,Torgersen,38.3,,,,
4,143,Adelie,Torgersen,36.7,19.3,Female,3.45,medium
