# Preparing Data: Adding Columns
If you want to type along with me, use [this notebook](https://humboldt.cloudbank.2i2c.cloud/hub/user-redirect/git-pull?repo=https%3A%2F%2Fgithub.com%2Fbethanyj0%2Fdata271_sp25&branch=main&urlpath=tree%2Fdata271_sp25%2Flectures%2Fdata271_lec24_live.ipynb) instead. 
If you don't want to type and want to follow along just by executing the cells, stay in this notebook.  

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import seaborn as sns
import re
sns.set_style("darkgrid")
import warnings 
warnings.filterwarnings('ignore') 

## Adding independent columns

### Method 1: Add to the end

In [None]:
# Creating animals dataframe
animals_dct = {
    'Animal': ['cow', 'kitten', 'penguin', 'Puppy'],
    'Sound': ['moo', 'purr', 'chirp', 'bark'],
}

animals = pd.DataFrame(animals_dct)
animals

In [None]:
# add a column on the right (broadcasting)
animals['Cute score'] = 10
animals

In [None]:
# add a column on the right (list)
animals = pd.DataFrame(animals_dct)
animals['Cute score'] = [10,9.9,10,10]
animals

### Method 2: Insert the column in specific position

In [None]:
# Insert at a specific position with .insert(loc, column, value)
animals = pd.DataFrame(animals_dct)
animals.insert(1,'Cute score',10)
animals

In [None]:
# or do it with a list
animals = pd.DataFrame(animals_dct)
animals.insert(1,'Cute score',[10,9.9,10,10])
animals

### Method 3: Add more than 1 column at once

In [None]:
# create two columns at once
animals = animals.assign(Adjective = ['adorable','playful','tough','cuddly'], Pet = [False,True,False,True])

In [None]:
animals

## Adding columns based on other columns

In [None]:
# Making a bool column based on condition
animals['Is Cute'] = animals['Cute score'] > 5
animals.head()

In [None]:
# Making a categorical column based on another categorical column with map
animals['Can Own'] = animals['Pet'].map({True:'yes',False:'no'})
animals.head()

In [None]:
# applying a function to everything in another column
animals['Rounded Score'] = animals['Cute score'].apply(round)
animals.head()

In [None]:
# Creating new columns with element-wise arithmetic
animals['Cute & Pet'] = animals['Cute score'] + animals['Pet']
animals.head()

In [None]:
# With string methods
animals['First letter'] = animals['Animal'].str[0]
animals.head()

In [None]:
# With list comprehension
animals['starts with p'] = ['yes' if re.search('[Pp]',i) else 'no' for i in animals['First letter']]
animals.head()

## Activity

Let's look at the Titanic dataset from the Seaborn library.

In [None]:
# load titanic data 
titanic_seaborn = sns.load_dataset('titanic')
titanic_seaborn.head()

The titanic dataset is available in Seaborn, but it originally came from [this](https://www.kaggle.com/c/titanic/data) Kaggle source.

In [None]:
titanic_kaggle = pd.read_csv('titanic.csv')
titanic_kaggle.head()

Notice that some of the columns in the Seaborn version of this dataset are not included in the Kaggle version. This is because the Seaborn creators added columns to make the data more interpretable and to make further analysis easier. We will replicate this process, then do a short EDA.

**Activity 1:** In the `titanic_kaggle` dataframe, create a new column called `alive` based on the `survived` column. It should be "no" when `survived` is 0, and "yes" when `survived` is 1. Your goal is to make it match the `alive` column in the `titanic_seaborn` dataframe, which was added by the Seaborn creators. 

In [None]:
# run this to check your work
(titanic_kaggle['alive'] == titanic_seaborn['alive']).all()

**Activity 2:** In the `titanic_kaggle` dataframe, create a new column called `alone` based on the `sibsp` and `parch` columns. It should be False if the passenger had any family members on board, True otherwise. Your goal is to make it match the `alone` column in `titanic_seaborn`, which was added by the Seaborn creators. 

In [None]:
# run this to check your work
(titanic_kaggle['alone'] == titanic_seaborn['alone']).all()

**Activity 3:** In the `titanic_kaggle` dataframe, create a new column called `who` based on the original `sex` and `age` columns. If the passenger is under 16, they should be labelled as "child", if they are over 16 and are `sex` "male" they should be labelled "man", and if they are over 16 and are `sex` "female" they should be labelled "woman." Your goal is to make it match the `who` column in `titanic_seaborn` which was added by the Seaborn creators. 

*HINT*: You might find it helpful to split this one up and solve in multiple lines of code. 

In [None]:
# run this to check your work
(titanic_kaggle['who'] == titanic_seaborn['who']).all()

**Activity 4:** Use Pandas methods or visualization to determine the number of passengers that survived and the number of passengers that didn't in each class. 

**Activity 5:** Survival by class and sex: Create a plot that shows the number of passengers that survived and the number of passengers that didn't in each pclass, facetted by sex.

**Activity 6:** Create a visualization that will help you determine how the median age of inviduals who survived compares to the median age of individuals who did not survive. What plot would also help you compare the age distributions between those who survived and those who didn't?

### Code for Discussion Questions

In [None]:
flights = pd.read_csv('flight_delays.csv')
flights.head()

In [None]:
flights['Year'] = flights['Flight_Date'].str.split("-").str[0]
flights.head()

In [None]:
# Another option
flights['Flight_Date'] = pd.to_datetime(flights['Flight_Date'])
flights.dtypes

In [None]:
flights['Year_dt'] = flights['Flight_Date'].dt.year
flights.head()

In [None]:
flights.dtypes

In [None]:
sns.set(font_scale = 1.8)
fig = sns.displot(data = flights, x  = 'Departure_Delay_Minutes', col = 'Year_dt',binwidth=5)
fig.set(xlim=(-20, 100))
plt.tight_layout()

In [None]:
sns.set(font_scale = 1.8)
fig = sns.displot(data = flights, x  = 'Departure_Delay_Minutes', col = 'Year',binwidth=5)
fig.set(xlim=(-20, 100))
plt.tight_layout()