# Pandas 라이브러리
- 파이썬에서 데이터 분석을 위한 가장 강력하고 인기 있는 라이브러리
- 테이블 형태의 데이터를 효율적으로 처리하고 EDA 등 데이터를 분석하는 데 탁월한 기능을 제공
- 엑셀 파일, CSV 파일 등 다양한 형식의 데이터를 읽어 들여 분석하고 가공하는 작업을 매우 편리하게 수행


## 탐색적 데이터 분석 (Exploratory Data Analysis, EDA) 이란?
- 데이터의 기본적인 통계량과 시각화를 통해 데이터의 특징을 파악하고, 데이터 속에 숨겨진 패턴이나 이상치를 찾아내는 과정
- 마치 미지의 땅을 탐험하듯 데이터를 탐구하며, 데이터에 대한 이해를 높이고 효과적인 모델링을 위한 기반을 마련하는 중요한 단계

### EDA의 중요성
- 데이터 이해
    - 데이터의 분포, 중심 경향, 변동성 등을 파악하여 데이터의 특성을 정확하게 이해
- 데이터 정제
    - 이상치, 결측치 등의 문제를 발견하고 처리하여 데이터의 품질을 높일 수 있음
- 모델링 기반 마련
    - 데이터의 특성을 바탕으로 적절한 모델을 선택하고, 변수 간의 관계를 파악하여 모델의 성능을 향상
- 가설 설정
    - 데이터를 탐색하는 과정에서 새로운 가설을 설정

### EDA 과정에서 하는 주요 활동
- 데이터 요약
    - 평균, 중앙값, 표준편차 등의 기본적인 통계량을 계산하여 데이터를 요약
- 데이터 시각화
    - 히스토그램, 산점도, 박스플롯 등 다양한 시각화 기법을 활용하여 데이터의 분포, 상관관계 등을 시각적으로 확인
- 변수 간의 관계 분석
    - 상관관계 분석 등을 통해 변수 간의 관계를 파악
- 이상치 탐색
    - 이상치를 발견하고 처리하여 데이터의 정확성을 높임
- 결측치 처리
    - 결측치를 처리하는 다양한 방법을 적용하여 데이터의 완전성을 확보

### EDA에 사용되는 도구
- Pandas, NumPy, Matplotlib, Seaborn 등 다양한 라이브러리를 활용하여 효율적으로 EDA를 수행

## Pandas 주요 특징
- numpy 를 내부적으로 활용
    - numpy에서 제공되는 기능들을 그대로 사용할 수 있음
- 많은 양의 데이터를 읽어들여 분석하는데 최적화
- 데이터 분석에 특화된 데이터 구조를 제공
- 다양한 데이터 분석 함수를 제공
- 다른 시스템이나 라이브러리에 쉽게 연결(데이터베이스, 머신러닝라이브러리 등)

## Pandas의 주요 기능
- 데이터 입출력
    - 다양한 형식의 데이터(CSV, Excel, SQL 등)를 읽고 쓰는 기능을 제공
- 데이터 선택 및 필터링
    - 원하는 데이터를 선택하고 필터링하는 기능을 제공
- 데이터 정렬
    - 데이터를 특정 기준으로 정렬하는 기능을 제공.
- 데이터 결합
    - 여러 개의 데이터를 합치는 기능을 제공
- 그룹화 및 집계
    - 데이터를 그룹으로 나누어 각 그룹별로 통계 정보를 계산하는 기능을 제공
- 누락값 처리
    - 데이터에 포함된 누락값을 처리하는 다양한 방법을 제공

## Pandas를 사용하는 이유
- 데이터 분석의 효율성
    - 복잡한 데이터 분석 작업을 간결하고 효율적으로 수행
- 다양한 데이터 형식 지원
    - 다양한 형식의 데이터를 쉽게 처리
- 풍부한 기능
    - 데이터 분석에 필요한 다양한 기능을 제공
- 활발한 커뮤니티
    - 많은 사용자와 개발자가 활동하는 커뮤니티가 있어 문제 해결이나 새로운 기능 학습이 용이

# pandas의 데이터 구조(자료구조)
- DataFrame
    - 2차원 테이블 형태의 데이터를 나타내며, 행과 열로 구성
    - 각 열은 서로 다른 데이터 타입을 가질 수 있으며, 각 행은 특정 관측값을 의미
- Series:
    - 1차원 배열과 유사하며, 1차원 구조로 되어있는 한 종류의 데이터
    - 열 데이터


```shell
pip install pandas
```

In [75]:
import pandas as pd

- 리스트로 DataFrame 만들기

In [76]:
data = [
    ["A군", 30, 170],
    ["B군", 25, 180]
]
df = pd.DataFrame(data, columns=["이름", "나 이", "키"])
df

Unnamed: 0,이름,나 이,키
0,A군,30,170
1,B군,25,180


In [77]:
type(df)

pandas.core.frame.DataFrame

- Series

In [78]:
df["키"]

0    170
1    180
Name: 키, dtype: int64

In [79]:
df.키 # 권장하지 않음

0    170
1    180
Name: 키, dtype: int64

In [80]:
df["나 이"]

0    30
1    25
Name: 나 이, dtype: int64

In [81]:
# from google.colab import drive
# drive.mount('/content/drive')

In [82]:
import numpy as np
DATA_PATH = "data/"

arr = np.load(f"{DATA_PATH}samsung_stock_2021.npy")

import pandas as pd
pd.DataFrame(arr, columns = ["시가", "고가", "저가", "종가"])

Unnamed: 0,시가,고가,저가,종가
0,81000,84400,80200,83000
1,81600,83900,81600,83900
2,83300,84500,82100,82200
3,82800,84200,82700,82900
4,83300,90000,83000,88800
...,...,...,...,...
243,80200,80800,80200,80500
244,80600,80600,79800,80200
245,80200,80400,79700,80300
246,80200,80200,78500,78800


- 딕셔너리를 이용하여 dataframe 만들기

In [83]:
data = {
    # key는 컬럼명, values 열데이터
    "이름": ["A군", "B군"],
    "나 이": [30, 25],
    "키": [170, 180]
}
df = pd.DataFrame(data)
df

Unnamed: 0,이름,나 이,키
0,A군,30,170
1,B군,25,180


# CSV 파일 불러오기
- CSV 파일이란?
    - CSV는 Comma-Separated Values의 약자로, 쉼표(,)로 값을 구분하여 텍스트 형식으로 데이터를 저장하는 파일 형식
    - 마치 엑셀 파일처럼 행과 열로 구성된 표 형태의 데이터를 간단하게 저장하고 공유할 수 있기 때문에 매우 널리 사용
    - 불필요한 서식 정보가 없어 엑셀 파일과는 다르게 데이터와는 무관한 정보를 포함하지 않음
    - 엑셀 파일을 데이터프레임으로 불러올 수 있지만 보통 데이터 수집후 csv 파일로 저장

- 타이타닉 데이터셋 링크
    - https://drive.google.com/file/d/1aMh6jdtMq4RmHP-BFujZtIjvv29hxVN9/view?usp=sharing

## read_csv 함수
- csv 파일 읽어서 dataframe 객체로 반환

In [84]:
df = pd.read_csv(f"{DATA_PATH}titanic_train.csv")
df

Unnamed: 0,passengerid,survived,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked
0,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
1,462,0,3,"Morley, Mr. William",male,34.0,0,0,364506,8.0500,,S
2,1286,0,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.0250,,S
3,1130,1,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0000,,S
4,461,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.5500,E12,S
...,...,...,...,...,...,...,...,...,...,...,...,...
911,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
912,518,0,3,"Ryan, Mr. Patrick",male,,0,0,371110,24.1500,,Q
913,664,0,3,"Coleff, Mr. Peju",male,36.0,0,0,349210,7.4958,,S
914,109,0,3,"Rekic, Mr. Tido",male,38.0,0,0,349249,7.8958,,S


## `to_csv` 메서드
- dataframe을 csv 파일로 저장하기

In [85]:
df.to_csv(f"{DATA_PATH}titanic2.csv", index=False)

# 데이터프레임 기초정보 확인하기

In [86]:
df.columns

Index(['passengerid', 'survived', 'pclass', 'name', 'gender', 'age', 'sibsp',
       'parch', 'ticket', 'fare', 'cabin', 'embarked'],
      dtype='object')

In [87]:
df.columns.tolist()

['passengerid',
 'survived',
 'pclass',
 'name',
 'gender',
 'age',
 'sibsp',
 'parch',
 'ticket',
 'fare',
 'cabin',
 'embarked']

In [88]:
df["age"].tolist()

[71.0,
 34.0,
 29.0,
 18.0,
 48.0,
 17.0,
 45.0,
 nan,
 6.0,
 24.0,
 30.5,
 nan,
 27.0,
 31.0,
 nan,
 33.0,
 nan,
 22.0,
 40.0,
 nan,
 nan,
 nan,
 71.0,
 nan,
 50.0,
 23.0,
 46.0,
 nan,
 28.0,
 29.0,
 9.0,
 41.0,
 nan,
 33.0,
 43.0,
 45.5,
 26.0,
 22.0,
 32.5,
 47.0,
 30.0,
 0.75,
 48.0,
 21.0,
 32.0,
 18.0,
 57.0,
 39.0,
 24.0,
 18.0,
 29.0,
 23.0,
 19.0,
 20.0,
 25.0,
 8.0,
 59.0,
 41.0,
 34.0,
 31.0,
 24.0,
 21.0,
 39.0,
 18.0,
 36.0,
 nan,
 18.0,
 29.0,
 33.0,
 30.0,
 nan,
 22.0,
 18.0,
 27.0,
 nan,
 45.0,
 61.0,
 nan,
 33.0,
 25.0,
 18.5,
 50.0,
 36.5,
 10.0,
 22.0,
 33.0,
 40.0,
 30.0,
 26.0,
 32.5,
 35.0,
 34.0,
 3.0,
 nan,
 33.0,
 36.0,
 65.0,
 37.0,
 19.0,
 58.0,
 26.0,
 19.0,
 24.0,
 26.0,
 42.0,
 27.0,
 nan,
 21.0,
 45.0,
 nan,
 22.0,
 11.0,
 23.0,
 10.0,
 nan,
 60.0,
 30.0,
 18.0,
 20.0,
 35.0,
 nan,
 nan,
 19.0,
 28.0,
 nan,
 50.0,
 28.0,
 19.0,
 26.0,
 13.0,
 21.0,
 nan,
 nan,
 nan,
 nan,
 nan,
 36.0,
 52.0,
 30.0,
 50.0,
 nan,
 27.0,
 30.0,
 44.0,
 40.0,
 25.0,
 36.0,
 3

- 데이터 프레임 정보 확인하기

In [89]:
# verbose=True 모든 데이터 출력, show_counts=True
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 916 entries, 0 to 915
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   passengerid  916 non-null    int64  
 1   survived     916 non-null    int64  
 2   pclass       916 non-null    int64  
 3   name         916 non-null    object 
 4   gender       916 non-null    object 
 5   age          736 non-null    float64
 6   sibsp        916 non-null    int64  
 7   parch        916 non-null    int64  
 8   ticket       916 non-null    object 
 9   fare         916 non-null    float64
 10  cabin        210 non-null    object 
 11  embarked     916 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 86.0+ KB


In [90]:
df.shape

(916, 12)

In [91]:
df.head(5) # default 5

Unnamed: 0,passengerid,survived,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked
0,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
1,462,0,3,"Morley, Mr. William",male,34.0,0,0,364506,8.05,,S
2,1286,0,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.025,,S
3,1130,1,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0,,S
4,461,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S


In [92]:
df.tail()

Unnamed: 0,passengerid,survived,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked
911,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
912,518,0,3,"Ryan, Mr. Patrick",male,,0,0,371110,24.15,,Q
913,664,0,3,"Coleff, Mr. Peju",male,36.0,0,0,349210,7.4958,,S
914,109,0,3,"Rekic, Mr. Tido",male,38.0,0,0,349249,7.8958,,S
915,146,0,2,"Nicholls, Mr. Joseph Charles",male,19.0,1,1,C.A. 33112,36.75,,S


# 데이터프레임 다루기

## `copy` 메서드
- 데이터프레임 복사
- 기본적으로 깊은 복사

In [93]:
df_cp = df.copy()
df_cp

Unnamed: 0,passengerid,survived,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked
0,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
1,462,0,3,"Morley, Mr. William",male,34.0,0,0,364506,8.0500,,S
2,1286,0,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.0250,,S
3,1130,1,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0000,,S
4,461,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.5500,E12,S
...,...,...,...,...,...,...,...,...,...,...,...,...
911,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
912,518,0,3,"Ryan, Mr. Patrick",male,,0,0,371110,24.1500,,Q
913,664,0,3,"Coleff, Mr. Peju",male,36.0,0,0,349210,7.4958,,S
914,109,0,3,"Rekic, Mr. Tido",male,38.0,0,0,349249,7.8958,,S


## 컬럼명 변경

In [94]:
cols = ["id", "생존", "객실등급", "이름", "성별", "나이", "형제자매_배우자수", "부모_자녀수",
        "티켓번호", "운임료", "객실번호", "탑승항구"]
df_cp.columns = cols
df_cp.head()

Unnamed: 0,id,생존,객실등급,이름,성별,나이,형제자매_배우자수,부모_자녀수,티켓번호,운임료,객실번호,탑승항구
0,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
1,462,0,3,"Morley, Mr. William",male,34.0,0,0,364506,8.05,,S
2,1286,0,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.025,,S
3,1130,1,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0,,S
4,461,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S


In [95]:
# df_cp.columns[0] = "id"

## `rename` 메서드
- 지정한 컬럼명들을 변경할수있다.
- 딕셔너리 형태로 전달하면 된다.
- key 는 변경 전 열이름, value는 변경 후 열이름

In [96]:
df_cp = df_cp.rename(columns={"id": "아이디"})
df_cp.head()

Unnamed: 0,아이디,생존,객실등급,이름,성별,나이,형제자매_배우자수,부모_자녀수,티켓번호,운임료,객실번호,탑승항구
0,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
1,462,0,3,"Morley, Mr. William",male,34.0,0,0,364506,8.05,,S
2,1286,0,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.025,,S
3,1130,1,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0,,S
4,461,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S


In [97]:
# df_cp.rename(columns={"생존": "생존여부"}, inplace=True) # 권장 x

## `add_prefix` 메서드
- 컬럼명 앞부분에 공통된 문자열 붙혀주기

In [98]:
df.add_prefix("pre_")

Unnamed: 0,pre_passengerid,pre_survived,pre_pclass,pre_name,pre_gender,pre_age,pre_sibsp,pre_parch,pre_ticket,pre_fare,pre_cabin,pre_embarked
0,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
1,462,0,3,"Morley, Mr. William",male,34.0,0,0,364506,8.0500,,S
2,1286,0,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.0250,,S
3,1130,1,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0000,,S
4,461,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.5500,E12,S
...,...,...,...,...,...,...,...,...,...,...,...,...
911,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
912,518,0,3,"Ryan, Mr. Patrick",male,,0,0,371110,24.1500,,Q
913,664,0,3,"Coleff, Mr. Peju",male,36.0,0,0,349210,7.4958,,S
914,109,0,3,"Rekic, Mr. Tido",male,38.0,0,0,349249,7.8958,,S


## `add_suffix` 메서드
- 컬럼명 뒷부분에 공통된 문자열 붙혀주기

In [99]:
df.add_suffix("_suf")

Unnamed: 0,passengerid_suf,survived_suf,pclass_suf,name_suf,gender_suf,age_suf,sibsp_suf,parch_suf,ticket_suf,fare_suf,cabin_suf,embarked_suf
0,494,0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
1,462,0,3,"Morley, Mr. William",male,34.0,0,0,364506,8.0500,,S
2,1286,0,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.0250,,S
3,1130,1,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0000,,S
4,461,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.5500,E12,S
...,...,...,...,...,...,...,...,...,...,...,...,...
911,738,1,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C
912,518,0,3,"Ryan, Mr. Patrick",male,,0,0,371110,24.1500,,Q
913,664,0,3,"Coleff, Mr. Peju",male,36.0,0,0,349210,7.4958,,S
914,109,0,3,"Rekic, Mr. Tido",male,38.0,0,0,349249,7.8958,,S


## 특정 컬럼들 선택하기

In [100]:
cols = ["age", "gender", "name"]
df[cols]

Unnamed: 0,age,gender,name
0,71.0,male,"Artagaveytia, Mr. Ramon"
1,34.0,male,"Morley, Mr. William"
2,29.0,male,"Kink-Heilmann, Mr. Anton"
3,18.0,female,"Hiltunen, Miss. Marta"
4,48.0,male,"Anderson, Mr. Harry"
...,...,...,...
911,35.0,male,"Lesurer, Mr. Gustave J"
912,,male,"Ryan, Mr. Patrick"
913,36.0,male,"Coleff, Mr. Peju"
914,38.0,male,"Rekic, Mr. Tido"


In [101]:
target = df["survived"]
target

0      0
1      0
2      0
3      1
4      1
      ..
911    1
912    0
913    0
914    0
915    0
Name: survived, Length: 916, dtype: int64

In [102]:
df[["age"]] # series 형태를 2차원 형태로 출력

Unnamed: 0,age
0,71.0
1,34.0
2,29.0
3,18.0
4,48.0
...,...
911,35.0
912,
913,36.0
914,38.0


## 컬럼 삭제하기

In [103]:
df.drop("name", axis=1) # 열방향

Unnamed: 0,passengerid,survived,pclass,gender,age,sibsp,parch,ticket,fare,cabin,embarked
0,494,0,1,male,71.0,0,0,PC 17609,49.5042,,C
1,462,0,3,male,34.0,0,0,364506,8.0500,,S
2,1286,0,3,male,29.0,3,1,315153,22.0250,,S
3,1130,1,2,female,18.0,1,1,250650,13.0000,,S
4,461,1,1,male,48.0,0,0,19952,26.5500,E12,S
...,...,...,...,...,...,...,...,...,...,...,...
911,738,1,1,male,35.0,0,0,PC 17755,512.3292,B101,C
912,518,0,3,male,,0,0,371110,24.1500,,Q
913,664,0,3,male,36.0,0,0,349210,7.4958,,S
914,109,0,3,male,38.0,0,0,349249,7.8958,,S


In [104]:
df.drop(["name", "pclass"], axis=1)

Unnamed: 0,passengerid,survived,gender,age,sibsp,parch,ticket,fare,cabin,embarked
0,494,0,male,71.0,0,0,PC 17609,49.5042,,C
1,462,0,male,34.0,0,0,364506,8.0500,,S
2,1286,0,male,29.0,3,1,315153,22.0250,,S
3,1130,1,female,18.0,1,1,250650,13.0000,,S
4,461,1,male,48.0,0,0,19952,26.5500,E12,S
...,...,...,...,...,...,...,...,...,...,...
911,738,1,male,35.0,0,0,PC 17755,512.3292,B101,C
912,518,0,male,,0,0,371110,24.1500,,Q
913,664,0,male,36.0,0,0,349210,7.4958,,S
914,109,0,male,38.0,0,0,349249,7.8958,,S


In [105]:
df.drop(columns=["name", "pclass"])

Unnamed: 0,passengerid,survived,gender,age,sibsp,parch,ticket,fare,cabin,embarked
0,494,0,male,71.0,0,0,PC 17609,49.5042,,C
1,462,0,male,34.0,0,0,364506,8.0500,,S
2,1286,0,male,29.0,3,1,315153,22.0250,,S
3,1130,1,female,18.0,1,1,250650,13.0000,,S
4,461,1,male,48.0,0,0,19952,26.5500,E12,S
...,...,...,...,...,...,...,...,...,...,...
911,738,1,male,35.0,0,0,PC 17755,512.3292,B101,C
912,518,0,male,,0,0,371110,24.1500,,Q
913,664,0,male,36.0,0,0,349210,7.4958,,S
914,109,0,male,38.0,0,0,349249,7.8958,,S


In [106]:
# df.drop(3, axis=0) # 행방향, 거의 사용안함

In [107]:
df = df.drop(columns = ["passengerid", "survived"])
df.head()

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C
1,3,"Morley, Mr. William",male,34.0,0,0,364506,8.05,,S
2,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.025,,S
3,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0,,S
4,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S


## 컬럼 추가하기

In [108]:
df["target"] = target
df.head()

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,0
1,3,"Morley, Mr. William",male,34.0,0,0,364506,8.05,,S,0
2,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.025,,S,0
3,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0,,S,1
4,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,1


## `sort_values` 메서드
- 컬럼을 기준으로 데이터 정렬하기
    - 기본은 오름차순

In [109]:
df.sort_values("age")

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
852,3,"Dean, Miss. Elizabeth Gladys Millvina""""",female,0.17,1,2,C.A. 2315,20.5750,,S,1
480,2,"Hamalainen, Master. Viljo",male,0.67,1,1,250649,14.5000,,S,1
289,3,"Baclini, Miss. Helene Barbara",female,0.75,2,1,2666,19.2583,,C,1
41,3,"Peacock, Master. Alfred Edward",male,0.75,1,1,SOTON/O.Q. 3101315,13.7750,,S,0
587,2,"Caldwell, Master. Alden Gates",male,0.83,0,2,248738,29.0000,,S,1
...,...,...,...,...,...,...,...,...,...,...,...
881,1,"Bradley, Mr. George (""George Arthur Brayton"")",male,,0,0,111427,26.5500,,S,1
893,3,"Rommetvedt, Mr. Knud Paust",male,,0,0,312993,7.7750,,S,0
895,3,"Moubarek, Master. Gerios",male,,1,1,2661,15.2458,,C,1
897,1,"Kenyon, Mrs. Frederick R (Marion)",female,,1,0,17464,51.8625,D21,S,1


In [110]:
df.sort_values("age", ascending=False) # 내림차순

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
694,1,"Barkworth, Mr. Algernon Henry Wilson",male,80.0,0,0,27042,30.0000,A23,S,1
729,1,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1,0,19877,78.8500,C46,S,1
375,3,"Svensson, Mr. Johan",male,74.0,0,0,347060,7.7750,,S,0
0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,0
22,1,"Goldschmidt, Mr. George B",male,71.0,0,0,PC 17754,34.6542,A5,C,0
...,...,...,...,...,...,...,...,...,...,...,...
881,1,"Bradley, Mr. George (""George Arthur Brayton"")",male,,0,0,111427,26.5500,,S,1
893,3,"Rommetvedt, Mr. Knud Paust",male,,0,0,312993,7.7750,,S,0
895,3,"Moubarek, Master. Gerios",male,,1,1,2661,15.2458,,C,1
897,1,"Kenyon, Mrs. Frederick R (Marion)",female,,1,0,17464,51.8625,D21,S,1


## `sample` 메서드
- 샘플링 기능
- 주요 파라미터
    - n: 반환할 샘플수
    - frac: 반환할 샘플 비율
    - replace
        - True: 복원추출
        - False: 비복원추출(기본값)
    - random_state: 시드값


In [111]:
df.sample(5)

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
297,3,"Kiernan, Mr. Philip",male,,1,0,367229,7.75,,Q,0
193,3,"Krekorian, Mr. Neshan",male,25.0,0,0,2654,7.2292,F E57,C,0
543,3,"Andersson, Miss. Sigrid Elisabeth",female,11.0,4,2,347082,31.275,,S,0
573,1,"Brown, Mrs. John Murray (Caroline Lane Lamson)",female,59.0,2,0,11769,51.4792,C101,S,1
586,2,"Carter, Rev. Ernest Courtenay",male,54.0,1,0,244252,26.0,,S,0


In [112]:
df.sample(frac=0.1)

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
85,3,"Stankovic, Mr. Ivan",male,33.0,0,0,349239,8.6625,,C,0
79,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.0500,,S,0
489,1,"Lines, Miss. Mary Conover",female,16.0,0,1,PC 17592,39.4000,D28,S,1
806,1,"Rothschild, Mrs. Martin (Elizabeth L. Barrett)",female,54.0,1,0,PC 17603,59.4000,,C,1
623,2,"Jacobsohn, Mrs. Sidney Samuel (Amy Frances Chr...",female,24.0,2,1,243847,27.0000,,S,1
...,...,...,...,...,...,...,...,...,...,...,...
869,3,"Jussila, Miss. Mari Aina",female,21.0,1,0,4137,9.8250,,S,0
195,3,"Doyle, Miss. Elizabeth",female,24.0,0,0,368702,7.7500,,Q,1
31,3,"Hansen, Mr. Claus Peter",male,41.0,2,0,350026,14.1083,,S,0
684,1,"Madill, Miss. Georgette Alexandra",female,15.0,0,1,24160,211.3375,B5,S,1


In [113]:
df.sample(frac=1, random_state=42) # shuffle

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
380,3,"Moutal, Mr. Rahamin Haim",male,,0,0,374746,8.0500,,S,0
879,1,"Natsch, Mr. Charles H",male,37.0,0,1,PC 17596,29.7000,C118,C,0
355,1,"Dick, Mrs. Albert Adrian (Vera Gillespie)",female,17.0,1,0,17474,57.0000,B20,S,1
357,1,"Ryerson, Master. John Borie",male,13.0,2,2,PC 17608,262.3750,B57 B59 B63 B66,C,0
362,3,"Sage, Mr. John George",male,,1,9,CA. 2343,69.5500,,S,0
...,...,...,...,...,...,...,...,...,...,...,...
106,3,"Lockyer, Mr. Edward",male,,0,0,1222,7.8792,,S,0
270,3,"Nicola-Yarred, Miss. Jamila",female,14.0,1,0,2651,11.2417,,C,1
860,2,"Cook, Mrs. (Selena Rogers)",female,22.0,0,0,W./C. 14266,10.5000,F33,S,1
435,3,"Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu...",female,33.0,3,0,3101278,15.8500,,S,1


# 행 열 다루기
- 데이터프레임에는 행에 대한 이름을 index 라고 하고, 열에 대한 이름을 column 이라 한다.
- DataFrame도 numpy 기반으로 돌아가기 때문에 행번호, 열번호 가 있다.



In [114]:
np.arange(12).reshape(3,4)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

## `iloc`
- 행번호 , 열번호 를 이용한 슬라이싱


In [115]:
df2 = df.sample(frac=1, random_state=42)
df2.head()

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
380,3,"Moutal, Mr. Rahamin Haim",male,,0,0,374746,8.05,,S,0
879,1,"Natsch, Mr. Charles H",male,37.0,0,1,PC 17596,29.7,C118,C,0
355,1,"Dick, Mrs. Albert Adrian (Vera Gillespie)",female,17.0,1,0,17474,57.0,B20,S,1
357,1,"Ryerson, Master. John Borie",male,13.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C,0
362,3,"Sage, Mr. John George",male,,1,9,CA. 2343,69.55,,S,0


In [116]:
df2.iloc[:3]

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
380,3,"Moutal, Mr. Rahamin Haim",male,,0,0,374746,8.05,,S,0
879,1,"Natsch, Mr. Charles H",male,37.0,0,1,PC 17596,29.7,C118,C,0
355,1,"Dick, Mrs. Albert Adrian (Vera Gillespie)",female,17.0,1,0,17474,57.0,B20,S,1


In [117]:
df2.iloc[1:5, 1:6]

Unnamed: 0,name,gender,age,sibsp,parch
879,"Natsch, Mr. Charles H",male,37.0,0,1
355,"Dick, Mrs. Albert Adrian (Vera Gillespie)",female,17.0,1,0
357,"Ryerson, Master. John Borie",male,13.0,2,2
362,"Sage, Mr. John George",male,,1,9


In [118]:
num_rows = [1,3]
num_cols = [0,4,5]
df2.iloc[num_rows, num_cols]

Unnamed: 0,pclass,sibsp,parch
879,1,0,1
357,1,2,2


In [119]:
mask = np.full(11, True)
mask[-1] = False
mask

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True, False])

In [120]:
df2.iloc[num_rows, mask]

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked
879,1,"Natsch, Mr. Charles H",male,37.0,0,1,PC 17596,29.7,C118,C
357,1,"Ryerson, Master. John Borie",male,13.0,2,2,PC 17608,262.375,B57 B59 B63 B66,C


## `loc`
- index 명 과 column명 이용한  슬라이싱
- 마스킹을 이용한 행과 열 선택이 가능

In [121]:
df.loc[:3] # end 값 포함

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,0
1,3,"Morley, Mr. William",male,34.0,0,0,364506,8.05,,S,0
2,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.025,,S,0
3,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0,,S,1


In [122]:
cols = ["name", "age", "parch"]
df.loc[:, cols]

Unnamed: 0,name,age,parch
0,"Artagaveytia, Mr. Ramon",71.0,0
1,"Morley, Mr. William",34.0,0
2,"Kink-Heilmann, Mr. Anton",29.0,1
3,"Hiltunen, Miss. Marta",18.0,1
4,"Anderson, Mr. Harry",48.0,0
...,...,...,...
911,"Lesurer, Mr. Gustave J",35.0,0
912,"Ryan, Mr. Patrick",,0
913,"Coleff, Mr. Peju",36.0,0
914,"Rekic, Mr. Tido",38.0,0


In [123]:
df[cols]

Unnamed: 0,name,age,parch
0,"Artagaveytia, Mr. Ramon",71.0,0
1,"Morley, Mr. William",34.0,0
2,"Kink-Heilmann, Mr. Anton",29.0,1
3,"Hiltunen, Miss. Marta",18.0,1
4,"Anderson, Mr. Harry",48.0,0
...,...,...,...
911,"Lesurer, Mr. Gustave J",35.0,0
912,"Ryan, Mr. Patrick",,0
913,"Coleff, Mr. Peju",36.0,0
914,"Rekic, Mr. Tido",38.0,0


## 마스킹

In [124]:
mask = df["target"] == 1
df.loc[mask] # df[mask]

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
3,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0000,,S,1
4,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.5500,E12,S,1
7,3,"McCoy, Miss. Agnes",female,,2,0,367226,23.2500,,Q,1
11,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S,1
13,2,"Collyer, Mrs. Harvey (Charlotte Annie Tate)",female,31.0,1,1,C.A. 31921,26.2500,,S,1
...,...,...,...,...,...,...,...,...,...,...,...
905,3,"Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judi...",female,22.0,1,0,347072,13.9000,,S,1
906,3,"Dahl, Mr. Karl Edwart",male,45.0,0,0,7598,8.0500,,S,1
907,1,"Hoyt, Mr. Frederick Maxfield",male,38.0,1,0,19943,90.0000,C93,S,1
908,3,"Landergren, Miss. Aurora Adelia",female,22.0,0,0,C 7077,7.2500,,S,1


## 다중 조건을 주어 마스킹
- dataframe 에 대하여 bool 에 대한 논리 연산자를 지원하지 않음 (numpy도 동일)
- 비트 연산자를 사용해야 함
- `or` 대신 `|`
- `and` 대신 `&`
- `not` 대신 `~`

In [125]:
df.loc[(df["target"] == 1) & (df["age"] < 20)]

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
3,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0000,,S,1
30,3,"Touma, Miss. Maria Youssef",female,9.0,1,1,2650,15.2458,,C,1
45,1,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",female,18.0,1,0,PC 17757,227.5250,C62 C64,C,1
55,2,"Quick, Miss. Winifred Vera",female,8.0,1,1,26360,26.0000,,S,1
72,3,"Badman, Miss. Emily Louisa",female,18.0,0,0,A/4 31416,8.0500,,S,1
...,...,...,...,...,...,...,...,...,...,...,...
859,2,"Ilett, Miss. Bertha",female,17.0,0,0,SO/C 14885,10.5000,,S,1
867,3,"Aks, Mrs. Sam (Leah Rosen)",female,18.0,0,1,392091,9.3500,,S,1
874,3,"Karun, Miss. Manca",female,4.0,0,1,349256,13.4167,,C,1
889,2,"Quick, Miss. Phyllis May",female,2.0,1,1,26360,26.0000,,S,1


In [126]:
mask1 = df["target"] == 1
mask2 = df["age"] < 20
df.loc[mask1 & mask2]

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
3,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0000,,S,1
30,3,"Touma, Miss. Maria Youssef",female,9.0,1,1,2650,15.2458,,C,1
45,1,"Astor, Mrs. John Jacob (Madeleine Talmadge Force)",female,18.0,1,0,PC 17757,227.5250,C62 C64,C,1
55,2,"Quick, Miss. Winifred Vera",female,8.0,1,1,26360,26.0000,,S,1
72,3,"Badman, Miss. Emily Louisa",female,18.0,0,0,A/4 31416,8.0500,,S,1
...,...,...,...,...,...,...,...,...,...,...,...
859,2,"Ilett, Miss. Bertha",female,17.0,0,0,SO/C 14885,10.5000,,S,1
867,3,"Aks, Mrs. Sam (Leah Rosen)",female,18.0,0,1,392091,9.3500,,S,1
874,3,"Karun, Miss. Manca",female,4.0,0,1,349256,13.4167,,C,1
889,2,"Quick, Miss. Phyllis May",female,2.0,1,1,26360,26.0000,,S,1


In [127]:
mask = df["target"] == 1
df.loc[mask]

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
3,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0000,,S,1
4,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.5500,E12,S,1
7,3,"McCoy, Miss. Agnes",female,,2,0,367226,23.2500,,Q,1
11,2,"Williams, Mr. Charles Eugene",male,,0,0,244373,13.0000,,S,1
13,2,"Collyer, Mrs. Harvey (Charlotte Annie Tate)",female,31.0,1,1,C.A. 31921,26.2500,,S,1
...,...,...,...,...,...,...,...,...,...,...,...
905,3,"Dyker, Mrs. Adolf Fredrik (Anna Elisabeth Judi...",female,22.0,1,0,347072,13.9000,,S,1
906,3,"Dahl, Mr. Karl Edwart",male,45.0,0,0,7598,8.0500,,S,1
907,1,"Hoyt, Mr. Frederick Maxfield",male,38.0,1,0,19943,90.0000,C93,S,1
908,3,"Landergren, Miss. Aurora Adelia",female,22.0,0,0,C 7077,7.2500,,S,1


In [128]:
df.loc[~mask]

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,0
1,3,"Morley, Mr. William",male,34.0,0,0,364506,8.0500,,S,0
2,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.0250,,S,0
5,3,"Dika, Mr. Mirko",male,17.0,0,0,349232,7.8958,,S,0
6,3,"Ekstrom, Mr. Johan",male,45.0,0,0,347061,6.9750,,S,0
...,...,...,...,...,...,...,...,...,...,...,...
910,3,"Cacic, Miss. Marija",female,30.0,0,0,315084,8.6625,,S,0
912,3,"Ryan, Mr. Patrick",male,,0,0,371110,24.1500,,Q,0
913,3,"Coleff, Mr. Peju",male,36.0,0,0,349210,7.4958,,S,0
914,3,"Rekic, Mr. Tido",male,38.0,0,0,349249,7.8958,,S,0


## 데이터 형식에 기반한 열 선택

In [129]:
df.select_dtypes("float64")

Unnamed: 0,age,fare
0,71.0,49.5042
1,34.0,8.0500
2,29.0,22.0250
3,18.0,13.0000
4,48.0,26.5500
...,...,...
911,35.0,512.3292
912,,24.1500
913,36.0,7.4958
914,38.0,7.8958


In [130]:
df.select_dtypes(["float64", "int64"])

Unnamed: 0,pclass,age,sibsp,parch,fare,target
0,1,71.0,0,0,49.5042,0
1,3,34.0,0,0,8.0500,0
2,3,29.0,3,1,22.0250,0
3,2,18.0,1,1,13.0000,1
4,1,48.0,0,0,26.5500,1
...,...,...,...,...,...,...
911,1,35.0,0,0,512.3292,1
912,3,,0,0,24.1500,0
913,3,36.0,0,0,7.4958,0
914,3,38.0,0,0,7.8958,0


In [131]:
df.select_dtypes("number")

Unnamed: 0,pclass,age,sibsp,parch,fare,target
0,1,71.0,0,0,49.5042,0
1,3,34.0,0,0,8.0500,0
2,3,29.0,3,1,22.0250,0
3,2,18.0,1,1,13.0000,1
4,1,48.0,0,0,26.5500,1
...,...,...,...,...,...,...
911,1,35.0,0,0,512.3292,1
912,3,,0,0,24.1500,0
913,3,36.0,0,0,7.4958,0
914,3,38.0,0,0,7.8958,0


In [132]:
df.select_dtypes("object")

Unnamed: 0,name,gender,ticket,cabin,embarked
0,"Artagaveytia, Mr. Ramon",male,PC 17609,,C
1,"Morley, Mr. William",male,364506,,S
2,"Kink-Heilmann, Mr. Anton",male,315153,,S
3,"Hiltunen, Miss. Marta",female,250650,,S
4,"Anderson, Mr. Harry",male,19952,E12,S
...,...,...,...,...,...
911,"Lesurer, Mr. Gustave J",male,PC 17755,B101,C
912,"Ryan, Mr. Patrick",male,371110,,Q
913,"Coleff, Mr. Peju",male,349210,,S
914,"Rekic, Mr. Tido",male,349249,,S


In [133]:
cols = df.select_dtypes("object").columns
cols

Index(['name', 'gender', 'ticket', 'cabin', 'embarked'], dtype='object')

In [134]:
df[cols]

Unnamed: 0,name,gender,ticket,cabin,embarked
0,"Artagaveytia, Mr. Ramon",male,PC 17609,,C
1,"Morley, Mr. William",male,364506,,S
2,"Kink-Heilmann, Mr. Anton",male,315153,,S
3,"Hiltunen, Miss. Marta",female,250650,,S
4,"Anderson, Mr. Harry",male,19952,E12,S
...,...,...,...,...,...
911,"Lesurer, Mr. Gustave J",male,PC 17755,B101,C
912,"Ryan, Mr. Patrick",male,371110,,Q
913,"Coleff, Mr. Peju",male,349210,,S
914,"Rekic, Mr. Tido",male,349249,,S


# 그동안 배운 연산자가 시리즈 간에 지원된다.

In [135]:
df

Unnamed: 0,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked,target
0,1,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,0
1,3,"Morley, Mr. William",male,34.0,0,0,364506,8.0500,,S,0
2,3,"Kink-Heilmann, Mr. Anton",male,29.0,3,1,315153,22.0250,,S,0
3,2,"Hiltunen, Miss. Marta",female,18.0,1,1,250650,13.0000,,S,1
4,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.5500,E12,S,1
...,...,...,...,...,...,...,...,...,...,...,...
911,1,"Lesurer, Mr. Gustave J",male,35.0,0,0,PC 17755,512.3292,B101,C,1
912,3,"Ryan, Mr. Patrick",male,,0,0,371110,24.1500,,Q,0
913,3,"Coleff, Mr. Peju",male,36.0,0,0,349210,7.4958,,S,0
914,3,"Rekic, Mr. Tido",male,38.0,0,0,349249,7.8958,,S,0


In [136]:
df["ticket"] * df["target"]

0              
1              
2              
3        250650
4         19952
         ...   
911    PC 17755
912            
913            
914            
915            
Length: 916, dtype: object

# map 메서드

In [137]:
df["gender"].map({"male": 1, "female": 0})

0      1
1      1
2      1
3      0
4      1
      ..
911    1
912    1
913    1
914    1
915    1
Name: gender, Length: 916, dtype: int64

In [138]:
df["gender"].map(lambda x: 1 if x == "male" else 0)

0      1
1      1
2      1
3      0
4      1
      ..
911    1
912    1
913    1
914    1
915    1
Name: gender, Length: 916, dtype: int64