## 타이타닉 데이터셋 도전

- 승객의 나이, 성별, 승객 등급, 승선 위치 같은 속성을 기반으로 하여 승객의 생존 여부를 예측하는 것이 목표

- [캐글](https://www.kaggle.com)의 [타이타닉 챌린지](https://www.kaggle.com/c/titanic)에서 `train.csv`와 `test.csv`를 다운로드
- 두 파일을 각각 datasets 디렉토리에 titanic_train.csv titanic_test.csv로 저장

## 1. 데이터 탐색

In [247]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#### 1.1 데이터 적재

In [248]:
titanic_df = pd.read_csv('./datasets/titanic_train.csv')


#### 1.2 titanic_df 살펴보기

In [249]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


* **Survived**: 타깃. 0은 생존하지 못한 것이고 1은 생존을 의미
* **Pclass**: 승객 등급. 1, 2, 3등석.
* **Name**, **Sex**, **Age**: 이름 그대로의 의미
* **SibSp**: 함께 탑승한 형제, 배우자의 수
* **Parch**: 함께 탑승한 자녀, 부모의 수
* **Ticket**: 티켓 아이디
* **Fare**: 티켓 요금 (파운드)
* **Cabin**: 객실 번호
* **Embarked**: 승객이 탑승한 곳. C(Cherbourg), Q(Queenstown), S(Southampton)


#### 1.3 누락 데이터 살펴보기

In [250]:
titanic_df.isnull()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


#### 1.4 통계치 살펴보기

In [251]:
titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


#### 1.5 Survived 컬럼 값의 빈도수 확인

In [252]:
titanic_df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

#### 1.6 범주형(카테고리) 특성들의 빈도수 확인
- **Pclass**, **Sex**, **Embarked**
- **Embarked** 특성은 승객이 탑승한 곳 : C=Cherbourg, Q=Queenstown, S=Southampton.

In [253]:
titanic_df.describe(include = 'object')

Unnamed: 0,Name,Sex,Ticket,Cabin,Embarked
count,891,891,891,204,889
unique,891,2,681,147,3
top,"Braund, Mr. Owen Harris",male,347082,B96 B98,S
freq,1,577,7,4,644


In [254]:
titanic_df['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [255]:
titanic_df['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [256]:
titanic_df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

#### 1.7 Name과 Age 열 을 Age 순으로 정렬해서 보기

In [257]:
df = titanic_df[['Name', 'Age']]
df = df.sort_values(by ='Age' , axis = 0 )
df

Unnamed: 0,Name,Age
803,"Thomas, Master. Assad Alexander",0.42
755,"Hamalainen, Master. Viljo",0.67
644,"Baclini, Miss. Eugenie",0.75
469,"Baclini, Miss. Helene Barbara",0.75
78,"Caldwell, Master. Alden Gates",0.83
...,...,...
859,"Razi, Mr. Raihed",
863,"Sage, Miss. Dorothy Edith ""Dolly""",
868,"van Melkebeke, Mr. Philemon",
878,"Laleff, Mr. Kristo",


#### 1.8 나이(Age)가 60 이상인 사람들의 Name과 Age 확인해 보기

In [258]:
titanic_df.loc[titanic_df['Age'] >= 60, ['Name', 'Age']]

Unnamed: 0,Name,Age
33,"Wheadon, Mr. Edward H",66.0
54,"Ostby, Mr. Engelhart Cornelius",65.0
96,"Goldschmidt, Mr. George B",71.0
116,"Connors, Mr. Patrick",70.5
170,"Van der hoef, Mr. Wyckoff",61.0
252,"Stead, Mr. William Thomas",62.0
275,"Andrews, Miss. Kornelia Theodosia",63.0
280,"Duane, Mr. Frank",65.0
326,"Nysveen, Mr. Johan Hansen",61.0
366,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",60.0


#### 1.9 나이가(Age)가 60 이상이고 1등석에 탔으며 여성인 탑승자 확인해 보기

In [259]:
a= titanic_df['Age'] >= 60  
b= titanic_df['Pclass'] ==1 
c=  titanic_df['Sex']== 'female'

titanic_df[a & b & c]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
366,367,1,1,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",female,60.0,1,0,110813,75.25,D37,C
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


#### 1.10 요금(Fare)의 최대값 최소값 확인해 보기

In [260]:
titanic_df['Fare'].max()

512.3292

In [261]:
titanic_df['Fare'].min()

0.0

#### 1.11 등급(Pcalss) 그룹별 생존률 확인해보기

In [262]:
titanic_df.groupby('Pclass').mean()

  titanic_df.groupby('Pclass').mean()


Unnamed: 0_level_0,PassengerId,Survived,Age,SibSp,Parch,Fare
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,461.597222,0.62963,38.233441,0.416667,0.356481,84.154687
2,445.956522,0.472826,29.87763,0.402174,0.380435,20.662183
3,439.154786,0.242363,25.14062,0.615071,0.393075,13.67555


In [263]:
titanic_df.groupby('Pclass').mean()['Survived']

  titanic_df.groupby('Pclass').mean()['Survived']


Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

In [264]:
# 행 개수  
titanic_df.groupby('Pclass').size()

Pclass
1    216
2    184
3    491
dtype: int64

In [265]:
# vlaue_counts() 
titanic_df.groupby("Pclass").value_counts()

Pclass  PassengerId  Survived  Name                                                 Sex     Age   SibSp  Parch  Ticket             Fare      Cabin        Embarked
1       2            1         Cumings, Mrs. John Bradley (Florence Briggs Thayer)  female  38.0  1      0      PC 17599           71.2833   C85          C           1
        660          0         Newell, Mr. Arthur Webster                           male    58.0  0      2      35273              113.2750  D48          C           1
        672          0         Davidson, Mr. Thornton                               male    31.0  1      0      F.C. 12750         52.0000   B71          S           1
        680          1         Cardeza, Mr. Thomas Drake Martinez                   male    36.0  0      1      PC 17755           512.3292  B51 B53 B55  C           1
        682          1         Hassab, Mr. Hammad                                   male    27.0  0      0      PC 17572           76.7292   D49          C          

In [266]:
titanic_df.groupby

<bound method DataFrame.groupby of      PassengerId  Survived  Pclass  \
0              1         0       3   
1              2         1       1   
2              3         1       3   
3              4         1       1   
4              5         0       3   
..           ...       ...     ...   
886          887         0       2   
887          888         1       1   
888          889         0       3   
889          890         1       1   
890          891         0       3   

                                                  Name     Sex   Age  SibSp  \
0                              Braund, Mr. Owen Harris    male  22.0      1   
1    Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                               Heikkinen, Miss. Laina  female  26.0      0   
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                             Allen, Mr. William Henry    male  35.0      0   
..                                

## 2. 데이터 전처리 (누락 데이터 처리, 범주화 등)

#### 2.1 Cabin 열 : 전체 삭제하기

In [267]:
titanic_df.drop(columns= 'Cabin')

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,S
...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C


#### 2.2  Embarked 열 : 누락데이터를 승선도시 최고 빈도수 값으로 대체하기

In [268]:
# null 값 몇개니~
titanic_df['Embarked'].isnull().sum() 

2

In [269]:
sr= titanic_df["Embarked"].value_counts(dropna=False)

In [270]:
# 빠져있는 2건에 S 넣어주기 
# 갖ㅇ 큰값의 인덱스 구하기 
# numpy : argmax
# pandas : idxmax 
most_freq = sr.idxmax()

In [271]:
titanic_df["Embarked"].fillna(most_freq)

0      S
1      C
2      S
3      S
4      S
      ..
886    S
887    S
888    S
889    C
890    Q
Name: Embarked, Length: 891, dtype: object

#### 2.3  Age 열 : 누락된 값을 중간값으로 대체하기

In [272]:
median_age = titanic_df['Age'].median() 
median_age


28.0

In [273]:
titanic_df['Age'].fillna(median_age)

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

#### 2.4  Age 열: 범주로 나눠보기

* 0~18세
* 18~25세
* 25~35세
* 35~60세
* 60~80세

In [274]:
bins = [0,18,25,35,60,80]
pd.cut(titanic_df['Age'], bins)

0      (18.0, 25.0]
1      (35.0, 60.0]
2      (25.0, 35.0]
3      (25.0, 35.0]
4      (25.0, 35.0]
           ...     
886    (25.0, 35.0]
887    (18.0, 25.0]
888             NaN
889    (25.0, 35.0]
890    (25.0, 35.0]
Name: Age, Length: 891, dtype: category
Categories (5, interval[int64, right]): [(0, 18] < (18, 25] < (25, 35] < (35, 60] < (60, 80]]

* 범주 데이터를 dummy 변수로 바꿔보기 (One-Hot Encoding)

#### 2.5 중복 데이터 확인