## 타이타닉 데이터셋 도전

- [캐글](https://www.kaggle.com)의 [타이타닉 챌린지](https://www.kaggle.com/c/titanic)에서 `train.csv`와 `test.csv`를 다운로드
- 두 파일을 각각 titanic_train.csv titanic_test.csv로 저장

### 1. 데이터 적재

In [1]:
import pandas as pd
titanic_df = pd.read_csv("datasets\\titanic_train.csv")

- 원래 타이타닉 챌린지에서는 승객의 나이, 성별, 승객 등급, 승선 위치 같은 속성을 기반으로 하여 승객의 생존 여부를 예측하는 것이 목표
- 그러나 본 실습에서는 훈련데이터로 데이터 분석까지만 수행할 예정

## 2. 데이터 탐색

* titanic_df 살펴보기

In [2]:
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


* **Survived**: 타깃. 0은 생존하지 못한 것이고 1은 생존을 의미
* **Pclass**: 승객 등급. 1, 2, 3등석.
* **Name**, **Sex**, **Age**: 이름 그대로의 의미
* **SibSp**: 함께 탑승한 형제, 배우자의 수.
* **Parch**: 함께 탑승한 자녀, 부모의 수.
* **Ticket**: 티켓 아이디
* **Fare**: 티켓 요금 (파운드)
* **Cabin**: 객실 번호
* **Embarked**: 승객이 탑승한 곳. C(Cherbourg), Q(Queenstown), S(Southampton)


* 누락 데이터 살펴보기

In [3]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


- **Age**, **Cabin**, **Embarked** 속성의 일부가 null
- 특히 **Cabin**은 77%가 null. 일단 **Cabin**은 무시하고 나머지를 활용
- **Age**는 177개(19%)가 null이므로 이를 어떻게 처리할지 결정해야 함 - null을 중간 나이로 바꾸기 고려
- **Name**과 **Ticket** 속성은 숫자로 변환하는 것이 조금 까다로와서 지금은 무시

* 통계치 살펴보기

In [4]:
titanic_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


* 38%만 **Survived**
* 평균 **Fare**는 32.20 파운드
* 평균 **Age**는 30보다 작음

* Survived(머신러닝에서 타깃)가 0과 1로 이루어졌는지 확인

In [5]:
titanic_df["Survived"].value_counts()

0    549
1    342
Name: Survived, dtype: int64

* 범주형(카테고리) 특성들을 확인

In [6]:
titanic_df["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

In [7]:
titanic_df["Sex"].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [8]:
titanic_df["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

**Embarked** 특성은 승객이 탑승한 곳 : C=Cherbourg, Q=Queenstown, S=Southampton.

## 3. 데이터 탐색 (상세)

* Name과 Age 열 을 나이 순으로 정렬해서 보기

In [9]:
titanic_df[['Name','Age']].sort_values(by='Age')

Unnamed: 0,Name,Age
803,"Thomas, Master. Assad Alexander",0.42
755,"Hamalainen, Master. Viljo",0.67
644,"Baclini, Miss. Eugenie",0.75
469,"Baclini, Miss. Helene Barbara",0.75
78,"Caldwell, Master. Alden Gates",0.83
...,...,...
859,"Razi, Mr. Raihed",
863,"Sage, Miss. Dorothy Edith ""Dolly""",
868,"van Melkebeke, Mr. Philemon",
878,"Laleff, Mr. Kristo",


* 나이(Age)가 60 이상인 사람들의 이름과 나이 확인해 보기

In [10]:
titanic_df[titanic_df['Age'] >= 60][['Name','Age']] 

Unnamed: 0,Name,Age
33,"Wheadon, Mr. Edward H",66.0
54,"Ostby, Mr. Engelhart Cornelius",65.0
96,"Goldschmidt, Mr. George B",71.0
116,"Connors, Mr. Patrick",70.5
170,"Van der hoef, Mr. Wyckoff",61.0
252,"Stead, Mr. William Thomas",62.0
275,"Andrews, Miss. Kornelia Theodosia",63.0
280,"Duane, Mr. Frank",65.0
326,"Nysveen, Mr. Johan Hansen",61.0
366,"Warren, Mrs. Frank Manley (Anna Sophia Atkinson)",60.0


* 나이가(Age)가 60 이상이고 1등석에 탔으며 여성인 탑승자 확인해 보기

In [11]:
titanic_df[ (titanic_df['Age'] >= 60) & (titanic_df['Pclass']==1) & (titanic_df['Sex']=='female')]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
275,276,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


* 요금(Fare)의 최대값 최소값 확인해 보기

In [12]:
titanic_df['Fare'].max()

512.3292

In [13]:
titanic_df['Fare'].min()

0.0

* 등급(Pcalss) 그룹별 생존률 확인해보기

In [14]:
titanic_df.groupby(by='Pclass').size()

Pclass
1    216
2    184
3    491
dtype: int64

In [15]:
titanic_df.groupby(['Pclass', 'Survived']).size()

Pclass  Survived
1       0            80
        1           136
2       0            97
        1            87
3       0           372
        1           119
dtype: int64

In [16]:
titanic_df.groupby(by='Pclass')['Survived'].mean()

Pclass
1    0.629630
2    0.472826
3    0.242363
Name: Survived, dtype: float64

* 특성을 조합해 또다른 특성(Family_No)을 만들기(가족과 탑승한 사람과 혼자 탑승한 사람)

In [17]:
titanic_df['Family_No'] = titanic_df['SibSp'] + titanic_df['Parch']+1

In [18]:
titanic_df['Family_No'].value_counts()

1     537
2     161
3     102
4      29
6      22
5      15
7      12
11      7
8       6
Name: Family_No, dtype: int64

* Family_No 그룹의 생존률 평군 알아보기

In [19]:
titanic_df.groupby(by='Family_No')['Survived'].mean()

Family_No
1     0.303538
2     0.552795
3     0.578431
4     0.724138
5     0.200000
6     0.136364
7     0.333333
8     0.000000
11    0.000000
Name: Survived, dtype: float64

### 4. 데이터 전처리 (누락 데이터 처리, 범주화 등)

* Cabin 열 : 전체 삭제하기

In [20]:
titanic_df.dropna(axis=1, thresh=500, inplace=True) 

In [21]:
titanic_df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked', 'Family_No'],
      dtype='object')

* Embarked 열 : 승선도시 최고 빈도수 값으로 대체하기

In [1]:
titanic_df['Embarked'].value_counts(dropna=False)

NameError: name 'titanic_df' is not defined

In [2]:
# embark_town 열의 NaN값을 승선도시 중에서 가장 많이 출현한 값으로 치환하기
most_freq = titanic_df['Embarked'].value_counts(dropna=True).idxmax()   
most_freq

NameError: name 'titanic_df' is not defined

In [24]:
titanic_df['Embarked'].fillna(most_freq, inplace=True)

In [25]:
titanic_df['Embarked'].value_counts(dropna=False)

S    646
C    168
Q     77
Name: Embarked, dtype: int64

* Age 열 : 중간값으로 대체하기

In [26]:
titanic_df['Age'].isnull().sum()

177

In [27]:
median_age = titanic_df['Age'].median(axis=0)   
titanic_df['Age'].fillna(median_age, inplace=True)

In [28]:
titanic_df['Age'].isnull().sum()

0

* Age 열: 범주로 나눠보기

In [29]:
bins = [0,18, 25, 35, 60, 80]
group_names = ['Children','Youth', 'YoungAdult', 'MiddleAged', 'Senior']
age_cats= pd.cut(titanic_df['Age'], bins, labels=group_names)
age_cats

0           Youth
1      MiddleAged
2      YoungAdult
3      YoungAdult
4      YoungAdult
          ...    
886    YoungAdult
887         Youth
888    YoungAdult
889    YoungAdult
890    YoungAdult
Name: Age, Length: 891, dtype: category
Categories (5, object): [Children < Youth < YoungAdult < MiddleAged < Senior]

In [38]:
pd.value_counts(age_cats)

YoungAdult    373
MiddleAged    195
Youth         162
Children      139
Senior         22
Name: Age, dtype: int64

In [39]:
Age_dummies = pd.get_dummies(age_cats)
Age_dummies= Age_dummies.add_prefix('Age_')
Age_dummies

Unnamed: 0,Age_Children,Age_Youth,Age_YoungAdult,Age_MiddleAged,Age_Senior
0,0,1,0,0,0
1,0,0,0,1,0
2,0,0,1,0,0
3,0,0,1,0,0
4,0,0,1,0,0
...,...,...,...,...,...
886,0,0,1,0,0
887,0,1,0,0,0
888,0,0,1,0,0
889,0,0,1,0,0


* 중복 데이터 확인

In [40]:
titanic_df.duplicated().sum()

0

* One-hot Encoding

In [41]:
#["Pclass", "Sex", "Embarked"] 에 대해 one-hot encoding 수행

In [42]:
Pclass_dummies = pd.get_dummies(titanic_df['Pclass'])
Pclass_dummies = Pclass_dummies.add_prefix('Pclass_')
Pclass_dummies

Unnamed: 0,Pclass_1,Pclass_2,Pclass_3
0,0,0,1
1,1,0,0
2,0,0,1
3,1,0,0
4,0,0,1
...,...,...,...
886,0,1,0
887,1,0,0
888,0,0,1
889,1,0,0


In [43]:
Sex_dummies = pd.get_dummies(titanic_df['Sex'])
Sex_dummies = Sex_dummies.add_prefix('Sex_')
Sex_dummies

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1
...,...,...
886,0,1
887,1,0
888,1,0
889,0,1


In [44]:
Embarked_dummies = pd.get_dummies(titanic_df['Embarked'])
Embarked_dummies = Embarked_dummies.add_prefix('Embarked_')
Embarked_dummies

Unnamed: 0,EmbarkedC,EmbarkedQ,EmbarkedS
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
886,0,0,1
887,0,0,1
888,0,0,1
889,1,0,0


* 위에서 만들어진 더미 데이터들 프레임을 합치기

In [45]:
titanic_df = pd.concat([titanic_df, Age_dummies, Pclass_dummies, Sex_dummies, Embarked_dummies], axis=1)
titanic_df.shape

(891, 25)

In [47]:
titanic_df_prepared=titanic_df.drop(['Age', 'Pclass', 'Sex', 'Embarked'], axis=1)
titanic_df_prepared.shape

(891, 21)

In [48]:
titanic_df_prepared.head()

Unnamed: 0,PassengerId,Survived,Name,SibSp,Parch,Ticket,Fare,Family_No,Age_Children,Age_Youth,...,Age_MiddleAged,Age_Senior,Pclass_1,Pclass_2,Pclass_3,Sex_female,Sex_male,EmbarkedC,EmbarkedQ,EmbarkedS
0,1,0,"Braund, Mr. Owen Harris",1,0,A/5 21171,7.25,2,0,1,...,0,0,0,0,1,0,1,0,0,1
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,0,PC 17599,71.2833,2,0,0,...,1,0,1,0,0,1,0,1,0,0
2,3,1,"Heikkinen, Miss. Laina",0,0,STON/O2. 3101282,7.925,1,0,0,...,0,0,0,0,1,1,0,0,0,1
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,0,113803,53.1,2,0,0,...,0,0,1,0,0,1,0,0,0,1
4,5,0,"Allen, Mr. William Henry",0,0,373450,8.05,1,0,0,...,0,0,0,0,1,0,1,0,0,1


* 더 시도해 볼 수 있는 것이 있다면 추가!!