# EDA
- 탐색적 데이터분석
- 데이터 분석하고 결과를 도출하는 과정에 있어서 지속적으로 해당 데이터에 대한 탐색과 이해를 기본적으로 가져야 한다는 의미

## 데이터 종류
- 수치형 데이터
    - 연속형(데이터 속에 연속적으로 발생)
        - 일정 범위 안에서 어떤 값 도 취할수 있는 실수형 데이터(ex,온도 ,키, 운임료(소수점으로 떨어질때),풍속)
    - 이산형(사건발생 횟수 같은거)
        - 횟수 같은 정수형 데이터(ex.사건에 대한 발생 횟수, 방개수, 부모자식수)
- 범주형 데이터
    - 가능한 범주안의 값만 취할수 있는 데이터(ex,성별, 전공, 장르, 영화평점, 직급) 순서형과 명목형이 있음

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [13]:
DATA_PATH = "/content/drive/MyDrive/01-python/data/db.yaml"

In [16]:
df = pd.read_csv(f"{DATA_PATH}titanic.csv")
df.head()

Unnamed: 0,passengerid,survived,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   passengerid  1309 non-null   int64  
 1   survived     1309 non-null   int64  
 2   pclass       1309 non-null   int64  
 3   name         1309 non-null   object 
 4   gender       1309 non-null   object 
 5   age          1046 non-null   float64
 6   sibsp        1309 non-null   int64  
 7   parch        1309 non-null   int64  
 8   ticket       1309 non-null   object 
 9   fare         1308 non-null   float64
 10  cabin        295 non-null    object 
 11  embarked     1307 non-null   object 
dtypes: float64(2), int64(5), object(5)
memory usage: 122.8+ KB


# 수치형 데이터를 분석하기

- 총합

In [22]:
df["fare"].sum()

43550.4869

- 평균

In [19]:
df["fare"].mean()

33.29547928134557

- 중앙값

In [21]:
df["fare"].median()

14.4542

- 분산

In [24]:
df["fare"].var()

2678.959737892891

- 표준편차

In [25]:
df["fare"].std()


51.75866823917411

- 분위수

In [28]:
df["fare"].quantile([0.25,0.5,0.75])


0.25     7.8958
0.50    14.4542
0.75    31.2750
Name: fare, dtype: float64

In [29]:
df["fare"].quantile([0.5,0.95])

0.50     14.4542
0.95    133.6500
Name: fare, dtype: float64

- 왜도(Skewness)
    - 데이터 분포의 비대칭도를 나타내는 통계량
    - 분포가 오른쪽으로 치우쳐저 있고 왼쪽으로 긴꼬리를 가지는 경우 왜도는 음수
    - 분포가 왼쪽으로 치우쳐서 오른쪽으로 긴 꼬리를 가지는 경우 왜도는 양수
    - 정규분포와 같이 좌우 대칭인경우 왜도는 0 에 가까워 진다.

In [30]:
df["fare"].skew()

4.367709134122922

- 상관계수
    - 칼 피어슨(karl Pearson)이 개발한 상관계수
    - 두 개의 수치형 변수의 변화가 연관되는 정도
    - +1 ~ -1 사이의 값을 가짐
    - +1 의 가까울수록 양의 상관관계
    - -1 의 가까울수록 음의 상관관계
    - 0에 가까울수옥 상관관계 x

In [None]:
df.head()

In [31]:
cols =["survived","age","sibsp","parch","fare"]
df[cols].corr()

Unnamed: 0,survived,age,sibsp,parch,fare
survived,1.0,-0.053695,0.00237,0.108919,0.233622
age,-0.053695,1.0,-0.243699,-0.150917,0.17874
sibsp,0.00237,-0.243699,1.0,0.373587,0.160238
parch,0.108919,-0.150917,0.373587,1.0,0.221539
fare,0.233622,0.17874,0.160238,0.221539,1.0


# 범주형 데이터 분석하기

## 고유값들보기

In [33]:
df["embarked"].nunique() # 원핫인코딩   

3

In [35]:
df["embarked"].unique()

array(['S', 'C', 'Q', nan], dtype=object)

- 최빈값

In [36]:
df["embarked"].mode()

0    S
dtype: object

- 범주별 개수보기

In [38]:
df["embarked"].value_counts()

S    914
C    270
Q    123
Name: embarked, dtype: int64

In [39]:
df["embarked"].value_counts(normalize=True)

S    0.699311
C    0.206580
Q    0.094109
Name: embarked, dtype: float64

In [41]:
df["cabin"].nunique()

186

In [47]:
df["cabin"].value_counts().head(60)

C23 C25 C27        6
G6                 5
B57 B59 B63 B66    5
C22 C26            4
F33                4
F2                 4
B96 B98            4
C78                4
F4                 4
D                  4
E34                3
B58 B60            3
A34                3
E101               3
C101               3
B51 B53 B55        3
C31                2
C55 C57            2
D37                2
C54                2
B35                2
C32                2
C7                 2
C124               2
E50                2
C6                 2
E44                2
C46                2
C92                2
D21                2
C116               2
C85                2
D20                2
B45                2
E8                 2
E121               2
E24                2
C62 C64            2
F G63              2
B20                2
B5                 2
B71                2
C126               2
D17                2
D19                2
B69                2
B41                2
C68          

In [48]:
df.tail()

Unnamed: 0,passengerid,survived,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked
1304,1305,0,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S
1305,1306,1,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C
1306,1307,0,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
1307,1308,0,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
1308,1309,0,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


- 두 범주형 간에 관계보기

In [52]:
pd.crosstab(df["gender"],df["survived"],margins=True)

survived,0,1,All
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,81,385,466
male,734,109,843
All,815,494,1309


In [54]:
pd.crosstab(df["gender"],df["survived"],margins=True,normalize="index") # 비율로보기

survived,0,1
gender,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.17382,0.82618
male,0.8707,0.1293
All,0.622613,0.377387


In [None]:
df.head()

In [66]:
pd.crosstab(df["pclass"],df["gender"],margins=True,normalize="index")

gender,female,male
pclass,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.44582,0.55418
2,0.382671,0.617329
3,0.304654,0.695346
All,0.355997,0.644003


In [86]:
pd.crosstab(df["survived"],df["ticket"],margins=True,normalize="index")



ticket,110152,110413,110465,110469,110489,110564,110813,111163,111240,111320,...,W./C. 14258,W./C. 14260,W./C. 14263,W./C. 14266,W./C. 6607,W./C. 6608,W./C. 6609,W.E.P. 5734,W/C 14208,WE/P 5735
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.0,0.001227,0.002454,0.001227,0.001227,0.0,0.001227,0.001227,0.001227,0.001227,...,0.0,0.001227,0.001227,0.0,0.003681,0.006135,0.001227,0.001227,0.001227,0.001227
1,0.006073,0.004049,0.0,0.0,0.0,0.002024,0.002024,0.0,0.0,0.0,...,0.002024,0.0,0.0,0.002024,0.002024,0.0,0.0,0.002024,0.0,0.002024
All,0.002292,0.002292,0.001528,0.000764,0.000764,0.000764,0.001528,0.000764,0.000764,0.000764,...,0.000764,0.000764,0.000764,0.000764,0.003056,0.00382,0.000764,0.001528,0.000764,0.001528


In [68]:
df.head()

Unnamed: 0,passengerid,survived,pclass,name,gender,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
