# 신용카드 사용자 연체 예측 AI 경진대회

### 데이터 설명
- `train.csv`: 신용카드 사용자들의 개인 신상정보 (26457, 20)   
- `test.csv`: credit 열 미포함 (10000, 19)   
- `sample_submission.csv`: 정답 제출 파일 (10000, 4)

### 데이터 변수 설명
- `index`
- `gender`: 성별
- `car`: 차량 소유 여부
- `reality`: 부동산 소유 여부
- `child_num`: 자녀 수
- `income_total`: 연간 소득
- `income_type`: 소득 분류 ['Commercial associate', 'Working', 'State servant', 'Pensioner', 'Student']
- `edu_type`: 교육 수준 ['Higher education' ,'Secondary / secondary special', 'Incomplete higher', 'Lower secondary', 'Academic degree']
- `family_type`: 결혼 여부 ['Married', 'Civil marriage', 'Separated', 'Single / not married', 'Widow']
- `house_type`: 생활 방식 ['Municipal apartment', 'House / apartment', 'With parents', 'Co-op apartment', 'Rented apartment', 'Office apartment']
- `DAYS_BIRTH`: 출생일 (데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 하루 전에 태어났음을 의미)
- `DAYS_EMPLOYED`: 업무 시작일 (데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 하루 전부터 일을 시작함을 의미, 양수 값은 고용되지 않은 상태를 의미함)
- `FLAG_MOBIL`: 핸드폰 소유 여부
- `work_phone`: 업무용 전화 소유 여부
- `phone`: 전화 소유 여부
- `email`: 이메일 소유 여부
- `occyp_type`: 직업 유형													
- `family_size`: 가족 규모
- `begin_month`: 신용카드 발급 월 (데이터 수집 당시 (0)부터 역으로 셈, 즉, -1은 데이터 수집일 한 달 전에 신용카드를 발급함을 의미)
- `credit`: 사용자의 신용카드 대금 연체를 기준으로 한 신용도 (=> 낮을 수록 높은 신용의 신용카드 사용자를 의미함)

In [1]:
import numpy as np
import pandas as pd

df = pd.read_csv('data/train.csv')
df.head()

Unnamed: 0,index,gender,car,reality,child_num,income_total,income_type,edu_type,family_type,house_type,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,work_phone,phone,email,occyp_type,family_size,begin_month,credit
0,0,F,N,N,0,202500.0,Commercial associate,Higher education,Married,Municipal apartment,-13899,-4709,1,0,0,0,,2.0,-6.0,1.0
1,1,F,N,Y,1,247500.0,Commercial associate,Secondary / secondary special,Civil marriage,House / apartment,-11380,-1540,1,0,0,1,Laborers,3.0,-5.0,1.0
2,2,M,Y,Y,0,450000.0,Working,Higher education,Married,House / apartment,-19087,-4434,1,0,1,0,Managers,2.0,-22.0,2.0
3,3,F,N,Y,0,202500.0,Commercial associate,Secondary / secondary special,Married,House / apartment,-15088,-2092,1,0,1,0,Sales staff,2.0,-37.0,0.0
4,4,F,Y,Y,0,157500.0,State servant,Higher education,Married,House / apartment,-15037,-2105,1,0,0,0,Managers,2.0,-26.0,2.0


## Data Info

In [2]:
df.describe()

Unnamed: 0,index,child_num,income_total,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,work_phone,phone,email,family_size,begin_month,credit
count,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0,26457.0
mean,13228.0,0.428658,187306.5,-15958.053899,59068.750728,1.0,0.224742,0.294251,0.09128,2.196848,-26.123294,1.51956
std,7637.622372,0.747326,101878.4,4201.589022,137475.427503,0.0,0.41742,0.455714,0.288013,0.916717,16.55955,0.702283
min,0.0,0.0,27000.0,-25152.0,-15713.0,1.0,0.0,0.0,0.0,1.0,-60.0,0.0
25%,6614.0,0.0,121500.0,-19431.0,-3153.0,1.0,0.0,0.0,0.0,2.0,-39.0,1.0
50%,13228.0,0.0,157500.0,-15547.0,-1539.0,1.0,0.0,0.0,0.0,2.0,-24.0,2.0
75%,19842.0,1.0,225000.0,-12446.0,-407.0,1.0,0.0,1.0,0.0,3.0,-12.0,2.0
max,26456.0,19.0,1575000.0,-7705.0,365243.0,1.0,1.0,1.0,1.0,20.0,0.0,2.0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26457 entries, 0 to 26456
Data columns (total 20 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   index          26457 non-null  int64  
 1   gender         26457 non-null  object 
 2   car            26457 non-null  object 
 3   reality        26457 non-null  object 
 4   child_num      26457 non-null  int64  
 5   income_total   26457 non-null  float64
 6   income_type    26457 non-null  object 
 7   edu_type       26457 non-null  object 
 8   family_type    26457 non-null  object 
 9   house_type     26457 non-null  object 
 10  DAYS_BIRTH     26457 non-null  int64  
 11  DAYS_EMPLOYED  26457 non-null  int64  
 12  FLAG_MOBIL     26457 non-null  int64  
 13  work_phone     26457 non-null  int64  
 14  phone          26457 non-null  int64  
 15  email          26457 non-null  int64  
 16  occyp_type     18286 non-null  object 
 17  family_size    26457 non-null  float64
 18  begin_

In [4]:
df['occyp_type'].isnull().sum()

8171

## Data Values

In [5]:
df['gender'].value_counts() # 성별

F    17697
M     8760
Name: gender, dtype: int64

In [6]:
df['car'].value_counts() # 차량 소유 여부

N    16410
Y    10047
Name: car, dtype: int64

In [7]:
df['reality'].value_counts() # 부동산 소유 여부

Y    17830
N     8627
Name: reality, dtype: int64

In [8]:
df['child_num'].value_counts() # 자녀 수

0     18340
1      5386
2      2362
3       306
4        47
5        10
14        3
7         2
19        1
Name: child_num, dtype: int64

In [11]:
df['income_total'].value_counts() # 연간 소득

135000.0    3164
157500.0    2233
180000.0    2225
112500.0    2178
225000.0    2170
            ... 
57150.0        1
51750.0        1
87448.5        1
227250.0       1
191700.0       1
Name: income_total, Length: 249, dtype: int64

In [12]:
df['income_type'].value_counts() # 소득 분류

Working                 13645
Commercial associate     6202
Pensioner                4449
State servant            2154
Student                     7
Name: income_type, dtype: int64

In [13]:
df['edu_type'].value_counts() # 교육 수준

Secondary / secondary special    17995
Higher education                  7162
Incomplete higher                 1020
Lower secondary                    257
Academic degree                     23
Name: edu_type, dtype: int64

In [14]:
df['family_type'].value_counts() # 결혼 여부

Married                 18196
Single / not married     3496
Civil marriage           2123
Separated                1539
Widow                    1103
Name: family_type, dtype: int64

In [15]:
df['house_type'].value_counts() # 생활 방식

House / apartment      23653
With parents            1257
Municipal apartment      818
Rented apartment         429
Office apartment         190
Co-op apartment          110
Name: house_type, dtype: int64

In [16]:
df['DAYS_BIRTH'].value_counts() # 출생일 (데이터 수집일 기준)

-12676    40
-15519    38
-14667    32
-15140    26
-16768    24
          ..
-18629     1
-12786     1
-13688     1
-14543     1
-19912     1
Name: DAYS_BIRTH, Length: 6621, dtype: int64

In [17]:
df['DAYS_EMPLOYED'].value_counts() # 업무 시작일 (데이터 수집일 기준, 양수는 무직)

 365243    4438
-401         57
-1539        47
-200         45
-2087        44
           ... 
-10475        1
-2202         1
-2552         1
-680          1
-4973         1
Name: DAYS_EMPLOYED, Length: 3470, dtype: int64

In [18]:
df['FLAG_MOBIL'].value_counts() # 핸드폰 소유 여부

1    26457
Name: FLAG_MOBIL, dtype: int64

In [19]:
df['work_phone'].value_counts() # 업무용 전화 소유 여부

0    20511
1     5946
Name: work_phone, dtype: int64

In [20]:
df['phone'].value_counts() # 전화 소유 여부

0    18672
1     7785
Name: phone, dtype: int64

In [21]:
df['email'].value_counts() # 이메일 소유 여부

0    24042
1     2415
Name: email, dtype: int64

In [22]:
df['occyp_type'].value_counts() # 직업 유형

Laborers                 4512
Core staff               2646
Sales staff              2539
Managers                 2167
Drivers                  1575
High skill tech staff    1040
Accountants               902
Medicine staff            864
Cooking staff             457
Security staff            424
Cleaning staff            403
Private service staff     243
Low-skill Laborers        127
Waiters/barmen staff      124
Secretaries                97
Realty agents              63
HR staff                   62
IT staff                   41
Name: occyp_type, dtype: int64

In [23]:
df['family_size'].value_counts() # 가족 규모

2.0     14106
1.0      5109
3.0      4632
4.0      2260
5.0       291
6.0        44
7.0         9
15.0        3
9.0         2
20.0        1
Name: family_size, dtype: int64

In [24]:
df['begin_month'].value_counts() # 신용카드 발급 월 (데이터 수집일 기준)

-7.0     662
-11.0    617
-8.0     612
-3.0     593
-6.0     591
        ... 
-58.0    244
-59.0    242
-60.0    235
 0.0     231
-57.0    228
Name: begin_month, Length: 61, dtype: int64

In [25]:
df['credit'].value_counts() # 사용자의 신용카드 대금 연체를 기준으로 한 신용도 (낮을수록 높은 신용)

2.0    16968
1.0     6267
0.0     3222
Name: credit, dtype: int64