# 목차
## 1.[Data 불러오기](#1)
## 2.[Data 크기 및 모습 확인](#2)
## 3.[파티 지속 시간 계산을 위한 전처리](#3)

In [1]:
import pandas as pd
import numpy as np
from datetime import datetime

## 1. Data 불러오기<a name = 1></a>

In [2]:
path = 'data/final_data_rev/test/'
DF_train_party = pd.read_csv(path+'test_party.csv')
DF_train_label = pd.read_csv(path+'test_label.csv')

## 2. Data 크기 및 모습 확인<a name = 2></a>

In [3]:
print("DF_train_party shape : ", DF_train_party.shape)
DF_train_party.head(5)

DF_train_party shape :  (4121512, 7)


Unnamed: 0,party_start_week,party_start_day,party_start_time,party_end_week,party_end_day,party_end_time,hashed
0,1,1,11:42:01.123,1,1,12:11:28.466,633e2b44564d93ff278716bf31db234040287de8bfaab2...
1,3,3,11:05:05.176,3,3,13:07:42.515,7176c1516207692857535c30a4650b8e8e586af1fed0fd...
2,3,6,02:18:43.172,3,6,02:28:58.177,8092e194a750aae539862ed4405f67a6dd5b492e7e57e3...
3,6,1,09:12:30.447,6,1,09:31:51.871,4b33f0b6969e591bb19d7ea939af5e45e08c6799ef18e7...
4,6,5,10:58:28.822,6,5,11:01:05.309,a284744f3707f84daf525d5040191fda9a46db4c368fe6...


## 3. 파티 지속 시간 계산을 위한 전처리<a name = 3></a>
### 순서
- 그냥 `24 * (end_day - start_day)`와 `end_time - start_time`으로는 기간을 계산할 수 없다.(1일 23시 시작 2일 01시 종료는 2시간)<br/>
- 따라서 `week`와 `day`를 순서쌍으로 1월 1일부터 2월 25일까지 매핑해준다<br/>
![img](img/calender.png)
- 다음으로 시간 계산을 위해서 데이터 타입을 str에서 datetime으로 변환해준다<br/>
- `end - start`로 시간 계산

In [4]:
temp = 7*(DF_train_party['party_start_week'].values - 1) + (DF_train_party['party_start_day'].values)
temp_month = np.array([1 for i in range(temp.shape[0])])

for i in range(temp.shape[0]):
    if temp[i] > 31:
        temp[i] = temp[i] % 31
        temp_month[i] += 1

temp = temp.astype('str')
temp_month = temp_month.astype('str')

for i in range(temp.shape[0]):
    temp[i] = temp_month[i]+'-'+temp[i]+' '

temp = temp + DF_train_party['party_start_time'] 

DF_train_party['start'] = temp

~~~ python
temp = 7*(DF_train_party['party_start_week'].values - 1) + (DF_train_party['party_start_day'].values)
temp_month = np.array([1 for i in range(temp.shape[0])])
~~~

`temp`에는 week와 day를 사용해서 새로운 day를 저장하여 초기화(ex. 1주 1일 => 1일, 8주 7일 => 56일, 16주 2일 => 37일)<br/>
`temp_month`에는 1로 초기화. 괄호 안 `[1 for i in range(temp.shape[0])]` 은 1을 temp.shape[0]번 만큼 배열에 넣는다는 의미<br/>
<hr/>

~~~ python
for i in range(temp.shape[0]):
    if temp[i] > 31:
        temp[i] = temp[i] % 31
        temp_month[i] += 1
~~~
`temp[i]`가 31을 넘어가면 2월로 넘기는 작업. 이해하기 쉽쥬?<br/>
<hr/>

~~~ python
temp = temp.astype('str')
temp_month = temp_month.astype('str')
~~~
파티 생성 시각과 합쳐주기 위해서 `astype` 메서드를 통해 str 타입으로 변환<br/>
<hr/>

~~~ python
for i in range(temp.shape[0]):
    temp[i] = temp_month[i]+'-'+temp[i]+' '
~~~
날짜와 시각을 더했을 때 필요한 공백 추가<br/>
<hr/>

~~~ python
temp = temp + DF_train_party['party_start_time']
~~~
날짜와 시각 더하기


In [5]:
temp = 7*(DF_train_party['party_end_week'].values - 1) + (DF_train_party['party_end_day'].values)
temp_month = np.array([1 for i in range(temp.shape[0])])

for i in range(temp.shape[0]):
    if temp[i] > 31:
        temp[i] = temp[i] % 31
        temp_month[i] += 1

temp = temp.astype('str')
temp_month = temp_month.astype('str')

for i in range(temp.shape[0]):
    temp[i] = temp_month[i]+'-'+temp[i]+' '

temp = temp + DF_train_party['party_end_time']  

DF_train_party['end'] = temp

위와 동일

In [6]:
DF_train_party['start'] = pd.to_datetime(DF_train_party['start'], format="%m-%d %H:%M:%S.%f")
DF_train_party['end'] = pd.to_datetime(DF_train_party['end'], format="%m-%d %H:%M:%S.%f")

str -> datetime 으로 전환

In [7]:
DF_train_party.head()

Unnamed: 0,party_start_week,party_start_day,party_start_time,party_end_week,party_end_day,party_end_time,hashed,start,end
0,1,1,11:42:01.123,1,1,12:11:28.466,633e2b44564d93ff278716bf31db234040287de8bfaab2...,1900-01-01 11:42:01.123,1900-01-01 12:11:28.466
1,3,3,11:05:05.176,3,3,13:07:42.515,7176c1516207692857535c30a4650b8e8e586af1fed0fd...,1900-01-17 11:05:05.176,1900-01-17 13:07:42.515
2,3,6,02:18:43.172,3,6,02:28:58.177,8092e194a750aae539862ed4405f67a6dd5b492e7e57e3...,1900-01-20 02:18:43.172,1900-01-20 02:28:58.177
3,6,1,09:12:30.447,6,1,09:31:51.871,4b33f0b6969e591bb19d7ea939af5e45e08c6799ef18e7...,1900-02-05 09:12:30.447,1900-02-05 09:31:51.871
4,6,5,10:58:28.822,6,5,11:01:05.309,a284744f3707f84daf525d5040191fda9a46db4c368fe6...,1900-02-09 10:58:28.822,1900-02-09 11:01:05.309


In [8]:
'''
DF_train_party = DF_train_party.drop(['party_start_week', 'party_start_day', 'party_start_time', 
                                      'party_end_week', 'party_end_day', 'party_end_time'], axis = 1)
DF_train_party.head()
'''
DF_train_party = DF_train_party.drop(['party_start_day', 'party_start_time', 'party_end_day', 'party_end_time'], axis = 1)
DF_train_party.head()

Unnamed: 0,party_start_week,party_end_week,hashed,start,end
0,1,1,633e2b44564d93ff278716bf31db234040287de8bfaab2...,1900-01-01 11:42:01.123,1900-01-01 12:11:28.466
1,3,3,7176c1516207692857535c30a4650b8e8e586af1fed0fd...,1900-01-17 11:05:05.176,1900-01-17 13:07:42.515
2,3,3,8092e194a750aae539862ed4405f67a6dd5b492e7e57e3...,1900-01-20 02:18:43.172,1900-01-20 02:28:58.177
3,6,6,4b33f0b6969e591bb19d7ea939af5e45e08c6799ef18e7...,1900-02-05 09:12:30.447,1900-02-05 09:31:51.871
4,6,6,a284744f3707f84daf525d5040191fda9a46db4c368fe6...,1900-02-09 10:58:28.822,1900-02-09 11:01:05.309


In [9]:
DF_train_party['time'] = DF_train_party['end'].values - DF_train_party['start'].values
DF_train_party.head()

Unnamed: 0,party_start_week,party_end_week,hashed,start,end,time
0,1,1,633e2b44564d93ff278716bf31db234040287de8bfaab2...,1900-01-01 11:42:01.123,1900-01-01 12:11:28.466,00:29:27.343000
1,3,3,7176c1516207692857535c30a4650b8e8e586af1fed0fd...,1900-01-17 11:05:05.176,1900-01-17 13:07:42.515,02:02:37.339000
2,3,3,8092e194a750aae539862ed4405f67a6dd5b492e7e57e3...,1900-01-20 02:18:43.172,1900-01-20 02:28:58.177,00:10:15.005000
3,6,6,4b33f0b6969e591bb19d7ea939af5e45e08c6799ef18e7...,1900-02-05 09:12:30.447,1900-02-05 09:31:51.871,00:19:21.424000
4,6,6,a284744f3707f84daf525d5040191fda9a46db4c368fe6...,1900-02-09 10:58:28.822,1900-02-09 11:01:05.309,00:02:36.487000


이해하는데 어려움이 없을 것이라고 생각합니다.<br/>
나중에 datetime 타입은 연산하는데 불편하므로 소수로 다 바꿔버리겠습니다.

In [10]:
temp = []
for i in range(DF_train_party.shape[0]):
    a = DF_train_party.loc[i,'time']
    a = a.total_seconds()/60
    temp.append(a)
DF_train_party['minutes'] = temp    

In [11]:
DF_train_party['minutes'] = DF_train_party['minutes'].astype('float64')
DF_train_party.head()

Unnamed: 0,party_start_week,party_end_week,hashed,start,end,time,minutes
0,1,1,633e2b44564d93ff278716bf31db234040287de8bfaab2...,1900-01-01 11:42:01.123,1900-01-01 12:11:28.466,00:29:27.343000,29.455717
1,3,3,7176c1516207692857535c30a4650b8e8e586af1fed0fd...,1900-01-17 11:05:05.176,1900-01-17 13:07:42.515,02:02:37.339000,122.622317
2,3,3,8092e194a750aae539862ed4405f67a6dd5b492e7e57e3...,1900-01-20 02:18:43.172,1900-01-20 02:28:58.177,00:10:15.005000,10.250083
3,6,6,4b33f0b6969e591bb19d7ea939af5e45e08c6799ef18e7...,1900-02-05 09:12:30.447,1900-02-05 09:31:51.871,00:19:21.424000,19.357067
4,6,6,a284744f3707f84daf525d5040191fda9a46db4c368fe6...,1900-02-09 10:58:28.822,1900-02-09 11:01:05.309,00:02:36.487000,2.608117


연산 속도를 위해 `minutes`와 `party_members_acc_id`열을 제외하고 다 지우겠습니다.

In [12]:
DF_train_party.drop(['start', 'end', 'time'], axis =1, inplace = True)
DF_train_party.to_csv(path+'party_temp2.csv', encoding=False, index=False)