데이터 링크 : [다운로드](https://drive.google.com/file/d/10Am_lQnYbUhACh_z5HKlN-lBRPgoUAFp/view?usp=share_link)

In [2]:
import pandas as pd
from time import time

### Load Transaction data

- pandas로 데이터를 읽어온 뒤, 기본적인 데이터를 확인해봅니다.

In [5]:
# Load data (pandas version)

start = time()
train = pd.read_csv('./data/transactions_train.csv')
end = time()
print(f'데이터를 불러오는데 걸리는 시간 : {end - start}초')
train

데이터를 불러오는데 걸리는 시간 : 19.63964581489563초


Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2
...,...,...,...,...,...
31788319,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,929511001,0.059305,2
31788320,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,891322004,0.042356,2
31788321,2020-09-22,fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...,918325001,0.043203,1
31788322,2020-09-22,fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...,833459002,0.006763,1


In [6]:
# Check memory usage 

mem_usage = train.memory_usage(deep=True).sum() / 1024 / 1024 / 1024
print(f"Memory Usage : {mem_usage:.4} GiB")

Memory Usage : 6.276 GiB


### 데이터를 가져오는 첫번째 방법. parquet

- 훨씬 더 데이터를 빠르게 가져올 수 있는 구조인 parquet 파일로 변환합니다.

In [7]:
# parquet로 다시 만들어봅시다.

part = pd.read_csv('./data/transactions_train.csv',
                   nrows=1000)      # 1000개만 읽어오기
part

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713001,0.050831,2
1,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,541518023,0.030492,2
2,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,505221004,0.015237,2
3,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687003,0.016932,2
4,2018-09-20,00007d2de826758b65a93dd24ce629ed66842531df6699...,685687004,0.016932,2
...,...,...,...,...,...
995,2018-09-20,05943a58bd172641b80919a9bdf14012df940800bc74d0...,661794001,0.152525,2
996,2018-09-20,05943a58bd172641b80919a9bdf14012df940800bc74d0...,661794001,0.152525,2
997,2018-09-20,05943a58bd172641b80919a9bdf14012df940800bc74d0...,661794002,0.152525,2
998,2018-09-20,05943a58bd172641b80919a9bdf14012df940800bc74d0...,661794002,0.152525,2


In [8]:
part2 = pd.read_csv('./data/transactions_train.csv',
                   usecols=['t_dat','sales_channel_id'])
part2 

Unnamed: 0,t_dat,sales_channel_id
0,2018-09-20,2
1,2018-09-20,2
2,2018-09-20,2
3,2018-09-20,2
4,2018-09-20,2
...,...,...
31788319,2020-09-22,2
31788320,2020-09-22,2
31788321,2020-09-22,1
31788322,2020-09-22,1


In [9]:
# Check memory usage

mem_usage = part2.memory_usage(deep=True).sum() / 1024 / 1024 / 1024
print(f"Memory Usage : {mem_usage:.4} GiB")

Memory Usage : 2.22 GiB


### 데이터를 가져오는 두번째 방법. I/O

- 데이터를 쪼개서 들고와봅시다.

In [18]:
print(part['sales_channel_id'].value_counts()) 

2    716
1    284
Name: sales_channel_id, dtype: int64


In [19]:
# chunk

sales = part['sales_channel_id'].value_counts() * 0

for chunk in pd.read_csv('./data/transactions_train.csv',
                         chunksize=300000):
    print(chunk['sales_channel_id'].value_counts())
    sales = sales + chunk['sales_channel_id'].value_counts()

sales

2    226143
1     73857
Name: sales_channel_id, dtype: int64
2    165439
1    134561
Name: sales_channel_id, dtype: int64
2    194943
1    105057
Name: sales_channel_id, dtype: int64
2    201745
1     98255
Name: sales_channel_id, dtype: int64
2    220733
1     79267
Name: sales_channel_id, dtype: int64
2    203909
1     96091
Name: sales_channel_id, dtype: int64
2    217776
1     82224
Name: sales_channel_id, dtype: int64
2    211571
1     88429
Name: sales_channel_id, dtype: int64
2    220108
1     79892
Name: sales_channel_id, dtype: int64
2    213843
1     86157
Name: sales_channel_id, dtype: int64
2    214965
1     85035
Name: sales_channel_id, dtype: int64
2    206437
1     93563
Name: sales_channel_id, dtype: int64
2    197220
1    102780
Name: sales_channel_id, dtype: int64
2    162149
1    137851
Name: sales_channel_id, dtype: int64
2    185695
1    114305
Name: sales_channel_id, dtype: int64
2    224869
1     75131
Name: sales_channel_id, dtype: int64
2    230780
1     69220


1           NaN
2    22379862.0
Name: sales_channel_id, dtype: float64

(OPTIONAL) 데이터를 일부만 따로 저장

In [21]:
train2006 = train.loc[train['t_dat'] > '2020-06-01']
train2006.to_csv('transactions_202006.csv',index=False)

In [23]:
train = pd.read_csv('./transactions_202006.csv')
train

Unnamed: 0,t_dat,customer_id,article_id,price,sales_channel_id
0,2020-06-02,0001f8cef6b9702d54abf66fd89eb21014bf98567065a9...,855834001,0.015831,1
1,2020-06-02,0001f8cef6b9702d54abf66fd89eb21014bf98567065a9...,836130002,0.015831,1
2,2020-06-02,0015f16aa2702e2ec13d2e38052f496b9b915d3c64e82c...,832453006,0.016932,1
3,2020-06-02,0015f16aa2702e2ec13d2e38052f496b9b915d3c64e82c...,841260011,0.016932,1
4,2020-06-02,001ef7c503e5407b6b836351b0415d3a226c587d4fb17b...,822946002,0.026797,2
...,...,...,...,...,...
5108381,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,929511001,0.059305,2
5108382,2020-09-22,fff2282977442e327b45d8c89afde25617d00124d0f999...,891322004,0.042356,2
5108383,2020-09-22,fff380805474b287b05cb2a7507b9a013482f7dd0bce0e...,918325001,0.043203,1
5108384,2020-09-22,fff4d3a8b1f3b60af93e78c30a7cb4cf75edaf2590d3e5...,833459002,0.006763,1
