# 00-1 Setup and Preprocessing

In this notebook, we prepare 2 edited data sets for future use. The first is a transaction log by product code. The second is a transaction log consisting of items which have a corresponding image in the inventory. This is in preparation of the Image Search Feature we will implement at the end of the project.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [11]:
# load the data
transactions = pd.read_csv('../data/transactions_train.csv', parse_dates=['t_dat'])

articles = pd.read_csv('../data/articles.csv')[['article_id', 'product_code']]

In [8]:
transactions.dtypes

t_dat               datetime64[ns]
customer_id                 object
article_id                   int64
price                      float64
sales_channel_id             int64
dtype: object

In [3]:
transactions.shape

(31788324, 5)

In [4]:
len(articles['article_id'])

105542

## Transactions by Product Code

In [12]:
transactions_by_product = transactions[['t_dat','customer_id', 'article_id']]

In [13]:
transactions_by_product = transactions_by_product.merge(right=articles, on = 'article_id')

In [14]:
transactions_by_product = transactions_by_product[['t_dat', 'customer_id', 'product_code']]

In [15]:
transactions_by_product.head()

Unnamed: 0,t_dat,customer_id,product_code
0,2018-09-20,000058a12d5b43e67d225668fa1f8d618c13dc232df0ca...,663713
1,2018-09-20,3681748607f3287d2c3a65e00bb5fb153de30e9becf158...,663713
2,2018-09-20,4ef5967ff17bf474bffebe5b16bd54878e1d4105f7b4ed...,663713
3,2018-09-20,6b7b10d2d47516c82a6f97332478dab748070f09693f09...,663713
4,2018-09-20,8ac137752bbe914aa4ae6ad007a9a0c5b67a1ab2b2d474...,663713


In [16]:
transactions_by_product.to_csv('../data/transactions-by-product.csv', index=False)

## Transactions with Item Image

In [19]:
has_img = []

for i in articles['article_id']:
    file = '0'+str(i)
    folder = file[0:3]
    try:
        read_image('../resized_images/'+folder+'/'+file+'.jpg')
        has_img.append(i)
    except:
        pass
    

In [21]:
len(has_img)

105100

In [24]:
transactions_with_img = transactions[ transactions['article_id'].isin(has_img) ]

In [25]:
transactions_with_img.shape

(31651678, 5)

In [28]:
transactions_with_img.to_csv('../data/transactions_cleaned.csv')