## Column Descriptions
**event_time:** Time when event happened at (in UTC).

**event_type:** Events can be:<br>
$\;\;\;\;\;\;$<font color='blue'>view</font> - a user viewed a product<br>
$\;\;\;\;\;\;$<font color='blue'>cart</font> - a user added a product to shopping cart<br>
$\;\;\;\;\;\;$<font color='blue'>removefromcart</font> - a user removed a product from shopping cart<br>
$\;\;\;\;\;\;$<font color='blue'>purchase</font> - a user purchased a product<br>

$\;\;\;\;\;\;$Typical funnel: view => cart => purchase.

**product_id:** ID of a product

**category_id:** Product's category ID

**category_code:** Product's category taxonomy (code name) if it was possible to make it. Usually present for meaningful categories and skipped for different kinds of accessories.

**brand** Downcased string of brand name. Can be missed.

**price** Float price of a product. Present.

**user_id** Permanent user ID.

**user_session** Temporary user's session ID. Same for each user's session. Is changed every time user come back to online store from a long pause.

## Basic Info
**Rows:** 67,501,979<br>
**Columns:** 9<br>
<br>
**event_time:**        object<br>
**event_type:**        object<br>
**product_id:**         int64<br>
**category_id:**        int64<br>
**category_code:**     object<br>
**brand:**             object<br>
**price:**            float64<br>
**user_id:**            int64<br>
**user_session:**      object<br>

In [1]:
import pandas as pd
import numpy as np
import functions

In [2]:
df_nov = pd.read_csv('2019-Nov.csv', parse_dates=['event_time'])
df_oct = pd.read_csv('2019-Oct.csv', parse_dates=['event_time'])

In [3]:
print(df_oct.info())
print('-'*100)
print(df_nov.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42448764 entries, 0 to 42448763
Data columns (total 9 columns):
 #   Column         Dtype              
---  ------         -----              
 0   event_time     datetime64[ns, UTC]
 1   event_type     object             
 2   product_id     int64              
 3   category_id    int64              
 4   category_code  object             
 5   brand          object             
 6   price          float64            
 7   user_id        int64              
 8   user_session   object             
dtypes: datetime64[ns, UTC](1), float64(1), int64(3), object(4)
memory usage: 2.8+ GB
None
----------------------------------------------------------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67501979 entries, 0 to 67501978
Data columns (total 9 columns):
 #   Column         Dtype              
---  ------         -----              
 0   event_time     datetime64[ns, UTC]
 1   event_type     

In [4]:
print('{}\n{}\n'.format('NOVEMBER SALES', df_nov.describe()))
print('{}\n{}\n'.format('OCTOBER SALES', df_oct.describe()))

NOVEMBER SALES
         product_id   category_id         price       user_id
count  6.750198e+07  6.750198e+07  6.750198e+07  6.750198e+07
mean   1.251406e+07  2.057898e+18  2.924593e+02  5.386397e+08
std    1.725741e+07  2.012549e+16  3.556745e+02  2.288516e+07
min    1.000365e+06  2.053014e+18  0.000000e+00  1.030022e+07
25%    1.305977e+06  2.053014e+18  6.924000e+01  5.164762e+08
50%    5.100568e+06  2.053014e+18  1.657700e+02  5.350573e+08
75%    1.730075e+07  2.053014e+18  3.603400e+02  5.610794e+08
max    1.000286e+08  2.187708e+18  2.574070e+03  5.799699e+08

OCTOBER SALES
         product_id   category_id         price       user_id
count  4.244876e+07  4.244876e+07  4.244876e+07  4.244876e+07
mean   1.054993e+07  2.057404e+18  2.903237e+02  5.335371e+08
std    1.188191e+07  1.843926e+16  3.582692e+02  1.852374e+07
min    1.000978e+06  2.053014e+18  0.000000e+00  3.386938e+07
25%    1.005157e+06  2.053014e+18  6.598000e+01  5.159043e+08
50%    5.000470e+06  2.053014e+18  1.629

In [8]:
functions.split_categories(df_oct, df_oct.category_code)

KeyError: "None of [Index([          nan,  'appliances',   'furniture',   'computers',\n       'electronics',   'computers',           nan,           nan,\n           'apparel', 'electronics',\n       ...\n           'apparel',           nan,           nan, 'electronics',\n        'appliances', 'electronics',           nan,        'auto',\n       'electronics',           nan],\n      dtype='object', length=42412522)] are in the [columns]"

In [9]:
category_1 = []
category_2 = []
category_3 = []

for x in df_oct.category_code:
    # if not a string then appends nan to all categories
    if type(x) != str:
        category_1.append(float('NaN'))
        category_2.append(float('NaN'))
        category_3.append(float('NaN'))
        continue
        # appends to appropriate list and fills 
        # nan for remaining categories
    split = x.split('.')
    if len(split) == 1:
        category_1.append(split[0])
        category_2.append(float('NaN'))
        category_3.append(float('NaN'))
    elif len(split) == 2:
        category_1.append(split[0])
        category_2.append(split[1])
        category_3.append(float('NaN'))
    elif len(split) == 3:
        category_1.append(split[0])
        category_2.append(split[1])
        category_3.append(split[2])

df_oct['category_1'] = df_oct[category_1]
df_oct['category_2'] = df_oct[category_2]
df_oct['category_3'] = df_oct[category_3]

KeyError: "None of [Index([          nan,  'appliances',   'furniture',   'computers',\n       'electronics',   'computers',           nan,           nan,\n           'apparel', 'electronics',\n       ...\n           'apparel',           nan,           nan, 'electronics',\n        'appliances', 'electronics',           nan,        'auto',\n       'electronics',           nan],\n      dtype='object', length=42412522)] are in the [columns]"

In [None]:
# splits up categories by their sublevels into separate columns
category_1 = []
category_2 = []
category_3 = []

for x in df.category_code:
    # if nan then appends nan to all categories
    if type(x) == float:
        category_1.append(np.nan)
        category_2.append(np.nan)
        category_3.append(np.nan)
        continue
    # appends to appropriate list and fills 
    # nan for remaining categories
    split = x.split('.')
    if len(split) == 1:
        category_1.append(split[0])
        category_2.append(np.nan)
        category_3.append(np.nan)
    elif len(split) == 2:
        category_1.append(split[0])
        category_2.append(split[1])
        category_3.append(np.nan)
    elif len(split) == 3:
        category_1.append(split[0])
        category_2.append(split[1])
        category_3.append(split[2])

df['category_1'] = df[category_1]
df['category_2'] = df[category_2]
df['category_3'] = df[category_3]