## Data Information

Data have 5 columns and more than 100 million rows. There are about 1 million users whoes user behaviors including click, purchase, adding item to shopping cart and item favoring during November 25 to December 03, 2017. Each line represents a specific user-item interaction, which consists of user ID, item ID, item's category ID, behavior type and timestamp, separated by commas.

## 1.Exploratory Data Analysis

In [1]:
# Modules imported
import pandas as pd
import numpy as np
import matplotlib as plt
import warnings
warnings.filterwarnings('ignore')

In [2]:
data = pd.read_csv('/Users/paxton615/Downloads/UserBehavior.csv')

In [3]:
print(data.shape)
print(data.isnull().sum())
data.head()

(100150806, 5)
1             0
2268318       0
2520377       0
pv            0
1511544070    0
dtype: int64


Unnamed: 0,1,2268318,2520377,pv,1511544070
0,1,2333346,2520771,pv,1511561733
1,1,2576651,149192,pv,1511572885
2,1,3830808,4181361,pv,1511593493
3,1,4365585,2520377,pv,1511596146
4,1,4606018,2735466,pv,1511616481


In [4]:
data.columns = ['user_id', 'item_id','category_id','status',"timestamp"]

### Select 3 million for analysis, calling them 'users'

In [5]:
users = data.iloc[5000000:8000000,:]
users.reset_index(drop=True, inplace=True)
print(users.shape)
print(users.nunique())
users.head()

(3000000, 5)
user_id         29233
item_id        806118
category_id      6911
status              4
timestamp      676049
dtype: int64


Unnamed: 0,user_id,item_id,category_id,status,timestamp
0,309818,4710383,1792277,pv,1511959603
1,309818,1421743,4069500,pv,1511959759
2,309818,800137,1216617,pv,1511959828
3,309818,2493122,1216617,pv,1511959953
4,309818,1461532,3102419,pv,1511998449


In [6]:
# How much does this sample,users, represent the population.

#data.nunique() # take 2-3 minutes to complete, be cautious to run this line
#the results of data.nunique() is shown below

total_unq = ['1 987994','2268318 4162024','2520377 9439','pv 4','1511544070 815859']

total_unq = [int(i.split(' ')[1]) for i in total_unq]

sample_unq = [i for i in users.nunique()]

perctage = []
for i,j in zip(sample_unq,total_unq):
    perctage.append('{:2%}'.format(i/j))

print(perctage)

['2.958824%', '19.368413%', '73.217502%', '100.000000%', '82.863460%']


#### the subset, users , contains 3% of total users, 19% of total items, 73% of categories

### Unpack timestamp data and EDA

In [7]:
#timezone must be right
users['hour']=[pd.Timestamp(i, unit='s',tz='Asia/Shanghai').hour for i in users.timestamp]

users['year']=[pd.Timestamp(i, unit='s',tz='Asia/Shanghai').year for i in users.timestamp]

users['day']=[pd.Timestamp(i, unit='s',tz='Asia/Shanghai').day for i in users.timestamp]

users['month']=[pd.Timestamp(i, unit='s',tz='Asia/Shanghai').month for i in users.timestamp]
# 0 is Monday，6 is Sunday
users['dayofweek']=[pd.Timestamp(i, unit='s',tz='Asia/Shanghai').dayofweek for i in users.timestamp]

In [8]:
users.head()

Unnamed: 0,user_id,item_id,category_id,status,timestamp,hour,year,day,month,dayofweek
0,309818,4710383,1792277,pv,1511959603,20,2017,29,11,2
1,309818,1421743,4069500,pv,1511959759,20,2017,29,11,2
2,309818,800137,1216617,pv,1511959828,20,2017,29,11,2
3,309818,2493122,1216617,pv,1511959953,20,2017,29,11,2
4,309818,1461532,3102419,pv,1511998449,7,2017,30,11,3


In [9]:
print(users.year.unique()) # year should only be 2017

print(np.sort(users.hour.unique())) 

print(users.month.unique()) # should only be 11,12

print(users.day.unique()) # should be within 11.25-12.3

[2017 2020 1919 2021]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
[11 12 10  4  9  5  8]
[29 30  1  2  3 27 28 25 26 24 20 23 22 18 19 17 21 14  4  5 16 11 12 13
 15 10]


In [10]:
users = users[users['year']==2017] 
users = users[users['month'].isin([11,12])]
users = users[users['day'].isin([25,26,27,28,29,30,1,2,3])] 

In [11]:
print(users.shape)
users.head()

(2998310, 10)


Unnamed: 0,user_id,item_id,category_id,status,timestamp,hour,year,day,month,dayofweek
0,309818,4710383,1792277,pv,1511959603,20,2017,29,11,2
1,309818,1421743,4069500,pv,1511959759,20,2017,29,11,2
2,309818,800137,1216617,pv,1511959828,20,2017,29,11,2
3,309818,2493122,1216617,pv,1511959953,20,2017,29,11,2
4,309818,1461532,3102419,pv,1511998449,7,2017,30,11,3


In [12]:
# drop duplicstes condition: same user, same day, same hour, viewed the same item 
users = users.drop_duplicates(subset=['user_id','item_id','status','hour','day'])
print(users.shape)

(2748966, 10)


In [13]:
# Form user['date'] by using year, month, day
date = []
for i,j,k in zip(users['year'], users['month'], users['day']):
    date.append(str(i)+"-"+str(j)+"-"+str(k))

users['date']=pd.to_datetime(date)

users.head()

Unnamed: 0,user_id,item_id,category_id,status,timestamp,hour,year,day,month,dayofweek,date
0,309818,4710383,1792277,pv,1511959603,20,2017,29,11,2,2017-11-29
1,309818,1421743,4069500,pv,1511959759,20,2017,29,11,2,2017-11-29
2,309818,800137,1216617,pv,1511959828,20,2017,29,11,2,2017-11-29
3,309818,2493122,1216617,pv,1511959953,20,2017,29,11,2,2017-11-29
4,309818,1461532,3102419,pv,1511998449,7,2017,30,11,3,2017-11-30


In [14]:
# tidy cols and reset_index
users = users[['user_id', 'item_id' , 'category_id', 'status', 'date','dayofweek' ,'hour']]
users.reset_index(drop=True, inplace=True)

In [15]:
print(users.shape)
print(users.isnull().sum())
print(users.dtypes)
users.head()

(2748966, 7)
user_id        0
item_id        0
category_id    0
status         0
date           0
dayofweek      0
hour           0
dtype: int64
user_id                 int64
item_id                 int64
category_id             int64
status                 object
date           datetime64[ns]
dayofweek               int64
hour                    int64
dtype: object


Unnamed: 0,user_id,item_id,category_id,status,date,dayofweek,hour
0,309818,4710383,1792277,pv,2017-11-29,2,20
1,309818,1421743,4069500,pv,2017-11-29,2,20
2,309818,800137,1216617,pv,2017-11-29,2,20
3,309818,2493122,1216617,pv,2017-11-29,2,20
4,309818,1461532,3102419,pv,2017-11-30,3,7


In [16]:
users.to_csv(r'/Users/paxton615/Github_Personal/Alibaba_UserBehavior_Analysis/drafts/users_2m.csv')

#### Data is ready for analysis.