### Features

We divide all the features into the following subgroups. Each of the subgroup has its unique source of data. For example. the dataset of "user" are derived from the user information and their order records in our system. For each of the subgroup, we have specific target that we want to explore the pattern of. In "users", we want to analyze the relationship between the profit of an order and the information of the user who makes the order. In "Order", we want to analyze the relationship between the profit of an order and the number of different inventories. In "DUration", we want to analyze the relationship between the time spent on transiting and the running data of the robots. In the last dataset, we combine several typical features from each of the dataset and put into a single dataset to see if we can dig out something interesting.

#### User
customer_id              
name            
age       
sex       
city         
state            
country          
income             
credit              
education    
occupation        
orderCount       
totalProfit   

#### Order
customer_id

green,blue,black,yellow,red,white

**amount** = green + blue + black + yellow + red + white

#### Duration

orderDate,takenDate,shipDate

**transitDuration** = shipDate - takenDate

**fufillDuration** = shipDate - orderDate

shipVisitCount

#### Cost, revenue, profit
productSales,shipHandleCost

**totalRevenue**=productSales + shipHandleCost

productCOGS, *orderProcessCost(=10)*, shipVisitCost

**totalCost** = productCOGS + *orderProcessCost* + shipVisitcCost

**profit**= **totalRevenue** - **totalCost**

In [2]:
from datetime import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import visuals as vs
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

plt.style.use('fivethirtyeight')

df = pd.read_csv('../data/exp2/orderWithProfit.csv', header=0)
# filtered_df = df[df['orderdate'].isnull()]
df = df.dropna()
df["orderDate"] = df["orderDate"].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
df["takenDate"] = df["takenDate"].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
df["shipDate"] = df["shipDate"].apply(lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'))
df["transitDuration"] = (df["shipDate"]-df["takenDate"])/ np.timedelta64(1, 's')
df["fulfillDuration"] = (df["shipDate"]-df["orderDate"])/ np.timedelta64(1, 's')

df["amount"] = df["red"]+df["blue"]+df["yellow"]+df["black"]+df["white"]

dic = {}

# Combine with customer info
df_tmp = pd.read_csv('../data/exp2/orderWithCustomer.csv', header=0)
df = pd.merge(df, df_tmp, how='inner', left_on="customer", right_on="name",suffixes=('_x', '_y'),)

for key in ["customer", "name","age", "sex", "city", "state", "country",\
                 "income", "credit","education", "occupation","orderCount","totalProfit"]:
    dic[key] = {}
    ## Add Customer ID (Integer number)
    id = 1
    for _,name in df[[key]].drop_duplicates()[key].iteritems():
        dic[key][name] = id # id starts from 0
        id = id+1
    df[key] = df[key].apply(lambda x: dic[key][x])

    
print "dictionary keys:",dic.keys()
print df.info()

dictionary keys: ['customer', 'city', 'name', 'orderCount', 'country', 'age', 'totalProfit', 'sex', 'credit', 'state', 'income', 'education', 'occupation']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 150 entries, 0 to 149
Data columns (total 34 columns):
customer            150 non-null int64
green               150 non-null int64
blue                150 non-null int64
black               150 non-null int64
yellow              150 non-null int64
red                 150 non-null int64
white               150 non-null int64
orderDate           150 non-null datetime64[ns]
takenDate           150 non-null datetime64[ns]
shipDate            150 non-null datetime64[ns]
shipVisitCount      150 non-null int64
productSales        150 non-null float64
shipHandleCost      150 non-null float64
totalRevenue        150 non-null float64
productCOGS         150 non-null float64
orderProcessCost    150 non-null float64
shipVisitCost       150 non-null float64
totalCost           150 non-null floa