In [196]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [197]:
os.listdir('../../data')

['categories.csv',
 'products.csv',
 'orders.csv',
 'customer.csv',
 'department.csv',
 'order_items.csv']

In [198]:
Categories = pd.read_csv('../../data/categories.csv', delimiter=',')
Products = pd.read_csv('../../data/products.csv', delimiter=';')
Orders = pd.read_csv('../../data/orders.csv', delimiter=';')
Customer = pd.read_csv('../../data/customer.csv', delimiter=';')
Department = pd.read_csv('../../data/department.csv', delimiter=';')
Order_Items = pd.read_csv('../../data/order_items.csv', delimiter=';')

In [199]:
del Customer['customer_email']
del Customer['customer_password']
Customer.head(2)

Unnamed: 0,customer_id,customer_fname,customer_lname,customer_street,customer_city,customer_state,customer_zipcode
0,1,Richard,Hernandez,6303 Heather Plaza,Brownsville,TX,78521
1,2,Mary,Barrett,9526 Noble Embers Ridge,Littleton,CO,80126


In [232]:
Department.columns.values

array(['department_id', 'department_name'], dtype=object)

In [214]:
Order_Items.columns.values

array(['order_item_id', 'order_item_order_id', 'order_item_product_id',
       'order_item_quantity', 'order_item_subtotal',
       'order_item_product_price'], dtype=object)

In [310]:
def f(startsWith = ''):
    def g(val):
        if(val == startsWith + 'id'):
            return val
        return val.replace(startsWith, '')
    return g

FeatureMatrix = pd.DataFrame.from_dict({
    'Categories': np.vectorize(f('category_'))(Categories.columns.values).tolist(),
    'Products': np.vectorize(f('product_'))(Products.columns.values).tolist(),
    'Orders': np.vectorize(f('order_'))(Orders.columns.values).tolist(),
    'Customer': np.vectorize(f('customer_'))(Customer.columns.values).tolist(),
    'Department': np.vectorize(f('department_'))(Department.columns.values).tolist(),
    'Order_Items': np.vectorize(f('order_item_'))(Order_Items.columns.values).tolist()
}, orient='index').fillna(value = '-')
FeatureMatrix.head(10)

Unnamed: 0,0,1,2,3,4,5,6
Categories,category_id,department_id,name,-,-,-,-
Products,product_id,category_id,name,description,price,image,-
Orders,order_id,date,customer_id,status,-,-,-
Customer,customer_id,fname,lname,street,city,state,zipcode
Department,department_id,name,-,-,-,-,-
Order_Items,order_item_id,order_id,product_id,quantity,subtotal,product_price,-


We would like to pick up those attributes which answers any of the following question:
* Who purchased
* What purchased
* When purchased
* How much purchased

For the above question set we are picking up relevant feature for our feature matrix, this is totally based on business knowledge. Anything doubtful should be considered `in`.
* **Customer Id** - 
* **city** 
* **street** 
* **zipcode**
* **state**
* **ordered_product_name**
* **ordered_product_desciption**
* **ordered_product_price**
* **ordered_product_has_image**   (As of now, all products has image, we are just including this, but won't use this one) - `Products['product_image'].isnull().sum()`
* **order_date**
* **order_status**
* **department_name** - Not sure how helpful this will be, benefit of doubt goes to addition of attribute.
* **quantity**
* **subtotal**
* **product_price**

#### Biz problem
 - CEO wants to initiate an email marketting campain, so in order to get most out of it, he want to use our knowledge.
 
#### Thought process
 As a data analyst, let's try out a conversation between DA and consumer
 - **DA**: I have sent you a mail. Would you like to check it ?
 - User: Why should I?
 - **DA**: It may contains the items you are interested in?
 - User: May or does it ?
 - **DA**: `We need to figure out the list of items user is interested in`
 - *DA: We have some amazing offers for you?
 - User: Okay !!
 - DA: `We need to find out best combination of interested product and offer`[Out of our scope as dataset doesn't have any attribute for offers and how much to offer depends on business.]
 - User: This list seems to be providing good.

Now, we have following task in hand:
  - We need to figure out the list of items user is interested in
    - **Approach#1** - User might be interested in items in which he was interested in recent past.
    - **Approach#2** - User might be interested items which other similar users has bought. Ex - If i bought TV, then what are those items which other people has purchased along with or in near duration of purchasing a TV.
    - **Approach#3** - User might be offer alcholic - Ex Some users ends up buying lot unnecessary stuff during sale.

## Approach#1 - Understanding user's interest