### Data Documentation:
<br>**Description**: Synthetic dataset from Gap Inc., representing individual purchases from Q1 FY2020. Each row is a unique item purchased in an order
<br><br>

| **Feature** | **Description**    | **Sample Value**  |
| ------- | -----------    | ------------- |
| OrderID | Unique order identifier per transaction | Q13Fa20xP   |
| ItemID  | 8-digit identifier per item in order | 513-817-239 |
| ItemName | Name of item associated with item identifier | Blue khakis |
| ItemSize | Size of item | XS, X, M, L, X, XL |
| Collection | Which part of store | Mens Denim, Mens Outerwear, Mens Tees, Kids Tees, |
| PriceTag | Listed price of item | $9.95 |
| ClearanceType | Type of clearance | Retail, Clearance, Final Sale |
| DiscountType | If Gap Card rewards was used | Reward points, Promotion, GapCash, Other |
| TimeStamp | Date purchased as YYYYMMDD | 2020-02-05 |
| BranchID | 4-digit store number| #1253, #0531, #4176 |
| StoreName | Physical location of store | San Diego, CA |
| OnlinePurchase | Was the purchase done online  |1 if yes, 0 otherwise |

<br>

## Workflow:


**for products:**
(have list of products)
(map each product to randomly generated ID)
(map each product to collection)
(map each product to business segment) 
(map each product to price)
    * price selected from dist
    
**for stores:**
(gen list of storeIDs)
(map storeID to location)

<br><br>
(gen list of orderIDs from n_orders)
create diff order lengths, using poisson distribution w/ lambda=1

**for each order:**
    - choose a timestamp (rng)
    - choose online or not (rng)
    
    - choose storeID (& corresponding location) (by binom)
    - choose discountType (by dist)
    
    - randomly select k products, where k is the order length
    - for each product:
        - randomly assign ClearanceType, and subtract corresponding number from Price Tag
    
    
append all to a dataframe    

In [1]:
import pandas as pd
import numpy as np
np.seed = 137

from sklearn.datasets import make_classification
from sklearn.datasets import make_regression

import random 
import string 

import matplotlib.pyplot as plt
import seaborn as sns

execution_ct = 0

In [3]:
def genClassesByFreq(n_obs=100, classes=[], freq=[]):
    RNGs = np.random.uniform(0,1,n_obs)
    cutoffs = np.array(freq).cumsum()
    labels = np.digitize(RNGs, cutoffs, right=True)

    labels_dict = dict(zip(range(len(classes)), classes))
    data = np.vectorize(dict(labels_dict).get)(labels)
    return data

In [4]:
def genStringID(length=8, chars=string.ascii_uppercase + string.digits): 
    """
    Generate random strings of length str_length, choosing from chars list only
    """
    return ''.join(random.choice(chars) for x in range(length))

def genFloatID(n_sections=2): 
    """
    Generate random float id with n sections seperated by hyphen, each section of length 'length'
    """
    return '-'.join(str(np.random.randint(100,999)) for x in range(n_sections))
   
def generateIDs(n_obs, n_unique=10, freq=[], id_type='string', verbose=False, **kwargs):
    """
    Generate IDs of array length n_obs, with k classes, with desired frequency given as weights to each class
    Can make either float or string ID
    """
    if 'str' in id_type:
        classes = [genStringID(chars="1234567890QZFSCWZTKRD",**kwargs) for i in range(n_unique)]
    else: 
        classes = [genFloatID(**kwargs) for i in range(n_unique)]
        
    data = genClassesByFreq(n_obs, classes, freq)
    if verbose:
        print(pd.Series(data).value_counts())
    
    return data
# pd.Series(generateIDs(n_obs=100, n_unique=10, freq=[.5,.4,.06, .04], id_type='string')).value_counts()

In [5]:
def random_dates(start, end, n=10):
    """
    Generate random TS between start / end date
    """
    start_u = start.value//10**9
    end_u = end.value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

Q1_start = pd.to_datetime('2020-01-01')
Q1_end = pd.to_datetime('2020-03-31')

## Make Product Data

In [13]:
# ORIGIN = pd.read_clipboard()
products = ORIGIN.copy()

In [14]:
products.columns=['ProductName','Collection','Brand','PriceIndex','RelPopularity']
products.sample(5)

Unnamed: 0,ProductName,Collection,Brand,PriceIndex,RelPopularity
29,Lightly Used Boxer Briefs,Accessories,Gap,1,0.001
6,White Camisole with Regina George Cutouts,Women's Tops,Banana Republic,2,0.01
45,Mom-Bought-Me-These Brown Slacks,Kids Bottoms,Gap,2,0.1
17,Pink Polo by Kanye Plain Tee,Men's Tops,Gap,2,0.001
40,Father's Day Dark Dress Socks,Accessories,Banana Republic,1,0.01


In [15]:
def mapPricesFromIndex(x=1):
    """
    x  target range	distribution
    1  $1-20	    gaussian(10,3)
    2  $30-60	    gaussian(45,4)
    3  $90+	        exponential(20)+80
    """
    if x==1:
        return 1+np.abs(np.random.gamma(3,3))
    elif x==2:
        return np.abs(np.random.normal(45,10))
    elif x==3:
        return 60+5*np.random.gamma(2,7)
    else:
        return 9999.

In [16]:
products['Price'] = np.around(products.PriceIndex.apply(mapPricesFromIndex),decimals=2)
products['RelFreq'] = 4+np.log10(products.RelPopularity)+np.abs(np.random.normal(0,.15,size=len(products)))
products['ProductID'] = [genFloatID(3)[1:] for i in range(len(products.Price))]

In [17]:
products = products[['ProductID','ProductName','Collection','Brand', 'Price', 'RelFreq']]
products.sample(5)

Unnamed: 0,ProductID,ProductName,Collection,Brand,Price,RelFreq
19,57-583-745,Super Skinny Fit with GapFlex Max Jeans,Denim Shop,Gap,37.12,3.11362
37,00-893-818,"Beach Week Linen Button Down, Off-White",Men's Tops,Banana Republic,64.38,2.191056
39,47-230-154,"SpaceX Branded Socks, by NASA™",Kids Tops,Gap,5.99,2.0056
8,10-966-490,Tan Slacks for Serious Press Conference,Men's Bottoms,Banana Republic,138.83,3.026137
28,91-951-829,Wrinkled Gray Sweatpants for Work-From-Home,Men's Bottoms,Gap,8.51,3.065687


In [18]:
products['SampleWeight'] = (products.RelFreq**1.3) / (products.RelFreq**1.3).sum()
products

Unnamed: 0,ProductID,ProductName,Collection,Brand,Price,RelFreq,SampleWeight
0,31-880-165,Edgar Allen Poe's Snotty Hankerchief,Accessories,Banana Republic,6.56,1.121818,0.008366
1,77-374-679,"Baby Shoes, Never Worn",Accessories,Gap,10.63,2.055596,0.018384
2,36-682-186,Adam Levine Choker,Accessories,Gap,11.39,1.157228,0.008711
3,10-113-846,My Wife Borat Graphic Tee,Men's Tops,Gap,50.95,2.046423,0.018278
4,75-833-369,Checkered Cloth Mask with Drinking Straw Hole,Accessories,Gap,7.31,3.234156,0.033137
5,93-552-710,Acid-Washed Low-Rise Jeans with LSD-tab-sized ...,Women's Bottoms,Banana Republic,109.48,3.216709,0.032905
6,77-870-213,White Camisole with Regina George Cutouts,Women's Tops,Banana Republic,53.6,2.141382,0.019388
7,82-161-128,Mullet-Cut Midi Fur Skirt,Women's Bottoms,Banana Republic,88.94,2.16752,0.019696
8,10-966-490,Tan Slacks for Serious Press Conference,Men's Bottoms,Banana Republic,138.83,3.026137,0.030394
9,97-602-297,Tan Suit Jacket for Casual Press Conference,Men's Tops,Banana Republic,97.93,3.037863,0.030547


In [19]:
products['RetailPrice'] = np.ceil(products.Price)

In [20]:
products.sample(5)

Unnamed: 0,ProductID,ProductName,Collection,Brand,Price,RelFreq,SampleWeight,RetailPrice
36,45-570-528,That-Hoodie-You'll-Wear-to-Every-Class Sweater,Men's Tops,Gap,49.55,3.005548,0.030125,50.0
39,47-230-154,"SpaceX Branded Socks, by NASA™",Kids Tops,Gap,5.99,2.0056,0.017805,6.0
25,79-411-382,Beautiful Mind Tweed Jacket with Shoulder Pads,Men's Tops,Banana Republic,73.08,1.011961,0.007317,74.0
21,26-434-264,Bootleg Boot Cut,Denim Shop,Gap,39.35,2.077511,0.018639,40.0
35,24-868-681,The Conjuring Graphic Tee Collection,Kids Tops,Gap,9.46,2.149633,0.019485,10.0


## Make Store Data

In [21]:
stores_va = pd.read_csv('va_stores.txt', header=None, sep=r'\n',engine='python')
stores = stores_va.iloc[::3,0].str.split('located in').str[-1].drop_duplicates().tolist()
stores = [i[1:] for i in stores]

stores_df1 = pd.DataFrame(columns=['StoreName','StoreID'])
stores_df1['StoreName'] = stores
# stores_df1['StoreID'] = '#'+pd.Series(np.random.randint(1000,9999,size=len(stores))).astype(str)+'G'
stores_df1['Location'] = 'VA'
stores_df1['Brand'] = 'Gap'

stores_df1

Unnamed: 0,StoreName,StoreID,Location,Brand
0,Charlottesville Fashion Square,,VA,Gap
1,Dulles Town Center,,VA,Gap
2,Fair Oaks Mall,,VA,Gap
3,Fashion Centre at Pentagon City,,VA,Gap
4,Leesburg Premium Outlets,,VA,Gap
5,Norfolk Premium Outlets,,VA,Gap
6,Potomac Mills,,VA,Gap
7,Short Pump Town Center,,VA,Gap
8,Stony Point Fashion Park,,VA,Gap
9,Tysons Corner Center,,VA,Gap


In [22]:
stores_br = pd.read_csv('stores_br.txt', header=None, sep=r'\n',engine='python')
stores = stores_br.iloc[::2,0].str.split('located in').str[-1].drop_duplicates().tolist()
stores = [i[1:] for i in stores]

stores_df2 = pd.DataFrame(columns=['StoreName','StoreID'])
stores_df2['StoreName'] = stores
# stores_df2['StoreID'] = '#'+pd.Series(np.random.randint(1000,9999,size=len(stores))).astype(str) + 'B'
stores_df2['Location'] = 'VA'
stores_df2['Brand'] = 'Banana Republic'

stores_df2

Unnamed: 0,StoreName,StoreID,Location,Brand
0,Barracks Road Shopping Center,,VA,Banana Republic
1,Dulles Town Center,,VA,Banana Republic
2,Fair Oaks Mall,,VA,Banana Republic
3,Fashion Centre at Pentagon City,,VA,Banana Republic
4,Leesburg Premium Outlets,,VA,Banana Republic
5,MacArthur Center,,VA,Banana Republic
6,Norfolk Premium Outlets,,VA,Banana Republic
7,Potomac Mills,,VA,Banana Republic
8,Reston Town Center,,VA,Banana Republic
9,Short Pump Town Center,,VA,Banana Republic


In [23]:
stores_df = stores_df1.append(stores_df2).reset_index(drop=True)

stores_df['StoreID'] = '#'+pd.Series(np.random.choice(range(1000,9999), 
                                                      size=len(stores_df),
                                                      replace=False)).astype(str) # Generate unique storeIDs

# stores_df['Brand'] = stores_df.Brand.str.replace('Gap','Zap').replace('Banana Republic','Zanana Republic')
stores_df

Unnamed: 0,StoreName,StoreID,Location,Brand
0,Charlottesville Fashion Square,#9901,VA,Gap
1,Dulles Town Center,#8425,VA,Gap
2,Fair Oaks Mall,#8289,VA,Gap
3,Fashion Centre at Pentagon City,#6802,VA,Gap
4,Leesburg Premium Outlets,#5101,VA,Gap
5,Norfolk Premium Outlets,#7104,VA,Gap
6,Potomac Mills,#9837,VA,Gap
7,Short Pump Town Center,#9367,VA,Gap
8,Stony Point Fashion Park,#8803,VA,Gap
9,Tysons Corner Center,#9981,VA,Gap


## Create customers

In [24]:
n_customers = np.random.randint(600,1000)
n_customers

855

In [25]:
CustomerIDs = [genStringID(2,chars='ABCDEFHIJKLMNOPQRSTUVWXYZ') + genFloatID(1) for i in range(n_customers)]
CustomerIDs[:5]

['KD477', 'EP365', 'IV908', 'JX401', 'QM447']

## Creating Orders

In [26]:
n_orders = np.random.randint(2000,2200)
n_orders

2172

In [27]:
df_orders =  pd.DataFrame(columns=['OrderID'])
for i in range(n_orders):
    order_i = pd.DataFrame(columns=['OrderID'])
    order_i['OrderID'] = [genStringID(length=7,chars="1234567890QZFSCWZTKRD")]
    #Choose customer 
    order_i['CustomerID'] = np.random.choice(CustomerIDs,size=1)
    # Choose Brand
    brand = np.random.choice(['Gap','Banana Republic'],p=[.4,.6])
    order_i['Brand'] = [brand]
    # Choose corresponding StoreID
    order_i['StoreID'] = np.random.choice(stores_df[stores_df.Brand==brand].StoreID)
    # Choose if Online Order
    order_i['OrderType'] = np.random.choice(a=['InStore', 'HomeDelivery', 'StorePickup'], size=1,
                                            p=[.76,.12,.12] if brand == 'Gap' else [.53,.37,.1])
    
    order_i['ItemSize'] = np.random.choice(a=['XS', 'S', 'M', 'L', 'XL'], size=1,
                                           p=[.1,.24,.32,.24,.1])
    # Give Timestamp
    order_i['Timestamp'] = random_dates(Q1_start, Q1_end, 1)
    
    # Choose number of items in order
#     n_items = np.random.geometric(.4 if brand == 'Gap' else .7)
    order_i['ItemsBought'] = np.random.geometric(.4 if brand == 'Gap' else .7, size=1)
    
    # Add order to existing dataset
    df_orders = df_orders.append(order_i)
  

In [28]:
df_orders.reset_index(drop=True,inplace=True)
df_orders.head()

Unnamed: 0,OrderID,CustomerID,Brand,StoreID,OrderType,ItemSize,Timestamp,ItemsBought
0,DRW7C20,QK848,Gap,#6802,InStore,M,2020-02-11 01:09:56,1.0
1,7T8QZ38,DP336,Gap,#7104,InStore,M,2020-01-13 13:35:27,3.0
2,DR3SRRR,KP441,Gap,#8289,HomeDelivery,S,2020-02-01 21:02:14,6.0
3,4371891,BX487,Banana Republic,#4479,InStore,M,2020-03-08 13:00:49,1.0
4,TD06SWS,CP331,Banana Republic,#9033,HomeDelivery,M,2020-03-26 16:33:04,1.0


### Select items per order

In [29]:
# Create items for each order
df_items = pd.DataFrame()
df_byOrder = df_orders.groupby(by='OrderID')
for orderID, order in df_orders.set_index('OrderID').iterrows():
    productsByBrand = products[products.Brand== order.Brand].copy() # Make sure only choose from products in store
    productsByBrand['SampleWeight'] /= productsByBrand.SampleWeight.sum() # Renormalizes probabilities
    
    n_items = int(order.ItemsBought)
    items = pd.DataFrame()
    items['OrderID'] = [orderID]*n_items
    items['CustomerID'] = [order.CustomerID]*n_items
    items['ProductID'] = np.random.choice(a=productsByBrand.ProductID, 
                                          size=n_items,replace=False,
                                          p=productsByBrand.SampleWeight)
    items['StoreID'] = [order.StoreID]*n_items
    items['OrderType'] = [order.OrderType]*n_items
    items['Timestamp'] = [order.Timestamp]*n_items
    items['Brand'] = [order.Brand]*n_items
    items['ItemSize'] = [order.ItemSize]*n_items
    df_items = df_items.append(items)    

df_items.head()

Unnamed: 0,OrderID,CustomerID,ProductID,StoreID,OrderType,Timestamp,Brand,ItemSize
0,DRW7C20,QK848,36-682-186,#6802,InStore,2020-02-11 01:09:56,Gap,M
0,7T8QZ38,DP336,60-444-763,#7104,InStore,2020-01-13 13:35:27,Gap,M
1,7T8QZ38,DP336,91-951-829,#7104,InStore,2020-01-13 13:35:27,Gap,M
2,7T8QZ38,DP336,29-761-664,#7104,InStore,2020-01-13 13:35:27,Gap,M
0,DR3SRRR,KP441,64-360-768,#8289,HomeDelivery,2020-02-01 21:02:14,Gap,S


### Merge back to store & product data

In [30]:
df = pd.merge(df_items, products.drop(columns=['Brand','RelFreq','SampleWeight','Price']), 
              how='left', on='ProductID')
df.head()

Unnamed: 0,OrderID,CustomerID,ProductID,StoreID,OrderType,Timestamp,Brand,ItemSize,ProductName,Collection,RetailPrice
0,DRW7C20,QK848,36-682-186,#6802,InStore,2020-02-11 01:09:56,Gap,M,Adam Levine Choker,Accessories,12.0
1,7T8QZ38,DP336,60-444-763,#7104,InStore,2020-01-13 13:35:27,Gap,M,Human Rights (except for the children who make...,Kids Tops,17.0
2,7T8QZ38,DP336,91-951-829,#7104,InStore,2020-01-13 13:35:27,Gap,M,Wrinkled Gray Sweatpants for Work-From-Home,Men's Bottoms,9.0
3,7T8QZ38,DP336,29-761-664,#7104,InStore,2020-01-13 13:35:27,Gap,M,Dishwasher-Safe Jean Shorts,Denim Shop,42.0
4,DR3SRRR,KP441,64-360-768,#8289,HomeDelivery,2020-02-01 21:02:14,Gap,S,Lightly Used Boxer Briefs,Accessories,10.0


In [31]:
df['Price'] = df.RetailPrice - np.random.choice(a=[.05,.01,.03],
                                                p=[.55,.35,.1],
                                                size=len(df.RetailPrice))
df = df.drop(columns='RetailPrice')
df['ClearanceType'] = df.Price.astype(str).str[-2:].map({'95':'FullRetail',
                                                         '99':'Clearance',
                                                         '97':'FinalSale'})
df.head()

Unnamed: 0,OrderID,CustomerID,ProductID,StoreID,OrderType,Timestamp,Brand,ItemSize,ProductName,Collection,Price,ClearanceType
0,DRW7C20,QK848,36-682-186,#6802,InStore,2020-02-11 01:09:56,Gap,M,Adam Levine Choker,Accessories,11.95,FullRetail
1,7T8QZ38,DP336,60-444-763,#7104,InStore,2020-01-13 13:35:27,Gap,M,Human Rights (except for the children who make...,Kids Tops,16.95,FullRetail
2,7T8QZ38,DP336,91-951-829,#7104,InStore,2020-01-13 13:35:27,Gap,M,Wrinkled Gray Sweatpants for Work-From-Home,Men's Bottoms,8.95,FullRetail
3,7T8QZ38,DP336,29-761-664,#7104,InStore,2020-01-13 13:35:27,Gap,M,Dishwasher-Safe Jean Shorts,Denim Shop,41.97,FinalSale
4,DR3SRRR,KP441,64-360-768,#8289,HomeDelivery,2020-02-01 21:02:14,Gap,S,Lightly Used Boxer Briefs,Accessories,9.95,FullRetail


In [32]:
df = pd.merge(df, stores_df.drop(columns='Brand'), how='left', on='StoreID')
df.head(5)

Unnamed: 0,OrderID,CustomerID,ProductID,StoreID,OrderType,Timestamp,Brand,ItemSize,ProductName,Collection,Price,ClearanceType,StoreName,Location
0,DRW7C20,QK848,36-682-186,#6802,InStore,2020-02-11 01:09:56,Gap,M,Adam Levine Choker,Accessories,11.95,FullRetail,Fashion Centre at Pentagon City,VA
1,7T8QZ38,DP336,60-444-763,#7104,InStore,2020-01-13 13:35:27,Gap,M,Human Rights (except for the children who make...,Kids Tops,16.95,FullRetail,Norfolk Premium Outlets,VA
2,7T8QZ38,DP336,91-951-829,#7104,InStore,2020-01-13 13:35:27,Gap,M,Wrinkled Gray Sweatpants for Work-From-Home,Men's Bottoms,8.95,FullRetail,Norfolk Premium Outlets,VA
3,7T8QZ38,DP336,29-761-664,#7104,InStore,2020-01-13 13:35:27,Gap,M,Dishwasher-Safe Jean Shorts,Denim Shop,41.97,FinalSale,Norfolk Premium Outlets,VA
4,DR3SRRR,KP441,64-360-768,#8289,HomeDelivery,2020-02-01 21:02:14,Gap,S,Lightly Used Boxer Briefs,Accessories,9.95,FullRetail,Fair Oaks Mall,VA


## Exploration

In [33]:
df.shape

(4031, 14)

In [34]:
df.to_csv('gap.csv',sep='|',index=False)
execution_ct += 1
print('Exported {} times'.format(execution_ct))

Exported 1 times


----

In [8]:
import pandas as pd
import numpy as np
# df2 = pd.read_csv('../gap.csv',sep='|')

In [13]:
# df2[['StoreID','StoreName','Location']].drop_duplicates().to_csv('gap_stores.csv',sep='|',index=False)

In [9]:
# df2.drop(columns=['StoreName','Location']).to_csv('gap.csv',sep='|',index=False)