## Notes on the Data

#### Data Documentation:
<br>**Description**: Synthetic dataset from Gap Inc., representing a random sample of individual purchases from Q1 FY2020. Each row is a unique item purchased in an order
<br><br>

| **Feature** | **Description**    | **Sample Value(s)**  |
| ------- | -----------    | ------------- |
| OrderID | Unique identifier per transaction (7-digit) | DRW7C20   |
| CustomerID | Unique identifier per customer (5-digit) | KP441   |
| ProductID  | Unique identifier per item (8-digit) | 13-817-239 |
| StoreID | Unique identifier per store (4-digit) | #4176 |
| OrderType | How purchase was completed  | InStore, HomeDelivery, Online |
| Timestamp | Timestamp of transaction (YYYY-MM-DD) | 2020-01-18 10:13:56	 |
| Brand | Which reporting segment of Gap Inc. bought from | Banana Republic |
| ItemSize | Size of item | XS, S, M, L, X, XL |
| ProductName | Name of item associated with item identifier | Pink Polo by Kanye |
| Collection | Which part of store | Denim Shop |
| Price | Listed price of item | $29.95 |
| ClearanceType | Type of clearance | Retail, Clearance, Final Sale |
| DiscountType | If Gap Card rewards was used | Reward points, Promotion, GapCash, Other |
| StoreName | Store name (i.e. Mall), or facility where online order was shipped from | Fair Oaks Mall |
| Location | State of store location | VA |

<br>

**Some comments about the data**: 

<br>In the real world, data comes from databases. Each transaction gets recorded as a observation, what item was purchased, at which store, by which customer. We can refer to these items by name, but there can be confusion if there's overlap. (i.e. Two customers with the same name, or two products with the same name)

<br>Instead, we can assign **unique identifiers** to each item, store, customer. Even if they have the same name, they'll be distinct in the eyes of the data. Since these are randomly assigned, they can look pretty ugly, but just know that **each ID is associated just one thing**

**Why we're using GroupBys**:

<br> Sometimes information in observations can be duplicated. Each row is a item that was purchased by some customer, on some day, at some store. That customer could have purchased *multiple items* in that transaction. To show that, we give each observation from the same transaction the *same OrderID*.

<br> This means that if we wanted some information at the transaction level, i.e. how many items were purchased per transaction, we could **first groupby each OrderID**, and then get some summary statistic, i.e 'count' to capture how many items are in each order.

<br> Now we have data where each row is a different order. To get the average number of items purchased per transaction, we now **average the grouped data**, and come up with a single number that summarizes that series. 

### Package & Data Imports

In [4]:
import pandas as pd
import numpy as np

In [5]:
df = pd.read_csv('gap.csv', sep='|') # We're using a pipe instead of comma to seperate here

In [6]:
print(df.shape)
df.sample(2)

(4031, 14)


Unnamed: 0,OrderID,CustomerID,ProductID,StoreID,OrderType,Timestamp,Brand,ItemSize,ProductName,Collection,Price,ClearanceType,StoreName,Location
1574,8K5C065,DC242,45-570-528,#8803,InStore,2020-02-27 14:00:58,Gap,XL,That-Hoodie-You'll-Wear-to-Every-Class Sweater,Men's Tops,49.95,FullRetail,Stony Point Fashion Park,VA
680,SR15680,QE430,77-870-213,#4291,HomeDelivery,2020-01-17 08:35:34,Banana Republic,S,White Camisole with Regina George Cutouts,Women's Tops,53.95,FullRetail,Potomac Mills,VA


## Collection Trends

#### Questions (Easy):
Let's take a look at some breakdowns by each collection: 
1. Subset the data frame for purchases in the Kids Tops collection. What were total sales here?
2. What about total sales for the Denim Shop collection? For Accessories? 
3. Use `groupby()` instead to get total sales per every collection... do your answers line up from before?

In [108]:
df[df.Collection == 'Kids Tops'].Price.sum() #1

13095.460000000001

In [109]:
df[df.Collection == 'Denim Shop'].Price.sum() #2

19796.47

In [110]:
df[df.Collection == 'Accessories'].Price.sum() #2

6139.860000000001

In [8]:
df.groupby('Collection').Price.sum() # 3

Collection
Accessories         6139.86
Denim Shop         19796.47
Kids Bottoms        4746.85
Kids Tops          13095.46
Men's Bottoms      28907.40
Men's Tops         38469.18
Women's Bottoms    36413.43
Women's Tops       30972.46
Name: Price, dtype: float64

## Product Trends

#### Questions (Easy-ish):
Let's take a look at some product trends:
1. How many unique products are there? What different collections are there?
2. Take a look at the top 10 best selling items, and save these items to a list (not pd.Series!)
3. Looking only at these transactions, which of these had prices over $100?
<br>

#### Tips:
Functions you'll probably wanna use:
- Selecting columns as series
- `.value_counts()`, `.unique()`, `.nunique()`
- Filtering rows using `.isin()`, `=='string'`, or `> float`
- Chaining functions / subsets together

In [80]:
print(df.ProductName.nunique()) # 1 
print(df.Collection.unique()) # 1 

46
['Accessories' 'Kids Tops' "Men's Bottoms" 'Denim Shop' "Men's Tops"
 "Women's Bottoms" "Women's Tops" 'Kids Bottoms']


In [24]:
top_items = df.ProductName.value_counts().head(10)
top_items #2

# Quick comment here: top_items is a pd.Series, which means the actual values are 173, 172 etc.
# The index, on the other hand, is the actual ProductName
# We can extract this to a list by doing <pd.Series>.index.values.tolist()

Tan Slacks for Serious Press Conference                  173
Acid-Washed Low-Rise Jeans with LSD-tab-sized pockets    172
Tan Suit Jacket for Casual Press Conference              150
Let's-Wear-White-After-Labor-Day 3/4 Sleeve Blazer       133
Sun.png Summer Collection Sundress                       126
Checkered Cloth Mask with Drinking Straw Hole            123
Wireframe Glasses, Without plastic lenses                122
Mullet-Cut Midi Fur Skirt                                121
Vintage Surfboard Graphic Tee, Cotton                    121
Dishwasher-Safe Jean Shorts                              118
Name: ProductName, dtype: int64

In [22]:
df[df.ProductName.isin(top_items.index.values.tolist())].loc[df.Price > 100].ProductName.unique() #3

array(['Acid-Washed Low-Rise Jeans with LSD-tab-sized pockets',
       'Tan Slacks for Serious Press Conference',
       "Let's-Wear-White-After-Labor-Day 3/4 Sleeve Blazer"], dtype=object)

## Segment Trends

#### Questions (Easy - Medium):
Let's take a look at some breakdowns by business segment, or brand:
1. What were total sales by brand (in $)? Which had the most sales?
2. What about number of *unique* orders per brand? Does the brand with the most orders have the most sales?
3. How many orders of each type (OrderType) were completed, by brand? <br>Looking at the data, does Banana or Gap do more StorePickup orders?
<br>

#### Tips:
- First GroupBy, then Aggregate! `.groupby(by=)`, `.agg(func=)`
    - Google possible aggregation arguments, like sum, average, or number of unique items
- For question 3, use **multi-level groupbys**, i.e. pass a `list` of columns on which to groupby, instead of just one string.  
- If you want to see number of unique orders, use the OrderID to capture information about the unique transaction, otherwise the metric will get duplicated for every item in that transaction.

In [84]:
df.groupby('Brand').Price.agg('sum')

Brand
Banana Republic    120011.82
Gap                 58529.29
Name: Price, dtype: float64

In [85]:
df.groupby('Brand').OrderID.agg('nunique')

Brand
Banana Republic    1349
Gap                 823
Name: OrderID, dtype: int64

In [86]:
df.groupby(['OrderType','Brand']).OrderID.agg('count')['StorePickup']

Brand
Banana Republic    182
Gap                248
Name: OrderID, dtype: int64

## Customer Trends

#### Questions (Hard):
Let's look at the customers. A couple of questions that management wants us to answer:
1. (At Gap only) How many different stores did each customer shop at? 
2. On average, how much does each customer spend per transaction? 
3. Does this value differ when broken down by brand?
<br>

#### Tips:
- Think about count vs. nunique vs. value_counts: There's a bunch of ways to do this, but you wanna make sure to count at the order level (and not double count multiple items in the same order) 
- For question 2 & 3, use **multi-level groupbys**, i.e. pass a `list` of columns on which to groupby, instead of just one string. 

In [132]:
df[df.Brand == 'Gap'].groupby('CustomerID').agg('nunique').StoreID.value_counts(sort=False)

1    322
2    162
3     43
4      3
5      1
Name: StoreID, dtype: int64

In [137]:
# 2
df.groupby(['CustomerID', 'OrderID']).Price.agg('sum').mean()

82.20124769797421

In [138]:
# 3 
df_byCustomerOrder = df.groupby(['Brand','CustomerID', 'OrderID']).Price.agg('sum').reset_index()
df_byCustomerOrder.groupby('Brand').Price.agg('mean').map('${:,.2f}'.format)

Brand
Banana Republic    $88.96
Gap                $71.12
Name: Price, dtype: object

## Store Trends

#### Questions (Medium):
Almost done! Think at the store level:
1. Which stores have the lowest sales (in $)? How many orders does that store have?
2. What is the maximum number of HomeDelivery orders that any one store got?
<br>

#### Tips:
- `value_counts()`, `unique()`
- `groupby(by=)`, `agg(func=)`
- Subsetting dataframes by conditions
- Selecting columns as series

In [158]:
# 1
df.groupby('StoreID').Price.sum().sort_values(ascending=False).tail(1)

StoreID
#1812    4325.72
Name: Price, dtype: float64

In [29]:
# 1 
bad_stores = df.groupby('StoreID').Price.sum().sort_values(ascending=False).tail(1).index
df.groupby('StoreID').OrderID.agg('nunique')[bad_stores]

StoreID
#1812    68
Name: OrderID, dtype: int64

In [180]:
# 2
df.groupby(['OrderType','StoreID']).agg('nunique').OrderID.loc['HomeDelivery'].max()

49