## Import Data

#### Data Documentation:
<br>**Description**: Synthetic dataset from Gap Inc., representing a random sample of individual purchases from Q1 FY2020. Each row is a unique item purchased in an order
<br><br>

| **Feature** | **Description**    | **Sample Value(s)**  |
| ------- | -----------    | ------------- |
| OrderID | Unique identifier per transaction (7-digit) | DRW7C20   |
| CustomerID | Unique identifier per customer (5-digit) | KP441   |
| ProductID  | Unique identifier per item (8-digit) | 13-817-239 |
| StoreID | Unique identifier per store (4-digit) | #4176 |
| | | | 
| OrderType | How purchase was completed  | InStore, HomeDelivery, Online |
| Timestamp | Timestamp of transaction (YYYY-MM-DD) | 2020-01-18 10:13:56	 |
| | | | 
| Brand | Which reporting segment of Gap Inc. bought from | Banana Republic |
| ItemSize | Size of item | XS, S, M, L, X, XL |
| ProductName | Name of item associated with item identifier | Pink Polo by Kanye |
| Collection | Which part of store | Denim Shop |
| Price | Listed price of item | $29.95 |
| ClearanceType | Type of clearance | Retail, Clearance, Final Sale |
| DiscountType | If Gap Card rewards was used | Reward points, Promotion, GapCash, Other |
| | | | 
| StoreName | Store name (i.e. Mall), or facility where online order was shipped from | Fair Oaks Mall |
| Location | State of store location | VA |

<br>

**Quick note on IDs**: 

<br>IDs are a really important part of many, if not most datasets. Each unique *thing*, whether that's a product, or store, or customer, gets assigned **it's own unique identifier**. 

This is important in case two stores have the same name (i.e. Gap and Banana Republic at the Fair Oaks Mall). When we group by Store Name, for example, we want to make sure we're not accidentally clumping up both both stores, and instead keep the two seperate.

In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv('gap.csv', sep='|')

In [4]:
print(df.shape)
df.sample(2)

(4031, 14)


Unnamed: 0,OrderID,CustomerID,ProductID,StoreID,OrderType,Timestamp,Brand,ItemSize,ProductName,Collection,Price,ClearanceType,StoreName,Location
3514,R4DZZW0,LH413,60-444-763,#1812,InStore,2020-03-17 10:20:12,Gap,XL,Human Rights (except for the children who make...,Kids Tops,16.95,FullRetail,Williamsburg Premium Outlets,VA
25,199QQ1K,PY118,76-558-812,#4479,InStore,2020-03-15 07:50:08,Banana Republic,M,Yes-I-am-Experiencing-Menopause-Thank-You-for-...,Women's Bottoms,34.99,Clearance,Williamsburg Premium Outlets,VA


## Product Trends

#### Questions (Easy-ish):
Let's take a look at some product trends:
1. How many unique products are there? What different collections are there?
2. Take a look at the top 10 best selling items, and save these items to a list (not pd.Series!)
3. Looking only at these transactions, which of these had prices over $100?
4. There's one product, `09-875-876`, whose name seems to indicate it is sold in one size. What's it's name, and what sizes were actually sold?
<br>

#### Tips:
Functions you'll probably wanna use:
- Selecting columns as series
- `.value_counts()`, `.unique()`, `.nunique()`
- Filtering rows using `.isin()`, `=='string'`, or `> float`
- Chaining functions / subsets together
- `.drop_duplicates()`


## Collection Trends

#### Questions (Easy):
Let's take a look at some breakdowns by each collection: 
1. Subset the data frame for purchases in the Kids Tops collection. What were total sales here?
2. What about total sales for the Denim Shop collection? For Accessories? 
3. Use `groupby()` instead to get total sales per every collection... do your answers line up from before?

## Segment Trends

#### Questions (Easy):
Let's take a look at some breakdowns by business segment, or brand:
1. What were total sales by brand (in $)? Which had the most sales?
2. What about number of *unique* orders per brand? Does the brand with the most orders have the most sales?
3. How many orders of each type (OrderType) were completed, by brand? <br>Looking at the data, does Banana or Gap do more StorePickup orders?
<br>

#### Tips:
- Use GroupBy and Agg functions! `.groupby(by=)`, `.agg(func=)`
    - Google possible aggregation arguments, like sum, average, or number of unique items
- For question 3, use **multi-level groupbys**, i.e. pass a `list` of columns on which to groupby, instead of just one string.  
- If you want to see number of unique orders, use the OrderID to capture information about the unique transaction, otherwise the metric will get duplicated for every item in that transaction.

## Customer Trends

#### Questions (Hard):
Let's look at the customers. A couple of questions that management wants us to answer:
1. (At Gap only) How many different stores did each customer shop at? 
2. On average, how much does each customer spend per transaction? 
3. Does this value differ when broken down by brand?
<br>

#### Tips:
- Think about count vs. nunique vs. value_counts: There's a bunch of ways to do this, but you wanna make sure to count at the order level (and not double count multiple items in the same order) 
- For question 2 & 3, use **multi-level groupbys**, i.e. pass a `list` of columns on which to groupby, instead of just one string. 

## Store Trends

#### Questions (Medium):
Almost done! Think at the store level:
1. Which stores have the lowest sales (in $)? How many orders does that store have?
2. What is the maximum number of HomeDelivery orders that any one store got?
<br>

#### Tips:
- `value_counts()`, `unique()`
- `groupby(by=)`, `agg(func=)`
- Subsetting dataframes by conditions
- Selecting columns as series