## Notes on the Data

#### Data Documentation:
<br>**Description**: Synthetic dataset from Gap Inc., representing a random sample of individual purchases from Q1 FY2020. Each row is a unique item purchased in an order
<br><br>

| **Feature** | **Description**    | **Sample Value(s)**  |
| ------- | -----------    | ------------- |
| OrderID | Unique identifier per transaction (7-digit) | DRW7C20   |
| CustomerID | Unique identifier per customer (5-digit) | KP441   |
| ProductID  | Unique identifier per item (8-digit) | 13-817-239 |
| StoreID | Unique identifier per store (4-digit) | #4176 |
| OrderType | How purchase was completed  | InStore, HomeDelivery, Online |
| Timestamp | Timestamp of transaction (YYYY-MM-DD) | 2020-01-18 10:13:56	 |
| Brand | Which reporting segment of Gap Inc. bought from | Banana Republic |
| ItemSize | Size of item | XS, S, M, L, X, XL |
| ProductName | Name of item associated with item identifier | Pink Polo by Kanye |
| Collection | Which part of store | Denim Shop |
| Price | Listed price of item | $29.95 |
| ClearanceType | Type of clearance | Retail, Clearance, Final Sale |
| DiscountType | If Gap Card rewards was used | Reward points, Promotion, GapCash, Other |
| StoreName | Store name (i.e. Mall), or facility where online order was shipped from | Fair Oaks Mall |
| Location | State of store location | VA |

<br>

**Some comments about the data**: 

<br>In the real world, data comes from databases. Each transaction gets recorded as a observation, what item was purchased, at which store, by which customer. We can refer to these items by name, but there can be confusion if there's overlap. (i.e. Two customers with the same name, or two products with the same name)

<br>Instead, we can assign **unique identifiers** to each item, store, customer. Even if they have the same name, they'll be distinct in the eyes of the data. Since these are randomly assigned, they can look pretty ugly, but just know that **each ID is associated just one thing**

**Why we're using GroupBys**:

<br> Sometimes information in observations can be duplicated. Each row is a item that was purchased by some customer, on some day, at some store. That customer could have purchased *multiple items* in that transaction. To show that, we give each observation from the same transaction the *same OrderID*.

<br> This means that if we wanted some information at the transaction level, i.e. how many items were purchased per transaction, we could **first groupby each OrderID**, and then get some summary statistic, i.e 'count' to capture how many items are in each order.

<br> Now we have data where each row is a different order. To get the average number of items purchased per transaction, we now **average the grouped data**, and come up with a single number that summarizes that series. 

### Package & Data Imports

1. Import our usual packages, pandas and numpy. Use pandas to read in the CSV

2. Note that we'll have to pass in a new parameter, `sep=|`, to reflect that the raw data is stored a bit differently than usual

3. Check out the dimensions of the dataset, then take a look at the dataframe itself.

URL = `https://raw.githubusercontent.com/ishaandey/node/master/week-2/lab/gap.csv`

## Collection Trends

#### Questions (Easy):
Let's take a look at total sales in various collections:
1. First, subset down to the *Kids Tops* collection. Grab the price column, then take the sum of that. How much did Kids Tops sales total to?
2. What about total sales for the Denim Shop collection? For Accessories? 


<br>Since it's getting pretty tedious to look for total sales in each collection, let's automate this by taking the sum of sales in each collection. To do so:
3. Group the dataframe by Collection, grab the price column, then aggregate the rows by taking a sum across each collection.
4. Do your answers here line up with what we found by the subset method?

## Product Trends

#### Questions (Easy):
Let's take a look at some product trends:
1. How many unique products are there? 
2. What different collections are there?
3. What are top 10 best selling items (in terms of sale frequency)?
4. How many times was the `N-A-P Logo Branded Sweater` purchased?
<br>

#### Tips:
Functions you'll probably wanna use: `.unique()`, `.nunique()`, `.value_counts()` 

## Segment Trends

#### Questions (Easy - Medium):
Let's take a look at some breakdowns by business segment, or brand.
1. What were *total sales* by brand (in terms of $)?


<br>Since a customer can buy multiple items in one transaction, or "order", use *OrderID* to prevent double counting.
2. What  were the total number of *unique* orders per brand? 
3. How many orders of each type (`OrderType`) were completed?
4. Could we break this down by brand?


#### Tips:
- First GroupBy, then Aggregate! `.groupby(by=)`, `.agg(func=)`
    - Google possible aggregation arguments, like sum, average, or number of unique items
- For question 4, use **multi-level groupbys**, i.e. pass a `list` of columns on which to groupby, instead of just one string.  

## Store Trends

#### Questions (Medium):
Almost done! Think at the store level:
1. Which stores have the lowest sales (in $)? Save its StoreID to a varable, `bad_store`
2. How many orders does that store have? (Hint: Use `.index` to capture the key of a pd.Series object)
3. What is the maximum number of HomeDelivery orders that any one store got?
<br>

#### Tips:
- We're just chaining things together. Subset, then groupby, then pull a column, then aggregate, then apply a function, etc.

## Customer Trends

#### Questions (Medium - Hard):
Let's look at the customers. A couple of questions that management wants us to answer:
1. At Gap, how many different stores did each customer shop at, on average? 
2. On average, how much does each customer spend per transaction? 
3. Does this value differ when broken down by brand? Which store do customers spend more at?
<br>

#### Tips:
- Same chaining workflow: Subset, then groupby, then pull a column, then aggregate, then apply a function, etc.
- For #1: Since each customer is assigned a unique ID, we wanna get our data at a *per-customer* level. In other words, we'll group by CustomerID first.
- For #2: Try to sum up how much *each customer*, at *each order*, spent in total. Then take the average of this.
- For #3: Add another level to your previous solution, but then *group* up to the brand level again to consolidate all the customer-orders into each brand