# Pandas Guided Practice - Improving Store Profits

A Superstore is seeking your knowledge in understanding what works best for them. They would like you to provide a few concrete recommendations for how they can maximize their profits.

The data provided is stored in three different files inside of the `data` folder:
- `orders.csv`: General order information (date, shipping method, quantity ordered, profit, etc)
- `customers.csv`: Unique customers who placed orders to the store
- `products.csv`: Unique products ordered from the store

It covers all orders made to the Superstore over the past few years.

The original dataset can be found on [Kaggle](https://www.kaggle.com/datasets/vivek468/superstore-dataset-final)

### Tasks

There are a few high-level tasks we will need to complete:
- Formatting the data
- Cleaning the data
- Exploring what the Superstore should sell (or not sell)
- Determining what they should be selling *when*?

#### Pair Programming

There are specific instructions for completing each of these tasks below. You and a partner will be given ten minutes in a breakout room to work through each task. We recommend you both work on the same notebook by having one person share their screen. This also allows the person who is not screen sharing to do the searching on Google when necessary. After ten minutes we will get back together and work through the task as a group.

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

## Format Data

Take a few minutes in a breakout room to format the data into a single dataframe.

Tasks:
- Begin by reading each individual file into it's own DataFrame and finding the common columns (what you will join on)
- Drop the `Unnamed: 0` column from each of the three DataFrames
- Combine `orders.csv`, `customers.csv`, and `products.csv` into a single DataFrame, `df_base`
- Create a copy of `df_base` and call it `df`. This will be the DataFrame we work with moving forward.
- Clean the column names so spaces are replaced with underscores and all text is lowercase

All three csvs are stored in the `data` folder so the path to each file will be similar (`./data/<filename>`)

In [None]:
# Your code here (add as many cells as needed)

## Cleaning

Now that we have combined all of our data into a single DataFrame, let's look at what data we actually have and if any cleaning is required.

- Use the `.info()` method to check the size of the DataFrame and the datatypes of each column.
- Check for [duplicates](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html) using `['order_id', 'product_id', 'customer_id']` as the `subset` of columns. 
- If any, [drop duplicate rows](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop_duplicates.html#pandas.DataFrame.drop_duplicates) and [reset the index](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html). 
- Convert the `sales` column to type `float64`. 
- Convert `order_date` and `ship_date` to [pandas datetime](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)
- Show the number of missing values in each column. Filter the DataFrame to show the rows missing data. What seems to be the underlying problem?
- Drop the rows with missing values. 

**`df` should have 9970 rows when you are done**

In [None]:
# Your code here

## TIME TO EXPLORE

The Superstore wants us to provide them with some recommendations on how they can maximize their profits. For this guided practice, we want to provide some general insight into: 
- What to sell?
- When to sell *what*?

### What to sell? (Or what NOT to sell?)

Steps:
- Create a new column, `profit_per_unit`, by dividing `profit` by `quantity`
- Create a visual showing how many orders fall within each `category`
- Find the `sub-category` with the highest median `profit_per_unit` and lowest median `profit_per_unit`
- The tables sub-category has a *negative* median `profit_per_unit`. Find the total number of table orders with a negative `profit_per_unit`
- Of all table orders, what *percentage* of them had a negative `profit_per_unit`

In [None]:
# Your code here

### When to sell *what*?

- Using the [datetime functionality](https://pandas.pydata.org/docs/reference/api/pandas.Series.dt.month.html) of `order_date`, create a `month` column
- Create a bar pot showing the median profit of each `category` for each `month` of the year
- There is one `category` which consistently has the highest median profit. Create a new DataFrame of only these orders.
- Show the *total* profit for every `sub-category` of tech for each month of the year. What should the Superstore be selling when?

In [None]:
# Your code here