## Week 1's Challenge

Challenge by Carl Allchin

This week we are going to be focusing on cleaning data ready to answer some questions from our stakeholders. In the requirements I will be adding some links to some useful resources if you get stuck on a particular requirement. 

### Input

![sample data](https://1.bp.blogspot.com/-m11G7WH-ftY/X_NSuNX2e6I/AAAAAAAACFc/Zs4l5XIgQv8p3OflKifXhN8Y924iGJpHgCLcBGAsYHQ/w640-h240/Screenshot%2B2021-01-04%2Bat%2B17.38.47.png)

### Requirements

Here's what we need you to do:
- Connect and load the csv file 
- Split the 'Store-Bike' field into 'Store' and 'Bike' 
- Clean up the 'Bike' field to leave just three values in the 'Bike' field (Mountain, Gravel, Road) 
- Create two different cuts of the date field: 'quarter' and 'day of month' 
- Remove the first 10 orders as they are test values 
- Output the data as a csv 

8 Data Fields
- Quarter
- Day of Month
- Store
- Bike
- Order ID
- Customer Age
- Bike Value
- Existing Customer?

### Output

![output](https://1.bp.blogspot.com/-5dcGmoprvTg/X_NVYmeoxEI/AAAAAAAACFo/U38HGpwxTIwzdBPyOQspdLX9rRnLsfjGwCLcBGAsYHQ/w640-h258/Screenshot%2B2021-01-04%2Bat%2B17.50.16.png)

### Bonus task

We are conscious this shouldn't be too much of a stretch for our Preppin' regulars so we've set you a Desktop task too. 

Our stakeholder wants to know the average monthly bike value sold by each day in the month. Yeah, they even want the running total to see where they should be in terms of sales by that point. The stakeholder knows each quarter is significantly different so the running totals should be separated by quarter. 

Build this view using the output if you fancy taking on the extra task.

![bonus](https://1.bp.blogspot.com/-fA_a39JBRGg/X_NWG9PKLrI/AAAAAAAACF0/ZJRb6OhBSC0_KPlppEewDX3Sodj93OHIACLcBGAsYHQ/w400-h359/Screenshot%2B2021-01-04%2Bat%2B11.54.29.png)

In [383]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import seaborn as sns

In [384]:
df = pd.read_csv("./PD 2021 Wk 1 Input - Bike Sales.csv")
df.head()

Unnamed: 0,Order ID,Customer Age,Bike Value,Existing Customer?,Date,Store - Bike
0,1,22,481,No,25/04/2021,York - Road
1,2,28,1825,No,23/01/2021,York - Road
2,3,51,1903,No,03/07/2021,York - Rood
3,4,59,1059,No,24/01/2021,York - Road
4,5,44,1764,Yes,12/08/2021,York - Mountain


In [385]:
# Split the "Store-Boke" field into "Store" and "Bike"
store_bike = df["Store - Bike"].str.split("-", expand=True)
store_bike = store_bike.rename(columns={0: "Store", 1: "Bike"})
store_bike.head()

Unnamed: 0,Store,Bike
0,York,Road
1,York,Road
2,York,Rood
3,York,Road
4,York,Mountain


In [386]:
# Clean up the 'Bike' field to leave just three values in the 'Bike' field (Mountain, Gravel, Road)
store_bike["Bike"].value_counts()

 Mountain    390
 Road        375
 Gravel      177
 Rood         18
 Graval       12
 Gravle       11
 Mountaen     10
 Rowd          7
Name: Bike, dtype: int64

In [387]:
store_bike["Bike"] = store_bike["Bike"].str.replace("Mountaen", "Mountain")
store_bike["Bike"] = store_bike["Bike"].str.replace("Graval", "Gravel")
store_bike["Bike"] = store_bike["Bike"].str.replace("Gravle", "Gravel")
store_bike["Bike"] = store_bike["Bike"].str.replace("Rood", "Road")
store_bike["Bike"] = store_bike["Bike"].str.replace("Rowd", "Road")
store_bike["Bike"].value_counts()

 Road        400
 Mountain    400
 Gravel      200
Name: Bike, dtype: int64

In [388]:
df = pd.concat([df, store_bike], axis=1)
df = df.drop(["Store - Bike"], axis=1)
df.head()

Unnamed: 0,Order ID,Customer Age,Bike Value,Existing Customer?,Date,Store,Bike
0,1,22,481,No,25/04/2021,York,Road
1,2,28,1825,No,23/01/2021,York,Road
2,3,51,1903,No,03/07/2021,York,Road
3,4,59,1059,No,24/01/2021,York,Road
4,5,44,1764,Yes,12/08/2021,York,Mountain


In [389]:
# create two different cuts of the data field : "quarter" and "day of month"
df["Quarter"] = pd.to_datetime(df["Date"], format="%d/%m/%Y").dt.quarter
df["Day of Month"] = pd.to_datetime(df["Date"], format="%d/%m/%Y").dt.day
df.head()

Unnamed: 0,Order ID,Customer Age,Bike Value,Existing Customer?,Date,Store,Bike,Quarter,Day of Month
0,1,22,481,No,25/04/2021,York,Road,2,25
1,2,28,1825,No,23/01/2021,York,Road,1,23
2,3,51,1903,No,03/07/2021,York,Road,3,3
3,4,59,1059,No,24/01/2021,York,Road,1,24
4,5,44,1764,Yes,12/08/2021,York,Mountain,3,12


In [390]:
# remove the first 10 orders as they are test values
df = df.drop(range(0, 10), axis=0)
df = df.reset_index(drop=True)
df = df.drop(["Date"], axis=1)
df = df.loc[:, ["Quarter", "Store", "Bike", "Order ID", "Customer Age", 
                "Bike Value", "Existing Customer?", "Day of Month"]]
df.head(10)

Unnamed: 0,Quarter,Store,Bike,Order ID,Customer Age,Bike Value,Existing Customer?,Day of Month
0,4,Birmingham,Road,11,57,902,No,4
1,1,Leeds,Road,12,31,946,Yes,17
2,4,Birmingham,Road,13,17,1296,Yes,25
3,3,Manchester,Road,14,59,1166,Yes,18
4,4,Manchester,Mountain,15,24,1781,No,10
5,4,York,Mountain,16,59,1074,No,6
6,3,York,Mountain,17,57,1188,No,14
7,4,York,Mountain,18,56,544,No,23
8,4,York,Gravel,19,34,579,Yes,24
9,2,York,Gravel,20,17,1021,Yes,24


In [391]:
group = df.groupby(["Quarter", "Day of Month", "Bike"])["Bike Value"].mean().astype(int).unstack(fill_value=0).reset_index()
group = group.rename(columns={" Gravel" : "Gravel", " Mountain" : "Mountain", " Road": "Road" })
group = group.melt(id_vars=["Quarter", "Day of Month"],
           value_name="Amount")
group

Unnamed: 0,Quarter,Day of Month,Bike,Amount
0,1,1,Gravel,1988
1,1,2,Gravel,2461
2,1,3,Gravel,992
3,1,4,Gravel,1113
4,1,5,Gravel,0
...,...,...,...,...
367,4,27,Road,1726
368,4,28,Road,1205
369,4,29,Road,1017
370,4,30,Road,1057


In [392]:
traces = []

quarters = [1, 2, 3, 4]
bikes = ["Gravel", "Mountain", "Road"]

for bike, grouped in group.groupby("Bike"):
    tmp = grouped[(grouped["Quarter"] == 1)]
    trace = go.Scatter(x = tmp["Day of Month"],
                       y = tmp["Amount"].cumsum(),
                       mode = "lines",
                       name=bike,
                       hovertemplate="Bike=%s<br>Day=%%{x}<br>Amount=%%{y}<extra></extra>"% bike
                       )
    traces.append(trace)
data = traces
layout = go.Layout(title = "Monthly Sales in each Quarter")
fig = go.Figure(data, layout)
fig.show()

In [403]:
from plotly.subplots import make_subplots

fig = make_subplots(rows = 4, cols = 1, shared_yaxes = "all")

quarters = [1, 2, 3, 4]
bikes = ["Gravel", "Mountain", "Road"]
color = ["Orange", "Red", "Grey"]
traces_list = []

for i, quarter in enumerate(quarters):
    traces = []
    j = 0
    for bike, grouped in group.groupby("Bike"):
        tmp = grouped[(grouped["Quarter"] == quarter)]
        trace = go.Scatter(x = tmp["Day of Month"],
                           y = tmp["Amount"].cumsum(),
                           mode = "lines",
                           line= dict(color=color[j]),
                           name=bike,
                           hovertemplate="Bike=%s<br>Day=%%{x}<br>Amount=%%{y}<extra></extra>"% bike,
                           legendgroup = i
                       )
        traces.append(trace)
        j += 1
    traces_list.append(traces)

In [404]:
fig.append_trace(traces_list[0][0], 1, 1)
fig.append_trace(traces_list[0][1], 1, 1)
fig.append_trace(traces_list[0][2], 1, 1)

fig.append_trace(traces_list[1][0], 2, 1)
fig.append_trace(traces_list[1][1], 2, 1)
fig.append_trace(traces_list[1][2], 2, 1)

fig.append_trace(traces_list[2][0], 3, 1)
fig.append_trace(traces_list[2][1], 3, 1)
fig.append_trace(traces_list[2][2], 3, 1)

fig.append_trace(traces_list[3][0], 4, 1)
fig.append_trace(traces_list[3][1], 4, 1)
fig.append_trace(traces_list[3][2], 4, 1)

In [405]:
fig.update_layout(title = "Monthly Sales in each Quarter",
                  height = 1200, width = 700,
                  plot_bgcolor = "rgb(250, 250, 250)",
                  legend_tracegroupgap = 220,
                  xaxis1_title = "Day of Month",
                  xaxis2_title = "Day of Month",                  
                  xaxis3_title = "Day of Month",
                  xaxis4_title = "Day of Month",
                  yaxis1_title = "Q1 Total of Sales",
                  yaxis2_title = "Q2 Total of Sales",
                  yaxis3_title = "Q3 Total of Sales",
                  yaxis4_title = "Q4 Total of Sales"
                  )
fig.show()

In [396]:
df.to_csv("Output.csv")