# Week 12 Preppin Data Challenge
## Chin & Beard Suds Co
**Background**
Would you believe that Chin & Beard Suds Co have encountered yet more messy data? It seems someone was trying to be helpful by creating an aggregated view of sales per week for each scent of soap. However, in doing so we've lost the lower level of detail of the product sizes that make up these sales for each scent. We really need this for other analysis we've been carrying out!

Fortunately, we know what percentage of sales each product size makes up for each product in each week. Unfortunately, the data isn't stored in a way that will make it easy to join all the necessary information together.

**Requirements**
* Input data
* Our final output requires the Date to be in in the Year Week Number format. 
* We don't care about any product sizes that make up 0% of sales.
* In the Lookup Table, it seems the Product ID and Size have been erroneously concatenated. These need to be separated.  
* You'll need to do some cleaning of the Scent fields to join together the Total Sales and the Lookup Table.
* Calculate the sales per week for each scent and product size.
* Output the data

**Output**
5 Data Fields:
* Year Week Number
* Scent
* Product Type
* Size
* Sales
* 307 rows (308 including headers)

In [1]:
# import packages
import pandas as pd

In [2]:
# load data
dfTotSales = pd.read_excel('PD week 12 input.xlsx', sheet_name = 'Total Sales')
dfPctSales = pd.read_excel('PD week 12 input.xlsx', sheet_name = 'Percentage of Sales')
dfLookup = pd.read_excel('PD week 12 input.xlsx', sheet_name = 'Lookup Table')

In [3]:
# create product lookup column that concatenates product id and size
dfPctSales['Product'] = dfPctSales['Product ID'] + dfPctSales['Size']
dfPctSales.head()

Unnamed: 0,Product ID,Week Commencing,Size,Product Type,Percentage of Sales,Product
0,0c60c126,2020-01-06,0.5l,Liquid,0.33,0c60c1260.5l
1,0c60c126,2020-01-06,100g,Bar,0.13,0c60c126100g
2,0c60c126,2020-01-06,250ml,Liquid,0.2,0c60c126250ml
3,0c60c126,2020-01-06,50g,Bar,0.34,0c60c12650g
4,0c60c126,2020-01-13,0.5l,Liquid,0.8,0c60c1260.5l


In [4]:
# join pct sales and lookups to bring in scent
dfPctSales = pd.merge(dfLookup, dfPctSales, how = 'inner', on = 'Product')
dfPctSales.head()

Unnamed: 0,Scent,Product,Product ID,Week Commencing,Size,Product Type,Percentage of Sales
0,Lavendar,78773f6b96549676530187b50g,78773f6b96549676530187b,2020-01-06,50g,Bar,0.2
1,Lavendar,78773f6b96549676530187b50g,78773f6b96549676530187b,2020-01-13,50g,Bar,0.5
2,Lavendar,78773f6b96549676530187b50g,78773f6b96549676530187b,2020-01-20,50g,Bar,0.34
3,Lavendar,78773f6b96549676530187b50g,78773f6b96549676530187b,2020-01-27,50g,Bar,0.16
4,Lavendar,78773f6b96549676530187b50g,78773f6b96549676530187b,2020-02-03,50g,Bar,0.04


In [5]:
# clean up scent field
dfTotSales['Scent'] = dfTotSales['Scent'].str.replace(' ','').str.capitalize()

# correct tea tree
dfTotSales.loc[dfTotSales['Scent'] == 'Teatree', 'Scent'] = 'Tea Tree'
dfTotSales.head(10)

Unnamed: 0,Year Week Number,Scent,Total Scent Sales
0,202002,Coconut,20.37
1,202002,Honey,3459.07
2,202002,Lavendar,1869.73
3,202002,Lemongrass,6987.85
4,202002,Mint,5895.54
5,202002,Orange,3537.73
6,202002,Tea Tree,7629.48
7,202002,Vanilla,3365.6
8,202003,Coconut,872.54
9,202003,Honey,5223.75


In [6]:
# check field type (appears that year week number is integer)
dfTotSales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 80 entries, 0 to 79
Data columns (total 3 columns):
Year Week Number     80 non-null int64
Scent                80 non-null object
Total Scent Sales    80 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 2.0+ KB


In [7]:
# change year week number to string
dfTotSales['Year Week Number'] = dfTotSales['Year Week Number'].astype(str)

In [8]:
# format dfPctSales so that we have year and week to join with dfTotSales
dfPctSales['Year'] = pd.DatetimeIndex(dfPctSales['Week Commencing']).year.astype(str)
dfPctSales['Week Number'] =  pd.DatetimeIndex(dfPctSales['Week Commencing']).weekofyear.astype(str).str.rjust(2,'0')
dfPctSales['Year Week Number'] = dfPctSales['Year'] + dfPctSales['Week Number']
dfPctSales.head()

Unnamed: 0,Scent,Product,Product ID,Week Commencing,Size,Product Type,Percentage of Sales,Year,Week Number,Year Week Number
0,Lavendar,78773f6b96549676530187b50g,78773f6b96549676530187b,2020-01-06,50g,Bar,0.2,2020,2,202002
1,Lavendar,78773f6b96549676530187b50g,78773f6b96549676530187b,2020-01-13,50g,Bar,0.5,2020,3,202003
2,Lavendar,78773f6b96549676530187b50g,78773f6b96549676530187b,2020-01-20,50g,Bar,0.34,2020,4,202004
3,Lavendar,78773f6b96549676530187b50g,78773f6b96549676530187b,2020-01-27,50g,Bar,0.16,2020,5,202005
4,Lavendar,78773f6b96549676530187b50g,78773f6b96549676530187b,2020-02-03,50g,Bar,0.04,2020,6,202006


In [9]:
# join together dfTotSales and dfPctSales
dfSummary = pd.merge(dfTotSales, dfPctSales, how = 'inner', on = ['Scent','Year Week Number'])
dfSummary.head()

Unnamed: 0,Year Week Number,Scent,Total Scent Sales,Product,Product ID,Week Commencing,Size,Product Type,Percentage of Sales,Year,Week Number
0,202002,Coconut,20.37,1426aacc6733c9c50g,1426aacc6733c9c,2020-01-06,50g,Bar,0.2,2020,2
1,202002,Coconut,20.37,1426aacc6733c9c100g,1426aacc6733c9c,2020-01-06,100g,Bar,0.04,2020,2
2,202002,Coconut,20.37,1426aacc6733c9c250ml,1426aacc6733c9c,2020-01-06,250ml,Liquid,0.75,2020,2
3,202002,Coconut,20.37,1426aacc6733c9c0.5l,1426aacc6733c9c,2020-01-06,0.5l,Liquid,0.01,2020,2
4,202002,Honey,3459.07,388d62b984eaff95962e8ec2e9c350g,388d62b984eaff95962e8ec2e9c3,2020-01-06,50g,Bar,0.34,2020,2


In [10]:
# calculate sales
dfSummary['Sales'] = dfSummary['Total Scent Sales'] * dfSummary['Percentage of Sales']
dfSummary.head()

Unnamed: 0,Year Week Number,Scent,Total Scent Sales,Product,Product ID,Week Commencing,Size,Product Type,Percentage of Sales,Year,Week Number,Sales
0,202002,Coconut,20.37,1426aacc6733c9c50g,1426aacc6733c9c,2020-01-06,50g,Bar,0.2,2020,2,4.074
1,202002,Coconut,20.37,1426aacc6733c9c100g,1426aacc6733c9c,2020-01-06,100g,Bar,0.04,2020,2,0.8148
2,202002,Coconut,20.37,1426aacc6733c9c250ml,1426aacc6733c9c,2020-01-06,250ml,Liquid,0.75,2020,2,15.2775
3,202002,Coconut,20.37,1426aacc6733c9c0.5l,1426aacc6733c9c,2020-01-06,0.5l,Liquid,0.01,2020,2,0.2037
4,202002,Honey,3459.07,388d62b984eaff95962e8ec2e9c350g,388d62b984eaff95962e8ec2e9c3,2020-01-06,50g,Bar,0.34,2020,2,1176.0838


In [11]:
# deselect unneeded fields
dfSummary = dfSummary[['Year Week Number','Scent','Size','Product Type','Sales']]

In [12]:
# filter out zeros
dfSummary = dfSummary[dfSummary['Sales'] != 0]

In [13]:
# write to csv
dfSummary.to_csv('Soap Sales.csv', index = False)