# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `warehouse_and_retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per supplier.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [2]:
## your code here
import pandas as pd

##importing the dataset
# data = pd.read_csv('/Users/vpavandijk/LEARNING/Ironhack/copy of google drive class AMS0520/Copy of Warehouse_and_Retail_Sales.csv')
data


Unnamed: 0,YEAR,MONTH,SUPPLIER,ITEM CODE,ITEM DESCRIPTION,ITEM TYPE,RETAIL SALES,RETAIL TRANSFERS,WAREHOUSE SALES
0,2017,4,ROYAL WINE CORP,100200,GAMLA CAB - 750ML,WINE,0.00,1.0,0.0
1,2017,4,SANTA MARGHERITA USA INC,100749,SANTA MARGHERITA P/GRIG ALTO - 375ML,WINE,0.00,1.0,0.0
2,2017,4,JIM BEAM BRANDS CO,10103,KNOB CREEK BOURBON 9YR - 100P - 375ML,LIQUOR,0.00,8.0,0.0
3,2017,4,HEAVEN HILL DISTILLERIES INC,10120,J W DANT BOURBON 100P - 1.75L,LIQUOR,0.00,2.0,0.0
4,2017,4,ROYAL WINE CORP,101664,RAMON CORDOVA RIOJA - 750ML,WINE,0.00,4.0,0.0
...,...,...,...,...,...,...,...,...,...
128350,2018,2,ANHEUSER BUSCH INC,9997,HOEGAARDEN 4/6NR - 12OZ,BEER,66.46,59.0,212.0
128351,2018,2,COASTAL BREWING COMPANY LLC,99970,DOMINION OAK BARREL STOUT 4/6 NR - 12OZ,BEER,9.08,7.0,35.0
128352,2018,2,BOSTON BEER CORPORATION,99988,SAM ADAMS COLD SNAP 1/6 KG,KEGS,0.00,0.0,32.0
128353,2018,2,,BC,BEER CREDIT,REF,0.00,0.0,-35.0


In [16]:
## Tried to figure out what the meaning of the column retail transfers is. The general idea was that it was the amount of sold items, but as I see lines with a positive value for this while the line has a value of zero in the 'RETAIL SALES' column, this does not make sense.
##DECISION. Disregard the column RETAIL TRANSFERS until we have more information (e.g. after interviews)

# #checking for nul values in the dataset
# data.isnull().sum()

# #Getting out the rows where SUPPLIER is NaN
# data2 = data[(data['SUPPLIER'].isnull()==True)]
# data2
# #DECISION: Filling the NaN's with NO SUPPLIER as all lines has got to do with some sort of returns from warehouse to supplier (probably some sort of deposit)
# data['SUPPLIER'] = data['SUPPLIER'].fillna('NO SUPPLIER')

# #Check if filling went ok
# data.isnull().sum()

# #Getting out the rows where ITEM TYPE is NaN
# data3 = data[(data['ITEM TYPE'].isnull()==True)]
# data3
# #DECISION: not important let the NaN in.

# #Find low-variance columns
# import numpy as np
# low_variance = []

# for col in list(data._get_numeric_data()):
#     minimum = min(data[col])
#     ninety_perc = np.percentile(data[col], 90)
#     if ninety_perc == minimum:
#         low_variance.append(col)

# print(low_variance)

# #DECISION: There are no low-variance columns

# #Check datatypes
# data.dtypes

# #DECISION changing data types for columns YEAR and MONTH to object so it will not show in outliers check
# data['YEAR'] = data['YEAR'].astype('object')
# data['MONTH'] = data['MONTH'].astype('object')

# # Check if change went ok
# data.dtypes

# #check for outliers
# data.describe().transpose()
# stats = data.describe().transpose()
# stats['IQR'] = stats['75%'] - stats['25%']
# stats

# # search for outliers with values we learned in lesson
# outliers = pd.DataFrame(columns=data.columns)

# for col in stats.index:
#     iqr = stats.at[col,'IQR']
#     cutoff = iqr * 1.5
#     lower = stats.at[col,'25%'] - cutoff
#     upper = stats.at[col,'75%'] + cutoff
#     results = data[(data[col] < lower) | 
#                    (data[col] > upper)].copy()
#     results['Outlier'] = col
#     outliers = outliers.append(results)

# outliers
# #DECISION this gives 55606 outliers (43% of all the rows) is to much, check again with new values! 

# # search for outliers with new values
# outliers = pd.DataFrame(columns=data.columns)

# for col in stats.index:
#     iqr = stats.at[col,'IQR']
#     cutoff = iqr * 3
#     lower = stats.at[col,'25%'] - cutoff
#     upper = stats.at[col,'75%'] + cutoff
#     results = data[(data[col] < lower) | 
#                    (data[col] > upper)].copy()
#     results['Outlier'] = col
#     outliers = outliers.append(results)

# outliers

# #Did the outliers calculation agiain with iqr*2 and iqr*3. These gave 49058 (38%) outliers and 39616 (31%) outliers. These numbers are still so high that I cannot make anything of it.
# #So DECISION: do not do anything with outliers....

# #checking for duplicates
# before = len(data)
# data4 = data.drop_duplicates()
# after = len(data4)
# print('Number of duplicate records dropped: ', str(before - after))
# #REMARK : no duplicate records

Number of duplicate records dropped:  0


In [17]:
#making the aggregates

supplier_aggregate = data.groupby(['SUPPLIER'])['RETAIL SALES', 'WAREHOUSE SALES'].agg(['sum'])
supplier_aggregate

#This shows the total sales per supplier

Unnamed: 0_level_0,RETAIL SALES,WAREHOUSE SALES
Unnamed: 0_level_1,sum,sum
SUPPLIER,Unnamed: 1_level_2,Unnamed: 2_level_2
8 VINI INC,2.78,1.00
A HARDY USA LTD,0.40,0.00
A I G WINE & SPIRITS,12.52,134.00
A VINTNERS SELECTIONS,8640.57,29776.67
A&E INC,11.52,0.00
...,...,...
WINEBOW INC,1.24,0.00
YOUNG WON TRADING INC,1058.65,2528.90
YUENGLING BREWERY,9628.35,53805.32
Z WINE GALLERY IMPORTS LLC,8.83,16.00


In [18]:
supplier_aggregate_per_month = data.groupby(['SUPPLIER', 'YEAR', 'MONTH'])['RETAIL SALES', 'WAREHOUSE SALES'].agg(['sum'])
supplier_aggregate_per_month

#This shows the sales per supplier per month per year (to see the development of sales)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,RETAIL SALES,WAREHOUSE SALES
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum,sum
SUPPLIER,YEAR,MONTH,Unnamed: 3_level_2,Unnamed: 4_level_2
8 VINI INC,2017,5,0.33,0.0
8 VINI INC,2017,6,0.08,0.0
8 VINI INC,2017,8,0.24,0.0
8 VINI INC,2017,9,0.08,0.0
8 VINI INC,2017,10,0.16,1.0
...,...,...,...,...
ZURENA LLC,2017,10,0.97,0.0
ZURENA LLC,2017,11,1.05,0.0
ZURENA LLC,2017,12,2.35,0.0
ZURENA LLC,2018,1,0.56,0.0


In [19]:
item_aggregate = data.groupby(['ITEM DESCRIPTION'])['RETAIL SALES', 'WAREHOUSE SALES'].agg(['sum'])
item_aggregate

#This shows the total sales per item

Unnamed: 0_level_0,RETAIL SALES,WAREHOUSE SALES
Unnamed: 0_level_1,sum,sum
ITEM DESCRIPTION,Unnamed: 1_level_2,Unnamed: 2_level_2
! EA ! - 750ML,0.00,8.0
!ZAZIN - 750ML,0.00,1.0
10 BARREL JOE IPA 1/6K,0.00,35.0
10 BARREL JOE IPA 4/6PK NR,76.50,225.0
10 SPAN CC CAB - 750ML,0.81,17.0
...,...,...
ZYR RUSSIAN VODKA - 750ML,27.74,3.0
ZYWIEC 20/16.9.OZ NR,2.90,21.0
ZYWIEC 4/6 16.9OZ CAN,15.60,32.0
ZYWIEC 4/6NR - 11.2OZ,5.00,97.0


In [20]:
item_aggregate_per_month = data.groupby(['ITEM DESCRIPTION', 'YEAR', 'MONTH'])['RETAIL SALES', 'WAREHOUSE SALES'].agg(['sum'])
item_aggregate_per_month

#This shows the sales per item per month per year (idea was to see the development)
#Actually no added value as this is more or less the input data. This can be seen on the number of rows.

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,RETAIL SALES,WAREHOUSE SALES
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum,sum
ITEM DESCRIPTION,YEAR,MONTH,Unnamed: 3_level_2,Unnamed: 4_level_2
! EA ! - 750ML,2017,5,0.0,6.0
! EA ! - 750ML,2017,11,0.0,2.0
!ZAZIN - 750ML,2017,5,0.0,1.0
10 BARREL JOE IPA 1/6K,2017,8,0.0,11.0
10 BARREL JOE IPA 1/6K,2017,9,0.0,8.0
...,...,...,...,...
ZYWIEC PORTER 5/4 NR - 16.9OZ,2017,5,0.0,4.0
ZYWIEC PORTER 5/4 NR - 16.9OZ,2017,9,0.0,5.0
ZYWIEC PORTER 5/4 NR - 16.9OZ,2017,11,0.0,2.0
ZYWIEC PORTER 5/4 NR - 16.9OZ,2017,12,0.0,1.0
