# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `warehouse_and_retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per supplier.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [148]:
# Read the sample file that a daily process will save in your folder.

import pandas as pd

data = pd.read_csv('/Users/vilmastasiute/Ironhack/Warehouse/Warehouse_and_Retail_Sales.csv')
data


Unnamed: 0,YEAR,MONTH,SUPPLIER,ITEM CODE,ITEM DESCRIPTION,ITEM TYPE,RETAIL SALES,RETAIL TRANSFERS,WAREHOUSE SALES
0,2017,4,ROYAL WINE CORP,100200,GAMLA CAB - 750ML,WINE,0.00,1.0,0.0
1,2017,4,SANTA MARGHERITA USA INC,100749,SANTA MARGHERITA P/GRIG ALTO - 375ML,WINE,0.00,1.0,0.0
2,2017,4,JIM BEAM BRANDS CO,10103,KNOB CREEK BOURBON 9YR - 100P - 375ML,LIQUOR,0.00,8.0,0.0
3,2017,4,HEAVEN HILL DISTILLERIES INC,10120,J W DANT BOURBON 100P - 1.75L,LIQUOR,0.00,2.0,0.0
4,2017,4,ROYAL WINE CORP,101664,RAMON CORDOVA RIOJA - 750ML,WINE,0.00,4.0,0.0
...,...,...,...,...,...,...,...,...,...
128350,2018,2,ANHEUSER BUSCH INC,9997,HOEGAARDEN 4/6NR - 12OZ,BEER,66.46,59.0,212.0
128351,2018,2,COASTAL BREWING COMPANY LLC,99970,DOMINION OAK BARREL STOUT 4/6 NR - 12OZ,BEER,9.08,7.0,35.0
128352,2018,2,BOSTON BEER CORPORATION,99988,SAM ADAMS COLD SNAP 1/6 KG,KEGS,0.00,0.0,32.0
128353,2018,2,,BC,BEER CREDIT,REF,0.00,0.0,-35.0


In [149]:
# Clean up the data.

# Checking data types - Do they seem logical for the values contained in the rows?

print(data.dtypes)

# Checking if there are NaN values in the dataset

null_cols = data.isnull().sum()
print("NaN values:\n", null_cols[null_cols > 0])


YEAR                  int64
MONTH                 int64
SUPPLIER             object
ITEM CODE            object
ITEM DESCRIPTION     object
ITEM TYPE            object
RETAIL SALES        float64
RETAIL TRANSFERS    float64
WAREHOUSE SALES     float64
dtype: object
NaN values:
 SUPPLIER     24
ITEM TYPE     1
dtype: int64


In [150]:
# Checking how rows with NaN values look like

null_displ_supl = data[(data['SUPPLIER'].isnull()==True)]
null_displ_itype = data[(data['ITEM TYPE'].isnull()==True)]

null_displ_supl

Unnamed: 0,YEAR,MONTH,SUPPLIER,ITEM CODE,ITEM DESCRIPTION,ITEM TYPE,RETAIL SALES,RETAIL TRANSFERS,WAREHOUSE SALES
19483,2017,6,,1279,EMPTY WINE KEG - KEGS,DUNNAGE,0.0,0.0,-9.0
20056,2017,8,,1279,EMPTY WINE KEG - KEGS,DUNNAGE,0.0,0.0,-5.0
32282,2017,6,,BC,BEER CREDIT,REF,0.0,0.0,-58.0
32283,2017,6,,WC,WINE CREDIT,REF,0.0,0.0,-8.0
45871,2017,8,,BC,BEER CREDIT,REF,0.0,0.0,-699.0
45872,2017,8,,WC,WINE CREDIT,REF,0.0,0.0,-5.0
46518,2017,9,,1279,EMPTY WINE KEG - KEGS,DUNNAGE,0.0,0.0,-9.0
59259,2017,9,,BC,BEER CREDIT,REF,0.0,0.0,-502.0
59260,2017,9,,WC,WINE CREDIT,REF,0.0,0.0,-15.0
59920,2017,10,,1279,EMPTY WINE KEG - KEGS,DUNNAGE,0.0,0.0,-6.0


In [152]:
null_displ_itype

Unnamed: 0,YEAR,MONTH,SUPPLIER,ITEM CODE,ITEM DESCRIPTION,ITEM TYPE,RETAIL SALES,RETAIL TRANSFERS,WAREHOUSE SALES
66439,2017,10,REPUBLIC NATIONAL DISTRIBUTING CO,347939,FONTANAFREDDA BAROLO SILVER LABEL 750 ML,,0.0,0.0,1.0


In [153]:
# Since I still want values related to undocumented suppliers in my 
# final calculations, I will replace NaN with Unknown.

data[['SUPPLIER']] = data[['SUPPLIER']].fillna('Unknown')
data[['ITEM TYPE']] = data[['ITEM TYPE']].fillna('Unknown')

In [154]:
# I see no reason to drop extreme values for this particular dataset, because I will want to
# know exact situation with particular supplier and for each item.

stats = data.describe()
stats


Unnamed: 0,YEAR,MONTH,RETAIL SALES,RETAIL TRANSFERS,WAREHOUSE SALES
count,128355.0,128355.0,128355.0,128355.0,128355.0
mean,2017.20603,7.079303,6.563037,7.188161,22.624213
std,0.404454,3.645826,28.924944,30.640156,239.693277
min,2017.0,1.0,-6.49,-27.66,-4996.0
25%,2017.0,5.0,0.0,0.0,0.0
50%,2017.0,8.0,0.33,0.0,1.0
75%,2017.0,10.0,3.25,4.0,4.0
max,2018.0,12.0,1616.6,1587.99,16271.75


In [155]:
# Checking the consistency of year, month values.

print(set(data['YEAR']))
print(set(data['MONTH']))

{2017, 2018}
{1, 2, 4, 5, 6, 8, 9, 10, 11, 12}


In [156]:
# Create the aggregates.
# One aggregate per supplier that adds up the rest of the values. (supplier)
# Omitting Year and Month, because these aggregates do not make sense.

columns = ['RETAIL SALES', 'RETAIL TRANSFERS', 'WAREHOUSE SALES']
aggr_per_suppl = data.groupby(['SUPPLIER'], as_index=False)[columns].sum()
aggr_per_suppl


Unnamed: 0,SUPPLIER,RETAIL SALES,RETAIL TRANSFERS,WAREHOUSE SALES
0,8 VINI INC,2.78,2.00,1.00
1,A HARDY USA LTD,0.40,0.00,0.00
2,A I G WINE & SPIRITS,12.52,5.92,134.00
3,A VINTNERS SELECTIONS,8640.57,8361.10,29776.67
4,A&E INC,11.52,2.00,0.00
...,...,...,...,...
329,WINEBOW INC,1.24,-1.58,0.00
330,YOUNG WON TRADING INC,1058.65,1047.40,2528.90
331,YUENGLING BREWERY,9628.35,10851.17,53805.32
332,Z WINE GALLERY IMPORTS LLC,8.83,11.25,16.00


In [157]:
# One aggregate per item that adds up the rest of the values. (item code)

aggr_per_item = data.groupby(['ITEM CODE'], as_index=False)[columns].sum()
aggr_per_item


Unnamed: 0,ITEM CODE,RETAIL SALES,RETAIL TRANSFERS,WAREHOUSE SALES
0,100003,0.00,0.0,1.0
1,100007,0.00,0.0,1.0
2,100008,0.00,0.0,1.0
3,100009,0.00,0.0,12.0
4,100011,0.00,0.0,3.0
...,...,...,...,...
23551,99970,118.24,118.0,456.0
23552,99988,0.00,0.0,70.0
23553,99990,68.50,90.0,985.5
23554,BC,0.00,0.0,-6022.0


In [158]:
# Write three tables in your local database:
# 1. A table for the cleaned data.
# 2. A table for the aggregate per supplier.
# 3. A table for the aggregate per item.

data = data.to_csv('/Users/vilmastasiute/Ironhack/Warehouse/Warehouse_and_Retail_Sales_cleaned.csv', index=False)
aggr_per_suppl = aggr_per_suppl.to_csv('/Users/vilmastasiute/Ironhack/Warehouse/Warehouse_and_Retail_Sales_aggregate_per_supplier.csv', index=False)
aggr_per_item = aggr_per_item.to_csv('/Users/vilmastasiute/Ironhack/Warehouse/Warehouse_and_Retail_Sales_aggregate_per_item.csv', index=False)
