# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `warehouse_and_retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per supplier.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [17]:
import pandas as pd
import numpy as np
import pymysql
from sqlalchemy import create_engine

In [114]:
engine = create_engine('mysql+pymysql://root:asdfgh@localhost:3306/sales')


In [115]:
engine.connect()

<sqlalchemy.engine.base.Connection at 0x12fd7b6c860>

In [20]:
df = pd.read_csv('Warehouse_and_Retail_Sales.csv')

In [21]:
print(len(df))

128355


In [22]:
df['DATE'] = pd.to_datetime(df[['YEAR', 'MONTH']].assign(DAY=1))

In [23]:
df.drop(['YEAR', 'MONTH'], axis=1, inplace=True)

In [24]:
df.columns

Index(['SUPPLIER', 'ITEM CODE', 'ITEM DESCRIPTION', 'ITEM TYPE',
       'RETAIL SALES', 'RETAIL TRANSFERS', 'WAREHOUSE SALES', 'DATE'],
      dtype='object')

In [35]:
df.isna().sum()

SUPPLIER            0
ITEM CODE           0
ITEM DESCRIPTION    0
ITEM TYPE           1
RETAIL SALES        0
RETAIL TRANSFERS    0
WAREHOUSE SALES     0
DATE                0
dtype: int64

In [96]:
df['SUPPLIER'] = df['SUPPLIER'].fillna('UNAVAILABLE')

In [97]:
df['ITEM TYPE'].dropna(inplace=True)

In [98]:
df.head()

Unnamed: 0,SUPPLIER,ITEM CODE,ITEM DESCRIPTION,ITEM TYPE,RETAIL SALES,RETAIL TRANSFERS,WAREHOUSE SALES,DATE,Total_sales
0,ROYAL WINE CORP,100200,GAMLA CAB - 750ML,WINE,0.0,1.0,0.0,2017-04-01,1.0
1,SANTA MARGHERITA USA INC,100749,SANTA MARGHERITA P/GRIG ALTO - 375ML,WINE,0.0,1.0,0.0,2017-04-01,1.0
2,JIM BEAM BRANDS CO,10103,KNOB CREEK BOURBON 9YR - 100P - 375ML,LIQUOR,0.0,8.0,0.0,2017-04-01,8.0
3,HEAVEN HILL DISTILLERIES INC,10120,J W DANT BOURBON 100P - 1.75L,LIQUOR,0.0,2.0,0.0,2017-04-01,2.0
4,ROYAL WINE CORP,101664,RAMON CORDOVA RIOJA - 750ML,WINE,0.0,4.0,0.0,2017-04-01,4.0


In [99]:
len(df['SUPPLIER'].unique())

334

In [100]:
len(df['ITEM CODE'].unique())

23556

In [101]:
item_sales = df.groupby(df['ITEM TYPE']).sum()

In [102]:
item_sales

Unnamed: 0_level_0,RETAIL SALES,RETAIL TRANSFERS,WAREHOUSE SALES,Total_sales
ITEM TYPE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BEER,209763.11,234924.44,2437617.32,2882304.87
DUNNAGE,0.0,0.0,-45331.0,-45331.0
KEGS,0.0,0.0,43558.0,43558.0
LIQUOR,309847.85,334176.41,33173.32,677197.58
NON-ALCOHOL,8109.97,9058.37,8656.72,25825.06
REF,281.34,171.92,-6754.0,-6300.74
STR_SUPPLIES,995.98,3594.7,0.0,4590.68
WINE,313400.42,340710.51,433009.47,1087120.4


In [103]:
df['Total_sales'] = df['RETAIL SALES']+ df['WAREHOUSE SALES']+df['RETAIL TRANSFERS']

In [104]:
sales_per_supplier= df.pivot_table(values=['RETAIL SALES', 'WAREHOUSE SALES', 'RETAIL TRANSFERS', 'Total_sales'], index=['SUPPLIER'], aggfunc=sum)

In [121]:
sales_per_supplier = sales_per_supplier.reindex(sales_per_supplier['Total_sales'].sort_values(ascending=False).index).head(10)

In [108]:
sales_per_supplier_per_item = df.pivot_table(index=['SUPPLIER', 'ITEM TYPE'], values=['RETAIL SALES', 'RETAIL TRANSFERS', 'WAREHOUSE SALES','Total_sales'], aggfunc=sum)

In [118]:
sales_per_supplier_per_item = sales_per_supplier_per_item.reindex(sales_per_supplier_per_item['Total_sales'].sort_values(ascending=False).index).head(10)

In [125]:
sales_per_supplier_per_item.to_sql("Supplier_item", 
          engine, 
          index=False, 
          if_exists='append',
          chunksize=25000,
          method=None)

In [127]:
sales_per_supplier.to_sql("Supplier_sales", 
          engine, 
          index=False, 
          if_exists='append',
          chunksize=25000,
          method=None)



In [128]:
item_sales.to_sql("item_sales", 
          engine, 
          index=False, 
          if_exists='append',
          chunksize=25000,
          method=None)