# Mandatory Challenge
## Context
You work in the data analysis team of a very important company. On Monday, the company shares some good news with you: you just got hired by a major retail company! So, let's get prepared for a huge amount of work!

Then you get to work with your team and define the following tasks to perform:   
1. You need to start your analysis using data from the past.  
2. You need to define a process that takes your daily data as an input and integrates it.  

You are in charge of the second part, so you are provided with a sample file that you will have to read daily. To complete you task, you need the following aggregates:
* One aggregate per store that adds up the rest of the values.
* One aggregate per item that adds up the rest of the values.

You can import the dataset `warehouse_and_retail_sales` from Ironhack's database. 

## Your task
Therefore, your process will consist of the following steps:
1. Read the sample file that a daily process will save in your folder. 
2. Clean up the data.
3. Create the aggregates.
4. Write three tables in your local database: 
    - A table for the cleaned data.
    - A table for the aggregate per supplier.
    - A table for the aggregate per item.

## Instructions
* Read the csv you can find in Ironhack's database.
* Clean the data and create the aggregates as you consider.
* Create the tables in your local database.
* Populate them with your process.

In [None]:
# your code here
import numpy as np
import pandas as pd
import requests
from io import StringIO
from collections import Counter
import pymysql
from sqlalchemy import create_engine

In [None]:
orig_url = 'https://drive.google.com/file/d/1ZsHSCYciWkUd8y9mvEV5RZQiO1Qk0WLh/view?usp=sharing'

file_id = orig_url.split('/')[-2]
dwn_url = 'https://drive.google.com/uc?export=download&id=' + file_id
url = requests.get(dwn_url).text
csv_raw = StringIO(url)
raw_data = pd.read_csv(csv_raw)
raw_data.head()

In [None]:
# Check if there are null values
raw_data.isnull().sum()

In [None]:
# Check why they may be null. Should they be removed or replaced?
check = raw_data.loc[raw_data['SUPPLIER'].isnull() == True]
check
'''These seems to be some internal technical records irrelevant to sales. In reall life I would ask owner of this data for
clarification but now I have to assume I'm right. This means I can remove these lines from the table.'''
clean_data = raw_data.loc[raw_data['SUPPLIER'].isnull() == False]
clean_data.head()

In [None]:
# There was also one line where Item Type column had null value. Let's check if it's still there and what we can do about it.
check = clean_data[clean_data['ITEM TYPE'].isnull() == True]
check

In [None]:
'''I know that Barolo is a wine but it's also very easy to find this information in Google.'''
clean_data.loc[clean_data['ITEM TYPE'].isnull(), 'ITEM TYPE'] = 'WINE'
check = clean_data[clean_data['ITEM TYPE'].isnull() == True]
check

In [None]:
print(len(clean_data['ITEM CODE'].unique()))
print(len(clean_data['ITEM DESCRIPTION'].unique()))
print(len(clean_data['ITEM CODE'].unique()) - len(clean_data['ITEM DESCRIPTION'].unique()))
'''There are some items with the same code but different description'''
check = clean_data[['ITEM CODE', 'ITEM DESCRIPTION']]
check = check.groupby('ITEM CODE')['ITEM DESCRIPTION'].nunique()
item_code_list = check[check > 1].index.values.tolist()
item_code_list

In [None]:
check_dict = {}
for item in item_code_list:
    lst = clean_data.loc[clean_data['ITEM CODE'] == item]['ITEM DESCRIPTION'].unique().tolist()
    check_dict[item] = lst
'''The reason for having multiple descriptions for one code is various spelling, hense it can be aligned.
I will take first spelling option by default.'''

In [None]:
codes = clean_data['ITEM CODE'].unique().tolist()
len(codes) # Self check that number of unique codes haven't changed since cell 9
code_dic = {}
for code in codes:
    code_dic[code] = clean_data.loc[clean_data['ITEM CODE'] == code]['ITEM DESCRIPTION'].tolist()[0]
code_dic

In [None]:
len(code_dic) # Self check again

In [None]:
clean_data['ITEM DESCRIPTION'] = clean_data['ITEM CODE'].map(code_dic)
clean_data.head()
# I also ran cell 7 to check that now item_code_list is empty as there are no multiple descriptions per one code anymore.

In [32]:
engine = create_engine('mysql+pymysql://Svetlana Gruzdeva:FBfrbfL6TKeKUtMNhkF7@localhost/warehouse')
clean_data.to_sql(con=engine, name='cleaned_data', if_exists='replace')

Unnamed: 0,index,YEAR,MONTH,SUPPLIER,ITEM CODE,ITEM DESCRIPTION,ITEM TYPE,RETAIL SALES,RETAIL TRANSFERS,WAREHOUSE SALES
0,0,2017,4,ROYAL WINE CORP,100200,GAMLA CAB - 750ML,WINE,0.0,1.0,0.0
1,1,2017,4,SANTA MARGHERITA USA INC,100749,SANTA MARGHERITA P/GRIG ALTO - 375ML,WINE,0.0,1.0,0.0
2,2,2017,4,JIM BEAM BRANDS CO,10103,KNOB CREEK BOURBON 9YR - 100P - 375ML,LIQUOR,0.0,8.0,0.0
3,3,2017,4,HEAVEN HILL DISTILLERIES INC,10120,J W DANT BOURBON 100P - 1.75L,LIQUOR,0.0,2.0,0.0
4,4,2017,4,ROYAL WINE CORP,101664,RAMON CORDOVA RIOJA - 750ML,WINE,0.0,4.0,0.0
5,5,2017,4,REPUBLIC NATIONAL DISTRIBUTING CO,101680,MANISCHEWITZ CREAM WH CONCORD - 1.5L,WINE,0.0,1.0,0.0
6,6,2017,4,ROYAL WINE CORP,101753,BARKAN CLASSIC PET SYR - 750ML,WINE,0.0,1.0,0.0
7,7,2017,4,JIM BEAM BRANDS CO,10197,KNOB CREEK BOURBON 9YR - 100P - 1.75L,LIQUOR,0.0,32.0,0.0
8,8,2017,4,STE MICHELLE WINE ESTATES,101974,CH ST MICH P/GRIS - 750ML,WINE,0.0,26.0,0.0
9,9,2017,4,MONSIEUR TOUTON SELECTION,102083,CH DE LA CHESNAIE MUSCADET - 750ML,WINE,0.0,1.0,0.0


In [52]:
col = ['RETAIL SALES', 'RETAIL TRANSFERS', 'WAREHOUSE SALES']
agg_supplier = clean_data.groupby('SUPPLIER')[col].sum().reset_index()
agg_supplier.to_sql(con=engine, name='aggregated_per_supplier', if_exists='replace')

In [53]:
agg_item = clean_data.groupby('ITEM CODE')[col].sum().reset_index()
agg_item.to_sql(con=engine, name='aggregated_per_item', if_exists='replace')