#  Introduction

## Problem Statement
*  there is a change in trend that occurs due to uncertain conditions causing excessive stock of goods, so sometimes goods that have been purchased cannot be resold because the product has been piling up too long in the inventory.

*  determining the storage of stock goods by analyzing using sales data for the previous month / year because the data in the previous month usually has an impact on future sales, while sales that occur do not always follow the previous month's data 

## Data Provided
The Data that is used in this project are
*     **sale_items.csv** - Contains the details of historical daily sales of a customer
*     **sales.csv** - Contains the historical daily sales of a customer
*     **sma_categories.csv** - Contains info about category of products
*     **sma_companies.csv** - Contains information about the customer of company X
*     **sma_products.csv** - Contains list of products that is sold in the companies
We will have a sneak peak into the dataset below 

# Content:
1. Data Preprocessing
     - Read the dataset
     - Drop null value and several columns
     - join dataset into 1 dataset
2. Exploratory Data Analysis
3. Search correlation between data
4. Implement model
     - use only data sales 
     - use with data external such as IHK, Inflasi, IHP
5. Use specific data for trend product
6. Use Time lag in analysis

In [1]:
#load library
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os #using operating system dependent functionality
from datetime import datetime#datetime module supplies classes for manipulating dates and times.
import math # provides access to the mathematical functions
from IPython.display import display, HTML

#For Plotting
# Using plotly + cufflinks in offline mode
import plotly as py
import plotly.graph_objs as go
import plotly
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import cufflinks as cf
cf.set_config_file(offline=True)
init_notebook_mode(connected=True)

import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

#For time series decomposition
from matplotlib import pyplot
from statsmodels.tsa.seasonal import seasonal_decompose

from datetime import timedelta

#Pandas option
pd.options.display.float_format = '{:.2f}'.format

# 1. Data Preprocessing

## 1.1 Read Dataset

In [2]:
#read dataset
df_sales = pd.read_csv('proyek_skripsi/sales.csv')
df_sale_items = pd.read_csv('proyek_skripsi/sale_items.csv')
df_products = pd.read_csv('proyek_skripsi/sma_products.csv')
df_category = pd.read_csv('proyek_skripsi/sma_categories.csv')
df_companies = pd.read_csv('proyek_skripsi/sma_companies.csv')
df_ihk_inflasi = pd.read_csv("proyek_skripsi/Data_IHK_Inflasi_Nasional.csv")
df_ihp = pd.read_csv("proyek_skripsi/DataIHPKhususPlastikKaret.csv")


Columns (10) have mixed types.Specify dtype option on import or set low_memory=False.


Columns (14) have mixed types.Specify dtype option on import or set low_memory=False.



## 1.2 Clean Dataset

In [4]:
def clean_data(dropped_column, df, negative_column = []):
    df = df.drop(dropped_column, axis=1)
    df = df[(df[negative_column] >= 0).all(1)]
    df = df.fillna(0)
    return df

In [5]:
dropped_sales = ['note', 'staff_note', 'order_discount_id', 'updated_by', 'updated_at', 'return_id', 
                 'attachment', 'return_sale_ref', 'sale_id', 'rounding', 'suspend_note', 'address_id', 
                 'reserve_id', 'salesman_id', 'salesman_commission', 'total_commission', 'product_discount', 
                 'total_discount', 'grand_total', 'payment_term', 'due_date', 'paid', 'customer', 'quote_id', 
                 'reference_no', 'biller_id', 'biller', 'warehouse_id', 'order_discount', 'product_tax', 
                 'order_tax_id', 'order_tax', 'total_tax', 'shipping', 'pos', 'surcharge', 'api', 'shop', 'hash', 
                 'created_by', 'sale_status', 'payment_status', 'return_sale_total', 'total_items']

df_sales = clean_data(dropped_sales, df_sales, ['total'])
df_sales = df_sales.rename(columns={'id': 'sale_id'})
df_sales = df_sales.loc[(df_sales['date'] >= '2017-02-31')]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17396 entries, 0 to 17496
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   id           17396 non-null  int64  
 1   date         17396 non-null  object 
 2   customer_id  17396 non-null  int64  
 3   total        17396 non-null  float64
dtypes: float64(1), int64(2), object(1)
memory usage: 679.5+ KB


In [None]:
dropped_saleitems = ['serial_no', 'sale_item_id', 'comment', 'id', 'product_type', 'option_id', 'unit_price', 
                     'net_unit_price', 'warehouse_id', 'item_tax', 'tax_rate_id', 'tax', 'product_unit_id', 
                     'product_unit_code', 'unit_quantity']

df_sale_items = clean_data(dropped_saleitems, df_sale_items, ['quantity', 'item_discount', 'real_unit_price', 'subtotal'])