# **Group 3 Project: store_data.csv**

## Table of Contents

<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangle">Data Wrangling</a></li>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclude">Conclusion</a></li>
</ul>

## Introduction

`Store_data.csv` is a dataset that stores the daily sales of five stores, namely storeA, storeB, storeC, and StoreD. This dataset is collected for four years from 2014 to 2018. Some questions that can be explored from this dataset are:
1. What is the total sales for the last month?
2. What is the average sales across all stores?
3. What is the sales on March 13, 2016?
4. When is worst week of Store C?
5. How much are the total sales from the most recent 3 months?

## Data Wrangling

### 1. Importing packages

In [12]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()

### 2. Gathering the data

In [13]:
df = pd.read_csv('store_data.csv')
df.head()

Unnamed: 0,week,storeA,storeB,storeC,storeD,storeE
0,2014-05-04,2643,8257,3893,6231,1294
1,2014-05-11,6444,5736,5634,7092,2907
2,2014-05-18,9646,2552,4253,5447,4736
3,2014-05-25,5960,10740,8264,6063,949
4,2014-06-01,7412,7374,3208,3985,3023


### 3. Assess for possible problems

In [11]:
df.shape

(200, 6)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
week      200 non-null object
storeA    200 non-null int64
storeB    200 non-null int64
storeC    200 non-null int64
storeD    200 non-null int64
storeE    200 non-null int64
dtypes: int64(5), object(1)
memory usage: 9.5+ KB


> From this step, several insights regarding the dataset can be obtained. (1) It has 200 rows and 6 columns, meaning that, for each store, daily sales are recorder for 200 days, (2) 'Week' column shows the date when the sales are recorded and the data type is 'object', and (3) Other columns aside of 'week' show the daily sales recorded in each store and the data type is 'integer.'

> Another important insight acquired from this step is that all total observations for each column are 200 observations. This implies that there is no missing values in the dataset. There is also no duplicate value in every columns.

> There is one potential problem, though, in the 'week' column, the data format is 'object.' In order to conduct time-series analysis, it will be more convenient to tell Pandas that this column is a time indicator.

### 4. Cleaning the dataset

In [19]:
# Convert 'week' colum to 'datetime'-type data.
df_copy = df.copy()

Unnamed: 0,week,storeA,storeB,storeC,storeD,storeE
0,2014-05-04,2643,8257,3893,6231,1294
1,2014-05-11,6444,5736,5634,7092,2907
2,2014-05-18,9646,2552,4253,5447,4736
3,2014-05-25,5960,10740,8264,6063,949
4,2014-06-01,7412,7374,3208,3985,3023


In [20]:
df_copy['week'] = pd.to_datetime(df_copy['week'], format='%Y-%m-%d')
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 6 columns):
week      200 non-null datetime64[ns]
storeA    200 non-null int64
storeB    200 non-null int64
storeC    200 non-null int64
storeD    200 non-null int64
storeE    200 non-null int64
dtypes: datetime64[ns](1), int64(5)
memory usage: 9.5 KB


> Now, the 'week' column is in 'datetime' format.

## Data Exploring