## Exploratory Analysis using Jupyter Notebook
For further reading, we recommend: 
- [the pandas documentation](http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#getting)  for information about using DataFrames
- [this blog post](https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed) for a jumpstart into visualizations
- [the matplotlib documentation](https://matplotlib.org/users/pyplot_tutorial.html) for more info about visualizations

In [1]:
import pandas as pd
import s3fs



#### Loading data from our GPC bucket

In [53]:
import os
def download(f):
    data_file = f'../data/{f}.csv'
    if not os.path.isfile(data_file):
        s3 = s3fs.S3FileSystem(anon=True)
        s3.get(f'twde-datalab/raw/{f}.csv', 
           date_file)
    return pd.read_csv(date_file)

In [2]:
import s3fs
s3 = s3fs.S3FileSystem(anon=True)
s3.ls('twde-datalab/raw')

s3.get('twde-datalab/raw/quito_stores_sample2016-2017.csv', 
       '../data/quito_stores_sample2016-2017.csv')

In [56]:
items = download('items')
items.head()

In [78]:
holidays = download('holidays_events')
holidays
# holidays['locale'].unique()

Unnamed: 0,date,type,locale,locale_name,description,transferred
0,2012-03-02,Holiday,Local,Manta,Fundacion de Manta,False
1,2012-04-01,Holiday,Regional,Cotopaxi,Provincializacion de Cotopaxi,False
2,2012-04-12,Holiday,Local,Cuenca,Fundacion de Cuenca,False
3,2012-04-14,Holiday,Local,Libertad,Cantonizacion de Libertad,False
4,2012-04-21,Holiday,Local,Riobamba,Cantonizacion de Riobamba,False
5,2012-05-12,Holiday,Local,Puyo,Cantonizacion del Puyo,False
6,2012-06-23,Holiday,Local,Guaranda,Cantonizacion de Guaranda,False
7,2012-06-25,Holiday,Regional,Imbabura,Provincializacion de Imbabura,False
8,2012-06-25,Holiday,Local,Latacunga,Cantonizacion de Latacunga,False
9,2012-06-25,Holiday,Local,Machala,Fundacion de Machala,False


In [70]:
items.head()

Unnamed: 0,item_nbr,family,class,perishable
0,96995,GROCERY I,1093,0
1,99197,GROCERY I,1067,0
2,103501,CLEANING,3008,0
3,103520,GROCERY I,1028,0
4,103665,BREAD/BAKERY,2712,1


In [3]:
train = pd.read_csv('../data/quito_stores_sample2016-2017.csv')

In [23]:
data = train
train.head()

Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,cluster
0,88211471,2016-08-16,44,103520,7.0,True,Quito,Pichincha,5
1,88211472,2016-08-16,44,103665,7.0,False,Quito,Pichincha,5
2,88211473,2016-08-16,44,105574,13.0,False,Quito,Pichincha,5
3,88211474,2016-08-16,44,105575,18.0,False,Quito,Pichincha,5
4,88211475,2016-08-16,44,105577,8.0,False,Quito,Pichincha,5


#### With just this glimpse, you can start to fill out your list of assumptions, hypotheses, and questions. Some of mine are:
- Question: What is the span of dates we are provided?
- Question: How many distinct store_nbr values are there?
- Question: How many distinct item_nbr values are there?
- Hypothesis: unit_sales are always positive
- Hypothesis: onpromotion is always either True or False
- Hypothesis: city and state are always going to be Quito and Pichincha
- Hypothesis: cluster is always 5
- Question: What does cluster mean and is it important to know?
- Question: How many records does the data contain?
- Question: What other data files are available?

### Here's some examples of how to address those first questions

In [6]:
# Access an entire dataframe column like you would
# the value in a python dictionary:
# (The returned object has similar pandas built-in 
# functions, like 'head' and 'max')
print(train['date'].min())
print(train['date'].max())

2016-08-16
2017-08-15


In [8]:
# Dataframe columns also have a 'unique' method,
# which can answer several of our questions from above
train['onpromotion'].unique()

array([ True, False])

In [10]:
print(train['item_nbr'].unique())
print("There are too many item numbers to display, so let's just count them for now:")
print("\n{} different item_nbr values in our data"
          .format(len(train['item_nbr'].unique())))

[ 103520  103665  105574 ... 2011468 2011448 2123839]
There are too many item numbers to display, so let's just count them for now:

3717 different item_nbr values in our data


#### It might be helpful to know the 'shape' of our data. We could count by hand (for now) the columns, but how many rows do we have altogether?

In [18]:
data = train
print(train.shape)
print("There are {} rows and {} columns in our data".format(data.shape[0], data.shape[1]))

(5877318, 9)
There are 5877318 rows and 9 columns in our data


#### Moving along to answer our intial questions... Let's have a look at unit_sales. Keep in mind that unit sales is the variable we want to predict with our science.

Each row in our data is essentially telling us a `unit_sales` number for a given `item_nbr` at a given `store_nbr` on a given `date`. That is, "how many of an item was sold at a store on a day".

In [31]:
train.sort_values(by='unit_sales', ascending=True).head()


Unnamed: 0,id,date,store_nbr,item_nbr,unit_sales,onpromotion,city,state,cluster
2570881,104180529,2017-01-25,46,269029,-290.0,False,Quito,Pichincha,14
5196106,121081534,2017-07-04,49,315178,-274.0,False,Quito,Pichincha,11
1910439,99947658,2016-12-14,45,940664,-200.0,False,Quito,Pichincha,11
1824743,99424349,2016-12-09,44,363889,-187.0,False,Quito,Pichincha,5
3351107,109126746,2017-03-13,48,1336459,-175.0,False,Quito,Pichincha,14


In [39]:
train.unit_sales = train.unit_sales.astype(int)

In [46]:
pd.options.display.float_format = '{:.2f}'.format
train.describe()

Unnamed: 0,id,store_nbr,item_nbr,unit_sales,cluster
count,5877318.0,5877318.0,5877318.0,5877318.0,5877318.0
mean,106616439.99,46.46,1166542.46,13.81,11.43
std,10816746.32,1.71,572611.4,30.85,3.23
min,88211471.0,44.0,96995.0,-290.0,5.0
25%,97206128.25,45.0,724360.0,3.0,11.0
50%,106494793.5,46.0,1228319.0,7.0,11.0
75%,115973034.75,48.0,1502392.0,14.0,14.0
max,125486927.0,49.0,2127114.0,6932.0,14.0
