## Exploratory Analysis using Jupyter Notebook
For further reading, we recommend: 
- [the pandas documentation](http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#getting)  for information about using DataFrames
- [this blog post](https://towardsdatascience.com/introduction-to-data-visualization-in-python-89a54c97fbed) for a jumpstart into visualizations
- [the matplotlib documentation](https://matplotlib.org/users/pyplot_tutorial.html) for more info about visualizations

In [2]:
import pandas as pd

#### Loading data from our GPC bucket

In [27]:
import s3fs
s3 = s3fs.S3FileSystem(anon=True)
s3.ls('twde-datalab/raw')

s3.get('twde-datalab/raw/items.csv', 
       '../data/item.csv')

In [28]:
items = pd.read_csv('../data/item.csv')

In [32]:
items.sample(n =5)

Unnamed: 0,item_nbr,family,class,perishable
1824,1222571,PERSONAL CARE,4114,0
2817,1503855,PRODUCE,2004,1
1999,1246784,GROCERY I,1074,0
2287,1397797,GROCERY I,1016,0
2643,1464092,BEVERAGES,1148,0


In [38]:
help(pd.DataFrame.join)
joined_table= pd.dataFrame.join(train,items, on=['item_nbr'],rsffix="_items")
joined_table.head()
# items[items['item_nbr'] ==103520]

Help on function join in module pandas.core.frame:

join(self, other, on=None, how='left', lsuffix='', rsuffix='', sort=False)
    Join columns of another DataFrame.
    
    Join columns with `other` DataFrame either on index or on a key
    column. Efficiently join multiple DataFrame objects by index at once by
    passing a list.
    
    Parameters
    ----------
    other : DataFrame, Series, or list of DataFrame
        Index should be similar to one of the columns in this one. If a
        Series is passed, its name attribute must be set, and that will be
        used as the column name in the resulting joined DataFrame.
    on : str, list of str, or array-like, optional
        Column or index level name(s) in the caller to join on the index
        in `other`, otherwise joins index-on-index. If multiple
        values given, the `other` DataFrame must have a MultiIndex. Can
        pass an array as the join key if it is not already contained in
        the calling DataFrame. L

AttributeError: module 'pandas' has no attribute 'dataFrame'

#### With just this glimpse, you can start to fill out your list of assumptions, hypotheses, and questions. Some of mine are:
- Question: What is the span of dates we are provided?
- Question: How many distinct store_nbr values are there?
- Question: How many distinct item_nbr values are there?
- Hypothesis: unit_sales are always positive
- Hypothesis: onpromotion is always either True or False
- Hypothesis: city and state are always going to be Quito and Pichincha
- Hypothesis: cluster is always 5
- Question: What does cluster mean and is it important to know?
- Question: How many records does the data contain?
- Question: What other data files are available?

### Here's some examples of how to address those first questions

In [13]:
# Access an entire dataframe column like you would
# the value in a python dictionary:
# (The returned object has similar pandas built-in 
# functions, like 'head' and 'max')
data = train
print(data['date'].min())
print(data['date'].max())

2016-08-16
2017-08-15


In [12]:
# Dataframe columns also have a 'unique' method,
# which can answer several of our questions from above
data['store_nbr'].unique()

array([44, 45, 46, 47, 48, 49])

In [9]:
print(data['item_nbr'].unique())
print("There are too many item numbers to display, so let's just count them for now:")
print("\n{} different item_nbr values in our data"
          .format(len(data['item_nbr'].unique())))

[ 103520  103665  105574 ... 2011468 2011448 2123839]
There are too many item numbers to display, so let's just count them for now:

3717 different item_nbr values in our data


#### It might be helpful to know the 'shape' of our data. We could count by hand (for now) the columns, but how many rows do we have altogether?

In [10]:
print(data.shape)
print("There are {} rows and {} columns in our data".format(data.shape[0], data.shape[1]))

(5877318, 9)
There are 5877318 rows and 9 columns in our data


#### Moving along to answer our intial questions... Let's have a look at unit_sales. Keep in mind that unit sales is the variable we want to predict with our science.

Each row in our data is essentially telling us a `unit_sales` number for a given `item_nbr` at a given `store_nbr` on a given `date`. That is, "how many of an item was sold at a store on a day".

In [21]:
print ([{n: len(data[data['cluster'] == n])] for n in data['cluster'].unique()])

SyntaxError: invalid syntax (<ipython-input-21-1d905df9116d>, line 1)