Version 1.0.1

# Pandas basics 

Hi! In this programming assignment you need to refresh your `pandas` knowledge. You will need to do several [`groupby`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html)s and [`join`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html)`s to solve the task. 

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
%matplotlib inline 

from grader import Grader

In [None]:
from pathlib import Path

The dataset we are going to use is taken from the competition, that serves as the final project for this course. You can find complete data description at the [competition web page](https://www.kaggle.com/c/competitive-data-science-final-project/data). To join the competition use [this link](https://www.kaggle.com/t/1ea93815dca248e99221df42ebde3540).

In [None]:
DATA_FOLDER = Path('.').absolute().parent.joinpath('readonly', 'final_project_data')

transactions    = pd.read_csv(DATA_FOLDER.joinpath('sales_train.csv.gz'))
items           = pd.read_csv(DATA_FOLDER.joinpath('items.csv'))
item_categories = pd.read_csv(DATA_FOLDER.joinpath('item_categories.csv'))
shops           = pd.read_csv(DATA_FOLDER.joinpath('shops.csv'))

## Grading

We will create a grader instace below and use it to collect your answers. When function `submit_tag` is called, grader will store your answer *locally*. The answers will *not* be submited to the platform immediately so you can call `submit_tag` function as many times as you need. 

When you are ready to push your answers to the platform you should fill your credentials and run `submit` function in the <a href="#Authorization-&-Submission">last paragraph</a>  of the assignment.

In [None]:
grader = Grader()

# Task

Let's start with a simple task. 

<ol start="0">
  <li><b>Print the shape of the loaded dataframes and use [`df.head`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html) function to print several rows. Examine the features you are given.</b></li>
</ol>

In [None]:
print(transactions.shape)
transactions.head()

In [None]:
print(items.shape)
items.head()

In [None]:
print(item_categories.shape)
item_categories.head()

In [None]:
print(shops.shape)
shops.head()

Now use your `pandas` skills to get answers for the following questions. 
The first question is:

1. ** What was the maximum total revenue among all the shops in September, 2014?** 


* Hereinafter *revenue* refers to total sales minus value of goods returned.

*Hints:*

* Sometimes items are returned, find such examples in the dataset. 
* It is handy to split `date` field into [`day`, `month`, `year`] components and use `df.year == 14` and `df.month == 9` in order to select target subset of dates.
* You may work with `date` feature as with srings, or you may first convert it to `pd.datetime` type with `pd.to_datetime` function, but do not forget to set correct `format` argument.

In [None]:
if transactions.loc[:, 'date'].values.shape[0] == list(set(transactions.loc[:, 'date'].values.shape))[0]:
    print('Dates can be used as indices')
else:
    print('Use IDs as indices')

In [None]:
# Casting index to date

transactions.set_index('date', inplace=True)
transactions.index = pd.to_datetime(transactions.index, format='%d.%m.%Y')

In [None]:
# Sort indices for human readability handling
transactions.sort_index(inplace=True)

In [None]:
# All the transactions in September
september_trans = transactions[(transactions.index >= '2014-09-01') & (transactions.index < '2014-10-01')]

In [None]:
# Calculate revenue
revenue = september_trans.loc[:,'item_price']*september_trans.loc[:,'item_cnt_day']
september_trans = september_trans.assign(revenue=revenue)

In [None]:
# Sum over the rows with the same shop_id
shop_revenue_sum = september_trans.loc[:, ['shop_id', 'revenue']].groupby('shop_id').sum()

In [None]:
max_revenue = shop_revenue_sum.max().values.item()
grader.submit_tag('max_revenue', max_revenue)

Great! Let's move on and answer another question:

<ol start="2">
  <li><b>What item category generated the highest revenue in summer 2014?</b></li>
</ol>

* Submit `id` of the category found.
    
* Here we call "summer" the period from June to August.

*Hints:*

* Note, that for an object `x` of type `pd.Series`: `x.argmax()` returns **index** of the maximum element. `pd.Series` can have non-trivial index (not `[1, 2, 3, ... ]`).

In [None]:
# All the transactions in the summer months
summer_trans = transactions[(transactions.index >= '2014-06-01') & (transactions.index < '2014-09-01')]

In [None]:
# Calculate revenue
revenue = summer_trans.loc[:,'item_price']*summer_trans.loc[:,'item_cnt_day']
summer_trans = summer_trans.assign(revenue=revenue)

In [None]:
# Join the item in order to get item categories
summer_trans = summer_trans.merge(items, on='item_id')

In [None]:
# Sum over the rows with the same item_id
item_category_revenue_sum = summer_trans.loc[:, ['item_category_id', 'revenue']].groupby('item_category_id').sum()

In [None]:
# Find the entry with the max revenue
max_revenue = item_category_revenue_sum.loc[item_category_revenue_sum.loc[:, 'revenue'] == item_category_revenue_sum.loc[:, 'revenue'].max()]

In [None]:
category_id_with_max_revenue = max_revenue.index.values.item()
grader.submit_tag('category_id_with_max_revenue', category_id_with_max_revenue)

<ol start="3">
  <li><b>How many items are there, such that their price stays constant (to the best of our knowledge) during the whole period of time?</b></li>
</ol>

* Let's assume, that the items are returned for the same price as they had been sold.

In [None]:
# Stated differently: How many item_ids has an item_price with 0 variance
# NOTE: The divisor in std is
#       N - delta_degree_of_freedoms
#       For sample standard deviation ddof = 1 (as one degree of freedom is used to calculate the mean)
#       For population standard deviation ddof=0
#       As we have the whole population we should use ddof = 0
item_price_std = transactions.loc[:, ['item_id', 'item_price']].groupby('item_id').std(ddof=0)

In [None]:
# Find zero variance entries
zero_variance = item_price_std.loc[np.isclose(item_price_std.loc[:, 'item_price'], 0)]

In [None]:
num_items_constant_price = zero_variance.shape[0]
grader.submit_tag('num_items_constant_price', num_items_constant_price)

Remember, the data can sometimes be noisy.

<ol start="4">
  <li><b>What was the variance of the number of sold items per day sequence for the shop with `shop_id = 25` in December, 2014?</b></li>
</ol>

* Fill `total_num_items_sold` and `days` arrays, and plot the sequence with the code below.
* Then compute variance. Remember, there can be differences in how you normalize variance (biased or unbiased estimate, see [link](https://math.stackexchange.com/questions/496627/the-difference-between-unbiased-biased-estimator-variance)). Compute ***unbiased*** estimate (use the right value for `ddof` argument in `pd.var` or `np.var`).

In [None]:
december_trans = transactions.loc[(transactions.index >= '2014-12-01') & (transactions.index < '2015-01-01'), ['shop_id', 'item_id', 'item_cnt_day']]

In [None]:
shop_id = 25
shop_25 = december_trans.loc[december_trans.loc[:, 'shop_id'] == shop_id]

In [None]:
item_per_day_group = shop_25.loc[:, 'item_cnt_day'].groupby(shop_25.index)

In [None]:
total_items = item_per_day_group.sum()

total_num_items_sold = total_items.values
days = total_items.index

# Plot it
plt.plot(days, total_num_items_sold)
plt.ylabel('Num items')
plt.xlabel('Day')
plt.title("Daily revenue for shop_id = 25")
plt.show()

# Unbiased means ddof = 1
total_num_items_sold_var = total_items.var(ddof=1)
grader.submit_tag('total_num_items_sold_var', total_num_items_sold_var)

## Authorization & Submission
To submit assignment to Cousera platform, please, enter your e-mail and token into the variables below. You can generate token on the programming assignment page. *Note:* Token expires 30 minutes after generation.

In [None]:
STUDENT_EMAIL = ''
STUDENT_TOKEN = ''
grader.status()

In [None]:
grader.submit(STUDENT_EMAIL, STUDENT_TOKEN)

Well done! :)