Yay! Exercises!

# Imports

In [2]:
import pandas as pd
#from sklearn.linear_model import LinearRegression
import numpy as np
import hashlib # for grading purposes
%matplotlib inline
from sklearn.impute import KNNImputer
import os

### You're hired

You've been hired as the data scientist for a supermarket chain that wants to start extracting insights from their data. First, you'll start with a specific store.

Let's get our data: `store.csv` under `data/` folder

> Important: save the dataframe into the `store` variable

In [12]:
store = pd.read_csv(os.path.join('data/store.csv'))
store.head(3)

Unnamed: 0,date,customers
0,09-09-2013,1781
1,16-08-2015,456
2,13-09-2015,459


## Exercise 1: Index and datetime

#### 1.1) Make the index the datetime of the dates. 

In [13]:
#We expect the solution to be a dataframe

store['date'] = pd.to_datetime(store['date'],format='%d-%m-%Y')
store = store.set_index('date')
store = store.sort_index()        # Don't forget best practices

# YOUR CODE HERE
#raise NotImplementedError()

In [8]:
expected_hash = '660d9054fe3a6cbcfb77e2647932e3c41ff5acab9fc4d162fdc448c7c8e6ccc2'
assert hashlib.sha256(str(store.iloc[28].name).encode()).hexdigest() == expected_hash
assert hashlib.sha256(str(store.index.dtype).encode()).hexdigest() == '261738f2e43a1c47a16f043b46deb993943d61f4a2bbe5ef4b03c3fb1af362b5'


# clue: if this assert is failing, and your iloc[25].name is '2017-01-7', 
# then you are missing the "best practices" part. 
# What did we say in the Learning notebook about this? 

## Exercise 2: Time series preprocessing

In [14]:
store

Unnamed: 0_level_0,customers
date,Unnamed: 1_level_1
2013-01-02,2111
2013-01-03,1833
2013-01-04,1863
2013-01-05,1509
2013-01-06,520
...,...
2017-07-27,1729
2017-07-28,1848
2017-07-29,1251
2017-07-30,519


#### 2.1) Accounting for missing days

Sometimes datasets don't have rows corresponding to all timestamps, as a data scientist you should know if this is the case. Copy `store` to a new variable called `store_complete`, with no gap days. Fill the missing data with nulls.

In [19]:
# copy store to store_complete
store_complete = store.copy()

# change store_complete index so that it contains each of the days in the time range [2013-01-01, 2017-07-31]
store_complete = store_complete.loc['2013-01-01': '2017-07-31']
store_complete
# fill missing data with nulls
#store_complete = store_complete.fillna(np.nan)

# YOUR CODE HERE
#raise NotImplementedError()

Unnamed: 0_level_0,customers
date,Unnamed: 1_level_1
2013-01-02,2111
2013-01-03,1833
2013-01-04,1863
2013-01-05,1509
2013-01-06,520
...,...
2017-07-27,1729
2017-07-28,1848
2017-07-29,1251
2017-07-30,519


In [16]:
#### check number of nulls
assert store_complete.isnull().sum()[0] != 0, "You have 0 null values in the dataset! Remember that each missing day should correspond to a null in customers."
assert store_complete.isnull().sum()[0] == 11, "You should have found 11 days with missing data, no more no less."

#### check store_complete dataframe integrity
assert store_complete.shape[0] != 1676, "Did you fill the index with the missing days?"
assert store_complete.shape[0] == 1672, "The number of rows is not the expected."
assert store_complete.shape[1] == 1, "You shouldn't change the number of columns."
assert str(store_complete[store_complete.customers.isnull()].index[6])[:10] == '2016-01-01', "Do you have all the missing days? Is the index is ordered?"

AssertionError: You have 0 null values in the dataset! Remember that each missing day should correspond to a null in customers.

## Exercise 3: Working with timestamps

#### 3.1) Worst day in 2016

What was the worst day in terms of customers in 2016?

In [21]:
# hint: the answer should be a timestamp

worst_day_2016 = store.loc['2016'].min()

worst_day_2016
# YOUR CODE HERE
#raise NotImplementedError()

In [None]:
expected_hash = '54ca8373016eeb8acb093f60f9d909b8fc1bcc8e37d9f762530df4053bb83a1d'
assert hashlib.sha256(str(worst_day_2016).encode()).hexdigest() == expected_hash

print(f"The worst day in 2016 was {worst_day_2016.day} of {worst_day_2016.month_name()}. Talk about new year's blues !")

#### 3.2) Best Friday

Last Friday there were 3000 customers, and your boss said he's never seen such a high count of customers on a Friday. To check if your boss is correct, can you find the maximum number of customers that we've ever had on a Friday?

- _hint #1: you can use the methods at the bottom of this [page](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DatetimeIndex.html) directly on the index_  
- _hint #2: when operating directly on the index, you do not need to use `.dt` to use the methods_

In [None]:
#max_customers_Friday =

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
expected_hash = 'b134ce47a896876fe3111bfed26cbe06363ede8a60ada5f70fe285d04fc4b7e9'
assert hashlib.sha256(str(int(max_customers_Friday)).encode()).hexdigest() == expected_hash

print(f"Yep! The highest count we ever had on a Friday was {int(max_customers_Friday)} customers. Don't tell your boss.")

## Exercise 4: Time series methods

#### 4.1) Shopping rush

A new pandemic has started, and everyone came to buy soap and isopropyl alcohol. Your boss swears to have never seen such an absolute increase in customers from one day to the next - "Yesterday there were 100 customers, today there were 5000."

To confirm if what your boss is saying is true, can you find the maximum increase in customers from one day to the next?

In [None]:
# hint: the solution expects a float

#max_increase = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
expected_hash = 'aa4b0d224e2b4488c6e3c5692347a0e26322d86dcb6bf01ab937e15d76037ee4'
assert hashlib.sha256(str(int(max_increase)).encode()).hexdigest() == expected_hash

#### 4.2) Bad month

Despite the shopping rush of the last few days, we had a bad month , with a monthly sum of customers < 45000 . What was the last month we had less than 45000 customers  (`last_bad_month`)?

In [None]:
# We expect the answer to be a monthly time period (freq='M'), so answers  
# which are of the form "the second month of the year" will not pass the grader. 
# hint: by default pandas uses freq='M'.

#sum_monthly_customers = 
#last_bad_month = 

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
expected_hash = '50aa98c99f36b286c0269526800c2fc49253f75fcdef0cf02f4db4aee064ddbc'
assert hashlib.sha256(str(last_bad_month).encode()).hexdigest() == expected_hash

---