## Get last 3 months of articles sales and store in file

We want to have a dataframe with the average monthly units sold from July, August and September 2018 and the units sold in October 2018 to use as a benchmark to help test our model results. 

## Notebook setup

Import packages and load data

In [5]:
# Navigate to directory with cleaned data
#%cd /data/p_dsi/teams2023/team7

/gpfs52/data/p_dsi/teams2023/team7


In [70]:
import pandas as pd
import numpy as np
import seaborn as sns
#from janitor import clean_names
from sklearn.preprocessing import OneHotEncoder

In [3]:
pd.options.display.max_columns = 100

In [57]:
# Read-in data
sales = pd.read_csv("final_sales.csv", parse_dates=["DATE"], 
                    usecols=["ARTICLE_ID","DATE","UNITS"], 
                    dtype={"ARTICLE_ID":"category","UNITS":np.float64},
                    low_memory=False)

# (1) Store shape, (2) total sum of units and (3) 2018 total sum of units for sanity checks
# 1
sales_shape = sales.shape
# 2
total_units_sold = sales["UNITS"].sum()
#3
sales = sales.set_index("DATE")
total_units_sold_in_2018 = sales.loc['2018']['UNITS'].sum()

# Print completion message
print("Data read-in complete. \nShape of dataframe: ", sales.shape, "\n Total unit sales across all articles: ", total_units_sold)

Data read-in complete. 
Shape of dataframe:  (13499278, 2) 
 Total unit sales across all articles:  30299003.0


In [59]:
sales.head(3)

Unnamed: 0_level_0,ARTICLE_ID,UNITS
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-07-22,3431,1.0
2018-07-05,3448,1.0
2018-07-02,147127,1.0


## Yearly tire sales:

In [63]:
# Information about the articles sold
print("\n Total units sold in 2018 across all articles: ", get_total_units_sold(sales, 2018))
print("\n Total units sold in 2017 across all articles: ", get_total_units_sold(sales, 2017))
print("\n Total units sold in 2016 across all articles: ", get_total_units_sold(sales, 2016))
print("\n Total units sold in 2015 across all articles: ", get_total_units_sold(sales, 2015))


 Total units sold in 2018 across all articles:  7907212.0

 Total units sold in 2017 across all articles:  9008568.0

 Total units sold in 2016 across all articles:  8191490.0

 Total units sold in 2015 across all articles:  5191733.0


In [12]:
# Sort
sales.sort_values(by=["DATE","ARTICLE_ID"], inplace=True)

# Group the rows by ARTICLE_ID and month of DATE, and sum the UNITS values
grouped_sales = 

# Create new month and year columns
grouped_sales['MONTH'] = grouped_sales['DATE'].dt.month
grouped_sales['YEAR'] = grouped_sales['DATE'].dt.year
grouped_sales = grouped_sales.drop(["DATE"],axis=1)

print("Shape of grouped_sales: ", grouped_sales.shape)
print("Simple average of monthly units sold per article: ", grouped_sales.shape[0]/420, "\n")

print("Some stats: ", grouped_sales.describe(), "\n")

print(grouped_sales.sample(10, random_state=1))

Shape of grouped_sales:  (18705, 4)
Simple average of monthly units sold per article:  44.535714285714285 

Some stats:                UNITS         MONTH          YEAR
count  18705.000000  18705.000000  18705.000000
mean    1619.834429      6.581395   2016.534884
std     1910.748041      3.265356      1.064220
min       -2.000000      1.000000   2015.000000
25%      452.000000      4.000000   2016.000000
50%     1000.000000      7.000000   2017.000000
75%     2130.000000      9.000000   2017.000000
max    23338.000000     12.000000   2018.000000 

      ARTICLE_ID   UNITS  MONTH  YEAR
11602       3928   701.0      3  2018
4785       15029  6789.0      4  2016
3581        1428     6.0      4  2016
11999       4022     0.0      6  2015
14920       6410   867.0     10  2018
15295       6896   548.0     10  2017
8126        2360  1020.0     10  2018
7160        2012   781.0      2  2017
3552        1427  3040.0      6  2017
14417       6240   623.0      4  2016


In [13]:
# Sanity check 1 PASS
print("Sum of units in grouped_sales: ", sum(grouped_sales['UNITS']))
print("Sum of units sold before grouping transformation: ", total_units_sold)

# Sanity check 2
grouped_sales_2018 = grouped_sales[grouped_sales['YEAR'] == 2018]['UNITS'].sum()
print("Sum of units sold in 2018 in grouped_sales: ", grouped_sales_2018)
print("Sum of units sold in 2018 before grouping transformation: ", total_units_sold_in_2018)


Sum of units in grouped_sales:  30299003.0
Sum of units sold before grouping transformation:  30299003.0
Sum of units sold in 2018 in grouped_sales:  7907212.0
Sum of units sold in 2018 before grouping transformation:  7907212.0


In [14]:
# pivot the dataframe
sales_pivoted = grouped_sales.pivot(index=['YEAR', 'ARTICLE_ID'], columns='MONTH', values='UNITS').reset_index()

# print the pivoted dataframe
sales_pivoted

MONTH,YEAR,ARTICLE_ID,1,2,3,4,5,6,7,8,9,10,11,12
0,2015,106242,,,,149.0,147.0,126.0,144.0,132.0,136.0,212.0,209.0,210.0
1,2015,106259,,,,869.0,1096.0,1185.0,1278.0,1122.0,1239.0,1029.0,1486.0,1253.0
2,2015,106310,,,,969.0,1435.0,1470.0,1694.0,1938.0,2140.0,1752.0,1885.0,1436.0
3,2015,106497,,,,4011.0,3735.0,4022.0,3808.0,4941.0,3136.0,2640.0,3981.0,3034.0
4,2015,106650,,,,734.0,723.0,976.0,1053.0,1507.0,1770.0,1293.0,1549.0,1017.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1735,2018,98031,559.0,457.0,418.0,381.0,402.0,553.0,408.0,410.0,321.0,295.0,,
1736,2018,98048,767.0,722.0,1037.0,782.0,951.0,1002.0,979.0,922.0,733.0,653.0,,
1737,2018,98065,2237.0,2250.0,2384.0,2153.0,2197.0,2604.0,2457.0,2799.0,2478.0,1887.0,,
1738,2018,98099,2490.0,2343.0,2700.0,2352.0,2703.0,3224.0,2986.0,2823.0,2929.0,3042.0,,


In [15]:

temp = sales_pivoted[['ARTICLE_ID','YEAR', 7, 8, 9, 10]]
temp['AvgLast3Months'] = temp[[7, 8, 9]].apply(lambda x: x.mean(), axis=1)
temp = temp[['ARTICLE_ID','YEAR', 7, 8, 9,'AvgLast3Months',10]]
final = temp[temp['YEAR'] == 2018]
final

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp['AvgLast3Months'] = temp[[7, 8, 9]].apply(lambda x: x.mean(), axis=1)


MONTH,ARTICLE_ID,YEAR,7,8,9,AvgLast3Months,10
1305,106242,2018,431.0,552.0,564.0,515.666667,507.0
1306,106259,2018,709.0,722.0,645.0,692.000000,687.0
1307,106310,2018,1059.0,1024.0,534.0,872.333333,417.0
1308,106497,2018,1457.0,1713.0,1765.0,1645.000000,1412.0
1309,106650,2018,555.0,676.0,743.0,658.000000,573.0
...,...,...,...,...,...,...,...
1735,98031,2018,408.0,410.0,321.0,379.666667,295.0
1736,98048,2018,979.0,922.0,733.0,878.000000,653.0
1737,98065,2018,2457.0,2799.0,2478.0,2578.000000,1887.0
1738,98099,2018,2986.0,2823.0,2929.0,2912.666667,3042.0


In [16]:
quant = sales_pivoted.query('YEAR == 2017')[11].quantile(0.9)

sales_pivoted[sales_pivoted[11] >= quant].query('YEAR == 2017').sort_values(by=11, ascending=False)

## TODO - check end of this notebook and around here 
## Want to have a notebook going that addresses articles with the biggest gaps between Oct-Nov and Sept-Nov
## because those articles will really hurt our score. 
## Also note issue with articles that are newer and so have harder sales patterns to predict

MONTH,YEAR,ARTICLE_ID,1,2,3,4,5,6,7,8,9,10,11,12
1296,2017,97946,7344.0,6237.0,7825.0,7617.0,7659.0,8163.0,8821.0,8367.0,9757.0,8774.0,12071.0,10569.0
1294,2017,97912,7209.0,6858.0,9239.0,8319.0,8317.0,9129.0,9493.0,8201.0,9110.0,9872.0,10899.0,9238.0
886,2017,114674,7738.0,10347.0,14452.0,12580.0,12806.0,11262.0,14099.0,14687.0,11148.0,12606.0,10753.0,9362.0
1287,2017,97759,8163.0,7318.0,8219.0,7749.0,7689.0,8017.0,9362.0,8580.0,8995.0,8744.0,10475.0,9114.0
911,2017,123888,6152.0,6651.0,12303.0,10001.0,10806.0,10125.0,12770.0,13273.0,11226.0,11556.0,9956.0,8095.0
1208,2017,6251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,5970.0,9847.0,12378.0
903,2017,122596,6873.0,7638.0,10878.0,9652.0,9914.0,8831.0,10942.0,10886.0,9905.0,10311.0,9666.0,8726.0
1234,2017,7099618,7665.0,7209.0,8663.0,8358.0,8865.0,9196.0,8733.0,9527.0,8400.0,8648.0,8876.0,8663.0
986,2017,15131,8230.0,6973.0,7584.0,6825.0,8243.0,8462.0,7969.0,7581.0,7018.0,8926.0,8787.0,6548.0
987,2017,15148,8748.0,8074.0,7800.0,7413.0,8795.0,9273.0,7565.0,10026.0,8239.0,9937.0,8559.0,5362.0


In [17]:
temp = sales_pivoted[['ARTICLE_ID','YEAR', 7, 8, 9, 10]]
temp['AvgLast3Months'] = temp[[7, 8, 9]].apply(lambda x: x.mean(), axis=1)
temp = temp[['ARTICLE_ID','YEAR', 7, 8, 9,'AvgLast3Months',10]]
final = temp[temp['YEAR'] == 2018]
final

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp['AvgLast3Months'] = temp[[7, 8, 9]].apply(lambda x: x.mean(), axis=1)


MONTH,ARTICLE_ID,YEAR,7,8,9,AvgLast3Months,10
1305,106242,2018,431.0,552.0,564.0,515.666667,507.0
1306,106259,2018,709.0,722.0,645.0,692.000000,687.0
1307,106310,2018,1059.0,1024.0,534.0,872.333333,417.0
1308,106497,2018,1457.0,1713.0,1765.0,1645.000000,1412.0
1309,106650,2018,555.0,676.0,743.0,658.000000,573.0
...,...,...,...,...,...,...,...
1735,98031,2018,408.0,410.0,321.0,379.666667,295.0
1736,98048,2018,979.0,922.0,733.0,878.000000,653.0
1737,98065,2018,2457.0,2799.0,2478.0,2578.000000,1887.0
1738,98099,2018,2986.0,2823.0,2929.0,2912.666667,3042.0


In [18]:
final.to_csv("last_3_mo_articles_sold.csv", index=False)

In [19]:
def get_wmape(df, actuals='actual', predictions='prediction'):
  '''
  Function takes a dataframe with two columns for actual and predicted values.
  actuals = string of the column name with actual values
  predictions = string of the column name with predicted values
  Returns the WMAPE score.
  '''
  return sum(abs(df[actuals]-df[predictions]))/sum(df[actuals])

In [20]:
get_wmape(final, actuals=10, predictions='AvgLast3Months')

0.13183069311862147

## WMAPE for average most recent 2 months

In [21]:
final['AvgLast2Months'] = final[[8, 9]].apply(lambda x: x.mean(), axis=1)
final

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final['AvgLast2Months'] = final[[8, 9]].apply(lambda x: x.mean(), axis=1)


MONTH,ARTICLE_ID,YEAR,7,8,9,AvgLast3Months,10,AvgLast2Months
1305,106242,2018,431.0,552.0,564.0,515.666667,507.0,558.0
1306,106259,2018,709.0,722.0,645.0,692.000000,687.0,683.5
1307,106310,2018,1059.0,1024.0,534.0,872.333333,417.0,779.0
1308,106497,2018,1457.0,1713.0,1765.0,1645.000000,1412.0,1739.0
1309,106650,2018,555.0,676.0,743.0,658.000000,573.0,709.5
...,...,...,...,...,...,...,...,...
1735,98031,2018,408.0,410.0,321.0,379.666667,295.0,365.5
1736,98048,2018,979.0,922.0,733.0,878.000000,653.0,827.5
1737,98065,2018,2457.0,2799.0,2478.0,2578.000000,1887.0,2638.5
1738,98099,2018,2986.0,2823.0,2929.0,2912.666667,3042.0,2876.0


In [22]:
get_wmape(final, actuals=10, predictions='AvgLast2Months')

0.1187478107079956

In [23]:
final['AbsDiff_Last2'] = abs(final[10] - final['AvgLast2Months'])
final['AbsDiff_Last3'] = abs(final[10] - final['AvgLast3Months'])

final['AbsDiff_Last2'].sum(), final['AbsDiff_Last3'].sum()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final['AbsDiff_Last2'] = abs(final[10] - final['AvgLast2Months'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final['AbsDiff_Last3'] = abs(final[10] - final['AvgLast3Months'])


(96954.5, 107636.33333333334)

In [100]:
final

MONTH,level_0,index,ARTICLE_ID,YEAR,7,8,9,AvgLast3Months,10,AvgLast2Months,AbsDiff_Last2,AbsDiff_Last3
0,0,1305,106242,2018,431.0,552.0,564.0,515.666667,507.0,558.0,51.0,8.666667
1,1,1306,106259,2018,709.0,722.0,645.0,692.000000,687.0,683.5,3.5,5.000000
2,2,1307,106310,2018,1059.0,1024.0,534.0,872.333333,417.0,779.0,362.0,455.333333
3,3,1308,106497,2018,1457.0,1713.0,1765.0,1645.000000,1412.0,1739.0,327.0,233.000000
4,4,1309,106650,2018,555.0,676.0,743.0,658.000000,573.0,709.5,136.5,85.000000
...,...,...,...,...,...,...,...,...,...,...,...,...
430,430,1735,98031,2018,408.0,410.0,321.0,379.666667,295.0,365.5,70.5,84.666667
431,431,1736,98048,2018,979.0,922.0,733.0,878.000000,653.0,827.5,174.5,225.000000
432,432,1737,98065,2018,2457.0,2799.0,2478.0,2578.000000,1887.0,2638.5,751.5,691.000000
433,433,1738,98099,2018,2986.0,2823.0,2929.0,2912.666667,3042.0,2876.0,166.0,129.333333


## Research: Which months are good predictors for November, and which are not? 

In [25]:
# Group the rows by ARTICLE_ID and month of DATE, and sum the UNITS values
grouped_sales = sales.groupby(['ARTICLE_ID', pd.Grouper(key='DATE', freq='M')])['UNITS'].sum().reset_index()

# Create new month and year columns
grouped_sales['MONTH'] = grouped_sales['DATE'].dt.month
grouped_sales['YEAR'] = grouped_sales['DATE'].dt.year
grouped_sales = grouped_sales.drop(["DATE"],axis=1)

# Filter for 2017
grouped_sales2017 = grouped_sales[grouped_sales['YEAR'] == 2017]
sales_pivoted2017 = grouped_sales2017.pivot(index=['YEAR', 'ARTICLE_ID'], columns='MONTH', values='UNITS').reset_index()

# Add comparison columns
sales_pivoted2017['Oct-Nov'] = sales_pivoted2017[11]-sales_pivoted2017[10]
sales_pivoted2017['Sept-Oct'] = sales_pivoted2017[10]-sales_pivoted2017[9]
sales_pivoted2017['Perc_Oct-Nov'] = round((sales_pivoted2017[11]-sales_pivoted2017[10])/sales_pivoted2017[10]*100,1)
sales_pivoted2017['Perc_Sept-Oct'] = round((sales_pivoted2017[10]-sales_pivoted2017[9])/sales_pivoted2017[9]*100,1)
sales_pivoted2017['AvgSeptOct'] = sales_pivoted2017[[9, 10]].apply(lambda x: x.mean(), axis=1)
sales_pivoted2017['Nov-AvgSeptOct'] = sales_pivoted2017[11]-sales_pivoted2017['AvgSeptOct']
sales_pivoted2017['Abs(Nov-AvgSeptOct)'] = abs(sales_pivoted2017['Nov-AvgSeptOct'])
sales_pivoted2017[5:11]


MONTH,YEAR,ARTICLE_ID,1,2,3,4,5,6,7,8,9,10,11,12,Oct-Nov,Sept-Oct,Perc_Oct-Nov,Perc_Sept-Oct,AvgSeptOct,Nov-AvgSeptOct,Abs(Nov-AvgSeptOct)
5,2017,108860,441.0,431.0,530.0,503.0,455.0,478.0,468.0,574.0,664.0,598.0,657.0,763.0,59.0,-66.0,9.9,-9.9,631.0,26.0,26.0
6,2017,108894,1016.0,810.0,997.0,1025.0,1229.0,1067.0,1007.0,1080.0,1330.0,1623.0,1940.0,1637.0,317.0,293.0,19.5,22.0,1476.5,463.5,463.5
7,2017,108911,575.0,550.0,608.0,596.0,632.0,675.0,698.0,803.0,905.0,1021.0,1165.0,1049.0,144.0,116.0,14.1,12.8,963.0,202.0,202.0
8,2017,11187,640.0,657.0,736.0,757.0,739.0,759.0,792.0,679.0,582.0,671.0,492.0,601.0,-179.0,89.0,-26.7,15.3,626.5,-134.5,134.5
9,2017,112345,752.0,705.0,951.0,717.0,1472.0,1303.0,1682.0,2113.0,3087.0,3036.0,2998.0,2553.0,-38.0,-51.0,-1.3,-1.7,3061.5,-63.5,63.5
10,2017,113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,27.0,52.0,78.0,25.0,27.0,92.6,inf,13.5,38.5,38.5


In [26]:
sales_pivoted2017.sort_values(['Abs(Nov-AvgSeptOct)'], ascending=False).head(20)

MONTH,YEAR,ARTICLE_ID,1,2,3,4,5,6,7,8,9,10,11,12,Oct-Nov,Sept-Oct,Perc_Oct-Nov,Perc_Sept-Oct,AvgSeptOct,Nov-AvgSeptOct,Abs(Nov-AvgSeptOct)
393,2017,85876,9869.0,10156.0,13401.0,11604.0,13748.0,13643.0,18826.0,23338.0,15807.0,5268.0,1200.0,789.0,-4068.0,-10539.0,-77.2,-66.7,10537.5,-9337.5,9337.5
338,2017,6251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,5970.0,9847.0,12378.0,3877.0,5938.0,64.9,18556.2,3001.0,6846.0,6846.0
339,2017,6252,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,79.0,3075.0,6960.0,7073.0,3885.0,2996.0,126.3,3792.4,1577.0,5383.0,5383.0
264,2017,3816,0.0,0.0,12.0,12.0,47.0,44.0,78.0,1173.0,3342.0,6113.0,7895.0,7820.0,1782.0,2771.0,29.2,82.9,4727.5,3167.5,3167.5
426,2017,97946,7344.0,6237.0,7825.0,7617.0,7659.0,8163.0,8821.0,8367.0,9757.0,8774.0,12071.0,10569.0,3297.0,-983.0,37.6,-10.1,9265.5,2805.5,2805.5
278,2017,4020,0.0,0.0,1.0,20.0,26.0,49.0,122.0,618.0,794.0,2007.0,4190.0,4150.0,2183.0,1213.0,108.8,152.8,1400.5,2789.5,2789.5
281,2017,4030,0.0,0.0,4.0,12.0,17.0,8.0,62.0,544.0,548.0,2459.0,3836.0,3376.0,1377.0,1911.0,56.0,348.7,1503.5,2332.5,2332.5
344,2017,6258,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.0,131.0,2082.0,3403.0,3140.0,1321.0,1951.0,63.4,1489.3,1106.5,2296.5,2296.5
279,2017,4022,0.0,0.0,1.0,50.0,77.0,82.0,166.0,657.0,1881.0,3047.0,4550.0,4009.0,1503.0,1166.0,49.3,62.0,2464.0,2086.0,2086.0
273,2017,4002,0.0,0.0,21.0,34.0,34.0,40.0,180.0,902.0,1039.0,2155.0,3392.0,3418.0,1237.0,1116.0,57.4,107.4,1597.0,1795.0,1795.0


In [27]:
# Group the rows by ARTICLE_ID and month of DATE, and sum the UNITS values
grouped_sales = sales.groupby(['ARTICLE_ID', pd.Grouper(key='DATE', freq='M')])['UNITS'].sum().reset_index()

# Create new month and year columns
grouped_sales['MONTH'] = grouped_sales['DATE'].dt.month
grouped_sales['YEAR'] = grouped_sales['DATE'].dt.year
grouped_sales = grouped_sales.drop(["DATE"],axis=1)

# Filter for 2016
grouped_sales2016 = grouped_sales[grouped_sales['YEAR'] == 2016]
sales_pivoted2016 = grouped_sales2016.pivot(index=['YEAR', 'ARTICLE_ID'], columns='MONTH', values='UNITS').reset_index()
sales_pivoted2016[5:11]

# Add comparison columns
sales_pivoted2016['Oct-Nov'] = sales_pivoted2016[11]-sales_pivoted2016[10]
sales_pivoted2016['Sept-Oct'] = sales_pivoted2016[10]-sales_pivoted2016[9]
sales_pivoted2016['Perc_Oct-Nov'] = round((sales_pivoted2016[11]-sales_pivoted2016[10])/sales_pivoted2016[10]*100,1)
sales_pivoted2016['Perc_Sept-Oct'] = round((sales_pivoted2016[10]-sales_pivoted2016[9])/sales_pivoted2016[9]*100,1)
sales_pivoted2016['AvgSeptOct'] = sales_pivoted2016[[9, 10]].apply(lambda x: x.mean(), axis=1)
sales_pivoted2016['Nov-AvgSeptOct'] = sales_pivoted2016[11]-sales_pivoted2016['AvgSeptOct']
sales_pivoted2016['Abs(Nov-AvgSeptOct)'] = abs(sales_pivoted2016['Nov-AvgSeptOct'])
sales_pivoted2016[5:11]


MONTH,YEAR,ARTICLE_ID,1,2,3,4,5,6,7,8,9,10,11,12,Oct-Nov,Sept-Oct,Perc_Oct-Nov,Perc_Sept-Oct,AvgSeptOct,Nov-AvgSeptOct,Abs(Nov-AvgSeptOct)
5,2016,108860,398.0,385.0,369.0,287.0,327.0,416.0,284.0,462.0,541.0,553.0,528.0,545.0,-25.0,12.0,-4.5,2.2,547.0,-19.0,19.0
6,2016,108894,864.0,729.0,652.0,662.0,830.0,1023.0,866.0,906.0,1281.0,1188.0,1265.0,1548.0,77.0,-93.0,6.5,-7.3,1234.5,30.5,30.5
7,2016,108911,743.0,609.0,650.0,491.0,616.0,689.0,548.0,582.0,813.0,724.0,696.0,860.0,-28.0,-89.0,-3.9,-10.9,768.5,-72.5,72.5
8,2016,11187,1086.0,1050.0,751.0,934.0,745.0,879.0,908.0,802.0,808.0,733.0,757.0,706.0,24.0,-75.0,3.3,-9.3,770.5,-13.5,13.5
9,2016,112345,881.0,619.0,409.0,498.0,589.0,542.0,933.0,995.0,989.0,1102.0,1376.0,1444.0,274.0,113.0,24.9,11.4,1045.5,330.5,330.5
10,2016,113,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,0.0,0.0,0.0


In [28]:
sales_pivoted2016.sort_values(['Abs(Nov-AvgSeptOct)'], ascending=False).head(20)

MONTH,YEAR,ARTICLE_ID,1,2,3,4,5,6,7,8,9,10,11,12,Oct-Nov,Sept-Oct,Perc_Oct-Nov,Perc_Sept-Oct,AvgSeptOct,Nov-AvgSeptOct,Abs(Nov-AvgSeptOct)
384,2016,832,3468.0,3572.0,4011.0,3571.0,3199.0,3267.0,3319.0,3595.0,5250.0,7416.0,10280.0,3611.0,2864.0,2166.0,38.6,41.3,6333.0,3947.0,3947.0
312,2016,4971,1293.0,1443.0,1424.0,1274.0,1165.0,876.0,822.0,931.0,1220.0,1625.0,4972.0,5412.0,3347.0,405.0,206.0,33.2,1422.5,3549.5,3549.5
187,2016,23240,0.0,0.0,705.0,1918.0,3172.0,3264.0,3319.0,2342.0,2572.0,2920.0,5466.0,5009.0,2546.0,348.0,87.2,13.5,2746.0,2720.0,2720.0
393,2016,85876,5173.0,4992.0,5919.0,5200.0,4926.0,6192.0,9414.0,10163.0,10246.0,9906.0,12119.0,12335.0,2213.0,-340.0,22.3,-3.3,10076.0,2043.0,2043.0
348,2016,6505,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,542.0,1570.0,1948.0,1028.0,542.0,189.7,inf,271.0,1299.0,1299.0
382,2016,830,2078.0,2108.0,1898.0,1846.0,1896.0,1822.0,1851.0,1951.0,2393.0,2630.0,3805.0,2611.0,1175.0,237.0,44.7,9.9,2511.5,1293.5,1293.5
135,2016,15556,0.0,0.0,181.0,303.0,631.0,910.0,761.0,884.0,1319.0,1390.0,2610.0,2723.0,1220.0,71.0,87.8,5.4,1354.5,1255.5,1255.5
374,2016,81303,1252.0,981.0,1237.0,1702.0,1452.0,1211.0,1583.0,2084.0,1766.0,2069.0,776.0,1163.0,-1293.0,303.0,-62.5,17.2,1917.5,-1141.5,1141.5
316,2016,53967,3395.0,3682.0,3602.0,4052.0,3758.0,5072.0,5491.0,5174.0,5237.0,5675.0,4453.0,3993.0,-1222.0,438.0,-21.5,8.4,5456.0,-1003.0,1003.0
131,2016,15471,0.0,0.0,127.0,289.0,865.0,1034.0,1001.0,1142.0,1436.0,1243.0,2333.0,1979.0,1090.0,-193.0,87.7,-13.4,1339.5,993.5,993.5


In [29]:
# Group the rows by ARTICLE_ID and month of DATE, and sum the UNITS values
grouped_sales = sales.groupby(['ARTICLE_ID', pd.Grouper(key='DATE', freq='M')])['UNITS'].sum().reset_index()

# Create new month and year columns
grouped_sales['MONTH'] = grouped_sales['DATE'].dt.month
grouped_sales['YEAR'] = grouped_sales['DATE'].dt.year
grouped_sales = grouped_sales.drop(["DATE"],axis=1)

# Filter for 2015
grouped_sales2015 = grouped_sales[grouped_sales['YEAR'] == 2015]
sales_pivoted2015 = grouped_sales2015.pivot(index=['YEAR', 'ARTICLE_ID'], columns='MONTH', values='UNITS').reset_index()

# Add comparison columns
sales_pivoted2015['Oct-Nov'] = sales_pivoted2015[11]-sales_pivoted2015[10]
sales_pivoted2015['Sept-Oct'] = sales_pivoted2015[10]-sales_pivoted2015[9]
sales_pivoted2015['Perc_Oct-Nov'] = round((sales_pivoted2015[11]-sales_pivoted2015[10])/sales_pivoted2015[10]*100,1)
sales_pivoted2015['Perc_Sept-Oct'] = round((sales_pivoted2015[10]-sales_pivoted2015[9])/sales_pivoted2015[9]*100,1)
sales_pivoted2015['AvgSeptOct'] = sales_pivoted2015[[9, 10]].apply(lambda x: x.mean(), axis=1)
sales_pivoted2015['Nov-AvgSeptOct'] = sales_pivoted2015[11]-sales_pivoted2015['AvgSeptOct']
sales_pivoted2015['Abs(Nov-AvgSeptOct)'] = abs(sales_pivoted2015['Nov-AvgSeptOct'])


In [30]:
sales_pivoted2015.sort_values(['Abs(Nov-AvgSeptOct)'], ascending=False).head(20)

MONTH,YEAR,ARTICLE_ID,4,5,6,7,8,9,10,11,12,Oct-Nov,Sept-Oct,Perc_Oct-Nov,Perc_Sept-Oct,AvgSeptOct,Nov-AvgSeptOct,Abs(Nov-AvgSeptOct)
314,2015,5148,0.0,0.0,0.0,0.0,0.0,862.0,3849.0,5151.0,4276.0,1302.0,2987.0,33.8,346.5,2355.5,2795.5,2795.5
417,2015,97759,10183.0,11861.0,11774.0,8448.0,8527.0,8884.0,10551.0,11865.0,10094.0,1314.0,1667.0,12.5,18.8,9717.5,2147.5,2147.5
313,2015,5130,0.0,0.0,0.0,0.0,0.0,692.0,3216.0,3997.0,3514.0,781.0,2524.0,24.3,364.7,1954.0,2043.0,2043.0
426,2015,97946,5657.0,7102.0,7610.0,7553.0,6195.0,6290.0,7183.0,8614.0,7977.0,1431.0,893.0,19.9,14.2,6736.5,1877.5,1877.5
424,2015,97912,8125.0,11819.0,11616.0,10019.0,9574.0,9105.0,9734.0,11123.0,9303.0,1389.0,629.0,14.3,6.9,9419.5,1703.5,1703.5
50,2015,136094,4239.0,5558.0,5941.0,5020.0,4457.0,4078.0,4947.0,5840.0,5094.0,893.0,869.0,18.1,21.3,4512.5,1327.5,1327.5
37,2015,122817,2278.0,2718.0,2720.0,2653.0,3415.0,4231.0,3296.0,2507.0,1383.0,-789.0,-935.0,-23.9,-22.1,3763.5,-1256.5,1256.5
315,2015,5151,0.0,0.0,0.0,0.0,0.0,321.0,2047.0,2401.0,1933.0,354.0,1726.0,17.3,537.7,1184.0,1217.0,1217.0
239,2015,3438,3244.0,2680.0,3064.0,4442.0,5140.0,3766.0,2460.0,4269.0,4394.0,1809.0,-1306.0,73.5,-34.7,3113.0,1156.0,1156.0
3,2015,106497,4011.0,3735.0,4022.0,3808.0,4941.0,3136.0,2640.0,3981.0,3034.0,1341.0,-496.0,50.8,-15.8,2888.0,1093.0,1093.0


In [37]:
# Get top 20 most impactful articles (from above) for each year
x = sales_pivoted2017.sort_values(['Abs(Nov-AvgSeptOct)'], ascending=False)['ARTICLE_ID'].head(20)
y = sales_pivoted2016.sort_values(['Abs(Nov-AvgSeptOct)'], ascending=False)['ARTICLE_ID'].head(20)
z = sales_pivoted2015.sort_values(['Abs(Nov-AvgSeptOct)'], ascending=False)['ARTICLE_ID'].head(20)

x, y, z = set(x), set(y), set(z)

print("Articles with worst Nov predictions from 2015-2017: ", x.intersection(y,z))
print("Articles with worst Nov predictions from 2016-2017: ", x.intersection(y))

Articles with worst Nov predictions from 2015-2017:  {'97912', '85876'}
Articles with worst Nov predictions from 2016-2017:  {'97912', '85876'}


In [45]:
sales_pivoted[(sales_pivoted["ARTICLE_ID"] == "97912")] # | (sales_pivoted["ARTICLE_ID"] == "85876")]

MONTH,YEAR,ARTICLE_ID,1,2,3,4,5,6,7,8,9,10,11,12
424,2015,97912,,,,8125.0,11819.0,11616.0,10019.0,9574.0,9105.0,9734.0,11123.0,9303.0
859,2016,97912,8335.0,8175.0,9305.0,8466.0,9196.0,10182.0,10017.0,8272.0,9870.0,9406.0,8795.0,9807.0
1294,2017,97912,7209.0,6858.0,9239.0,8319.0,8317.0,9129.0,9493.0,8201.0,9110.0,9872.0,10899.0,9238.0
1729,2018,97912,7579.0,7251.0,8030.0,6842.0,7689.0,8930.0,8622.0,8113.0,8384.0,8175.0,,


In [51]:
# get units sold from Jan-Oct each year for article 97912
grouped_sales[(grouped_sales["ARTICLE_ID"] == "97912") & (grouped_sales["MONTH"] != 11) & (grouped_sales["MONTH"] != 12)][["YEAR","UNITS"]].groupby(["YEAR"]).sum()


Unnamed: 0_level_0,UNITS
YEAR,Unnamed: 1_level_1
2015,69992.0
2016,91224.0
2017,85747.0
2018,79615.0


The above suggests that Sept-Oct cannot reliably predict two articles, 97912 and 85876 for November sales. 

In fact, 85876 shows dwindling sales throughout 2018. It would be reasonable to predict fewer than 12 units for this article given the above. 

Article 97912 consistently has a high sales volume. In 2015 and 2017, this article sold more units in Nov than in Sept or Oct, but the opposite was true in 2016. Meanwhile, the article has fewer sales in 2018 when compared overall to 2016 & 2017.

Questions: 

- Why such a big jump in sales for article 106497 in November 2015, then October 2016 and pretty big in September 2017?

- Similarly, for articles 106650 , September is a much better predictor for November than October is from 2015-2017.

- 98150 seems to make big leaps at the end of the year

- What happens if focus on articles with highest volumes of transactions? 

- 3438 - how did this article have almost no sales in some months in 2017? 

## Article Drill Down

In [66]:
# 3438 units sold in August 2017
get_total_units_sold(sales, year=2017, month=8, article_id=3438)

4.0