In [162]:
import numpy as np
import pandas as pd
import csv
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import seaborn as sns

import scipy.signal as sgn
import scipy.stats as stat
import matplotlib.gridspec as grds


import statsmodels.tsa.stattools as sta

% matplotlib qt


sns.set_style("whitegrid")

In [170]:
# Loading the data
data = pd.read_csv('challenge-data-v2.csv', index_col='event_date',parse_dates = True)


# Taking a look into the first elements of the data to know more about the structure
data.head()

# Changing the name of the colums to a more suitable (shorter) form
data.columns = ['sups','moff','mon','hd']

In [74]:
data.describe()

Unnamed: 0,sups,moff,mon
count,1155.0,282.0,1155.0
mean,8054.577489,98884.865284,334811.893541
std,2093.419767,71107.830804,152979.099583
min,3973.0,0.0,40243.09
25%,6657.5,55285.4875,234232.175
50%,7639.0,84413.085,314120.8
75%,8967.5,123233.0,430718.99
max,26348.0,446191.35,975192.3


In [85]:
for feature in data.columns:
    if data[feature].values.dtype != 'object':
        print feature
        scale_factor = np.round(np.log10(np.mean(data[feature])))
        data[feature] = data[feature] / 10**scale_factor
        print 'The feature: '+feature+' was rescaled by a factor of '+str(10**scale_factor)

sups
The feature: sups was rescaled by a factor of 10000.0
moff
The feature: moff was rescaled by a factor of 100000.0
mon
The feature: mon was rescaled by a factor of 1000000.0


In [82]:
# Some statistics
print data.columns[data.columns!='hd']

data.head()

Index([u'sups', u'moff', u'mon'], dtype='object')


Unnamed: 0_level_0,sups,moff,mon,hd
event_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2014-01-01,0.4246,,0.041934,NewYearsDay
2014-01-02,0.6569,,0.054117,
2014-01-03,0.7466,,0.051634,
2014-01-04,0.6911,,0.047323,
2014-01-05,0.5929,,0.048324,


## Q1) 	 CORRELATION 	
    -Q1.1) Using Python and the libraries and packages of your own choice, write a well-structured and readable program that can be used to analyse the data set for correlations between the time series. The program should work with data files that are similar, but not necessarily identical to the sample provided. Your response should include the program itself and brief instructions about any dependencies that should be installed before running it.


### Proposal

For answering this question I have programmed a python file called "BW7_toolbox.py" that is attached in my submission. This file contains two functions: correlation_study and seasons_study. Each one is the answer for each of the question suggested in the case of study. In the same project there is also a __init__ file just for the sake of generating a python package.

To execute the functions just unzip the file in the folder where regularly is used as a python path and use "import BW7_toolbox as BW7" to add it to your project. Once in your project you could call the evaluation of the solutions as BW7.correlation_study(filename) or BW7.seasons_study(filename). By default the data for this case of study will be loaded but any csv file that is in the same folder that the python files coudl be also used just by using the proper filename (with the csv extension as part of the name). 

The idea behind the solution of this firs question is to provide the user the most usefol plots and information to observe any correlation. With that in mind after the execution of "correlation_study" the next plots will be presented on the screen (different windows):
* Temporal distribution by year. The idea of this plot is to provide the user a broad overview of the activity of the different features. For that reason each feature has assigned one colum while the rows represent different distributions:
    * First row, distribution of features accross the year for different years.
    * Second row, distribution of the features per month for different years. 
    * Third row, distribution of the featyres during the weekdays for different years. 

* Monthly distribution of features. In this plot I present the monthly distribution of the different features. The idea here is to provide some insight into the data before the next plot. 
* Monthly features crosscorrelations. This plot address explicitly the question of the correlation between the different time series of the data set in the temporal scale of a month. I did it in this way to address that the data is non-stationary and hence this relationship depends strongly of the period of the year (month). Take in account that the first zero lag is from the first time series mentioned in the plot to the second. Then if you observed a peak in the positive side of the crosscorrelograms should be interpreted as a larger probability of having the event two in that lag of time respect the first time series mentioned in the plot. The cross correlations are computed to have the same number of samples than the original time series (I find a 'full' version non informative in this time scale). Whenever you find an empty plot is because there was not enough collected data to calculate any statistics. 

* Rolling corelation coefficient and weekly activity of the different features. In my opinion this is the more informative than the previous one but address the problem a bit different. This figure shows the moving correlation coefficient (with a temporal window of 14 days, 2 weeks) together with the dynamics of the features (resampled to weekly activity for the sake of smooth it). Interestingly in this way it is possible to extract some statistics on what it is going with this coefficient divided by years. This is shown in the last row of the plot. This row presents the distributions of the correlations during the year and provide a nice picture of the business choices taken during that year. It also measure the effectiveness of whatever event because this will have an impact in the shape and other moments of those distributions. The median (assuming that in general the situation wont be gaussian) is shown as a parameter for decision making (or index of correlation for a year). In the cases the skweness of the distribution is close to a gaussian, a normal distribution will be fitted to the data (and the mean and variance of the distribution will be shown). In addition to this in the console there will be reflected also the activity of the different moments of the distributions per year. 

These plots are just tools, a window to the observer to have an insight on what it is happening with the correlation of the features. 

Non-numeric features will not be considered for this study. That is the reason for discarding (not using) the holydays.

In the next cells I present the same code that is used in the function that will help in the reporting of the specific case of study.

There is some redundancy in the code but this was done with the idea of reusability of blocks of code, not only for me but for other developers too.


In [198]:
# *****************************************************
# Computation of the distributions per year for the different features
years = np.unique(data.index.year)


# Definition of the figure.
figid = plt.figure('Temporal distributions by year', figsize=(20, 10))

# Definition of the matrix of plots. In this case the situation is more complex that is why I need to define a
# matrix. It will be a dim[2x3] matrix.
col = len(data.columns[data.columns!='hd'])
gs = grds.GridSpec(3,col)

for i in range(col):
    
    ax1 = plt.subplot(gs[0, i])
    ax2 = plt.subplot(gs[1,i])
    
    legend =[]
    for y in years:
        dat = data[data.columns[i]][str(y)].values
        ax1.plot(np.arange(len(dat)),dat,'-')
        legend.append(str(y))
    ax1.legend(legend)
    ax1.set_title('Daily '+data.columns[i])

    legend =[]
    for y in years:
        dat = data[data.columns[i]][str(y)].resample('M').values
        ax2.plot(np.arange(len(dat))+1,dat,'-o')
        legend.append(str(y))
    ax2.legend(legend)
    ax2.set_title('Monthly '+data.columns[i])
    plt.xlim([1,12])

    ax3 = plt.subplot(gs[2,i])
    legend =[]
    for y in years:
        dat = data[data.columns[i]][str(y)]
#         dat = dat.groupby(data.index.dayofweek).mean()
        dat = dat.groupby(dat.index.dayofweek).mean()
        dat.index=['Mon','Tues','Wed','Thurs','Fri','Sat','Sun']
        dat.plot(style='-o')
        legend.append(str(y))
    ax3.legend(legend)
    ax3.set_title('Day week '+data.columns[i])

# print data.index.strftime('%A')

# sup['2015'].plot(ax=ax)
# sup['2016'].plot(ax=ax)
# sup['2017'].plot(ax=ax)

.resample() is now a deferred operation
You called values(...) on this deferred object which materialized it into a series
by implicitly taking the mean.  Use .resample(...).mean() instead


In [199]:
# *****************************************************
# Computation of the distribution of activity per month for the different time series that are present in the data
years = np.unique(data.index.year)


# Definition of the figure.
figid = plt.figure('Monthly distribution of features',figsize=(20, 10))

# Definition of the matrix of plots. In this case the situation is more complex that is why I need to define a
# matrix. It will be a dim[2x3] matrix.
col = len(data.columns[data.columns!='hd'])
rows = len(years) 
gs = grds.GridSpec(rows,col)
months=['Jan','Feb','Mar','Aprl','May','Jun','Jul','Agst','Sep','Oct','Nov','Dec']
colors = sns.hls_palette(12, l=.5, s=.6)

for c in range(col):
    for r in range(rows):
        ax1 = plt.subplot(gs[r, c])

        dat_year = data[data.columns[c]][str(years[r])]

        for m in range(1,13):
#             print m
            dat = dat_year[dat_year.index.month==m].values
            ax1.plot(np.arange(len(dat)),dat,'-', color=colors[m-1])
        if r==0 and c==col-1:
            ax1.legend(months, bbox_to_anchor=(1, 1), loc='upper left', borderaxespad=0.,ncol=2, fancybox=True,frameon=True)
  
        if c==0:
            ax1.set_ylabel('Year: '+str(years[r]))
            
        if r==0:
            ax1.set_title('Feature: '+str(data.columns[c]))

In [200]:
# *****************************************************
# Computation of the distribution of cross correlations per month for the different time series that are present in the data
years = np.unique(data.index.year)


# Definition of the figure.
figid = plt.figure('Monthly features cross-correlations', figsize=(20, 10))

# Definition of the matrix of plots.
feature1 =[0,0,2]
feature2= [2,1,1]

col = len(data.columns[data.columns!='hd'])
rows = len(years) 

gs = grds.GridSpec(rows,col)
months=['Jan','Feb','Mar','Aprl','May','Jun','Jul','Agst','Sep','Oct','Nov','Dec']
# colors = sns.color_palette("Set2", 12)
colors = sns.hls_palette(12, l=.5, s=.6)

for c in range(col):
    for r in range(rows):
        ax1 = plt.subplot(gs[r, c])

        dat_year_feat1 = data[data.columns[feature1[c]]][str(years[r])]
        dat_year_feat2 = data[data.columns[feature2[c]]][str(years[r])]
        
        for m in range(1,13):
            dat_feat1 = dat_year_feat1[dat_year_feat1.index.month==m].values
            dat_feat1= np.subtract(dat_feat1,np.mean(dat_feat1))
            dat_feat2 = dat_year_feat2[dat_year_feat2.index.month==m].values
            dat_feat2= np.subtract(dat_feat2,np.mean(dat_feat2))
            dat = sgn.correlate(dat_feat1,dat_feat2,mode='same')
            ax1.plot(np.linspace(-15,15,len(dat)),dat,'-', color=colors[m-1])
        if c==0:
            ax1.set_ylabel('Year: '+str(years[r]))
        if r==0 and c==col-1:
            ax1.legend(months, bbox_to_anchor=(1, 1), loc='upper left', borderaxespad=0.,ncol=2, fancybox=True,frameon=True)
        if r==0:
            ax1.set_title('Xcorr: '+str(data.columns[feature1[c]])+' and '+str(data.columns[feature2[c]]))

In [201]:
# Xcorr normalizing the variance

years = np.unique(data.index.year)


# Definition of the figure.
figid = plt.figure(5,figsize=(20, 10))

# Definition of the matrix of plots.
feature1 =[0,0,2]
feature2= [2,1,1]

col = len(data.columns[data.columns!='hd'])
rows = len(years) 

gs = grds.GridSpec(rows,col)
months=['Jan','Feb','Mar','Aprl','May','Jun','Jul','Agst','Sep','Oct','Nov','Dec']
colors = sns.hls_palette(12, l=.5, s=.6)

for c in range(col):
    for r in range(rows):
        ax1 = plt.subplot(gs[r, c])

        dat_year_feat1 = data[data.columns[feature1[c]]][str(years[r])]
        dat_year_feat2 = data[data.columns[feature2[c]]][str(years[r])]
        
        for m in range(1,13):
            dat_feat1 = dat_year_feat1[dat_year_feat1.index.month==m].values
            dat_feat1= np.subtract(dat_feat1,np.mean(dat_feat1))/np.var(dat_feat2)
            dat_feat2 = dat_year_feat2[dat_year_feat2.index.month==m].values
            dat_feat2= np.subtract(dat_feat2,np.mean(dat_feat2))/np.var(dat_feat2)
            dat = sgn.correlate(dat_feat1,dat_feat2,mode='same')
            ax1.plot(np.linspace(-15,15,len(dat)),dat,'-', color=colors[m-1])
        if c==0:
            ax1.set_ylabel('Year: '+str(years[r]))
        if r==0 and c==col-1:
            ax1.legend(months, bbox_to_anchor=(1, 1), loc='upper left', borderaxespad=0.,ncol=2, fancybox=True,frameon=True)
        if r==0:
            ax1.set_title('Xcorr: '+str(data.columns[feature1[c]])+' and '+str(data.columns[feature2[c]]))

  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
  ret = ret.dtype.type(ret / rcount)


In [86]:
# Monthly correlations Xcorr

years = np.unique(data.index.year)


# Definition of the figure.
figid = plt.figure(6,figsize=(20, 10))

# Definition of the matrix of plots.
feature1 =[0]
feature2= [2]

col = len(feature1)
rows = len(years) 


gs = grds.GridSpec(rows,col)
months=['Jan','Feb','Mar','Aprl','May','Jun','Jul','Agst','Sep','Oct','Nov','Dec']
# colors = sns.color_palette("Set2", 12)
colors = sns.hls_palette(12, l=.5, s=.6)


for c in range(col):
    for r in range(rows):

        dat_year_feat1 = data[data.columns[feature1[c]]][str(years[r])]
        dat_year_feat2 = data[data.columns[feature2[c]]][str(years[r])]
        
        for m in range(1,13):
            dat_feat1 = dat_year_feat1[dat_year_feat1.index.month==m].values
#             dat_feat1= np.subtract(dat_feat1,np.mean(dat_feat1))
            dat_feat2 = dat_year_feat2[dat_year_feat2.index.month==m].values
#             dat_feat2= np.subtract(dat_feat2,np.mean(dat_feat2))
#             print type(dat_feat1)

In [381]:
x=np.arange(100)
y=np.sin(x)

dat_year_feat1 = data['sups']['2014']
dat_year_feat2 = data['mon']['2014']
            
dat_feat1 = dat_year_feat1[dat_year_feat1.index.month==1].values
dat_feat2 = dat_year_feat2[dat_year_feat2.index.month==1].values

y=dat_feat1


plt.figure(1)
plt.plot(tsa.stattools.ccovf(dat_feat1,dat_feat2))


temp =sgn.correlate(np.subtract(dat_feat1,np.mean(dat_feat1)),np.subtract(dat_feat2,np.mean(dat_feat2)),mode='full')
plt.figure(2)
plt.plot(temp)

print data['sups']['2014'].corr(data['mon']['2014'])

0.268050524256


In [197]:
years = np.unique(data.index.year)

# Definition of the matrix of plots.
feature1 = [0,0,2]
feature2 = [2,1,1]

for f in range(len(feature1)):

    figid = plt.figure(
        'Rolling correlation coefficient and weekly activity of ' + data.columns[feature1[f]] + ' and ' + data.columns[
            feature2[f]], figsize=(20, 10))
    rows = len(years)

    gs = grds.GridSpec(4, 4)

    for r in range(rows):
        ax1 = plt.subplot(gs[0, :])
        ax2 = plt.subplot(gs[1, :])
        ax3 = plt.subplot(gs[2, :])

        dat_year_feat1 = data[data.columns[feature1[f]]][str(years[r])]
        dat_year_feat2 = data[data.columns[feature2[f]]][str(years[r])]

        ref = ax1.plot(dat_year_feat1.resample('W'))
        ax2.plot(dat_year_feat2.resample('W'))

        xcorr = pd.rolling_corr(dat_year_feat1, dat_year_feat2, 14)
        ax3.plot(xcorr)

        ax4 = plt.subplot(gs[3, r])
        n, bins, patches = ax4.hist(xcorr.values[np.logical_not(np.isnan(xcorr.values))], bins=np.round(len(xcorr) / 6),
                                    facecolor=ref[0].get_color(), edgecolor=ref[0].get_color())
        mediana = ax4.axvline(np.median(xcorr.values[np.logical_not(np.isnan(xcorr.values))]), color='r', linestyle='--')
        ax4.set_xlim([-1, 1])
        ax4.set_xlabel('CorrCoef year '+str(years[r]))
        ax4.set_label(mediana)

        print '-------------------------------------------------'
        print 'Correlation distribution for year ' + str(years[r])
        print 'Mean:', xcorr.mean()
        print 'Median:', xcorr.median()
        print 'Standard deviation:', xcorr.std()
        print 'Kurtosis:', xcorr.kurtosis()  # Kurtosis is mainly related with outliers not with the central peak
        print 'Skewness:', xcorr.skew() #Take in account that this value has not the substraction of the skew of a normal distribution (3)

        if (np.abs(xcorr.skew())) < 0.65:
            mu, sigma = stat.norm.fit(xcorr.values[np.logical_not(np.isnan(xcorr.values))])
            print 'Normal distribution fitted!'
            print 'mu=' + str(mu)
            print 'sigma=' + str(sigma)

            fitted_normal = mlab.normpdf(bins, mu, sigma) * np.max(
                xcorr.values[np.logical_not(np.isnan(xcorr.values))])
            # print fitted_normal
            normfit = ax4.plot(bins, fitted_normal * np.max(n), 'r--', linewidth=2, color="#3498db")
            ax4.legend(['Median '+str(round(xcorr.median(),2)), 'N(' + str(round(mu, 1)) + ',' + str(round(sigma, 2)) + ')'], loc='best')
        else:
            ax4.legend(['Median '+str(round(xcorr.median(),2))], loc='best')


    ax1.set_ylabel(data.columns[feature1[f]])
    ax2.set_ylabel(data.columns[feature2[f]])
    ax3.set_ylabel('Rolling correlation, 14 days period')


    ax1.legend(years, bbox_to_anchor=(1, 1), loc='upper left', borderaxespad=0., ncol=1, fancybox=True, frameon=True)


	Series.rolling(window=14).corr(other=<Series>)


-------------------------------------------------
Correlation distribution for year 2014
Mean: 0.532746314132
Median: 0.557949258656
Standard deviation: 0.215720671611
Kurtosis: -0.0421423527986
Skewness: -0.638196099596
Normal distribution fitted!
mu=0.532746314132
sigma=0.215414032264
-------------------------------------------------
Correlation distribution for year 2015
Mean: 0.56389935887
Median: 0.635166988321
Standard deviation: 0.259349079152
Kurtosis: 0.449585874115
Skewness: -1.00510390293
-------------------------------------------------
Correlation distribution for year 2016
Mean: 0.378524397416
Median: 0.417440043375
Standard deviation: 0.326511914571
Kurtosis: 0.0600812000593
Skewness: -0.535276902702
Normal distribution fitted!
mu=0.378524397416
sigma=0.326049105104
-------------------------------------------------
Correlation distribution for year 2017
Mean: 0.407220906323
Median: 0.435383118459
Standard deviation: 0.197383216262
Kurtosis: -0.00269736097952
Skewness: -0

  ret = ret.dtype.type(ret / rcount)


-------------------------------------------------
Correlation distribution for year 2014
Mean: nan
Median: nan
Standard deviation: nan
Kurtosis: nan
Skewness: nan
-------------------------------------------------
Correlation distribution for year 2015
Mean: nan
Median: nan
Standard deviation: nan
Kurtosis: nan
Skewness: nan
-------------------------------------------------
Correlation distribution for year 2016
Mean: 0.0667764147975
Median: 0.0597625619621
Standard deviation: 0.305620922964
Kurtosis: 0.359789556375
Skewness: -0.277127492116
Normal distribution fitted!
mu=0.0667764147975
sigma=0.304892385758
-------------------------------------------------
Correlation distribution for year 2017
Mean: 0.0586522881211
Median: 0.00137535800231
Standard deviation: 0.249063699487
Kurtosis: -0.205005171739
Skewness: 0.743828219699
-------------------------------------------------
Correlation distribution for year 2014
Mean: nan
Median: nan
Standard deviation: nan
Kurtosis: nan
Skewness: nan


In [127]:
import matplotlib.mlab as mlab

x= xcorr.values[np.logical_not(np.isnan(xcorr.values))]

plt.figure()
n, bins, patches = plt.hist(x, normed=1, facecolor='green', alpha=0.75)

# add a 'best fit' line
y = mlab.normpdf( bins, mu, sigma)
l = plt.plot(bins, y, 'r--', linewidth=1)

In [13]:
years = np.unique(data.index.year)


# Definition of the figure.
figid = plt.figure(8,figsize=(20, 10))

# Definition of the matrix of plots.
feature1 = 0
feature2= 1

rows = len(years) 

gs = grds.GridSpec(3,1)


for r in range(rows):
    ax1 = plt.subplot(gs[0, 0])
    ax2 = plt.subplot(gs[1, 0])
    ax3 = plt.subplot(gs[2, 0])
    
    dat_year_feat1 = data[data.columns[feature1]][str(years[r])]
    dat_year_feat2 = data[data.columns[feature2]][str(years[r])]
    
    if np.all(np.isnan(dat_year_feat2)):
        continue

    ax1.plot(dat_year_feat1.resample('W'))
    ax2.plot(dat_year_feat2.resample('W'))
    ax3.plot(pd.rolling_corr(dat_year_feat1,dat_year_feat2,14))
    

ax1.set_ylabel(data.columns[feature1])
ax2.set_ylabel(data.columns[feature2])
ax3.set_ylabel('Rolling correlation, 14 days period')

	Series.rolling(window=14).corr(other=<Series>)


<matplotlib.text.Text at 0x7fb4786b1c10>

In [412]:
print type(data.columns[0])

<type 'str'>


In [None]:
print tsa.stattools.periodogram(dat_feat1)
plt.figure(3)
plt.plot(tsa.stattools.periodogram(dat_feat1))

spe = np.fft.fft(dat_feat1)
spe = np.abs(spe)
plt.figure(4)
plt.plot(spe)

## Q1) 	 CORRELATION 	
    -Q1.1) Using Python and the libraries and packages of your own choice, write a well-structured and readable program that can be used to analyse the data set for correlations between the time series. The program should work with data files that are similar, but not necessarily identical to the sample provided. Your response should include the program itself and brief instructions about any dependencies that should be installed before running it.

    -Q1.2) Run the program on the provided data sample from PetFood and comment on the output.

    -Q1.3) Comment on additional approaches that could be used to search for various types of correlations in the data set.

## Q2) SEASONS AND CYCLES 	
    -Q2.1) Using Python and the libraries and packages of your own choice, write	a well-structured and readable program that can be used to identify periodic behaviour in the “signups” time series. The program should work with data files that are similar, but not necessarily identical to the sample provided. Your response should include the program itself and brief instructions about any dependencies that should be installed to run it.

    -Q2.2) Run the program on the data sample from PetFood, and comment on the output

    -Q2.3) Discuss any additional methods and data sources that would be useful to improve the detection	of cycles in the number	of signups.

    -Q2.4) Discuss to what degree this same code solution can be expected to	work for a completely different customer, selling a	completely different product, in a different market. Would the approach	need to be adjusted to accommodate such a general setting?	