# Ice cream dataset - behind the scenes

The purpose of this notebook is to build a toy dataset for use with the cross correlation & r-squared statistical vignette notebook. The initial temperature and precipitation data were obtained via the National Weather Service from Raleigh, NC and show a historical average from 1981 - 2010. (https://www.ncdc.noaa.gov/cdo-web/datasets#GHCND)

The goal is to use these average values to generate a (pretend) year's worth of ice cream sales data.... But we might also meander through some data exploration along the way, just for fun!

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import hvplot.pandas
import holoviews as hv
from holoviews import opts

In [None]:
data_url = "https://michw.com/DATA/NC_temp_data.csv"

# Read ice cream data into a data frame
df = pd.read_csv(data_url)

# Here's a peek at the first few days of data
df.head(3) # The first three rows

In [None]:
# Let's remove some of the less-useful columns so our dataset is easier to look at
df = df.drop(columns=['STATION','PRCP_ATTRIBUTES','TAVG_ATTRIBUTES','TMAX_ATTRIBUTES','TMIN_ATTRIBUTES'])

# Let's also convert the DATE column to the datetime format that pandas understands and uses
df['DATE']=pd.to_datetime(df['DATE'])

df.tail(3) # The last three rows

In [None]:
# We can also look at data type for each of the columns. 
# Note that our DATE column is in "datetime64" format.
df.dtypes

## Exploring through plots

The best way to do a quick exploration of data is to plot it, so let's do that with holoviews and hvplot tools (those plotting tools allow for interactivity).

In [None]:
dfplot = df.hvplot.area('DATE','TMIN','TMAX',alpha=0.2,grid=True) *\
df.hvplot('DATE','TAVG',kind='line')
# Adjust the appearance of the plot using holoviews
dfplot.opts(xlabel='Date',ylabel='Air Temp (F)')

## Making some (fake) data!

The next step is to invent some data, but make it at least somewhat believable. We're going to base the dataset on a tweaked version of a simple linear function. Let's describe the underlying linear trend based on a line that intersects two specific points. 

The first point will be at a 40 degrees, at which point 100 ice cream cones per day will be sold. The second point will be at 70 degrees, at which point 400 cones will be sold. Below 40 degrees we'll assume an average 100 cones and above 70 we'll assume an average of 400 cones. Between the two we'll have a linear fit. 

The idea behind this model is that, sure, ice cream sales increase as a function of temperature, but at some point it levels off. For example, Dax is equally likely to want ice cream at 75 degrees and 100 degrees. And he's equally unlikely to want ice cream at 20 degrees or 35 degrees.

Let's tackle this by making a function that is in three sections:

- $<$ 40 degrees
- between 40 and 70 degrees
- $>$ 70 degrees

In [None]:
# A simple function that generates daily cone sales for a given temperature
def getCones(temps):
    cones = []
    x1,y1 = 40.,100.
    x2,y2 = 70.,400.
    slope = (y2-y1)/(x2-x1)
    intercept = y1-(slope*x1)
    #print('slope=' + str(slope) + ', intercept = ' + str(intercept))
    for temp in temps:
        if temp < 40:
            cones.append(100)
        elif (temp >= 40) & (temp<70):
            cones.append((slope*temp)+intercept)
        else:
            cones.append(400)
    return cones

In [None]:
# Create a new column in our pandas dataframe for our model of the number of cones
# sold as a function of average temperature.
df['conesMod'] = getCones(df.TAVG)

# Plot the Cone sales data we just created.
df.hvplot.scatter(x='TAVG',y='conesMod',grid=True) 

The plot of cone sales as a function of temperature looks pretty silly. It doesn't look like real data at all. So let's add a bit of randomness.

In [None]:
sdev = 150
error = [np.random.normal(val,sdev) for val in df['conesMod']]
df['conesErr'] = df['conesMod']+error

errConesPlot = df.hvplot.scatter('TAVG','conesErr',grid=True)
errConesPlot.opts(xlabel='Date',ylabel='Air Temp (F)')

Ah, this looks a bit more believable. Tons of variation. The number of cones sold does depend on temperature, but of course it depends on loads of other stuff too. Speaking of which, let's add one more layer of quirkiness, just for fun. There's way more ice cream sold on certain days of the year. Why don't we add some outliers in manually?

In [None]:
data = [('4-July-2018',350),
        ('3-September-2018',200),
        ('28-May-2018',250)]
dfHolidays = pd.DataFrame(data,columns=['Date','offset'])
dfHolidays['Date'] = pd.to_datetime(dfHolidays['Date'])
dfHolidays

In [None]:
for row in dfHolidays.iterrows:
    row['Date']
    