# Ice cream data 

The purpose of this notebook is to build a toy dataset for use with the cross correlation & r-squared statistical vignette notebook. The initial temperature and precipitation data were obtained via the National Weather Service from Raleigh, NC and show a historical average from 1981 - 2010. (https://www.ncdc.noaa.gov/cdo-web/datasets#GHCND)

The goal is to use these average values to generate a (pretend) year's worth of ice cream sales data.... But we might also meander through some data exploration along the way, just for fun!

In [107]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import hvplot.pandas
import holoviews as hv
from holoviews import opts

In [84]:
data_url = "https://michw.com/DATA/NC_temp_data.csv"

# Read ice cream data into a data frame
df = pd.read_csv(data_url)

# Make column names easier to work with
#df.columns = ['Month','dayOfMonth','JulDay','NormalMaxTemp',\
#              'NormalMinTemp','MeanTemp','NormalPrecip']

In [85]:
# Here's a peek at the first few days of data
df.head(3) # The first three rows

Unnamed: 0,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION,DATE,PRCP,PRCP_ATTRIBUTES,TAVG,TAVG_ATTRIBUTES,TMAX,TMAX_ATTRIBUTES,TMIN,TMIN_ATTRIBUTES
0,USW00013722,"RALEIGH AIRPORT, NC US",35.8923,-78.7819,126.8,2018-01-01,0.0,",,W,2400",22,"H,,S",28,",,W",13,",,W"
1,USW00013722,"RALEIGH AIRPORT, NC US",35.8923,-78.7819,126.8,2018-01-02,0.0,",,W,2400",20,"H,,S",32,",,W",12,",,W"
2,USW00013722,"RALEIGH AIRPORT, NC US",35.8923,-78.7819,126.8,2018-01-03,0.06,",,W,2400",21,"H,,S",32,",,W",9,",,W"


In [86]:
# Let's remove some of the less-useful columns so our dataset is easier to look at
df = df.drop(columns=['STATION','PRCP_ATTRIBUTES','TAVG_ATTRIBUTES','TMAX_ATTRIBUTES','TMIN_ATTRIBUTES'])

# Let's also convert the DATE column to the datetime format that pandas understands and uses
df['DATE']=pd.to_datetime(df['DATE'])

df.tail(3) # The last three rows

Unnamed: 0,NAME,LATITUDE,LONGITUDE,ELEVATION,DATE,PRCP,TAVG,TMAX,TMIN
362,"RALEIGH AIRPORT, NC US",35.8923,-78.7819,126.8,2018-12-29,0.0,58,65,45
363,"RALEIGH AIRPORT, NC US",35.8923,-78.7819,126.8,2018-12-30,0.01,50,59,45
364,"RALEIGH AIRPORT, NC US",35.8923,-78.7819,126.8,2018-12-31,0.13,53,67,47


In [87]:
# We can also look at data type for each of the columns. 
# Note that our DATE column is in "datetime64" format.
df.dtypes

NAME                 object
LATITUDE            float64
LONGITUDE           float64
ELEVATION           float64
DATE         datetime64[ns]
PRCP                float64
TAVG                  int64
TMAX                  int64
TMIN                  int64
dtype: object

In [117]:
dfplot = df.hvplot.area('DATE','TMIN','TMAX',alpha=0.2,grid=True) *\
df.hvplot('DATE','TAVG',kind='line')
# Adjust the appearance of the plot using holoviews
dfplot.opts(xlabel='Date',ylabel='Air Temp (F)')