### Statistical significance of the relationships between cost and size and cost and time

The previous notebook explored the relationships between installation date  and cost/watt and installation size and cost/watt.
Here we examine the statistical significance of those relationships.

In [1]:
# set up
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.rc('figure', figsize=(10, 8))
np.set_printoptions(precision=8, suppress=False)
# please show all columns
pd.set_option("display.max_columns", 60)
import seaborn as sns
sns.set()

In [2]:
# read cleaned data
dftts = pd.read_csv('../local/data/LBNL_openpv_tts_data/ttsclean20180127.csv',
                    encoding='iso-8859-1', # avoids windows encoding issue
                    index_col='row_id',
                    parse_dates=['install_date'],
                    dtype={'zipcode' : np.object})

##### Convert the date to number of months since data begins

In [3]:
# capture the installation month for each row
month = dftts.install_date.apply(lambda x: x.to_period('M'))

# save in a new column
dftts = dftts.assign(install_month=month)

# the first installation date
month0 = dftts.install_month.values[0]

# convert to number of months since epoch
nMonths = (dftts.install_month - (month0)).astype(np.float)

dftts = dftts.assign(nMonths=nMonths)

##### Or number of days since data begins

In [4]:
# capture the installation day for each row
day = dftts.install_date.apply(lambda x: x.to_period('D'))

# save in a new column
dftts = dftts.assign(install_day=day)

# the first installation date
day0 = dftts.install_day.values[0]

# convert to number of months since epoch
nDays = (dftts.install_day - (day0)).astype(np.float)

dftts = dftts.assign(nDays=nDays)

In [5]:
dftts.head(3)

Unnamed: 0_level_0,file_row,data_provider,sysid_dp,sysid_tts,install_date,size_kw,price,appraised_value,cust_type,new_const,tracking,ground_mounted,battery,zipcode,city,county,state,third-party,uinverter,dc_optimizer,cost_per_watt,num_days,install_month,nMonths,install_day,nDays
row_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
1,10108220,California Public Utilities Commission (Curren...,PGE-INT-11328 & CA_ERP_24698,CA-NEM-12257,1998-01-09,2.2824,24500.0,False,RES,False,False,False,False,94107.0,San Francisco,San Francisco,CA,False,False,False,10.734315,0,1998-01,0.0,1998-01-09,0.0
2,10107162,California Public Utilities Commission (Curren...,PGE-INT-11220 & CA_ERP_24687,CA-NEM-11180,1998-01-30,1.8504,20555.54,False,RES,False,False,False,False,95949.0,Nevada City,Nevada,CA,False,False,False,11.108701,21,1998-01,0.0,1998-01-30,21.0
4,10107618,California Public Utilities Commission (Curren...,PGE-INT-11268 & CA_ERP_24540,CA-NEM-11641,1998-02-04,2.3076,20000.0,False,RES,False,False,False,False,94518.0,Concord,Contra Costa,CA,False,False,False,8.667013,26,1998-02,1.0,1998-02-04,26.0


#### Get the correlation matrix for the relevant variables.  Pandas provides a method for the correlation matrix.  

The correlation coefficient for cost and time is -0.61, a moderate negative correlation.  This implies that as time increases, cost decreases moderately.

The correlation coefficient for cost and time is -0.04, a slight negative correlation.  This implies that as size increases, cost decreases slightly.

In [6]:
theCorrelationMatrix = dftts[['cost_per_watt', 'size_kw', 'nDays']].corr(); theCorrelationMatrix

Unnamed: 0,cost_per_watt,size_kw,nDays
cost_per_watt,1.0,-0.044252,-0.611365
size_kw,-0.044252,1.0,-0.006031
nDays,-0.611365,-0.006031,1.0


#### Scipy.stats provides a procedure for the correlation coefficient (Pearson r-value) and the associated p value.

__Null Hypothesis__: $H_0: \rho = 0$:  The population correlation coefficient IS NOT significantly different from zero. There IS NOT a significant linear relationship(correlation) between time and cost in the population.

__Alternate Hypothesis__: $H_a: \rho = 0$: The population correlation coefficient is significantly different zero. There is significant linear relationship (correlation) between time and cost in the population.

The *p*-value, 0.00000 is less than the significance level (*α* = 0.01).  This indicates that it is extremely unlikely that the relationship is due to randomness. 

__Decision__: Reject the null hypothesis.
__Conclusion__: There is sufficient evidence to conclude that there is a significant linear relationship between cost and time because the correlation coefficient is significantly different from zero and it is unlikely that the relationship is due to randomness.

The same argument applies to cost and size since the p-value of the correlation coefficient between cost and size is 0.0000.  The relationship is much however much weaker (i.e. size accounts for much less of the variance in cost).

In [7]:

from scipy.stats.stats import pearsonr
cost_time_corr, cost_time_p_value = pearsonr(dftts.nDays, dftts.cost_per_watt)
print('cost ~ time ==> correlation coefficient: {:.6f}, p value: {:.8f}'.format(cost_time_corr, cost_time_p_value))

cost ~ time ==> correlation coefficient: -0.611365, p value: 0.00000000


In [8]:
from scipy.stats.stats import pearsonr
cost_time_r, cost_time_p_value = pearsonr(dftts.size_kw, dftts.cost_per_watt)
print('cost ~ size ==> correlation coefficient: {:.8f}, p value: {:.8f}'.format(cost_time_corr, cost_time_p_value))

cost ~ size ==> correlation coefficient: -0.61136486, p value: 0.00000000


### Here we use scipy.stats ols to fit a linear regression model $cost \sim time, size$

The report from ```ols``` indicates:

* $R^2$ = 0.376; This indicates approximately 38% of the variance is a accounted for by the variables `nDays` and `size_kw`.
* `Prob (F-statistic) = 0`: This indicates that the probability of this occuring by chance is extremely small ($< 1\%$).  Thus the null hypothesis ('There is no relationship between cost and time and size') can be rejected.

Thus we conclude the correlations and regression are statisically significant.

In [9]:
from statsmodels.formula.api import ols
m1 = ols('cost_per_watt ~ nDays + size_kw', dftts).fit()
print(m1.summary())

                            OLS Regression Results                            
Dep. Variable:          cost_per_watt   R-squared:                       0.376
Model:                            OLS   Adj. R-squared:                  0.376
Method:                 Least Squares   F-statistic:                 2.338e+05
Date:                Mon, 26 Mar 2018   Prob (F-statistic):               0.00
Time:                        11:12:13   Log-Likelihood:            -1.4034e+06
No. Observations:              775694   AIC:                         2.807e+06
Df Residuals:                  775691   BIC:                         2.807e+06
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     11.5260      0.009   1215.715      0.0