### Introduction

For Assignment 5, we are looking to test if Berkeley and Livermore have the same weather data. We will first import the necessary functions from the past assignments. Specifically, we import get_almeda_county_points, filter_randon_criteria, clean_data, and get_weather_data. 

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from scipy import stats
from permute.core import two_sample

import os
os.chdir('../')
print(os.getcwd())
from Assignment2.assignment2 import get_alameda_county_points
from Assignment3.assignment3 import filter_ranson_criteria, clean_data, get_weather_data

/Users/brandonhuang/Desktop/Education/UC Berkeley/STAT 159


### Capturing Data and Binning


We first bin the maximum temperature for the weather stations using Ranson's categories: 0°F-10°F, 10°F-20°F, 20°F-30°F, 30°F-40°F, 40°F-50°F, 50°F-60°F, 60°F-70°F, 70°F-80°F, 80°F-90°F, and 90°F-100°F. We do this through the fuctions get_data() and get_city_data(df, id, element='TMAX'). Then we convert the data points to fahrenheit using convert_to_fahrenheit(df). 

In [5]:
def get_data():
    """ Return DataFrame containing weather data in Alameda county based on Ranson's criteria
    return:
        DataFrame: weather data
    """
    points = get_alameda_county_points()
    return filter_ranson_criteria(clean_data(get_weather_data(points)))

def get_city_data(df, id, element='TMAX'):
    """ Return a DataFrame that is filtered by element and ID
    return:
        DataFrame: weather data
    """
    hcn_city = df[df['ID'] == id]
    return hcn_city[hcn_city['ELEMENT'] == element]

def convert_to_fahrenheit(df):
    """ Return a DataFrame that converts DATA VALUE to FAHRENHEIT
    return:
        DataFrame: weather data
    """
    df['FAHRENHEIT'] = ((df['DATA VALUE'] / 10) * 1.8) + 32
    return df

def bin_data(df):
    """ Return a DataFrame that groups the data by year and month and bins FAHRENHEIT
    return:
        DataFrame: binned weather data
    """
    grouped_df = df.groupby(['year', 'month', pd.cut(df['FAHRENHEIT'], range(0, 110, 10))])
    return grouped_df.size().unstack().sort_index(axis = 1).fillna(0.0)

### Intersecting Livermore's and Berkeley's DataFrame

The above creates the initial data frame that bins the maximum temperature. To perform a two sample test, the length of the data sets have to be the same. Since Livermore contains more data than Berkeley, we use get_intersected to shorten Livermore's data to match Berkley's data frame. Essentially, we create one data frame with both Berkeley's data frame and Livermore's data frame. We then split the data frame into two. The intersection formula is denoted below.

In [6]:
def get_intersected(df1, df2, intersect_on='date'):
    """ Returns two DataFrames that include only rows which have the same intersect_on values
    return:
        DataFrame: filtered weather data
        DataFrame: filtered weather data
    """
    intersection = np.intersect1d(df1[intersect_on], df2[intersect_on])
    return df1[df1[intersect_on].isin(intersection)], df2[df2[intersect_on].isin(intersection)]

### Stratification 

**What did you stratify on? Why is that a good choice? Why stratify at all?**

After we have both dataframes, we stratify the data by year and month. We believe that these are good stratas because they are mutually exclusive and collectively exhaustive. Furthermore, this accounts for climate change. On a larger yearly scope, this accounts for global warming. On a smaller monthly scope, this accounts for seasonality. 

In [7]:
def stratify(df, by=['year', 'month']):
    """ Returns a list of DataFrame that are the groups from the by criteria
    return:
        List[DataFrame]: grouped weather data
    """
    df_grouped = df.groupby(by)
    return [df_grouped.get_group(group) for group in df_grouped.groups]

### Chi-Squared Distribution 

**Can you use the chi-square distribution to calibrate the test? Why or why not?**

After the stratification of the data, we use the fisher combining formula. The value from the formula is chi-square distributed therefore you can use chi-square distribution to calibrate the test. We can then use a chi-squared distribution to determine how well our data's distribution compares to a normal, binomial, or poisson distribution. 

From there, we calculate the p-value. Below is the function for the above process.

In [8]:
def permutation_test(a, b):
    """ Returns p-value for a two-sided two sample permutation test using Fisher combining function
    return:
        Double: p-value of the chi squared statistic
    """
    min_p = 10 ** -20
    fisher_statistic = 0.0
    
    for stratum in range(len(a)):
        p, _ = two_sample(a[stratum]['FAHRENHEIT'], b[stratum]['FAHRENHEIT'], alternative='two-sided')
        print('Finished stratum %d got p-value %f' % (stratum, p))
        fisher_statistic += -2 * np.log(max(p, min_p))

    return 1 - stats.chi2.cdf(fisher_statistic, 2 * len(a))

### Taking into Account Simulation Uncertainty 

**Discuss how to take into account simulation uncertainty in estimating the overall P-value**

To reduce uncertainty in estimating the overall p-value, we can generate PRNG on the data for certain amount of times until p-values converge or use bias-correction as Ranson mentioned. 
Essentially, using the two-sample function provided by Professor Stark.

### Conclusion

**Discuss what your findings mean for Ranson's approach**

Since our findings include a p-value that is lower than alpha, we must reject the null hypothesis. Therefore, we conclude that the weather in Livermore and in Berkeley is not the same. In Ranson's paper, he assumed that the weather within a county is the same. However, this is not the case according to our findings. 

Since Ranson assumed that the weather within a county is the same, his results could be misleading. Mainly, the positive correlation between crime rate and high temperature could be exagerrated since the cities within a county could have differing temperatures. The vice versa holds true also. 

### Appendix

In [None]:
import os
os.chdir('../')
print(os.getcwd())

if __name__ == '__main__':
    # get data from ranson's criteria
    df = get_data()
    
    # get data for berkeley and livermore
    berkeley = get_city_data(df, 'USC00040693')
    livermore = get_city_data(df, 'USC00044997')

    # add fahrenheit column
    berkeley = convert_to_fahrenheit(berkeley)
    livermore = convert_to_fahrenheit(livermore)

    # bin data
    berkeley_bins = bin_data(berkeley)
    livermore_bins = bin_data(livermore)

    # H0: Berkeley and Livermore have the same weather.
    # H1: Berkeley and Livermore have different weather.
    
    # select only dates that exist in both cities
    berkeley_intersected, livermore_intersected = get_intersected(berkeley, livermore)

    # stratify by month
    berkeley_stratified = stratify(berkeley_intersected)
    livermore_stratified = stratify(livermore_intersected)

    # get p-value from permutation test using fisher combining function
    fisher_statistic = 0.0
    p = permutation_test(berkeley_stratified, livermore_stratified)
    print('P-value under H0: Berkeley and Livermore have the same weather is %f' % p)

/Users/brandonhuang/Desktop/Education/UC Berkeley
