# Using statistical tests
This Notebook will show some examples of using statistical tests to get qualitative results from questions asked of the accidents dataset.

In [None]:
# Import the required libraries

import pymongo
import datetime
import collections

import pandas as pd
import scipy.stats

In [None]:
# Open a connection to the Mongo server, open the accidents database and name the collections of accidents and labels
# client = pymongo.MongoClient('mongodb://localhost:27017/')
client = pymongo.MongoClient('mongodb://localhost:27351/')

db = client.accidents
accidents = db.accidents
labels = db.labels

In [None]:
# Load the expanded names of keys and human-readable codes into memory

expanded_name = collections.defaultdict(str)
for e in labels.find({'expanded': {"$exists": True}}):
    expanded_name[e['label']] = e['expanded']
    
label_of = collections.defaultdict(str)
for l in labels.find({'codes': {"$exists": True}}):
    for c in l['codes']:
        try:
            label_of[l['label'], int(c)] = l['codes'][c]
        except ValueError: 
            label_of[l['label'], c] = l['codes'][c]

## Pearson's *R*²
### Comparing the number of casualties and vehicles
This is the same investigation as in Notebook `14.2 Introduction to accidents`.

In [None]:
# Build a DataFrame, one row for each accident
cas_veh_unrolled_df = pd.DataFrame(list(accidents.find({}, ['Number_of_Casualties', 'Number_of_Vehicles'])))

# Count the number of each severity
cas_veh_df = pd.crosstab(cas_veh_unrolled_df['Number_of_Casualties'], 
                                      cas_veh_unrolled_df['Number_of_Vehicles'])
# Reshape
cas_veh_long_df = cas_veh_df.stack().reset_index()
cas_veh_long_df

In [None]:
regressionline = scipy.stats.linregress(cas_veh_unrolled_df['Number_of_Casualties'],
                                       cas_veh_unrolled_df['Number_of_Vehicles'])

# The regression line is of the form y = m x + b
m = regressionline[0]
b = regressionline[1]
(m, b)

In [None]:
plt.scatter(cas_veh_long_df['Number_of_Casualties'], 
            cas_veh_long_df['Number_of_Vehicles'],
            s=np.sqrt(cas_veh_long_df[0])*1.5,
            alpha=0.5
            )

x = np.linspace(0, 30, 20)
plt.plot(x, m*x + b)

plt.xlabel('Number of casualties')
plt.ylabel('Number of vehicles')
plt.show()

The `pearsonr` function calculates Pearson's *R*² value of correlation. The function takes two lists of numbers, of equal lengths. The Pearson's *R*² function looks at the values at the same index in both lists and finds how the values in one column vary with respect to the other column. 

Note that we have to give each accident on its own row: if there are 145,000 accidents, the `pearsonr` function must be passed lists with 145,000 items.

Recall that values near +1 show good positive correlation, values near -1 show good negative correlation, and values near 0 show no particular correlation. The `scipy` function returns a second value, the *p* value of the result. 

In [None]:
scipy.stats.pearsonr(cas_veh_unrolled_df['Number_of_Casualties'], 
                     cas_veh_unrolled_df['Number_of_Vehicles'])

This result shows a small, positive correlation with a very small *p* value. In other words, there's not much correlation, and the result is statistically significant. This means we can reject the the null hypothesis that the number of casualties in an accident is unrelated to the number of vehicles.

Looking at the data, it seems to be a result that most accidents result in very few casualties, and the accidents with the most casualties have few vehicles.

Can you think of a reason for this?

You'll look at this in more detail in Notebook `14.4 Regression on subgroups`.

### Activity 1
Ages of people in the accidents dataset are stored as bands, not continuous values. This means that correlations between them must use Spearman's *r*.

Calculate the Spearman rank-order correlation coefficient between the age of a vehicle's driver and the age of the passengers. 

Similar to the Pearson function above, the `scipy.stats.spearmanr()` function takes two parameters, each a list of values for the two variables being compared. 

You'll need to create an `unrolled` DataFrame with one row for each injured passenger. Each row should have two values: one for the age band of the driver, and one for the age band of a passenger. If the vehicle has multiple passengers, the DataFrame should have one row for each passenger. (Each element of `Casualties` has a `Vehicle_Reference` that relates it to the vehicle the casualty was in.)

We're interested in the relationship between drivers and passengers, so don't include the driver as a casualty. (Use the `Casualty_Class` to find out.)

Don't include any driver-passenger pairs where the age of one of them is unknown (code -1).

**Hint**

Each accident document contains a list of vehicles and a list of casualties. For each accident, you'll need to iterate through both of these to find the information for each individual casualty.

The solution is in the [`14.3solutions`](14.3solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

## Chi-squared example 1: hypothetical voting intention
This is the same example as used in the teaching material, showing how the chi-squared statistic is calculated.

Note this way of creating a DataFrame. It's a `dict`, where each entry is a column in the DataFrame. The key is the column name, the value is the items in the column. Each set of column values is itself a `dict`, with one key for each index entry and the value being the contents of that cell in the DataFrame.

In [None]:
actual_survey_results = pd.DataFrame({'Conservative': {'Men': 170, 'Women': 220},
                      'Labour': {'Men': 240, 'Women': 190},
                      'Other': {'Men': 80, 'Women': 100}})
actual_survey_results

We could find the expected counts manually, or we could use the `scipy.stats.contingency.expected_freq()` function to do it for us. Note that this returns an array, rather than a DataFrame, but it's the same shape as the original.

In [None]:
scipy.stats.contingency.expected_freq(actual_survey_results)

In [None]:
def expected_of_df(actual_df):
    df = pd.DataFrame(
        {c: 
         {r: actual_df[c].sum() * actual_df.loc[r].sum() / actual_df.sum().sum()
                  for r in actual_df[c].index} 
              for c in actual_df})
    # Fix the order of columns and rows
    df = df[actual_df.columns]
    df = df.reindex(actual_df.index)
    return df

In [None]:
expected_survey_results = expected_of_df(actual_survey_results)
expected_survey_results

As we're using a table of several rows and columns, we use the `scipy.stats.chi2_contingency()` function to find the $\chi ^ 2$ statistic and the _p_ value. 

Note that the function returns $\chi ^ 2$, the _p_ value, the number of degrees of freedom, and the matrix of expected frequencies. We're generally after just the second returned value, the _p_ value.

In [None]:
scipy.stats.chi2_contingency(actual_survey_results)

In [None]:
chi2, p, _, _ = scipy.stats.chi2_contingency(actual_survey_results)
chi2, p

The *p* value of 0.0009 means that we can reject the null hypothesis that voting intention is independent of gender: for this example, it seems that we can say that men and women vote differently.

If we adjust the numbers slightly, we can get a very different result.

In [None]:
actual_survey_results_2 = pd.DataFrame({'Conservative': {'Men': 170, 'Women': 220},
                      'Labour': {'Men': 220, 'Women': 210},
                      'Other': {'Men': 80, 'Women': 100}})
actual_survey_results_2

In [None]:
expected_survey_results_2 = expected_of_df(actual_survey_results_2)
expected_survey_results_2

In [None]:
chi2, p, _, _ = scipy.stats.chi2_contingency(actual_survey_results_2)
chi2, p

The *p* value of 0.07 means that we *cannot* reject the null hypothesis that voting intention is independent of gender: for this modified example, we can't say that men and women vote differently.

## Chi square example 2: accident frequency by day of week
Let's look to see if more accidents occur on different days of the week.

In [None]:
# Build a DataFrame, one row for each accident
count_by_day_unrolled_df = pd.DataFrame(list(accidents.find({}, ['Day_of_Week'])))

# Find counts for each day
count_by_day_ss = count_by_day_unrolled_df['Day_of_Week'].value_counts()

# Reorder by day of week, add labels.
count_by_day_ss.sort_index(inplace=True)
count_by_day_ss.index = [label_of['Day_of_Week', r] for r in count_by_day_ss.index]

count_by_day_ss

In [None]:
count_by_day_ss.plot(kind='bar')

There are differences, but are they significant?

We run into a slight problem here, though: the functions we used in the voting example above assume the data is in a table of at least two rows and columns. They don't work on one-dimensional series:

In [None]:
scipy.stats.contingency.expected_freq(count_by_day_ss)

In [None]:
scipy.stats.chi2_contingency(count_by_day_ss)

This means we have to use less convenient functions to calculate the $\chi^2$ and _p_ values. First, we explicitly find the expected values, then we use the `scipy.stats.chisquare()` function to find the test results. Note that the _p_ value is nicely labelled for us.

In [None]:
expected_count_by_day_ss = pd.Series(count_by_day_ss.sum() / len(count_by_day_ss), 
                                  index=count_by_day_ss.index)
expected_count_by_day_ss

In [None]:
scipy.stats.chisquare(count_by_day_ss, expected_count_by_day_ss)

The *p* value of zero shows that this is a significant result, and that the varying number of accidents by day is significant.

### Activity 2
We might expect there to be more accidents in bad weather. We might also expect that weather conditions will affect different roads differently, with bad conditions on high-speed roads having more of an impact on accident likelihood than low-speed (typically urban) roads.

If the weather affects all roads equally, we would expect to see the proportions of accidents in different weathers to be the same for different road speed limits. 

Use a chi-squared test to determine if the proportion of accidents in different weather conditions is independent of road speed.

Note that this activity will require several stages, following the pattern above: finding the values for the different ranges of `Weather_Conditions`, extracting the data from the database into a DataFrame, calculating the expected values for each combination of speed limit and weather, and finally calculating the chi-squared *p* value.

The solution is in the [`14.3solutions`](14.3solutions.ipynb) Notebook.

In [None]:
# Insert your solution here.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to `14.4 Regression on subgroups`.