# Using statistical tests — regression (optional)
This Notebook will look in more detail at some factors that affect the connection between the number of casualties and vehicles in an accident. It does this by splitting the accidents dataset into two groups: those involving bus-like vehicles and those that don't, and comparing regression tests between the two groups.

*As with the previous notebook, this notebook demonstrates how we can apply statistics functions that can be found in powerful Python statistical analysis package to our own datasets.*

*You are reminded that __this module is not intended to teach you statistical methods.__*

*The examples provided here should be regarded primarily as examples of how to apply powerful statistics based Python packages to datasets, rather than examples of how to "do statistics".*

Let's start in the normal way, loading in the required libraries:

In [None]:
# Standard imports

import pandas as pd
import numpy as np
import scipy.stats

import matplotlib.pyplot as plt
import seaborn as sns


## Setting up the document database 

In the notebooks for parts 14, 15 and 16, you will be using a document database to manage data. As with the relational database you looked at in previous sections, the data in the database is *persistent*. The document database, MongoDB, is described as "NoSQL" to reflect that it does not use the tabular format of the relational database to store data. However, many of properties of a formal RDBMS apply to MongoDB, including the need to connect to the database server.

As with PostgreSQL, the MongoDB database server runs independently from the Jupyter notebook server. To interact with it, you need to set up an explicit connection.

### Setting your database credentials

In order to work with a database, we need to create a *connection* to the database. A connection allows us to manipulate the database, and query its contents (depending on what usage rights you have been granted). For the SQL notebooks in TM351, the details of your connection will depend upon whether you are using the OU-hosted server, accessed via [tm351.open.ac.uk](https:tm351.open.ac.uk), or whether you are using a version hosted on your own computer, which you should have set up using either Vagrant or Docker.

To set up the connection, you need a login name and a pasword. we will use the variables `DB_USER` and `DB_PWD` to hold the user name and password respectively that you will use to connect to the database. Run the appropriate cell to set your credentials in the following cells.

#### Connecting to the database on [tm351.open.ac.uk](https:tm351.open.ac.uk)

If you are using the Open University hosted server, you should execute the following cell, using your OUCU as the value of `DB_USER`, and the password you were given at the beginning of the module. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

The variables `DB_USER` and `DB_PWD` are strings, and so you need to put them in quotes.

In this case, note that the connection string contains an additional option at the end: `?authsource=user-data`. For the MongoDB setup that we are using here, this option tells Mongo where to look for the authentication database.

#### Connecting to the database on a locally hosted machine

If you are running the Jupyter server on your own machine, via Docker or Vagrant, you should execute the following cell. Note that if the cell is in RAW NBconvert style, you will need to change its type to Code in order to execute it.

Note that the locally hosted versions of the environment give you full administrator rights, which is why you do not need to specify a user name or password. Obviously, this would not generally not be granted on a multi-user database, unless you are the database administrator.

### Connecting to the database

We can now set up a connection to the database. As with PostgreSQL, we use a connection string:

In [None]:
print(MONGO_CONNECTION_STRING)

The connection string is made up of several parts:

- `mongodb` : tells `pymongo` that we will use MongoDB as our database engine
- Your user name and (character escaped) password, separated by a colon if you are using the remote server. If you are using a local server, you will be logged on as an adminstrator, and do not need to specify a name or password.
- `localhost:27017` : the port on which the database engine is listening.
- A reference to the authentication file (`?authsource=user-data`), if you are using the remote server.

We now connect to the database with a `pymongo.MongoClient` object.

In [None]:
from pymongo import MongoClient

In [None]:
mongo_client=MongoClient(MONGO_CONNECTION_STRING)

You should now be connected to the MongoDB database server.

## The accidents database

The accidents database takes a long time to set up, so we have already imported it into a MongoDB database so that you can work with it. Note that on the remote VCE, the database is read-only, so you will not be able to alter its contents, although you can copy the contents into your own database space as discussed in the previous MongoDB notebooks, and alter that.

The cells in the earlier section, Setting up the document database, put the name of the accidents database into the variable `ACCIDENTS_DB_NAME`. Use this value to set up the connection to the `accidents` database and collections within it:

In [None]:
accidents_db=mongo_client[ACCIDENTS_DB_NAME]

We can look at the names of the collections in the database:

In [None]:
accidents_db.list_collection_names()

We will introduce some of the different collections in the rest of the materials, but let's start with the `accidents` collection:

In [None]:
accidents_collection=accidents_db['accidents']

This collection contains information on individual accidents. We can see how many examples it contains with the `.count_documents()` method:

In [None]:
accidents_collection.count_documents({})

In [None]:
labels=accidents_db['labels']

In [None]:
accidents_collection.count_documents({})

In case we feel the need to generate some charts, let's make sure they're big enough to see:

In [None]:
sns.set(rc={'figure.figsize':(11.7,8.27)})

## Linear Regression Lines
Let's look again at the whole dataset to find counts for each combination of the number of casualties and number of vehicles across all the accidents.

In [None]:
# Build a DataFrame, one row for each accident
cas_veh_df = pd.DataFrame(accidents_collection.find({'Local_Authority_(Highway)': 'E06000042'},
                                                ['Number_of_Casualties', 'Number_of_Vehicles']))
cas_veh_df.head()

Create a scatterplot overplotted with the best fit line to help tune us in (or not!) to any obvious relationship:

In [None]:
sns.lmplot(x="Number_of_Casualties", y="Number_of_Vehicles",
           data=cas_veh_df,
           x_jitter=0.2, y_jitter=0.2, scatter_kws={'s':1}
          );

A linear regression over the raw data (one row per accident) provides us with the equation of a best fit line for the data:

In [None]:
regressionline = scipy.stats.linregress(cas_veh_df['Number_of_Casualties'],
                                        cas_veh_df['Number_of_Vehicles'])

# The regression line is of the form y = m x + c
m = regressionline[0]
c = regressionline[1]

print(f'm: {m}, c: {c}')

We can also overplot the original `.lmplot()` chart with points along the regression line calculated above:

In [None]:
sns.lmplot(x="Number_of_Casualties", y="Number_of_Vehicles",
           data=cas_veh_df,
           x_jitter=0.2, y_jitter=0.2, scatter_kws={'s':1}
          );


# Generate an array of points to plot the regression line against
x = np.linspace(0, 10, 20)

# Plot points along the calculate regression line
plt.plot(x, m*x + c, '.',  color='red');

In [None]:
# Count the number of each severity
cas_veh_crosstab = pd.crosstab(cas_veh_df['Number_of_Casualties'], 
                               cas_veh_df['Number_of_Vehicles'])

cas_veh_crosstab.head(3)

Create a long dataframe from the crosstab result:

In [None]:
cas_veh_crosstab_long = cas_veh_crosstab.reset_index().melt(id_vars='Number_of_Casualties',
                                                            value_name="Count")

cas_veh_crosstab_long.head()

Plot the scatterplot using transparency layers to limit the number of points plotted to the chart:

In [None]:
sns.lmplot(x="Number_of_Casualties", y="Number_of_Vehicles",
           data=cas_veh_df, scatter_kws={'s':100, 'alpha': 0.1})

# Set the x axis limits (None means set automatically)
plt.xlim(0, None);

## Pearson's *R*

The `pearsonr` function calculates Pearson's *R* correlation. Recall that values near +1 show good positive correlation, values near -1 show good negative correlation, and values near 0 show no particular correlation. The `scipy` function returns a second value, the *p* value of the result. 

In [None]:
scipy.stats.pearsonr(cas_veh_df['Number_of_Casualties'], 
                     cas_veh_df['Number_of_Vehicles'])

This result shows a small, positive correlation with a very small *p* value. In other words, there's not much correlation, and the result is statistically significant (it's not likely that the observed values occurred by chance). This means we can reject the the null hypothesis that the number of casualties in an accident is unrelated to the number of vehicles and make the claim that the number of casualties associated with an accident *is* related to the number of vehicles involved.


### Activity 1
In this activity we will continue to explore the relationship between the number of casualties and the number of vehicles across two different classes of accident: those that involve bus-like vehicles and those that don't.

As part of your investigation, you should calculate the regression lines for the subgroups and visualise them appropriately. Then calculate Pearson's R / correlation coefficent for group to see how significant any detected effect might be. In each case, comment on what you learn.


__Hint:__ To simplify your database query, note that MongoDB queries can search over multiple values in various ways. For example, the following constructions are all legitimate in the SELECT part of the query:

- `{'$or':[ TERM1, TERM2]}`
- `{FIELD: {'$in': [10, 11]} }`

We can also search for "not" things: `{'$not': TERM }`.

Also remember that you can use the dot notation to query elements in subdocuments.

What are the vehicle types?

In [None]:
# Enter your code in this cell


What are the codes for the bus like vehicles?

In [None]:
# Enter your code in this cell


Get the data for bus like vehicles.


In [None]:
# Enter your code in this cell


Get the data for non-bus like vehicles.


In [None]:
# Enter your code in this cell


Find the regression for bus like vehicles.


In [None]:
# Enter your code in this cell


Visualise the regression line for bus like vehicles.


In [None]:
# Enter your code in this cell


Find Pearsons R for the bus like vehicles data.


In [None]:
# Enter your code in this cell


Comment on what you discovered for the bus related incidents.

Write your comments in this cell

Find the regression for non-bus like vehicles


In [None]:
# Enter your code in this cell


Visualise the regression line for non-bus like vehicles


In [None]:
# Enter your code in this cell


Find Pearsons R for the non bus like vehicles data


In [None]:
# Enter your code in this cell


Comment on what you discovered for the non-bus related incidents.

Write your comments in this cell

#### Our solution

To reveal our solution, run this cell or click on the triangle symbol on the left-hand side of the cell.

Display the different vehicle types, using the `labels` collection to make them human-readable:

In [None]:
labels=accidents_db['labels']

In [None]:
labels.find_one({'label': 'Vehicle_Type'}, {'codes'})

"Bus like" vehicles are vehicles with codes 10 (*Minibus (8 - 16 passenger seats)*) and 11 (*Bus or coach (17 or more pass seats)*).

We can now pull accidents involving bus-like vehicles:

In [None]:
# Build a DataFrame, one row for each accident
coach_df = pd.DataFrame(accidents_collection.find({'$or':[{'Vehicles.Vehicle_Type': 10},
                                                      {'Vehicles.Vehicle_Type': 11}]},
                                              ['Number_of_Casualties', 'Number_of_Vehicles']))
len(coach_df)

And accidents involving not bus-like vehicles.

In [None]:
# Build a DataFrame, one row for each accident
non_coach_df = pd.DataFrame(accidents_collection.find({'Vehicles.Vehicle_Type': {'$not': {'$in': [10, 11]}}}, 
                                                       ['Number_of_Casualties', 'Number_of_Vehicles']))
len(non_coach_df)

Check we've got them all:

In [None]:
len(coach_df) + len(non_coach_df) == accidents_collection.count_documents({})

__Bus related incidents__

Find the regression for the bus related incidents:

In [None]:
coach_regressionline = scipy.stats.linregress(coach_df['Number_of_Casualties'],
                                              coach_df['Number_of_Vehicles'])

# The regression line is of the form y = m x + c
coach_m = coach_regressionline[0]
coach_c = coach_regressionline[1]

print(f'm: {coach_m}, c: {coach_c}')

Generate a crosstab report to give us a sense of the numbers of the bus related accident vehicle and casualty number pairs:

In [None]:
# Count the number of each severity
coach_crosstab = pd.crosstab(coach_df['Number_of_Casualties'], 
                                      coach_df['Number_of_Vehicles'])
coach_crosstab.head(3)

Display a scatter plot with the regression line from the original accident line item data:

In [None]:
sns.lmplot(x="Number_of_Casualties", y="Number_of_Vehicles",
           data=coach_df, scatter_kws={'s':10, 'alpha': 0.1})

# Generate an array of points to plot the regression line against
x = np.linspace(0, 10, 20)

# Plot points along the calculate regression line
plt.plot(x, coach_m*x + coach_c, '.',  color='red');

# Set the x axis limits (None means set automatically)
plt.xlim(0, None);

Calculate the correlation coefficient and the p-value of the result:

In [None]:
(corr, p) = scipy.stats.pearsonr(coach_df['Number_of_Casualties'],
                                 coach_df['Number_of_Vehicles'])

print(f'Correlation: {corr}, p-value: {p}')

This shows a light degree of positive correlation. The very small *p* value means it is unlikely that the detected relationship was caused by chance, so we can reject the null hypothesis that the number of casualties is independent of the number of vehicles. In other words, there's a very slight, but likely real, positive correlation, between the number of vehicles and the number of accidents. Nothing to make a story out of, though...

__Non-bus Incidents__

Find the regression for non-bus accidents

In [None]:
non_coach_regressionline = scipy.stats.linregress(non_coach_df['Number_of_Casualties'],
                                       non_coach_df['Number_of_Vehicles'])

# The regression line is of the form y = m x + c
non_coach_m = non_coach_regressionline[0]
non_coach_c = non_coach_regressionline[1]


print(f'm: {non_coach_m}, c: {non_coach_c}')

Generate a crosstab report to give us a sense of the numbers associated with the bus related accident vehicle and casualty number pairs:

In [None]:
# Count the number of each severity
non_coach_crosstab = pd.crosstab(non_coach_df['Number_of_Casualties'],
                                 non_coach_df['Number_of_Vehicles'])
non_coach_crosstab.head()

Display the scatterplot and over plot the calculated regression line (This may take some time to calculate):

In [None]:
sns.lmplot(x="Number_of_Casualties", y="Number_of_Vehicles",
           data=non_coach_df, scatter_kws={'s':10, 'alpha': 0.1})

# Generate an array of points to plot the regression line against
x = np.linspace(0, 10, 20)

# Plot points along the calculate regression line
plt.plot(x, non_coach_m*x + non_coach_c, '.',  color='red');

# Set the x axis limits (None means set automatically)
plt.xlim(0, None);

Calculate the correlation coefficient and the p-value of the result:

In [None]:
(corr, p) = scipy.stats.pearsonr(non_coach_df['Number_of_Casualties'],
                                 non_coach_df['Number_of_Vehicles'])

print(f'Correlation: {corr}, p-value: {p}')

This shows slightly more correlation than for the bus related accidents. Once again, the extremely small *p* value means that the observed result was unlikely to occur by chance, which suggests we can reject the null hypothesis that the number of casualties is independent of the number of vehicles. In other words, there's a very slight, but likely real, positive correlation. Nothing really to make a story out of this either, though...

#### End of Activity 1

------------------------------------------------------

## What next?

This is one of two optional notebooks. If you have time available, consider reviewing the other optional notebook. Otherwise, return to the module materials now.