# Module 1: Basics of statistics


## Statistics of nerve conduction velocities
An inquisitive Duke BME student decides to measure the nerve conduction velocities of fellow studies on campus. After ten grueling hours of recording, the student accumulates velocity readings for a random sample of 50 students, stored to a .csv file.

In [3]:
# Import relevant packages
import scipy.stats as stats     # Comprehensive stats package
import numpy as np              # Mathematical operations
import plotly.express as px     # Plotting
import pandas as pd             # Data reading and processing

# Import data as pandas dataframe
df = pd.read_csv("../data/ncv_data.csv") # Make sure this is the correct path to the .csv file!

# It is good practice to look at your data frame before doing any work
df.info()


module_1_basics_of_statistics.ipynb   module_4_anova.ipynb
module_2_hypothesis_testing.ipynb     module_5_anova2.ipynb
module_3_regression_correlation.ipynb
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   NCV     50 non-null     int64
dtypes: int64(1)
memory usage: 528.0 bytes


## Visualizing the data

Make a histogram of the raw data. What information does a histogram tell you?

In [None]:
fig = px.histogram(df,x="NCV",                  # Call on the NCV tag in your data frame
                   title='Histogram of NCVs',   # Give your plot a title
                   labels={'NCV':'NCV (m/s)'})  # Change the x-axis label to include units
fig.show()

## Calculating basic measures

Calculate the sample mean and standard deviation.

In [None]:
sample_mean = 
sample_std = 

# Get in the habit of printing your results
print('Sample mean: %.2f' % sample_mean)
print('Sample standard deviation: %.2f' % sample_std)

## The sampling distribution

Estimate the standard deviation of the sampling distribution of NCVs for Duke students. Be able to explain what the sampling distribution represents. Why is it acceptable to use the t-distribution to model the sampling distribution of the NCVs of Duke students? How many degrees of freedom are there when using the sample data to estimate the t-distribution?

In [None]:
n = df['NCV'].count()     # This is just one of several useful pandas operations
sampling_distribution_std = 
df_ncv = 

# Print your results
print('Sampling distribution standard deviation: %.2f' % sampling_distribution_std)
print('Degrees of freedom: %d' % df_ncv)

## Probabilities

Assume that the true population (Duke students) mean for NCV is known to be 51 m/s. Perform the calculations necessary to indicate which region of the t-distribution (i.e. the cut-off t-value) corresponds to probability of collecting a sample with a mean less than or equal to that found using the data provided. Calculate the probability with Python and compare it to value given in the t-table provided.

In [None]:
pop_mean = 51
t = 
print('The region less than t-statistic = %.2f' % t)

# Look up how to use this function - what inputs do you need?
p = stats.t.cdf()

print('p = %.3f' % p)


What is the probability that your next random sample of 50 Duke students will have a mean greater than 51.5 m/s?

In [None]:
new_sample_mean = 51.5
t = 

# It's the same function as before. How will you change your inputs?
p = stats.t.cdf()
print('p = %.2f' % p)

# Working backwards

Let's think about this problem in the reverse. Instead of determining the probability of finding a sample mean, let's find the mean that yields a desired probability, e.g. $P(\bar{x} \leq ?) = 0.05$. We will basically complete the following statement: "There is a 5% chance of collecting a sample mean greater than _______."

First, find the unknown t-statistic in the following statement: $P(t \leq ?) = 0.95$. This value is called the critical t-value, or t-critical.

In [None]:
# Another functions from stats.t. Always look up documentation if you don't recognize a function!
t_crit = stats.t.ppf()
print('t-critical = %.2f' % t_crit)

Using this t-critical value, find the sample mean that completes the following statement: "There is a 5% chance of collecting a sample mean greater than _______."

In [None]:
new_sample_mean = 
print('There is a 5%% chance of collecting a sample mean greater than %.2f' % new_sample_mean)