# Data exploration: Summary Statistics

This notebook shows examples of how to summarize and visualize data. It generates a few statistical characteristics of the data (e.g., max, min, median, etc) and a visualization.

For those of you interested in the code, it uses predefined functions from the numpy library to store data and Python's matplotlib library to generate a plot.

Run the cell below to load the definitions of two functions. The function "display_summary_statistics" takes a list of numbers as its argument and displays the min, 1st quartile, median, 3rd quartile and max of these numbers. The function "plot_5_summary" takes a list of numebrs as its arguments and generates a boxplot for these numbers.

In [None]:
from numpy import *
def display_summary_statistics(numbers):
      Min=percentile(numbers,0)
      Q1=percentile(numbers,25)
      Median=percentile(numbers,50)
      Q3=percentile(numbers,75)
      Max=percentile(numbers,100)
      return Min,Q1,Median,Q3,Max

def plot_5_summary(numbers):
      import matplotlib.pyplot as plt
      plt.boxplot(numbers)
      plt.show()
      plt.close('all')


The following code read the traffic data from the repository and loads it into a pandas dataframe

In [None]:
!wget https://raw.githubusercontent.com/msoley/CSCI549/master/In-class%20exercises/Practicum%201/traffic_data.csv

import pandas as pd
ds=pd.read_csv('traffic_data.csv',index_col=0)

## Generating Summary Statistics for a Dataset

Run the cell below. After processing, the cell will output five summary statistics for your data from the csv file: max, min, median, Q1 and Q3.

* The max represents the maximum value in the data.
* The min represents the minimum value in the data.
* The median is the value separating the higher 50% from the lower 50% of a data sample. 
* Q1 (1st quartile) is the value separating the higher 75% from the lower 25% of a data sample. 
* Q3 (3rd quartile) is the value separating the higher 25% from the lower 75% of a data sample.

In [None]:
Min,Q1,Median,Q3,Max=display_summary_statistics(ds["0"].values)
print('Min:',Min)
print('Q1:', Q1)
print('Median:',Median)
print('Q3:',Q3)
print('Max:',Max)

## Generating a Plot¶

Run the cell below to generate a boxplot of your numbers. 

In [None]:
%matplotlib inline
plot_5_summary(ds["0"])

How to interpret this plot: The red line represents the median. The edge of the box represents the 1st and 3rd quartile. The whiskers represent the minimum and maximum values.

It gives a representation of the skewness of the data.