In [1]:
%pylab inline
import warnings
warnings.filterwarnings('ignore')

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


# Summary Statistics

Summary statistics are the numbers that summarize properties of the data. Summarized properties include frequency, location, and spread. Most summary statistics can be calculated in a single pass through the data. There are multiple ways to obtain summary statistics for your data in Python. Below, we will demonstrate how to do so using pandas and NumPy. First, we import both of these libraries:

In [2]:
import pandas as pd
import numpy as np

Next, we generate two illustrative datasets from which we can generate summary statistics:

In [3]:
# Defines an n-dimensional array (ndarray) with 10,000 random numbers in the range [0-500).
values1 = np.random.randint(500, size=10000) 

# Defines a pandas Series similar to the above ndarray.
values2 = pd.Series(np.random.randint(500, size=10000)) 

pandas series are implemented as labeled ndarrays, and hence all statistical methods available for ndarrays can be used with this data structure as well. Below, see the first 10 elements generated for each list of values (note the explicit indexing displayed in the pandas series):

In [4]:
values1[:10]

array([385, 319, 159, 361,  14, 400, 206,  35, 238, 401])

In [5]:
values2[-10:]

9990    416
9991     31
9992    343
9993     77
9994    119
9995    461
9996    250
9997     70
9998     36
9999    150
dtype: int64

Built-in methods are available for all basic statistics. Some are demonstrated below:

In [6]:
print ('MIN(values1) = ' + str(values1.min()) + '\t\t\tMIN(values2) = ' + str(values2.min())) # minimum value in the list
print ('MAX(values1) = ' + str(values1.max()) + '\t\t\tMAX(values2) = ' + str(values2.max())) # maximum value
print ('RANGE(values1) = ' + str(values1.ptp()) + '\t\t\tRANGE(values2) = ' + str(values2.ptp())) # the range of the values
print ('MEAN(values1) = ' + str(values1.mean()) + '\t\tMEAN(values2) = ' + str(values2.mean())) # the mean of the values
print ('STD(values1) = ' + str(values1.std()) + '\t\tSTD(values2) = ' + str(values2.std())) # the standard deviation of the values
print ('VARIANCE(values1) = ' + str(values1.var()) + '\tVARIANCE(values2) = ' + str(values2.var())) # the variance of 

MIN(values1) = 0			MIN(values2) = 0
MAX(values1) = 499			MAX(values2) = 499


AttributeError: 'Series' object has no attribute 'ptp'

Additionally, pandas series have a method called *describe()* that returns a nice summary of these basic statistics.

In [None]:
values2.describe()

For non-numerical series objects, *describe()* will return a simple summary of the number of unique values and most frequently occurring ones.

In [None]:
s = pd.Series(['a', 'a', 'b', 'b', 'a', 'a', np.nan, 'c', 'd', 'a'])

In [None]:
s

In [None]:
s.describe()

*np.nan* is used to denote missing values. By default, the statistical methods implemented in pandas skip these values, which is not always the case when we are dealing with ndarrays. This behavior can be altered by including the *skipna=False* flag when calling a method.

## Handling missing data with pandas

pandas has great support for missing data. For full documentation, [check this page](http://pandas.pydata.org/pandas-docs/dev/missing_data.html). Below are a few examples of how to work with missing data using pandas. First, we create a pandas DataFrame with 5 rows and 3 columns and fill it with random numbers:

In [None]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three'])

In [None]:
df

Next, we add two more columns, named 'four' and 'five':

In [None]:
df['four'] = 'bar'
df['five'] = df['one'] > 0

In [None]:
df

Adding new rows is also simple. Below we include three extra empty rows:

In [None]:
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2

In [None]:
#This is one way to index a column in pandas
df2['one']

In [None]:
#This is one way to index a row in pandas
df2.loc['a']

pandas has two functions *isnull()* and *notnull()* that return boolean objects when called.

In [None]:
pd.isnull(df2['one'])

In [None]:
pd.notnull(df2['one'])

Missing values propagate naturally through arithmetic operations between pandas objects.

In [None]:
a = df[['one','two']]
a.loc[['a','c'],'one'] = float('nan')
a

In [None]:
b = df[['one','two']]
b

In [None]:
a * b

In [None]:
a['one'].dropna()

In pandas, summary statistics all account for missing values. 

*   When summing data, NA (missing) values will be treated as zero
*   If the data are all NA, the result will be NA
*   Methods like cumsum and cumprod ignore NA values, but preserve them in the resulting arrays

In [None]:
a

In [None]:
a['one'].sum()

In [None]:
a.mean(1)

*This is just the bare minimum. pandas has a lot more missing data functionality.*

## Histograms

#### Using pandas

Plotting histograms using pandas is quite straightforward. Using the above *values2* series, we can simply call the *hist()* method.

In [None]:
pdhist = values2.hist()

Parameters can be used to change the number of bins, color, transparency, etc.

In [None]:
pdhist2 = values2.hist(bins=20, color='r',alpha=0.4, figsize=(10,6))

#### Using NumPy + matplotlib

While the pandas data structure has a method that automatically wraps around a call to the *hist()* method of the plotting library Matplotlib, we can achieve the same result by performing that call manually on our *values1* ndarray.

In [None]:
import matplotlib.pyplot as plt # Required for plotting

In [None]:
nphist = plt.hist(values1)

## Boxplots

#### Using pandas

pandas DataFrames have a boxplot method that allows you to visualize the distribution of values within each column.

In [None]:
df = pd.DataFrame(rand(10,2), columns=['Col1', 'Col2'] )

In [None]:
df.head()

In [None]:
box = df.boxplot(grid=False, return_type='axes')

#### Using NumPy + matplotlib

In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

x1 = np.random.normal(0,1,50)
x2 = np.random.normal(1,1,50)

npbox = ax.boxplot([x1,x2])

## Scatterplots

#### Using pandas

Let's define a dataframe containing 2 columns, each with 200 random numbers < 1.

In [None]:
df = pd.DataFrame(rand(200,2))

In [None]:
df.head()

In [None]:
pdscatter = plt.scatter(df[0], df[1])

#### Using NumPy + matplotlib

In [None]:
x = np.random.randn(200)
y = np.random.randn(200)

fig = plt.figure()
ax = fig.add_subplot(111)

npscatter = ax.scatter(x,y,color='r')