# Practical Statistics for Data Scientists (Python)
# Chapter 1. Exploratory Data Analysis
> (c) 2019 Peter C. Bruce, Andrew Bruce, Peter Gedeck

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import trim_mean
%matplotlib inline

In [11]:
# Define part to dataset
AIRLINE_STATS_CSV = '../data/airline_stats.csv'
KC_TAX_CSV = '../data/kc_tax.csv.gz'
LC_LOANS_CSV = '../data/lc_loans.csv'
AIRPORT_DELAYS_CSV = '../data/dfw_airline.csv'
SP500_DATA_CSV = '../data/sp500_data.csv.gz'
SP500_SECTORS_CSV = '../data/sp500_sectors.csv'
STATE_CSV = '../data/state.csv'

# 1. Estimates of Location
## Example: Location Estimates of Population and Murder Rates

Compute the mean, trimmed mean, and median for Population. For `mean` and `median` we can use the _pandas_ methods of the data frame. The trimmed mean requires the `trim_mean` function in _scipy.stats_.

In [12]:
state = pd.read_csv(STATE_CSV)
state.info()
state.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50 entries, 0 to 49
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   State         50 non-null     object 
 1   Population    50 non-null     int64  
 2   Murder.Rate   50 non-null     float64
 3   Abbreviation  50 non-null     object 
dtypes: float64(1), int64(1), object(2)
memory usage: 1.7+ KB


Unnamed: 0,State,Population,Murder.Rate,Abbreviation
0,Alabama,4779736,5.7,AL
1,Alaska,710231,5.6,AK
2,Arizona,6392017,4.7,AZ
3,Arkansas,2915918,5.6,AR
4,California,37253956,4.4,CA


In [32]:
def trim_mean(s,f):
    '''
    Return trimmed mean of serie
    Args:
        s: series
        f (float): trim value (0:1)
    Returns:
        mean: mean of trimmed series
    '''
    assert f >0, 'f value must be from 0-1'
    assert f <1, 'f value must be from 0-1'
    l = len(s)
    n = int(l*f)
    trimmed_s = s[n:-n]
    trimmed_mean = trimmed_s.mean()
    return trimmed_mean

In [38]:
trim_mean(state['Population'],f=0.1)

5831569.675

In [36]:
# mean of population
print('mean of population',state['Population'].mean())

# trimmed mean of population
print('trimmed mean population',trim_mean(state['Population'],0.1))

# median of population
print('median of population',state['Population'].median())

mean of population 6162876.3
trimmed mean population 5831569.675
median of population 4436369.5


# 2. Estimates of Variability

## Percentiles and Boxplots
_Pandas_ has the `quantile` method for data frames.

## Frequency Table and Histograms
The `cut` method for _pandas_ data splits the dataset into bins. There are a number of arguments for the method. The following code creates equal sized bins. The method `value_counts` returns a frequency table.

## Density Estimates
Density is an alternative to histograms that can provide more insight into the distribution of the data points. Use the argument `bw_method` to control the smoothness of the density curve.

# 3. Exploring Binary and Categorical Data

# 4. Correlation
First read the required datasets

## Scatterplots
Simple scatterplots are supported by _pandas_. Specifying the marker as `$\u25EF$` uses an open circle for each point.

# 5. Exploring Two or More Variables
Load the kc_tax dataset and filter based on a variety of criteria

## Hexagonal binning and Contours 
### Plotting numeric versus numeric data

If the number of data points gets large, scatter plots will no longer be meaningful. Here methods that visualize densities are more useful. The `hexbin` method for _pandas_ data frames is one powerful approach.

## Two Categorical Variables
Load the `lc_loans` dataset

## Categorical and Numeric Data
_Pandas_ boxplots of a column can be grouped by a different column.

## Visualizing Multiple Variables