# Exploratory Data Analysis

# Setup

The following cell sets up the Colab environment. No changes are made if run locally.

In [None]:
# If running on Colab, set up the environment
import sys
if 'google.colab' in sys.modules:
    !pip install requests
    !mkdir -p /content/data
    %cd /content
    !wget -q https://raw.githubusercontent.com/olearydj/INSY7120/refs/heads/main/notebooks/common.py -O common.py

In all environments, use a function from the common module (installed above or already local) to download required data files. They will be placed in `./data`

In [None]:
import common

STATE_CSV = common.download_gh_file('https://raw.githubusercontent.com/gedeck/practical-statistics-for-data-scientists/refs/heads/master/data/state.csv')

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
from scipy.stats import trim_mean
from statsmodels import robust
import wquantiles

import seaborn as sns
import matplotlib.pylab as plt

## Estimates of Location

### Location Estimates of Population and Murder Rates

Load the data using Pandas `read_csv` function and look at the first eight rows of data.

In [None]:
state = pd.read_csv(STATE_CSV)
print(state.head(8))

Get a sense of the data.

In [None]:
# shape and size
print("ndim:", state.ndim)
print("shape:", state.shape)

In [None]:
# data types
state.dtypes

In [None]:
# general information
state.info()

In [None]:
print(state.describe())

Compute the mean, trimmed mean, and median for Population. For `mean` and `median` we can use the _pandas_ methods of the data frame. The trimmed mean requires the `trim_mean` function in _scipy.stats_.

In [None]:
print(state['Population'].mean())

In [None]:
print(trim_mean(state['Population'], 0.1))

In [None]:
print(state['Population'].median())

mean > trimmed mean > median

Why?

Weighted mean is available with numpy. For weighted median, we can use the specialised package `wquantiles` (https://pypi.org/project/wquantiles/).

In [None]:
print(state['Murder.Rate'].mean())

In [None]:
print(np.average(state['Murder.Rate'], weights=state['Population']))

In [None]:
print(wquantiles.median(state['Murder.Rate'], weights=state['Population']))