# Exploratory Data Analysis in Python

## What we'll cover

* Statistical Concepts
* Arrays and Matrices
* Dataframes and Series
* Statistical Modeling


We'll be extending our work in part 3 with ETL processes to the realm of analysis. Currently, there is significant overlap between bringing in data and its analysis (often in as near real time as possible). Below we'll be leaning heavily on the work of Wes McKinney. If you have a chance, reach out to Wes and thank him and if you're feeling generous, buy his book (see syllabus; I highly recommend it). 

<hr>

## Statistical Concepts

<hr>

An understanding of statistical modeling is essential before using Python to process your data. Here are a few key concepts to review:

* Hypothesis - unproven or unsubstantiated statement on a problem being investigated that is testable
* Model - a device to simplify the problem's reality so that relationships between variables can be studied
* Population - collection of all data we are interested in studing
* Sample - a subset of a population
* Descriptive Stats - representation of patterns existing in the data
* Inferential Stats - estimations or predictions about the population based on sample analysis
* Accuracy - a degree of correctness for a data set/point based on a standard
* Precision - a measure of repeatability
* Validity - a measurement of quality of a variable
* Reliability - a measurement of errors in data over time
* Mean - the average number
* Median - the middle number
* Mode - the most frequent number
* Central Tendency - clustering of values about certain numerical values (such as mean, median or mode)
* Variability - the spread and dispersion of data values

<hr>

## Numpy

<hr>

In [None]:
# import numpy

import numpy as np

### NumPy ndarray

In [None]:
# Here is a sum of a simple list

a = list(range(1000000))

%timeit sum(a)

In [None]:
# now here is the advantage of using a NumPy array

b = np.array(a)

%timeit np.sum(b)

In [None]:
#create a numpy array with .array() function
# Broadcast a math operation

a = np.array([1, 2, 3, 4])

a * 2

In [None]:
# ndarrays is a homogeneous data container

a.dtype

In [None]:
# determine the dimension of your array
# in this case we get a tuple with our array dimensions
# in this case a 2D array (columns, rows)

a.shape

In [None]:
# you can create empty or zero filled arrays with a tuple

array_zeros = np.zeros((4, 4))
array_zeros

In [None]:
# an array with 2 columns or 2 nested lists in 2 rows

array_empty = np.empty((2, 2, 2))
array_empty

In [None]:
# numpy can handle many data types

array_dtype = np.array([1, 2, 3, 4], dtype=np.int64)
array_dtype.dtype

In [None]:
# vectorization allows numpy to express operations without for loops

array_demo = np.array([[1, 2, 3], [4, 5, 6]])
array_demo * array_demo

### Indexing and Slicing

In [None]:
# numpy has its own range function

array_range = np.arange(10)
array_range

In [None]:
# we can slice and index just like we learned in Part 2

array_range[::2]

In [None]:
# we can apply changes to slices

array_range_slice = array_range[5:8]
array_range_slice[:] = 42
array_range

In [None]:
# we can apply this to higher dimensional arrays

array_demo_2D = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
array_demo_2D[0][2]  # access item in 3rd column, 1st row

In [None]:
array_demo_3D = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11,12]]])
array_demo_3D[0, 1, 0]  # access 1st element of the second list in the first row

In [None]:
# you can use Boolean operators

array_demo_3D > 6

In [None]:
# we can use reshape to provide dimension for an array

array_reshape = np.arange(24).reshape((8, 3))
array_reshape

In [None]:
# we can then transpose our array
# this is very useful for matrix math

array_reshape.T

In [None]:
# we can use .dot() for inner matrix products
# ORDER MATTERS! try swapping the order of the transpose

np.dot(array_reshape.T, array_reshape)

### Functions

In [None]:
# numpy has many ufuncs or universal functions for fast operations
# They are similar to standard library operations (e.g., unary ufuncs)

array_funcs = np.arange(10)
np.sqrt(array_funcs)

In [None]:
# binary ufuncs take two arrays and return one

array_funcs_x = np.random.randn(10)
array_funcs_y = np.random.randn(10)
# return largest number per element between each array index
np.maximum(array_funcs_x, array_funcs_y)

### Data Processing

In [None]:
# let's continue with examples of vertorization and broadcasting
# these are two reasons why numerical computation with numpy is great

# create an array of 1000 equally spaced points
points = np.arange(-5, 5, 0.01)

# create a meshgrid using out 1D points array twice
xs, ys = np.meshgrid(points, points)

# now we can evaluate the function sqrt(x^2 + y^2)
z = np.sqrt(xs ** 2 + ys ** 2)
%timeit z

In [None]:
# numpy.where can replace `x if condition else y` statements for any scale array

cond = np.array([True, True, False, False, False, True, False, False, True, False])
# create a resultant array
# where a value is taken from x where condition is True
np.where(cond, array_funcs_x, array_funcs_y)

In [None]:
# we can also call math function for arrays such as mean and standard deviation

rand_2d_array = np.random.randn(5, 4)
print("mean: " + str(rand_2d_array.mean()) + " - std: " + str(rand_2d_array.std()))

In [None]:
# you can apply these across any number of axes

rand_2d_array.mean(axis=1)

In [None]:
# for Boolean arrays, you can use .any() and .all() to check for True

cond.any()

In [None]:
# we can sort our values using .sort()

array_funcs_x.sort(0)  # sort our 1D array in ascending order
array_funcs_x

In [None]:
# We can use .unique() to create an array of unique values from an array

array_funcs_unique = np.array([1, 2, 3, 3, 2, 3, 1, 2, 2, 1])
np.unique(array_funcs_unique)

In [None]:
# you can use inid to test membership of values in one array

np.in1d(array_funcs_unique, [1, 3, 4, 5])

### File I/O

In [None]:
# use np.save() and np.load() to write and read data
# data is saved as raw binary data files with the .npy extension

np.save('data/meshgrid_proof_out', z)

In [None]:
np.load('data/meshgrid_proof_out.npy')

In [None]:
# You can also use np.loadtxt and np.genfromtxt...
# but this is much easier to do in Pandas :)

<hr>

## Pandas

<hr>

In [None]:
# import pandas

import pandas as pd

### Series

In [None]:
# A Series is like a 1D array with an index (or data labels)

series_example = pd.Series([10, 12, 13, 24])
series_example  # index is on the left, data values on the right

In [None]:
# if you want to see the array representation, use .values
series_example.values

In [None]:
# use .index to see the index
series_example.index

In [None]:
# you can assign an index
series_example_2 = pd.Series([10, 11, 12, 13], index=['a', 'b', 'c', 'd'])
series_example_2

In [None]:
# use math operators
series_example_2 * 3

In [None]:
# use Boolean operators to examine Series
series_example_2 > 11

In [None]:
# create a Series from a dictionary
series_example_3 = pd.Series({'Pat': 100, 'Tad': 100, 'Anna': 101})
series_example_3

In [None]:
# check for null values in your data with .isnull() and .notnull()
pd.notnull(series_example_3)

In [None]:
# assign a name to the Series
series_example_3.name = 'names'
series_example_3

In [None]:
# assign a new index to Series
series_example_3.index = ['Matt', 'Lynn', 'Lori']
series_example_3

### DataFrame

In [None]:
# a dataframe can be thought of as a dictionary of series
# each series is homogenous
# a common method of building dataframes is from dictionary of equal length lists

data = {
    'places': ['Lexington', 'Louisville', 'Bowling Green'],
    'established': [1782, 1778, 1798],
    'population': [321959, 771158, 58067]
}
# create dataframe from data. there is the option to identify columns and an index
ky_cities = pd.DataFrame(data, columns=['places', 'established', 'population'], index=[1, 2, 3])
ky_cities

In [None]:
# You can examing the columns of a dataframe
ky_cities.columns

In [None]:
# you can examine columns as a series using ['column_name'] or .column_name
ky_cities['places']

In [None]:
# you can pull rows using .loc(label) and .iloc(position)
ky_cities.iloc[0]

In [None]:
# add a column
ky_cities = pd.DataFrame(ky_cities, columns=['places', 'established', 'population', 'rand'])
ky_cities['rand'] = np.random.randn()
ky_cities

In [None]:
# delete a column
del ky_cities['rand']
ky_cities

In [None]:
# we can transpose the dataframe
ky_cities.T

In [None]:
# you can also pull the dataframe's values
ky_cities.values

### Indexing

In [None]:
# you can create an index object for a dataframe
test_index = pd.Index(np.arange(10))
test_index

In [None]:
# once created indexes are immutable
ky_cities.index[0] = 'one'

### Functionality

In [None]:
# you can reindex a series
places_ky = ky_cities['places']
places_ky = places_ky.reindex([2, 3, 1])
places_ky

In [None]:
# you can drop entries with .drop
places_ky_2 = places_ky.drop(3)
places_ky_2

In [None]:
# you can index similar to numpy except you can use integers and labels
# if you slice with labels, the endpoint is inclusive
places_ky[1:2]

In [None]:
# indexing into a dataframe is much the same way
ky_cities['population'][1]  # population for lexington

In [None]:
# you can use math operators on the dataframe columns
ky_cities.population * 2

In [None]:
# you can also use lambdas and apply() to apply changes across a series
# you can also use map() and applymap()
lambda_func = lambda x: x + 200
ky_cities.established.apply(lambda_func)

In [None]:
# we can sort the values
ky_cities.places.sort_index

In [None]:
# we can order values using sort_values(by='column_name')
ky_cities.sort_values(by='population')

In [None]:
# we can establish rank in data using .rank()
ky_cities.rank()

In [None]:
# you can check is all indexes are unique
ky_cities.places.is_unique

### Descriptive Stats

In [None]:
# we can use .describe to get stats for a dataframe or series
ky_cities.describe()

There are several summary stats for dataframes and series. Refer to the Pandas documentation for further details.
Here are a few:

* sum (sum of values)
* cumsum (cummulative sum of values)
* mean (average)
* median (median)
* var (variance)
* std (standard deviation)
* skew (skewness)
* kurt (kurtosis)
* pct_change (percent changes)

In [None]:
# we can calculate correlation (strength of relationship between variable between -1 and 1)
ky_census = pd.read_csv('data/census_2010_ky.csv')
ky_census.corr()

In [None]:
# we can also calculate covariance (how two variable vary together)
ky_census.cov()

In [None]:
# There is also a .unique method for series
ky_census.geoid.unique()

In [None]:
# value_counts
ky_census.label.value_counts()

In [None]:
# we can also test for membership with isin
ky_census.geoid.isin(['21005', '21023'])

### Missing Data

In [None]:
# we can check for missing data with isnull() or notnull()
missing_data = pd.Series([345, 567, np.nan, 890])
missing_data.isnull()

In [None]:
# to drop null values, use dropna()
missing_data.dropna()

In [None]:
# we can also fill null values with fillna()
missing_data.fillna(0)

<hr>

## StatsModels

<hr>

There is much to cover with statistical modeling in Python with StatsModels which is based on SciPy and NumPy functionality. Here we will perform a simple regression on our census data to see how well the condition of single fathers in Kentucky counties is dependent upon single mothers in the same geographic space.

In [None]:
# import Pandas and statsmodels
import pandas as pd
import statsmodels.formula.api as smf

In [None]:
# read in our census data set
census_2010_ky = pd.read_csv('data/census_2010_ky.csv')
census_2010_ky.head(2)

In [None]:
# examine our descriptive stats
census_2010_ky.describe()

In [None]:
# use our order of lease squares model to fit the relationship between single moms and single dads
model = smf.ols('sindads~sinmoms + totpop + medage', census_2010_ky).fit()
model.summary()

<hr>

## Additional Materials (for future versions)

<hr>

* [Pandas Crosstab](http://pbpython.com/pandas-crosstab.html)


<hr>

## Resources

<hr>

**Note:** A lot of the open-source materials are provided by people who develop those materials for a living. So please consider sending them a thank you and if you can, a few buck to support their efforts. Thanks! :)    

* [Numpy Docs](https://docs.scipy.org/doc/numpy/)
* [Pandas Docs](https://pandas.pydata.org/pandas-docs/stable/)
* [Pandas Tricks and Features](https://realpython.com/python-pandas-tricks/)
* [Speed Up Pandas](https://realpython.com/fast-flexible-pandas/)
* [statsmodels](http://www.statsmodels.org/stable/index.html)
* [SciPy Lectures](http://www.scipy-lectures.org/)
* [SciKit and Pandas](https://medium.com/dunder-data/from-pandas-to-scikit-learn-a-new-exciting-workflow-e88e2271ef62)
* [Python for Data Analysis by Wes McKinney](http://wesmckinney.com/pages/book.html)
* [Introduction to Statistical Learning with Python by Thomas Haslwanter](https://github.com/thomas-haslwanter/statsintro_python)