## Pandas

Python's data structures (lists, dictionaries, etc.) are suitable for many purposes, but complex calculations repeated over and over again will often be slow in Python. *Numpy* was developed to create structures of data (called arrays) that are much faster in mathematical operations, in large part because the data types of all the elements in an array will be the same type. These numbers often lack the context to make the content meaningful, though. *Pandas* was developed on top of numpy to create data structures that would facilitate rapid calculation of numerical data while preserving the context that allows for the selection and manipulation of that data.

Pandas is not included by default in Python, but can be installed and used like any other module in Python. In the documentation, pandas is usually abbreviated as pd in the import statement.

In [2]:
import pandas as pd

### Series

The fundamental one-dimensional unit of pandas data is the *Series*. A series will have values that are all the same data type, and will also have index values associated with the data values. The index defaults to a series of integers (0,1,2...), but alternate values can be provided.

In [None]:
numbers = pd.Series([12.3,-4,8,2.7])
print(numbers)

In [None]:
ages = pd.Series([24,18,31,25],index=["Rick","George","Bob","Freddie"])
print(ages)

Series can be subsetted by index, by slice or by boolean series.

In [None]:
print(numbers[1])
print(ages["Bob"])
print(ages[["George","Rick","George"]])
print(ages["George":"Freddie"])
print(ages[ages<30])

Series values can be used in mathematical operations. Each element of the series will be used in the series, without the need to set up an explicit loop. 

In [None]:
print(numbers*0.3)

In [None]:
print(ages+10)

Series can be used in mathematical operations with other series. In this case, the operations are aligned by index value, meaning the operation will be performed for each index value in the pair of series. If the two series have different values in their indices, this will lead to NaN (not a number) values for every index value that is unique to one of the series. As such, it's helpful when performing mathematical operations with multiple series if they have a common index.

In [None]:
print(numbers*numbers)

In [None]:
print(numbers+ages)

### Data Frames

It will often be convenient, when performing calculations on different sets of numbers, to create sets of series that share a common index. The series will then have the same length, and mathematical operations between the series will be aligned appropriately. In pandas, this structure is called a *data frame*, and is the most common data structure used in data analyses in pandas. In addition to the common index, each series will also have a unique name and its own data type. This arrangement of data fits nicely into the tidy data paradigm, with each series representing a column of values corresponding to a particular variable (usually the name of the series), and each value in the index corresponding to a different observation or row.

While there is a function, *DataFrame*, that can be used to create a data frame from individual series, it's more common to compile data in another file and read it into Python. There are pandas functions for reading data from different files or other sources. Reading of data from text files is generally based on the CSV (comma-separated values) format, which can be generated either by hand or by several different programs.

The following code block reads data from a file that is *not* in CSV format, but still uses the read_csv function to read it. Several optional arguments are provided to indicate how the data is formatted.

In [None]:
gas = pd.read_csv("gasoline.data", delim_whitespace=True,
                 index_col=0, parse_dates=True)
print(gas)

### Subsetting

With a pandas data frame, it will often be helpful in performing analyses to be able to generate subsets of a collection of data for further processing. Pandas series and data frames are frequently treated like lists or dictionaries, in that the same subsetting notation (square brackets) is used for all of them. Series and data frames have additional syntax which allows for a more sophisticated selection of data.

Providing a data frame with the name of a series in the data frame will return the series.

In [None]:
print(gas["fuel"])

Subsets of data can be selected by row identifiers (the index), column identifiers (names), and/or by boolean series for rows or columns. When using boolean series, rows or columns that have a True value in the series will be returned in the subset. Note that this code block uses the *loc* attribute to select the particular subset we are interested in. 

In [None]:
print(gas["mpg"]>27)

In [None]:
print(gas.loc[gas["mpg"]>27,:])

### Grouping and Aggregating Data

Data in a "tidy" format, with categories listed in different columns, often need to be separated into categories in the process of analyzing the data. This can be accomplished fairly readily in pandas data frames with a *groupby* method, which takes as its main argument the column (or columns, if provided with a list) that contain the categories of the data. The method returns an object that contains the groups into which the data is collected. This grouped object can be used like a list in looping structures, or it can be analyzed with *aggregating* methods that perform simple calculations on the data.

The following code block groups the gasoline data by fuel and station, and calculates the mean of the price per gallon for every combination of fuel and station. To showcase another feature of pandas data frames, the resulting aggregated prices will also be displayed in a "wide" form, which is common for display of certain types of data. This analysis will be performed in steps, but the steps could be combined into a single line to avoid unnecessary creation of variable names.

In [None]:
pricegroups = gas[["fuel","station","ppg"]].groupby(["fuel","station"])
pricemean = pricegroups.mean()
print(pricemean)

In [None]:
print(pricemean.unstack())

## Statistical Analysis

Pandas data frames are classes with methods for collecting rudimentary analyses of data, such as finding averages, sums or extreme values within the collection of data. To perform more rigorous statistical analyses, though, other Python modules can be used. The *statsmodels* module facilitates statistical analysis of data assembled in pandas data frames. The module provides several functions for doing different
types of data analysis.

In [1]:
import statsmodels as sm

In [2]:
import statsmodels.formula.api as smf

In [4]:
from scipy import stats

In [8]:
stats.ttest_ind(gas["ppg"][gas["fuel"]=="Unleaded87"],
            gas["ppg"][gas["fuel"]=="E10"])

Ttest_indResult(statistic=2.4538705315908325, pvalue=0.020869556705461444)

In [None]:
stationsmodel = smf.ols('ppg~station', data=gas)
stationlm = stationsmodel.fit()
print(stationlm.summary())

## Visualization of Data

Several Python modules have been developed to produce plots of numerical data. Pandas is designed to work particularly well with matplotlib, which is part of the same project that developed numpy, on which Pandas is based. 

In [12]:
from matplotlib import pyplot as plt

Pandas series and data frames include methods for plotting data. By default, they will use matplotlib for the functions required for creating plots.

Plotting a Series will plot the values in the series as a function of the index values.

In [None]:
gas["ppg"].plot()

Pandas data frames also have a plot method associated with them.

In [None]:
gas.plot()

Plots can be created by specifying the x and y values by column title.

In [None]:
gas.plot(x="ppg", y="mpg", kind="scatter")

Creating a program will allow you to manage plotting while leveraging other aspects of data management in Python

In [None]:
gas.boxplot(column="ppg", by="fuel")

## Seaborn

It's possible to create fairly sophisticated plots with Matplotlib, but this generally requires quite a bit of code to accomplish. Other Python modules have been developed to create more sophisticated plots with a smaller amount of code. Seaborn is a module that defines several functions for generating plots with Matplotlib, while specifying a relatively few parameters.

In [27]:
import seaborn as sns

In [None]:
sns.scatterplot(x="ppg",y="mpg",hue="station",data=gas)