# Pandas

The Pandas library is essentially a one-stop shop for common workflows in data science. It provides the basic toolset needed for data cleaning, feature engineering, statistical analysis, and visualizations in one place. Sometimes the basic toolset is enough, but in cases where it is not, the pandas library is built on top of numpy which allows easy integration with more specialized libraries like scikit-learn. Essentially, as long as it makes sense to represent your data in tables you should consider using pandas.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Getting Data
The main data structure of Pandas is the DataFrame, which is a 2-D labeled, table like structure. It is composed of several Data Series which are 1-D homogeneous-typed arrays. Understanding how to work with these two structures is the core of working with Pandas. The first step is acquiring a dataframe (or series) to work with.

One way to get a dataframe is by importing from another data source. Pandas supports the ability to work with many common file formats (csv, excel, json, hdf5, ...) and databases. The cell below pulls a csv from the specified url and imports it in as a Pandas DataFrame.

In [None]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

### Exercise 1
Let's experiment creating a dataframe from JSON files. The simulated datasets are available in the module folder as `simulated_form<id>.json` files. Each file is the same dataset, but in a different json format. Use the `read_json` Pandas method to load three or more of the files.

In [None]:
df1 = None # fill in these lines
df2 = None
df3 = None

for df in [df1, df2, df3]:
    print(df.head())

Another way to acquire DataFrames or Series is to literally construct them from existing collections. Base collections could be native Python lists, numpy arrays, or dictionaries.

In [None]:
pd.Series([1, 3, 9])

In [None]:
pd.DataFrame(np.random.randn(10,4), columns = ['A', 'B', 'C', 'D'])

In [None]:
pd.DataFrame({
    "PI": 3.14,
    "Radius": np.arange(5),
    "Size": pd.Categorical(["S", "S", "S", "L", "L"])
})

## Exploring Data

After you've created a data frame the next natural step is exploring it's contents. There are several helpful methods that can be used for viewing different attributes. Below are just a few.

In [None]:
print(f'Shape = {iris.shape}')
print(f'Index = {iris.index}')
print(f'Columns = {iris.columns}')

In [None]:
print(iris.head())
print(iris.tail())
print(iris.describe())

### Exercise 2

Import the (red) wine quality datasets and answer the following questions. 
1. How many columns and rows are there? 
2. How many columns deal with measures of acidity? 
3. What's the most alcholic wine available?

In [None]:
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv'
wine = None # insert your code here

## Selecting Data

Once the dataframe is created and a basic understanding of its contents established, the next most common steps are to examine subsets or perform transformations. In both cases, the ability to select specific elements of the DataFrame is key.

There are four primary methods for selecting elements: `.loc()`, `.iloc()`, `[]`, and, in some cases, direct attribute access with `.`. The most efficient and explicit methods are`.loc()` and `.iloc()` for label and index based selections, respectively. Using `[]` and `.` is great for interactive work but is generally not recommended for production quality code. Here's a few examples with attribute access and indexing/slicing.

In [None]:
iris.species.head()

In [None]:
iris['species'].head() # this will work

In [None]:
iris[['petal_length', 'petal_width']].head() # so will this

In [None]:
iris['petal_length':] # this will not

In [None]:
iris[0] # this will fail

In [None]:
iris[:5] # this will not

In [None]:
iris['species'][:5] # combining works

From the above you can see how the shorthand syntax might be useful. Below we'll flesh out the use of the selection by label and index methods. Both methods essentially take the same set of five possible input types:
1. a single label/index
2. a list of labels/indicies
3. a slice of labels/indicies
4. a boolean array
5. a callable object

They also can accept the row and column labels simultaneously. If only one is specified, it is assumed to identify the rows.

In [None]:
iris.loc[:5, 'species']

In [None]:
iris.loc[:5, ['petal_length', 'petal_width']]

In [None]:
iris.loc[:5, 'petal_length':]

In [None]:
iris.loc[iris.species == 'versicolor', :].head()

In [None]:
(iris.loc[lambda df : df.species == 'versicolor', :]
    .loc[lambda df : df.petal_length < 4.5, :]
    .head())

In [None]:
iris.iloc[-5:,-3:]

## Operating on Data

In order to process your data you'll need to know how to perform common statistical operations, apply custom functions, and create new dataframes (or modify the existing one). This includes handling missing data, computing statistics, reshaping (pivot, melt, transpose, etc.). Most of the operations are straightforward to use if you're familiar with the routine you're trying to apply. We'll explore a few examples here.

In [None]:
poor_quality = pd.DataFrame([[1, np.nan], [2, 3.14]], columns = ['A', 'B'])
poor_quality.isna()

In [None]:
poor_quality.fillna(value = poor_quality.B.mean())

In [None]:
poor_quality.dropna()

In [None]:
iris.species.value_counts()

In [None]:
iris.mean()

In [None]:
iris.corr()

In [None]:
iris.nunique()

In [None]:
iris.iloc[:,:-1].apply(np.sum, axis=0)

In [None]:
iris.iloc[:,:-1].apply(np.sum, axis=1)

In [None]:
iris.T

## Visualizing Data

Underneath the hood, pandas is setup to take advantage of matplotlib's basic functionality. It then exposes that functionality through convenient methods of the DataFrame class. 

In [None]:
ax = iris.plot()

In [None]:
ax = iris.plot(kind = 'scatter', x = 'petal_length', y='petal_width')

In [None]:
fig, ax = plt.subplots()
species = iris.groupby('species')
for name, df in species:
    ax.scatter(df.petal_length, df.petal_width, label=name)
legend = ax.legend()

### Exercise 3

We've been using the iris dataset to acquiant ourselves with Pandas, but only started with the wine data set. Your task now is to utilize all the methods you know to inspect the wine data set. Your goal is to gather ideas for what features or feature combinations might be useful for training a machine learning model to predict the quality of wine.