# Pandas

The Pandas library is essentially a one-stop shop for common workflows in data science. It provides the basic toolset needed for data cleaning, feature engineering, statistical analysis, and visualizations in one place. Sometimes the basic toolset is enough, but in cases where it is not, the pandas library is built on top of numpy which allows easy integration with more specialized libraries like scikit-learn. Essentially, as long as it makes sense to represent your data in tables you should consider using pandas.

In [3]:
import numpy as np
import pandas as pd

## Getting Data
The main data structure of Pandas is the DataFrame, which is a 2-D labeled, table like structure. It is composed of several Data Series which are 1-D homogeneous-typed arrays. Understanding how to work with these two structures is the core of working with Pandas. The first step is acquiring a dataframe (or series) to work with.

One way to get a dataframe is by importing from another data source. Pandas supports the ability to work with many common file formats (csv, excel, json, hdf5, ...) and databases. The cell below pulls a csv from the specified url and imports it in as a Pandas DataFrame.

In [6]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')

### Exercise 1
Let's experiment creating a dataframe from JSON files. The simulated datasets are available in the module folder as `simulated_form<id>.json` files. Each file is the same dataset, but in a different json format. Use the `read_json` Pandas method to load three or more of the files.

In [None]:
df1 = None # fill in these lines
df2 = None
df3 = None

for df in [df1, df2, df3]:
    print(df.head())

Another way to acquire DataFrames or Series is to literally construct them from existing collections. Base collections could be native Python lists, numpy arrays, or dictionaries.

In [None]:
pd.Series([1, 3, 9])

In [None]:
pd.DataFrame(np.random.randn(10,4), columns = ['A', 'B', 'C', 'D'])

In [None]:
pd.DataFrame({
    "PI": 3.14,
    "Radius": np.arange(5),
    "Size": pd.Categorical(["S", "S", "S", "L", "L"])
})

## Exploring Data

After you've created a data frame the next natural step is exploring it's contents. There are several helpful methods that can be used for viewing different attributes. Below are just a few.

In [27]:
print(f'Shape = {iris.shape}')
print(f'Index = {iris.index}')
print(f'Columns = {iris.columns}')

Shape = (150, 5)
Index = RangeIndex(start=0, stop=150, step=1)
Columns = Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
