# Pandas: Python Data Analysis Library

**pandas** is a Python package providing fast, flexible, and expressive data structures designed to work with tabular data. To store the table of data pandas provides the **DataFrame** object. Each table (i.e. DataFrame) contains one or more data categories in columns, also called attributes:

| (index) | Name | Age | Height | LikesIceCream |
| :---: | :--: | :--: | :--: | :--: |
| 0     | "Nick" | 22 | 3.4 | True |
| 1     | "Jenn" | 55 | 1.2 | True |
| 2     | "Joe"  | 25 | 2.2 | True |

Importantly, in contrast to Numpy arrays, a DataFrame can hold different types objects. Within one cloumn (i.e. one specific attribute) the type should be the same but different columns can have different types.

## Making DataFrames Directly

There are different ways to create a DataFrame with pandas. Let's go through some of them.

### From a List of Dicts

Dicts are named collections.  If you have many of the same dicts in a list, the DataFrame constructor can convert it to a Dataframe:

In [2]:
import pandas as pd

In [3]:
friends = [
    {'Name': "Nick", "Age": 31, "Height": 2.9, "Weight": 20},
    {'Name': "Jenn", "Age": 55, "Height": 1.2},
    {"Name": "Joe", "Height": 1.2, "Age": 25, },
]
pd.DataFrame(friends)

Unnamed: 0,Name,Age,Height,Weight
0,Nick,31,2.9,20.0
1,Jenn,55,1.2,
2,Joe,25,1.2,


### From a Dict of Lists

In [4]:
df = pd.DataFrame({
    'Name': ['Nick', 'Jenn', 'Joe'], 
    'Age': [31, 55, 25], 
    'Height': [2.9, 1.2, 1.2],
})
df

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2


### From a List of Lists

If you have a number of same-length collections (e.g. Lists), you essentially have a rectangular data structure already!  All that's needed is to add some column labels.

In [5]:
friends = [
    ['Nick', 31, 2.9],
    ['Jenn', 55, 1.2],
    ['Joe',  25, 1.2],
]
pd.DataFrame(friends, columns=["Name", "Age", "Height"])

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2


### From an empty DataFrame
If you prefer, you can also add columns one at a time, starting with an empty DataFrame:

In [6]:
df = pd.DataFrame()
df['Name'] = ['Nick', 'Jenn', 'Joe']
df['Age'] = [31, 55, 25]
df['Height'] = [2.9, 1.2, 1.2]
df

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2


## Exercise (quick)

Make a DataFrame from Scratch

Please use Pandas to recreate the table below as a Dataframe using one of the approaches detailed above:

| Year | Product | Cost |
| :--: | :----:  | :--: |
| 2015 | Apples  | 0.35 |
| 2016 | Apples  | 0.45 |
| 2015 | Bananas | 0.75 |
| 2016 | Bananas | 1.10 |

## Reading data from files into a DataFrame

**DataFrame** objects come with methods that allow you to read data from a variety of sources:

| File Format | File Extension | `read_xxx()` function | Dataframe Write Method | 
| :--:  | :--: | :--: | :--: |
| Comma-Seperated Values      | .csv           | `pd.read_csv()` | `df.to_csv()` |
| Tab-seperated Valuess       | .tsv, .tabular, .csv | `pd.read_csv(sep='\t')`, `pd.read_table()` | `df.to_csv(sep='\t')` `df.to_table()` |
| Excel Spreadsheet           |  .xls | `pd.read_excel()`                    | `df.to_excel()`  |
| Excel Spreadsheet 2010      | .xlsx | `pd.read_excel(engine='openpyxl')`   | `df.to_excel(engine='openpyxl')` |
| JSON                        | .json | `pd.read_json()`                     | `df.to_json()` |
| Tables in a Web Page (HTML) | .html | `pd.read_html()[0]`                  | `df.to_html()` |
| HDF5 | .hdf5, .h5, | `pd.read_hdf5()` |  `df.to_hdf5()` |

## Exercise (quick)

1. Run the code below to download the Titanic passengers dataset, and transform it into different file formats

**Note**: Yep, that's right, you can supply a web url and pandas reads it like a normal file!

In [4]:
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)
df

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


2. Now run the code below to save the file to a comma-seperated file using the `DataFrame.to_csv()` method:

In [5]:
df.to_csv('titanic.csv')

3. Run the code below to save the dataframe to a tab-seperated file, using the .tsv file extension

In [7]:
df.to_csv("titanic2001.tsv", sep='\t')

... Did it save correctly?  Check by reading the TSV file into Pandas again.

## Inspecting DataFrame

DataFrame objects come with methods that allow quick inspection of the data.

In [15]:
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [16]:
df.tail()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [17]:
df.shape

(891, 15)

In [18]:
df.columns

Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare',
       'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town',
       'alive', 'alone'],
      dtype='object')

In [19]:
df.dtypes

survived         int64
pclass           int64
sex             object
age            float64
sibsp            int64
parch            int64
fare           float64
embarked        object
class           object
who             object
adult_male        bool
deck            object
embark_town     object
alive           object
alone             bool
dtype: object

In [21]:
df.describe()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


## Exercise

1. Load the `tips.csv` file into Python as a DataFrame.

2. How many observations (or rows) does the data have?

3. How many attributes (or columns) does it have?

... What are the attributes?

4. What is the data type of each attribute?