In [7]:
#  !pip install pandas  # Run this to install pandas if imports below don't work.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline

In [2]:
___ = ''

In [3]:
pd.__version__

'1.2.4'

# Pandas DataFrames

## What is a DataFrame?

A DataFrame, simply put, is a **Table** of data.  It is a structure that contains multiple rows, each row containing the same labelled collection of data types.  For example, a DataFrame might look like this:

| (index) | Name | Age | Height | LikesIceCream |
| :---: | :--: | :--: | :--: | :--: |
| 0     | "Nick" | 22 | 3.4 | True |
| 1     | "Jenn" | 55 | 1.2 | True |
| 2     | "Joe"  | 25 | 2.2 | True |

Because each row contains the same data, DataFrames can also be thought of as a collection of same-length columns!

**Pandas** is a Python package that has a DataFrame class.  Using either the **DataFrame** class constructor or one of Pandas' many **read_()** functions, you can make your own DataFrame from a variety of sources.  

## Making DataFrames Directly

#### From a List of Dicts

Dicts are named collections.  If you have many of the same dicts in a list, the DataFrame constructor can convert it to a Dataframe:

In [4]:
friends = [
    {'Name': "Nick", "Age": 31, "Height": 2.9},
    {'Name': "Jenn", "Age": 55, "Height": 1.2},
    {"Name": "Joe",  "Age": 25, "Height": 1.2},
]
pd.DataFrame(friends)

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2


#### From a Dict of Lists

In [5]:
df = pd.DataFrame({
    'Name': ['Nick', 'Jenn', 'Joe'], 
    'Age': [31, 55, 25], 
    'Height': [2.9, 1.2, 1.2],
})
df

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2


#### From a List of Lists

if you have a collection of same-length sequences, you essentially have a rectangular data structure already!  All that's needed is to add some column labels.

In [6]:
friends = [
    ['Nick', 31, 2.9],
    ['Jenn', 55, 1.2],
    ['Joe',  25, 1.2],
]
pd.DataFrame(friends, columns=["Name", "Age", "Height"])

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2


#### From an empty DataFrame
If you prefer, you can also add columns one at a time, starting with an empty DataFrame:

In [7]:
df = pd.DataFrame()
df['Name'] = ['Nick', 'Jenn', 'Joe']
df['Age'] = [31, 55, 25]
df['Height'] = [2.9, 1.2, 1.2]
df

Unnamed: 0,Name,Age,Height
0,Nick,31,2.9
1,Jenn,55,1.2
2,Joe,25,1.2


### Exercise: Making DataFrames from Scratch

Please recreate the table below as a Dataframe using one of the approaches detailed above:

| Year | Product | Cost |
| :--: | :----:  | :--: |
| 2015 | Apples  | 0.35 |
| 2016 | Apples  | 0.45 |
| 2015 | Bananas | 0.75 |
| 2016 | Bananas | 1.10 |

In [11]:
dataframe = [
[2015, "Apples", 0.35],
[2016, "Apples", 0.45],
[2015, "Bananas", 0.75],
[2016, "Bananas", 1.10]
]
pd.DataFrame(dataframe, columns=["Year", "Product", "Cost"])

Unnamed: 0,Year,Product,Cost
0,2015,Apples,0.35
1,2016,Apples,0.45
2,2015,Bananas,0.75
3,2016,Bananas,1.1


Discuss: Which approach did you choose?  What did you like about it?

### Reading Data from Files into a DataFrame


| File Format | File Extension | `read_xxx()` function | Dataframe Write Method | 
| :--:  | :--: | :--: | :--: |
| Comma-Seperated Values      | .csv           | `pd.read_csv()` | `df.to_csv()` |
| Tab-seperated Valuess       | .tsv, .tabular, .csv | `pd.read_csv(sep='\t')`, `pd.read_table()` | `df.to_csv(sep='\t')` `df.to_table()` |
| Excel Spreadsheet           |  .xls | `pd.read_excel()`                    | `df.to_excel()`  |
| Excel Spreadsheet 2010      | .xlsx | `pd.read_excel(engine='openpyxl')`   | `df.to_excel(engine='openpyxl')` |
| JSON                        | .json | `pd.read_json()`                     | `df.to_json()` |
| Tables in a Web Page (HTML) | .html | `pd.read_html()[0]`                  | `df.to_html()` |

In [12]:
import pandas as pd

### File Format Exercises: "Roundtripping" write-read

run the code below to download the Titanic passengers dataset, and transform it into different file formats

In [15]:
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
df = pd.read_csv(url)
df[:5]

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


#### Tab-Seperated Values

Save the dataframe to a TSV file.

In [23]:
df.to_csv("test.csv")

Read the TSV file into Pandas again.  

In [24]:
pd.read_csv("test.csv")

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


Open the file in Jupyter (right-click on it in the file browser, click open with editor).  What does the file it look like?

#### JSON 

Save the dataframe to a JSON file.

In [25]:
df.to_json("test.json")

Read the JSON file into Pandas again.  

In [26]:
pd.read_json("test.json")

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


Open the file in Jupyter (right-click on it in the file browser, click open with editor).  What does the file it look like?

#### HTML 

Save the dataframe to a HTML file.

In [27]:
df.to_html("test.html")

Read the HTML file into Pandas again.  

In [29]:
pd.read_html("test.html")[0]

Unnamed: 0.1,Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


Open the file in Jupyter (right-click on it in the file browser, click open with editor).  What does the file it look like?

Open it in your web browser by double-clicking on it in your file explorer. What does it look like in your web browser?

#### Excel 

Note: Because XLS and XLSX are proprietary formats, you may need to install a couple extra packages for this to work (code below)

In [30]:
!pip install openpyxl



Save the dataframe to an Excel file.

In [32]:
df.to_excel("test.xlsx", engine='openpyxl')

Read the HTML file into Pandas again.  

In [33]:
pd.read_table("test.xlsx")

ParserError: Error tokenizing data. C error: Expected 2 fields in line 14, saw 3


Open the file in Jupyter (right-click on it in the file browser, click open with editor).  What does the file it look like?

Open it in your spreadsheet program.  What does it look like?