# import pandas as pd

## Workshop: Pandas and Data Manipulation

**pandas** is a Python package providing fast, flexible, and expressive data
structures designed to make working with "relational" or "labeled" data both
easy and intuitive. It aims to be the fundamental high-level building block for
doing practical, **real world** data analysis in Python. Additionally, it has
the broader goal of becoming **the most powerful and flexible open source data
analysis / manipulation tool available in any language**. It is already well on
its way toward this goal.

### Main Features
Here are just a few of the things that pandas does well:

  - Easy handling of missing data in floating point as well as non-floating
    point data
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can  be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets
  - Intuitive merging and joining data sets
  - Flexible reshaping and pivoting of data sets
  - Hierarchical labeling of axes (possible to have multiple labels per tick)
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, moving window linear regressions,
    date shifting and lagging, etc.
    
Source: official documentation
[run command: pd?]

### Cheatsheet

https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf

### References

1. https://pandas.pydata.org/pandas-docs/stable/  
2. Python Data Science Handbook by Jake VanderPlas
3. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney

### Installation

Windows: Start Button -> "Anaconda Prompt"

Ubuntu / MacOS: conda should be in your path

Activate the environment

```
conda activate module1
```

Pandas should already be installed. If not, install it:

```
conda install pandas
```

Tip: You can check the versions installed by calling Python with a script:
```
python -c "import pandas; print(pandas.__version__)"
```

### SGD to USD Exchange Rate Data

Instead of constructing sample arrays, we'll be using real data to play with numpy concepts. 

We'll use some data from data.gov.sg.

In [316]:
from IPython.display import IFrame

IFrame('https://data.gov.sg/dataset/exchange-rates-sgd-per-unit-of-usd-average-for-period-annual/resource/f927c39b-3b44-492e-8b54-174e775e0d98/view/43207b9f-1554-4afb-98fe-80dfdd6bb4f6', width=600, height=400)

### Download Instructions
1. Go to https://data.gov.sg/dataset/exchange-rates-sgd-per-unit-of-usd-average-for-period-annual
2. Click on the `Download` button
3. Unzip and extract the `.csv` file. Note the path for use below.

In [464]:
import pandas as pd

# we are using some pandas tricks to parse dates automagically
sgd_usd = pd.read_csv('D:/tmp/exchange-rates/exchange-rates-sgd-per-unit-of-usd-daily.csv',
                     parse_dates=True, index_col=0, infer_datetime_format=True,
                     squeeze=True)

array([ 2.0443,  2.0313,  2.0205, ...,  1.3763,  1.3834,  1.3827])

Where did the dates go? They are part of the pandas Series:

In [326]:
# inspect the first 5 entries of the Series
sgd_usd.head(5)

date
1988-01-08    2.0443
1988-01-15    2.0313
1988-01-22    2.0205
1988-01-29    2.0182
1988-02-05    2.0160
Name: exchange_rate_usd, dtype: float64

### Merging datasets

Let's say we need to also show Singapore Dollar and Renminbi (CNY) exchange rates, but from a different data set.

This dataset is already downloaded for you in the `data` folder.

In [300]:
# data source: https://www.exchangerates.org.uk
sgd_cny = pd.read_csv('data/sgd_cny_rates_daily.csv',
                     parse_dates=True, index_col=0, infer_datetime_format=True,
                     squeeze=True)
print('First 5 entries:')
sgd_cny.head(5)

First 5 entries:


Date
2018-05-27    4.7499
2018-05-26    4.7620
2018-05-25    4.7610
2018-05-24    4.7618
2018-05-23    4.7553
Name: Singapore Dollar to Chinese Yuan, dtype: float64

In [220]:
df_all = pd.DataFrame(df)
df_all.join(df_rmb).dropna()


Unnamed: 0,exchange_rate_usd,Singapore Dollar to Chinese Yuan
2009-10-06,1.4046,4.8734
2009-10-07,1.4031,4.8754
2009-10-08,1.3949,4.9082
2009-10-09,1.3918,4.9039
2009-10-12,1.4014,4.8791
2009-10-13,1.3972,4.8871
2009-10-14,1.3939,4.9050
2009-10-15,1.3861,4.9157
2009-10-16,1.3884,4.9051
2009-10-19,1.3953,4.9037


# import matplotlib.pyplot as plt

## Workshop: Matplotlib and Data Visualization

# Putting everything together

## Workshop: Data Workflow

## Assessment 1: Data Workflow