# Explore, Visualize, and Predict using Pandas & Jupyter

### Learn to import, explore, and tweak your data

Matt Harrison (@\_\_mharrison\_\_)

The pandas library is very popular among data scientists, quants, Excel junkies, and Python developers because it allows you to perform data ingestion, exporting, transformation, and visualization with ease. But if you are only familiar with Python, pandas may present some challenges. Since pandas is inspired by Numpy, its syntax conventions can be confusing to Python developers.

If you have questions on Python syntax, check out https://github.com/mattharrison/Tiny-Python-3.6-Notebook

Much of this content is based on my Pandas book, [*Learning the Pandas Library*](https://www.amazon.com/Learning-Pandas-Library-Munging-Analysis/dp/153359824X/ref=sr_1_3?ie=UTF8&qid=1505448275&sr=8-3&keywords=python+pandas)

# Jupyter Intro

Jupyter notebook is an environment for combining interactive coding and text in a webbrowser. This allows us to easily share code as well as narrative around that code. An example that was popular in the scientific community was [the discovery of gravitational waves.](https://losc.ligo.org/s/events/GW150914/GW150914_tutorial.html)

The name Jupyter is a rebranding of an open source project previously known as iPython Notebook. The rebranding was to emphasize that although the backend is written in Python, it supports various *kernals* to run other languages, including Julia (the "Ju" portion), Python ("pyt"), and R ("er"). All popular *data science* programming languages.

The architecture of Jupyter includes a server running various kernals. Using a *notebook* we can interact with a kernal. Typically we use a webbrowser to do this, but there are other iterfaces, such as an emacs mode (ein).

## Using Jupyter

After we create a notebook, we are presented with a page with an empty cell. The cell will have a blue outline, ane the text:

    In [ ]: 
    
on the side. The blue outline indicates that we are in *command mode*. There are two modes in Jupyter, command mode and *edit mode*.

To enter edit mode simply hit the enter or return key. You will notice that the outline will change to green. In edit mode, with a Python kernel, we can type Python code. Type:

    print("hello world")
    
You will notice that unlike a normal Python REPL, this will note print anything after hitting return again. To *execute* the cell, you need to hold down control and hit enter (``C-Enter``). This will run the code, print the results of the cell and put you back into edit mode.     

## Edit Mode

To enter *Edit Mode* you need to click on a cell or hit enter when it is surrounded by the blue outline. You will see that it goes green if you are in edit mode. In edit mode you have basic editing functionality. A few keys to know:

* Ctr-Enter - Run cell (execute Python code, render Markdown)
* ESC - Go back to command mode
* TAB - Tab completion
* Shift-TAB - Bring up tooltip (ESC to dismiss)


## Command Mode

*Command Mode* gives to the ability to create, copy, paste, move, and execute cells. A few keys to know:

* h - Bring up help (ESC to dismiss)
* b - Create cell below
* a - Create cell above
* c - Copy cell
* v - Paste cell below
* Enter - Go into Edit Mode
* m - Change cell type to Markdown
* y - Change cell type to code
* ii - Interrupt kernel
* oo - Restart kernel

## Cell Types

* Code
* Markdown


## Markdown

Can make *italicized*, **bold**, and ``monospaced text``:

    Can make *italicized*, **bold**, and ``monospaced text``


Headers:

    # H1
    ## H2
    ### H3
 
Lists:

    * First item
    * Second item
    
Code:

    If you indent by four spaces you have code:
    
        def add(x, y):
            return x + yt
    
## Cell Magic

type and run ``%lsmagic`` in a cell.

Common magics include:

* ``%%time`` - time how long it takes to run cell
* ``%%!`` - run shell command
* ``%matplotlib inline`` - show matplotlib plots


## IPython Help
Add ? after function, method, etc for documentation (can also run shift-tab 4 times in notebook). Add ?? after function, method, etc to see the source.

# Setup

In [1]:
import pandas as pd
import matplotlib
import numpy as np

pd.__version__, matplotlib.__version__, np.__version__

('0.20.3', '2.0.2', '1.12.1')

In [2]:
# test for unicode
'\N{SNAKE}'

'üêç'

In [3]:
import sys
sys.getdefaultencoding() 

'utf-8'

In [4]:
sys.version

'3.6.2 | packaged by conda-forge | (default, Jul 23 2017, 22:59:30) \n[GCC 4.8.2 20140120 (Red Hat 4.8.2-15)]'

# Pandas Intro

## Installation

Presumably, you have pandas installed if you ran the cell after **Setup** successfully. The Anaconda distribution is a common way to get the Python scientific stack up and running quickly on most platforms. Running ``pip install pandas`` works as well.

In [5]:
# pandas has two main datatypes: a Series and a DataFrame
# A Series is like a column from a spreadsheet

s = pd.Series([0, 4, 6, 7])

In [6]:
# A DataFrame is like a spreadsheet

df = pd.DataFrame({'name': ['Fred', 'Johh', 'Joe', 'Abe'], 'age': s})

In [9]:
# We can do tab completion on objects that exist (shift tab brings up tooltip)
# ?? brings up source
df.age

0    0
1    4
2    6
3    7
Name: age, dtype: int64

# Datasets

For this class we will look at some time series data. The class will look at Central Park weather. The assignments will deal with El Nino data.

## Central Park


https://pastebin.com/vaB6QQGp

## El Nino

https://archive.ics.uci.edu/ml/datasets/El+Nino

In [10]:
%matplotlib inline
# I typically start with imports like this including the matplotlib magic 
# for most notebooks
import pandas as pd
import numpy as np 

# Getting Data
There are various ``pd.read_`` functions for ingesting data

In [11]:
# not necessary if you started jupyter from the project directory
%ls data/
# should have central-park-raw.csv

[0m[01;32mcentral-park-raw.csv[0m*  [01;32mtao-all2.dat.gz[0m*  [01;32mvehicles.csv.zip[0m*


In [None]:
# if you execute this cell it will bring up a tooltip due to
# the ? at the end. You can also hit shift-tab 4 times
# if your cursor is after the v
# Hit escape to dismiss the tooltip
pd.read_csv?

In [12]:
# let's load the data and treat column 0 as a date
nyc = pd.read_csv('data/central-park-raw.csv', parse_dates=[0])
# Jupyter will print the result of the last command
nyc

Unnamed: 0,EST,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,2006-01-01,42.0,37.0,32.0,32.0,30.0,28.0,85.0,74.0,62.0,...,10.0,10.0,8.0,9.0,3.0,10.0,0.00,8.0,,276.0
1,2006-01-02,48.0,44.0,39.0,38.0,34.0,29.0,92.0,71.0,49.0,...,10.0,8.0,4.0,18.0,5.0,24.0,0.63,5.0,Rain,76.0
2,2006-01-03,40.0,37.0,33.0,38.0,33.0,26.0,92.0,84.0,75.0,...,10.0,7.0,2.0,28.0,15.0,41.0,1.13,8.0,Rain,39.0
3,2006-01-04,38.0,34.0,29.0,36.0,26.0,19.0,85.0,72.0,59.0,...,10.0,10.0,4.0,15.0,7.0,20.0,0.00,3.0,,70.0
4,2006-01-05,50.0,44.0,37.0,38.0,35.0,32.0,92.0,71.0,50.0,...,10.0,6.0,2.0,15.0,5.0,21.0,0.05,6.0,Rain,251.0
5,2006-01-06,43.0,37.0,30.0,33.0,24.0,14.0,73.0,60.0,47.0,...,10.0,10.0,10.0,17.0,6.0,25.0,0.00,7.0,,317.0
6,2006-01-07,35.0,30.0,25.0,19.0,14.0,11.0,60.0,51.0,41.0,...,10.0,10.0,10.0,15.0,7.0,23.0,0.00,2.0,,267.0
7,2006-01-08,46.0,40.0,34.0,35.0,25.0,19.0,70.0,56.0,41.0,...,10.0,10.0,10.0,13.0,5.0,17.0,0.00,3.0,,192.0
8,2006-01-09,60.0,52.0,43.0,39.0,36.0,30.0,76.0,60.0,44.0,...,10.0,10.0,10.0,15.0,8.0,24.0,0.00,1.0,,249.0
9,2006-01-10,49.0,45.0,41.0,31.0,28.0,26.0,62.0,52.0,42.0,...,10.0,10.0,10.0,10.0,6.0,16.0,0.00,1.0,,261.0


In [13]:
# dataframes can get big, so only show the first bit
nyc.head()

Unnamed: 0,EST,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,2006-01-01,42.0,37.0,32.0,32.0,30.0,28.0,85.0,74.0,62.0,...,10.0,10.0,8.0,9.0,3.0,10.0,0.0,8.0,,276.0
1,2006-01-02,48.0,44.0,39.0,38.0,34.0,29.0,92.0,71.0,49.0,...,10.0,8.0,4.0,18.0,5.0,24.0,0.63,5.0,Rain,76.0
2,2006-01-03,40.0,37.0,33.0,38.0,33.0,26.0,92.0,84.0,75.0,...,10.0,7.0,2.0,28.0,15.0,41.0,1.13,8.0,Rain,39.0
3,2006-01-04,38.0,34.0,29.0,36.0,26.0,19.0,85.0,72.0,59.0,...,10.0,10.0,4.0,15.0,7.0,20.0,0.0,3.0,,70.0
4,2006-01-05,50.0,44.0,37.0,38.0,35.0,32.0,92.0,71.0,50.0,...,10.0,6.0,2.0,15.0,5.0,21.0,0.05,6.0,Rain,251.0


## Getting Data Assignment

For your assignment, you will look at El Nino data.

The [website](https://archive.ics.uci.edu/ml/datasets/El+Nino)  states:

    The data is stored in an ASCII files with one observation per line. Spaces separate fields and periods (.) denote missing values.


Load the ``data/tao-all2.dat.gz`` file into a data frame using ``pd.read_csv``.
Use the ``names`` variable for the initial column names (taken from website).
Replace empty values (``.``) with ``NaN``. Pull the year, month, and date columns into a single variable using the ``parse_dates`` parameter (see the ``pd.read_csv`` docs for info on this).

In [31]:
names = '''obs
year
month
day
date
latitude
longitude
zon.winds
mer.winds
humidity
air temp.
s.s.temp.'''.split('\n')

In [39]:
nino1 = pd.read_csv('data/tao-all2.dat.gz', header=None)
nino1.head()

Unnamed: 0,0
0,1 80 3 7 800307 -0.02 -109.46 -6.8 0.7 . 26.14...
1,2 80 3 8 800308 -0.02 -109.46 -4.9 1.1 . 25.66...
2,3 80 3 9 800309 -0.02 -109.46 -4.5 2.2 . 25.69...
3,4 80 3 10 800310 -0.02 -109.46 -3.8 1.9 . 25.5...
4,5 80 3 11 800311 -0.02 -109.46 -4.2 1.5 . 25.3...


In [32]:
nino = pd.read_csv('data/tao-all2.dat.gz', sep=' ', names=names, na_values='.', parse_dates=[[1, 2, 3]])

In [33]:
nino.head()

Unnamed: 0,year_month_day,obs,date,latitude,longitude,zon.winds,mer.winds,humidity,air temp.,s.s.temp.
0,1980-03-07,1,800307,-0.02,-109.46,-6.8,0.7,,26.14,26.24
1,1980-03-08,2,800308,-0.02,-109.46,-4.9,1.1,,25.66,25.97
2,1980-03-09,3,800309,-0.02,-109.46,-4.5,2.2,,25.69,25.28
3,1980-03-10,4,800310,-0.02,-109.46,-3.8,1.9,,25.57,24.31
4,1980-03-11,5,800311,-0.02,-109.46,-4.2,1.5,,25.3,23.19


# Inspecting Data

In [20]:
# Interesting aside, the columns are actually an Index 
nyc.columns

Index(['EST', 'Max TemperatureF', 'Mean TemperatureF', 'Min TemperatureF',
       'Max Dew PointF', 'MeanDew PointF', 'Min DewpointF', 'Max Humidity',
       ' Mean Humidity', ' Min Humidity', ' Max Sea Level PressureIn',
       ' Mean Sea Level PressureIn', ' Min Sea Level PressureIn',
       ' Max VisibilityMiles', ' Mean VisibilityMiles', ' Min VisibilityMiles',
       ' Max Wind SpeedMPH', ' Mean Wind SpeedMPH', ' Max Gust SpeedMPH',
       'PrecipitationIn', ' CloudCover', ' Events', ' WindDirDegrees'],
      dtype='object')

In [21]:
# If is good to know if columns have a [correct] type, (object could mean string)
nyc.dtypes

EST                           datetime64[ns]
Max TemperatureF                     float64
Mean TemperatureF                    float64
Min TemperatureF                     float64
Max Dew PointF                       float64
MeanDew PointF                       float64
Min DewpointF                        float64
Max Humidity                         float64
 Mean Humidity                       float64
 Min Humidity                        float64
 Max Sea Level PressureIn            float64
 Mean Sea Level PressureIn           float64
 Min Sea Level PressureIn            float64
 Max VisibilityMiles                 float64
 Mean VisibilityMiles                float64
 Min VisibilityMiles                 float64
 Max Wind SpeedMPH                   float64
 Mean Wind SpeedMPH                  float64
 Max Gust SpeedMPH                   float64
PrecipitationIn                       object
 CloudCover                          float64
 Events                               object
 WindDirDe

In [22]:
# we can also see how much space is taken up
nyc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3287 entries, 0 to 3286
Data columns (total 23 columns):
EST                           3287 non-null datetime64[ns]
Max TemperatureF              3285 non-null float64
Mean TemperatureF             3285 non-null float64
Min TemperatureF              3285 non-null float64
Max Dew PointF                3285 non-null float64
MeanDew PointF                3285 non-null float64
Min DewpointF                 3285 non-null float64
Max Humidity                  3285 non-null float64
 Mean Humidity                3285 non-null float64
 Min Humidity                 3285 non-null float64
 Max Sea Level PressureIn     3275 non-null float64
 Mean Sea Level PressureIn    3275 non-null float64
 Min Sea Level PressureIn     3275 non-null float64
 Max VisibilityMiles          3277 non-null float64
 Mean VisibilityMiles         3277 non-null float64
 Min VisibilityMiles          3277 non-null float64
 Max Wind SpeedMPH            3245 non-null float64
 M

In [25]:
nyc.describe()

Unnamed: 0,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,Max Sea Level PressureIn,Mean Sea Level PressureIn,Min Sea Level PressureIn,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,CloudCover,WindDirDegrees
count,3285.0,3285.0,3285.0,3285.0,3285.0,3285.0,3285.0,3285.0,3285.0,3275.0,3275.0,3275.0,3277.0,3277.0,3277.0,3245.0,3244.0,3177.0,3277.0,3285.0
mean,62.930898,56.042314,48.701674,47.334855,41.676712,35.374125,78.707458,62.108676,45.009132,30.11374,30.006202,29.900043,9.939274,8.663717,6.523955,14.487827,5.826449,22.75543,3.231614,193.028919
std,18.006236,16.953644,16.303976,17.901201,18.706095,19.586057,15.652513,14.548359,15.90128,0.209992,0.222218,0.239583,0.406343,2.041796,3.910295,4.355743,2.996004,7.064674,2.745582,104.107605
min,16.0,12.0,4.0,-8.0,-12.0,-16.0,28.0,20.0,6.0,29.26,28.84,28.53,5.0,1.0,0.0,3.0,0.0,5.0,0.0,-1.0
25%,48.0,42.0,36.0,34.0,27.0,20.0,67.0,51.0,34.0,29.97,29.87,29.76,10.0,8.0,2.0,12.0,4.0,18.0,1.0,78.0
50%,64.0,57.0,49.0,50.0,43.0,36.0,80.0,62.0,43.0,30.1,30.01,29.91,10.0,10.0,9.0,14.0,5.0,22.0,3.0,236.0
75%,79.0,71.0,63.0,63.0,58.0,52.0,93.0,73.0,54.0,30.25,30.15,30.06,10.0,10.0,10.0,17.0,7.0,26.0,6.0,279.0
max,104.0,94.0,84.0,77.0,75.0,72.0,100.0,97.0,93.0,30.77,30.69,30.59,10.0,10.0,10.0,99.0,99.0,137.0,8.0,360.0


In [23]:
# just view the first 10 rows
nyc.head(10)

Unnamed: 0,EST,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,2006-01-01,42.0,37.0,32.0,32.0,30.0,28.0,85.0,74.0,62.0,...,10.0,10.0,8.0,9.0,3.0,10.0,0.0,8.0,,276.0
1,2006-01-02,48.0,44.0,39.0,38.0,34.0,29.0,92.0,71.0,49.0,...,10.0,8.0,4.0,18.0,5.0,24.0,0.63,5.0,Rain,76.0
2,2006-01-03,40.0,37.0,33.0,38.0,33.0,26.0,92.0,84.0,75.0,...,10.0,7.0,2.0,28.0,15.0,41.0,1.13,8.0,Rain,39.0
3,2006-01-04,38.0,34.0,29.0,36.0,26.0,19.0,85.0,72.0,59.0,...,10.0,10.0,4.0,15.0,7.0,20.0,0.0,3.0,,70.0
4,2006-01-05,50.0,44.0,37.0,38.0,35.0,32.0,92.0,71.0,50.0,...,10.0,6.0,2.0,15.0,5.0,21.0,0.05,6.0,Rain,251.0
5,2006-01-06,43.0,37.0,30.0,33.0,24.0,14.0,73.0,60.0,47.0,...,10.0,10.0,10.0,17.0,6.0,25.0,0.0,7.0,,317.0
6,2006-01-07,35.0,30.0,25.0,19.0,14.0,11.0,60.0,51.0,41.0,...,10.0,10.0,10.0,15.0,7.0,23.0,0.0,2.0,,267.0
7,2006-01-08,46.0,40.0,34.0,35.0,25.0,19.0,70.0,56.0,41.0,...,10.0,10.0,10.0,13.0,5.0,17.0,0.0,3.0,,192.0
8,2006-01-09,60.0,52.0,43.0,39.0,36.0,30.0,76.0,60.0,44.0,...,10.0,10.0,10.0,15.0,8.0,24.0,0.0,1.0,,249.0
9,2006-01-10,49.0,45.0,41.0,31.0,28.0,26.0,62.0,52.0,42.0,...,10.0,10.0,10.0,10.0,6.0,16.0,0.0,1.0,,261.0


In [26]:
# Transposing the data often makes it easier to view
nyc.T  # nyc.transpose()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3277,3278,3279,3280,3281,3282,3283,3284,3285,3286
EST,2006-01-01 00:00:00,2006-01-02 00:00:00,2006-01-03 00:00:00,2006-01-04 00:00:00,2006-01-05 00:00:00,2006-01-06 00:00:00,2006-01-07 00:00:00,2006-01-08 00:00:00,2006-01-09 00:00:00,2006-01-10 00:00:00,...,2014-12-22 00:00:00,2014-12-23 00:00:00,2014-12-24 00:00:00,2014-12-25 00:00:00,2014-12-26 00:00:00,2014-12-27 00:00:00,2014-12-28 00:00:00,2014-12-29 00:00:00,2014-12-30 00:00:00,2014-12-31 00:00:00
Max TemperatureF,42,48,40,38,50,43,35,46,60,49,...,44,46,58,62,50,55,54,44,34,32
Mean TemperatureF,37,44,37,34,44,37,30,40,52,45,...,40,45,51,53,45,50,49,39,31,30
Min TemperatureF,32,39,33,29,37,30,25,34,43,41,...,35,43,44,44,40,44,43,34,28,27
Max Dew PointF,32,38,38,36,38,33,19,35,39,31,...,42,44,57,60,29,35,43,25,17,12
MeanDew PointF,30,34,33,26,35,24,14,25,36,28,...,35,42,47,40,28,31,37,19,13,8
Min DewpointF,28,29,26,19,32,14,11,19,30,26,...,29,41,43,27,27,29,26,15,8,5
Max Humidity,85,92,92,85,92,73,60,70,76,62,...,89,96,100,100,64,53,92,53,58,55
Mean Humidity,74,71,84,72,71,60,51,56,60,52,...,82,91,96,69,53,47,73,42,47,43
Min Humidity,62,49,75,59,50,47,41,41,44,42,...,75,86,92,38,42,41,53,31,36,30


In [27]:
# Here is the size (num rows, num cols)
nyc.shape

(3287, 23)

In [28]:
# We can inspect the index
nyc.index

RangeIndex(start=0, stop=3287, step=1)

In [29]:
# We can use the .set_index method to use another column as the index
nyc.set_index('EST')

Unnamed: 0_level_0,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,Max Sea Level PressureIn,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
EST,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2006-01-01,42.0,37.0,32.0,32.0,30.0,28.0,85.0,74.0,62.0,30.20,...,10.0,10.0,8.0,9.0,3.0,10.0,0.00,8.0,,276.0
2006-01-02,48.0,44.0,39.0,38.0,34.0,29.0,92.0,71.0,49.0,30.24,...,10.0,8.0,4.0,18.0,5.0,24.0,0.63,5.0,Rain,76.0
2006-01-03,40.0,37.0,33.0,38.0,33.0,26.0,92.0,84.0,75.0,30.05,...,10.0,7.0,2.0,28.0,15.0,41.0,1.13,8.0,Rain,39.0
2006-01-04,38.0,34.0,29.0,36.0,26.0,19.0,85.0,72.0,59.0,30.09,...,10.0,10.0,4.0,15.0,7.0,20.0,0.00,3.0,,70.0
2006-01-05,50.0,44.0,37.0,38.0,35.0,32.0,92.0,71.0,50.0,29.81,...,10.0,6.0,2.0,15.0,5.0,21.0,0.05,6.0,Rain,251.0
2006-01-06,43.0,37.0,30.0,33.0,24.0,14.0,73.0,60.0,47.0,29.82,...,10.0,10.0,10.0,17.0,6.0,25.0,0.00,7.0,,317.0
2006-01-07,35.0,30.0,25.0,19.0,14.0,11.0,60.0,51.0,41.0,29.99,...,10.0,10.0,10.0,15.0,7.0,23.0,0.00,2.0,,267.0
2006-01-08,46.0,40.0,34.0,35.0,25.0,19.0,70.0,56.0,41.0,30.10,...,10.0,10.0,10.0,13.0,5.0,17.0,0.00,3.0,,192.0
2006-01-09,60.0,52.0,43.0,39.0,36.0,30.0,76.0,60.0,44.0,30.25,...,10.0,10.0,10.0,15.0,8.0,24.0,0.00,1.0,,249.0
2006-01-10,49.0,45.0,41.0,31.0,28.0,26.0,62.0,52.0,42.0,30.50,...,10.0,10.0,10.0,10.0,6.0,16.0,0.00,1.0,,261.0


In [30]:
# undo .set_index with .reset_index
nyc.set_index('EST').reset_index()

Unnamed: 0,EST,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,MeanDew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
0,2006-01-01,42.0,37.0,32.0,32.0,30.0,28.0,85.0,74.0,62.0,...,10.0,10.0,8.0,9.0,3.0,10.0,0.00,8.0,,276.0
1,2006-01-02,48.0,44.0,39.0,38.0,34.0,29.0,92.0,71.0,49.0,...,10.0,8.0,4.0,18.0,5.0,24.0,0.63,5.0,Rain,76.0
2,2006-01-03,40.0,37.0,33.0,38.0,33.0,26.0,92.0,84.0,75.0,...,10.0,7.0,2.0,28.0,15.0,41.0,1.13,8.0,Rain,39.0
3,2006-01-04,38.0,34.0,29.0,36.0,26.0,19.0,85.0,72.0,59.0,...,10.0,10.0,4.0,15.0,7.0,20.0,0.00,3.0,,70.0
4,2006-01-05,50.0,44.0,37.0,38.0,35.0,32.0,92.0,71.0,50.0,...,10.0,6.0,2.0,15.0,5.0,21.0,0.05,6.0,Rain,251.0
5,2006-01-06,43.0,37.0,30.0,33.0,24.0,14.0,73.0,60.0,47.0,...,10.0,10.0,10.0,17.0,6.0,25.0,0.00,7.0,,317.0
6,2006-01-07,35.0,30.0,25.0,19.0,14.0,11.0,60.0,51.0,41.0,...,10.0,10.0,10.0,15.0,7.0,23.0,0.00,2.0,,267.0
7,2006-01-08,46.0,40.0,34.0,35.0,25.0,19.0,70.0,56.0,41.0,...,10.0,10.0,10.0,13.0,5.0,17.0,0.00,3.0,,192.0
8,2006-01-09,60.0,52.0,43.0,39.0,36.0,30.0,76.0,60.0,44.0,...,10.0,10.0,10.0,15.0,8.0,24.0,0.00,1.0,,249.0
9,2006-01-10,49.0,45.0,41.0,31.0,28.0,26.0,62.0,52.0,42.0,...,10.0,10.0,10.0,10.0,6.0,16.0,0.00,1.0,,261.0


## Inspecting Data Assignment

Now it is your turn to inspect the El Nino data.
 
* What are the columns of the dataframe?
* What are the types of the columns?
* How would you print the first 10 rows of data?
* How would you transpose the data?
* What is the shape of the data?
* How would we inspect the index?

# Tweak Data

  *In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for  prepare data.*
  
  -@bigdataborat
  
Let's see how we spend 80% of our time.  


In [None]:
# I like to start by inspecting the columns. Pandas will try to 
# infer types from CSV files, but doesn't always do the right thing.
# Sometimes the data is just messy.
nyc.dtypes

In [None]:
# See those spaces in front of some of the Columns?
# Remove spaces from front/end of column names
nyc.columns = [x.strip() for x in nyc.columns]

In [None]:
# Use underscores to enable attribute access/jupyter completion
nyc.columns = [x.replace(' ', '_') for x in nyc.columns]

In [None]:
# For non-numeric columns, .value_counts gives us 
# counts of the data. One would think that 
# PrecipitationIn should be numeric....
nyc.PrecipitationIn.value_counts()

In [None]:
# There is a "T" in there. Trace? 
# Convert "T" to 0.001
nyc.PrecipitationIn.replace("T", '0.001')
# Convert to numeric data
nyc.PrecipitationIn = pd.to_numeric(nyc.PrecipitationIn.replace("T", '0.001'))

In [None]:
nyc.Events.value_counts()

In [None]:
# can perform string operations on string columns off of the "str" attribute
nyc.Events.str.upper()

In [None]:
# Looks like the type of this column is mixed
type(nyc.Events[0])

In [None]:
set(nyc.Events.apply(type))

In [None]:
# Replace nan with ''
nyc['Events'] = nyc.Events.fillna('')

In [None]:
set(nyc.Events.apply(type))

In [None]:
# convert inches to cm
# If we multiply a column (Series), we are *broadcasting*
# the operation to every cell
nyc.PrecipitationIn * 2.54

In [None]:
# can also apply an arbitrary function, though this will be slow as it is not vectorized
#   map - works with a dictionary (mapping value to new value),  series (like dict), function
#   apply - only works with function as a parameter. Allows extra parameters
#   aggregate (agg) - works with function or list of functions. If reducing function, returns a scalar.
#   transform - wraps agg and won't do a reduction
def to_cm(val):
    return val * 2.54

nyc.PrecipitationIn.transform(to_cm)

In [None]:
%%timeit
nyc.PrecipitationIn.map(to_cm)

In [None]:
%%timeit
nyc.PrecipitationIn.transform(to_cm)

In [None]:
%%timeit
nyc.PrecipitationIn*2.54

In [None]:
# can add and drop columns (axis=1 means along the columns axis)
nyc['State'] = 'NYC'
nyc = nyc.drop(['State'], axis=1)
nyc

## Tweak Data Assignment
* Replace the periods and spaces in the column names with underscores
* The temperatures are stored as Celsius. Create a new column, ``air_temp_F``, using Fahrenheit
  (Tf = Tc*9/5 + 32)
* The wind speed is in meters per second. Create new columns,  adding ``_mph``, that uses miles per hour ( 1 MPS = 2.237 MPH )
* Convert the ``date`` column to a date type.
* Drop the obs column