<!--BOOK_INFORMATION-->
*This notebook contains both adapted and unmodified material from:
- the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*
- data-science-ipython-notebooks by Donne Martin; the content is available [on GitHub](https://github.com/donnemartin/data-science-ipython-notebooks)**

\* The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT).

** The text is released under under an open source license.

# 08 Data  
## CLASS MATERIAL

<br> <a href='#InstallPandas'>1. InstallPandas</a>
<br> <a href='#PandasDataStructures'>2. Pandas Data Structures</a> 
<br> <a href='#DataIndexingSelection'>3. Data Indexing and Selection</a> 
<br> <a href='#PerformingOperationsDataPandas'>4. Performing Operations on Data in Pandas</a> 
<br><a href='#DataCleaning'>5. Data Cleaning</a>
<br><a href='#CombiningDatasets'>6. Combining Datasets</a>
<br><a href='#AggregationGrouping'>7. Summarising Data with Aggregation and Grouping</a>
<br><a href='#StringData'>8. Working with `String` Data</a>
<br><a href='#TimeSeries'>9. Working with Time Series</a>
<br><a href='#ReviewExercises'>10. Review Exercises</a>

# Download the new class notes.
__Navigate to the directory where your files are stored.__

__Update the course notes by downloading the changes__




##### Windows
Search for __Git Bash__ in the programs menu.

Select __Git Bash__, a terminal will open.

Use `cd` to navigate to *inside* the __ILAS_PyEng2019__ repository you downloaded. 

Run the command:
>`./automerge`



##### Mac
Open a terminal. 

Use `cd` to navigate to *inside* the __ILAS_PyEng2019__ repository you downloaded. 

Run the command:
>`sudo ./automerge`

Enter your password when prompted. 

This week we begin the __Applications of Programming__ part of the course. 

In this part of the course, we will explore some uses of programming in which you can apply the fundamental techniques we have studied so far and your specialism as engineers. 

The first topic we will cover is handling of data, specifically large and unstructured data sets. 

__Data Science__ :  
A field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.

It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and informatics. 

Data sets are modelled and curated to find patterns and make predictions about the future. 

__Machine Learning__ :  
An important tool used with Data Science is machine learning. 

In Machine learning, algorithms acquire the knowledge or skill through experience. 

Therefore, Machine learning relies on big data sets to identify patterns.

__Artificial Intelligence (AI):__
<br>AI is the study of enabling machines to make decisions independently without the need for human interference. 

Therefore AI tends to be used in situations where adapting to new scenarios are important.
<br>As this often involves acquiring knowledge and learning to apply it, Machine Learning is a widely used approach for AI. 

AI has broad application ranging from robotics to text analysis.

<img src="img/machine_learning_AI.png" alt="Drawing" style="width: 300px;"/>

In Data Science, Machine Learning and AI, typical problems involve large amounts of __unstructured data__.

__Unstructured data__ 
Information that is not organised in a pre-defined manner (e.g. does not fit nicely into a numpy array). 
- text-heavy
- mixed data types (dates, numbers, text info ...)  
- missing data (transmitted data packet, sensor data)
- noise (sensor data, experimental results) 

### Lesson Goal

- Data strutures for data science : Pandas `DataFrame` and `Series` 
- Data cleaning: remove missing values, filtering rows or columns by some criteria
- Calculate statistics and analytics 
    - e.g. average, median, max, min of each column
    - does column A correlate with column B?
    - distribution of data in column C

- Data visualization 



### Fundamental programming concepts
 
Using python Pandas, a powerful python tool for data handling. 

Pandas is built on top of NumPy and enables fast analysis, data cleaning and preparation. 

We can think of Pandas as Python’s version of Microsoft’s Excel.

Unlike Excel, Pandas works well with data from a wide variety of diverse sources such as; Excel sheet, csv file, or even a webpage

## Pandas
<a id='Pandas'></a>
Pandas stands for “Python Data Analysis Library”.

<img src="img/panda.jpg" alt="Drawing" style="width: 300px;"/>






<a id='InstallPandas'></a>
# 1. Install Pandas



##### Windows 

1. Open the Anaconda Prompt from the terminal.
<p align="center">
  <img src="img/anaconda_prompt.png" alt="Drawing" style="width: 300px;"/>
</p>

1. The window that opens will look like the command line. In the window type the following code then press 'Enter':
>`conda install -c anaconda pip`

1. When the installation completes type the following code then press 'Enter':
>`pip install pandas`





##### Mac

1. Open a terminal. 

1. Type the following code then press 'Enter':
>`conda install -c anaconda pip`

1. When the installation completes type the following code then press 'Enter':
>`pip install pandas`

To check the installation has worked type try importing pandas in Spyder or Jupyter notebook and run the code. If no error is generated you have installed pandas successfully. 

Just as we generally import Numpy as ``np``, we will import Pandas as ``pd``:

In [204]:
import pandas as pd

We will also use numpy and matplotlib in this class.

In [205]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from IPython.display import display

## Documentation

In Jupyter Notebook you the can quickly view the contents of a package (by using the tab-completion feature) as well as the documentation of various functions (using the ``?`` character). 

For example, to display all the contents of the pandas namespace, you can type

```ipython
In [3]: pd.<TAB>
```

And to display Pandas's built-in documentation, you can use this:

```ipython
In [4]: pd?
```

More detailed documentation, along with tutorials and other resources, can be found at http://pandas.pydata.org/.

In [206]:
#pd?

<a id='PandasDataStructures'></a>
# 2. Pandas Data Structures

<br> <a href='#Series'>2.1 `Series`</a>
<br> <a href='#ConstructingSeriesObjects'>2.2 Constructing `Series` Objects</a>
<br> <a href='#DataFrame'>2.3 `DataFrame`</a>
<br> <a href='#ConstructingDataFrame'>2.4 Constructing a `DataFrame`</a>
<br> <a href='#ConstructingDataFrameImportingData'>2.5 Constructing a `DataFrame by Importing Data`</a>


We will first look at the data structures provided by the Pandas library.




The numPy `array` data structure is useful for clean, well-organized data typically seen in numerical computing tasks.

It is limited when more flexibility is needed:
 - attaching labels to data
 - working with missing data
 - grouping data 

``DataFrame``s are essentially multidimensional arrays with row and column labels

They often contain heterogeneous types and/or missing data.

<a id='Series'></a>
## 2.1 `Series`

A Pandas ``Series`` is a one-dimensional array of indexed data.

It can be created from a list or array:

In [207]:
data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64

Like with a NumPy array, data can be accessed with square-bracket *implicitly defined integer* indexing:

In [208]:
data[1]

data[1:3]

1    0.50
2    0.75
dtype: float64

### ``Series`` as generalized NumPy array

... or, unlike a numpy array, data can be accessed with square-bracket *explicitly defined* index associated with the values:

In [209]:
# strings as indices
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

data['b']

0.5

In [210]:
# non-contiguous or non-sequential indices
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[2, 5, 3, 7])
data

2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64

<a id='ConstructingSeriesObjects'></a>
## 2.2 Constructing Series Objects



A Pandas ``Series`` constructed from scratch; always has the following format:

```python
>>> pd.Series(data, index=index)
```

``index`` is an optional argument (default : integer array)

``data`` can be given in various ways e.g.  list or NumPy array.

In [211]:
print(pd.Series([2, 4, 6]), 
      end='\n\n')

print(pd.Series(5, index=[100, 200, 300]), # data is repeated scalar
      end='\n\n') 

print(pd.Series({2:'a', 1:'b', 3:'c'}), # data is dictionary (see Supplementary material)
      end='\n\n') 

print(pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2]), # Iindex is explicitly set
      end='\n\n')

0    2
1    4
2    6
dtype: int64

100    5
200    5
300    5
dtype: int64

2    a
1    b
3    c
dtype: object

3    c
2    a
dtype: object



<a id='DataFrame'></a>
## 2.3 `DataFrame` 

`Series` :  behaves as a 1D array with user-specifiable row indices. 

`DataFrame` : behaves as a 2D array with both user-specifiable row indices *and* column names.



2D array : ordered sequence of aligned one-dimensional columns.

`DataFrame` :  sequence of `Series` objects with the same index order.

Example : Construct two series...

In [212]:
# Construct two series
area_dict = {'California': 423967, 
             'Texas': 695662, 
             'New York': 141297,
             'Florida': 170312, 
             'Illinois': 149995}

population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

population = pd.Series(population_dict)
area = pd.Series(area_dict)

Combine two series as data frame

In [213]:
states = pd.DataFrame({'population': population,
                       'area': area})

states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


__Potential source of confusion:__
<br>In a two-dimensional NumPy array called `data`, ``data[0]`` will return the first *row*. 

For a ``DataFrame`` called `data`, ``data['col0']`` will return the first *column*.

## 2.4 Constructing a `DataFrame`




ConstructingSeriesObjects

A Pandas ``DataFrame`` can be constructed in a variety of ways; always with the following format:

```python
>>> pd.Series(data, columns=columns)
```

``columns`` is an optional argument (default : first value in each column)

``data`` can be given in various ways ...

#### From a single `Series` object

A ``DataFrame`` is a collection of ``Series`` objects, and a single-column ``DataFrame`` can be constructed from a single ``Series``:

In [214]:
# These two lines can be used interchangeably

population_df = pd.DataFrame(population, columns=['population']) # Series, columns = series name

population_df = pd.DataFrame({'population': population}) # Series name, series. dict structure

population_df

Unnamed: 0,population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


#### From a dictionary of Series objects

In [215]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


#### From a python `dict` (dictionary)

In [216]:
df = pd.DataFrame({'keys':['A', 'B', 'C', 'D'],
                   'vals':[[0,0,0],
                         [0,0,0,0],
                         [0,0,0,0,0],
                         [0,0,0,0,0,0]]},
                  
                   columns=['keys', 'vals'])

df

Unnamed: 0,keys,vals
0,A,"[0, 0, 0]"
1,B,"[0, 0, 0, 0]"
2,C,"[0, 0, 0, 0, 0]"
3,D,"[0, 0, 0, 0, 0, 0]"


#### From a list of dicts

Any list of dictionaries can be made into a ``DataFrame``.


In [217]:
data = [{'a': 1, 'b': 1},
        {'a': 2, 'b': 10},
        {'a': 3, 'b': 1000}]


pd.DataFrame(data)

Unnamed: 0,a,b
0,1,1
1,2,10
2,3,1000


If some indices (keys) in the dictionary are missing, Pandas will fill them in with ``NaN`` (i.e., "not a number"):

In [218]:
pd.DataFrame([{'a': 1, 'b': 2}, 
              {'b': 3, 'c': 4}])

Unnamed: 0,a,b,c
0,1.0,2,
1,,3,4.0


#### From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.

In [219]:
pd.DataFrame(np.random.rand(3, 2),   # 3 x 2 array
             columns=['foo', 'bar'], # column names
             index=['a', 'b', 'c'])  # rows indices

Unnamed: 0,foo,bar
a,0.34376,0.198665
b,0.157869,0.276743
c,0.125465,0.681511


## 2.5 Constructing a `DataFrame` by Importing Data



Often we want to use data from an external source such as a website or exernal data file. 

Panda's IO tools (input-output tools) make it easy to import data from almost any data source. https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

The choice of function to use depends on the file format:

Examples:

`read_csv` : for delimited files including .txt files

`read_json` : JavaScript Object Notation primarily used for transmitting data between a web application and a server.

The function argument is the location of the file to import.

This can be a file location on your computer...

In [220]:
students = pd.read_csv('sample_data/sample_student_data.csv')
students.head()



Unnamed: 0,Student,Sex,DOB,Height,Weight,BP
0,(ID),M/F,dd/mm/yy,m,kg,mmHg
1,JW-1,M,19/12/1995,1.82,92.4,119/76
2,JW-2,M,11/01/1996,1.77,80.9,114/73
3,JW-3,F,02/10/1995,1.68,69.7,124/79
4,JW-4,M,11/01/1996,1.77,80.9,114/73


.. or an online location

In [221]:
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'  
chipotle = pd.read_csv(url, sep = '\t')  # tsv file (tab seperated) is specified when importing
chipotle.head(20)

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98
5,3,1,Chicken Bowl,"[Fresh Tomato Salsa (Mild), [Rice, Cheese, Sou...",$10.98
6,3,1,Side of Chips,,$1.69
7,4,1,Steak Burrito,"[Tomatillo Red Chili Salsa, [Fajita Vegetables...",$11.75
8,4,1,Steak Soft Tacos,"[Tomatillo Green Chili Salsa, [Pinto Beans, Ch...",$9.25
9,5,1,Steak Burrito,"[Fresh Tomato Salsa, [Rice, Black Beans, Pinto...",$9.25


### Assigning Names to Unlabelled Data.
<a id='AssigningNamesUnlabelledData'></a>
Sometimes your data doesn't have column or row names.

`pandas` will try to create them.

`sample_data/noHeader_noIndex.csv` contains the data:

```Python
65.056,  1.053,  2.105,  3.158,  4.211
74.452, 48.348, 68.733, 59.796, 54.123
```


The first row is automatically used as the columns headings.

In [222]:
pd.read_csv('sample_data/noHeader_noIndex.csv')

Unnamed: 0,1.053,2.105,3.158,4.211,6.056
0,48.348,68.733,59.796,54.123,74.452


Therefore, if there are no headers, in the imported file, they should either be omitted...

In [223]:
pd.read_csv('sample_data/noHeader_noIndex.csv', header=None)


Unnamed: 0,0,1,2,3,4
0,1.053,2.105,3.158,4.211,6.056
1,48.348,68.733,59.796,54.123,74.452


... or assigned by the user:

In [224]:
headers = ["U","V","X","Y","Z"]

pd.read_csv('sample_data/noHeader_noIndex.csv', names=headers)


Unnamed: 0,U,V,X,Y,Z
0,1.053,2.105,3.158,4.211,6.056
1,48.348,68.733,59.796,54.123,74.452


Note, this can also be used to *replace* existing column names on import as we will see later...

### Import Options
<a id='ImportOptions'></a>
The read_csv function has a large number of keyword arguments:
http://pandas.pydata.org/pandas-docs/version/0.23/generated/pandas.read_csv.html





Here are two examples. 

`skiprows`: rows to omit

`sample_data/sample_student_data.csv` has some unecessary information in row 0.

In [225]:
pd.read_csv('sample_data/sample_student_data.csv').head()

Unnamed: 0,Student,Sex,DOB,Height,Weight,BP
0,(ID),M/F,dd/mm/yy,m,kg,mmHg
1,JW-1,M,19/12/1995,1.82,92.4,119/76
2,JW-2,M,11/01/1996,1.77,80.9,114/73
3,JW-3,F,02/10/1995,1.68,69.7,124/79
4,JW-4,M,11/01/1996,1.77,80.9,114/73


In [226]:
pd.read_csv('sample_data/sample_student_data.csv', skiprows=[1]).head()

Unnamed: 0,Student,Sex,DOB,Height,Weight,BP
0,JW-1,M,19/12/1995,1.82,92.4,119/76
1,JW-2,M,11/01/1996,1.77,80.9,114/73
2,JW-3,F,02/10/1995,1.68,69.7,124/79
3,JW-4,M,11/01/1996,1.77,80.9,114/73
4,JW-5,F,02/10/1995,1.68,69.7,124/79


`index_col`: the names in any column can be used as the index used to select a row.

In [227]:
# DOB as index
pd.read_csv('sample_data/sample_student_data.csv', 
            skiprows = [1], 
            index_col = 2).head()

# Student (JW- code) as index
pd.read_csv('sample_data/sample_student_data.csv', 
            skiprows = [1], 
            index_col = 0).head()


Unnamed: 0_level_0,Sex,DOB,Height,Weight,BP
Student,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
JW-1,M,19/12/1995,1.82,92.4,119/76
JW-2,M,11/01/1996,1.77,80.9,114/73
JW-3,F,02/10/1995,1.68,69.7,124/79
JW-4,M,11/01/1996,1.77,80.9,114/73
JW-5,F,02/10/1995,1.68,69.7,124/79


<a id='DataIndexingSelection'></a>
# 3. Data Indexing and Selection

<br> <a href='#DataSelectionSeries'>3.1 Data Selection in `Series`</a>
<br> <a href='#SeriesIndexers'>3.2 `Series` Indexers: loc & iloc</a> 
<br> <a href='#DataSelectionDataFrame'>3.3 Data Selection in  `DataFrame`</a> 
<br> <a href='#DataFrameIndexers'>3.4  `DataFrame` Indexers: loc & iloc</a> 




Indexing (accessing individual elements by index) works a little differently than nother data structures. 

However, if we keep in mind how to access the elements of a numpy array, the process will hopefully seem logical. 

<a id='DataSelectionSeries'></a>
## 3.1 Data Selection in `Series`



A ``Series`` object acts in many ways like a 1D NumPy array (or a standard Python dictionary).

In [228]:
# Implicit indexing
data = np.array([0.25, 0.5, 0.75, 1.0])

print(data)
print(data[0])
print(data[3])

[0.25 0.5  0.75 1.  ]
0.25
1.0


In [229]:
# Explicint indexing
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
print(data)
print(data['a'])
print(data['c'])

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
0.25
0.75


The same basic mechanisms of NumPy array-style item selection : 
- slicing
- masking

can be used. 

The only difference is that the indices may be *explicitly* defined.  


In [230]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [231]:
# fancy indexing
data[['a', 'c']]

a    0.25
c    0.75
dtype: float64

In [232]:
data[[0, 2]]

a    0.25
c    0.75
dtype: float64

In [233]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [234]:
# slicing by explicit (user-defined) index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

__Potential source of confusion:__

When slicing with an explicit index (e.g. ``data['a':'c']``), the final index is *included* in the slice.

When slicing with an implicit index (e.g. ``data[0:2]``), the final index is *excluded* from the slice.

If a ``Series`` has an *explicit integer* index:
- indexing operation e.g. ``data[1]`` will use the *explicit* 
- a slicing operation e.g. ``data[1:3]`` will automatically use the *implicit* array-style index.

In [235]:
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=[1, 2, 3, 4])

# explicit index when indexing
print(data[1], end='\n\n')

# implicit index when slicing
print(data[1:3], end='\n\n')

0.25

2    0.50
3    0.75
dtype: float64





<a id='SeriesIndexers'></a>
## 3.2 `Series` Indexers: loc & iloc

To get around the problem caused in the case of integer indexes, Pandas has special *indexers* that explicitly show which indexing scheme is being used.

In [236]:
data

1    0.25
2    0.50
3    0.75
4    1.00
dtype: float64

## loc

Always references the explicit index

In [237]:
data.loc[1]

0.25

In [238]:
data.loc[1:3]

1    0.25
2    0.50
3    0.75
dtype: float64

## iloc
Always references the implicit Python array-style index

In [239]:
data.iloc[1]

0.5

In [240]:
data.iloc[1:3]

2    0.50
3    0.75
dtype: float64

One principle of Python is that "explicit is better than implicit."

The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code and avoiding bugs due to mixed indexing/slicing.

<a id='DataSelectionDataFrame'></a>
## 3.3 Data Selection in  `DataFrame`



A ``DataFrame`` acts in many ways like:
- a dictionary (see supplementary)
- a 2D array. 

In [241]:
area = pd.Series({'California': 423967, 
                  'Texas': 695662,
                  'New York': 141297, 
                  'Florida': 170312,
                  'Illinois': 149995})

pop = pd.Series({'California': 38332521, 
                 'Texas': 26448193,
                 'New York': 19651127, 
                 'Florida': 19552860,
                 'Illinois': 12882135})

data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


### DataFrame as a dictionary

The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing using the column name:

In [242]:
data['area'] # dictionary style

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [243]:
data.area # . shorthand

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [244]:
data.area is data['area']

True

Shorthand e.g. `data.area`, does not work in some cases:
- if the column names are not strings
- if the column names conflict with methods of the ``DataFrame``<br>(e.g. the ``DataFrame`` has a ``pop()`` method. ``data.pop`` will run the method not the ``"pop"`` column)
- column assignment <br>(i.e. use ``data['pop'] = z``, not ``data.pop = z``).





Example : Editing a column

In [245]:
data.pop is data['pop']

data.pop = 3        # has no effect on DataFrame
#data['pop'] = 3   # changes all values in pop column to 3

data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Example : Adding a new column

In [246]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


### DataFrame as two-dimensional array

As mentioned previously, we can also view the ``DataFrame`` as an enhanced two-dimensional array.

Data structures with implicit indices may be indexed as numpy arrays:

In [247]:
data = pd.DataFrame(np.random.rand(3, 2))

print(data, end='\n\n')

print(data[1:2][0]) # rows 1 to 2, column 0

          0         1
0  0.371865  0.656336
1  0.424145  0.855127
2  0.364807  0.356233

1    0.424145
Name: 0, dtype: float64


For *explicitly indexed* of ``DataFrame`` objects, limited indexing is available.



In [248]:
data = pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

data

Unnamed: 0,foo,bar
a,0.725975,0.188078
b,0.227132,0.761094
c,0.643218,0.964354


Access a column:

In [249]:
data['foo']

a    0.725975
b    0.227132
c    0.643218
Name: foo, dtype: float64

Access a row:

In [250]:
data.values[0] # row 0

array([0.72597513, 0.18807846])

Access a region like with numpy array

In [251]:

data.values[::2, 0] # every other row, column 0

array([0.72597513, 0.6432177 ])

In [252]:
# Some more examples

data = pd.DataFrame(np.random.rand(3, 2),
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

print(data, 
      end='\n\n')

data['foo'] # column

data.values[0] # row

data[1:2]  # multiple rows

data[1:2].foo # single column, multiple rows

data[['foo']][:2] # single column, multiple rows

data[['foo', 'bar']][1:3] # multiple columns, multiple rows

        foo       bar
a  0.735100  0.122799
b  0.135115  0.347658
c  0.771025  0.259579



Unnamed: 0,foo,bar
b,0.135115,0.347658
c,0.771025,0.259579


In [253]:
# These commands produce the same result : first two rows, column 'foo'

print(data[:2].foo)

print(data['foo'][:2])

print(data['foo'].head(2))

a    0.735100
b    0.135115
Name: foo, dtype: float64
a    0.735100
b    0.135115
Name: foo, dtype: float64
a    0.735100
b    0.135115
Name: foo, dtype: float64


DataFrame information components can be access indiviually:

In [254]:
print(data.columns)
print(data.index)
print(data.values)

Index(['foo', 'bar'], dtype='object')
Index(['a', 'b', 'c'], dtype='object')
[[0.73509985 0.12279886]
 [0.13511471 0.34765834]
 [0.77102479 0.25957862]]




<a id='DataFrameIndexers'></a>
## `DataFrame` Indexers: loc & iloc

Again, Pandas uses the ``loc``and ``iloc``*indexers* each relating to implicit or explcit indexing schemes. 

This makes indexing much less messy.

In [255]:
area = pd.Series({'California': 423967, 
                  'Texas': 695662,
                  'New York': 141297, 
                  'Florida': 170312,
                  'Illinois': 149995})

pop = pd.Series({'California': 38332521, 
                 'Texas': 26448193,
                 'New York': 19651127, 
                 'Florida': 19552860,
                 'Illinois': 12882135})

US = pd.DataFrame({'area':area, 'pop':pop})
US




Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


## loc

Always references the explicit index

In [256]:
US.loc['Illinois', 'pop'] # rows = Illinois, columns = pop

12882135

In [257]:
US.loc[:'Florida', 'area'] # rows up to Florida, columns = area

California    423967
Texas         695662
New York      141297
Florida       170312
Name: area, dtype: int64

In [258]:
US.loc[:'Illinois', :'pop'] # rows up to Illinois, columns up to pop

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


## iloc
Always references the implicit Python array-style index

In [259]:
US.iloc[:3, :2]    # rows, columns    
US.iloc[:3, [0,1]]
US.iloc[2, 1]  

19651127

Any of the familiar NumPy-style data access patterns can be used within these indexers.

Example
<br>``loc`` indexer with masking and fancy indexing:

In [260]:
US.loc[US.area > 200000, # indexes with area column > 200,000
       ['pop', 'area']]  # columns pop and area

Unnamed: 0,pop,area
California,38332521,423967
Texas,26448193,695662


Example
<br>Mondify value

In [261]:
print(US)
US.iloc[2, 1] = 90 # row, column
print(US)

              area       pop
California  423967  38332521
Texas       695662  26448193
New York    141297  19651127
Florida     170312  19552860
Illinois    149995  12882135
              area       pop
California  423967  38332521
Texas       695662  26448193
New York    141297        90
Florida     170312  19552860
Illinois    149995  12882135


### Additional indexing conventions

Slicing by explicit index

In [262]:
US['Florida':'Illinois']

Unnamed: 0,area,pop
Florida,170312,19552860
Illinois,149995,12882135


Slicing by implicit index

In [263]:
US[1:3]

Unnamed: 0,area,pop
Texas,695662,26448193
New York,141297,90


Direct masking operations are interpreted row-wise 

In [264]:
US[US.area > 400000]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193


<a id='PerformingOperationsDataPandas'></a>
# 4. Performing Operations on Data in Pandas

<br> <a href='#VectorizedFunctions'>4.1 Vectorized Functions</a>
<br> <a href='#Non-VectorizedFunctions'>4.2 Non-Vectorized Functions</a> 
<br> <a href='#ModifyingData'>4.3 Modifying Data</a> 
<br> <a href='#IndexAlignmentSeries'>4.4 Index Alignment in a `Series`</a> 
<br> <a href='#IndexAlignmentDataFrame'>4.4 Index alignment in a `DataFrame`</a> 
<br><a href='#OperationsBetweenDataFrameSeries'>4.5 Operations Between `DataFrame` and `Series`</a>




Numpy is useful for performing quick element-wise operations (e.g. basic arithmetic (addition, subtraction, multiplication...) , trigonometric functions, exponential and logarithmic functions etc). 

Pandas is designed to work with NumPy so any NumPy elementwise function (ufunc) will work on Pandas ``Series`` and ``DataFrame`` objects.

Pandas builds on elementwise application by *preserving index and column labels* when implementing such functions. 

Idices are automatically *aligned* when passing the `Series`/`DataFrame` to the function.

Operations between 1D ``Series`` structures and 2D ``DataFrame`` structures are similar to those in Numpy.

Main advantages:
- keeping the context of data
- combining data from different sources

while error prone with NumPy, are automated when using Pandas. 

Let's do some examples to illustrate this:

<a id='VectorizedFunctions'></a>
## 4.1 Vectorized Functions




These functions automatically apply elementwise and can be used as with Numpy arrays.

Example: `Series`

In [265]:
rng = np.random.RandomState(42) # set random seed to give same random number each time

ser = pd.Series(rng.randint(0, 10, 4)) # low = 0, high = 10, size = 4

ser

0    6
1    3
2    7
3    4
dtype: int64

Example: `DataFrame`

In [266]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,6,9,2,6
1,7,4,3,7
2,7,2,5,4


The numpy function applies to all values in the `Series`:

In [267]:
np.exp(ser)

0     403.428793
1      20.085537
2    1096.633158
3      54.598150
dtype: float64

The numpy function applies to all values in the `DataFrame`:

In [268]:
np.sin(df * np.pi / 4)

Unnamed: 0,A,B,C,D
0,-1.0,0.7071068,1.0,-1.0
1,-0.707107,1.224647e-16,0.707107,-0.7071068
2,-0.707107,1.0,-0.707107,1.224647e-16


<a id='SymbolicMathematics'></a>
## 4.2 Non-Vectorized Functions






For functions that do not automatically apply elementwise we can use:
- `Series.apply(function)` 
- `DataFrame.apply(function)` 

Exmaple : Apply the `len` function to a series.

In [269]:
# Series
ser = pd.Series([[0,0,0],
                 [0,0,0,0],
                 [0,0,0,0,0],
                 [0,0,0,0,0,0]])

ser

0             [0, 0, 0]
1          [0, 0, 0, 0]
2       [0, 0, 0, 0, 0]
3    [0, 0, 0, 0, 0, 0]
dtype: object

In [270]:
ser.apply(len) 

0    3
1    4
2    5
3    6
dtype: int64

Example : Apply the `max` function to a data frame

In [271]:
# Data Frame
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

Unnamed: 0,A,B,C,D
0,1,7,5,1
1,4,0,9,5
2,8,0,9,2


In [272]:
df.apply(max)

A    8
B    7
C    9
D    5
dtype: int64

Example : Apply the `max` function to a data frame or a Series wihtin a DataFrame

In [273]:
df = pd.DataFrame({'keys' : ['A', 'B', 'C', 'D'],
                   'ser'  : ser,
                   'vals' : list(range(4))},
                  columns=['keys', 'ser', 'vals'])
df

Unnamed: 0,keys,ser,vals
0,A,"[0, 0, 0]",0
1,B,"[0, 0, 0, 0]",1
2,C,"[0, 0, 0, 0, 0]",2
3,D,"[0, 0, 0, 0, 0, 0]",3


In [274]:
df.apply(max)

keys                     D
ser     [0, 0, 0, 0, 0, 0]
vals                     3
dtype: object

In [275]:
df.ser.apply(len) 

0    3
1    4
2    5
3    6
Name: ser, dtype: int64

This is a very good example of where `lambda` functions are used.

Apply the function $x^2$ to each value, $x$ in the column `vals`

In [276]:
df.vals.apply(lambda x: x**2)

0    0
1    1
2    4
3    9
Name: vals, dtype: int64

Initial values can be overwritten.

In [277]:
df.vals = df.vals.apply(lambda x: x**2)
#df['vals']= df.vals.apply(lambda x: x**2)
df

Unnamed: 0,keys,ser,vals
0,A,"[0, 0, 0]",0
1,B,"[0, 0, 0, 0]",1
2,C,"[0, 0, 0, 0, 0]",4
3,D,"[0, 0, 0, 0, 0, 0]",9


New columns (and rows) can be added

In [278]:
df['vals_orig'] = df.vals.apply(lambda x: x**(1/2))
df

Unnamed: 0,keys,ser,vals,vals_orig
0,A,"[0, 0, 0]",0,0.0
1,B,"[0, 0, 0, 0]",1,1.0
2,C,"[0, 0, 0, 0, 0]",4,2.0
3,D,"[0, 0, 0, 0, 0, 0]",9,3.0


<a id='ModifyingData'></a>
## 4.3 Modifying Data




It can be useful to modify indiividual values of a DataFrame.

For exmaple, replacing/removing unwanted features of each item of changing the data type. 

### Replacing an Item / Part of an Item
The `replace` method is not limited to string data but is particularly useful when dealing with strings.

In [279]:
tax_1 = {'Alaska': '$172',  
         'Texas': '$695',
         'California': '$423'
        }


tax_2 = {'California': '$3.8', 
         'Texas': '$2.6',
         'New York': '$1.9'
        }



US_tax = pd.DataFrame({'tax_1' : tax_1,
                       'tax_2' : tax_2})

US_tax

Unnamed: 0,tax_1,tax_2
Alaska,$172,
Texas,$695,$2.6
California,$423,$3.8
New York,,$1.9


In [280]:
US_tax.replace('\$', '', regex=True) # regex=True indicates string data

Unnamed: 0,tax_1,tax_2
Alaska,172.0,
Texas,695.0,2.6
California,423.0,3.8
New York,,1.9


### Changing Data Type
Values can be cast as a different data type using  `astype`. 

In [281]:
US_tax.replace('\$', '', regex=True).astype(float)

Unnamed: 0,tax_1,tax_2
Alaska,172.0,
Texas,695.0,2.6
California,423.0,3.8
New York,,1.9


### Sorting Items 
The `replace` method is not limited to string data but is particularly useful when dealing with strings.

In [282]:
US_tax.sort_values(by = "tax_1", ascending = False)

Unnamed: 0,tax_1,tax_2
Texas,$695,$2.6
California,$423,$3.8
Alaska,$172,
New York,,$1.9


<a id='IndexAlignmentSeries'></a>
## 4.4 Index Alignment in a `Series`



Consider combining two different data sources
- the top three US states by *area*
- the top three US states by *population*

In [283]:
area = pd.Series({'Alaska': 1723337, 
                  'Texas': 695662,
                  'California': 423967
                 }, 
                 name='area')


population = pd.Series({'California': 38332521, 
                        'Texas': 26448193,
                        'New York': 19651127
                       }, 
                       name='population')

Let's use them to compute the population density:

In [284]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

The result array contains the *union* of indices of the two `DataFrame`s.

In [285]:
# union represented using set arithmetic
area.index | population.index

Index(['Alaska', 'California', 'New York', 'Texas'], dtype='object')

In [286]:
# | Union : ['Alaska', 'California', 'New York', 'Texas']
# & Intersection : ['California', 'Texas']
# – Difference : ['Alaska']
# ^ Symmetric difference : ['Alaska', 'New York']

In the result, any index missing from one of the `DataFrame`s is marked with ``NaN``, or "Not a Number".

This is how Pandas marks missing data. 

This index matching is also applies to Python's built-in arithmetic operators:

In [287]:
A = pd.Series([2, 4, 6], index=[0, 1, 2])
B = pd.Series([1, 3, 5], index=[1, 2, 3])
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If NaN values are undesirable in the result, the filler value can be modified by using an __object method__ for an operation. 

Addition : ``A + B``

__Object method__ for addition :``A.add(B)`` <br>allows a *single* fill value to be specified

In [288]:
A.add(B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64

<a id='IndexAlignmentDataFrame'></a>
## 4.4 Index alignment in a `DataFrame`



Index alignment takes place for *both* columns and indices in ``DataFrame``s:

2 x 2 `DataFrame`:

In [289]:
frame1 = pd.DataFrame(rng.randint(0, 20, (2, 2)),
                      columns=list('AB'))
frame1

Unnamed: 0,A,B
0,11,19
1,2,4


3 x 3 `DataFrame`:

In [290]:
frame2 = pd.DataFrame(rng.randint(0, 10, (3, 3)),
                      columns=list('BAC'))
frame2

Unnamed: 0,B,A,C
0,2,6,4
1,8,6,1
2,3,8,1


Normal addition

In [291]:
frame1 + frame2

Unnamed: 0,A,B,C
0,17.0,21.0,
1,8.0,12.0,
2,,,


Example

Fill with the mean of *all* values in ``frame1`` 

This is found by first stacking the rows of ``frame1``:

In [292]:
# frame1.mean() # mean of each column

fill = frame1.stack().mean() # stack stacks all columns on top of each other to form a single columns

frame1.add(frame2, 
           fill_value=fill)

Unnamed: 0,A,B,C
0,17.0,21.0,13.0
1,8.0,12.0,10.0
2,17.0,12.0,10.0


Python operators and their equivalent Pandas object methods:

| Python Operator | Pandas Method(s)                      |
|-----------------|---------------------------------------|
| ``+``           | ``add()``                             |
| ``-``           | ``sub()``, ``subtract()``             |
| ``*``           | ``mul()``, ``multiply()``             |
| ``/``           | ``truediv()``, ``div()``, ``divide()``|
| ``//``          | ``floordiv()``                        |
| ``%``           | ``mod()``                             |
| ``**``          | ``pow()``                             |


<a id='OperationsBetweenDataFrameSeries'></a>
## 4.5 Operations Between `DataFrame` and `Series`



Index and column alignment is maintained.

Operations between a ``DataFrame`` and a ``Series`` are similar to operations between a 2D and 1D NumPy array.

Consider the following operation: 

Find the difference of a two-dimensional array and one of its rows:

In [293]:
A = np.random.randint(10, size=(3, 4))
A

array([[6, 5, 3, 0],
       [5, 6, 5, 3],
       [7, 3, 2, 3]])

In [294]:
A - A[0]

array([[ 0,  0,  0,  0],
       [-1,  1,  2,  3],
       [ 1, -2, -1,  3]])

According to NumPy's broadcasting rules (see [05_Algebra_SympyScipyNumpy__ClassMaterial](05_Algebra_SympyScipyNumpy__ClassMaterial.ipynb)):

- subtraction between a 2D array and one of its rows is applied row-wise.

- subtraction between a 2D array and one of its columns is applied column-wise.

In Pandas, broadcast operations are *row-wise* when using normal operators

In [295]:
import pandas as pd
df = pd.DataFrame(A, columns=list('QRST'))
df

Unnamed: 0,Q,R,S,T
0,6,5,3,0
1,5,6,5,3
2,7,3,2,3


In [296]:
df - df.values[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,1,2,3
2,1,-2,-1,3


In [297]:
df - df.iloc[0] # subtract row 0 (implicit indexing)

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-1,1,2,3
2,1,-2,-1,3


To operate column-wise:
- use __object methods__ mentioned earlier
- specify the ``axis`` keyword:

In [298]:
df.subtract(df['R']) # gives incorrect result, 'R' is not an index

Unnamed: 0,Q,R,S,T,0,1,2
0,,,,,,,
1,,,,,,,
2,,,,,,,


In [299]:
# 0 = apply to each row, 1 = apply to each column
df.subtract(df['R'], axis=0) 

Unnamed: 0,Q,R,S,T
0,1,0,-2,-5
1,-1,0,-1,-3
2,4,0,-1,0


<a id='DataCleaning'></a>
# 5. Data Cleaning


<br> <a href='#HandlingMissingData'>5.1 Handling Missing Data</a>
<br> <a href='#DuplicateData'>5.2 Duplicate Data</a>




<a id='HandlingMissingData'></a>
## Data Cleaning : Handling Missing Data

Real-world data is rarely clean and homogeneous.

In particular, many interesting datasets will have some amount of data missing.

To make things even more complicated, different data sources may indicate missing data in different ways.

Pandas has useful tools for handling missing data.

### Missing Data Conventions

A number of schemes have been developed to indicate the presence of missing data in a table or DataFrame.
Generally, they revolve around one of two strategies: 
- using a *mask* (a separate Boolean array to indicate missing values)
- choosing a *sentinel value* that indicates a missing entry (e.g. `NaN`).

### Missing Data in Pandas

Pandas uses two already-existing Python null values as sentinels for missing data:
- floating-point ``NaN`` (*Not a Number*) value :  a special floating-point value recognized by all systems that use the standard floating-point representation.
- Python ``None`` object

In [300]:
print(1 + np.nan)
print(0 *  np.nan)

nan
nan


In [301]:
vals2 = np.array([1, np.nan, 3, 4])

print(vals2.sum(), vals2.min(), vals2.max())

print(np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2))

nan nan nan
8.0 1.0 4.0


In [302]:
vals1 = np.array([1, None, 3, 4])
print(vals1)
vals1.sum()

[1 None 3 4]


TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

### `NaN` and `None` in Pandas

Pandas is built to handle ``NaN`` and ``None`` interchangeably, converting between them where appropriate:

In [None]:
# pandas automatucally converts int-->float, None-->NaN
pd.Series([1, np.nan, 2, None])

<a id='OperatingNullValues'></a>
### Operating on Null Values



There are several useful methods for detecting, removing, and replacing null values in Pandas data structures.
They are:

- ``isnull()``: Generate a boolean mask indicating missing values
- ``notnull()``: Opposite of ``isnull()``
- ``dropna()``: Return a filtered version of the data
- ``fillna()``: Return a copy of the data with missing values filled or imputed


### Detecting null values : ``isnull()`` and ``notnull()``
Generate a boolean mask indicating missing/not missing values

In [None]:
data = pd.Series([1, np.nan, 'hello', None])
data

In [None]:
data.isnull()

Boolean masks can be used as a ``Series`` or ``DataFrame`` index

In [None]:
data[data.notnull()]

In [3]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
display(df.isnull())
display(df[df.notnull()])

NameError: name 'pd' is not defined

We can also see how many null values appear in each column. 

In [4]:
print(df.isnull().sum())

NameError: name 'df' is not defined

### Dropping null values :  ``dropna()``

`Series` example

In [None]:
data.dropna()

``DataFrame`` example

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

We cannot drop single values from a ``DataFrame``; we can only drop full rows or full columns.

By default, ``dropna()`` will drop all rows in which *any* null value is present:

In [None]:
df.dropna()

Alternatively, you can drop NA values along a different axis; ``axis=1`` drops all columns containing a null value:

In [None]:
df.dropna(axis='columns')

In [None]:
df.dropna(axis=1)

In doing this we lose some of our data set. 

We can instead drop rows or columns with:
- *all* NA values 
- a majority of NA values

In [None]:
# Example DataFrame
df[3] = np.nan
df

In [None]:
# Drop columns with all NaN values
df.dropna(axis='columns', how='all')

In [None]:
# Drop rows where thresh sets the minimum required number of non-NA values.
df.dropna(axis='rows', thresh=3)

Here the first and last row have been dropped, because they contain only two non-null values.

### Filling null values : ``fillna()``

`Series` example

In [None]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

In [None]:
display(data.fillna(0))  # fill NA entries with a single value

In [None]:
display(data.fillna(method='ffill'))  # forward-fill to propagate the previous value forward

In [None]:
display(data.fillna(method='bfill'))  # back-fill to propagate the next values backward

``DataFrame`` example

We can also specify an ``axis`` along which the fills take place:

In [None]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

In [None]:
# fill NaN with previous value in column
df.fillna(method='ffill', axis=0) # forward-fill 

In [None]:
# fill NaN with previous value in row
df.fillna(method='ffill', axis=1) # forward-fill 

We can see the use of this if we consider the operation methods we used previously.

In [None]:
df = pd.DataFrame(np.random.randint(10, size=(3, 4)), columns=list('QRST'))
display(df)

halfcols = df.iloc[0, ::2]
display(halfcols)

df = df.subtract(halfcols)
df

In [None]:
df.fillna(method='ffill', axis=1) # fill columnwise with previous column

In [None]:
df.fillna(1) # fill NaN with single value

<a id='DuplicateData'></a>
## 5.2 Data Cleaning : Duplicate Data

 

<br> <a href='#OperatingNullValues'>5.1 Operating on Null Values</a>
<br> <a href='#PandasDataStructures'>2. Pandas Data Structures</a> 
<br> <a href='#DataIndexingSelection'>3. Data Indexing and Selection</a> 
<br> <a href='#PerformingOperationsDataPandas'>4. Performing Operations on Data in Pandas</a> 
<br><a href='#ReviewExercises'>5. Review Exercises</a>

Duplicates can be removed from the data set based on a particular column using `drop_duplicates`

In [None]:
header = ['Item 1', 'Item 2', 'Item 3', 'Item 4']
df = pd.read_csv('sample_data/data_with_holes.csv', names=header)
df

In [None]:
df.drop_duplicates('Item 4')

By default, the first item of the duplicates is kept in the data set.

Optionally, you can specify:
- which value to keep (default keeps first)
- to apply the change to the original data frame

In [None]:
df.drop_duplicates('Item 3',     # column to drop duplicates from
                   keep='last',  # which instance to keepin data set
                   inplace=True) # apply to original data set

print(df)

<a id='CombiningDatasets'></a>
# 6. Combining Datasets



Some of the most interesting studies of data come from combining different data sources.

These operations can involve：
- straightforward concatenation of datasets
- database-style joins and merges that handle any overlaps between the datasets.

Please refer to the [08_Data__Supplementary](08_Data__Supplementary.ipynb)) for details of these operations. 

<a id='AggregationGrouping'></a>
# 7. Summarising Data with Aggregation and Grouping

<br> <a href='#SimpleAggregationPandas'>7.1 Simple Aggregation in Pandas</a>
<br> <a href='#GroupBySplitApplyCombine'>7.2 `GroupBy`: `Split`, `Apply`, `Combine`</a> 
<br> <a href='#aggregatefiltertransformapply'>7.3 `aggregate`, `filter`, `transform`, `apply`</a> 



Essential analysis of large data includes efficient summarization: e.g. computing aggregations like ``sum()``, ``mean()``, ``median()``, ``min()``, and ``max()``.

In each case a single number gives insight into the nature of a potentially large dataset.

## Planets Data

Here we will use the Planets dataset, available via the [Seaborn package](http://seaborn.pydata.org/).

It gives information on planets that astronomers have discovered around stars other than our sun <br>(known as *extrasolar planets* or *exoplanets* for short). 

It can be downloaded with a simple Seaborn command:

In [None]:
import seaborn as sns
planets = sns.load_dataset('planets')

planets.shape  # there are 1035 exoplanets in the data set and 6 fields

In [None]:
planets.head(10)

<a id='SimpleAggregationPandas'></a>
## 7.1 Simple Aggregation in Pandas



As with a one-dimensional NumPy array, for a Pandas ``Series`` the aggregates return a single value:

In [None]:
rng = np.random.RandomState(42) # set a random seed
ser = pd.Series(rng.rand(5))    # Series with 5 random values
ser

In [None]:
ser.sum()

In [None]:
ser.mean()

For a ``DataFrame``, by default the aggregates return results within each *column*:

In [None]:
df = pd.DataFrame({'A': rng.rand(5),
                   'B': rng.rand(5)})
df

In [None]:
df.mean()

By specifying the ``axis`` argument, you can instead aggregate within each row:

In [None]:
df.mean(axis='columns')

Pandas ``Series`` and ``DataFrame``s include many common aggregates.

The following table summarizes some other built-in Pandas aggregations:

| Aggregation              | Description                     |
|--------------------------|---------------------------------|
| ``count()``              | Total number of items           |
| ``first()``, ``last()``  | First and last item             |
| ``mean()``, ``median()`` | Mean and median                 |
| ``min()``, ``max()``     | Minimum and maximum             |
| ``std()``, ``var()``     | Standard deviation and variance |
| ``mad()``                | Mean absolute deviation         |
| ``prod()``               | Product of all items            |
| ``sum()``                | Sum of all items                |

These are all methods of ``DataFrame`` and ``Series`` objects.


In addition, there is a convenience method ``describe()`` that computes several common aggregates for each column and returns the result.

Let's use this on the Planets data, for now dropping rows with missing values:

In [None]:
planets.dropna().describe()

This can be a useful way to begin understanding the overall properties of a dataset.


We see in the ``year`` column that although exoplanets were discovered as far back as 1989, half of all known expolanets were not discovered until 2010 or after.

To go deeper into the data, however, simple aggregates are often not enough.

The next level of data summarization is the ``groupby`` operation, which allows you to quickly and efficiently compute aggregates on subsets of data.

<a id='GroupBySplitApplyCombine'></a>
## 7.2 `GroupBy`: `Split`, `Apply`, `Combine`



``groupby`` : aggregates conditionally on some label or index.

- __split__ : breaking up and grouping a ``DataFrame`` depending on the value of the specified key.
- __apply__ : computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
- __combine__ :merges the results of these operations into an output array.

### Split, apply, combine

In the exmaple below, "apply" is a summation aggregation:

![](img/split_apply_combine.png)

This could be achieved manually using a combination of the masking, aggregation, and merging commands we covered earlier.  

``GroupBy`` can do this in a single step so the user need not think about *how* the computation is done, and can focus on the *operation as a whole*.

__Example__ : computation shown in this diagram.

Creating the input ``DataFrame``:

In [None]:
df = pd.DataFrame({'col1': ['A', 'B', 'C', 'C', 'B', 'B', 'A'],
                   'col2': [1, 2, 3, 4, 2, 5, 3]}, 
                  columns=['col1', 'col2'])
df

The most basic split-apply-combine operation can be computed with the ``groupby()`` method of ``DataFrame``s, passing the name of the desired key column.

Notice that what is returned is not a set of ``DataFrame``s, but a ``DataFrameGroupBy`` object.

In [None]:
df = pd.DataFrame({'col1': ['A', 'B', 'C', 'C', 'B', 'B', 'A'],
                   'col2': [1, 2, 3, 4, 2, 5, 3]}, 
                  columns=['col1', 'col2'])

df.groupby('col1')



To produce a result, apply an aggregate to this ``DataFrameGroupBy`` object.

This will perform the appropriate apply/combine steps to produce the desired result:

In [None]:
df.groupby('col1').sum() # A DataFrame showing the sum of all items in groups A, B, C for each field

The ``sum()`` method is just an example.

Apply any:
- Pandas or NumPy aggregation function
- ``DataFrame`` operation...

### The `GroupBy` object

The ``GroupBy`` object can be thought of as a collection of ``DataFrame``s.

![](img/split_apply_combine.png)



#### Iteration over groups

The ``GroupBy`` object supports direct iteration over the groups, returning each group as a ``Series`` or ``DataFrame``:

In [None]:
for (method, group) in planets.groupby('method'):    
    # This print formatting leaves 30 spaces between printed item 0 and printed item 1
    print("DataFrame name : {0:30s}, shape={1}".format(method, group.shape))

Let's also introduce some of the other functionality that can be used with the basic ``GroupBy`` operation.

#### Column indexing

The ``GroupBy`` object supports column indexing in the same way as the ``DataFrame``, and returns a modified ``GroupBy`` object.

In [None]:
planets.groupby('method') # choose a column name to group by e.g. method

In [None]:
planets.groupby('method')['orbital_period'] # choose a column name we are interested in e.g. orbital period

As with the ``GroupBy`` object, no computation is done until we call some aggregate on the object:

In [None]:
# Find the median orbital period of each measurement method
planets.groupby('method')['orbital_period'].median()

This gives an idea of the general scale of orbital periods (in days) that each method is sensitive to.

<a id='aggregatefiltertransformapply'></a>
## 7.3 `aggregate`, `filter`, `transform`, `apply`



In these examples we focused on aggregation for the combine operation, but there are more options available.

![](img/split_apply_combine.png)

``GroupBy`` objects have the following methods:
- ``aggregate()``
- ``filter()``
- ``transform()``
- ``apply()`` 

Example ``DataFrame``:

In [None]:
rng = np.random.RandomState(0)

df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
                   'data1': range(6),
                   'data2': rng.randint(0, 10, 6)},
                   columns = ['key', 'data1', 'data2'])
df

#### Aggregation

The ``aggregate()`` method extends the range of available aggregations beyond ``sum()``, ``median()`` etc.

It can take a string/function/list and compute all the aggregates at once:

In [None]:
# group items using column 'key' and find the min, median and max  of each column
df.groupby('key').aggregate(['min', np.median, max]) 

Column names can be mapped to operations to be applied on that column:

In [None]:
df.groupby('key').aggregate({'data1': 'min',
                             'data2': 'max'})

#### Filtering

Filter data based on the group properties.

Example : keep all groups in which the standard deviation is larger than some critical value (e.g. 4)

In [None]:
display(df)       # data frame
display(df.groupby('key').std()) # standard deviation 

In [None]:
def filter_func(x):
    """ 
    Drop GROUPS from grouped DataFrame x with standard deviation of column 'data2' < 4
    """
    return x['data2'].std() > 4


display(df.groupby('key').filter(filter_func))

display(df.groupby('key').filter(filter_func).groupby('key').std())

Group A items have a standard deviation less than 4, so are dropped from the result.

#### The apply() method

The ``apply()`` method lets you apply an arbitrary function to the group results.

The function should have:
- input``DataFrame``
- return Pandas object (e.g., ``DataFrame``, ``Series``) or a scalar

Example, here is an ``apply()`` that normalizes the first column to the sum (for that group) of the second:

In [None]:
def norm_by_data2(x):
    # x is a DataFrame of group values
    x['data1'] /= x['data2'].sum()
    return x

display(df)
display(df.groupby('key').apply(norm_by_data2))

``apply()`` within a ``GroupBy`` is quite flexible: 

the only criterion is that the function takes a ``DataFrame`` and returns a Pandas object or scalar

What else the function does is up to you!

### Specifying the split key

In the simple examples presented before, we split the ``DataFrame`` on a single column name.

This is just one of many options by which the groups can be defined.

More are given in the supplementary material for you to explore independently.

In [None]:
planets.head()

### Grouping example

Let's use grouping to count the number of  planets discovered during each decade:

In [None]:
planets['decade'] = 10 * (planets['year'] // 10) # new column : floor diivide by 10 then mutliply by ten

planets.groupby('decade')['number'].sum()

Let's use grouping to count the number of  planets discovered __by each method__ during each decade:

The returned values are stacked as a single column with each item a combination of decade and method using `groupby(['method', 'decade'])`. 

In [None]:
planets.groupby(['method', 'decade'])['number'].sum().unstack().fillna(0)

To display this more in a more readble way, use `unstack()`

Finally, replace all missing values with zeros using `fillna(0)`:

This shows the power of combining many of the operations we've discussed up to this point when looking at realistic datasets.

We can see when and how planets have been discovered over the past several decades!

<a id='StringData'></a>
# 8. Working with `String` Data




We have seen how tools like NumPy and Pandas generalize arithmetic operations.

This allows us to easily and quickly perform the same operation on many array elements. 

For example:

In [None]:
import numpy as np

x = np.array([2, 3, 5, 7, 11, 13])

x * 2

This *vectorization* of operations simplifies the syntax of operating on arrays of data.

We no longer have to think about the size or shape of the array. 

We only need to consider what we want done.

For arrays of strings, NumPy does not provide such simple access.

In [None]:
data = ['peter', 'Paul', 'MARY', 'gUIDO']

# Lost comprehension to loop through each item in list.
[s.capitalize() for s in data]

This will break if there are any missing values.

For example:

In [None]:
data = ['peter', 'Paul', None, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

Pandas provides useful string operations for cleaning up real-world data.

Pandas includes features for:
- vectorized string operations
- handling missing data via the ``str`` attribute of Pandas Series and Index objects containing strings.

So, for example, suppose we create a Pandas Series with this data:

In [None]:
names = pd.Series(data)
names

We can now call a single method that will capitalize all the entries, while skipping over any missing values:

In [None]:
names.str.capitalize()

Using tab completion on this ``str`` attribute will list all the vectorized string methods available to Pandas.

## Tables of Pandas String Methods

The examples in this section use the following series of names.

In [None]:
monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam',
                   'Eric Idle', 'Terry Jones', 'Michael Palin'])

### Methods similar to Python string methods
Nearly all Python's built-in string methods have an equivalent Pandas vectorized string method. 

|             |                  |                  |                  |
|-------------|------------------|------------------|------------------|
|``len()``    | ``lower()``      | ``translate()``  | ``islower()``    | 
|``ljust()``  | ``upper()``      | ``startswith()`` | ``isupper()``    | 
|``rjust()``  | ``find()``       | ``endswith()``   | ``isnumeric()``  | 
|``center()`` | ``rfind()``      | ``isalnum()``    | ``isdecimal()``  | 
|``zfill()``  | ``index()``      | ``isalpha()``    | ``split()``      | 
|``strip()``  | ``rindex()``     | ``isdigit()``    | ``rsplit()``     | 
|``rstrip()`` | ``capitalize()`` | ``isspace()``    | ``partition()``  | 
|``lstrip()`` |  ``swapcase()``  |  ``istitle()``   | ``rpartition()`` |



### Methods using regular expressions

Methods that accept regular expressions (regex) to examine the content of each string element.

These can be adapted with Python's built-in ``re`` module:

| Method | Description |
|--------|-------------|
| ``match()`` | Call ``re.match()`` on each element, returning a boolean. |
| ``extract()`` | Call ``re.match()`` on each element, returning matched groups as strings.|
| ``findall()`` | Call ``re.findall()`` on each element |
| ``replace()`` | Replace occurrences of pattern with some other string|
| ``contains()`` | Call ``re.search()`` on each element, returning a boolean |
| ``count()`` | Count occurrences of pattern|
| ``split()``   | Equivalent to ``str.split()``, but accepts regexps |
| ``rsplit()`` | Equivalent to ``str.rsplit()``, but accepts regexps |

### Miscellaneous methods

| Method | Description |
|--------|-------------|
| ``get()`` | Index each element |
| ``slice()`` | Slice each element|
| ``slice_replace()`` | Replace slice in each element with passed value|
| ``cat()``      | Concatenate strings|
| ``repeat()`` | Repeat values |
| ``normalize()`` | Return Unicode form of string |
| ``pad()`` | Add whitespace to left, right, or both sides of strings|
| ``wrap()`` | Split long strings into lines with length less than a given width|
| ``join()`` | Join strings in each element of the Series with passed separator|
| ``get_dummies()`` | extract dummy variables as a dataframe |

Notice that these have various return values. Some, like ``lower()``, return a series of strings:

In [None]:
# make all letters lower case
monte.str.lower()

Others return numbers:

In [None]:
# length of each string
monte.str.len()

Or Boolean values:

In [None]:
monte.str.startswith('T') # first letter is T

monte.str.endswith('e') # last letter is e

monte.str.contains('in') # contains the character sequence 'in' 

Others return lists or other compound values for each element:

In [None]:
monte.str.split()

And some can be used to modify the string data:

In [None]:
tax_1 = {'Alaska': '$172K',  
         'Texas': '$695K',
         'California': '$423K'
        }


tax_2 = {'California': '$3.8M', 
         'Texas': '$2.6M',
         'New York': '$1.9M'
        }



US_tax = pd.DataFrame({'tax_1' : tax_1,
                       'tax_2' : tax_2})

US_tax

In [None]:
US_tax[US_tax.columns[:].str.replace('[\$,]', '', regex=True)]

In [None]:
US_tax.replace('[\$,]', '', regex=True)

#### Vectorized slicing

In [None]:
# These two expressions are equivalent

monte.str[0:3]

monte.str.slice(0,3)

#### Indexing

``df.str.get(i)`` and ``df.str[i]`` 

In [None]:
monte.str.get(3)

Example : 

``split()`` : splits a string into lists of words.

``get()`` and ``slice()`` : let you access elements of a list e.g. arrays returned by ``split()``

Example: extract the last name of each entry:

In [None]:
monte.str.split() # use str.split() to split name strings into firstand last name

In [None]:
monte.str.split().str.get(-1) # use str.get() to ge the last tem in each list (index[-1])

## Example: Recipe Database

Let's use these vectorized string operations to clean up messy, real-world data.

This example uses a real open recipe database compiled from various sources on the Web.

Goal : 
- parse the recipe data into ingredient list
- quickly find a recipe based on some ingredients we have 



The database `db-recipes.json` is about 900 kB, and can be downloaded and unzipped by uncommenting the commands in the cell below and running the cell. 

A copy of the database can also be found in the `sample_data` folder of the `ILAS_PyEng2019` repository. 

In [None]:
#!curl -O https://raw.githubusercontent.com/tabatkins/recipe-db/master/db-recipes.json

We will import the data using the url...

In [303]:
# Create URL to JSON file (alternatively this can be a filepath)
url = 'https://raw.githubusercontent.com/tabatkins/recipe-db/master/db-recipes.json'

# Load the first sheet of the JSON file into a data frame
# recipes = pd.read_json('sample_data/db-recipes.json') # use local copy
recipes = pd.read_json(url) # use online version

# View the first five rows
recipes.head()

# # Transpose the DataFrame
recipes = recipes.T

# # View the first five rows
recipes.head()

Unnamed: 0,id,name,source,preptime,waittime,cooktime,servings,comments,calories,fat,satfat,carbs,fiber,sugar,protein,instructions,ingredients,tags
2,2,Baked Shrimp Scampi,Ina Garten: Barefoot Contessa Back to Basics,0,0,0,6,Modified by reducing butter and salt. Substit...,2565,159,67,76,4,6,200,Preheat the oven to 425 degrees F.\r\n\r\nDefr...,"[2/3 cup panko\r, 1/4 teaspoon red pepper flak...","[seafood, shrimp, main]"
4,4,Strawberries Romanov (La Madeleine copycat),http://cookeatshare.com/recipes/la-madeleine-s...,0,0,0,4,,0,0,0,0,0,0,0,Wash strawberries and cut the tops off. Let st...,"[2 tbsp powdered sugar\r, 1/2 pt heavy whippin...","[fruit, dessert, strawberries, copycat, untried]"
5,5,Tomato Basil Soup (La Madeleine copycat),http://cookeatshare.com/recipes/la-madeleine-s...,0,0,0,4,,0,0,0,0,0,0,0,"Combine tomatoes, juice/and or possibly stock ...","[4 c tomatoes, minced, peeled, and cored\r, 4 ...","[soup, tomatoes, copycat, main]"
7,7,John Thorne's Pecan Pie,http://chowhound.chow.com/topics/281175#1495442,0,0,0,12,Does not use corn syrup,0,0,0,0,0,0,0,"Preheat oven to 350F. In a large saucepan, hea...","[1/4 tsp salt \r, 3 eggs \r, 4 tbsp butter \r,...","[dessert, pie, untried]"
8,8,Smoked Salmon Ebelskivers,Ebelskivers by Kevin Crafts,900,0,1080,3,"If dill is not available, add 1 tsp of Old Bay...",0,0,0,0,0,0,0,"Preheat oven to 200F.\r\n\r\nIn a large bowl, ...","[1 cup all-purpose flour\r, 1 1/2 tsp sugar\r,...","[salmon, ebelskivers, main]"


In [304]:
recipes.shape

(476, 18)

We see there are nearly 500 recipes, and 18 parameters.

Let's take a look at one row, the first recipe:

In [305]:
recipes.iloc[0]

id                                                              2
name                                          Baked Shrimp Scampi
source               Ina Garten: Barefoot Contessa Back to Basics
preptime                                                        0
waittime                                                        0
cooktime                                                        0
servings                                                        6
comments        Modified by reducing butter and salt.  Substit...
calories                                                     2565
fat                                                           159
satfat                                                         67
carbs                                                          76
fiber                                                           4
sugar                                                           6
protein                                                       200
instructio

There is a lot of information there, but much of it is in a very messy form, as is typical of data scraped from the Web.

In particular, the ingredient list is in string format.

Let's start by taking a closer look at the ingredients:

In [306]:
recipes.iloc[0]['ingredients']

['2/3 cup panko\r',
 '1/4 teaspoon red pepper flakes\r',
 '1/2 lemon, zested and juiced\r',
 '1 extra-large egg yolk\r',
 '1 teaspoon rosemary, minced\r',
 '3 tablespoon parsley, minced\r',
 '4 clove garlic, minced\r',
 '1/4 cup shallots, minced\r',
 '8 tablespoon unsalted butter, at room temperature\r',
 '2 tablespoon dry white wine\r',
 'Freshly ground black pepper\r',
 'Kosher salt\r',
 '3 tablespoon olive oil\r',
 '2 pound frozen shrimp']

Let's see which recipe has the longest ingredient list.

We can `apply` the function `len` to find the length of each list.

In [307]:
recipes.ingredients.apply(len).head()

2    14
4     4
5     7
7     8
8    13
Name: ingredients, dtype: int64

Find index of the recipe with the longest list of ingredients using `idxmax`.

The index of the maximum row (`axis=0`) is returned. 

(If there is more than one item with the maximum value, the first is returned)

In [308]:
recipes.ingredients.map(len).idxmax(axis=0)

'id420'

This can be used as an index to find the name of the recipe

In [309]:
recipes.name[recipes.ingredients.map(len).idxmax(axis=0)]

'Wild mushroom and cauliflower lasagna'

Let's see how many recipes contain Pepper

In [310]:
# join ingredients list to form a string
recipes['ingredients'] = recipes['ingredients'].str.join("") 

Search for the character string 'Pepper' using `str.contains()`

We can search for multiple terms by seperating them by logical `or` operators : `'Pepper|pepper'`

Count number of instances using `.sum()`.

In [311]:
recipes.ingredients.str.contains('Pepper|pepper').sum()

244

Case insensitive search

In [312]:
recipes.ingredients.str.contains('pepper', case=False)

#import re # a library for regex handling
#recipes.ingredients.str.contains('pepper', flags=re.IGNORECASE).sum() # case insensitive search

2         True
4        False
5         True
7        False
8         True
         ...  
id536    False
id537    False
id538    False
id539    False
id540    False
Name: ingredients, Length: 476, dtype: bool

Search for the character string 'Pepper' using `str.contains()`

The result `seelction` is a boolean array so can be used as an index. 

In [313]:
selection = recipes.ingredients.str.contains('pepper', case=False)
#selection = recipes.ingredients.str.contains('pepper', flags=re.IGNORECASE)

# names of selected recipes (those containing pepper)
recipes.name[selection].head()

NameError: name 're' is not defined

### A simple recipe recommender

Goal : given a list of ingredients, find a recipe that uses all the ingredients.

While conceptually straightforward, the task is complicated by the heterogeneity of the data: there is no easy operation, for example, to extract a clean list of ingredients from each row.


In [None]:
spice_list = ['salt', 'pepper', 'oregano', 'parsley']

__Example 1 :__

Find recipes containg *any* of the spices in the list. 

Join the list together

In [None]:
s = '|'

s.join(spice_list)

Find recipes containing spices using `str.contains()`

In [None]:
spice_index = recipes.ingredients.str.contains(s) 

recipes.name[spice_index].head()

__Example 2 :__

Find recipes containg *all* of the spices in the list. 

This code shows what we want to achieve:

In [None]:
spice_recipes = recipes[recipes.ingredients.str.contains('salt') &
                        recipes.ingredients.str.contains('pepper') &
                        recipes.ingredients.str.contains('oregano') &
                        recipes.ingredients.str.contains('parsley')]

spice_recipes.name

This code is repetitive and error prone. 

We can improve it with a for loop.

In [None]:
spice_recipes = recipes

for s in spice_list:
    spice_recipes = spice_recipes[recipes.ingredients.str.contains(s)]

spice_recipes.name 


Now that we have narrowed down our recipe selection from nearly 500 to 2, we can  make a more informed decision about what we'd like to cook for dinner.

<a id='TimeSeries'></a>
# 9. Working with Time Series

<br> <a href='#IndexingTime'>9.1 Pandas Time Series: Indexing by Time</a>
<br> <a href='#DataStructures'>9.2 Pandas Time Series : Data Structures</a> 
<br> <a href='#FrequenciesOffsets'>9.3 Frequencies and Offsets</a> 

Pandas was developed in the context of financial modeling, so it contains a fairly extensive set of tools for working with dates, times, and time-indexed data.

- *Time stamps* : moments in time (e.g., July 4th, 2015 at 7:00am).
- *Time intervals* and *periods* : length of time between a particular beginning and end point; for example, the year 2015. 
- *Time deltas* or *durations* : an exact length of time (e.g., a duration of 22.56 seconds).

This is a broad overview of how you as a user should approach working with time series.

### Dates and Times in Python

More generally, Python has a number of available representations of dates, times, deltas, and timespans.

While the time series tools provided by Pandas tend to be the most useful for data science applications, it is helpful to see their relationship to other packages used in Python.

### Native Python dates and times: ``datetime`` and ``dateutil``

Manually building a date using the ``datetime`` type:

In [None]:
from datetime import datetime
datetime(year=2015, month=7, day=4)

Parse dates from a variety of string formats with `dateutil`:

In [None]:
from dateutil import parser
date = parser.parse("4th of July, 2015")
date

### Dates and times in pandas

Pandas' ``Timestamp`` object, which combines ``datetime`` and ``dateutil`` with the efficient storage and vectorization. 

From a group of these ``Timestamp`` objects, Pandas can construct a ``DatetimeIndex`` that can be used to index data in a ``Series`` or ``DataFrame``.

Example : We can parse a flexibly formatted string date...

In [None]:
date = pd.to_datetime("4th of July, 2015")
date

Additionally, we can do NumPy-style vectorized operations directly on this same object:

In [None]:
date + pd.to_timedelta(np.arange(12), 'D')

In [None]:
date + pd.to_timedelta(np.arange(12), 'M')

<a id='IndexingTime'></a>
## 9.1 Pandas Time Series: Indexing by Time



Where the Pandas time series tools really become useful is when you begin to *index data by timestamps*.

For example, we can construct a ``Series`` object that has time indexed data:

In [None]:
index = pd.DatetimeIndex(['2014-07-04', 
                          '2014-08-04',
                          '2015-07-04', 
                          '2015-08-04'])
data = pd.Series([0, 1, 2, 3], index=index)
data

Now that we have this data in a ``Series``, we can make use of any of the ``Series`` indexing patterns we discussed in previous sections:

In [None]:
data['2014-07-04':'2015-07-04']

There are additional special date-only indexing operations, such as passing a year to obtain a slice of all data from that year:

In [None]:
data['2015']

<a id='DataStructures'></a>
## 9.2 Pandas Time Series : Data Structures



- ``Timestamp`` 
- ``Period``
- ``Timedelta``

In [None]:
dates = pd.to_datetime([pd.datetime(2015, 7, 3), 
                        '4th of July, 2015',
                        '2015-Jul-6', 
                        '07-07-2015', 
                        '20150708'])
dates

Any ``DatetimeIndex`` can be converted to a ``PeriodIndex`` with the ``to_period()`` function with the addition of a frequency code; here we'll use ``'D'`` to indicate daily frequency:

In [None]:
dates.to_period('D')

A ``TimedeltaIndex`` is created, for example, when a date is subtracted from another:

In [None]:
dates - dates[0]

### Regular sequences: ``pd.date_range()``

Generating a regular date sequence.  

We've seen that Python's ``range()`` and NumPy's ``np.arange()`` turn a startpoint, endpoint, and optional stepsize into a sequence.

Similarly, ``pd.date_range()`` accepts a start date, an end date, and an optional frequency code to create a regular sequence of dates.

By default, the frequency is one day.

In [None]:
pd.date_range('2015-07-03', '2015-07-10')

Alternatively, the date range can be specified with just a startpoint and a number of periods:

In [None]:
pd.date_range('2015-07-03', periods=8)

The spacing can be modified by altering the ``freq`` argument, which defaults to ``D``.


<a id='FrequenciesOffsets'></a>
## 9.3 Frequencies and Offsets



The following table summarizes the main frequency codes available to set the frequency or offset of time series data

| Code   | Description         | Code   | Description          |
|--------|---------------------|--------|----------------------|
| ``D``  | Calendar day        | ``B``  | Business day         |
| ``W``  | Weekly              |        |                      |
| ``M``  | Month end           | ``BM`` | Business month end   |
| ``Q``  | Quarter end         | ``BQ`` | Business quarter end |
| ``A``  | Year end            | ``BA`` | Business year end    |
| ``H``  | Hours               | ``BH`` | Business hours       |
| ``T``  | Minutes             |        |                      |
| ``S``  | Seconds             |        |                      |
| ``L``  | Milliseonds         |        |                      |
| ``U``  | Microseconds        |        |                      |
| ``N``  | nanoseconds         |        |                      |

The monthly, quarterly, and annual frequencies are all marked at the end of the specified period.
By adding an ``S`` suffix to any of these, they instead will be marked at the beginning:

| Code    | Description            || Code    | Description            |
|---------|------------------------||---------|------------------------|
| ``MS``  | Month start            ||``BMS``  | Business month start   |
| ``QS``  | Quarter start          ||``BQS``  | Business quarter start |
| ``AS``  | Year start             ||``BAS``  | Business year start    |

For example, here we will construct a range of hourly timestamps:

In [None]:
pd.date_range('2015-07-03', periods=8, freq='H')

To create regular sequences of ``Period`` or ``Timedelta`` values, the very similar ``pd.period_range()`` and ``pd.timedelta_range()`` functions are useful.

A period with increments increasing by month:

In [None]:
pd.period_range('2015-07', periods=8, freq='M')

And a sequence of durations increasing by an hour:

In [None]:
pd.timedelta_range(0, periods=10, freq='H')

## Example: Visualizing Bicycle Counts

Let's take a look the data comes from automated bicycle counter, installed on a bridge in 2012.

The data contains the hourly count of bicycle traffic on the east and west side of a bridge to the centre of the city.

In [None]:
#!curl -o FremontBridge.csv https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOAD

We can use Pandas to read the CSV output into a ``DataFrame``.

We must specify:
- Date is the  index
- we want these dates to be automatically parsed

In [None]:
# Load the csv file
data = pd.read_csv('sample_data/FremontBridge.csv', index_col='Date', parse_dates=True)

data.head()

For convenience, we'll further process this dataset by shortening the column names 

In [None]:
data.columns=['Total', 'East', 'West']

Now let's take a look at the summary statistics for this data:

In [None]:
data.dropna().describe()

### Visualizing the data

We can gain some insight into the dataset by visualizing it.

Let's start by plotting the raw data using pandas `plot` function. 

In [None]:
data.plot()
plt.ylabel('Hourly Bicycle Count');

### Resampling and converting frequencies

One common need for time series data is resampling at a higher or lower frequency.

This can be done using:
- ``resample()`` (data aggregation method)
- ``asfreq()`` (data selection method)


Let's resample the bicycle count data by week:

In [None]:
weekly = data.resample('W').sum()     # data resampled weekly. all values summed (other options available)

weekly.plot()                         # plt DataFrame

plt.ylabel('Weekly bicycle count');

This shows us some interesting seasonal trends: 
- people bicycle more in the summer than in the winter
- within a particular season the bicycle use varies from week to week 

### Rolling Window Funtcions

__Rolling Function__ : Applies a funcion different subsets of the full data set.

This rolling view makes available a number of aggregation operations by default.

__Rolling Mean__ : Smooths data by creating a series of averages of different subsets of the full data set.

We can use the ``pd.rolling_mean()`` function. 

Here we'll do a 30 day rolling mean of our data. 

Example: rolling mean with a window size of 30. 

In [None]:
daily = data.resample('D').sum()                     # resampled to daily measurements, all values summed

daily.rolling(50).sum().plot() # rolling window size of 30

plt.ylabel('mean hourly count');

### Looking deeper into the data

While these smoothed data views are useful to get an idea of the general trend in the data.

Looking closer reveals interesting structures.

We can look at the average traffic as a function of the time of day.

We will use `GroupBy` where the index is given in the form `data.index.time`, `data.index.day`, `data.index.day`...etc

In [None]:
by_time = data.groupby(data.index.time).mean() # group by hour, mean value each hour

hourly_ticks = 4 * 60 * 60 * np.arange(6)      # A list of 5 positions (gven in hours) at which ticks should be placed

by_time.plot(xticks=hourly_ticks);

The hourly traffic has a clear distribution.

The peaks are around 8:00 in the morning and 5:00 in the evening.

This is evidence of commuter traffic crossing the bridge.

Further evidence : differences between: 
- west side of the bridge (generally used going into the city), which peaks more strongly in the morning
- east side of the bridge (generally used coming out of the city), which peaks more strongly in the evening.



Let's look at how things change based on the day of the week. 

Again, we can do this with `groupby`:

In [None]:
by_weekday = data.groupby(data.index.dayofweek).mean() # group by weekday, mean value each weekday

by_weekday.index = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun'] # re-name the index

weekday_ticks = np.arange(7)

by_weekday.plot(xticks=weekday_ticks)



Strong distinction between weekday and weekend totals.

Around twice as many riders cross the bridge on Monday through Friday than on Saturday and Sunday.

In [None]:
weekdays = data[data.index.weekday < 5]
weekdays_by_time = weekdays.groupby(weekdays.index.time).mean() 
weekdays_by_time.plot()

weekend = data[data.index.weekday >= 5]
weekend_by_time = weekend.groupby(weekend.index.time).mean() 
weekend_by_time.plot()

We see:
- a commute pattern during the work week
- a recreational pattern during the weekends


Note that we can use `matplotlib` to plot the data when we want to present it.

Pandas' `plot` is useful for producing features quickly such as adding a legend and changing the style. 

In [None]:
weekdays_by_time.plot(kind="bar")
weekend_by_time.plot(kind="bar")

In [None]:
plt.subplot(1, 2, 1)
plt.plot(weekdays_by_time)

plt.subplot(1, 2, 2)
plt.plot(weekend_by_time)

### Saving a Data Frame
<a id='SavingDataFrame'></a>
The `to_csv`and `to_excel` methods can be used to save a `DataFrame`.

These command can be used to save the data as almost *any* test-based filetype. 

In [None]:
weekend_by_time.to_csv('sample_data/hourly_cycle_count_weekend.csv')
weekend_by_time.to_csv('sample_data/hourly_cycle_count_weekend.txt')

weekend_by_time.to_excel('sample_data/hourly_cycle_count_weekend.xls')

The full list of optional function arguments can be found here: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

Five useful optional function arguments as examples:
- `header` : new headers can be assigend as a list of strings OR the header can be omitted using
- `float_format` : the number of decimal places
- `sep`: delimter (default = ",")
- `mode` : the Python mode specifier (default = `w`)
 `header=False`.
- `index` : write row names (default = True)

<a id='ReviewExercises'></a>
# 10 Review Exercises



## Review Excercise: Restaurant Data

Import the .tsv data set from https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv to a Pandas `DataFrame`.

The data is a record of ordersplaced at restaurant 'Chipotle'.

__(a)__ 
What was the most popular item (item ordered the most times)? <br>*Hint : Use `groupby`*

__(b)__ 
What was the average sum of money spent on a single order? <br>*Hint : Convert prices to numerical data*

__(c)__ 
For each bowl, burrito and tacos item (`Chicken Bowl` etc) there is a  `choice description`.  
<br>Produce a table showing the number of times each `choice description` is selected for each bowl, burrito and tacos dish.
<br>*Hint : USe `str.contains` then use `groupby` with __two__ arguments (see planets example)* 
<br>Your answer should resemble the table shown below:

<img src="img/panda_table.png" alt="Drawing" style="width: 500px;"/>

In [None]:
# Review Excercise: Restaurant Data
# Example Solution
import pandas as pd

# import the dataset 
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv'  
chipo = pd.read_csv(url, sep = '\t')

# look at the data 
display(chipo.head())



# a) Most popular item
best = chipo.groupby('item_name')['quantity'].sum().idxmax()
print(f'most popular item is {best}')

# idxmax returns only the first maximum. 
# Are there any others? Print number of items ordered (high to low)
orders = chipo.groupby('item_name')['quantity'].sum().sort_values(ascending = False)
#display(orders.head())

# Verify that the first two items in the DataFrame are different.
orders[0] is not orders[1]



# b) Average money spent on single order
# remove symbols and convert string to float
chipo['item_price'] = chipo.item_price.replace('\$', '', regex=True).astype(float)
chipo.head()
bill = chipo.groupby('order_id')['item_price'].sum()     # sum of each order
print(f'mean amount spent is ${round(bill.mean(), 2)}')  # average sum



# c) 
choice_index = chipo.item_name.str.contains('Bowl | Burrito | Tacos')
choice_items = chipo[choice_index]

choices = choice_items.groupby(['item_name', 'choice_description'])['quantity'].sum().unstack()
choices.fillna(0, inplace= True)
choices

## Review Excercise: Time Series Data

Import the multiTimeline.csv from the sample_data folder to a Pandas `DataFrame`.

The data is a Google trends data of search keywords 'diet', 'gym' and 'finance' to see how they vary over time. 

Plot the data.
*Hints : Use `skiprows` when importing the data to remove the first line.*

 

__(a)__ 
Can you see a yearly repeating pattern in the data?<br>
Plot the __average monthly instances of each search keyword__ for all years in the data set.<br>
*Hints : Use `groupby`*<br>
What conclusions could be drawn about this plot? 

__(b)__ 
Are there any noticable long-term trends over the period 2004-2017? 
<br>It's difficult to see because of the annual fluctuations.
<br>Smooth the data to reveal any longer-term trends in the search keyword data.
<br>*Hint : use rolling average or downsampling* 

In [None]:
# Review Excercise: Time Series Data

resolutions = pd.read_csv('sample_data/multiTimeline.csv', 
                          skiprows=1, 
                          index_col='Month', 
                          parse_dates=True)

resolutions.columns = ['diet', 'gym', 'finance']

resolutions.plot()

# a)
by_month = resolutions.groupby(resolutions.index.month).mean()
by_month.plot()

# b)
# rolling average
for r in list(resolutions.columns):
    data = resolutions[[r]]         # select individual Series
    data.rolling(12).mean().plot()  # 12 month rolling average
    plt.xlabel('Year');             # original x series
    
# resampling    
yearly = resolutions.resample('Y').sum()     # data resampled weekly. all values summed (other options available)
yearly.plot()                                # plt DataFrame