# Problem Set #2 

Python is open source, which means that it is free, and anyone can alter its source code. You can't do that with [Matlab](https://www.mathworks.com/products/matlab.html), for example, which is kind of like a ["walled garden"](https://twitter.com/gallantlab/status/1014680265711501312) that only wealthy people can enjoy. 

In the last problem set, we explored basic data types and functions embedded within Python. Because Python is open source, smart folks have taken the language and written a bunch of highly-optimized functions and methods to do certain tasks. These people then compile these functions into things called "libraries."

Libraries are therefore a collection of functions and methods, often called [modules](https://docs.python.org/3/library/) that allows you to perform lots of actions without having to write your own code. In nearly all of the code above, we were using base libraries that come in Python called the standard library. 

### **Learning objectives:**
1. Learn how to use open source packages
2. Explore additional data structures in Python

In the lectures, we learned about how there can be discrepancies from different disaster reporting agencies from differing motivations on collecting data. In this problem set, we will be using data from those agencies to explore additional data structures in Python. 


**Total points for this problem set: 27 pts**
*   Example codes executed: 5 pts
*   Correct answers to problems: 17 pts
*   Comments added to responses: 5 pts

**Please do not forget to add comments (with the # sign) next to your code for all of the problems to explain what you are doing.**

## #1. Numpy

[Numpy](https://docs.scipy.org/doc/numpy-1.13.0/user/whatisnumpy.html) is a library. Click on the link and take a look. The documentation for numpy, which is one of the most important Python libraries, is crucial. Part of learning coding is building your confidence navigating the documentation associated with coding libraries.

In [None]:
# import numpy with as Python object named np
import numpy as np

## 1.1 NDArrays and additional data types

The core class (i.e., the "blueprint") of Numpy, which is just a library of code produced by people to do things, is this thing called an numpy ndarray (n-dimensional array). 

Here, the "n" just means a countable positive number—it can be 1, 2, 3, ..., n. The word [array](https://en.wikipedia.org/wiki/Array_data_structure) just refers to the way the data is arranged, kind of like in an Excel table. 

For example, that weather map we want to plot, if it's just something like surface temperature, is a 2-dimensional array, one dimension for latitude, one dimension for longitude. So each row in the array might corresppond to a different line of longitude, and all the columns might represent different lines of latitude. Then the values in the table would be the temperature at that latitude and longitude pair.

If we want to make multiple maps showing how temperature evolves over time, it could be stored as a 3-dimensional array: latitude, longitude, and time. So this ndarray thing is just a new type of data class, just like the string class from the first problem set. See the image below.

In [None]:
from IPython.display import Image
Image(url='https://cdn-images-1.medium.com/max/2400/1*Ikn1J6siiiCSk4ivYUhdgw.png')


In [None]:
# Let's create an array from a list
a = np.array([3,5,2,8,1])

# Notice the square brackets [] creating the list inside the function np.array(),
# where np is the name of the oject containing the numpy library and the "dot" (.) allows us to access functions or methods from that library.
# So here it's saying, use the "array" function in the numpy library

In [None]:
# What is the shape?
a.shape

In [None]:
# What is the shape type?
type(a.shape)

In [None]:
# What is the datatype?
a.dtype

Data types are pretty crucial. Essentially, a data type just limits the values a given variable or object can take on. We've met a few data types already: 1. strings, 2. numbers, 3. dictionaries, 4. booleans, 5. tuples. 

Some other specific datatypes include:
1. Integers (`int`)
2. Booleans (`bool`)
3. Real (`float`)
4. Complex (`complex`)

Above, it says, the data type is 'int64'. What exactly does this mean? 

So above I made the elements of the `numpy` array (ndarray) a double-precision (64-bit) float. 

It specifies how precise I want the computer to be with the numbers.

Precision is crucial in climate science. One of the interesting challenges in climate modeling
is that if I take a climate model, which is based on a set of deterministic equations (mass, momentum, energy), I will get a different answer on one computer than another computer, even if I provide the model with the same starting point (what we call _initial conditions_). This is because different computers generate round-off errors in their storage of floating point numbers. 

[Watch the great (and patient) Salman Khan explain binary](https://www.youtube.com/watch?v=ku4KOFQ-bB4), and once you've done that, watch the [fantastic Computerphile piece on floating point numbers](https://www.youtube.com/watch?v=PZRI1IfStY0). 

In [None]:
# Here is an array with a different datatype and shape
b = np.array([[3,4,8,5], [0,2,6,1]], dtype= np.float64)

# View
b

In [None]:
# Let's check datatype and shape
b.dtype, b.shape

In [None]:
# Arrays can be easily created with numpy
# Let's create a 9 x 9 array with just 0's 
zero_array = np.zeros((9,9))

print(zero_array)

In [None]:
# We can also create an array with ranges
np.arange(10)

### Problem #1

Let's create arrays on disaster data from 1996 to 2016 as reported by Sigma.

Here is the table of # of disasters and # of victims:

|   Year    | # of disasters| # of victims |
|:------:|:------:|:------: |
| 1996 | 	312	|	21276 |
| 1997 |	303	|	23323 |
| 1998 | 297 | 	45416 |
| 1999 | 296 | 62846 |
| 2000 | 	299	 |	14950 |
| 2001 | 	298	 |	35609 |
| 2002 | 	287	|	22311 |
| 2003 | 	322	|	78894 |
| 2004 | 	355	|	242519 |
| 2005 | 	421	|	101563 |
| 2006 | 	367	|	32532 |
| 2007 | 	360	|	22199 |
| 2008 | 	334	|	240612 |
| 2009 | 	308	|	14948 |
| 2010 | 	345	|	304054 |
| 2011 | 	341	|	34072 |
| 2012 | 	326	|	14007 |
| 2013 | 	327	|	27063 |
| 2014 | 	344	|	12914 |
| 2015 | 	357	|	26543 |
| 2016 | 	355	|	10841 |

We want to save the data from this table as a Python object we can work with.

1A - Create one array with two separate lists for the number of disasters and number of victims. Print the array.

Hint: Think about what datatype you would like to use for this array.

In [None]:
# ENTER CODE HERE

1B - What is the shape of the data?

In [None]:
# ENTER CODE HERE

## 1.2 Indexing and Slicing

Indexing is how we pull individual data items out of an array. Slicing extends this process to pulling out a regular set of the items.

Keep in mind that Python uses a zero-based system.

In [None]:
# Let's create a practice array
# I am creating an array with numbers from 1 through 12 that has a 3 by 4 dimensionality
A = np.arange(12).reshape(3,4)

A

In [None]:
# How can we find the 2nd item along the first dimension (row) and the 3rd along the second dimension (column)
# Is the below code correct? 
A[2, 3]

In [None]:
# No, it is not! Because it is a zero-based system, we have to adjust accordingly
# The correct answer will be
A[1,2]

In [None]:
# We can also index one whole dimension
# Here is the first dimension:
A[0]

Negative indices are also allowed, which permit indexing relative to the end of the array.

In [None]:
A[0,-1]

Slicing syntax is written as `start:stop[:step]`, where all numbers are optional.
- defaults: 
  - start = 0
  - stop = len(dim)
  - step = 1
- The second colon is also optional if no step is used.

It should be noted that end represents one past the last item; one can also think of it as a half open interval: `[start, end)`

In [None]:
# Get the 2nd and 3rd rows
A[1:3]

In [None]:
# All rows and 3rd column
A[:, 2]

In [None]:
# ... can be used to replace one or more full slices
A[..., 2]

### Problem #2

Let's practice indexing and shaping data from the array you made in Problem #1. 

In [None]:
# Print the array you produced in Problem #1

# ENTER CODE HERE

2A - What is the # of disasters that is located in the 10th year on record?



In [None]:
# ENTER CODE HERE

2B - # What is the # of victims that happened in each of the last five years?



In [None]:
# ENTER CODE HERE

2C - Reshape the array into a new object, so that it has seven years of only the number of disasters per list in the array. Print the new object to see that it is formatted correctly.


In [None]:
# ENTER CODE HERE

## 1.3 Indexing Arrays with Boolean Values

Numpy can easily create arrays of boolean values and use those to select certain values to extract from an array.

Sigma also reports data on the cost of the insured losses of disasters:

In [None]:
# Here is the array that represents the cost of all insured losses (in billions) per year from 1996 - 2016

disaster_cost = np.array([7.63, 7.03, 6.39, 8.91, 6.54, 37.25, 4.34, 
                          4.5, 5.06, 7.12, 6.46, 7.12, 9.92, 4.68, 
                          5.6, 7.72, 6.55, 8.47, 7.6, 10.3, 8.74])

In [None]:
# We want to know if there were disasters that cost more than $5 billion
disaster_cost > 5

In [None]:
# Let's see the actual costs of disasters that were greater than $5 billion
print(disaster_cost[disaster_cost > 5])

In [None]:
# We want to know all the years cost more than $5 billion dollars
# We need to first create an array of all the year
years = np.array([1996, 1997, 1998, 1999, 2000, 2001, 
                  2002, 2003, 2004, 2005, 2006, 
                  2007, 2008, 2009, 2010, 2011,
                  2012, 2013, 2014, 2015, 2016])

# Lets find the years
print(years[disaster_cost > 5])

In [None]:
# Let's find which years that cost more than $5 billion, but less than $7 billion

print(years[(disaster_cost > 5) & (disaster_cost < 7)])

### Problem #3

You will be working with the data you created in Problem #1. 

In [None]:
# Print the data you created in Problem #1

# ENTER CODE HERE

3A - Create two separate Numpy arrays for total number of disasters and total number of victims from your original array.

In [None]:
# ENTER CODE HERE

3B - In which years were there more than 100,000 victims?

Hint: Will we need to use the array for the years?

In [None]:
# ENTER CODE HERE

3C - Are there years where there were more than 400 disasters and had more than 75,000 victims? If so, which year(s)?

In [None]:
# ENTER CODE HERE

## 1.4 Understanding the axis

I want to introduces an important concept when working with NumPy: the axis. This indicates the particular dimension along which a function should operate (provided the function does something taking multiple values and converts to a single value). 

Let's look at a concrete example with `sum`, which just adds elements together:

In [None]:
# Here is object A again
A

In [None]:
# Using the array A we created above, let's add all the elements
np.sum(A)

In [None]:
# Double check if that function accurately added all the elements
np.sum(A) == (0+1+2+3+4+5+6+7+8+9+10+11)

In [None]:
# Let's add the sum across each of the rows
np.sum(A, axis = 0)

In [None]:
# Let's add the sum across each of the columns
np.sum(A, axis = 1)

### Problem #4

From the object you created in problem #2C, find the total number of disasters in seven year intervals:

In [None]:
# View object from problem 2C

In [None]:
# ENTER CORE HERE

# #2. Pandas
[Pandas](http://pandas.pydata.org/) is a an open source library providing high-performance, easy-to-use data structures and data analysis tools. Pandas is particularly suited to the analysis of _tabular_ data, i.e. data that can can go into a table. In other words, if you can imagine the data in an Excel spreadsheet, then Pandas is the tool for the job.

### Pandas capabilities (from the Pandas website):

* A fast and efficient DataFrame object for data manipulation with integrated indexing;
* Flexible reshaping and pivoting of data sets;
* Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
* Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
* Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

### Problem #5

Import the `pandas` library named as a `pd` object.

In [None]:
# ENTER CODE HERE

## 2.1 Data Structures

### 2.1.1 Series

A Series represents a one-dimensional array of data. The main difference between a Series and numpy array is that a Series has an _index_. The index contains the labels that we use to access the data, like a date, for example.

There are many ways to [create a Series](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#series). We will just show a few.

In [None]:
# Let's create a series with another disaster reporting data source, Munich RE
# Here is the list of years to be the index
year = ['1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003',
        '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011',
        '2012', '2013', '2014', '2015', '2016']

# Here is the number of disasters that occurred each year
disaster_counts = [448, 411, 469, 449, 523, 446, 443, 431, 380, 449, 554,
                   602, 486, 531, 565, 528, 648, 585, 679, 745, 748]

# Create a Series of the number of disasters, indexed by the year it occurred
MunichRE_disasters = pd.Series(disaster_counts, index= year)

# View the time series data
MunichRE_disasters

What's also nice is that arithmetic operations and most `numpy` function can be applied to Series.

An important point is that the Series keep their index during such operations.



In [None]:
# Lets log transform the counts
np.log(MunichRE_disasters)

We can also access the underlying index object if we need to:

In [None]:
# We double check the indexes we set
MunichRE_disasters.index

In [None]:
# How many disasters occurred in 2007?
MunichRE_disasters['2007']

In [None]:
# Similar function as above, we can also index using the 'loc' function
MunichRE_disasters.loc['2007']

In [None]:
# We can also index by the raw position, using the 'iloc' function.
# What was the number of disasters that occurred in the 12th observation?
MunichRE_disasters.iloc[12]

### Problem #6

Here is additional data from MunichRE:

|   Year    | Insured Losses ($bn) | 
|:------:|:------:|
| 1996 | 	17.05	|	
| 1997 |	7.56	|	
| 1998 | 25.60 | 
| 1999 | 36.04 |
| 2000 | 	13.25	 |
| 2001 | 	15.15	 |
| 2002 | 	21.80	|
| 2003 | 	20.85	|
| 2004 | 	49.33	|	
| 2005 | 	105.35	|	
| 2006 | 	18.00	|	
| 2007 | 	26.55	|	
| 2008 | 	42.69	|	
| 2009 | 	21.80	|
| 2010 | 	41.74	|	
| 2011 | 	105.35	|
| 2012 | 	59.78	|
| 2013 | 	33.19	|
| 2014 | 	28.45	|	
| 2015 | 	30.34	|
| 2016 | 	45.53	|	

6A - Create a new time series array object of the insured losses per year using Pandas. Print to see the time series array.


In [None]:
# ENTER CODE HERE

6B - What was the total insured losses in 2011? (Please use functions from above to show how you got the answer)



In [None]:
# ENTER CODE HERE

### 2.1.2 DataFrame

There is a lot more to Series, but they are limited to a single "column". A more useful Pandas data structure is the DataFrame. 

A DataFrame is basically a bunch of series that share the same index. It's a lot like a table in a spreadsheet.

Let's create a DataFrame using data from problem #4 and #5:

|   Year    | # of total disasters| Insured losses ($bn) |
|:------:|:------:|:------: |
| 1996 | 	448	|	17.05 |
| 1997 |	411	|	7.56 |
| 1998 | 469 | 	25.60 |
| 1999 | 449 | 36.04 |
| 2000 | 	523	 |	13.25 |
| 2001 | 	446	 |	15.15 |
| 2002 | 	443	|	21.80 |
| 2003 | 	431	|	20.85 |
| 2004 | 	380	|	49.33 |
| 2005 | 	449	|	105.35 |
| 2006 | 	554	|	18.00 |
| 2007 | 	602	|	26.55 |
| 2008 | 	486	|	42.69 |
| 2009 | 	531	|	21.80 |
| 2010 | 	565	|	41.74 |
| 2011 | 	528	|	105.35 |
| 2012 | 	648	|	59.78 |
| 2013 | 	585	|	33.19 |
| 2014 | 	679	|	28.45 |
| 2015 | 	745	|	30.34 |
| 2016 | 	748	|	45.53 |

In [None]:
# First, we create a dictionary
Year = ['1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003',
                 '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011',
                 '2012', '2013', '2014', '2015', '2016']
MunichRE_data = {'Total_Disasters': [448, 411, 469, 449, 523, 446, 443, 431, 380, 449,
                           554, 602, 486, 531, 565, 528, 648, 585, 679, 745, 748],
        'Insured_Losses': [17.05, 7.56, 25.6, 36.04, 13.25, 15.15, 21.8, 20.85,
                           49.33, 105.35, 18, 26.55, 42.69, 21.80, 41.74, 
                           105.35, 59.78, 33.19, 28.45, 30.34, 45.53]} 

# Then, we change the dictionary into a DataFrame
MunichRE_data = pd.DataFrame(MunichRE_data, index= Year)

# View
MunichRE_data

Doesn't the Pandas dataframe look like the table I provided you with?

Pandas handles missing data very elegantly, keeping track of it through all calculations. We don't have any missing data in the above DataFrame, but if we thought we did, we could check as follows:

In [None]:
MunichRE_data.info()

We can also determine some summary measures using Pandas.

In [None]:
# Let's view the minimum values per column
MunichRE_data.min()

In [None]:
# Let's calculate the averages per column
MunichRE_data.mean()

In [None]:
# Let's calculate the standard deviations per column
MunichRE_data.std()

In [None]:
# To make it even easier, let's calculate summary statistics for each column all at once
MunichRE_data.describe()

How do we index dataframes through pandas?

In [None]:
# We can get the data of one column in two ways. One way:
MunichRE_data['Total_Disasters']

In [None]:
# Or by syntax
MunichRE_data.Total_Disasters

We can also index for specific position in the dataframe. 

For example, let's see how many total disasters there were in 2008. 

In [None]:
MunichRE_data.Total_Disasters['2008']

In [None]:
# We can also create new columns
# Let's calculate average loss per disaster for each year by dividing total insured losses by total disasters
MunichRE_data['AvgLossPerDisaster'] = MunichRE_data.Insured_Losses/MunichRE_data.Total_Disasters

# View
MunichRE_data

### Problem #7

Here is the Sigma data provided in problem 1 again: 

|   Year    | # of disasters| # of victims |
|:------:|:------:|:------: |
| 1996 | 	312	|	21276 |
| 1997 |	303	|	23323 |
| 1998 | 297 | 	45416 |
| 1999 | 296 | 62846 |
| 2000 | 	299	 |	14950 |
| 2001 | 	298	 |	35609 |
| 2002 | 	287	|	22311 |
| 2003 | 	322	|	78894 |
| 2004 | 	355	|	242519 |
| 2005 | 	421	|	101563 |
| 2006 | 	367	|	32532 |
| 2007 | 	360	|	22199 |
| 2008 | 	334	|	240612 |
| 2009 | 	308	|	14948 |
| 2010 | 	345	|	304054 |
| 2011 | 	341	|	34072 |
| 2012 | 	326	|	14007 |
| 2013 | 	327	|	27063 |
| 2014 | 	344	|	12914 |
| 2015 | 	357	|	26543 |
| 2016 | 	355	|	10841 |

Let's practice working with pandas.

7A - Create a pandas DataFrame with Sigma data, with columns named TotalManMadeDisasters and TotalVictims. Print the dataframe to see what it looks like.

In [None]:
# ENTER CODE HERE

7B - In a new column, calculate the average number of victims per disaster for each year



In [None]:
# ENTER CODE HERE

7C - What was the average number of victims in 2014? (Use functions we learned in this problem set to find)



In [None]:
# ENTER CODE HERE

## 2.2 Merging data

Pandas supports a wide range of methods for merging different datasets. These are described extensively in the [documentation](https://pandas.pydata.org/pandas-docs/stable/merging.html). Here we just give a few examples.

Let's now work with disaster data from EM-Dat. 

In [None]:
# We'll create a pandas DataFrame of # of total disasters per year from 1996 - 2016
Year = ['1996', '1997', '1998', '1999', '2000', '2001', '2002', '2003',
                 '2004', '2005', '2006', '2007', '2008', '2009', '2010', '2011',
                 '2012', '2013', '2014', '2015', '2016']
EMDat_disasters = {'Total_Disasters': [483, 524, 585, 719, 893, 772, 893, 726, 764,
                                     870, 747, 728, 663, 614, 676, 601, 561,
                                     544, 553, 600, 522]} 
EMDat_disasters = pd.DataFrame(EMDat_disasters, index= Year)

# View
EMDat_disasters

In [None]:
# Here is another pandas DataFrame from EM-Dat that reports insured losses per year from 1996 - 2016
EMDat_insuredlosses = {'Insured_Losses': [4.93, 3.76, 11.04, 23.47, 5.71, 6.57, 10.97,
                                          12.32, 43.15, 92.27, 7.08, 22.70, 30.92,12.53,
                                          29.05, 90.38, 35.57, 23.85, 16.17, 20.38, 36.58]} 
EMDat_insuredlosses = pd.DataFrame(EMDat_insuredlosses, index= Year)

# View
EMDat_insuredlosses

In [None]:
# Let's merge these two data sets together
merged_EMdat = EMDat_disasters.join(EMDat_insuredlosses)

# View
merged_EMdat

### Problem #8

We want to compare total number of disasters from different reporting agencies. In order to do so, we need to have datasets in one object. 

Produce one dataframe with the total number of disasters from both MunichRE and EM-Dat by year by first creating pandas DataFrame for each datasets then merging the two. Print the merged dataframe to see what it looks like. 

Hint: Columns cannot have the same names. Please name the columns so that they indicate the proper disaster reporting source. 

In [None]:
# ENTER CODE HERE

## 2.3 Indexing using Boolean series

Like with Numpy arrays, we can also use boolean values to select certain values to extract from a pandas object.

Let's use the Munich RE data to see how we can index using Boolean values.

In [None]:
# Here is the MunichRE data again:
MunichRE_data

In [None]:
# Let's subset the data to show only information for years that had greater than $50 billion insured losses
costly = MunichRE_data[MunichRE_data.Insured_Losses > 50]

# View
costly

### Problem #9

In the MunichRE data, which years had more than 450 disasters and $20 billion insured losses? Please use code to answer this question. Write the code so that only the years that fit this criteria are shown. 

In [None]:
# ENTER CODE HERE