
## P4DS: Assignment 2 (Autumn 2020)

# Data and Algorithms

### Outline

* PART I
   * Question 1: World cities (10 marks)
   * Question 2: Earthquakes   (10 marks)
* PART II (in a separate file)
   * Question 3: Spell checker (10 marks)


## Question 1: World Cities

In this coursework exercise, you will download data from the provided link and read it in as a `CSV` file using the Pandas data analysis package for Python.

The data we will use contains a variety of information about cities from around the world.

**Questions Overview**

* __Question 1a__ --- Read a `CSV` file into a `pandas` `DataFrame`. __[1 Mark]__

* __Question 1b__ --- Generate a list of the largest cities (by population) in the world. __[2 Marks]__

* __Question 1c__ --- Find all the cities of a given country. __[2 Marks]__

* __Question 1d__ --- Create a DataFrame that contains
    the largest cities in a given country. __[3 Marks]__

* __Question 1e__ --- Find the total population of people living in cities in a given country. __[4 Marks]__

Pandas provides many useful functions for accessing and manipulating information.
For this question you are recommended to use the following:

* ```pandas.read_csv(source)``` ---
The  [read_csv function](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html#pandas.read_csv) can accept a filename or URL as an argument.

* ```df.head()``` ---
  For a DataFrame object, ```df```, this method extracts the first 5 rows of data, so you can easily check what the data looks like.
  
* ```df.describe()``` --- for a DataFrame, ```df```, this method provides a table giving and overview of some basic statistical properties of the DataFrame.

Note that the ```head()``` and ```describe()``` methods are actually operations that 
return a new DataFrame object. If this value is returned by the last line of a cell
it will be displayed as a table, but if it is generated elsewhere in the code
you will not see any output unless you use the ```display``` function from the
```IPython.display``` module.

### Question 1a
Store the ```worldcities.csv``` file in a Pandas DataFrame. This `CSV` file can be downloaded from the 
module's [data repository](https://teaching.bb-ai.net/P4DS/data/index.html).

More, specifically, you need to modify the following cell in order to set the 
global variable ```WC_DF``` to a DataFrame containing the information from ```worldcities.csv```. 
Be sure to keep to the same variable name `WC_DF` for this, otherwise you will 
break the autograder. **[1 Mark]** 

In [1]:
# Complete Question 1a in this cell.
import pandas as pd ## This is the module for creating and manipulating DataFrames
import numpy as np

# Modify this line to import the data using Pandas
WC_DF = pd.read_csv('https://teaching.bb-ai.net/P4DS/data/worldcities.csv', sep=',',  ) 
WC_DF[:10]
# 每个国家包含同一个城市名
# 每个大的州或省包含同一个城市名

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
1,New York,New York,40.6943,-73.9249,United States,US,USA,New York,,19354922.0,1840034016
2,Mexico City,Mexico City,19.4424,-99.131,Mexico,MX,MEX,Ciudad de México,primary,19028000.0,1484247881
3,Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
4,São Paulo,Sao Paulo,-23.5587,-46.625,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519
5,Delhi,Delhi,28.67,77.23,India,IN,IND,Delhi,admin,15926000.0,1356872604
6,Shanghai,Shanghai,31.2165,121.4365,China,CN,CHN,Shanghai,admin,14987000.0,1156073548
7,Kolkata,Kolkata,22.495,88.3247,India,IN,IND,West Bengal,admin,14787000.0,1356060520
8,Los Angeles,Los Angeles,34.1139,-118.4068,United States,US,USA,California,,12815475.0,1840020491
9,Dhaka,Dhaka,23.7231,90.4086,Bangladesh,BD,BGD,Dhaka,primary,12797394.0,1050529279


### Examples of  Working with `DataFrame`s

To complete these tasks you will need to access and filter a `DataFrame`. 
The `DataFrame` data structure has many convenient features for extracting and ordering information. Although conceptually it can be thought of as comptuational
represention of a table, it is quite a complex data structure and takes a while
to master. The following questions can be done with only a small but powerful
set of `DataFrame` operations; and the following examples should be useful 
for coding your answers.

#### Accessing DataFrame columns and rows

Each column of a `DataFrame` is a list-like object called a
`Series`. Elements, and slices of a `Series` can then be accessed in similar
fashion to a list. The following illustrates how get the `Series` containing
the first 5 elements of the `city` column of `WC_DF`:

In [2]:
top_5_cities = WC_DF["city"][:5]   ## selects the first 5 items of the "city" column.
top_5_cities

0          Tokyo
1       New York
2    Mexico City
3         Mumbai
4      São Paulo
Name: city, dtype: object

In the above output, the left hand column of the displayed value of `top_5_cities` 
shows the index label of each element. One of the differences between a `Series` and an ordinary list is that, whereas a list always has integers for its index labels, a `Series` can have different kinds of values for these. For instance (though there
is no reason to do this for the current assignment) we could set the index values to alphabetic letters, as follows:

In [3]:
top_5_cities.index = list("abcde")
top_5_cities

a          Tokyo
b       New York
c    Mexico City
d         Mumbai
e      São Paulo
Name: city, dtype: object

In [4]:
top_5_cities.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

You can also use ```.values``` to return an `array` of the column values without
the index:

In [5]:
WC_DF["city"][:5].values

array(['Tokyo', 'New York', 'Mexico City', 'Mumbai', 'São Paulo'],
      dtype=object)

An `array` is also a list-like datastructure. It does not have an index. The main difference between a list and an `array` is that the list is optimised for
storing large amounts of information and for efficiently applying numerical and
other operations to all elements of the array. Hence, `array`s are usually preferred
to lists when handling large amounts of information, or when storing numerical
vectors.

You can also easily find the column names of the DataFrame using ```.columns```, for example:

In [6]:
WC_DF.columns

Index(['city', 'city_ascii', 'lat', 'lng', 'country', 'iso2', 'iso3',
       'admin_name', 'capital', 'population', 'id'],
      dtype='object')

__Note:__ The `Index` returned here is yet another type of list-like, object. It is similar to an array,
except that it is used for indexing a `Series` or `DataFrame`. You do not usually
need to create or deal with `Index` objects directly, since this is done automatically when you create and minipulate `DataFrame`s. So you will normally only see one, when
you want to look at the columns or rows of a `DataFrame`. But what you should be
aware of, when dealing with `DataFrames`, is that the word _index_ can refer to
several different types of thing.

In many cases you can treat `Series`, `array` and `Index` objects like lists and if you want to change them to an ordinary list you can just use the `list` operator,
as in the following:

In [7]:
list(WC_DF.columns)

['city',
 'city_ascii',
 'lat',
 'lng',
 'country',
 'iso2',
 'iso3',
 'admin_name',
 'capital',
 'population',
 'id']

We can refer to rows of a `DataFrame` either by the expression `DF.loc[label]`, where `label` is the index label of the row we want, or by `DF.iloc[n]`,
where `n` is an `int` giving the position of the row in the `DataFrame`.
In the case of `WC_DF`, the labels are integers, so we would get the same result using either. You could test this. You could also see the difference if you try finding a row of `top_5_cities` `DataFrame` defined above, after its index labels have been replaced by letters. In this case you could access rows either using letters, using `loc`, or by `int`s, using `iloc`.

In [27]:
WC_DF.iloc[4]

city           São Paulo
city_ascii     Sao Paulo
lat             -23.5587
lng              -46.625
country           Brazil
iso2                  BR
iso3                 BRA
admin_name     São Paulo
capital            admin
population    1.8845e+07
id            1076532519
Name: 4, dtype: object

### Iterrating through the rows of a DataFrame
A convenient way of going through the rows of a `DataFrame` to perform some operation i by using the `iterrows` method in a `for` loop. This enables you to get both the index label and the row itself, for each successive row of the `DataFrame`. The following code is a simple example:

In [28]:
for i, row in WC_DF.iterrows():
    print(i, row['city_ascii'], row['lat'], row['lng'], row['id'])
    if i >3: break

0 Tokyo 35.685 139.7514 1392685764
1 New York 40.6943 -73.9249 1840034016
2 Mexico City 19.4424 -99.131 1484247881
3 Mumbai 19.017 72.857 1356226629
4 Sao Paulo -23.5587 -46.625 1076532519


### Sorting the rows of a DataFrame

It is easy, and often very useful, to sort the DataFrame by column values using ```.sort_values```, for example:

In [10]:
WC_DF.sort_values(by=["country"], ascending=True)[:10] # Sorts countries by alphabet

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
10235,Karukh,Karukh,34.4868,62.5918,Afghanistan,AF,AFG,Herāt,minor,17484.0,1004546127
8527,Kōṯah-ye ‘As̲h̲rō,Kotah-ye `Ashro,34.45,68.8,Afghanistan,AF,AFG,Wardak,,35008.0,1004450357
3341,Shibirghān,Shibirghan,36.658,65.7383,Afghanistan,AF,AFG,Jowzjān,admin,93241.0,1004805783
6050,Khōst,Khost,33.3395,69.9204,Afghanistan,AF,AFG,Khōst,admin,,1004919977
5141,Maḩmūd-e Rāqī,Mahmud-e Raqi,35.0167,69.3333,Afghanistan,AF,AFG,Kāpīsā,admin,7407.0,1004151943
2088,Lashkar Gāh,Lashkar Gah,31.583,64.36,Afghanistan,AF,AFG,Helmand,admin,201546.0,1004765445
3210,Gardēz,Gardez,33.6001,69.2146,Afghanistan,AF,AFG,Paktiyā,admin,103601.0,1004468894
6249,Maīdān Shahr,Maidan Shahr,34.3956,68.8662,Afghanistan,AF,AFG,Wardak,admin,,1004798735
2105,Maīmanah,Maimanah,35.9302,64.7701,Afghanistan,AF,AFG,Fāryāb,admin,199795.0,1004622920
7399,Andkhōy,Andkhoy,36.9317,65.1015,Afghanistan,AF,AFG,Fāryāb,minor,71730.0,1004472345


#### Note on __encodings__ of the city name
there are two columns that hold the city name. The first column name is `'city'` and
the second is `city_ascii`. There are various different ways in which textual
information can be encoded into bytes. These days [Unicode characters](https://home.unicode.org/)
encoded using [UTF-8](https://en.wikipedia.org/wiki/UTF-8) are pretty standard.
But the older [ASCII](https://en.wikipedia.org/wiki/ASCII) code, which uses
a single byte per character is still commonly used. Unicode provides a huge
variaty of text characters and other symbols, whereas ASCII is quite 
limited (mainly to characters and symbols found in standard English). 
But ASCII and is simpler and in 
some ways easier to deal with than UTF-8. In the following questions you
will be asked to use the ASCII version of the city name (from the `city_ascii` column). This mainly just to make you aware that there are different encodings
of text strings, but it will also prevent cerain problems that could occur in the
Autograder, if different people used different encodings.

### Filtering DataFrames
By _filtering_ we mean keeping some parts that we want and throwing away others.
Typically, we look for rows that match some condition; and the filter condition
is often some constraint involving the values for that row in one or more
columns.
`pandas` `DataFrame`s can  be filtered according values of a column by using a boolean expression, for example:

In [30]:
filtered_DF = WC_DF[ WC_DF['capital'] == 'admin'] # This keeps only administrative capitals
filtered_DF.head()

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
3,Mumbai,Mumbai,19.017,72.857,India,IN,IND,Mahārāshtra,admin,18978000.0,1356226629
4,São Paulo,Sao Paulo,-23.5587,-46.625,Brazil,BR,BRA,São Paulo,admin,18845000.0,1076532519
5,Delhi,Delhi,28.67,77.23,India,IN,IND,Delhi,admin,15926000.0,1356872604
6,Shanghai,Shanghai,31.2165,121.4365,China,CN,CHN,Shanghai,admin,14987000.0,1156073548
7,Kolkata,Kolkata,22.495,88.3247,India,IN,IND,West Bengal,admin,14787000.0,1356060520


This way of filtering is a very powerful and useful aspect of `DataFrames`. 
However, the
syntax of the filter operation is rather unusual and a bit difficult to understand.

What is happening can be explained by these steps in the way a filter expression
is evaluated:
*  `DF['label']` (where `DF` is any `DataFrame`), gives a `Series` corresponding
    to    the `'label'` column of `DF.
    
*  `a_series == val` is a special use of `==`. When a Boolean operator
    (such `==`, `<` etc.) is applied to a `Series` object
   the result is actually a `Series` of Boolean values (not a single Boolean).
   The new `Series` obtained will have the value `True` for each element where
   the original `Series` satisfies `element == val`, and `False` for the rest.
   
* `DF[ bool_series ]`, is a special kind of slice-like operation, where a boolean
   series is given as a selection argument to the `DataFrame`. It will return 
   a new DF, containing all the rows of `DF` for which `bool_series` has the 
   value `True`. (These rows can be quickly found because the `DataFrame` and
   the Boolean series both have the same `Index`.)
   
You do not necessarily need to follow all of that precise desciption of filtering
but it will be extremely helpful if you are able to construct filtering
operations similar to the above example. You will see another example below,
in relation to the earthquake data you will be processing.

### Question 1b 
Write a function `find_largest_cites` that takes an `int` argument `n`  and uses the
pandas DataFrame WC_DF to derive and return a `list` of the `n` largest cities, in terms of population size. The list should contain the `ascii` names of the cities in order of population size, with the largest first.  

In [12]:
# Question 1b answer cell
def find_largest_cities(n):
    
    top_n_largest_cities = []
    
    # if n is None or n less than 1, should be return []
    if n == None or n < 1:
        return []

    # Modify to return a list of the n cities with the largest population   
    df = WC_DF.sort_values(by=["population"], ascending=False) # Sorts cities by population
    top_n_largest_cities = list(df[:n]['city_ascii'])
    return top_n_largest_cities

city = find_largest_cities(6)
city

['Tokyo', 'New York', 'Mexico City', 'Mumbai', 'Sao Paulo', 'Delhi']

__NOTE:__ In answering __1b__ you may assume that no two cities have exactly the same population, which is almost but not quite certain, when dealing with large numbers like this. But, of course, when dealing with quantites where multiple data records could have the same value, we need to be careful, because this may not be the case.
For example, if we are interested in what equipement students own, we might think it would be informative
to find 'the top 10 students owning the most laptops'. In this case there could be: 1 student with 3 laptops, 23 students with 2 laptops, 160 with 1 laptop and 3 who do not own a laptop. In such a case it is not meaningful to pick the 'top 10' in terms of laptop ownership. A similar problem could potentiall occur with the earthquake data that we will look at later, because the earthquake magnitudes are only recorded to 1 decimal place.

### Question 1c
Define a function that returns a `list` _in alphabetical order_ of all the cities of a certain country. Make sure this function is case insensitive to the string input and that the output list contains
the `ascii` version of the city names. 

In [33]:
# Question 1c answer cell
def find_cities_in_country(country):
    """country is a string input"""
    # Edit this function to return an array of the cities belonging to given country
    list_cities_of_country = []
    
    if country == None:
        return []
    else:
        country = country.lower().title()
    
    df = WC_DF[WC_DF['country'] == country].sort_values(by=["city_ascii"], ascending=True)
    df = df['city_ascii']
    list_cities_of_country = list(dict.fromkeys(list(df)))
    return list_cities_of_country

cites = find_cities_in_country('China')
cites[:20]

['Aksu',
 'Altay',
 'Anda',
 'Ankang',
 'Anlu',
 'Anqing',
 'Anshan',
 'Anshun',
 'Baicheng',
 'Baiquan',
 'Baishan',
 'Baoding',
 'Baojishi',
 'Baoshan',
 'Baotou',
 'Bayan Hot',
 'Beian',
 'Beichengqu',
 'Beidao',
 'Beihai']

### Question 1d
Use the function you created in Question 1c and some of code from Question 1b to create a new function that  returns a `DataFrame` containing `n` largest cities (by population) of a given country, in descending order of population (i.e. highest population first).

__Note:__ For this question, you must return a `DataFrame` (not a list). You could test this function with a command such as ```display(largest_cities(5, japan))```, which should show a table of city data for these
cities. Note also that the `display` function produces the same output as you get from its argument expression when it is on its own as the last line of a code cell. 

In [14]:
# Complete question 1d in this cell
def largest_cities(n, country): # country is a string argument
    
    # Edit this function to return the largest cities of a country
    if country == None or country == ' ' or n == None or n < 1:
        return None
    else:
        country = country.lower().title()
    
    
    df = WC_DF[WC_DF['country'] == country].sort_values(by=["population"], ascending=False)
    cities_of_country = find_cities_in_country(country)
    df = df[:n]
    # display(df)
    return df
       
largest_cities(6, 'Japan')

Unnamed: 0,city,city_ascii,lat,lng,country,iso2,iso3,admin_name,capital,population,id
0,Tokyo,Tokyo,35.685,139.7514,Japan,JP,JPN,Tōkyō,primary,35676000.0,1392685764
14,Ōsaka,Osaka,34.75,135.4601,Japan,JP,JPN,Ōsaka,admin,11294000.0,1392419823
82,Yokohama,Yokohama,35.32,139.58,Japan,JP,JPN,Kanagawa,admin,3697894.0,1392118339
103,Nagoya,Nagoya,35.155,136.915,Japan,JP,JPN,Aichi,admin,3230000.0,1392407472
133,Fukuoka,Fukuoka,33.595,130.41,Japan,JP,JPN,Fukuoka,admin,2792000.0,1392576294
156,Sapporo,Sapporo,43.075,141.34,Japan,JP,JPN,Hokkaidō,admin,2544000.0,1392000195


### Question 1e
Create a function that given a country name, returns an `int` which is the total population of 
that country living in the cities in that country, as given in `WC_DF`. 

**Hints:** 
* Use the function you created in Question 1c to find the cities of a country.
* You can ignore cities for which no population value is recorded in `WC_DF`.

In [15]:
## Question 1e Answer Code Cell
def country_city_population(country):
    
    numbers_of_population = 0
    if country == None or country == ' ':
        return None
    else:
        country = country.lower().title()
    
    df = WC_DF[WC_DF['country'] == country]
    df = df[df['population'] >= 0]
    df = df.groupby(['city_ascii','admin_name'])['population'].sum()   
    for row in df.items():
        numbers_of_population += int(row[1]) 
        
    return numbers_of_population

num = country_city_population('India')
num

204338075

### Question 2: Earthquakes - Web Access and Pandas DataFrames

In this coursework exercise, you will learn how to download live information
from the web and procress it using the Pandas data analysis package for Python.

The data we will use as an example is from the 
[United States Geological Survey (USGS)](https://www.usgs.gov/), which 
provides a wide range of geographic and geological information and data.
We shall be using their data relating to seismological 
events (i.e. Earthquakes) from around the world, which is published in the 
form of continually updated CSV files.

Questions Overview

* __Q2a__ --- Initialise a Pandas DataFrame by downloading earthquake data from the web. __[1 mark]__
* __Q2b__ --- Find earthquakes of a given magnitude or higher.           __[2 marks]__
* __Q2c__ --- Make a DataFrame showing the most powerful quakes          __[3 marks]__
* __Q2d__ --- Identify large cities endangered by powerful earthquakes   __[4 makrs]__

### Question 2a: Read in data file
Read earthquake data from the USGS live feed CSV ```all_day.csv``` into a Pandas DataFrame.
The data can be obtained directly from  http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv and read into a Pandas DataFrame.

__Note:__ For this question you do not need to download and save the file ```all_day.csv```. It should
be loaded directly from the web feed. However, while testing, if you have no internet connection or
a bad connection you could download a copy of the file. But remember to put it back to downloading
the current one before you submit. Note also that ```all_day.csv``` is a live file, which lists
quakes recorded during the past 24 hours, and is updated every minute, so of course,
you will not always get the same file or the same results. More information about this and other
earthquake feeds provided by USGS can be found [here](https://earthquake.usgs.gov/earthquakes/feed/v1.0/csv.php).

In [16]:
# Q2a answer code cell
import pandas   ## This is the module for creating and manupulating DataFrames

# Here we have assigned the url of the quake datasource to the global variable 
# 'QUAKE_SOURCE' for your convenience.
QUAKE_SOURCE = ( "http://earthquake.usgs.gov/" +
                 "earthquakes/feed/v1.0/summary/all_day.csv" )
# Modify this line to import the data using Pandas
QUAKE_DF = pandas.read_csv(QUAKE_SOURCE, sep=',') 

#### You can use the following cell to test if you have read the quake data into `QUAKE_DF`

In [35]:
## If QUAKE_DF is a DataFrame, show the first 5 rows
if type(QUAKE_DF) == pandas.DataFrame:
    display(QUAKE_DF.head())
else:
    print("QUAKE_DF is not a DataFrame")

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
0,2020-11-20T10:40:32.080Z,38.813168,-122.79567,2.53,0.55,md,8.0,116.0,0.0052,0.01,...,2020-11-20T10:55:05.178Z,"5km NW of The Geysers, CA",earthquake,0.54,1.02,,1.0,automatic,nc,nc
1,2020-11-20T10:35:58.730Z,38.5402,-119.526,17.1,1.6,ml,12.0,98.31,0.197,0.13,...,2020-11-20T10:42:39.470Z,"3 km SSW of Coleville, California",earthquake,,0.7,,,automatic,nn,nn
2,2020-11-20T10:24:09.810Z,38.838001,-122.829666,1.96,0.57,md,7.0,83.0,0.004584,0.01,...,2020-11-20T10:39:07.084Z,"9km NW of The Geysers, CA",earthquake,0.48,0.81,,1.0,automatic,nc,nc
3,2020-11-20T10:10:05.600Z,38.1853,-117.7762,14.6,3.9,ml,62.0,56.82,0.034,0.36,...,2020-11-20T10:39:27.831Z,"36 km SE of Mina, Nevada",earthquake,,1.6,,,automatic,nn,nn
4,2020-11-20T10:04:12.550Z,35.587667,-117.4295,4.47,1.25,ml,19.0,144.0,0.02668,0.19,...,2020-11-20T10:07:58.473Z,"20km S of Searles Valley, CA",earthquake,0.41,0.52,0.194,17.0,automatic,ci,ci


### More examples of useful `pandas` functions

Here we show you some more pandas functions that you may find useful in this exercise. 

As we have seen, versatile filtering and sorting capabilities are provided by pandas. To get more understanding of these, you should look at tutorials of using Pandas DataFrames. But the following example illustrates how you can find and display quakes whose depth is greater than or equal to a given threshold:

In [18]:
def show_deep_quakes( depth ):
    # make deep_quakes DataFrame by selecting rows from QUAKE_DF
    deep_quakes = QUAKE_DF[ QUAKE_DF["depth"] >= depth ]  ## This is how you select rows by a condition
                                                          ## on one of the column values.
        
    print("Number of quakes of depth {} or deeper:".format(depth), 
           len(deep_quakes.index))     ## This finds the number of rows of the deep_quakes DataFrame
    
    display(deep_quakes.sort_values("depth", ascending=False))  ## Sort by descending depth value
    
show_deep_quakes(150)

Number of quakes of depth 150 or deeper: 4


Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
227,2020-11-19T14:50:40.785Z,-23.3986,-179.7895,540.05,4.9,mb,,42.0,5.977,0.89,...,2020-11-19T15:08:10.040Z,south of the Fiji Islands,earthquake,10.1,7.8,0.049,132.0,reviewed,us,us
170,2020-11-19T17:21:09.675Z,43.8986,141.9748,205.03,4.2,mb,,91.0,0.497,0.68,...,2020-11-19T21:13:57.040Z,"21 km NNW of Fukagawa, Japan",earthquake,9.7,4.9,0.079,45.0,reviewed,us,us
196,2020-11-19T15:49:16.775Z,-20.8692,-67.3827,195.62,4.4,mb,,77.0,1.811,0.83,...,2020-11-19T22:13:10.040Z,"73 km SW of Uyuni, Bolivia",earthquake,8.9,7.9,0.087,40.0,reviewed,us,us
102,2020-11-20T00:11:09.987Z,-14.9984,-71.3498,151.22,4.8,mb,,100.0,3.14,1.25,...,2020-11-20T00:22:12.040Z,"23 km SSE of Yauri, Peru",earthquake,8.6,7.2,0.048,132.0,reviewed,us,us


You can also find ```max``` and ```min``` values in a column. Eg:

In [36]:
QUAKE_DF["depth"].max()

540.05

In [37]:
QUAKE_DF["mag"].max()

5.5

### Question 2b: Find Powerful Quakes

Write a function `powerful_quakes` that takes a numerical argument and returns a `DataFrame` including
all the quakes in `QUAKE_DF` that have a magnitude greater than or equal to the given argument.

In [38]:
# Complete question 2b answer cell
def powerful_quakes(mag):
    ## This is just returning an empty DataFrame you need to code it to return
    ## a DataFrame with all quakes of magnitude greater than or equal to mag
    if mag == None:
        return pandas.DataFrame()
    
    powerful_quakes = QUAKE_DF[QUAKE_DF["mag"] >= mag ]
    return powerful_quakes

powerful_quakes(4.2)

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
11,2020-11-20T08:43:54.310Z,16.247,-94.345,85.51,4.7,mb,,132.0,2.121,0.71,...,2020-11-20T08:59:01.040Z,"16 km WSW of Chahuites, Mexico",earthquake,8.5,9.0,0.088,39.0,reviewed,us,us
47,2020-11-20T05:12:31.828Z,-21.9418,-68.5083,130.93,4.4,mb,,72.0,0.696,0.62,...,2020-11-20T05:38:39.040Z,"71 km NE of Calama, Chile",earthquake,6.1,9.1,0.218,6.0,reviewed,us,us
65,2020-11-20T03:19:15.545Z,-53.9263,140.5994,10.0,5.5,mww,,120.0,11.887,1.06,...,2020-11-20T05:23:13.069Z,west of Macquarie Island,earthquake,13.5,1.9,0.098,10.0,reviewed,us,us
80,2020-11-20T02:07:44.605Z,32.3388,141.8387,14.59,4.6,mb,,128.0,1.874,0.7,...,2020-11-20T02:59:27.040Z,"Izu Islands, Japan region",earthquake,8.9,5.1,0.107,26.0,reviewed,us,us
92,2020-11-20T00:57:00.415Z,6.8661,-73.0722,146.0,4.7,mb,,43.0,0.967,0.67,...,2020-11-20T01:25:27.492Z,"13 km S of Piedecuesta, Colombia",earthquake,6.0,7.0,0.031,317.0,reviewed,us,us
102,2020-11-20T00:11:09.987Z,-14.9984,-71.3498,151.22,4.8,mb,,100.0,3.14,1.25,...,2020-11-20T00:22:12.040Z,"23 km SSE of Yauri, Peru",earthquake,8.6,7.2,0.048,132.0,reviewed,us,us
164,2020-11-19T18:10:57.336Z,-1.4843,99.6307,53.98,4.7,mb,,201.0,2.284,0.64,...,2020-11-19T21:29:46.040Z,"99 km SW of Padang, Indonesia",earthquake,10.3,6.9,0.129,18.0,reviewed,us,us
170,2020-11-19T17:21:09.675Z,43.8986,141.9748,205.03,4.2,mb,,91.0,0.497,0.68,...,2020-11-19T21:13:57.040Z,"21 km NNW of Fukagawa, Japan",earthquake,9.7,4.9,0.079,45.0,reviewed,us,us
196,2020-11-19T15:49:16.775Z,-20.8692,-67.3827,195.62,4.4,mb,,77.0,1.811,0.83,...,2020-11-19T22:13:10.040Z,"73 km SW of Uyuni, Bolivia",earthquake,8.9,7.9,0.087,40.0,reviewed,us,us
227,2020-11-19T14:50:40.785Z,-23.3986,-179.7895,540.05,4.9,mb,,42.0,5.977,0.89,...,2020-11-19T15:08:10.040Z,south of the Fiji Islands,earthquake,10.1,7.8,0.049,132.0,reviewed,us,us


### Question 2c: Find `n+` most powerful earthquakes

Produce a DataFrame of the `n`(or maybe more) most powerful quakes. The ``DataFrame`` should show at least `n`
quakes and may sometimes show more since we do not want to leave out any quake that is equally
powerful as the last quake listed in the `DataFrame`.
More specificially, we want the function to return a `DataFrame` that:
* lists quakes in descending order of magnitude 
* contains all and only those quakes in `QUAKES_DF`, such that there are fewer than `n` other quakes in
  `QUAKES_DF` that have a higher magnitude.

In [41]:
# Question 2c answer cell
def most_powerful_n_quakes(n):
    # Edit this function to make it return a DataFrame of the 'top n' quakes of the all_day.csv file
    if n == None or n < 1:
        return pandas.DataFrame()  
    
    df = QUAKE_DF.sort_values(by='mag', ascending=False) 
    mags = sorted(list(set(df['mag'])),reverse=True)
    tdf = pandas.DataFrame()
    top_n = 0
    for mag in mags:
        tdf = df[df['mag'] == mag]
        top_n += len(tdf)
        if top_n >= n:
            return df[df['mag'] >= mag]

most_powerful_n_quakes(6)

Unnamed: 0,time,latitude,longitude,depth,mag,magType,nst,gap,dmin,rms,...,updated,place,type,horizontalError,depthError,magError,magNst,status,locationSource,magSource
65,2020-11-20T03:19:15.545Z,-53.9263,140.5994,10.0,5.5,mww,,120.0,11.887,1.06,...,2020-11-20T05:23:13.069Z,west of Macquarie Island,earthquake,13.5,1.9,0.098,10.0,reviewed,us,us
227,2020-11-19T14:50:40.785Z,-23.3986,-179.7895,540.05,4.9,mb,,42.0,5.977,0.89,...,2020-11-19T15:08:10.040Z,south of the Fiji Islands,earthquake,10.1,7.8,0.049,132.0,reviewed,us,us
102,2020-11-20T00:11:09.987Z,-14.9984,-71.3498,151.22,4.8,mb,,100.0,3.14,1.25,...,2020-11-20T00:22:12.040Z,"23 km SSE of Yauri, Peru",earthquake,8.6,7.2,0.048,132.0,reviewed,us,us
164,2020-11-19T18:10:57.336Z,-1.4843,99.6307,53.98,4.7,mb,,201.0,2.284,0.64,...,2020-11-19T21:29:46.040Z,"99 km SW of Padang, Indonesia",earthquake,10.3,6.9,0.129,18.0,reviewed,us,us
11,2020-11-20T08:43:54.310Z,16.247,-94.345,85.51,4.7,mb,,132.0,2.121,0.71,...,2020-11-20T08:59:01.040Z,"16 km WSW of Chahuites, Mexico",earthquake,8.5,9.0,0.088,39.0,reviewed,us,us
92,2020-11-20T00:57:00.415Z,6.8661,-73.0722,146.0,4.7,mb,,43.0,0.967,0.67,...,2020-11-20T01:25:27.492Z,"13 km S of Piedecuesta, Colombia",earthquake,6.0,7.0,0.031,317.0,reviewed,us,us


### Question 2d: Identifying Endangered Cities

The general idea of this question is to identify possible emergency situations by finding
cities over a given population that are within a certain distance of an earthquake of a given
magnitude or higher. (The precise specification is after the next code cell.)

To help answer this question you are provided with the function ```haversine_distance```, 
which will find the distance in metres between two locations, that are specified in terms of
latitude and longitude values. When finding distances betwen points on the surface of the
Earth We need to use this formula, rather than the simpler Pythagorean distance formula,
because the Earth's surface is a sphere.


In [23]:
# Function to compute distance between locations (metres) 
# Returns the distance in meters, according to the Haversine formula,
# between two locations given as (latitude, longitude) coordinate pairs.

import math
def haversine_distance( loc1 , loc2 ): # add wiki link or something
    '''finds the distance (m) between 2 locations, where locations are defined by
    longitudes and latitudes'''
    lat1, lon1 = loc1
    lat2, lon2 = loc2
    radius = 6371000  # meters
    dlat = math.radians(lat2 - lat1)
    dlon = math.radians(lon2 - lon1)
    a = (math.sin(dlat / 2) * math.sin(dlat / 2) +
         math.cos(math.radians(lat1)) * math.cos(math.radians(lat2)) *
         math.sin(dlon / 2) * math.sin(dlon / 2))
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))
    d = radius * c
    return d

Using the `haversine_distance` function, write a function `engangered_cities` that takes 
three numerica arguments (`population`, `distance` and `magnitude`) and 
returns a `list` in alphabetical order of the `ascii` names of all cities listed in 
the 'WC_DF' such that:
* they have a population greater than or equal to the given population 
* their location is at a distance of less than or equal to the given distance from
  an earthquake of magnitude greater than or equal to the given magnitude.

In [50]:
## 2d Answer Code Cell
def endangered_cities(population, distance, magnitude):
    
    list_of_endangered_cities = []
    # an earthquake of magnitude greater than or equal to the given magnitude.
    df_mag = QUAKE_DF[QUAKE_DF['mag'] >= magnitude][['latitude','longitude']]
    
    # the cities have a population greater than or equal to the given population
    df_population = WC_DF[WC_DF['population'] >= population][['city_ascii','lat','lng','population','country']]
    df_population = df_population.sort_values(by='city_ascii', ascending=True)
     
    for i in range(0, len(df_mag)):
        loc1 = [df_mag.iloc[i]['latitude'],df_mag.iloc[i]['longitude']]
        for j in range(0, len(df_population)):
            loc2 = [df_population.iloc[j]['lat'],df_population.iloc[j]['lng']]
            dis_loc1_to_loc2 = haversine_distance(loc1, loc2)
            if dis_loc1_to_loc2 <= distance:
                list_of_endangered_cities.append(df_population.iloc[j]['city_ascii'])
                #print(df_population.iloc[j]['country'])
                      
    return list_of_endangered_cities

endangered_cities(200040, 5000000, 5.0)

['Adelaide',
 'Auckland',
 'Brisbane',
 'Canberra',
 'Christchurch',
 'Cranbourne',
 'Gold Coast',
 'Manukau City',
 'Melbourne',
 'Newcastle',
 'Northcote',
 'Perth',
 'Port Moresby',
 'Sydney',
 'Waitakere',
 'Wellington',
 'Wollongong']

## Optional Exercises

Having got this far, you may find it interesting and informative to do some more processing
of the city and earthquake information.

### Constructing a city risk status alert `DataFrame`

A government or other organisation may want to monitor a certain list of cities with regard to whether
they may be at risk of earthquake damage. To answer this question you should create a function
that uses the `endangered_cities` function you have defined above to create such a `DataFrame`.

Your function `city_risk_alert` should return a pandas DataFrame that includes the status of ```'ENDANGERED'``` or ```'SAFE'``` for a certain city. The dataframe should also contain the city name, country and status for each city input. You could also extend this to add more columns showing
things like the distance and magnitude of the nearest earthquake. And you could perhaps make it
so any endangered cities were put at the top of the list.

For example:
```
display( city_risk_alert( ['Rome', 'Milan', 'Pisa'] )
```

might give the following output:

 
| city  | country|status|
|-------|-------|-----|
| Pisa  | Italy | ENDANGERED |
| Rome  | Italy | SAFE |
| Milan | Italy | SAFE |
 

## Visualisation Exercise: display endangered cities on a map

The code below creates a Map object using the ```ipyleaflet``` module and uses this to
display powerful quakes on the map. If you have coded the `powerful_quakes` function for
Question 2b above, the code in the cell below the map should draw the detected powerful
quakes onto the map at their correct locations.

To install the ```ipyleaflet``` module use ```!pip3 install ipyleaflet```. If the map does not display after installation be sure to restart the kernel, and close and reopen this file. We provide the ```draw_circle_on_map``` function, this add circles to a specified location on the map, where the location is defined by longitudes and latitudes.



In [25]:
from ipyleaflet import Map, basemaps, basemap_to_tiles, Circle, Polyline
from ipywidgets import Layout

LEEDS_LOC  = ( 53.8008,  -1.5491  ) # Here we define the longitude and latitude of Leeds
WORLD_MAP = Map(basemap=basemaps.OpenTopoMap, center=LEEDS_LOC, zoom=1.5,
                layout=Layout(height="500px")) # Here we create a map object centred on Leeds

WORLD_MAP

Map(center=[53.8008, -1.5491], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zo…

In [26]:
def draw_circle_on_map( a_map, location, radius = 1000, color="red", fill_color=None ):
    if not fill_color:
        fill_color = color
    circle = Circle()
    circle.location = location
    circle.radius = radius
    circle.color = color
    circle.fill_color = fill_color
    a_map.add_layer(circle)
    
draw_circle_on_map(WORLD_MAP, LEEDS_LOC, color="green" ) # This will edit your previous map rather than produce a new one

def display_powerful_quakes_on_map(mag):
    powerful = powerful_quakes(3)
    for i, quake in powerful.iterrows():
        draw_circle_on_map( WORLD_MAP,
                            (quake["latitude"],quake["longitude"]), 
                            radius= 20000*int(quake["mag"]) )

display_powerful_quakes_on_map(3)

### More Ideas for Graphical Display
It would be nice to also see the endangered cities on the map. For an ambitious exercise you
could see if you can draw lines on the map running from from the locations of powerful
earthquakes to the cities that are endangered by them.