# Data Wrangling With Pandas

## Task One - Series
In the cell below, create a `pandas` Series that contains the populations of the top ten British cities. You can find the necessary data in `data/urban.md`. Print out the series.

In [8]:
import pandas as pd
import numoy as np

populations = pd.Series(
    data = [
        10803000,
        2517000,
        2449000,
        1659000,
        1100000,
        835000,
        805000,
        719000,
        719000,
        603000
    ],
    index = [
        'London', 
        'Birmingham', 
        'Manchester', 
        'Leeds-Bradford', 
        'Glasgow', 
        'Liverpool', 
        'Southampton-Portsmouth', 
        'Newcastle', 
        'Nottingham', 
        'Sheffield'
    ], 
    name = 'Population'
)
populations

London                    10803000
Birmingham                 2517000
Manchester                 2449000
Leeds-Bradford             1659000
Glasgow                    1100000
Liverpool                   835000
Southampton-Portsmouth      805000
Newcastle                   719000
Nottingham                  719000
Sheffield                   603000
Name: Population, dtype: int64

Sort the series by alphabetical order creating a new series, and print it out.

In [4]:
alphabetical_populations = populations.sort_index()
alphabetical_populations

Birmingham                 2517000
Glasgow                    1100000
Leeds-Bradford             1659000
Liverpool                   835000
London                    10803000
Manchester                 2449000
Newcastle                   719000
Nottingham                  719000
Sheffield                   603000
Southampton-Portsmouth      805000
dtype: int64

Create and print a series consisting of the second half of the alphabetically-sorted series.

In [6]:
second_half = alphabetical_populations.iloc[len(alphabetical_populations)//2:]
second_half

Manchester                2449000
Newcastle                  719000
Nottingham                 719000
Sheffield                  603000
Southampton-Portsmouth     805000
dtype: int64

## Task Two - Dataframes

In the cell below, create three new Series for
- the top attractions in each city
- whether the city is a port
- the geographical area of the city in square km

From all the series, create a DataFrame than contains all the information about the cities. Include a unique id for each city. Print out the resulting dataframe.

In [44]:
attractions = pd.Series(
    data = [
        'Big Ben',
        'Cadbury World',
        'Northcoders',
        'Armouries',
        'Kelvingrove',
        'Albert Dock',
        'Naval Dockyard',
        'Bigg Market',
        'Nottingham Castle',
        'Botanical Gardens'
    ],
    index = [
        'London', 
        'Birmingham', 
        'Manchester', 
        'Leeds-Bradford', 
        'Glasgow', 
        'Liverpool', 
        'Southampton-Portsmouth', 
        'Newcastle', 
        'Nottingham', 
        'Sheffield'
    ], 
    name = 'Top Attraction'
)

is_port = pd.Series(
    data = [
        'Yes',
        'No',
        'No',
        'No',
        'Yes',
        'Yes',
        'Yes',
        'Yes',
        'No',
        'No'
    ],
    index = [
        'London', 
        'Birmingham', 
        'Manchester', 
        'Leeds-Bradford', 
        'Glasgow', 
        'Liverpool', 
        'Southampton-Portsmouth', 
        'Newcastle', 
        'Nottingham', 
        'Sheffield'
    ],
    name = 'Is a port?'
)

area = pd.Series(
    data = [
        1737.9,
        598.9,
        630.3,
        487.8,
        368.5,
        199.6,
        192,
        180.0,
        176.4,
        167.5
    ],
    index = [
        'London', 
        'Birmingham', 
        'Manchester', 
        'Leeds-Bradford', 
        'Glasgow', 
        'Liverpool', 
        'Southampton-Portsmouth', 
        'Newcastle', 
        'Nottingham', 
        'Sheffield'
    ],
    name = 'Area (square km)'
)

cities = populations.index.to_list()
data = [populations, attractions, is_port, area]
df = (pd
    .DataFrame(data, columns=cities)
    .transpose()
)
df.insert(0, 'city_id', range(1, len(df) + 1))

df

Unnamed: 0,city_id,Population,Top Attraction,Is a port?,Area (square km)
London,1,10803000,Big Ben,Yes,1737.9
Birmingham,2,2517000,Cadbury World,No,598.9
Manchester,3,2449000,Northcoders,No,630.3
Leeds-Bradford,4,1659000,Armouries,No,487.8
Glasgow,5,1100000,Kelvingrove,Yes,368.5
Liverpool,6,835000,Albert Dock,Yes,199.6
Southampton-Portsmouth,7,805000,Naval Dockyard,Yes,192.0
Newcastle,8,719000,Bigg Market,Yes,180.0
Nottingham,9,719000,Nottingham Castle,No,176.4
Sheffield,10,603000,Botanical Gardens,No,167.5


Create and print a dataframe containing the top attractions in cities that are ports. Include only the city id and name.

In [45]:
attractions_in_port_cities = df[df['Is a port?'] == 'Yes']
attractions_in_port_cities = attractions_in_port_cities[['city_id', 'Top Attraction']]
attractions_in_port_cities

Unnamed: 0,city_id,Top Attraction
London,1,Big Ben
Glasgow,5,Kelvingrove
Liverpool,6,Albert Dock
Southampton-Portsmouth,7,Naval Dockyard
Newcastle,8,Bigg Market


Create a dataframe which includes all the original data plus the population densities (population per square km) for each city and order them from low density to high density.

In [48]:
df['Population Density'] = df['Population'] / df['Area (square km)']
df = df.sort_values('Population Density')
df

Unnamed: 0,city_id,Population,Top Attraction,Is a port?,Area (square km),Population Density
Glasgow,5,1100000,Kelvingrove,Yes,368.5,2985.074627
Leeds-Bradford,4,1659000,Armouries,No,487.8,3400.98401
Sheffield,10,603000,Botanical Gardens,No,167.5,3600.0
Manchester,3,2449000,Northcoders,No,630.3,3885.451372
Newcastle,8,719000,Bigg Market,Yes,180.0,3994.444444
Nottingham,9,719000,Nottingham Castle,No,176.4,4075.963719
Liverpool,6,835000,Albert Dock,Yes,199.6,4183.366733
Southampton-Portsmouth,7,805000,Naval Dockyard,Yes,192.0,4192.708333
Birmingham,2,2517000,Cadbury World,No,598.9,4202.704959
London,1,10803000,Big Ben,Yes,1737.9,6216.122907
