# Exercise 1
 <p><div class="lev1"><a href="#Task-A.-Another-LEGO-brick-in-the-wall"><span class="toc-item-num">Task A.&nbsp;&nbsp;</span>Another LEGO brick in the wall</a></div>
 <p><div class="lev1"><a href="#Task-B.-Drop-the-Bike"><span class="toc-item-num">Task B.&nbsp;&nbsp;</span>Drop the Bike</a></div>

In [1]:
# Add your imports here
import pandas as pd

In [2]:
DATA_FOLDER = 'data'

## Task A. Another LEGO brick in the wall

LEGO is a popular brand of toy building bricks. They are often sold in sets in order to build a specific object. Each set contains a number of parts in different shapes, sizes and colors. This database contains information on which parts are included in different LEGO sets. It was originally compiled to help people who owned some LEGO sets already figure out what other sets they could build with the pieces they had.

This dataset contains the official LEGO colors, parts, inventories (i.e., sets of LEGO parts which assembled create an object in the LEGO world) and sets (i.e., sets of LEGO inventories which assembled create a LEGO ecosystem). The schema of the dataset can be shown in the following UML diagram: 

![lego-schema](lego-schema.png)

In this task you have to apply the following Data Wrangling pipeline:
1. Load your data into `Pandas`
* Explore it and clean its dirty parts
* Use it to answer a set of queries

Each of these subtasks are described in detail below.

### A1. Loading phase
Load all the csv files into different `DataFrames`. Use meaningful names for your `DataFrames` (e.g., the respective filenames).

*Hint: You can load files without first unzipping them (for `Pandas` version >= 0.18.1).*

In [3]:
LEGO_DATA_FOLDER = DATA_FOLDER + '/lego'

In [17]:
# Write your code here
themes = pd.read_csv(LEGO_DATA_FOLDER + '/themes.csv.zip') #, dtype={''}
colors = pd.read_csv(LEGO_DATA_FOLDER + '/colors.csv.zip')
inventories = pd.read_csv(LEGO_DATA_FOLDER + '/inventories.csv.zip')
inventory_parts = pd.read_csv(LEGO_DATA_FOLDER + '/inventory_parts.csv.zip')
inventory_sets = pd.read_csv(LEGO_DATA_FOLDER + '/inventory_sets.csv.zip')
part_categories = pd.read_csv(LEGO_DATA_FOLDER + '/part_categories.csv.zip')
parts = pd.read_csv(LEGO_DATA_FOLDER + '/parts.csv.zip')
sets = pd.read_csv(LEGO_DATA_FOLDER + '/sets.csv.zip')

### A2. Cleaning phase
Explore the following columns from your dataset:

1. sets: year
* inventory_parts: quantity

What is the time range of the sets? 
What is the average quantity of the inventory parts? 
Do you see any inconsistencies? 
Provide code that detects and cleans such inconsistencies and validates the coherence of your dataset. 

time range of the sets:

In [37]:
print(sets.year.iloc[[0, 7, 11645, 11643]])

0           70s
7         19788
11645    -20122
11643     -2014
Name: year, dtype: object


In [42]:
def read_year(year_s):
    try:
        year = int(year_s)
        if year < 0:
            year = -year
        if year > 2018:
            return read_year(year_s[:-1])
        else:
            return year
    except ValueError:
        return 1900 + read_year(year_s[:-1])
correct_year = sets.year.apply(read_year)
print('year min:', correct_year.min(), 'year max:', correct_year.max())

year min: 1950 year max: 2017


In [46]:
print(inventory_parts.quantity.iloc[[9]])

9   -inf
Name: quantity, dtype: float64


In [49]:
def read_quantity(quantity):
    if quantity == float('-inf'):
        return 0
    else:
        return int(quantity)
correct_quantity = inventory_parts.quantity.apply(read_quantity)
print(correct_quantity.mean())

2.7670542575540584


__\* Briefly explain your approach here \*__

### A3. Querying phase
Answer the following queries using the functionality of `Pandas`:

1. List the ids of the inventories that belong to sets that contain cars. (*Hint: Find a smart way to distinguish which sets contain cars based on the sets' name*).
* Plot the distribution of part categories as a (horizontal) bar chart. Restrict yourself to the 20 largest part categories (in terms of the number of parts belonging to the category).
* Find the dominant color of each set. Then, plot using a (horizontal) bar chart, the number of sets per dominant color. Color each bar with the respective color that it represents.
* Create a scatter plot of the *luminance*\* of the sets vs their publishing year. What do you observe for the years 1980-1981? How do you interpret what you see?

\*The luminance of a color is a [measure of brightness](https://en.wikipedia.org/wiki/Luminance) which, given its RGB representation, can be computed as follows:

$luminance = \sqrt{0.299*R^2 + 0.587*G^2 + 0.114*B^2}$

In [55]:
# Write your code here
from pprint import pprint
pprint(set(sets.name))

{' 1 stud Blue Storage Brick',
 ' Scenery and Dagger Trap polybag',
 ' Spectre',
 ' White Spaceman Key Chain',
 "'Where Are My Pants?' Guy",
 '1 stud Red Storage Brick',
 '1 x 1 Bricks',
 '1 x 1 Bricks with Letters (System)',
 '1 x 1 Bricks with Numbers (System)',
 '1 x 1 Round Bricks',
 '1 x 1 and 1 x 2 Plates (cardboard box version)',
 '1 x 1 and 1 x 2 Plates - Black (architectural hobby und modelbau version)',
 '1 x 1 and 1 x 2 Plates - Blue (architectural hobby und modelbau version)',
 '1 x 1 and 1 x 2 Plates - Light Gray (architectural hobby und modelbau '
 'version)',
 '1 x 1 and 1 x 2 Plates - Red (architectural hobby und modelbau version)',
 '1 x 1 and 1 x 2 Plates - Trans-Clear (architectural hobby und modelbau '
 'version)',
 '1 x 1 and 1 x 2 Plates - White (architectural hobby und modelbau version)',
 '1 x 1 and 1 x 2 Plates - Yellow (architectural hobby und modelbau version)',
 '1 x 1 x 1 Window Frame',
 '1 x 1 x 1 Window, Red or White',
 '1 x 1 x 2 Window Frame',
 '1 x 1 x

 'Brick Box',
 'Brick Calendar',
 'Brick Hinges',
 'Brick Pack 100',
 'Brick Pack 300',
 'Brick Separator',
 'Brick Separator Orange',
 'Brick Separator, Gray',
 'Brick Separators',
 'Brick Street Customs',
 'Brick Street Getaway',
 "Brick Tub 'Die Lego Show' - Limited Edition",
 'Brick USB Flash Drive',
 'Brick Vac',
 'Brick Yard',
 'BrickMaster',
 "Brickbeard's Bounty",
 'Bricklayer Oscar Orangutan',
 'Brickley',
 'Brickmaster Kit (with Digital Designer CD)',
 'Brickmaster Star Wars',
 'Brickmaster: Ninjago 2 - Fight The Power Of The Snakes',
 'Bricks',
 'Bricks Assorted, Green',
 'Bricks and Creations Tub',
 'Bricks and Creations Tub (Bottom Tub and its contents only)',
 'Bricks and Creations Tub - (TRU Exclusive)',
 'Bricks and Creations Tub - (TRU Exclusive) (Bottom Tub and its contents '
 'only)',
 'Bricks with Cross (axle) Holes (Pack of 50)',
 'Bricks with Groove and Garage Panels',
 "Bricks'n Motor Set",
 'Bricks, Black',
 'Bricks, Blue',
 'Bricks, Gray',
 'Bricks, Red',
 'Bri

 'Deep Freeze Defender',
 'Deep Reef Refuge',
 'Deep Sea Bounty',
 'Deep Sea Diver - Complete Set',
 'Deep Sea Exploration Vessel',
 'Deep Sea Helicopter',
 'Deep Sea Operation Base',
 'Deep Sea Predator',
 'Deep Sea Predators',
 'Deep Sea Raider',
 'Deep Sea Scuba Scooter',
 'Deep Sea Starter Set',
 'Deep Sea Striker',
 'Deep Sea Submarine',
 'Deep Sea Treasure Hunter',
 'Defense Archer',
 'Defilak',
 'Dekar',
 'Delivery Center',
 'Delivery Truck',
 'Delivery Truck Set',
 'Delivery Van',
 'Delivery Vehicle',
 'Deluxe Box of Fun',
 'Deluxe Brick Box',
 'Deluxe High Speed Train Collection',
 'Deluxe Hogwarts Kit',
 'Deluxe Motorized Train Set',
 'Deluxe Track Kit',
 'Deluxe Track for RC Trains',
 'Deluxe Train Set',
 'Demolition Driller',
 'Demolition Dummy - Complete Set',
 'Demolition Site',
 'Demolition Starter Set',
 'Demon Destroyer',
 'Denken mit Lego (Thinking with Lego 250pcs)',
 'Denken mit Lego (Thinking with Lego 900pcs)',
 'Derby Trotter',
 'Desert Attack',
 'Desert Biplane'

 'Hulk vs. Red Hulk',
 'Hulk’s Helicarrier Breakout',
 'Humans vs. Robots Battle Machine Collection',
 'Humungousaur',
 'Hun Warrior',
 'Hurricane Harbor',
 'Hurricane Heist',
 'Hybrid Rescue Tank',
 'Hydraxon',
 'Hydro',
 'Hydro Crystalization Station',
 'Hydro Racer',
 'Hydro Racer / Swamp Boat',
 'Hydro Reef Wrecker',
 'Hydro Search Sub',
 'Hydrofoil',
 'Hydrofoil 7 / Powerboat Columbia',
 'Hydroplane Racer',
 'Hyena Droid Bomber',
 'Hyperspeed Pursuit',
 'Hypno Cruiser',
 "I 'L brick' Anaheim Figure Magnet",
 'I Love LEGOLAND Magnet [Female]',
 'I Love LEGOLAND Magnet [Male]',
 'I Love Malaysia Magnet [Male]',
 'I Love Paris Magnet [Male]',
 'I Love Tokyo Magnet [Male]',
 'IR Speed Remote Control',
 'Ice Bear Mech',
 'Ice Bear Tribe Pack',
 'Ice Blade',
 'Ice Brick Tray - Yellow',
 'Ice Cannon',
 'Ice Cream Cart',
 'Ice Cream Machine',
 'Ice Cream Seller',
 'Ice Cream Stand',
 'Ice Cream Truck',
 'Ice Cream with Scooter',
 'Ice Crossbow',
 'Ice Cube Tray',
 'Ice Dragon Attack',
 'I

 'Medium Bucket',
 'Medium Bulk Bucket',
 'Medium Creative Brick Box',
 'Medium Gray Storage Bin',
 'Medium House Set',
 'Medium Ship Set',
 'Medium set of Soft Bricks',
 'Medusa - Complete Set',
 'Mega Core Magnetizer',
 'Mega Tack',
 'Melon - Hong Kong Lego Show Promotional',
 'Melon - Suntory Promotional',
 'Meltdown',
 'Melting Room',
 'Meltus',
 'Mercedes Benz Arocs 3245',
 'Mercedes-AMG GT3',
 'Merida’s Highland Games',
 "Merlok's Library 2.0",
 'Mermaid - Complete Set',
 'Mermaid Key Chain',
 'Merry-Go-Round',
 'Mesmo',
 'Message Decoder',
 'Message Intercept Base',
 'Mesut Özil (8)',
 "Metal Beard's Duel",
 'MetalBeard’s Sea Cow',
 'Meteor Monitor',
 'Meteor Strike',
 'Metro PD Station',
 'Metro Park & Service Tower',
 'Metro Station',
 'Metroliner',
 'Metroliner Kit',
 'Metru Nui Matoran Kit',
 'Metus',
 'Mezmo',
 "Mia's Beach Scooter",
 "Mia's Bedroom",
 "Mia's Farm Suitcase",
 "Mia's Lemonade Stand",
 "Mia's Magic Tricks",
 "Mia's Roadster",
 "Mia's Vet Clinic",
 'Mia’s Pupp

 'Red Planet Protector',
 'Red Plates 1 x n',
 'Red Plates 2 x n',
 'Red Plates Parts Pack',
 'Red Player & Goal',
 'Red Racer',
 'Red Racer Polybag',
 'Red Recon Flyer',
 'Red Ridge Tiles',
 'Red Roof Bricks Parts Pack, 33 Degree',
 'Red Roof Bricks Parts Pack, 45 Degree',
 'Red Roof Bricks, Shallow Pitch',
 'Red Roof Bricks, Steep Pitch',
 'Red Roof Tiles',
 'Red Rotors',
 'Red Rover Tires and Hubs (4 tires 4 hubs)',
 'Red Storage Brick',
 'Red Thunder',
 'Red/Black Plates',
 'Red/Blue Beams',
 'Red/Blue Bricks',
 'Red’s Water Rescue',
 'Refrigerated Car with Forklift',
 'Refrigerator Truck and Trailer',
 'Refuse Collection Truck',
 'Refuse Truck',
 'Regular & Transparent Bricks Bucket',
 'Rehearsal Stage',
 'Reidak',
 'Reindeer',
 'Reindeer (Legoland California)',
 'Relay Runner - Team GB Complete Set with Stand and Accessories',
 'Remote Control',
 'Remote Control for Crossing',
 'Remote Control for Electric Points',
 'Remote Control for Signal',
 'Remote Control for Switch',
 'Rem

 'TV Crew',
 "Ta-Metru Collector's Pack",
 'Table and Chairs',
 'Tacho Wheels (Pack of 20)',
 'Taco Tuesday Man',
 'Tactical Patrol Truck',
 'Tactical Tennis Player - Team GB Complete Set with Stand and Accessories',
 'Tahnok',
 'Tahnok Va',
 'Tahnok Va (Kabaya Promotional)',
 'Tahnok-Kal',
 'Tahu',
 'Tahu - Master of Fire',
 'Tahu - With mini CD-ROM',
 'Tahu Mask - New York Comic-Con 2014 Exclusive',
 'Tahu Nuva',
 'Tahu Uniter of Fire',
 'Taj Mahal',
 'Takadox',
 'Takanuva',
 'Takeshi Walker 1',
 'Takeshi Walker 2',
 'Takua and Pewku',
 'Takutanuva',
 'Takutanuva Kit',
 'Tall Classic Windows/Door (with Glass)',
 'Tank Truck',
 'Tanker',
 'Tanker Truck',
 'Tanker Truck Takedown',
 'Tanker Waggon (Shell)',
 'Tanma',
 'Tantive IV',
 'Tantive IV & Planet Alderaan',
 'Tapsy',
 'Tarakava',
 'Tarduk',
 'Target Lego Gift Card 2011 3 in 1 Set',
 'Target Practice',
 'Tarix',
 'Tatooine Mini-build - Star Wars Celebration Exclusive',
 'Taxi',
 'Taxi Station',
 "Tea Garden Cafe with Baker's Van",

__\* Briefly explain your approach for every query here \*__

## Task B. Drop the bike

*Los Angeles Metro* has been sharing publicly [anonymized *Metro Bike Share* trip data](https://bikeshare.metro.net/about/data/) under the [Open Database License (ODbL)](http://opendatacommons.org/licenses/odbl/1.0/).

In this task you will again perform data wrangling and interpretation.

### B1. Loading phase
Load the json file into a `DataFrame`.


In [None]:
BIKES_DATA_FOLDER = DATA_FOLDER + '/bikes'

In [None]:
# Write your code here

### B2. Cleaning phase
Describe the type and the value range of each attribute. Indicate and transform the attributes that are `Categorical`. Are there redundant columns in the dataset (i.e., are there columns whose value depends only on the value of another column)? What are the possible pitfalls of having such columns? Reduce *data redundancy* by extracting such columns to separate `DataFrames`. Which of the two formats (the initial one or the one with reduced data redundancy) is more susceptible to inconsistencies? At the end print for each `Dataframe` the *type of each column* and it's *shape*.

In [None]:
# Write your code here

__\* Briefly explain your approach here \*__

### B3. Querying phase
Answer the following queries using the functionality of `Pandas`.

1. Plot the *distribution* of the number of outgoing trips from each station in a histogram with 20 bins (Hint: each bin describes a range of counts, not stations).
* Plot histograms for the *duration* and *trip starting hour in the day* attributes. For both the *duration*  and the *trip starting hour* use *discrete 1-hour intervals*. What do you observe in each plot? What are some popular values in the *duration* plot? Explain the local maxima and the trends you observe on the *trip starting hour* plot based on human behavior.
* For each *trip route category*, calculate the proportion of trips by *passholder type* and present your results in *a stacked bar chart with normalized height*.
* Considering only trips that begin in the morning hours (before noon), plot in *a single bar chart* the proportion of trips by *passholder type* and *trip route category*. Explain any outliers you observe.
* Separate the hours of the day into two intervals that have (approximately) the same number of bikes leaving the stations. For each of the two intervals calculate the proportion of trips by *passholder type* and *trip route category*. Present your results in a `DataFrame` which has a unique, non-composite index. Does the proportion of trips depend on whether it is the first or second hour interval? Would the company have any significant benefit by creating a more complex paying scheme where monthly pass users would pay less in the first interval and (equally) more on the second one? Assume that the number of trips per interval will not change if the scheme changes.

In [None]:
# Write your code here

__\* Briefly explain your approach for every query here \*__