# SEN163A – Assignment 2 - Large-scale Internet Data Analysis
**DEADLINE**: Friday 18 Febary 2022 before 18:00

## Group 14
- Antonio Sanchez Martin - ''5245834''
- Felix Unger - ''5673631''
- Jeroen van Paassen - ''4720970''
- Yunus Emre Torlak - '' 5597668''

## Remarks from the lab
- The BGP routing might be very important! Lecturer hinted at this.
- Need to learn about JSON objects.
- Use the wayback machine for getting older info.
- Look at `pickle` in python for storing the data.
  - Not human-readable
  - For the `.pkl` files.
- Question 1: 
  - I.e. is the data enough?
  - You don't have to answer the questions sequentially
- Use the library `time`
  - Remember to repeat measurements to get an accurate time (as every time the code is run, it can give different results)

## Questions
1. Evaluate if there are *limitations* in the provided **datasets** (AS and probe data set). If you find limitations, describe these and conjecture possible reasons, supported with data.
   - No need to do it at the beginning! 
2. With the *AS and probe data set*, **find the number *m* of AS’s that can be used for hosting** in the EU and have probes in the *RIPE data set*. Sort the **ASN’s in ascending order** and include the **first and last three**
three in your report (number, name and country).
3. For a **single hour** in the *RIPE data set*: find all valid entries where the probe has hosting *type AS* and the *target IPv4 is from an EU country*. Implement this in an efficient way.
4. Move from using only an hour to the **full day**. It is advisable to store the raw results of each file. Then, using all processed files, *calculate the average latency’s for each country-AS* combination and store the results into one $$n_{countries}\times m$$ matrix. If we could place one server in each country, what would the *minimum average latency be for each country*? (include in your report)
5. Since we are only allowed to place four servers, determine the **best four datacenters** based on the total latency for all countries. Report your findings and your procedure to obtain them. Also include the *average latency for each country*.

## Dataset description and loading

### Probe dataset
- It only has two columns:
  - `prb_id` -> Used to check if the probe is also in the RIPE dataset.
  - `ASN` -> Autonomous System Number (i.e. small-network id)
- Why? -> To join the RIPE and AS datasets


In [1]:
import pickle

with open("./datasets/probe_dataset.pkl", 'rb') as file:
    probe_df = pickle.load(file)

print(probe_df.shape)
probe_df.head()

(11008, 2)


Unnamed: 0,prb_id,ASN
0,1,AS3265
1,2,AS1136
2,3,AS3265
3,6,AS6830
4,8,AS3265


### AS Dataset
- 5 columns:
  - `ASN`
  - Country code
  - Network name
  - Total number of IPs in network
  - Type of network
- Why? -> Can give us the number of IPs and location

In [3]:
import pickle

with open("./datasets/AS_dataset.pkl", 'rb') as file:
    AS_df = pickle.load(file)

print(AS_df.shape)
AS_df.head()

(60122, 5)


Unnamed: 0,ASN,Country,Name,NumIPs,type
0,AS55330,AF,AFGHANTELECOM GOVERNMENT COMMUNICATION NETWORK,50432,hosting
1,AS17411,AF,Io Global Services Pvt. Limited,13568,business
2,AS55424,AF,Instatelecom Limited,13312,business
3,AS38742,AF,AWCC,11520,isp
4,AS131284,AF,Etisalat Afghan,10240,isp


In [4]:
# Joining the Probe and AS datasets based on the ASN
AS_probe_df = AS_df.set_index('ASN').join(probe_df.set_index('ASN'))
AS_probe_df.reset_index().dropna().head()

Unnamed: 0,ASN,Country,Name,NumIPs,type,prb_id
10,AS10010,JP,TOKAI Communications Corporation,1430016,business,33002
11,AS10010,JP,TOKAI Communications Corporation,1430016,business,33022
12,AS10010,JP,TOKAI Communications Corporation,1430016,business,53282
82,AS10075,BD,Fiber@Home Global Limited,3328,business,10977
106,AS10098,HK,Towngas Telecommunications Fixed Network Ltdet...,19456,isp,21744


### RIPE

In [None]:
import time
import bz2
#open .bz2 file directly
bz2Filename = './datasets/ping-2020-02-20T0000.bz2'
bz2File     = bz2.open(bz2Filename, 'rt')

#read first 100k lines to estimate total loading time
count = 0;
st    = time.time()
for line in bz2File:
    count = count + 1
    if count>10: 
        break
#finally close bz2File
bz2File.close()

FileNotFoundError: [Errno 2] No such file or directory: 'ping-2020-02-20T0000.bz2'

### IPv4 dataset
The IP addresses are in integer format.

In [None]:
import pandas as pd
ipv4_df = pd.read_csv("IP2LOCATION-LITE-DB1.CSV", names=["ip_from", "ip_to", "country_code", "country_name"])
ipv4_df.drop(index=ipv4_df.index[0], axis=0, inplace=True) # Drop the first line as it is not data
ipv4_df.head()

Unnamed: 0,ip_from,ip_to,country_code,country_name
1,16777216,16777471,US,United States of America
2,16777472,16778239,CN,China
3,16778240,16779263,AU,Australia
4,16779264,16781311,CN,China
5,16781312,16785407,JP,Japan


In [None]:
# Convert int ip to, well, IP format
import ipaddress
for ip in ('ip_to', 'ip_from'):
    ipv4_df[ip] = ipv4_df[ip].apply(ipaddress.ip_address)
ipv4_df.head()

Unnamed: 0,ip_from,ip_to,country_code,country_name
1,1.0.0.0,1.0.0.255,US,United States of America
2,1.0.1.0,1.0.3.255,CN,China
3,1.0.4.0,1.0.7.255,AU,Australia
4,1.0.8.0,1.0.15.255,CN,China
5,1.0.16.0,1.0.31.255,JP,Japan
