# Notebook for getting a nice table for graphing data

This notebook takes in the dataset and returns a table containing all unique pairs of people calling each other with relavent data necessary for creating a graph-like object visualized over a map of Canada.
That is, it filters the cdr table for unique a_saddr and b_saddr pairs (unique rtp pairings). It then feeds this table through geolite (and ipaddress) to obtain human readable information such as ASN information and latitude/longitude.

It also creates smaller tables centered around Vancouver, Toronto, and Montreal (where CloudPBX has servers) in the instance that smaller graph-like objects and maps desire to be created.

In [1]:
import numpy as np
import pandas as pd
import geoip2.database
import ipaddress
import os

In [2]:
#DATA_ROOT = 'data/workshop-content18/5-cloudpbx/data/cloudpbx_sample_data_10k/'
CSV_FILE_PATH = os.path.join('pims_cloudpbx_subset_201806051550_1million.csv')
#CSV_FILE_PATH = os.path.join('locn-filtered.csv')
GEOLITE_ASN_PATH = os.path.join('GeoLite2-ASN.mmdb')
GEOLITE_CITY_PATH = os.path.join('GeoLite2-City.mmdb')

In [3]:
# read dataframe
df = pd.read_csv(CSV_FILE_PATH,delim_whitespace=True)

  interactivity=interactivity, compiler=compiler, result=result)


In [4]:
DESCRIBED_COLUMNS = ["a_saddr", "b_saddr"]

In [5]:
df_geo = df[DESCRIBED_COLUMNS]

In [6]:
df_geo_unique = df_geo.drop_duplicates()
df_geo_unique.shape

(2739, 2)

## Link IP addresses to ASN

In [7]:
# initiate geoip client
readerASN = geoip2.database.Reader(GEOLITE_ASN_PATH)
readerCITY = geoip2.database.Reader(GEOLITE_CITY_PATH)

In [8]:
# functions to get AS info
def getASobject(x):
    ip = ipaddress.ip_address(x)
    try: return readerASN.asn(str(ip))
    except: return "The address {} is not in the database.".format(ip)
def getIP(x):
    if type(x) == str: return x
    return x.ip_address
def getASN(x):
    if type(x) == str: return x
    return x.autonomous_system_number
def getASorg(x):
    if type(x) == str: return x
    return x.autonomous_system_organization
def getLat(x):
    try: return readerCITY.city(str(x)).location.latitude
    except: return "The address {} is not in the database.".format(str(x))
def getLong(x):
    try: return readerCITY.city(str(x)).location.longitude
    except: return "The address {} is not in the database.".format(str(x))

In [9]:
####
#
# Decided not to include SIP
#
# making a vector of AS objects for sipcaller
#V = df['sipcallerip'].apply(getASobject)
# adding columns to the data frame
#df['sipcallerasn'] = V.apply(getASN)
#df['sipcallerasorg'] = V.apply(getASorg)

In [10]:
####
#
# Decided not to include SIP
#
# making a vector of AS objects for sipcalled
#V = df['sipcalledip'].apply(getASobject)
# adding columns to the data frame
#df['sipcalledasn'] = V.apply(getASN)
#df['sipcalledasorg'] = V.apply(getASorg)

In [11]:
# making a vector of AS objects for sipcalled
V = df_geo_unique['a_saddr'].apply(getASobject)
# adding columns to the data frame
df_geo_unique['a_saddr_asn'] = V.apply(getASN)
df_geo_unique['a_saddr_asorg'] = V.apply(getASorg)
df_geo_unique['a_saddr_as_ip'] = V.apply(getIP)
df_geo_unique['a_saddr_lat'] = df_geo_unique['a_saddr_as_ip'].apply(getLat)
df_geo_unique['a_saddr_long'] = df_geo_unique['a_saddr_as_ip'].apply(getLong)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexin

In [12]:
# making a vector of AS objects for sipcalled
V = df_geo_unique['b_saddr'].apply(getASobject)
# adding columns to the data frame
df_geo_unique['b_saddr_asn'] = V.apply(getASN)
df_geo_unique['b_saddr_asorg'] = V.apply(getASorg)
df_geo_unique['b_saddr_as_ip'] = V.apply(getIP)
df_geo_unique['b_saddr_lat'] = df_geo_unique['b_saddr_as_ip'].apply(getLat)
df_geo_unique['b_saddr_long'] = df_geo_unique['b_saddr_as_ip'].apply(getLong)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexin

## df_geo_unique holds the table

In [13]:
df_geo_unique.head()

Unnamed: 0,Unnamed: 1,a_saddr,b_saddr,a_saddr_asn,a_saddr_asorg,a_saddr_as_ip,a_saddr_lat,a_saddr_long,b_saddr_asn,b_saddr_asorg,b_saddr_as_ip,b_saddr_lat,b_saddr_long
71483861,2018-03-15,3227975250,41732323,395152,CloudPBX,192.102.254.82,43.6319,-79.3716,5607,Sky UK Limited,2.124.200.227,53.2738,-2.6087
70783533,2018-03-09,3227975251,41733976,395152,CloudPBX,192.102.254.83,43.6319,-79.3716,5607,Sky UK Limited,2.124.207.88,53.2738,-2.6087
76384583,2018-05-03,1654599250,69076145,395766,CloudPBX,98.159.46.82,40.7432,-75.2242,3356,"Level 3 Parent, LLC",4.30.4.177,32.8072,-117.165
65221709,2018-01-11,3227975251,70715842,395152,CloudPBX,192.102.254.83,43.6319,-79.3716,3356,"Level 3 Parent, LLC",4.55.9.194,37.751,-97.822
65444938,2018-01-15,3227975250,70715842,395152,CloudPBX,192.102.254.82,43.6319,-79.3716,3356,"Level 3 Parent, LLC",4.55.9.194,37.751,-97.822


In [14]:
df_geo_unique.shape

(2739, 12)

## df_geo_unique_asn holds the table filtered for unique asn (preserving unique lat and long data instances if asn's have multiple lat and long data)

In [15]:
df_geo_unique_asn = df_geo_unique[['a_saddr_asn','a_saddr_asorg','a_saddr_lat','a_saddr_long',
                                   'b_saddr_asn','b_saddr_asorg','b_saddr_lat','b_saddr_long']].drop_duplicates()

In [16]:
df_geo_unique_asn.head()

Unnamed: 0,Unnamed: 1,a_saddr_asn,a_saddr_asorg,a_saddr_lat,a_saddr_long,b_saddr_asn,b_saddr_asorg,b_saddr_lat,b_saddr_long
71483861,2018-03-15,395152,CloudPBX,43.6319,-79.3716,5607,Sky UK Limited,53.2738,-2.6087
76384583,2018-05-03,395766,CloudPBX,40.7432,-75.2242,3356,"Level 3 Parent, LLC",32.8072,-117.165
65221709,2018-01-11,395152,CloudPBX,43.6319,-79.3716,3356,"Level 3 Parent, LLC",37.751,-97.822
77648504,2018-05-15,393755,CloudPBX,43.6319,-79.3716,3356,"Level 3 Parent, LLC",37.751,-97.822
77851961,2018-05-17,395766,CloudPBX,40.7432,-75.2242,3356,"Level 3 Parent, LLC",37.751,-97.822


In [17]:
df_geo_unique_asn.shape

(499, 8)

## Split into cities

In [18]:
df_van_ip = df_geo_unique[df_geo_unique['a_saddr_asn']==395152]
df_van_asn = df_geo_unique_asn[df_geo_unique_asn['a_saddr_asn']==395152]
df_tor_ip = df_geo_unique[df_geo_unique['a_saddr_asn']==393755]
df_tor_asn = df_geo_unique_asn[df_geo_unique_asn['a_saddr_asn']==393755]
df_mtl_ip = df_geo_unique[df_geo_unique['a_saddr_asn']==395766]
df_mtl_asn = df_geo_unique_asn[df_geo_unique_asn['a_saddr_asn']==395766]

In [19]:
print(df_van_ip.shape,
df_tor_ip.shape,
df_mtl_ip.shape)

(932, 12) (1330, 12) (477, 12)


In [20]:
print(df_van_asn.shape,
df_tor_asn.shape,
df_mtl_asn.shape)

(168, 8) (217, 8) (114, 8)


In [22]:
df_van_asn.head()

Unnamed: 0,Unnamed: 1,a_saddr_asn,a_saddr_asorg,a_saddr_lat,a_saddr_long,b_saddr_asn,b_saddr_asorg,b_saddr_lat,b_saddr_long
71483861,2018-03-15,395152,CloudPBX,43.6319,-79.3716,5607,Sky UK Limited,53.2738,-2.6087
65221709,2018-01-11,395152,CloudPBX,43.6319,-79.3716,3356,"Level 3 Parent, LLC",37.751,-97.822
64743723,2018-01-04,395152,CloudPBX,43.6319,-79.3716,32308,"8x8, Inc.",37.751,-97.822
65528895,2018-01-15,395152,CloudPBX,43.6319,-79.3716,16504,Granite Telecommunications LLC,37.751,-97.822
78144650,2018-05-23,395152,CloudPBX,43.6319,-79.3716,7018,"AT&T Services, Inc.",37.751,-97.822
