# Hitcount Map

**Authors:** Lorraine Hwang, Denise Kwong

**Previous Version:** Based on plot.py by Eric Heien.

revised uses 

This site or product includes IP2Location LITE data available from <a href="https://lite.ip2location.com">https://lite.ip2location.com</a>.

---

**Purpose**

This Jupyter notebook creates hitcount maps from both the legacy data from the UC Davis servers and the current data now hosted by Hubzero.

Users can use the legacy database stored as sqlite and/or import data generated from the Tool Stats from the software landing pages on geodynamics.org

The legacy database includes data roughly from 2012 - 2021.

The new database roughly begins in March 2023.

**Contributing**

We welcome your contributions in helping to create better functionality and prettier maps. Please consider contributing your changes to the repository by submitting a pull requests

**Known Issues**

The IP list for bots is known to be incomplete. When using the legacy data, inspect ip_nums and the map.  ip_nums is NOT the same as the IP address. ip_nums is the decimal format.

You can use the following tool to convert:

https://www.ipaddressguide.com/ip

**Example**

The following example plots both the legacy data and json data through May 2023'ish for the package ASPECT. 

Note the large number of downloads from China. Some IPs have already been added to filter_ips but more needs to be identified for this data to be trusted. Other suspect data locates in Iran.

e.g. 1311203070 is 78.39.94.254 which is located in Kerman, Iran

In the U.S., a large number of downloads are associated with CIG Headquarters at UC Davis and a government/USGS/??? server in Kansas.

<img title="ASPECT" alt="Alt text" src="../images/aspect.png">

---

## Import needed libraries

In [1]:
import math
import os
import sqlite3
import datetime
import pandas as pd
import json
import plotly.express as px

# Initialize array. This is necessary if not using legacy data
locs = []
PACKAGE_NAM = ""


### Legacy data
Skip if NOT using legacy data.

#### IP numbers to filter

This is used to filter out known bots from the legacy data.  

CAUTION: This list is NOT comprehensive. More work needs to be done in cleaning IPS out of the legacy data.

In [12]:
filter_ips = [
    2155411043,
    1368427042,  # crawl-81-144-138-34.wotbox.com, crawler
    2025873270,  # 120.192.95.118, unknown site in China
    2026569611,  # 120.202.255.139, unknown site in China
    2025868405,  # 120.192.76.117, unknown site in China
    2026569613,  # 120.202.255.141, unknown site in China
    3548981000,  # 211.137.39.8, unknown site in China
    1862796174,  # 111.8.3.142, unknown site in China
    3548981003,  # 211.137.39.11, unknown site in China
    3548980999,  # 211.137.39.7, unknown site in China
    3719653427,  # 221.181.104.51, unknown site in China
    2025868387,  # 120.192.76.99, unknown site in China
    # Added 15 August 2023 L.J. Hwang
    3658880259,  # 218.22.21.3 Suspected bot based on number of hits ChinaNet Anhui Province / blacklisted by some
    3525924814,  # 210.41.87.206 Suspected bot based on number of hits China Education and Research Network
    2682452965,  # 159.226.251.229 Suspected bot based on number of hits China Science and Technology Network
    3740312921,  # 222.240.165.89 Suspected bot based on number of hits ChinaNet Hunan Province / blacklisted by some
    1843404020,  # 109.224.28.244 Suspected bot based on number of hits Hulum Almustakbal Iraq / blacklisted by few
    289039475    # 17.58.100.115 Apple Data Center
]

#### Define functions
These functions are needed to manipulate the legacy data and created by Eric Heien.

In [3]:
def find_ip_lat_lon(db_conn, ip_num):
    curs = db_conn.cursor()
    curs.execute(
        "SELECT location.latitude, location.longitude FROM location, block WHERE block.loc_id = location.loc_id AND ? BETWEEN block.start_ip AND block.end_ip limit 1;", (ip_num,))
    return curs.fetchone()


def ip_nums_to_locations(db_name, ip_num_list):
    db_conn = sqlite3.connect(db_name)
    unmapped_ips = 0

    cache = {}
    result = []
    for check_ip in ip_num_list:
        if check_ip in cache:
            result.append(cache[check_ip])
        else:
            res = find_ip_lat_lon(db_conn, check_ip)
            if res is not None:
                result.append(res)
                cache[check_ip] = res
    db_conn.close()

    return result


def lookup_hits(db_name, package_name, start_time, end_time):
    db_conn = sqlite3.connect(db_name)
    curs = db_conn.cursor()
    result = []
    if package_name == "comprehensive":
        curs.execute(
            "SELECT hit.ip_num FROM hit WHERE hit.time >= ? AND hit.time <= ?;", (start_time, end_time,))
    else:
        curs.execute(
            "SELECT hit.ip_num FROM hit, dist_file, package WHERE hit.time >= ? AND hit.time <= ? AND hit.file_id = dist_file.id AND dist_file.package_id = package.id AND package.package_name = ?;",
            (start_time, end_time, package_name,))
    while True:
        next_val = curs.fetchone()
        if next_val is None:
            break
        ip_int_val = int(next_val[0])
        if ip_int_val not in filter_ips:
            result.append(ip_int_val)
    db_conn.close()
    return result


#### Set filenames for data

hit_database is the legacy database. It contains download data for all packages downloaded from cig servers.


In [4]:
HIT_DB_NAME = '../database/hit_database'
LOCATION_DB_NAME = '../database/ip_lookup_db'
INPUT_COUNT_LIST = '../json/monthlyCountList.json'

#### Specify name of package 

Specify a code name. 

Use a "-" dash and not an "_" underscore for specfem

We may wish to specify "comprehensive" too. However, this currently takes a long time to run 
as more bots need to be cleaned.

In [5]:
#PACKAGE_NAM = "aspect"
# PACKAGE_NAM = "pylith"
PACKAGE_NAM = "comprehensive"

# Print some useful information
print(HIT_DB_NAME,LOCATION_DB_NAME, PACKAGE_NAM)

../database/hit_database ../database/ip_lookup_db comprehensive


## Set parameters

### Date Range
Set start and end times for the data to be plotted. 

If none is specify, the default is for the start of UNIX epoch time (January 1, 1970) to current date.

Database contains download counts from approximately 2012-2021.


In [6]:
# START_TIME and END_TIME must be in UNIX epoch format (seconds since Jan 1 1970)

# Set default time to the beginning of time Jan 1 1970
#START_TIME = datetime.datetime.fromtimestamp(0)

# To change the default start time, replace MM/DD/YY HH:MM:SS with target time
#START_TIME = datetime.datetime.strptime('MM/DD/YY HH:MM:SS', '%m/%d/%y %H:%M:%S')
START_TIME = datetime.datetime.strptime('01/01/16 00:00:01', '%m/%d/%y %H:%M:%S')


# Set end time to current time. 
#END_TIME = datetime.datetime.now()

# To change the default end time, replace MM/DD/YY HH:MM:SS with target time
# END_TIME = datetime.datetime.strptime('MM/DD/YY HH:MM:SS', '%m/%d/%y %H:%M:%S')
END_TIME = datetime.datetime.strptime('12/31/22 23:59:59', '%m/%d/%y %H:%M:%S')


# Print some useful information
print(START_TIME, END_TIME)

2016-01-01 00:00:01 2022-12-31 23:59:59


## Read and create map data

###  Legacy data

Skip this section if you are NOT using legacy data.

#### Read in raw data

In [13]:
# Get the IP numbers associated with a given package
ip_nums = lookup_hits(HIT_DB_NAME, PACKAGE_NAM, START_TIME, END_TIME)

# Inspect ip_nums if you are looking for more bots
# print (ip_nums)

print("Found", len(ip_nums), "hits associated with package", PACKAGE_NAM)
if len(ip_nums) == 0:
     print("Cannot generate plot for", PACKAGE_NAM)

# Locate the corresponding lat/lon points
locs = ip_nums_to_locations(LOCATION_DB_NAME, ip_nums)
print("Checked", len(ip_nums), "IPs, found", len(locs), "locations.")

if len(locs) == 0:
    print("Cannot generate plot for", PACKAGE_NAM)


Found 8027 hits associated with package comprehensive
Checked 8027 IPs, found 7845 locations.


####  Create  map data


In [14]:
# Get the data and store in dictionary, with key as IP and value as frequency of that IP
dictionary = {i:locs.count(i) for i in locs}

# Convert the dictionary to a dataframe
df = pd.DataFrame(dictionary.items(), columns=['latlon', 'freq'])

# Create a new dataframe to separate the latitude and the longitude, which are in a tuple together
df2 = pd.DataFrame(df['latlon'].tolist(), columns=['lat', 'lon'])

# Append the frequency to the new dataframe
df2['freq'] = df['freq']

### For data beginning in 2023

The following will import json data from Tool Stats on https://geodynamics.org and append it to the legacy data if not empty.

Modify filename as needed.

Note that this is code package name agnostic. It is recommended that you rename the data filename using the code package name.

In [None]:
# monthlyCountList used for Tool Statistics (Downloads, Redirect Counts)

# Open and load the data
f = open(INPUT_COUNT_LIST)
monthly_count_list_data = json.load(f)

# Convert the original data to a dataframe
dat = monthly_count_list_data['world_map_list']
df = pd.DataFrame(dat)

# Check whether each IP is within the specified date range
for i in range (0, len(df)):
    if (datetime.datetime.strptime(str(df['date_download'][i]), '%Y-%m-%d %H:%M:%S') < START_TIME):
        df = df.drop(i)
    elif (datetime.datetime.strptime(str(df['date_download'][i]), '%Y-%m-%d %H:%M:%S') > END_TIME):
        df = df.drop(i)

# Get the duplicate entires that contain the same ip_lat, ip_long, city, and region, and the count of each duplicate entry
df3 = df.pivot_table(index = ['ip_lat', 'ip_long', 'city', 'region'], aggfunc ='size')

# Convert the pivot_table to a dataframe for easier manipulation
df3 = df3.reset_index()
df3.rename(columns={'ip_lat': 'lat', 'ip_long': 'lon', 0:'freq'}, inplace=True)
df3 = df3.drop('city', axis=1)
df3 = df3.drop('region', axis=1)

# Check whether the data from the database is empty; if not, combine both sets of data
if len(locs) == 0:
    df2 = df3
else:
    df2 = pd.concat([df2, df3], axis=0, ignore_index=True)

if len(df2) == 0:
    print("Cannot generate plot. No data found.")

## Plot

Several ways are presented below to plot your map data. We could not decide what we liked best but we know they all could use improvements.

### scattergeo

Use this one for formatted hovertext.

In [15]:
# define some map parameters
max = df2['freq'].max()   # This is used to normalize the size of the marker
scale = 50                # The marker needs to be scaled up to be visible after normalizing
min_size = 5              # If the range of values is too large, the smallest marker needs a minimum size
max_value = 20            # Maximum value for color scale



In [16]:

import plotly.graph_objects as go

fig =  go.Figure(
                data = go.Scattergeo(
                         lat=df2['lat'], 
                         lon=df2['lon'], 
                         hovertext = df2['freq'],
                         marker = dict(
                             colorscale = 'Earth', #Blackbody,Bluered,Blues,Cividis,Earth,Electric,Greens,Greys,Hot,Jet,Picnic,Portland,Rainbow,RdBu,Reds,Viridis,YlGnBu,YlOrRd
                             cmax = max_value,
                             cmin = 0,
                             color = df2['freq'],
                             size=df2['freq'] / max * scale,
                             colorbar_title = "Count",
                             line = dict(color='gray', width=0)   #symbol border color and width.
                         )                     
                        )
                )

fig.update_traces(marker_sizemin=min_size) 

# Note that the START_TIME and END_TIME is user specified and do not correpond to the actual
# range of data in your input file
#
# fig.update_layout(
#        title = 'Downloads: ' +  PACKAGE_NAM + ' ' + str(START_TIME) + ' to ' +str(END_TIME)
#   )
#
fig.update_layout(
        title = 'Downloads: ' +  PACKAGE_NAM
    )
fig.show()



### Map Options

The next 2 options use plotly.express

Note that using hovertemplate in the following fashion caused lat lat to be mapped incorrectly 
into the hover box in some cases e.g.

    fig.update_traces(hovertemplate = "(" + df2['lat'].apply(str) + ", " + df2['lon'].apply(str) + "): " + df2['freq'].apply(str));

Legend:
1. Range - choose the interval length (ex. 200 => 0-200, 200-400, etc.); discrete colors, uses `color_discrete_sequence`
2. Frequency - heat gradient; continuous color, uses `color_continuous_scale`

The color sequence can be chosen in one of the following ways:
1. A list of colors (ex. `['orange', 'red', '#00D']`)
2. Plotly's built-in color sequences. See [Color Sequences in Plotly Express](https://plotly.com/python/discrete-color/) (ex. `px.colors.qualitative.G10`)


In [None]:
# Map Options

# Set as either 'range' or 'freq'. The default is 'range' with an interval of 200.
legend = 'range'
#legend = 'freq'
interval = 200

# Set colors to preferred color palette. Set colors to a list of colors or a built-in Plotly sequence.
#colors = px.colors.qualitative.Pastel1
colors = px.colors.qualitative.Plotly

# Calculates the points in each range by taking the interval number and creating a new column 'ranges' that places that latlon in the corresponding range
ranges = []
for i in range (0, len(df2)):
    ranges.append(str((df2.loc[i]['freq'] // interval)*interval) + " - " + str((df2.loc[i]['freq'] // interval + 1)*interval))
df2['range'] = ranges


In [None]:
# Plot using scatter_geo


if (legend == 'range'):
    fig = px.scatter_geo(df2,
                         lat='lat', 
                         lon='lon', 
                         size='freq', 
                         title='Hitmap (Geo)', 
                         color=legend, 
                         color_discrete_sequence=colors
                        )
else:
    fig = px.scatter_geo(df2,
                         lat='lat', 
                         lon='lon', 
                         size='freq', 
                         title='Hitmap (Geo)', 
                         color=legend, 
                         color_continuous_scale=colors
                        )

fig.show()

In [None]:
# Plot using scatter_mapbox, with open-street-map as the default.

if (legend == 'range'):
    fig = px.scatter_mapbox(df2,lat='lat', lon='lon', size='freq', zoom=0.5, center=dict(lon=0, lat=0), mapbox_style="open-street-map", title='Hitmap (Mapbox)', color=legend, color_discrete_sequence=colors)
else:
    fig = px.scatter_mapbox(df2,lat='lat', lon='lon', size='freq', zoom=0.5, center=dict(lon=0, lat=0), mapbox_style="open-street-map", title='Hitmap (Mapbox)', color=legend, color_continuous_scale=colors)

# To change the style of the mapbox, uncomment any of the following lines:
#fig.update_layout(mapbox_style="carto-positron")
#fig.update_layout(mapbox_style="carto-darkmatter")
#fig.update_layout(mapbox_style="stamen-terrain")
#fig.update_layout(mapbox_style="stamen-toner")

fig.update_traces(marker_sizemin=10, selector=dict(type='scatter')) 
fig.show()

# Some useful python for debuggin

In [None]:
# Display all the rows of a dataframe
with pd.option_context('display.max_rows', None,
                      'display.max_columns', None,
                      'display.precision', 3,
                       ):
   print(df2)


---

You can use this to create an ordered list of IP addresses in decimal format and the assocaited number of downloads.

You can convert here:
    
   [https://www.ipaddressguide.com/ip](https://www.ipaddressguide.com)
    
And look up here:

   [https://whatismyipaddress.com/ip-lookup](https://whatismyipaddress.com/ip-lookup)
    
Note that more than one IP address may geolocate to the same location

In [17]:
pd.set_option('display.max_rows', None)
pd.Series(ip_nums).value_counts()

621369284     182
1422492997    139
1152705754     93
874592842      93
873244901      89
875052205      89
875006579      87
1152706093     87
873205036      86
597865614      78
908366907      76
908367026      67
1979230134     58
856624950      57
3058057942     55
222464821      46
1920440367     46
1920437900     44
1920437595     43
1920439641     41
3589265602     40
1920437870     36
908367027      36
1920436254     33
1920436732     33
1920435243     32
1920435211     32
1920435365     31
1920435238     31
1920440248     31
1920440956     31
222464866      30
989329164      30
1920435470     29
3475901736     29
1920436337     29
222464861      29
1920441488     27
3639886569     27
1920438167     26
1920435036     26
1920434718     26
1920436400     25
1920438882     25
1920437485     25
1920436935     25
1920436544     25
1920436583     25
1920439079     24
1920436761     24
1920439309     24
1920441850     24
1920441892     24
1920437034     24
222464911      24
1920441298