In [1]:
# imports
import pandas as pd
import re
import data_utils as du
import json

### A. Demographic areas

We are using the onto-notes tag since the `MISC` tag of the regular tags is too general. We are interested in the `GPE` and `NORP` tags of the onto-notes tags.

In [2]:
# loading the onto-notes tags
ner_onto_df = pd.read_parquet("data/ner_tagged_data_onto.parquet")
ner_onto_df.head()

Unnamed: 0,message_ids,text,label
0,1,OSINT,ORG
1,1,Cyberknow20,PERSON
2,1,pro-Russian,NORP
3,2,Today,DATE
4,2,Poland,GPE


In [3]:
# loading the json files to obtain dictionaries on languages and territories
file_paths = ["json-files/languages.json",
              "json-files/territories.json"]

for i, file_path in enumerate(file_paths):
    with open(file_path, "r", encoding="utf-8") as json_file:
        data = json.load(json_file)
    
    global_dict = data["main"]["en-GB"]["localeDisplayNames"]
    if i == 0:
        language_dict = global_dict["languages"]
        language_dict_reversed = {value: key for key, value in language_dict.items()}
    else:
        territories_dict = global_dict["territories"]
        territories_dict_reversed = {value: key for key, value in territories_dict.items()}

# combining the language and territories dict into a single dictionary
demographics_dict = {**language_dict_reversed, **territories_dict_reversed}

# lower casing all the keys and values of the demographics dict
demographics_dict_lower = {key.lower(): value.lower() for key, value in demographics_dict.items()}

In [15]:
# filtering out all the GPE and NORP tags 
tags_of_interest = ["GPE", "NORP"]
tag_mask = ner_onto_df['label'].isin(tags_of_interest)
filtered_df = ner_onto_df[tag_mask]

# adding the country column based on the demographics dictionary
filtered_df["country"] = filtered_df["text"].apply(lambda x: demographics_dict_lower.get(x.lower(), "unknown"))

# filtering out all the rows with unknown country
known_country_df = filtered_df[filtered_df["country"] != "unknown"]

# selecting only the message_id and country columns
known_country_df = known_country_df[["message_ids", "country"]]

# only selecting the unique countries per message
known_country_df = known_country_df.drop_duplicates()

# filtering out all the non-eu and nordic countries
eu_nordic_country_df = known_country_df[known_country_df["country"].isin(du.eu_nordic_countries.keys())]

# obtaining the country name from the country abbreviation
eu_nordic_country_df["country_name"] = eu_nordic_country_df['country'].map(du.eu_nordic_countries).fillna('unknown')

# counts per country
print(eu_nordic_country_df["country_name"].value_counts())

# percentage of targeted countries are EU and Nordic countries
total_targeted = len(known_country_df)
eu_nordic_targeted = len(eu_nordic_country_df)
eu_nordic_perc = round(eu_nordic_targeted / total_targeted * 100, 2)
print(f"{eu_nordic_perc}% of the attacks were targetting EU and Nordic countries")

# how many EU and Nordic countries have been targeted?
unique_eu_nordic_targeted = len(set(eu_nordic_country_df["country_name"]))
total_eu_nordic = len(du.eu_nordic_countries)
print(f"{unique_eu_nordic_targeted} out of {total_eu_nordic} EU and Nordic countries have been targeted")

country_name
poland            244
spain             171
lithuania         171
italy             158
germany           115
latvia             90
finland            69
france             67
netherlands        48
sweden             44
denmark            43
estonia            40
norway             30
romania            26
belgium            23
slovakia           19
luxembourg         15
austria            12
greece             11
slovenia            9
croatia             6
iceland             5
hungary             4
ireland             3
malta               1
cyprus              1
portugal            1
czech republic      1
Name: count, dtype: int64
38.68% of the attacks were targetting EU and Nordic countries
28 out of 28 EU and Nordic countries have been targeted


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  filtered_df["country"] = filtered_df["text"].apply(lambda x: demographics_dict_lower.get(x.lower(), "unknown"))
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  eu_nordic_country_df["country_name"] = eu_nordic_country_df['country'].map(du.eu_nordic_countries).fillna('unknown')


### B. Infrastructure sectors

In [96]:
# filtering out all the GPE and NORP tags 
tags_of_interest = ["ORG"]
tag_mask = ner_onto_df['label'].isin(tags_of_interest)
filtered_df = ner_onto_df[tag_mask]

# selecting the text column of the filtered df
text_set = set(filtered_df["text"])

# function to categorize organization
def categorize_organization(name):
    for sector, pattern in du.sectors_patterns.items():
        if re.search(pattern, name, re.IGNORECASE):
            return sector
    return 'Unknown'

# dictionary to store assigned sectors
sector_dict = {"organization": [],
               "sector": []
               }

# assigning organizations to a sector
for org in text_set:
    sector = categorize_organization(org)
    sector_dict["organization"].append(org)
    sector_dict["sector"].append(sector)

# viewing the assigned sectors qualitatively
sector_df = pd.DataFrame.from_dict(sector_dict)
print(sector_df)

                                  organization                 sector
0                 the Hampshire County Council                Unknown
1                                         ECAA                Unknown
2                          the Court of Appeal                Unknown
3                                       Isdefe                Unknown
4                 Ministry of National Defense  public administration
...                                        ...                    ...
2063                            InsanePakistan                Unknown
2064                       Vocational Training                Unknown
2065                                BLRT Grupp                Unknown
2066                                Latvenergo                Unknown
2067  Finnish Chamber of Commerce and Industry                Unknown

[2068 rows x 2 columns]


In [97]:
# counts for each sector
print(sector_df["sector"].value_counts())

sector
Unknown                            1639
public administration               165
banking                             106
transport                            86
financial market infrastructure      31
energy                               28
digital infrastructure                9
space                                 4
Name: count, dtype: int64


### C. Security properties (CIA)

In [98]:
# loading in the dataset and viewing some messages containing redundant information in the end
df = pd.read_csv("data/hacktivist_messages.csv", sep=";")
pd.set_option('display.max_colwidth', None)
df[130:140]

Unnamed: 0,Message Id,Datetime,Text
130,131,2022-12-21 19:12:25,The Latvian portal of the financial intelligence service is not working still🔥❌https://check-host.net/check-report/df61e8dk343🐻Subscribe to NoName057(16)🐻Join our DDoS-project🇷🇺Victory will be ours!
131,132,2022-12-22 11:02:56,🔥 Since yesterday the authorization service of the portal of grant projects of the State Agency for the Development of Education of Latvia haven't rehabilitated 🇱🇻 :❌ https://check-host.net/check-report/df78a8fk3ba🐻Subscribe to NoName057(16)🐻Join our DDoS-project🇷🇺Victory will be ours!
132,133,2022-12-23 11:07:07,"🔥Ziedot, a Latvian Russophobic charitable organization, started collecting donations to the Armed Forces of Ukraine, but we quickly reacted and the portal stopped working due to our DDoS attacks:❌https://check-host.net/check-report/df9cc89k288🐻Subscribe to NoName057(16)🐻Join our DDoS-project🇷🇺Victory will be ours!"
133,134,2022-12-23 11:28:13,"🔥As advised by subscribers, we are now conducting ""stress tests"" of sites😁The portal of the Court of Appeal in Rzeszow collapsed from stress:❌https://check-host.net/check-report/df9ce27k3a5🐻Subscribe to NoName057(16)🐻Join our DDoS-project🇷🇺Victory will be ours!"
134,135,2022-12-23 11:46:58,🔥The subdomain (job portal) of British munitions company Bae Systems did not pass our stress test:❌https://check-host.net/check-report/df9ce27k3a5🐻Subscribe to NoName057(16)🐻Join our DDoS-project🇷🇺Victory will be ours!
135,136,2022-12-24 11:23:08,📦Our DDoS-surprise was first accepted by the Polish portal of the Public Procurement Administration:❌https://check-host.net/check-report/dfc0281ka8e🐻Subscribe to NoName057(16)🐻Join our DDoS-project🇷🇺Victory will be ours!
136,137,2022-12-25 11:50:48,🔥There's again non-flying weather today in Poland due to ddos-hail:❌Civil Aviation Administration:https://check-host.net/check-report/dfe19c5k176❌Central database of reports of the Civil Aviation Authority:https://check-host.net/check-report/dfe1926k36c🐻Subscribe to NoName057(16)🐻Join our DDoS-project🇷🇺Victory will be ours!
137,138,2022-12-26 09:38:33,🔥The Latvian website of the Public Services Commission is not working today: ❌https://check-host.net/check-report/e00ae52kea4🐻Subscribe to NoName057(16)🐻Join our DDoS-project🇷🇺Victory will be ours!
138,139,2022-12-26 10:48:49,🚂The portal of the management company of Latvian Railways is also feeling bad today:❌https://check-host.net/check-report/e00e301k300🐻Subscribe to NoName057(16)🐻Join our DDoS-project🇷🇺Victory will be ours!
139,140,2022-12-26 13:40:04,"🚂The portal of the Latvian railway, as well as its subdomains, are feeling bad today:❌Latvian Railway:https://check-host.net/check-report/e012debk430❌Latvian Railway infrastructure:https://check-host.net/check-report/e012e6dk76❌Logistics Service:https://check-host.net/check-report/e012ec9kef8❌ Freight service:https://check-host.net/check-report/e012f3bka3f❌Rolling stock service:https://check-host.net/check-report/e012fe6kb5❌Security service:https://check-host.net/check-report/e01305eka35❌Electronic maintenance service of the railway system:https://check-host.net/check-report/e0130b5kca6❌Training Center:https://check-host.net/check-report/e0130f4kd05🐻Subscribe to NoName057(16)🐻Join our DDoS-project🇷🇺Victory will be ours!"


In [99]:
# function to cut off the redundant part of each message
def shorten_string(input_string):
    # Check if the input is a string
    if isinstance(input_string, str):
        pattern = r'❌.*?check-host'
        
        # Search for the pattern in the input string
        match = re.search(pattern, input_string)
        
        if match:
            # Cut off the string from the start of the match
            return input_string[:match.start()]
        else:
            return input_string
    else:
        # If not a string, return it unchanged (e.g., for NaN values)
        return input_string

# trimming the texts in the df
df["Text"] = df["Text"].apply(shorten_string)
df[130:140]

Unnamed: 0,Message Id,Datetime,Text
130,131,2022-12-21 19:12:25,The Latvian portal of the financial intelligence service is not working still🔥
131,132,2022-12-22 11:02:56,🔥 Since yesterday the authorization service of the portal of grant projects of the State Agency for the Development of Education of Latvia haven't rehabilitated 🇱🇻 :
132,133,2022-12-23 11:07:07,"🔥Ziedot, a Latvian Russophobic charitable organization, started collecting donations to the Armed Forces of Ukraine, but we quickly reacted and the portal stopped working due to our DDoS attacks:"
133,134,2022-12-23 11:28:13,"🔥As advised by subscribers, we are now conducting ""stress tests"" of sites😁The portal of the Court of Appeal in Rzeszow collapsed from stress:"
134,135,2022-12-23 11:46:58,🔥The subdomain (job portal) of British munitions company Bae Systems did not pass our stress test:
135,136,2022-12-24 11:23:08,📦Our DDoS-surprise was first accepted by the Polish portal of the Public Procurement Administration:
136,137,2022-12-25 11:50:48,🔥There's again non-flying weather today in Poland due to ddos-hail:
137,138,2022-12-26 09:38:33,🔥The Latvian website of the Public Services Commission is not working today:
138,139,2022-12-26 10:48:49,🚂The portal of the management company of Latvian Railways is also feeling bad today:
139,140,2022-12-26 13:40:04,"🚂The portal of the Latvian railway, as well as its subdomains, are feeling bad today:"


In [100]:
# set of all the trimmed messages
message_set = set(df["Text"])

# function to categorize messages
def categorize_message(message):
    message = str(message)
    for principle, pattern in du.cia_principles_patterns.items():
        if re.search(pattern, message, re.IGNORECASE):
            return principle
    return 'Unknown'

# dictionary to keep track of counts
principle_dict = {"message": [],
                    "principle": []
                    }       

# assigning organizations to a sector
for message in message_set:
    principle = categorize_message(message)
    principle_dict["message"].append(message)
    principle_dict["principle"].append(principle)

# viewing the assigned principles qualitatively
principle_df = pd.DataFrame.from_dict(principle_dict)
print(principle_dict)



In [101]:
# counts for each principle
print(principle_df["principle"].value_counts())

principle
Unknown            2265
availability        548
confidentiality       3
integrity             2
Name: count, dtype: int64


In [102]:
# taking a closer look on the confidentiality and integrity messages
principles_of_interest = ["confidentiality", "integrity"]
principle_mask = principle_df['principle'].isin(principles_of_interest)
filtered_df = principle_df[principle_mask]
filtered_df

Unnamed: 0,message,principle
448,We killed the website of the Swedish Privacy Protection Authority:,confidentiality
541,We shut down the portal of the Swedish privacy protection:👋https://check-host.net/check-report/fbe2fd3k6e👉Subscribe to NoName057(16)🐻Join our DDoS-project⚠️Subscribe to reserve channel🇷🇺Victory will be ours!,confidentiality
689,"Russia🇷🇺 is almost single-handedly standing up to the so-called deep state forces.Let’s explain what this beast is and what it entails👨🏻‍💻In the US, the term ""deep state"" gained prominence in 2007. It was then used to describe the military-industrial complex of the United States, which repeatedly lobbied for the country’s involvement in wars across various regions of the globe.Currently, the ""deep state"" essentially controls American finances and media, certain intelligence agencies, the leadership of the European Union, the political elites of the Baltic States, Ukraine, Moldova, and partially Poland. It controls Macron, lobbies for the interests of the Democratic Party, and, consequently, doesn’t like Trump😁The goals pursued by this hegemon have long been clear and stated: global control in all regions of the world, control over the world’s natural resources, a radical reduction of the human population to 1.5–2 billion people, and the alteration of the very nature of human existence.In reality, as grandiose as it may sound, the fate of the world order and human civilization is being decided in the confrontation between Russia and the deep state💪🏻And we are part of this struggle, friends! Naturally, we are on Russia’s side😈Spoiler: Tomorrow we will talk about a FAILED STATE. We look forward to your comments with the name of a country that fits this description😉Follow us➡️Russian version|DDoSia Project|Reserve channel",integrity
1156,We continue to punish the Swedish russophobes🇸🇪 - we shut down the portal of the Swedish Privacy Protection Authority:,confidentiality
1783,"🔻French President Emmanuel Macron accused Russia of violating the territorial integrity of Armenia, and also promised his ward Zelensky “support until victory.”Well, our team is ready to violate the territorial integrity of the French segment of the Internet today!😉🇫🇷",integrity
