# COGS 108 - Data Checkpoint

## Authors

Megan Yu: Data Overview + Instruction questions for Dataset #1

Olivia Huard: Instruction questions 1 and 2 for Dataset #2

Shanmukhi Nandiraju: Data Overview + Instruction questions 3, 4, 5 for Dataset #2 + Project proposal section modifications

Harshatha Prasanna: Data Overview + Instruction questions for Dataset #1

Jazely Tong: Data Overview + Instruction questions for Dataset #1

### Project Proposal Fixes:

Hi! Thank you so much for taking the time to go through our project proposal and give us feedback on it! Here are the changes we made to our project proposal to accomodate the suggestions (also included this message in a comment under the GitHub issue):

-> To address the feedback on our Research Question, we specified that we will use a Random Forest model and listed the specific lexical features (length, entropy, and special characters) we will use for classification. We also moved our 90% accuracy target to the Hypothesis section.

-> We expanded our Hypothesis to include a broader range of features, specifically character entropy and URL length, as suggested. We also incorporated our 90% accuracy goal into this section.

-> We have fully discussed and completed our project timeline, adding specific tasks and goals for each team meeting, individual and group, and meeting dates to ensure we stay on track for the checkpoints and overall project.

Please do let us know if there is anything else lacking. Thank you again!

## Research Question

Can we develop a Random Forest classification model to predict whether a URL is benign, phishing, malware or defacement based on lexical features such as URL length, character entropy, and the frequency of special characters? This predictive task aims to evaluate which combination of these structural attributes most effectively identifies maliciousness across various cybersecurity threat categories.

## Background and Prior Work

Malicious URLs are a major vector for phishing, malware delivery, and website defacement, and attackers increasingly embed harmful intent in the structure of a URL itself. These URLs often imitate legitimate sites, mislead users into entering sensitive information, or redirect them to compromised resources. Recent studies highlight their growing prevalence: Mittal (2023) notes that anti-phishing systems confronted over half a billion phishing attempts in 2022, underscoring the persistence and escalation of URL-based attacks.<a name="cite1"></a><a href="#ref1">1</a> Similarly, Omolara and Alawida (2025) report that malicious links rose by 144% in a single year, driven by techniques such as social engineering, obfuscation, and automated URL generation.<a name="cite2"></a><a href="#ref2">2</a> Collectively, this research shows that understanding URL characteristics is essential for mitigating harms associated with phishing and related threats.

Traditional blacklist-based defenses, such as browser-integrated URL safety checks, were among the earliest strategies for blocking malicious links. However, blacklist systems fundamentally depend on previously identified threats. As Omolara and Alawida explain, attackers frequently rotate domain names, use URL shorteners, manipulate HTTP/HTTPS presentation, or rely on fast-flux hosting, techniques that allow newly created malicious URLs to evade blacklist detection long enough to cause damage.<a href="#ref2">2</a> This limitation has motivated a shift toward automated, feature-based analysis that can evaluate a URL’s structure without requiring prior knowledge of whether it is malicious.

Prior work has analyzed the internal structure and lexical attributes of URLs to uncover which characteristics most reliably distinguish malicious links from benign ones. The literature identifies several strong predictors: the presence of an IP address instead of a domain name, abnormal anchor tags, unusually long URLs, excessive special characters, multi-subdomain patterns, and the use of misleading prefix/suffix tokens such as “-secure” or “-verify.” Mittal (2023) outlines dozens of such features, demonstrating how attributes like redirection counts, “@” symbols, URL entropy, and delimiter frequency can signal phishing activity.<a href="#ref1">1</a> The recent survey by Tian et al. (2025) further categorizes these features into lexical, host-based, and content-based groups, emphasizing the predictive power of purely lexical features derived from characters, symbols, and token patterns within the URL string.<a name="cite3"></a><a href="#ref3">3</a> This aligns directly with our project’s focus on character-level URL analysis.

In addition to feature engineering research, many studies have evaluated machine learning models for malicious URL classification. Mittal (2023) demonstrates that interpretable “glass box” models such as Logistic Regression and Decision Trees can achieve 90-95% accuracy using around 30 lexical and reputation-based features.<a href="#ref1">1</a> More advanced approaches, such as the ensemble techniques evaluated in Omolara and Alawida’s DaE2 framework, reach up to 98% accuracy using boosting, bagging, and stacked models trained on large malicious-URL datasets.<a href="#ref2">2</a> Tian et al. (2025) corroborate these findings, noting that character-level and token-level feature extraction remain among the strongest signals for ML-based URL detection.<a href="#ref3">3</a> Together, this body of work establishes a strong foundation for predicting URL maliciousness using structural properties alone.

Our project builds directly on these findings by focusing specifically on correlations between URL characters, symbols, and structural patterns, and whether those characteristics can be used to distinguish malicious URLs (e.g., phishing, malware, defacement) from benign ones. While prior studies have explored broad sets of lexical and host-based features, fewer have isolated the predictive value of character-level traits such as symbol frequency, delimiters, suspicious token patterns, and URL length variations. By quantifying these relationships and applying predictive modeling, our study aims to evaluate how well these standalone URL characteristics can determine URL authenticity and contribute to early malicious-URL detection.

References  
<a name="ref1"></a>  
Mittal, S. (2023). Explaining URL Phishing Detection by Glass Box Models. IC3 2023.  
https://doi.org/10.1145/3607947.3608059  
<a href="#cite1">^</a>

<a name="ref2"></a>  
Omolara, O. E., & Alawida, M. (2025). DaE2: Unmasking Malicious URLs via Consensus From Diverse Techniques.  
https://doi.org/10.1016/j.cose.2024.104170  
<a href="#cite2">^</a>

<a name="ref3"></a>  
Tian, Y., et al. (2025). From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets, and Code Repositories. arXiv:2504.16449. https://arxiv.org/abs/2504.16449  
<a href="#cite3">^</a>


## Hypothesis


Our group predicts that URL's categorized as malicious will exhibit higher character entropy, much longer string lengths, and a higher frequency of special characters like '$', '@', '!', "%", etc.) compared to benign URLs. We also predict that a classification model trained on these lexical features will achieve at least 90% accuracy in distinguishing between threat categories.

## Data

### Dataset #1

  - **Dataset Name**: Malicious URLs dataset
  - **Link to the dataset**: https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset
  - **Number of observations**: 651,191 total URLs are included in the dataset.
  - **Number of variables**: The dataset contains 2 main variables: URLS and type.

  - **Description of the variables most relevant to this project:**
    - **url:** (string):This is the web address itself 
    - **type:** (categorical label): This indicates what kind of URL it is: 
      - benign: legitimate URL
      - defacement: website appearance has been altered
      - phishing: designed to trick users into giving private info
  malware: URL leads to harmful downloads
  - **Descriptions of any shortcomings this dataset has with respect to the project**
  - This dataset has a limited number of variables and no given numerical features which will make it difficult when attempting to train the model.
  - This dataset consists of 66% "benign" URLs. This imbalance can make training models challenging because most models may simply learn to predict “benign” due to the skew in the data.

### Dataset 2

- **Dataset Name:** URL-Phish: A Feature-Engineered Dataset for Phishing Detection
- **Link to the dataset:** [https://data.mendeley.com/datasets/65z9twcx3r/1](https://data.mendeley.com/datasets/65z9twcx3r/1)
- **Number of observations:** The dataset contains 111,660 unique URLs
- **Number of variables:** There are 26 total columns. This includes 22 numerical feature columns, 3 string reference columns (url, dom, tld), and 1 binary label column
- **Description of the variables most relevant to this project:**
    - **label:** The target variable, where 0 represents benign and 1 represents phishing
    - **url_len:** The total length of the URL in characters, which has a mean of 32.95 but reaches a maximum of 1,202
    - **entropy:** A measure of the randomness of characters in the URL, ranging from 2.65 to 6.03 bits
    - **digit_ratio:** The proportion of numerical digits relative to the total URL length. Most samples have a low ratio (mean 0.013), but some reach as high as 0.826
    - **is_https:** A binary flag (0 or 1) indicating if the URL uses a secure connection; approximately 43.1% of the dataset uses HTTPS
- **Descriptions of any shortcomings this dataset has with respect to the project:**
    - The dataset is heavily skewed, with phishing samples making up only 14.2% of the data compared to 85.8% for benign samples. This imbalance can cause models to be biased toward predicting "benign" by default
    - There will be a sampling bias in Benign data, because Benign URLs were sourced exclusively from "trusted sources" such as educational (.edu), governmental (.gov), and top-ranked domains. This may not accurately represent the full diversity of legitimate URLs across the broader internet
    - There are also temporal limitations because the phishing samples were collected between November 2024 and September 2025 and since phishing tactics change constantly, the features may not be as good predictors anymore on the legitimacy of a site


**Plan to combine these datasets:** By combining Datasets 1 and 2, we hope to combat the limited feature set in Dataset 1 by incorporating the additional numerical variables recorded in Dataset 2. Both datasets contain a variable that identifies whether the URL is benign or malicious, so they will be standardized for a common binary format following Dataset 2’s structure where 0 represents benign and 1 represents malicious. To combine them, the numerical features in Dataset 2 will be engineered for Dataset 1 to have both datasets share the same column structure before being aligned. We will also remove duplicated URLs as necessary to avoid any bias.

In [1]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

In [2]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
#%pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
] 

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|                          | 0/2 [00:00<?, ?it/s]
Downloading airline-safety.csv:   0%|               | 0.00/1.23k [00:00<?, ?B/s][A
                                                                                [A

Successfully downloaded: airline-safety.csv



Downloading bad-drivers.csv:   0%|                  | 0.00/1.37k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████████████| 2/2 [00:00<00:00, 17.18it/s][A

Successfully downloaded: bad-drivers.csv





### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [3]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE

import pandas as pd
import numpy as np
import os

#loading dataset 
df1 = pd.read_csv("data/00-raw/malicious_urls.csv")

#data snippet  
print("First 5 rows of the dataset:")
display(df1.head())

# dataset information 
print("\nDataset info:")
df1.info()

# dataset size
print("\nDataset size:")
print("Rows:", df1.shape[0])
print("Columns:", df1.shape[1])

# check for missing data, there are no missing values 
print("\nMissing Data By Col:")
print(df1.isnull().sum())

# getting rid of duplicates
df1 = df1.drop_duplicates()

# clean label column
df1['type'] = df1['type'].str.lower().str.strip()

# add URL length as a new feature
df1['url_length'] = df1['url'].str.len()

# extremely long URLs are potential outliers (~0.004%) , we will keep these outliers as they will help us determine commonalities between benign and non-benign urls
outliers = df1[df1['url_length'] > 1000]
print("\nNumber of suspiciously long URLs (>1000 chars):", outliers.shape[0])
df1['very_long_url'] = df1['url_length'] > 1000

# summary statistics
print("\nURL length summary statistics:")
print(df1['url_length'].describe())

print("\nCounts per URL type:")
print(df1['type'].value_counts())

df1.head()

#saving to cleaned dataset to processed folder
processed_path = "data/02-processed/malicious_urls_clean.csv"
os.makedirs(os.path.dirname(processed_path), exist_ok=True)
df1.to_csv(processed_path, index=False)

First 5 rows of the dataset:


Unnamed: 0,url,type
0,br-icloud.com.br,phishing
1,mp3raid.com/music/krizz_kaliko.html,benign
2,bopsecrets.org/rexroth/cr/1.htm,benign
3,http://www.garage-pirenne.be/index.php?option=...,defacement
4,http://adventure-nicaragua.net/index.php?optio...,defacement



Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 651191 entries, 0 to 651190
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   url     651191 non-null  object
 1   type    651191 non-null  object
dtypes: object(2)
memory usage: 9.9+ MB

Dataset size:
Rows: 651191
Columns: 2

Missing Data By Col:
url     0
type    0
dtype: int64

Number of suspiciously long URLs (>1000 chars): 24

URL length summary statistics:
count    641125.000000
mean         59.762232
std          44.894451
min           1.000000
25%          32.000000
50%          47.000000
75%          76.000000
max        2175.000000
Name: url_length, dtype: float64

Counts per URL type:
type
benign        428080
defacement     95308
phishing       94092
malware        23645
Name: count, dtype: int64



### Dataset #2:


                            Mix of Benign and Phishing URLS: Specified Characters, and Measures of Characters


	The most important metrics in this dataset to answer our research question include, the counts of different characters, the label, url length, entropy, seeing if it uses https, and digit ratio. Most of these different metrics are in integer form, specifically counts of different characters, seeing if it uses https, the label, and url length. The only ones which are different are entropy and digit ratio, which are both floats.

For the counts of different characters, there are no specific unit types involved, only different numbers quantifying how many characters there are in a specific URL. For example, if a URL is “yay12_” then the character count of _ would go up by one. This will be beneficial to see if different characters influence whether or not a URL is benign or phishing.

For the label, there are no specific unit types involved either. The values for label are either 0 or one. The section label is essentially just letting the person reading the dataset know if a specific URL is benign or phishing, with a 0 or one representing each respectively.

For the URL length, this is again not using any specific unit types, just ranging numbers, quantifying how many characters make up any given URL. For instance if a URL is “fake_url” then the URL length would have the number 8, the number of characters in a URL. This will be helpful to see if benign URLS are shorter or longer on average than phishing URLS.

Entropy talks about how random any given character is in a URL. This is essentially talking about the average of randomness in a URL in our specific dataset. This is measured by a float, and can range in value if a URL is more random than another. This is so we can see if benign URLS are more or less random than phishing URLS.

Digit ratio in this dataset is how many numbers a specific URL has compared to other characters in it. This is a float, or decimal, that can have quite a range in value depending on the different URL being looked at. This will allow us to look and see if benign URLS use more numbers on average than a phishing URL.

Finally, seeing if a URL uses HTTPS or not. This will either be a value of 0 or one, which shows that a specific URL is using HTTPS(0) or not(1). This will allow us to see if HTTPS has any impact on the legitimacy of a URL.

	The major concern with this dataset is the fact that it is mostly made up of benign URLS. This dataset being made up of mostly benign URLS, makes it so we will have less overall data on malicious URLS. This means the model our group ends up making in the end, will be quite good at predicting if a URL is benign, but could be less adept at pinpointing if the URL we give it is specifically phishing. The other part of our research question, seeing if a URL is malware, or its purpose is defacement, will also not be able to be answered specifically with this dataset. To answer that, the model will require the usage of the other set to fill in the blind spot this dataset has. This set is also a bit old for looking into how to best predict a phishing URL, which could cause it to work better on URLS from the timeframe the ones in the dataset are collected from. This makes it so as time goes on, the model will become less and less accurate with predicting whether or not a URL is benign or phishing, unless accurate, and up to date data is given to the model, so it can continue learning. These are specific issues of the dataset to consider, as we attempt to answer our research question.


In [4]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE

import pandas as pd
import numpy as np
import os

# loading the raw dataset
df = pd.read_csv('data/00-raw/URL_Phish_FeatureSet.csv')

# demonstrating size and tidiness
# --> the dataset is tidy because every column is a lexical variable and every row is a unique url
print(f"Dataset Shape: {df.shape}")
df.head() 

# finding missing data and assessing systematic bias
# --> if nulls are rare and scattered, they are likely missing at random
null_values = df.isnull().sum()
print("Missing Values per Column:\n", null_values)

# finding and flagging outliers (> 3 sd from mean)
stats = df.describe()
print(stats)

# cleaning the data with dropna() because with 111,660 samples, dropping a few rows will not hurt
processed_df = df.dropna(how='any').copy()

# final wrangling: dropping string reference columns (url, dom, tld)
# --> because these are preserved for interpretability but not used for model training
processed_df = processed_df.drop(columns=['url', 'dom', 'tld'])

# writing the fully wrangled version of the data to data/02-processed
processed_df.to_csv('data/02-processed/URL_Phish_Final.csv', index=False)

# summary statistics for target variable distribution (0 = benign, 1 = phishing)
print("Final Label Distribution:\n", processed_df['label'].value_counts())

Dataset Shape: (116600, 26)
Missing Values per Column:
 url              0
url_len          0
dom              0
dom_len          0
is_ip            0
tld             14
tld_len          0
subdom_cnt       0
letter_cnt       0
digit_cnt        0
special_cnt      0
eq_cnt           0
qm_cnt           0
amp_cnt          0
dot_cnt          0
dash_cnt         0
under_cnt        0
letter_ratio     0
digit_ratio      0
spec_ratio       0
is_https         0
slash_cnt        0
entropy          0
path_len         0
query_len        0
label            0
dtype: int64
             url_len        dom_len          is_ip        tld_len  \
count  116600.000000  116600.000000  116600.000000  116600.000000   
mean       32.952521      12.845763       0.000111       3.320223   
std        29.369989       5.140745       0.010558       1.418215   
min        12.000000       4.000000       0.000000       0.000000   
25%        22.000000       9.000000       0.000000       2.000000   
50%        27.000000   

In [5]:
# this cell is just to learn some more about the dataset!

df2 = pd.read_csv('data/02-processed/URL_Phish_Final.csv')

print("feature columns:", df2.columns.tolist())

print("\nsummary statistics:\n", df2.describe())

print("\nclass distribution:\n", df2['label'].value_counts(normalize=True))

print("\ndata types:\n", df2.dtypes)

feature columns: ['url_len', 'dom_len', 'is_ip', 'tld_len', 'subdom_cnt', 'letter_cnt', 'digit_cnt', 'special_cnt', 'eq_cnt', 'qm_cnt', 'amp_cnt', 'dot_cnt', 'dash_cnt', 'under_cnt', 'letter_ratio', 'digit_ratio', 'spec_ratio', 'is_https', 'slash_cnt', 'entropy', 'path_len', 'query_len', 'label']

summary statistics:
              url_len        dom_len     is_ip        tld_len     subdom_cnt  \
count  116586.000000  116586.000000  116586.0  116586.000000  116586.000000   
mean       32.952061      12.845728       0.0       3.320622       0.845745   
std        29.371173       5.141025       0.0       1.417833       0.480524   
min        12.000000       4.000000       0.0       2.000000       0.000000   
25%        22.000000       9.000000       0.0       2.000000       1.000000   
50%        27.000000      11.000000       0.0       3.000000       1.000000   
75%        34.000000      16.000000       0.0       3.000000       1.000000   
max      1202.000000      62.000000       0.0   

## Ethics

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Most of the datasets we are currently looking at are publicly available datasets, so the people whose URL's are included have not consented for their data to be used in our specific project.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> The datasets we are looking at currently were published between 1 to 5 years ago, so we have temporal bias (unless we find more recent datasets to pull from), since if we train our model on older data it may fail to recognize more recent/modern threats or patterns because it would prioritize looking for characteristsics that are outgrown/outdated.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> Yes, our project specifically needs to focus on this ethical concern because URL's can frequently contain PII like emails or user IDs. We'd need to consider anonymizing these URL's or stripping these types of parameters to avoid exposing any user data.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We've considered that malicious URLs may be more frequently associated with certain regions or languages, and that theres a possibility our model could unfairly flag benign sites if they have structural similarities with known malicious domains, which is something we should try to minimize if possible.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> Yes, we realize that storing a large database of malicious URLs could be a security risk if its accessed by people with intentions to study existing malware or learn from it and cause more attacks, so we plan to keep the data secured within the repo and datahub.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> We've considered this yes, but at the same time since we are using third-party datasets we dont really have a way to remove someone's URL upon request. We don't know if there is a work around for this ethical concern in our project's case.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> Yes, we will delete the data once we finish the project and the quarter ends

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> Yes, we think one main blindspot is that we arent cybersecurity experts, so there is a lot of perspective that we would lack when it comes to understaning certain aspects of phishing. The main thing we could attempt to do to fix this is by researching/learning more about it before we start.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> We talked about how the model may develop a confirmatio bias if the training data is imbalanced, like for example if we use more malware URLs to train it than defacement URLs. So the steps we'd need to take to mitigate these possible biases is to make sure the model doesnt just "learn" to guess the most common class.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> Yes, our goal for our representations will be that they clearly define what makes up a malicious URL as best as we can and as clearly as we can to avoid misleading audiences in any way.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> We have considered this, and we will ensure that raw URLs containing any PII are not displayed in sny representations nor the final report.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> Yes, all of our process will be documented in Jupyter Notebooks and pushed to our GitHub repo to make sure the results are reproducible and accessible if we discover any issues in the future.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> Yes, we acknowledge this, and we will research the variables thoroughly enough to ensure they aren't acting as proxies for discrimination against legitimate websites.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> We will thoroughly test for disparate error rates, e.g. if the model is way more accurate for ".com" domains and way less accurate for ".org" domains then the model is not fair, and we will either try to address this bias or address that failure in the final report.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> We've set a goal of 90% accuracy, but we also realize the cost of false positives (blocking a benign site) versus false negatives (letting malware pass through undetected), since a model with 90% accuracy that blocks 10% of safe sites across the world would be a big problem realistically.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> Yes we can, we can be transparent with our features and highlight which one triggered the URL to be marked as malicious, and based on that we can provide a technical justification in understandable terms.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> Yes, in the final report and in any other documentations where its important to mention the shortcomings, we will make it clear that it is a class project and not a legitimate security tool in any way.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> No, we do not have a long-term deployment plan after the final project is completed.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> Yes, we would need a plan for when our model wrongly labels legitimate sites as malicious, and we discussed ways which we could go about this, like for example proactively updating the model as the digital landscape changes or if we see a way to improve the model to avoid harmful results. 

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> Yes, we agreed our model should have a defined mechanism to turn off or roll back the model if it starts to cause unindended harm or very incorrect results.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> Yes, as mentioned before, we realize that if the model is used in a negative way to study the algorithm, or use it to test someone's malicious URLs so that they could maybe bypass our 90% accuracy, that our project could potentially become a tool to create better scams. So we acknowledge that if this were to be deployed we would need some sort of monitoring plan to prevent such actions.

## Team Expectations 

These are the team expectations that the team agrees to

* If it's directed at a part of the project you're working on, make sure to respond to the group messages within 24 hours.
* Make sure to be open minded when hearing others ideas
* Messaging about the project will be conducted in the groupchat
* Making sure that the feedback we give to others is kind, and is not overall negative. Making sure to phrase things in a non-judgemental way
* We will have a rough meeting day and time every week, but we'll send out a WhenToMeet form weekly to see if that slot works best for a given week, or if a better day/time in a week works best to meet
* We will communicate decisions that are important for the full group to know. If it's something less important, say we're working on a section with another member, and we're changing up a paragraph, then that doesn't necessarily have to be mentioned, unless the members making the changes want to share.
* Everyone will do a mix of everything. We will slput up different things to work on, making an effort to make sure the work is evenly split up.
* We will check in every week to see if people need support in doing a task, and to see what tasks need to get done for the week.
* If we get off task and are working together, and are being noisy, please no shushing, instead say something like "let's lock in guys." If we get off task and need to get back on "let's lock in" will be the phrase we use to signal to the whole team(or ourselves)
* We will make a list of the tasks to do per week on the shared google doc. We will go over it every week together, see what's been completed, and how to move forward on the project for the upcoming week to make sure we're all on the same page
* If someone is struggling on their section, they can reach out as soon as they would like to chat it out so we can help as a group, as that can make work go faster.
* The person struggling should text
* If someone knows they won't have something done by a deadline we set, they can communicate that without fear of judgement. We will work as a team to see what needs to get done, and help if needed. If it takes an extra day, or more time than expected, keep open communication with the group, and tell us if you need help!
* If things get rough, keep an optimistic/not harshly negative attitude about the project during meetings
* If there's conflict, then there can be multiple things we do:

            - if comfortable, communicate with the other person directly, keep this cordial and open to hearing the other. Make sure to let the other say what they would like to, and truly hear what each other is saying. We don't have to become best friends (while that would be awesome), but we do have to work together toward a common goal. If it's a truly bad argument, see if you and the other person can work toward neutral, if not friends, as we do all have the same goal

            -if unsure about how to approach someone with conflict, can go see Olivia, as she knows many different conflict tequniqes, and has a camp counsler book that can give advice. She can help hold a conversation like a restoritive circle if people are unsure about how to communicate between themselves and want help. The two who are in conflict can ask anyone to sit in on a convorsation to make sure that both voices are being heard, or if they want extra support

            - in the event everything goes terribly and there's a big fallout we can't work toward neutral with, go to the prof

* Overall, make sure to communicate with the group, be understanding of others, open to hearing others ideas, if there's conflcit making sure to deal with it in a healthy way, meet weekly, and use the group Google Doc to divy out work.

## Project Timeline Proposal

We will be doing a machine learning model. Due to that, we will make sure to keep good communication between members, and ask for suport when needed.

We will send out a WhenToMeet form weekley to check and see if the days chosen still work. If they don't, then we will adjust to a different day/time early in the week so we can make sure to be on the same page for the week moving forward.

We will meet on zoom, or in person depending on people's avalibilities.


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/18  |  12:00-1:00 PM | Finalize datasets to use, divide up work for the checkpoint | Observe the datasets, note down important information to use for the checkpoint, and clean/tidy the datasets, and finish the data and data overview sections. Make sure Data Checkpoint is complete, Due Tonight | 
| 2/23-25  |  Individual work, but will meet on Wednesday from 12:00-1:00 PM to check in with eachother | Write Python functions to extract lexical features like URL length, character entropy, and frequency of special characters ($, !, @, %, etc). Start initial Exploratory Data Analysis (EDA) using histograms to check feature distributions | Discuss progress and questions if any |
| 3/2-4  | Individual work, but will meet on Wednesday from 12:00-1:00 PM to check in with eachother | Create box plots to compare feature frequencies across categories (Malware vs. Benign). Fit the initial Random Forest model. Check for overfitting by comparing training vs. validation accuracy | Discuss progress, questions, and fill in the jupyter notebook for the EDA Checkpoint for submission |
| 2/9-11  | Individual work, but will meet on Wednesday from 12:00-1:00 PM to check in with eachother | Fine-tune model parameters to reach the 90% accuracy goal. Perform a final check of the "Condition Number" to ensure features aren't redundant | Draft the "Limitations" section regarding cybersecurity expertise, address any questions, concerns, modifications needed, and plan any additional meetings we would need to have |
| 3/16-18  | Individual work on the Final Report (once we divide the work up), but will meet on Wednesday from 12:00-1:00 PM to record the presentation video and to wrap up any ends | Complete the technical Final Report | Record the 3-5 minute video presentation intended for a non-technical audience, and submit individual Team Evaluation surveys |