# COGS 108 - Project Proposal

## Authors

Megan Yu: Data section

Olivia Huard: Team Expectations, Hypothesis, and Timeline

Shanmukhi Nandiraju: Ethics section

Harshatha Prasanna: Background and Prior Work

Jazely Tong: Background and Prior Work

## Research Question

Can we develop a machine learning classification model to predict whether a URL is benign, phishing, malware or defacement with 90% or more accuracy? We will evaluate if these specific characteristics are sufficient to determine a domain’s authenticity.
This predictive task aims to identify which combinations of the URL attributes most effectively forecasts maliciousness across various cybersecurity threat categories.

## Background and Prior Work

Malicious URLs are a major vector for phishing, malware delivery, and website defacement, and attackers increasingly embed harmful intent in the structure of a URL itself. These URLs often imitate legitimate sites, mislead users into entering sensitive information, or redirect them to compromised resources. Recent studies highlight their growing prevalence: Mittal (2023) notes that anti-phishing systems confronted over half a billion phishing attempts in 2022, underscoring the persistence and escalation of URL-based attacks.<a name="cite1"></a><a href="#ref1">1</a> Similarly, Omolara and Alawida (2025) report that malicious links rose by 144% in a single year, driven by techniques such as social engineering, obfuscation, and automated URL generation.<a name="cite2"></a><a href="#ref2">2</a> Collectively, this research shows that understanding URL characteristics is essential for mitigating harms associated with phishing and related threats.

Traditional blacklist-based defenses, such as browser-integrated URL safety checks, were among the earliest strategies for blocking malicious links. However, blacklist systems fundamentally depend on previously identified threats. As Omolara and Alawida explain, attackers frequently rotate domain names, use URL shorteners, manipulate HTTP/HTTPS presentation, or rely on fast-flux hosting, techniques that allow newly created malicious URLs to evade blacklist detection long enough to cause damage.<a href="#ref2">2</a> This limitation has motivated a shift toward automated, feature-based analysis that can evaluate a URL’s structure without requiring prior knowledge of whether it is malicious.

Prior work has analyzed the internal structure and lexical attributes of URLs to uncover which characteristics most reliably distinguish malicious links from benign ones. The literature identifies several strong predictors: the presence of an IP address instead of a domain name, abnormal anchor tags, unusually long URLs, excessive special characters, multi-subdomain patterns, and the use of misleading prefix/suffix tokens such as “-secure” or “-verify.” Mittal (2023) outlines dozens of such features, demonstrating how attributes like redirection counts, “@” symbols, URL entropy, and delimiter frequency can signal phishing activity.<a href="#ref1">1</a> The recent survey by Tian et al. (2025) further categorizes these features into lexical, host-based, and content-based groups, emphasizing the predictive power of purely lexical features derived from characters, symbols, and token patterns within the URL string.<a name="cite3"></a><a href="#ref3">3</a> This aligns directly with our project’s focus on character-level URL analysis.

In addition to feature engineering research, many studies have evaluated machine learning models for malicious URL classification. Mittal (2023) demonstrates that interpretable “glass box” models such as Logistic Regression and Decision Trees can achieve 90-95% accuracy using around 30 lexical and reputation-based features.<a href="#ref1">1</a> More advanced approaches, such as the ensemble techniques evaluated in Omolara and Alawida’s DaE2 framework, reach up to 98% accuracy using boosting, bagging, and stacked models trained on large malicious-URL datasets.<a href="#ref2">2</a> Tian et al. (2025) corroborate these findings, noting that character-level and token-level feature extraction remain among the strongest signals for ML-based URL detection.<a href="#ref3">3</a> Together, this body of work establishes a strong foundation for predicting URL maliciousness using structural properties alone.

Our project builds directly on these findings by focusing specifically on correlations between URL characters, symbols, and structural patterns, and whether those characteristics can be used to distinguish malicious URLs (e.g., phishing, malware, defacement) from benign ones. While prior studies have explored broad sets of lexical and host-based features, fewer have isolated the predictive value of character-level traits such as symbol frequency, delimiters, suspicious token patterns, and URL length variations. By quantifying these relationships and applying predictive modeling, our study aims to evaluate how well these standalone URL characteristics can determine URL authenticity and contribute to early malicious-URL detection.

References  
<a name="ref1"></a>  
Mittal, S. (2023). Explaining URL Phishing Detection by Glass Box Models. IC3 2023.  
https://doi.org/10.1145/3607947.3608059  
<a href="#cite1">^</a>

<a name="ref2"></a>  
Omolara, O. E., & Alawida, M. (2025). DaE2: Unmasking Malicious URLs via Consensus From Diverse Techniques.  
https://doi.org/10.1016/j.cose.2024.104170  
<a href="#cite2">^</a>

<a name="ref3"></a>  
Tian, Y., et al. (2025). From Past to Present: A Survey of Malicious URL Detection Techniques, Datasets, and Code Repositories. arXiv:2504.16449. https://arxiv.org/abs/2504.16449  
<a href="#cite3">^</a>


## Hypothesis


Hypothesis: 

* Our group predicts that a URL containing more symbols such as "$" or "!" will be more likley to be malicious, than a URL that doesn't.






I think this will be the outcome, as typical URL's one sees going throughout the day, don't have a lot of exclamation points or dollar signs. It's due to this that we think our data will show URL's will have more symbols, that will be less commonly seen in a non-malicious URL.

## Data

The ideal dataset answering our research question would have variables that categorize whether the links are malicious or safe. Other variables would include differentiating malicious sites by type, such as malware, defacement, phishing, and more. Within these links, they should vary in size and format from different domains and servers to ensure there is a diverse dataset being sampled. This means we would need at least a couple thousand observations, ideally collected from across the online web. This data should be stored in a csv file, with each URL's legimitacy clearly classified as a dangerous or safe site. 

<u>*Potential datasets we could use for this project include the following:*</u>
- This dataset is located at https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset. This is lincensed under CC0: Public Domain, which means there is no copyright and allows us to use it without asking for permission. This dataset of over 650,000 URLs has categorized each URL under the following class of malicious URL's, like benign, defacement, phishing, and malware URLs.
- This dataset is located at https://archive.ics.uci.edu/dataset/967/phiusiil+phishing+url+dataset. This is licensed under CC BY 4.0, which means we can use it freely if we give proper credit by citing their work. This dataset of over 230,000 URLs has real, categorical, and integer features, such as URL length and letter ratios in the URLs, as well as categorizing each URL as a legitimate or phishing URL with either a label 0 or label 1.
- This dataset is located at https://www.kaggle.com/datasets/harisudhan411/phishing-and-legitimate-urls. This is lincensed under CC0: Public Domain. This dataset of over 800,000 URLs has a near equal amount of harmful and safe URLs, identified as either a legitimate or phishing URL. Each one is marked with a value of 0 or 1 to represents a domain as phishing or legitimate.
- This dataset is located at https://www.kaggle.com/datasets/hassaanmustafavi/phishing-urls-dataset. This is lincensed under CC0: Public Domain. Similar to the one above, this dataset of over 450,000 URLs also identifies its URLs as either legitimate or phishing.

## Ethics 

Instructions: Keep the contents of this cell. For each item on the checklist
-  put an X there if you've considered the item
-  IF THE ITEM IS RELEVANT place a short paragraph after the checklist item discussing the issue.
  
Items on this checklist are meant to provoke discussion among good-faith actors who take their ethical responsibilities seriously. Your teams will document these discussions and decisions for posterity using this section.  You don't have to solve these problems, you just have to acknowledge any potential harm no matter how unlikely.

Here is a [list of real world examples](https://deon.drivendata.org/examples/) for each item in the checklist that can refer to.

[![Deon badge](https://img.shields.io/badge/ethics%20checklist-deon-brightgreen.svg?style=popout-square)](http://deon.drivendata.org/)

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Most of the datasets we are currently looking at are publicly available datasets, so the people whose URL's are included have not consented for their data to be used in our specific project.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> The datasets we are looking at currently were published between 1 to 5 years ago, so we have temporal bias (unless we find more recent datasets to pull from), since if we train our model on older data it may fail to recognize more recent/modern threats or patterns because it would prioritize looking for characteristsics that are outgrown/outdated.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> Yes, our project specifically needs to focus on this ethical concern because URL's can frequently contain PII like emails or user IDs. We'd need to consider anonymizing these URL's or stripping these types of parameters to avoid exposing any user data.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We've considered that malicious URLs may be more frequently associated with certain regions or languages, and that theres a possibility our model could unfairly flag benign sites if they have structural similarities with known malicious domains, which is something we should try to minimize if possible.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> Yes, we realize that storing a large database of malicious URLs could be a security risk if its accessed by people with intentions to study existing malware or learn from it and cause more attacks, so we plan to keep the data secured within the repo and datahub.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> We've considered this yes, but at the same time since we are using third-party datasets we dont really have a way to remove someone's URL upon request. We don't know if there is a work around for this ethical concern in our project's case.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> Yes, we will delete the data once we finish the project and the quarter ends

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> Yes, we think one main blindspot is that we arent cybersecurity experts, so there is a lot of perspective that we would lack when it comes to understaning certain aspects of phishing. The main thing we could attempt to do to fix this is by researching/learning more about it before we start.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> We talked about how the model may develop a confirmatio bias if the training data is imbalanced, like for example if we use more malware URLs to train it than defacement URLs. So the steps we'd need to take to mitigate these possible biases is to make sure the model doesnt just "learn" to guess the most common class.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> Yes, our goal for our representations will be that they clearly define what makes up a malicious URL as best as we can and as clearly as we can to avoid misleading audiences in any way.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> We have considered this, and we will ensure that raw URLs containing any PII are not displayed in sny representations nor the final report.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> Yes, all of our process will be documented in Jupyter Notebooks and pushed to our GitHub repo to make sure the results are reproducible and accessible if we discover any issues in the future.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> Yes, we acknowledge this, and we will research the variables thoroughly enough to ensure they aren't acting as proxies for discrimination against legitimate websites.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> We will thoroughly test for disparate error rates, e.g. if the model is way more accurate for ".com" domains and way less accurate for ".org" domains then the model is not fair, and we will either try to address this bias or address that failure in the final report.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> We've set a goal of 90% accuracy, but we also realize the cost of false positives (blocking a benign site) versus false negatives (letting malware pass through undetected), since a model with 90% accuracy that blocks 10% of safe sites across the world would be a big problem realistically.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> Yes we can, we can be transparent with our features and highlight which one triggered the URL to be marked as malicious, and based on that we can provide a technical justification in understandable terms.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> Yes, in the final report and in any other documentations where its important to mention the shortcomings, we will make it clear that it is a class project and not a legitimate security tool in any way.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> No, we do not have a long-term deployment plan after the final project is completed.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> Yes, we would need a plan for when our model wrongly labels legitimate sites as malicious, and we discussed ways which we could go about this, like for example proactively updating the model as the digital landscape changes or if we see a way to improve the model to avoid harmful results. 

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> Yes, we agreed our model should have a defined mechanism to turn off or roll back the model if it starts to cause unindended harm or very incorrect results.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> Yes, as mentioned before, we realize that if the model is used in a negative way to study the algorithm, or use it to test someone's malicious URLs so that they could maybe bypass our 90% accuracy, that our project could potentially become a tool to create better scams. So we acknowledge that if this were to be deployed we would need some sort of monitoring plan to prevent such actions.

## Team Expectations 

These are the team expectations that the team agrees to

* If it's directed at a part of the project you're working on, make sure to respond to the group messages within 24 hours.
* Make sure to be open minded when hearing others ideas
* Messaging about the project will be conducted in the groupchat
* Making sure that the feedback we give to others is kind, and is not overall negative. Making sure to phrase things in a non-judgemental way
* We will have a rough meeting day and time every week, but we'll send out a WhenToMeet form weekly to see if that slot works best for a given week, or if a better day/time in a week works best to meet
* We will communicate decisions that are important for the full group to know. If it's something less important, say we're working on a section with another member, and we're changing up a paragraph, then that doesn't necessarily have to be mentioned, unless the members making the changes want to share.
* Everyone will do a mix of everything. We will slput up different things to work on, making an effort to make sure the work is evenly split up.
* We will check in every week to see if people need support in doing a task, and to see what tasks need to get done for the week.
* If we get off task and are working together, and are being noisy, please no shushing, instead say something like "let's lock in guys." If we get off task and need to get back on "let's lock in" will be the phrase we use to signal to the whole team(or ourselves)
* We will make a list of the tasks to do per week on the shared google doc. We will go over it every week together, see what's been completed, and how to move forward on the project for the upcoming week to make sure we're all on the same page
* If someone is struggling on their section, they can reach out as soon as they would like to chat it out so we can help as a group, as that can make work go faster.
* The person struggling should text
* If someone knows they won't have something done by a deadline we set, they can communicate that without fear of judgement. We will work as a team to see what needs to get done, and help if needed. If it takes an extra day, or more time than expected, keep open communication with the group, and tell us if you need help!
* If things get rough, keep an optimistic/not harshly negative attitude about the project during meetings
* If there's conflict, then there can be multiple things we do:

            - if comfortable, communicate with the other person directly, keep this cordial and open to hearing the other. Make sure to let the other say what they would like to, and truly hear what each other is saying. We don't have to become best friends (while that would be awesome), but we do have to work together toward a common goal. If it's a truly bad argument, see if you and the other person can work toward neutral, if not friends, as we do all have the same goal

            -if unsure about how to approach someone with conflict, can go see Olivia, as she knows many different conflict tequniqes, and has a camp counsler book that can give advice. She can help hold a conversation like a restoritive circle if people are unsure about how to communicate between themselves and want help. The two who are in conflict can ask anyone to sit in on a convorsation to make sure that both voices are being heard, or if they want extra support

            - in the event everything goes terribly and there's a big fallout we can't work toward neutral with, go to the prof

* Overall, make sure to communicate with the group, be understanding of others, open to hearing others ideas, if there's conflcit making sure to deal with it in a healthy way, meet weekly, and use the group Google Doc to divy out work.

## Project Timeline Proposal

We will be doing a machine learning model. Due to that, we will make sure to keep good communication between members, and ask for suport when needed.

We will send out a WhenToMeet form weekley to check and see if the days chosen still work. If they don't, then we will adjust to a different day/time early in the week so we can make sure to be on the same page for the week moving forward.

We will meet on zoom, or in person depending on people's avalibilities.


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/3  |  3:30-4:30 PM | Look over project proposal | Go over work to do for project proposal, work on group sections, divy up work.| 
| 2/10  |  3:30-4:30 PM  |  Think about what will the overall project involve | Make a game plan with what we want to get done, and divy up work for the week. Make sure we're on track to submitting for the first checkpoint | 
| 2/17  | 3:30-4:30 PM   | Think about what worked well, what didn't, and what you did  | Discuss what you did over the week. See what went well, and what could improve or stay the same (individually and/or group). Make game plan for the week and divide up work. Make sure finish up work to submit for Checkpoint 1 |
| 2/24  | 3:30-4:30 PM   | Think about what worked well, what didn't, and what you did  | Discuss what you did over the week. See what went well, and what could improve or stay the same (individually and/or group). Make game plan for the week and divide up work. Make sure to be completing what's necessary for Checkpoint 2 |
| 3/3  | 3:30-4:30 PM   | Think about what worked well, what didn't, and what you did  | Discuss what you did over the week. See what went well, and what could improve or stay the same (individually and/or group). Make game plan for the week and divide up work. Make sure to submit for Checkpoint 2 |
| 3/10  | 3:30-4:30 PM   | Think about what worked well, what didn't, and what you did  | Discuss what you did over the week. See what went well, and what could improve or stay the same (individually and/or group). Make game plan for the week and divide up work. Make sure we will be able to submit all finished products. |
| 3/14  | 3:30-4:30 PM   | Think about what worked well, what didn't, and what you did  | Discuss what you did over the week. See what final things need to be done before submission |

