## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Alice Anderson: Conceptualization, Data curation, Methodology, Writing - original draft
- Bob Barker:  Analysis, Software, Visualization
- Charlie Chang: Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft

## Research Question

Can we develop a Random Forest classification model to predict whether a URL is benign, phishing, malware or defacement based on lexical features such as URL length, character entropy, and the frequency of special characters? This predictive task aims to evaluate which combination of these structural attributes most effectively identifies maliciousness across various cybersecurity threat categories.

## Background and Prior Work

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Hypothesis


Our group predicts that URL's categorized as malicious will exhibit higher character entropy, much longer string lengths, and a higher frequency of special characters like '$', '@', '!', "%", etc.) compared to benign URLs. We also predict that a classification model trained on these lexical features will achieve at least 90% accuracy in distinguishing between threat categories.

## Data

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project

### Dataset 2

- **Dataset Name:** URL-Phish: A Feature-Engineered Dataset for Phishing Detection
- **Link to the dataset:** [https://data.mendeley.com/datasets/65z9twcx3r/1](https://data.mendeley.com/datasets/65z9twcx3r/1)
- **Number of observations:** The dataset contains 111,660 unique URLs
- **Number of variables:** There are 26 total columns. This includes 22 numerical feature columns, 3 string reference columns (url, dom, tld), and 1 binary label column
- **Description of the variables most relevant to this project:**
    - **label:** The target variable, where 0 represents benign and 1 represents phishing
    - **url_len:** The total length of the URL in characters, which has a mean of 32.95 but reaches a maximum of 1,202
    - **entropy:** A measure of the randomness of characters in the URL, ranging from 2.65 to 6.03 bits
    - **digit_ratio:** The proportion of numerical digits relative to the total URL length. Most samples have a low ratio (mean 0.013), but some reach as high as 0.826
    - **is_https:** A binary flag (0 or 1) indicating if the URL uses a secure connection; approximately 43.1% of the dataset uses HTTPS
- **Descriptions of any shortcomings this dataset has with repsect to the project:**
    - The dataset is heavily skewed, with phishing samples making up only 14.2% of the data (11,660 observations) compared to 85.8% for benign samples (100,000 observations). This imbalance can cause models to be biased toward predicting "benign" by default
    - There will be a sampling bias in Benign data, because Benign URLs were sourced exclusively from "trusted sources" such as educational (.edu), governmental (.gov), and top-ranked domains. This may not accurately represent the full diversity of legitimate URLs across the broader internet
    - There are also temporal limitations because the phishing samples were collected between November 2024 and September 2025 and since phishing tactics change constantly, the features may not be as good predictors anymore on the legitimacy of a site

#### IMPORTANT!!!! If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets. ---> NEED TO DO

In [5]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [6]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/airline-safety/airline-safety.csv', 'filename':'airline-safety.csv'},
    { 'url': 'https://raw.githubusercontent.com/fivethirtyeight/data/refs/heads/master/bad-drivers/bad-drivers.csv', 'filename':'bad-drivers.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:   0%|                          | 0/2 [00:00<?, ?it/s]
Downloading airline-safety.csv:   0%|               | 0.00/1.23k [00:00<?, ?B/s][A
Overall Download Progress:  50%|█████████         | 1/2 [00:00<00:00,  4.91it/s][A

Successfully downloaded: airline-safety.csv



Downloading bad-drivers.csv:   0%|                  | 0.00/1.37k [00:00<?, ?B/s][A
Overall Download Progress: 100%|██████████████████| 2/2 [00:00<00:00,  3.56it/s][A

Successfully downloaded: bad-drivers.csv





### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [7]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 

See instructions above for Dataset #1.  Feel free to keep adding as many more datasets as you need.  Put each new dataset in its own section just like these. 

Lastly if you do have multiple datasets, add another section where you demonstrate how you will join, align, cross-reference or whatever to combine data from the different datasets

Please note that you can always keep adding more datasets in the future if these datasets you turn in for the checkpoint aren't sufficient.  The goal here is demonstrate that you can obtain and wrangle data.  You are not tied down to only use what you turn in right now.

In [8]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


## Ethics

### A. Data Collection
 - [X] **A.1 Informed consent**: If there are human subjects, have they given informed consent, where subjects affirmatively opt-in and have a clear understanding of the data uses to which they consent?

> Most of the datasets we are currently looking at are publicly available datasets, so the people whose URL's are included have not consented for their data to be used in our specific project.

 - [X] **A.2 Collection bias**: Have we considered sources of bias that could be introduced during data collection and survey design and taken steps to mitigate those?

> The datasets we are looking at currently were published between 1 to 5 years ago, so we have temporal bias (unless we find more recent datasets to pull from), since if we train our model on older data it may fail to recognize more recent/modern threats or patterns because it would prioritize looking for characteristsics that are outgrown/outdated.

 - [X] **A.3 Limit PII exposure**: Have we considered ways to minimize exposure of personally identifiable information (PII) for example through anonymization or not collecting information that isn't relevant for analysis?

> Yes, our project specifically needs to focus on this ethical concern because URL's can frequently contain PII like emails or user IDs. We'd need to consider anonymizing these URL's or stripping these types of parameters to avoid exposing any user data.

 - [X] **A.4 Downstream bias mitigation**: Have we considered ways to enable testing downstream results for biased outcomes (e.g., collecting data on protected group status like race or gender)?

> We've considered that malicious URLs may be more frequently associated with certain regions or languages, and that theres a possibility our model could unfairly flag benign sites if they have structural similarities with known malicious domains, which is something we should try to minimize if possible.

### B. Data Storage
 - [X] **B.1 Data security**: Do we have a plan to protect and secure data (e.g., encryption at rest and in transit, access controls on internal users and third parties, access logs, and up-to-date software)?

> Yes, we realize that storing a large database of malicious URLs could be a security risk if its accessed by people with intentions to study existing malware or learn from it and cause more attacks, so we plan to keep the data secured within the repo and datahub.

 - [X] **B.2 Right to be forgotten**: Do we have a mechanism through which an individual can request their personal information be removed?

> We've considered this yes, but at the same time since we are using third-party datasets we dont really have a way to remove someone's URL upon request. We don't know if there is a work around for this ethical concern in our project's case.

 - [X] **B.3 Data retention plan**: Is there a schedule or plan to delete the data after it is no longer needed?

> Yes, we will delete the data once we finish the project and the quarter ends

### C. Analysis
 - [X] **C.1 Missing perspectives**: Have we sought to address blindspots in the analysis through engagement with relevant stakeholders (e.g., checking assumptions and discussing implications with affected communities and subject matter experts)?

> Yes, we think one main blindspot is that we arent cybersecurity experts, so there is a lot of perspective that we would lack when it comes to understaning certain aspects of phishing. The main thing we could attempt to do to fix this is by researching/learning more about it before we start.

 - [X] **C.2 Dataset bias**: Have we examined the data for possible sources of bias and taken steps to mitigate or address these biases (e.g., stereotype perpetuation, confirmation bias, imbalanced classes, or omitted confounding variables)?

> We talked about how the model may develop a confirmatio bias if the training data is imbalanced, like for example if we use more malware URLs to train it than defacement URLs. So the steps we'd need to take to mitigate these possible biases is to make sure the model doesnt just "learn" to guess the most common class.

 - [X] **C.3 Honest representation**: Are our visualizations, summary statistics, and reports designed to honestly represent the underlying data?

> Yes, our goal for our representations will be that they clearly define what makes up a malicious URL as best as we can and as clearly as we can to avoid misleading audiences in any way.

 - [X] **C.4 Privacy in analysis**: Have we ensured that data with PII are not used or displayed unless necessary for the analysis?

> We have considered this, and we will ensure that raw URLs containing any PII are not displayed in sny representations nor the final report.

 - [X] **C.5 Auditability**: Is the process of generating the analysis well documented and reproducible if we discover issues in the future?

> Yes, all of our process will be documented in Jupyter Notebooks and pushed to our GitHub repo to make sure the results are reproducible and accessible if we discover any issues in the future.

### D. Modeling
 - [X] **D.1 Proxy discrimination**: Have we ensured that the model does not rely on variables or proxies for variables that are unfairly discriminatory?

> Yes, we acknowledge this, and we will research the variables thoroughly enough to ensure they aren't acting as proxies for discrimination against legitimate websites.

 - [X] **D.2 Fairness across groups**: Have we tested model results for fairness with respect to different affected groups (e.g., tested for disparate error rates)?

> We will thoroughly test for disparate error rates, e.g. if the model is way more accurate for ".com" domains and way less accurate for ".org" domains then the model is not fair, and we will either try to address this bias or address that failure in the final report.

 - [X] **D.3 Metric selection**: Have we considered the effects of optimizing for our defined metrics and considered additional metrics?

> We've set a goal of 90% accuracy, but we also realize the cost of false positives (blocking a benign site) versus false negatives (letting malware pass through undetected), since a model with 90% accuracy that blocks 10% of safe sites across the world would be a big problem realistically.

 - [X] **D.4 Explainability**: Can we explain in understandable terms a decision the model made in cases where a justification is needed?

> Yes we can, we can be transparent with our features and highlight which one triggered the URL to be marked as malicious, and based on that we can provide a technical justification in understandable terms.

 - [X] **D.5 Communicate limitations**: Have we communicated the shortcomings, limitations, and biases of the model to relevant stakeholders in ways that can be generally understood?

> Yes, in the final report and in any other documentations where its important to mention the shortcomings, we will make it clear that it is a class project and not a legitimate security tool in any way.

### E. Deployment
 - [X] **E.1 Monitoring and evaluation**: Do we have a clear plan to monitor the model and its impacts after it is deployed (e.g., performance monitoring, regular audit of sample predictions, human review of high-stakes decisions, reviewing downstream impacts of errors or low-confidence decisions, testing for concept drift)?

> No, we do not have a long-term deployment plan after the final project is completed.

 - [X] **E.2 Redress**: Have we discussed with our organization a plan for response if users are harmed by the results (e.g., how does the data science team evaluate these cases and update analysis and models to prevent future harm)?

> Yes, we would need a plan for when our model wrongly labels legitimate sites as malicious, and we discussed ways which we could go about this, like for example proactively updating the model as the digital landscape changes or if we see a way to improve the model to avoid harmful results. 

 - [X] **E.3 Roll back**: Is there a way to turn off or roll back the model in production if necessary?

> Yes, we agreed our model should have a defined mechanism to turn off or roll back the model if it starts to cause unindended harm or very incorrect results.

 - [X] **E.4 Unintended use**: Have we taken steps to identify and prevent unintended uses and abuse of the model and do we have a plan to monitor these once the model is deployed?

> Yes, as mentioned before, we realize that if the model is used in a negative way to study the algorithm, or use it to test someone's malicious URLs so that they could maybe bypass our 90% accuracy, that our project could potentially become a tool to create better scams. So we acknowledge that if this were to be deployed we would need some sort of monitoring plan to prevent such actions.

## Team Expectations 

These are the team expectations that the team agrees to

* If it's directed at a part of the project you're working on, make sure to respond to the group messages within 24 hours.
* Make sure to be open minded when hearing others ideas
* Messaging about the project will be conducted in the groupchat
* Making sure that the feedback we give to others is kind, and is not overall negative. Making sure to phrase things in a non-judgemental way
* We will have a rough meeting day and time every week, but we'll send out a WhenToMeet form weekly to see if that slot works best for a given week, or if a better day/time in a week works best to meet
* We will communicate decisions that are important for the full group to know. If it's something less important, say we're working on a section with another member, and we're changing up a paragraph, then that doesn't necessarily have to be mentioned, unless the members making the changes want to share.
* Everyone will do a mix of everything. We will slput up different things to work on, making an effort to make sure the work is evenly split up.
* We will check in every week to see if people need support in doing a task, and to see what tasks need to get done for the week.
* If we get off task and are working together, and are being noisy, please no shushing, instead say something like "let's lock in guys." If we get off task and need to get back on "let's lock in" will be the phrase we use to signal to the whole team(or ourselves)
* We will make a list of the tasks to do per week on the shared google doc. We will go over it every week together, see what's been completed, and how to move forward on the project for the upcoming week to make sure we're all on the same page
* If someone is struggling on their section, they can reach out as soon as they would like to chat it out so we can help as a group, as that can make work go faster.
* The person struggling should text
* If someone knows they won't have something done by a deadline we set, they can communicate that without fear of judgement. We will work as a team to see what needs to get done, and help if needed. If it takes an extra day, or more time than expected, keep open communication with the group, and tell us if you need help!
* If things get rough, keep an optimistic/not harshly negative attitude about the project during meetings
* If there's conflict, then there can be multiple things we do:

            - if comfortable, communicate with the other person directly, keep this cordial and open to hearing the other. Make sure to let the other say what they would like to, and truly hear what each other is saying. We don't have to become best friends (while that would be awesome), but we do have to work together toward a common goal. If it's a truly bad argument, see if you and the other person can work toward neutral, if not friends, as we do all have the same goal

            -if unsure about how to approach someone with conflict, can go see Olivia, as she knows many different conflict tequniqes, and has a camp counsler book that can give advice. She can help hold a conversation like a restoritive circle if people are unsure about how to communicate between themselves and want help. The two who are in conflict can ask anyone to sit in on a convorsation to make sure that both voices are being heard, or if they want extra support

            - in the event everything goes terribly and there's a big fallout we can't work toward neutral with, go to the prof

* Overall, make sure to communicate with the group, be understanding of others, open to hearing others ideas, if there's conflcit making sure to deal with it in a healthy way, meet weekly, and use the group Google Doc to divy out work.

## Project Timeline Proposal

We will be doing a machine learning model. Due to that, we will make sure to keep good communication between members, and ask for suport when needed.

We will send out a WhenToMeet form weekley to check and see if the days chosen still work. If they don't, then we will adjust to a different day/time early in the week so we can make sure to be on the same page for the week moving forward.

We will meet on zoom, or in person depending on people's avalibilities.


| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/18  |  12:00-1:00 PM | Finalize data acquisition from Kaggle and UCI | Clean and tidy the URLs into a single rectangular dataframe. Priority: Strip PII (emails/IDs) from URL strings to satisfy ethics requirements. Data Checkpoint #1 Due | 
| 2/23-25  |  Individual work, but will meet on Wednesday from 12:00-1:00 PM to check in with eachother | Write Python functions to extract lexical features like URL length, character entropy, and frequency of special characters ($, !, @, %, etc). Start initial Exploratory Data Analysis (EDA) using histograms to check feature distributions | Discuss progress and questions if any |
| 3/2-4  | Individual work, but will meet on Wednesday from 12:00-1:00 PM to check in with eachother | Create box plots to compare feature frequencies across categories (Malware vs. Benign). Fit the initial Random Forest model. Check for overfitting by comparing training vs. validation accuracy | Discuss progress, questions, and fill in the jupyter notebook for the EDA Checkpoint for submission |
| 2/9-11  | Individual work, but will meet on Wednesday from 12:00-1:00 PM to check in with eachother | Fine-tune model parameters to reach the 90% accuracy goal. Perform a final check of the "Condition Number" to ensure features aren't redundant | Draft the "Limitations" section regarding cybersecurity expertise, address any questions, concerns, modifications needed, and plan any additional meetings we would need to have |
| 3/16-18  | Individual work on the Final Report (once we divide the work up), but will meet on Wednesday from 12:00-1:00 PM to record the presentation video and to wrap up any ends | Complete the technical Final Report | Record the 3-5 minute video presentation intended for a non-technical audience, and submit individual Team Evaluation surveys |