**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Haoyu Fu
- Qianxia Hui
- Arianna Morris
- Michael Tang
- Bofu Zou

# Research Question

Does parking occupancy on the UCSD campus predict traffic incidents on major roads near the UCSD campus?

## Background and Prior Work

Traffic incidents are always a significant concern in modern cities, especially in densely populated areas. As a large institution with a growing population, the University of California, San Diego (UCSD) campus tackles the problems brought by heavy vehicular traffic and parking. As the campus community grows, the parking occupancy rates may reflect the number of traffic incidents happening around the campus area. Our study seeks to find out whether there’s a correlation between the UCSD campus parking occupancy rates and the traffic incidents happening surrounding campus. 

While our topic is mainly focused on how parking occupancy may predict the occurrence of traffic collisions within the area of UCSD, there have been prior studies that analyze the likelihood of car accidents occurring in a given geographic region. For example, Forbes Advisor compiled a list of the top 50 U.S. cities with the highest likelihood of getting into a car accident, with San Diego being one of them.<a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1) This ranking system was determined by data for fatality rate per 100,000 people from the National Highway Transportation Safety Administration and average years between collisions and relative collision likelihood from the AllState Best Drivers Index. This data analysis only focuses on urban cities, which means the dataset is relatively small as only the 50 largest cities by population from census data were chosen. A general understanding of a particular city's collision danger is approximated by comparison to other major cities in the dataset. Forbes Advisor’s primary goal is providing financial advice, which this data analysis does by offering relevant information about collisions that may involve insurance, lawyer/attorney, and risk assessment situations. 

Another study investigated the causal relationship between road density and parking occupancy.<a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2) Tunisian researchers conducted a research paper to investigate the causal relationship between road density and parking occupancy in Tunis city center using Granger causality tests based on vector error correction modeling. The authors collected data using video cameras around a major street in the capital of Tunis, and found that there does exist a causal relationship between road density and parking occupancy, with road density Granger-causing parking occupancy. This suggests that increasing road density may lead to an increase in parking occupancy, which in turn may lead to an increase in road congestion. The authors suggest that their findings can be used to develop more sustainable parking policies that reduce road congestion and can be incorporated into parking models to improve their accuracy and effectiveness. This study provides important insights into the relationship between road density and parking occupancy, and gives us some ideas for our own project since road density and traffic incidents are two significant features of a city’s traffic conditions.

Similarly, we want to assess what the general level of safety associated with traffic collisions might be through exploring a more confined geographical area that is of interest to us, UCSD. Based upon these prior works and the UCSD campus parking occupancy and nearby traffic incidence data, our project aims to identify patterns specific to UCSD campus and its surrounding areas. We will conduct a descriptive and exploratory data analysis of these datasets, which will allow us to identify important features of the datasets and how they may relate to one another. We aim to contribute to the existing literature on road safety and how it impacts us on the UCSD campus. 

**References:**

1. <a name="cite_note-1"></a> [^](#cite_ref-1) Christy Bieber, J. D. (2023, October 25). The cities where you’re most likely to get in a car accident. Forbes. https://www.forbes.com/advisor/legal/auto-accident/cities-most-car-accidents/  
2. <a name="cite_note-2"></a> [^](#cite_ref-2) Hassine, S. B., Kooli, E., & Mraihi, R. (2022). The causal relationship between road density and parking occupancy. World Journal of Advanced Research and Reviews, 15(3), 125–134. https://wjarr.com/sites/default/files/WJARR-2022-0896.pdf


# Hypothesis


We hypothesize that there will be a positive correlation between the number of cars parked on campus and the amount of traffic incidents occurring on and around the UCSD campus. We believe this because more congestion within the campus area could lead to more traffic incidents occurring.

# Data

## Data overview

For each dataset include the following information
- Dataset #1
  - Dataset Name: Survey of Parking Space Occupancy Levels
  - Link to the dataset: https://rmp-wapps.ucsd.edu/TS/Survey/Survey%20of%20Parking%20Space%20Occupancy%20Levels/Quarterly%20Tables/Contents.html 
  - Number of observations: 93 (quarterly tables)
  - Number of variables: 15 ("University-Wide" data)

There are a total 93 observations in this dataset, with each consisting of the parking occupancy levels of UCSD parking lots of a given quarter at UCSD, from Summer 2000 to Summer 2023. However, we only plan to utilize 29 of the observations, from Summer 2016 to Summer 2023, to stay consistent with the data we have in Dataset #2. We started by scraping the dataset link for the 29 observations we wanted, and combined those 29 observations into one csv file (QbyQ UCSD Parking Occupancy.csv) to make our data analysis easier. For each row (observation) in our csv file, we have 15 columns (variables) which covers information including parking spaces in total, empty parking spaces listed in hourly time, and occupancy proportion at peak time (important feature that could be utilized along with another dataset). The variables all have the same datatype: numerical variables. After creating our own csv file from the weblink, we have a dataset we integrated ourselves, so we didn't need to do much data cleaning. Since we weren't interested in the columns that display parking occupancy in hourly time, we dropped those columns from the dataset.

- Dataset #2
  - Dataset Name: Traffic collisions details (2015 through year-to-date)
  - Link to the dataset: https://data.sandiego.gov/datasets/police-collisions-details/
  - Number of observations: 123708
  - Number of variables: 22

This dataset encompasses traffic collision details in the San Diego area (including the vicinity of UCSD) from 2015 to the present. With a total of 123,708 observations and 22 variables. The important variables in the dataset include time stamp of the collision, location details, violation type,  number of injuries and fatalities.The data types include categorical variables (violation type), numerical variables (number of injuries and fatalities), and time variables. The dataset can serve as a proxy for understanding traffic conditions and traffic safety in the area. We need to clean the dataset by filtering location (street names) in the dataset to find collisions that happen on the major streets in UCSD surrounding area. Then handling missing values,converting data type, and removing irrelevant variables.

## Dataset #1: Quarter by Quarter UCSD Parking Occupancy Dataset

In [1]:
## Import pandas to read csv files
import pandas as pd

In [2]:
## Import csv file with parking data
parking_data = pd.read_csv('QbyQ UCSD Parking Occupancy.csv', 
                            usecols = ['Quarter', 'Year', 'Parking Spaces', 
                                    'Empty Spaces', 'Occupied Spaces', '% Occupied'])

## Renaming the column names
parking_data = parking_data.rename(columns={'Quarter':'quarter', 'Year':'year', 
                     'Parking Spaces':'parking_spaces', 'Empty Spaces':'empty_spaces', 
                     'Occupied Spaces':'occupied_spaces', '% Occupied':'percent_occupied'})
parking_data.head(5)

Unnamed: 0,quarter,year,parking_spaces,empty_spaces,occupied_spaces,percent_occupied
0,Summer,2016,19297,6567,12730,65.97%
1,Fall,2016,19245,3578,15667,81.41%
2,Winter,2017,18316,2691,15625,85.31%
3,Spring,2017,18334,3096,15238,83.11%
4,Summer,2017,18082,5050,13032,72.07%


## Dataset #2: San Diego PD Traffic Collision Details Dataset

In [3]:
## Import csv file with collision data
collision_data = pd.read_csv('https://seshat.datasd.org/traffic_collision_details/pd_collisions_details_datasd.csv',
                             usecols = ['date_time', 'address_road_primary', 'injured', 'killed'], 
                             parse_dates = ['date_time'])
collision_data.head(5)

Unnamed: 0,date_time,address_road_primary,injured,killed
0,2015-01-14 20:00:00,JUNIPER,0,0
1,2015-03-19 12:00:00,LINDA VISTA,0,0
2,2015-03-24 03:05:00,WASHINGTON,2,0
3,2015-03-27 23:56:00,WORDEN,1,0
4,2015-07-06 11:45:00,EL CAJON,0,0


In [4]:
## Filtering datasset to include only relevant street names for analysis
collision_data['address_road_primary'] = collision_data['address_road_primary'].apply(lambda x: str.lower(str(x)))
street_names = ['genesee', 'gilman', 'hopkins', 'la jolla farms', 
                'la jolla scenic', 'la jolla scenic dr north', 
                'la jolla scenic n', 'la jolla village', 'lebon', 
                'nobel', 'north torrey pines', 'regents', 
                'villa la jolla', 'voigt']
collision_data = collision_data[collision_data['address_road_primary'].isin(street_names)]
collision_data.head(5)

Unnamed: 0,date_time,address_road_primary,injured,killed
50,2016-07-06 00:01:00,la jolla village,0,0
87,2016-07-29 15:27:00,genesee,0,0
88,2016-07-29 15:27:00,genesee,0,0
134,2016-08-05 07:30:00,nobel,1,0
149,2016-08-06 20:39:00,nobel,1,0


In [5]:
#extract the year for collision_data
collision_data['date_time'] = pd.to_datetime(collision_data['date_time'])
collision_data["year"] = collision_data['date_time'].dt.year
collision_data.dtypes

#extract the quarter time from UCSD_Quarter_Dates
quarter_dates = pd.read_csv("UCSD_Quarter_Dates.csv")
quarter_dates["Start_date"] = pd.to_datetime(quarter_dates['Start'])
quarter_dates["End_date"] = pd.to_datetime(quarter_dates['End'])

def quarter_check(date):
    for row in quarter_dates.values:
        if date >= row[3] and date <= row[4]:
            return row[0]
    return None

#extract the quarter for collision_data
collision_data["quarter"] = collision_data["date_time"].apply(quarter_check)
collision_data = collision_data[["year", "quarter", "address_road_primary", "injured", "killed"]]

# Ethics & Privacy

Our proposed project involves the analysis of data related to parking occupancy on the UCSD campus and traffic incidents on nearby major roads. As we conduct our analysis, we are aware of the importance of ethical standards and privacy concerns. [Deon’s ethics checklist](https://deon.drivendata.org/#data-science-ethics-checklist) provides a useful framework for investigating potential ethics or privacy issues. 

In order to avoid potential bias within our datasets, we are only utilizing datasets that are public domain and published by reputable sources. For instance, our proposed dataset providing accident information is published by the San Diego government, and our proposed dataset with UCSD parking data is published by UCSD campus authorities. The traffic incident information is not directly collected from individuals, but instead published to public domain at the time of the accident when collected. We acknowledge that some information available in the dataset may be sensitive, though not personally identifying information (PII), and we are excluding this information from our analysis. The parking occupancy data is collected by [“cameras embedded with artificial intelligence”](https://today.ucsd.edu/story/parking-on-campus-theres-an-app-for-that#:~:text=IT%20Services%20enlisted%20student%20developers,which%20parking%20spaces%20are%20available). The parking occupancy data does not collect PII, and no PII information is published with the dataset. 

We acknowledge that there may be unintended consequences as a result of our data analysis. While inferences may be extracted from our conclusions, our data analysis does not aim to make a generalization about the safety of the roads surrounding UCSD. Our analysis intends to be informative about one facet that may contribute to road safety. Correlation does not elude causation, thus, we are simply exploring potential correlation between our datasets. Furthermore, confounds may be present within our data analysis. Relationships between holidays/events and parking/collisions may be present, which could influence our data analysis. Since the data from our first dataset is grouped by each academic quarter of UCSD, there may be outlier data points that we cannot account for because the data has been generalized to a period of roughly three months, and we cannot see any variations that month-to-month events might present. Lastly, our intent is not to influence drivers to shift their driving habits, but instead, to provide insight into potential relationships between driving habits. 

We aim to be transparent in our methodology, and will provide a summary and details of the steps we complete in our analysis, as well as any concerns that may arise throughout our process. Our project's primary goal is to contribute positively to public safety and traffic efficiency while minimizing any potential harm.


# Team Expectations 

1. Communicate through group messages when we are making changes to the project (on github documents). 
2. Meet weekly at a time we all agree on
3. Equitable contribution
    - Each team member works through their portion(s) of the project equally 
    - If issues arise, communicate sooner than later
    - Ask another team member for help/advice if you run into any issues

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
|  10/23  |  5 PM  | Read & think about previous projects given to review; brainstorm topics/questions  | Determine best form of communication; Complete project review; Discuss possible final project topics |
|  10/30  |  5 PM  |  Brainstorm final project topics | Decide final project topic and split up work for project proposal |
|  11/1  |  5 PM  | Work on project proposal | Finalize and submit final project proposal |
|  11/11  |  6 PM  | Have project proposal submitted; Search for datasets  | Discuss wrangling and possible analytical approaches; Assign group members to lead each specific part |
|  11/13  |  5 PM  | Work on data descriptions | Start cleaning/wrangling datasets |
|  11/15  |  5 PM  | Finish data descriptions and initial cleaning/wrangling for data checkpoint | Finalize data checkpoint; Discuss EDA and analysis plan |
|  11/20  |  5 PM  | Finalize wrangling/EDA; Begin analysis | Discuss/edit analysis; Complete project check-in |
|  11/27  |  5 PM  | Complete analysis; Draft results/conclusion/discussion | Finalize EDA checkpoint; Discuss/edit full project |
|  12/4  |  5 PM  | NA | Record the final project video |
