# COGS 108 - Data Checkpoint

# Names
- Alan Miyazaki
- Alex Guan
- Nathan Ahmann
- Renaldy Herlim

<a id='research_question'></a>
# Research Question

Does crime happen more or less frequently around police stations and can that knowledge be used to more effectively distribute police station locations?

# Dataset(s)

### Dataset 1: Los Angeles Police Station Location
[Link to dataset](https://geohub.lacity.org/datasets/lapd-police-stations)  
Number of observations: 21

This dataset contains information on all 21 of LA City's Police Stations. If we find that our crime data extends past the city or want a more precise location, then we might make use of the extended data for LA county that includes more Police Stations and additioanlly includes Sheriff stations.

Each row of the dataset contains information for one police station.  
The relevant columns are:
* DIVISION - the division the police station is under
* LOCATION - an address for the police station. Since we would rather have a more precise location we might convert these to latitude/longitude or use the alternative dataset.
* PREC - the precinct each station is in charge of 


#### Alternate Dataset that is for the entire LA county and includes Sheriff Stations:  
[Link to alternate dataset](https://geohub.lacity.org/datasets/lacounty::sheriff-and-police-stations)  
Number of observations: 105

This dataset contains information on 105 of LA County's Police and Sheriff Stations. It contains more data than the above dataset, but some of the information might not be necessary so we listed it as an alternative.

Each row of the dataset contains information for one police or sheriff station.  
The relevant columns are:
* cat3 - Category 3 has a distinction between sheriff and police stations
* latitude, longitude - self explanatory
* addrln1 - address line 1 contains an address we could use for locaiton instead of latitude, longitude
* city - Contains the city so we could narrow it down to LA city


### Dataset 2: Los Angeles Crime Data from 2020 to Present (March 1st 2023)  
[Link to dataset](https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8) 
  
Number of observations: 673,367 

This dataset includes information on crimes that took place in Los Angeles between 2020 and March 1st 2023.
Since Dataset 2 and 3 have the same columns, the row information and relevant column is below.

### Dataset 3: Los Angeles Crime Data from 2010 to 2019
[Link to dataset](https://data.lacity.org/Public-Safety/Crime-Data-from-2010-to-2019/63jg-8b9z)
  
Number of observations: 2,119,797 

This dataset includes information on crimes that took place in Los Angeles between 2010 and 2019.  

Each row represents a single crime that took place.
The relevant columns are:
* DR_NO - Divisions of Records number. Acts as an unique ID for the crime
* DATE_OCC - The date that the crime occured
* AREA - contains the geographic area code for the police station. These are 1-21 and correspond to 1 of the 21 police stations.
* Crm Cd Desc - description of the criminal code. Essentially a human readable crime category
* LOCATION - street address the crime took place at
* LAT, LONG - latitude and longitude

### Combining the Datasets

Dataset 1 will be our police station information so it will stay as it's own dataset. In our analysis we will use it in conjunction with the other datasets via location data, which will likely be latitude/longitude.

Dataset 2 and 3 contain our crime data. Due to coming from the same source, they are quite easy to combine and contain the same columns so we can simply concatenate them together.

# Setup

In [1]:
import pandas as pd

# Data Cleaning

Describe your data cleaning steps here.

In [2]:
# Reading datasets from two different time periods

# https://data.lacity.org/Public-Safety/Crime-Data-from-2010-to-2019/63jg-8b9z
past_df = pd.read_csv("Crime_Data_from_2010_to_2019.csv")
# https://data.lacity.org/Public-Safety/Crime-Data-from-2020-to-Present/2nrs-mtv8
present_df = pd.read_csv("Crime_Data_from_2020_to_Present.csv")

# past dataset has column name typo
past_df = past_df.rename(columns={"AREA ": "AREA"})

# Both datasets use the same columns 
df = pd.concat([past_df, present_df])

In [6]:
# Dropping code columns since these are internal use and we don't get much 
# value from them given we have their description in a seperate column
df = df.drop(columns=[
    "Crm Cd", "Crm Cd 1", "Crm Cd 2", "Crm Cd 3", 
    "Crm Cd 4", "Premis Cd", "Weapon Used Cd"
    ])