<a href="https://colab.research.google.com/github/mariap13/CMSC320-FinalProject/blob/main/CMSC320_Final_Project_MP_KRT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analysis of Baltimore City Crime Data
Fall 2024 Data Science Project
Maria Pacheco, Kelly Turner
### Contributions:
* *Project Idea:* Kelly Turner
* *Dataset Curation and Preprocessing:* Kelly Turner, Maria Pacheco
* *Data Exploration and Summary Statistics:* Maria Pacheco, Kelly Turner
* *ML Algorithm Design/Development:*
* *ML Algorithm Training and Test Data Analysis:*
* *Visualization, Result Analysis, Conclusion:*
* *Final Tutorial Report Creation:*
* *Additional:*

## Introduction
 Our aim with this project is to reveal trends in major crimes committed in Baltimore City, and the impact that the nature of the crimes, locations of incidents, and the gender of the perpetrators have on frequency of crime in the city. <br />

## Questions (brainstorm)
* What are the projected rates of major crime in different districts for Baltimore in the next 5 to 10 years, based on the given data? Model to use: Linear Regression
* Visualize concentrations of crimes for each 5(?) year span until 2024, then maybe create visualizations of the ML predictions?
https://developers.arcgis.com/python/latest/guide/visualizing-data-with-the-spatially-enabled-dataframe/
* What are the hour breakdowns for the most prevalent crimes (? this is done in the example crime project so might look like plaigarism)



## Data Preprocessing

For this project we will be pulling data from the Open Data Baltimore API, specifically the major crime dataset featured in the link below, and loading it into a dataframe for further analysis.<br /> https://data.baltimorecity.gov/datasets/baltimore::part-1-crime-data/about <br> **Please download the CSV file under download. (file is too large to place in repo)**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
import requests
import json
response = json.loads(requests.get("https://services1.arcgis.com/UWYHeuuJISiGmgXx/arcgis/rest/services/Part1_Crime_Beta/FeatureServer/0/query?where=1%3D1&outFields=CrimeDateTime,Description,Weapon,Post,Gender,Age,Race,Ethnicity,Old_District,New_District,Neighborhood,Latitude,Longitude,GeoLocation,PremiseType,CCNumber&outSR=4326&f=geojson").text)['features']

df = pd.DataFrame.from_dict(pd.json_normalize(response), orient='columns')
df.head(20)
#crime_data = pd.read_csv('/content/Part1_Crime_Beta_5307206680000182585.csv')
#pd.options.display.max_columns = None
#crime_data.info()

Unnamed: 0,type,geometry.type,geometry.coordinates,properties.CrimeDateTime,properties.Description,properties.Weapon,properties.Post,properties.Gender,properties.Age,properties.Race,properties.Ethnicity,properties.Old_District,properties.New_District,properties.Neighborhood,properties.Latitude,properties.Longitude,properties.GeoLocation,properties.PremiseType,properties.CCNumber,geometry
0,Feature,Point,"[-76.565836164, 39.3191443600001]",1306348800000,ROBBERY - COMMERCIAL,FIREARM,432,,,UNKNOWN,,NORTHEAST,,BELAIR-EDISON,39.31914435961279,-76.56583616424851,"(39.319144359612793,-76.565836164248523)",SPECIALTY STORE,11E12857,
1,Feature,Point,"[-76.598549885, 39.3053536640001]",1306364400000,BURGLARY,,314,M,22.0,BLACK_OR_AFRICAN_AMERICAN,,EASTERN,,OLIVER,39.305353663659936,-76.5985498853239,"(39.305353663659936,-76.598549885323905)",ROW/TOWNHOUSE-OCC,11E13250,
2,Feature,Point,"[-76.641410623, 39.3132686930001]",1306332000000,ROBBERY,,733,M,21.0,UNKNOWN,,WESTERN,,PENN NORTH,39.313268693324495,-76.64141062280643,"(39.313268693324503,-76.641410622806433)",STREET,11E13040,
3,Feature,Point,"[-76.596738527, 39.303750237]",1306300200000,AGG. ASSAULT,HANDS,323,F,21.0,BLACK_OR_AFRICAN_AMERICAN,,EASTERN,,GAY STREET,39.303750236998845,-76.59673852736083,"(39.303750236998845,-76.596738527360841)",ROW/TOWNHOUSE-OCC,11E12729,
4,Feature,Point,"[-76.665071166, 39.295408579]",1306346400000,RAPE,OTHER,814,F,20.0,BLACK_OR_AFRICAN_AMERICAN,,SOUTHWEST,,FRANKLINTOWN ROAD,39.295408579160274,-76.66507116570659,"(39.295408579160274,-76.665071165706578)",ALLEY,11E12890,
5,Feature,Point,"[-76.595887778, 39.2993266290001]",1306365300000,COMMON ASSAULT,,323,F,24.0,BLACK_OR_AFRICAN_AMERICAN,,EASTERN,,GAY STREET,39.29932662911089,-76.5958877779155,"(39.299326629110887,-76.595887777915493)",APT/CONDO - OCCUPIED,11E12997,
6,Feature,Point,"[-76.635572371, 39.3630236400001]",1306328400000,AUTO THEFT,,521,,,UNKNOWN,,NORTHERN,,NORTH ROLAND PARK/POPLAR HILL,39.36302364011421,-76.63557237056443,"(39.363023640114214,-76.635572370564432)",ALLEY,11E12713,
7,Feature,Point,"[-76.5587177639999, 39.361831174]",1306339200000,BURGLARY,,423,M,60.0,WHITE,,NORTHEAST,,HAMILTON HILLS,39.36183117433617,-76.55871776420942,"(39.361831174336167,-76.558717764209433)",OTHER/RESIDENTIAL,11E13429,
8,Feature,Point,"[-76.627703159, 39.323402747]",1306324800000,LARCENY,,511,F,30.0,WHITE,,NORTHERN,,HAMPDEN,39.32340274732665,-76.62770315920882,"(39.323402747326647,-76.627703159208821)",ROW/TOWNHOUSE-OCC,11F07160,
9,Feature,Point,"[-76.64609175, 39.299973811]",1306288140000,AUTO THEFT,,724,F,37.0,BLACK_OR_AFRICAN_AMERICAN,,WESTERN,,SANDTOWN-WINCHESTER,39.29997381095836,-76.64609175014702,"(39.299973810958363,-76.646091750147036)",ALLEY,11E04021,


In [None]:
pd.options.display.max_columns = None
crime_data.head(20)

### Description of the Crime Data columns:
*   `RowID` - The unique ID for each entry in the dataset
*   `CCNumber` -
*   `CrimeCode` - The police Crime Code assigned to the crime
*   `Description` - The type of crime that was committed
*   `Inside_Outside` - Describes whether the crime occured indoors or outdoors
*   `Weapon` - Described what weapon was used, if any, in the crime
*   `Post` - The police post area the crime occured in
*   `Gender` - The gender ('Female', 'Male', 'Undefined') of the perpetrator
*   `Age` - The age of the perpetrator
*   `Race` - The race of the perpetrator
*   `Ethnicity` - The ethnicity of the perpetrator
*   `Location` - The address of the crime
*   `Old_District` - The district where the crime occured, according to the system used by Baltimore city before July 2023
*   `New_District` - The district where the crime occured, according to the system used by Baltimore city after July 2023
*   `Neighborhood` - The neighborhood where the crime occured
*   `Latitude` - Latitude coordinate of the crime location
*   `Longitude` - Longitude coordinate of the crime location
*   `GeoLocation` - Coordinates of the crime location to be used by ArcGIS
*   `PremiseType` - Brief description of the setting where the crime occured, for example, "Convenience store"
*   `Total_Incidents` - The number of incidents covered by the entry (this is '1' for every entry)




## Data Parsing
In order to avoid errors with future calculations/manipulations, the'CrimeDateTime' column was converted to datetime, and dates that were set to NaN due to being out of range (ex/a crime that happened in 1557) were removed. The column was then separated into a 'Date' column and a 'Time' column.<br />


First, the column names for the dataframe were adjusted after being read in with "properties." prepended to each in the json file.

In [None]:
df.drop('type', axis=1)
df.drop('geometry.type', axis=1)
print(df.columns)
properties = "properties."
geometry = "geometry."
for name in df.columns:
  if name.startswith(properties):
    name = name[len(properties):]
  elif name.startswith(geometry):
    name = name[len(geometry):]
print(df.columns)

Index(['type', 'geometry.type', 'geometry.coordinates',
       'properties.CrimeDateTime', 'properties.Description',
       'properties.Weapon', 'properties.Post', 'properties.Gender',
       'properties.Age', 'properties.Race', 'properties.Ethnicity',
       'properties.Old_District', 'properties.New_District',
       'properties.Neighborhood', 'properties.Latitude',
       'properties.Longitude', 'properties.GeoLocation',
       'properties.PremiseType', 'properties.CCNumber', 'geometry'],
      dtype='object')
Index(['type', 'geometry.type', 'geometry.coordinates',
       'properties.CrimeDateTime', 'properties.Description',
       'properties.Weapon', 'properties.Post', 'properties.Gender',
       'properties.Age', 'properties.Race', 'properties.Ethnicity',
       'properties.Old_District', 'properties.New_District',
       'properties.Neighborhood', 'properties.Latitude',
       'properties.Longitude', 'properties.GeoLocation',
       'properties.PremiseType', 'properties.CCNumber

The 'Age' column was set to integers from float values, and the frequencies of all unique ages in the column were printed to display the distribution of ages. Later on we removed the outliers from the 'Age' column.

In [None]:
# Convert all records in Age column to integer type
crime_data.Age = crime_data.Age.convert_dtypes(convert_integer=True)
# Convert all ages to positive values
crime_data.Age = crime_data.Age.abs()
# Look at the frequency of each age that is listed in this dataset
print(crime_data.Age.value_counts().sort_index())

In [None]:
# Convert all records in CrimeDateTime column to datetime format
crime_data.CrimeDateTime = pd.to_datetime(crime_data.CrimeDateTime, errors='coerce', format='mixed')
# Drop rows with null values in CrimeDateTime
crime_data.dropna(axis=0, subset=['CrimeDateTime'], inplace=True)
# Check result of the dropna
print("Null 'CrimeDateTime' values:", str(crime_data.CrimeDateTime.isna().sum()))
# Create a 'Date' column with just the date, and a 'Time' column with just the time, from each record in CrimeDateTime
crime_data['Date'] = [d.date() for d in crime_data['CrimeDateTime']]
crime_data['Time'] = [d.time() for d in crime_data['CrimeDateTime']]
# Check these new columns
print("'Date' Column: ", crime_data['Date'].head())
print("'Time' Column: ", crime_data['Time'].head())

In [None]:
# Drop the original CrimeDateTime column
crime_data = crime_data.drop('CrimeDateTime', axis=1)
crime_data.columns

In [None]:
# Create Dataframe for crime records since 07//2023
crime_data['Date'].loc['2023-01-01':'2023-07-01']

We removed the x and y columns since we don't plan on using them to map data points **(may change)**.  The Total_incidents column was also removed as it only redundantly recorded a single instance for each crime event in the dataset.

In [None]:
#remove x, y, total_incidents columns
crime_data.drop(['Total_Incidents'], axis=1, inplace=True)
crime_data.head()

##Data Exploration and Summary Statistics
Once we cleaned features of the crime data, we examined trends within and across these features.


In [None]:
col_names = ['Description', 'Inside_Outside', 'Weapon', 'Gender', 'Age', 'Race', 'Ethnicity', 'Old_District', 'New_District', 'Neighborhood', 'PremiseType', 'Description', 'Weapon']
for i in col_names:
  print("Unique values in", i, ": \t", crime_data[i].unique())
  print("Frequency of unique values in: ", crime_data[i].value_counts())

###Detecting outliers in the data
Creating a box plot for the 'Age' column revealed that ages higher than 80 are considered outliers. Once this was discovered, we removed the outliers from the 'Age' column.<br />


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
sns.boxplot(crime_data['Age'])
print(str(crime_data['Age'].max()))
plt.ylim(-10, 200)
plt.title(f'Age Column with Outliers')
plt.show()

In [None]:
removed_outliers = crime_data[crime_data['Age'] <= 80]
sns.boxplot(removed_outliers['Age'])
plt.title(f'Age Column without Outliers')
plt.show()
index_names = crime_data[crime_data['Age'] > 80].index
df = crime_data.copy()
df.drop(index_names, inplace=True)#drop outlier ages from the dataset

In [None]:
descrip_counts = df['Description'].value_counts()

plt.bar(descrip_counts.index, descrip_counts, color='purple')
plt.title('Count Plot of Crime Descriptions')
plt.xlabel('Description')
plt.xticks(rotation=80)
plt.ylabel('Count')
plt.show()

###Hypothesis Test 1:
**Null Hypothesis:** The district where a crime was committed will have an impact of the likelihood of the nature (description) of the crime.<br />
**Alternate Hypothesis:** The district where a crime was committed will not have an impact of the likelihood of the nature (description) of the crime.<br />
For this test we used a Chi-Squared test, since both variables are categorical.

In [None]:
#Columns: district, description
contingency = pd.crosstab(df.Old_District, df.Description)
contingency

In [None]:
plt = contingency.plot.bar(rot=45, xlabel="Old District", ylabel="Crime Description", stacked=True).legend(prop={'size': 5})

In [None]:
import scipy.stats as stats
chi2_res = stats.chi2_contingency(contingency)
if chi2_res.pvalue < 0.05:
  print("Reject, p-value is", str(chi2_res.pvalue))
else:
  print("Fail to Reject, p-value is", str(chi2_res.pvalue))

Since the p value is less than a 5% level of significance, the null hypothesis can be rejected. District does not have a significant impact on the likelihood that a type of crime will occur.<br /><br />
###Hypothesis Test 2: ANOVA
**Null Hypothesis:** The description of the crime and the weapon used do not significantly impact one another.<br />
**Alternate Hypothesis:** The description of the crime and the weapon used do significantly impact one another.

In [None]:
import matplotlib.pyplot as plt

In [None]:
# Group by 'Description' and 'Weapon'
grouped_description_weapon = df.groupby(['Description', 'Weapon']).size().reset_index(name='Count')
print(grouped_description_weapon)

In [None]:
sns.barplot(x='Description', y='Count', hue='Weapon', data=grouped_description_weapon)

plt.title('Group by Description and Weapon')
plt.xticks(rotation=45, horizontalalignment='right')
plt.legend(loc='upper right', bbox_to_anchor=(1.8, 1))
plt.show()

In [None]:
anova_data = [group['Count'].values for name, group in grouped_description_weapon.groupby('Description')]

#ANOVA
f_statistic, p_value = stats.f_oneway(*anova_data)

print("P-value:", p_value)

In [None]:
if p_value < 0.05:
    print("Reject")
else:
    print("Fail to reject")

Since the p value is less than a 5% level of significance, the null hypothesis can be rejected. Therefore, The description of the crime and the weapon used do significantly impact one another.

###Hypothesis Test 3: T-Test
T-test is a statistical test used to determine if there is a significant difference between the means of two groups. It is used when you have a smaller sample size (typically n < 30) or when you don't know the population standard deviation (σ) and must estimate it from the sample.

**Null Hypothesis:**  There is no difference in the average age of male and female perpetrators. <br>
**Alternate Hypothesis:** There is a difference in the average age of male and female perpetrators.

In [None]:
male_ages = df[(df['Gender'] == 'M')  & (df['Age'].notna())]['Age']
female_ages = df[(df['Gender'] == 'F')  & (df['Age'].notna())]['Age']
print(male_ages)
print(female_ages)

# t-test
t_statistic, p_value = stats.ttest_ind(male_ages, female_ages, equal_var=False)  # Use equal_var=False for Welch's t-test

print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

In [None]:
plt.figure(figsize=(12, 6))

# Box plot
plt.subplot(1, 2, 1)
plt.boxplot([male_ages, female_ages], labels=['Male', 'Female'])
plt.title('Box Plot of Ages by Gender')
plt.ylabel('Age')
plt.grid()

plt.show()

In [None]:
if p_value < 0.05:
    print("Reject")
else:
    print("Fail to reject")

Since the p value is less than a 5% level of significance, the null hypothesis can be rejected. The t-test shows that there is a significant difference in the average ages of male and female perpetrators in this data.

# Primary Analysis