#   Crime Classification Task
### Project Purpose
The "Crime Classification Task" project is a compelling endeavor aimed at tackling one of the most pressing challenges in the realm of law enforcement and public safety - the efficient and accurate classification of criminal activities. At its core, this project seeks to harness the power of data science and machine learning to create a robust system capable of categorizing various crimes into their appropriate classifications.

###  Import neccesary libraries
-   These libraries provide various functions and tools needed for data manipulation and analysis.

In [None]:
#Import Libraries
%pip install plotnine
from plotnine import *
theme_set(theme_gray())
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from IPython.display import display, Markdown
from plotnine import *
%matplotlib inline

### Load in the dataset into a pandas dataframe, print statistics

In [None]:
#Load in the dataset into a pandas dataframe, print statistics
crime= pd.read_csv('crimedata.csv')
crime.head(5)

In [None]:
#view Crime shape
crime.shape

In [None]:
#View crime Info
crime.info()

#### Data preprocessing and cleaning
Data Preprocessing Overview
This section focuses on preparing our crime dataset for analysis. Data preprocessing is a crucial step in any data science project as it ensures the quality and reliability of the data, especially in a complex domain like crime classification. Our preprocessing steps include handling missing values, refining data types, and dropping unnecessary columns.

In [None]:
#View the missing values in the dataset
crime.isna().sum()

#### Handling Missing Values in 'Mocodes': 
-  Assign a default value of '1501 Other MO' to missing entries in the 'Mocodes' column, as per the MO codes document. This step ensures that our dataset does not have gaps in this critical field, which could impact the analysis.

In [None]:
#DATA CLEANSING
#assign '1501 Other MO' to missing values for unknown MO
#gotten fro MO codes document
crime['Mocodes'].fillna('1501 Other MO', inplace=True)

#### Filling Missing Values for Victim Sex and Descent
-   For the 'Vict Sex' and 'Vict Descent' columns, missing values are assigned 'X' to indicate unknown data. This approach maintains data integrity while acknowledging the absence of specific information. 

In [None]:
#assign X to missing values for unknown victim sex
crime['Vict Sex'].fillna('X', inplace=True)

In [None]:
#assign X to missing values for unknown Vict Descent  
crime['Vict Descent'].fillna('X', inplace=True)

####    Dropping Missing Rows in 'Premis Desc'
-   Removed rows with missing values in the 'Premis Desc' column. This decision is made on the assumption that premise description is vital for the analysis, and missing values here could lead to inaccurate classifications.

In [None]:
#drop the missing rows for premise description
crime = crime.dropna(subset=['Premis Desc'])

####    Addressing Missing Weapon Information
-   For the 'Weapon Used Cd' and 'Weapon Desc' columns, missing values are filled with a default code '500' and a description 'UNKNOWN WEAPON/OTHER WEAPON', respectively based on the information in data lacity crime numerical codes. This step is important to maintain consistency in weapon-related data.

In [None]:
#Assign 500 =  UNKNOWN WEAPON/OTHER WEAPON for missing weapon used cd
crime['Weapon Used Cd'].fillna(500, inplace=True)

In [None]:
#Assign UNKNOWN WEAPON/OTHER WEAPON for missing weapon description
crime['Weapon Desc'].fillna('UNKNOWN WEAPON/OTHER WEAPON', inplace=True)

####    Streamlining the Dataset by Dropping Columns
-   We remove several columns ('Part 1-2', 'Crm Cd 1', 'Crm Cd 2', 'Crm Cd 3', 'Crm Cd 4') from our dataset. This is done to focus on the most relevant data and to simplify our analysis framework.

In [None]:
#drop the column part 1-2 from the dataframe
crime.drop('Part 1-2', axis=1, inplace=True)
#drop the column Crm Cd 1
crime.drop('Crm Cd 1', axis=1, inplace=True)
#drop the column Crm Cd 2
crime.drop('Crm Cd 2', axis=1, inplace=True)
#drop the column Crm Cd 3
crime.drop('Crm Cd 3', axis=1, inplace=True)
#drop the column Crm Cd 4
crime.drop('Crm Cd 4', axis=1, inplace=True)

####    Handling Missing Values in 'Cross Street'
-   Missing values in the 'Cross Street' column are filled with 'None', ensuring a complete dataset for geographical analysis.

In [None]:
#assign none to missing values for unknown Cross Street
crime['Cross Street'].fillna('None', inplace=True)


####    Refining the 'Date Rptd' Column
-   The 'Date Rptd' column is converted to a datetime format, and we extract 'Year', 'Month', and 'Day' as separate columns. This step enhances our capability to perform time-based analysis on the crime data.

In [None]:
#Change time datatype to integer
#separate Year, month, date from the date crime occured
crime['Date Rptd'] = pd.to_datetime(crime['Date Rptd'])
crime['Year'] = crime['Date Rptd'].dt.year
crime['Month'] = crime['Date Rptd'].dt.month
crime['Day'] = crime['Date Rptd'].dt.day

####    Column Renaming for Clarity
-   Rename the columns to enhance the readability and consistency of our dataset. This step is crucial for maintaining a clear understanding of our data and for making our code more intuitive. This renaming aligns with Python's naming conventions and makes our data more accessible for analysis.



In [None]:
#rename column
new_column_names = {'Date Rptd': 'date_rptd', 
                    'DATE OCC': 'date_occ', 
                    'TIME OCC': 'time_occ', 
                    'AREA NAME': 'area_name', 
                    'Rpt Dist No': 'rpt_dist_no', 
                    'Crm Cd': 'crm_cd', 
                    'Crm Cd Desc': 'crm_cd_desc', 
                    'Vict Age': 'vict_age', 
                    'Vict Sex': 'vict_sex', 
                    'Vict Descent': 'vict_descent', 
                    'Premis Cd': 'premis_cd', 
                    'Premis Desc': 'premis_desc', 
                    'Weapon Used Cd': 'weapon_used_cd', 
                    'Weapon Desc': 'weapon_desc', 
                    'Status Desc': 'status_desc',  
                    'Cross Street': 'cross_street'
                   }

crime.rename(columns=new_column_names, inplace=True)


####    Categorizing Crimes Based on UCR Reporting
-   Crimes are categorised according to the Uniform Crime Reporting (UCR) standards. This involves defining categories like Homicide, Rape, Robbery, etc., and mapping them to their respective crime codes. This classification is essential for a structured analysis and for understanding the nature and severity of different crimes in the dataset.

In [None]:
#Assign Categories based on UCR reporting
# Define the crime code categories
# Based on UCR REPORTING – Return A (Based on date of reporting) in pdf file UCR.COMPSTAT062618
HOMICIDE = [110, 113]
RAPE = [121, 122, 815, 820, 821]
ROBBERY = [210, 220]
AGG_ASSAULTS = [230, 231, 235]
DOMESTIC_VIOLENCE = [626, 627, 647, 763, 928, 930, 236, 250, 251, 761, 926]
SIMPLE_ASSAULT = [435, 436, 437, 622, 623, 624, 625]
BURGLARY = [310, 320]
MVT = [510, 520, 433]
BTFV = [330, 331, 410, 420, 421]
PERSONAL_THFT = [350, 351, 352, 353, 450, 451, 452, 453]


# Define a function to map crime codes to categories
def map_category(code):
    if code in HOMICIDE:
        return 'HOMICIDE'
    elif code in RAPE:
        return 'RAPE'
    elif code in ROBBERY:
        return 'ROBBERY'
    elif code in AGG_ASSAULTS:
        return 'AGG.ASSAULTS'
    elif code in DOMESTIC_VIOLENCE:
        return 'Domestic.Violence'
    elif code in SIMPLE_ASSAULT:
        return 'SIMPLE.ASSAULT'
    elif code in BURGLARY:
        return 'BURGLARY'
    elif code in MVT:
        return 'MVT'
    elif code in BTFV:
        return 'BTFV'
    elif code in PERSONAL_THFT:
        return 'PERSONAL.THFT'
    else:
        return 'OTHER.THEFT'

# Check if the 'crm_cd' column exists in the DataFrame
if 'crm_cd' in crime.columns:
    # Map the crime codes to categories
    crime['crm_cd'] = crime['crm_cd'].apply(map_category)
else:
    print("The 'Crm Cd' column does not exist in the DataFrame.")

####    Feature Engineering for Crime Categories
-   Engineer a new feature, crime_category, based on the type of crime (e.g., Assault, Larceny) to gain deeper insights. This categorization is derived from the earlier classification and helps in further breaking down the analysis into more specific crime types indicated by the UCR code.

In [None]:
#Feature engineering for crime category (assault or larceny)
# Create a dictionary to map crime codes to crime categories
crime_dict = {
    "BTFV": "Larceny",
    "BURGLARY": "Larceny",
    "ROBBERY": "Larceny",
    "MVT": "Larceny",
    "OTHER.THEFT": "Larceny",
    "PERSONAL.THFT": "Larceny",
    "RAPE": "Assault",
    "AGG.ASSAULTS": "Assault",
    "Domestic.Violence": "Assault",
    "SIMPLE.ASSAULT": "Assault",
    "HOMICIDE": "Unknown"
}

# Define a function to apply the dictionary to the "crm_cd" column
def categorize_crime(crm_cd):
    if crm_cd in crime_dict:
        return crime_dict[crm_cd]
    else:
        return None

# Apply the function to create a new column called "crime_category"
crime["crime_category"] = crime["crm_cd"].apply(categorize_crime)


####    Feature Engineering for Crime Types (Violent or Property)
Another layer of the  analysis involves distinguishing between violent and property crimes. A new feature, crime_type, is created to categorize each crime as either 'Violent' or 'Property' based on the UCR document. This distinction is crucial for understanding the nature of crimes and for potential policy and resource allocation decisions.

In [None]:
#Feature Engineering for crime type (violent or Property)
# Create a dictionary to map crime codes to crime types
crime_dict = {
    "BTFV": "Property",
    "MVT": "Property",
    "BURGLARY": "Property",
    "OTHER.THEFT": "Property",
    "PERSONAL.THFT": "Property",
    "HOMICIDE": "Violent",
    "RAPE": "Violent",
    "ROBBERY": "Violent",
    "AGG.ASSAULTS": "Violent",
    "Domestic.Violence": "Violent",
    "SIMPLE.ASSAULT": "Violent"
}

# Define a function to apply the dictionary to the "crm_cd" column
def categorize_crime_type(crm_cd):
    if crm_cd in crime_dict:
        return crime_dict[crm_cd]
    else:
        return None

# Apply the function to create a new column called "crime_type"
crime["crime_type"] = crime["crm_cd"].apply(categorize_crime_type)


In [None]:
#drop the column date_rptd
crime.drop('date_rptd', axis=1, inplace=True)

In [None]:
crime.isna().sum()

In [None]:
crime.head(5)

In [None]:
 # save the DataFrame to a CSV file
crime_data = crime.to_csv('allcleanedcrimedata', index=False)

### Data Analysis
-   Analise and visualise the data

In [None]:
#Load in the cleaned dataset into a pandas dataframe, print statistics
crime_data = pd.read_csv('allcleanedcrimedata', index_col=0)
#make a copy of the dataframe
crime_data_copy = crime_data.copy()
crime_data.head(5)

In [None]:
crime_data.shape
crime_data.info()

####    Statistical Summary of Crime Data
Gain a quantitative understanding of the dataset through the describe() function. This function provides a statistical summary of all numeric columns in the crime_data DataFrame. Key statistics include:

-   count: The number of non-null entries in each column.
-   mean: The mean value of each column.
-   std: The standard deviation, which measures the spread of the data.
-   min: The minimum value in each column.
-   25% (first quartile): The value below which 25% of the data falls.
-   50% (median): The middle value of the dataset.
-   75% (third quartile): The value below which 75% of the data falls.
-   max: The maximum value in each column.
These statistics are invaluable for understanding the distribution, variability, and central tendencies of our data, which are critical for informed data analysis and modeling.

In [None]:
#DESCRIPTIVE ANALYSIS
crime_data.describe()

In [None]:
# select only numeric columns
numeric_cols = crime_data.select_dtypes(include='number').columns

# calculate the skewness of the numeric columns
crime_data[numeric_cols].skew()

In [None]:
# select only numeric columns
numeric_cols = crime_data.select_dtypes(include='number').columns

# calculate the skewness of the numeric columns
crime_data[numeric_cols].kurtosis()

####    Distribution of Crimes Over Years
A histogram plot of the 'Year' column from our crime_data dataset is used to visualize how crime incidents are distributed over the years. A histogram is an excellent tool for showing the frequency distribution of numerical data. It will provide insights into:

-   Trends Over Time: Identifying which years had higher or lower crime rates.
-   Data Skewness: Understanding if the data is skewed towards certain years.
-   Data Outliers: Spotting any unusual spikes or drops in crime incidents in specific years.
-   This histogram will help us understand the temporal patterns in our crime data, which can be crucial for trend analysis and forecasting future crime rates.

In [None]:
crime_data.Year.plot.hist()

In [None]:
#Univariate Analysis

In [None]:
crime_data.crm_cd_desc.value_counts

In [None]:
crime_data. area_name.value_counts

####    Use boxplots to analyze the distribution of various key variables in our crime_data dataset.
Generate boxplots for the following variables:

-   Victim Age (vict_age): To observe the age distribution of crime victims.
-   Latitude (LAT) and Longitude (LON): To analyze the geographical spread of crime incidents.
-   Year, Month, and Day: These plots will help us understand the temporal distribution of crimes over years, months, and days.
-   Weapon Used Code (weapon_used_cd): To explore the distribution of different types of weapons used in crimes.
-   Area (AREA): To see the distribution of crimes across different areas.
-   Time of Occurrence (time_occ): To examine the times when crimes are most likely to occur.
-   Reporting District Number (rpt_dist_no): To analyze the distribution of crimes across different police reporting districts.

In [None]:
crime_data.vict_age.plot.box(vert=False)


In [None]:
crime_data.LAT.plot.box(vert=False)

In [None]:
crime_data.LON.plot.box(vert=False)

In [None]:
crime_data.Year.plot.box(vert=False)

In [None]:
crime_data.Month.plot.box(vert=False)

In [None]:
crime_data.Day.plot.box(vert=False)

In [None]:
crime_data.weapon_used_cd.plot.box(vert=False)

In [None]:
crime_data.AREA.plot.box(vert=False)

In [None]:
crime_data.time_occ.plot.box(vert=False)

In [None]:
crime_data.rpt_dist_no.plot.box(vert=False)

In [None]:
#Exploratory data analysis

In [None]:
crime_type = crime_data.groupby('crm_cd').size()
print(crime_type)

####    In-Depth Crime Data Analysis and Visualization
-   This section delves into a thorough analysis and visualization of the crime data, focusing on different aspects such as crime types, crime categories, geographical distribution, and temporal patterns. Various groupings, categorization, and plotting techniques are employed to extract meaningful insights from the dataset.

-   Analyzing Crime Types and Categories
Distribution of Crime Types: Using a histogram, the distribution of different crime types (crm_cd) was examined. This visualization helps identify which crime types are most prevalent in our dataset.

In [None]:
#Type of Crime：Identify the type of crime that occurs the most
crime_grouped = crime_data.groupby("crm_cd").size().reset_index(name="n")
crime_grouped["prop"] = round(crime_grouped["n"]/crime_grouped["n"].sum() * 100, 2)

fig, ax = plt.subplots(figsize=(15, 5))

sns.histplot(data=crime_grouped, x="crm_cd", weights="prop", kde=False, ax=ax, color="blue")

sns.set_style("white")
ax.set_title("Distribution of Crime Types")
ax.set_xlabel("Type of Crime")
ax.set_ylabel("Proportion (%)")

plt.show()
##Other theft account for most of the crime as shown below

-   Most Common Crime Categories: Analyze the crime categories to determine the most frequent categories. This is visualized using a histogram, offering a clear view of the proportional representation of each category.

In [None]:
#Identify the crime category that occurs the most
##Type of Crime
crime_grouped = crime_data.groupby("crime_category").size().reset_index(name="n")
crime_grouped["prop"] = round(crime_grouped["n"]/crime_grouped["n"].sum() * 100, 2)

fig, ax = plt.subplots(figsize=(6, 4))

sns.histplot(data=crime_grouped, x="crime_category", weights="prop", kde=False, ax=ax, color="blue")

sns.set_style("white")
ax.set_title("Distribution of Crime Category")
ax.set_xlabel("Category of Crime")
ax.set_ylabel("Proportion (%)")

plt.show()

-   Crime Types in Top Areas: Similarly, we explore the types of crimes (violent or property crimes) prevalent in these high-crime areas. This is also visualized using stacked bar charts.

In [None]:
#Type of Crime：Identify the type of crime that occurs the most
crime_grouped = crime_data.groupby("crime_type").size().reset_index(name="n")
crime_grouped["prop"] = round(crime_grouped["n"]/crime_grouped["n"].sum() * 100, 2)

fig, ax = plt.subplots(figsize=(6, 4))

sns.histplot(data=crime_grouped, x="crime_type", weights="prop", kde=False, ax=ax, color="blue")

sns.set_style("white")
ax.set_title("Distribution of Crime Types")
ax.set_xlabel("Type of Crimes")
ax.set_ylabel("Proportion (%)")

plt.show()
##Other theft account for most of the crime as shown below

-   Geographical Distribution of Crimes
Distribution by Area Name: A bar chart is used to visualize the distribution of crimes across different areas. This helps us identify areas with higher crime incidences.

In [None]:
# Count the occurrences of each area name
area_counts = crime_data["area_name"].value_counts()

# Create a bar chart to visualize the distribution of area names
fig, ax = plt.subplots(figsize=(12,6))
sns.barplot(x=area_counts.index, y=area_counts.values, ax=ax, color="blue")
sns.set_style("white")
ax.set_title("Distribution of Crime by Area Name")
ax.set_xlabel("Area Name")
ax.set_ylabel("Count")
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")

plt.show()

-   Top 10 High Crime Areas: We further narrow down our analysis to the top 10 areas with the highest number of crimes, both for the larger areas and subareas. This is visualized using horizontal bar charts, providing a clear picture of where the majority of crimes are concentrated.

In [None]:
#Location where the crime took place：The top ten areas where most crime took place
# Find the Top 10 Regions where the most crimes happened
top10 = (crime_data.groupby(['area_name', 'AREA'])
         .size()
         .reset_index(name='n')
         .sort_values(by='n', ascending=False)
         .head(10))

# Create a new column 'Area' by concatenating 'AREA NAME' and 'Rpt Dist No'
top10['Area'] = top10['area_name'] + '_' + top10['AREA'].astype(str)

# Map each unique area_name to a unique color
color_dict = {area: plt.cm.tab20(i) for i, area in enumerate(top10['area_name'].unique())}

# Map area_name column to colors using the color_dict
colors = [color_dict[area] for area in top10['area_name']]

# Plot the Top 10 Regions
plt.figure(figsize=(8,4))
plt.barh(top10['Area'], top10['n'], color=colors)
plt.title('Top ten of the areas that have the most numbers of the crime commited')
plt.xlabel('Number of the crime commited')
plt.ylabel('Area')
plt.show()

-   Categorization of Crimes in High Crime Areas
Crime Categories in Top Areas: For the top crime-prone areas, we analyze the distribution of different crime categories. This is depicted through stacked bar charts, offering insights into the types of crimes that are most common in these areas.

In [None]:
#DISTRIBUTION OF TOP TEN HIGH CRIME AREAS AND CRIME CATEGORY
# Assuming pandas library is imported and crime_data is the data frame
top_10_areas = crime_data['AREA'].value_counts().nlargest(10).index.tolist()

# Create a new column 'Area' by concatenating 'area_name' and 'AREA'
crime_data['Area'] = crime_data['area_name'] + '_' + crime_data['AREA'].astype(str)

crime_in_top_10 = crime_data[crime_data['AREA'].isin(top_10_areas)]
crime_categories_in_top_10 = crime_in_top_10['crime_category'].unique().tolist()

crime_counts = crime_in_top_10.groupby(['Area', 'crime_category']).size().unstack()

# Sort the bars in descending order of total crime counts for each area
crime_counts = crime_counts.loc[crime_counts.sum(axis=1).sort_values(ascending=False).index]

plt.figure(figsize=(8, 4))
crime_counts.plot(kind='barh', stacked=True, alpha=0.7)
plt.xlabel('Number of Crimes')
plt.title('Top 10 High Crime Areas and Crime Categories')
plt.legend(loc='upper right')
plt.show()


In [None]:
#DISTRIBUTION OF TOP TEN HIGH CRIME AREAS AND CRIME TYPES

# Assuming pandas library is imported and crime_data is the data frame
top_10_areas = crime_data['AREA'].value_counts().nlargest(10).index.tolist()

# Create a new column 'Area' by concatenating 'area_name' and 'AREA'
crime_data['Area'] = crime_data['area_name'] + '_' + crime_data['AREA'].astype(str)

crime_in_top_10 = crime_data[crime_data['AREA'].isin(top_10_areas)]
crime_categories_in_top_10 = crime_in_top_10['crime_type'].unique().tolist()

crime_counts = crime_in_top_10.groupby(['Area', 'crime_type']).size().unstack()

# Sort the bars in descending order of total crime counts for each area
crime_counts = crime_counts.loc[crime_counts.sum(axis=1).sort_values(ascending=False).index]

plt.figure(figsize=(8, 4))
crime_counts.plot(kind='barh', stacked=True, alpha=0.7)
plt.xlabel('Number of Crimes')
plt.title('Top 10 High Crime Areas and Crime Types')
plt.legend(loc='upper right')
plt.show()


In [None]:
#Location where the crime took place：The top ten subareas where most crime took place
# Find the Top 10 Regions where the most crimes happened
top10 = (crime_data.groupby(['area_name', 'rpt_dist_no'])
         .size()
         .reset_index(name='n')
         .sort_values(by='n', ascending=False)
         .head(10))

# Create a new column 'Area' by concatenating 'AREA NAME' and 'Rpt Dist No'
top10['Area'] = top10['area_name'] + '_' + top10['rpt_dist_no'].astype(str)

# Map each unique area_name to a unique color
color_dict = {area: plt.cm.tab20(i) for i, area in enumerate(top10['area_name'].unique())}

# Map area_name column to colors using the color_dict
colors = [color_dict[area] for area in top10['area_name']]

# Plot the Top 10 Regions
plt.figure(figsize=(10,5))
plt.barh(top10['Area'], top10['n'], color=colors)
plt.title('Top ten of the subareas that have the most numbers of the crime commited')
plt.xlabel('Number of the crime commited')
plt.ylabel('Area')
plt.show()

In [None]:
#Location where the crime took place：The top ten subareas with crime Categories
top_10_areas = crime_data['rpt_dist_no'].value_counts().nlargest(10).index.tolist()


# Create a new column 'Area' by concatenating 'AREA NAME' and 'Rpt Dist No'
crime_data['Area'] = crime_data['area_name'] + '_' + crime_data['rpt_dist_no'].astype(str)

# Create a new column 'Area' by concatenating 'area_name' and 'AREA'
#crime_data['Area'] = crime_data['area_name'] + '_' + crime_data['AREA'].astype(str)

crime_in_top_10 = crime_data[crime_data['rpt_dist_no'].isin(top_10_areas)]
crime_categories_in_top_10 = crime_in_top_10['crime_category'].unique().tolist()

crime_counts = crime_in_top_10.groupby(['Area', 'crime_category']).size().unstack()

# Sort the bars in descending order of total crime counts for each area
crime_counts = crime_counts.loc[crime_counts.sum(axis=1).sort_values(ascending=False).index]

plt.figure(figsize=(10, 6))
crime_counts.plot(kind='barh', stacked=True, alpha=0.7)
plt.xlabel('Number of Crimes')
plt.title('Top 10 High Crime SubAreas and Crime Categories')
plt.legend(loc='upper right')
plt.show()

In [None]:

#Location where the crime took place：The top ten subareas where most crime took place with Crime types
# Assuming pandas library is imported and crime_data is the data frame
top_10_areas = crime_data['rpt_dist_no'].value_counts().nlargest(10).index.tolist()


# Create a new column 'Area' by concatenating 'AREA NAME' and 'Rpt Dist No'
crime_data['Area'] = crime_data['area_name'] + '_' + crime_data['rpt_dist_no'].astype(str)

# Create a new column 'Area' by concatenating 'area_name' and 'AREA'
#crime_data['Area'] = crime_data['area_name'] + '_' + crime_data['AREA'].astype(str)

crime_in_top_10 = crime_data[crime_data['rpt_dist_no'].isin(top_10_areas)]
crime_categories_in_top_10 = crime_in_top_10['crime_type'].unique().tolist()

crime_counts = crime_in_top_10.groupby(['Area', 'crime_type']).size().unstack()

# Sort the bars in descending order of total crime counts for each area
crime_counts = crime_counts.loc[crime_counts.sum(axis=1).sort_values(ascending=False).index]

plt.figure(figsize=(10, 6))
crime_counts.plot(kind='barh', stacked=True, alpha=0.7)
plt.xlabel('Number of Crimes')
plt.title('Top 10 High Crime SubAreas and Crime Types')
plt.legend(loc='upper right')
plt.show()

-   Temporal Analysis of Crime Data
Time Series Analysis of Crime Reports: Finally, we conduct a time series analysis by plotting the number of crime reports over time. This graph helps us understand trends and patterns in crime occurrences over the years.

In [None]:
#visualizing the frequency of crimes that occurred at different times of the day, with an increasing trend in crime rate as the day progresses
#number of crime reports over time. using timeseries graph
crime_data['Date'] = pd.to_datetime(crime_data[['Year', 'Month', 'Day']])

# Group the crime reports by date and count the number of reports per day
daily_counts = crime_data.groupby('Date').size()

# Plot the daily crime report counts as a time series
plt.figure(figsize=(8, 5))
plt.plot(daily_counts.index, daily_counts.values)
plt.title('Crime Reports Over Time')
plt.xlabel('Date')
plt.ylabel('Number of Reports')
plt.show()


####    Crime Reports Analysis by Year
-   This section visualizes the number of crime reports for each year in the dataset. Convert the existing 'Year', 'Month', and 'Day' columns into a single datetime column 'Date', and then create a bar plot to display the frequency of crime reports per year. This visualization helps in understanding trends and changes in crime rates over time.

In [None]:
#Get the Crime report by Year
crime_data['Date'] = pd.to_datetime(crime_data[['Year', 'Month', 'Day']])

# Extract the year from the date column and plot a bar plot of the counts
crime_data['Date'].dt.year.value_counts().sort_index().plot(kind='bar')

plt.title('Crime Reports by Year')
plt.xlabel('Year')
plt.ylabel('Number of Reports')
plt.show()


In [None]:
#Crime report by Month/year
# Assuming the crime_data DataFrame has columns called "Year", "Month", and "Day"
crime_data['Date'] = pd.to_datetime(crime_data[['Year', 'Month', 'Day']])

# Group the crime reports by year and month and count the number of reports per month
monthly_counts = crime_data.groupby([crime_data['Date'].dt.year, crime_data['Date'].dt.month]).size()

# Plot the monthly crime report counts as a bar chart
fig, ax = plt.subplots(figsize=(15, 5))
monthly_counts.plot(kind='bar', ax=ax)
ax.set_title('Crime Reports by Year and Month')
ax.set_xlabel('Year, Month')
ax.set_ylabel('Number of Reports')
plt.show()


####    Crime Reports Analysis by Month and Year
Next, delve deeper into the temporal distribution of crime by examining crime reports on a month-by-year basis. By grouping the data by both year and month, create a comprehensive bar chart that reveals the monthly crime trends across different years. This detailed view aids in identifying seasonal patterns or specific months with unusually high or low crime rates.

In [None]:
#crime Occurence by Region and Month 2020
# Filter the data to only include the year 2020
crime_data_2020 = crime_data[crime_data['Year'] == 2020]

# Group the data by region, year, and month to get the total number of crimes in each region for each month and year in 2020
crime_data_grouped = crime_data_2020.groupby(['AREA', 'Year', 'Month', 'crm_cd_desc']).size().reset_index(name='crime_count')  

# Plot the data using a heatmap
plt.figure(figsize=(10, 5))
sns.heatmap(crime_data_grouped.pivot_table(values='crime_count', index='AREA', columns=['Month', 'Year'], aggfunc='sum', fill_value=0), cmap='YlOrRd', annot=True, fmt='g')
plt.title('Crime Occurrences by Region and Month in 2020')
plt.xlabel('Month')
plt.ylabel('Region')
plt.show()


In [None]:
#crime Occurence by Region and Month 2021
# Filter the data to only include the year 2021
crime_data_2021 = crime_data[crime_data['Year'] == 2021]

# Group the data by region, year, and month to get the total number of crimes in each region for each month and year in 2021
crime_data_grouped = crime_data_2021.groupby(['AREA', 'Year', 'Month', 'crm_cd_desc']).size().reset_index(name='crime_count')  

# Plot the data using a heatmap
plt.figure(figsize=(10, 5))
sns.heatmap(crime_data_grouped.pivot_table(values='crime_count', index='AREA', columns=['Month', 'Year'], aggfunc='sum', fill_value=0), cmap='YlOrRd', annot=True, fmt='g')
plt.title('Crime Occurrences by Region and Month in 2021')
plt.xlabel('Month')
plt.ylabel('Region')
plt.show()

In [None]:
#crime Occurence by Region and Month 2022
# Filter the data to only include the year 2022
crime_data_2022 = crime_data[crime_data['Year'] == 2022]

# Group the data by region, year, and month to get the total number of crimes in each region for each month and year in 2022
crime_data_grouped = crime_data_2022.groupby(['AREA', 'Year', 'Month', 'crm_cd_desc']).size().reset_index(name='crime_count')  

# Plot the data using a heatmap
plt.figure(figsize=(10, 5))
sns.heatmap(crime_data_grouped.pivot_table(values='crime_count', index='AREA', columns=['Month', 'Year'], aggfunc='sum', fill_value=0), cmap='YlOrRd', annot=True, fmt='g')
plt.title('Crime Occurrences by Region and Month in 2022')
plt.xlabel('Month')
plt.ylabel('Region')
plt.show()

In [None]:
# Use groupby() and first() methods to get the corresponding AREA for each area_name
area_mapping = crime_data.groupby('area_name')['AREA'].first()

# Print the area_mapping Series
print(area_mapping)



In [None]:
#crime Occurence by Region and Month 2023
# Filter the data to only include the year 2023
crime_data_2023 = crime_data[crime_data['Year'] == 2023]

# Group the data by region, year, and month to get the total number of crimes in each region for each month and year in 2023
crime_data_grouped = crime_data_2023.groupby(['AREA', 'Year', 'Month', 'crm_cd_desc']).size().reset_index(name='crime_count')  

# Plot the data using a heatmap
plt.figure(figsize=(10, 5))
sns.heatmap(crime_data_grouped.pivot_table(values='crime_count', index='AREA', columns=['Month', 'Year'], aggfunc='sum', fill_value=0), cmap='YlOrRd', annot=True, fmt='g')
plt.title('Crime Occurrences by Region and Month in 2023')
plt.xlabel('Month')
plt.ylabel('Region')
plt.show()

In [None]:
crime_data['vict_age'].astype(int)

In [None]:
#distribution of victim ages for different areas

In [None]:
#Victim age distribution: this plot allows us to visualize the distribution of victim ages 
#in the crime_data dataset, and to identify any patterns or trends in the data. 
#We can see how many victims fall into each age range, and whether there are any age
#groups that are overrepresented or underrepresented in the dataset.

In [None]:
#Victim Age by Areas: show the distribution of victim ages across the selected areas. 
#each histogram represents a different selected area, 
#and the color of each histogram will be based on the selected color palette.

#The purpose of the code is to provide a visual representation of the distribution of victim ages 
#across selected areas to help understand the crime data.

####    Victim Age Distribution Analysis
-   Analyze the age distribution of crime victims by plotting a histogram of the 'vict_age' column. This visualization gives insights into which age groups are most frequently affected by crimes. Following this, categorize the ages into groups like 'Children', 'Youth', 'Adult', and 'Senior' to better understand the age demographics of the victims. Also create a pie chart to visualize the overall distribution of age categories.

In [None]:
#Victim Age Dsitribution
plt.figure(figsize=(10,5))
plt.hist(crime_data['vict_age'], bins=range(0, 100, 5), color='lightblue', edgecolor='grey')
plt.title('Victim Age distribution')
plt.xlabel('Age')
plt.ylabel('Number of Victim Age')
plt.show()

In [None]:
#Define the age category from the Vict_age column
def Age_Category(col):
    if (col <= 0):
        return 'Unknown Age'
    elif (col > 0) & (col <= 14):
        return 'Children'
    elif (col >= 15) & (col <= 24):
        return 'Youth'
    elif (col >= 25) & (col <= 64):
        return 'Adult'
    else:
        return 'senior'


In [None]:
#Get the age distribution
age_distribution = crime_data['vict_age'].apply(Age_Category)
age_distribution.value_counts()

In [None]:
#Plot bar char to see the age distribution
plt.figure(figsize=(10,5))
age_distribution = crime_data['vict_age'].apply(Age_Category)
age_counts = age_distribution.value_counts()

sns.countplot(x=age_distribution)
plt.title('Age Distribution')
plt.xlabel('Age Category')
plt.ylabel('Number of Victim Age')
plt.show()

In [None]:
#distribution of Age CAtegory
crime_data['Age_Category'] = crime_data['vict_age'].apply(Age_Category)

age_count = crime_data.groupby(['Age_Category', 'area_name']).size().reset_index(name='counts')

age_count.groupby('Age_Category')['counts'].sum().plot(kind='pie', figsize=(10, 5), autopct='%1.1f%%')
plt.title('Victim Age distribution')
plt.ylabel('')
plt.show()

####    Victim Age Distribution by Region
-   Explore the intersection of victim age and geographical factors, by calculating and ploting the distribution of victim ages across different regions. Use stacked bar charts to depict the distribution of victim ages in different areas and specifically in the top 10 high-crime areas. These visualizations allow us to see how victim age profiles may vary across different regions.

In [None]:
#Get the Victim age by Area Distribution
# Calculate the crime count by age category and area name
age_count = crime_data.groupby(['Age_Category', 'area_name']).size().reset_index(name='counts')

# Calculate the total crime count for each area name
total_counts = age_count.groupby('area_name')['counts'].sum().reset_index(name='total_counts')

# Sort the data by total crime count for each area name in ascending order
total_counts = total_counts.sort_values(by='total_counts', ascending=True)

# Use the sorted area names to sort the stacked bars in the plot
sorted_area_names = total_counts['area_name'].tolist()

# Pivot the data to create a stacked bar plot with sorted area names
age_count_pivot = age_count.pivot(index='area_name', columns='Age_Category', values='counts')
age_count_pivot = age_count_pivot.loc[sorted_area_names]
age_count_pivot.plot(kind='bar', stacked=True, figsize=(10, 4))

# Add title and axis labels
plt.title('Victim Age and Area distribution')
plt.xlabel('Area Name')
plt.ylabel('Numbers of Victims')

# Show the plot
plt.show()


In [None]:
#Get the Victim Age by Area Distribution in the top 10 Areas
# Calculate the total crime count for each area name
total_counts = age_count.groupby('area_name')['counts'].sum().reset_index(name='total_counts')

# Sort the data by total crime count for each area name in descending order and get top 10 areas
top10_area_names = total_counts.sort_values(by='total_counts', ascending=False).head(10)['area_name'].tolist()

# Filter the age_count dataframe to keep only the top 10 areas
age_count_top10 = age_count[age_count['area_name'].isin(top10_area_names)]

# Use the sorted area names to sort the stacked bars in the plot
sorted_area_names = age_count_top10.groupby('area_name')['counts'].sum().sort_values().index.tolist()

# Pivot the data to create a stacked bar plot with sorted area names
age_count_top10_pivot = age_count_top10.pivot(index='area_name', columns='Age_Category', values='counts')
age_count_top10_pivot = age_count_top10_pivot.loc[sorted_area_names]
age_count_top10_pivot.plot(kind='bar', stacked=True, figsize=(10, 4))

# Add title and axis labels
plt.title('Victim Age and Area distribution in Top 10 Areas')
plt.xlabel('Area Name')
plt.ylabel('Numbers of Victims')

# Show the plot
plt.show()


-   Victim Sex Distribution: this plot allows us to visualize the distribution of victim sex in the crime_data dataset, 
and to identify whether there are any imbalances or biases in the data. 
We can see how many victims are male and how many are female, 
and whether there are any other categories of sex that are present in the data.

####    Victim Sex Distribution Analysis
-   In this part, Visualize the distribution of the victims' sex. A bar chart is used to display the number of crimes affecting different sexes. This analysis is crucial for understanding the gender-based dynamics of crime victimization.

In [None]:
#Get the Victim SEx Distribution
plt.figure(figsize=(10,5))

# Create a bar plot of victim sex distribution
plt.bar(crime_data['vict_sex'].unique(), crime_data['vict_sex'].value_counts(), color='lightblue', edgecolor='grey')

# Add title and axis labels
plt.title('Victim Sex distribution')
plt.xlabel('Sex')
plt.ylabel('Numbers of Victim Sex')

# Display the plot
plt.show()

-   Victim sex distribution by region. The chart shows that, except for the Southeast region, 
male victims are more common than female victims in other regions. 
For example, in the Central region, about 60% of the victims are male 
and 40% are female. In the Hollywood region, about 70% of the victims are male and 30% are female. 
In the Southeast region, the proportion of male and female victims is roughly equal.

####   Victim Descent Distribution Analysis
-   Examine the distribution of victims' descent to understand the ethnic or racial backgrounds most affected by crimes. A bar chart is utilized to represent the frequency of different descents in the dataset, providing insights into which groups are more commonly victimized.

In [None]:
#Analyzing and visualizing the distribution of victim descent across different areas
#Victim distribution in the dataset

plt.figure(figsize=(10, 5))

# Count the number of occurrences of each victim descent
counts = crime_data['vict_descent'].value_counts()

# Create a bar chart of the counts using seaborn
sns.barplot(x=counts.index, y=counts.values)

# Set the title and axis labels
plt.title('Victim Descent Distribution')
plt.xlabel('Victim Descent')
plt.ylabel('Count')

# Show the chart
plt.show()


#####   Descent Code: A 
- Other Asian A     - Black B   - Chinese C     - Cambodian D   - Filipino F    - Guamanian G   - Hispanic/Latin/Mexican H  - American Indian/Alaskan Native I
- Japanese  J   - Korean K  - Laotian L     - Other  O      - Pacific Islander P    - Samoan  S     - Hawaiian U    - Vietnamese V  - White  W  - Unknown  X    - Asian Indian Z

####    Analysis of Weapon Usage in Crimes
-   Analyze the types and proportions of weapons used in crimes. By counting the occurrences of each weapon description and calculating their proportions, we identify the most commonly used weapons. A bar chart is then used to illustrate these proportions, highlighting the prevalence of different types of weapons in criminal activities.

In [None]:
#Visualising the weapons used and the distribution across the data
#proportion and types of weapon used 
## Count the number of occurrences of each weapon description
weapon_counts = crime_data['weapon_desc'].value_counts()

# Calculate the proportion of each weapon description out of the total number of crimes
weapon_props = weapon_counts / sum(weapon_counts)

# Filter out any weapon descriptions that have a proportion less than 0.5%
weapon_props = weapon_props[weapon_props >= 0.005]

# Create a bar chart to show the proportion of different types of weapons used in crimes
plt.figure(figsize=(15, 3))
plt.bar(weapon_props.index, weapon_props)
plt.xticks(rotation=90)
plt.xlabel('Weapon used')
plt.ylabel('Proportion')
plt.title('Weapon Proportions in Crime Data')
plt.show()