# Project Assignment B: "The Viz and the Notebook"

The deliverables for the Final project are

A website with your visualizations and accompanying text. I recommend you structure it as one of the key types of data story (cf. the Segel paper). The website should tell the story about the data that you're interested in getting across.

* The website should contain visualizations to let the reader explore the data that you're interested in getting across. Some them should be interactive.


Your analysis behind the scenes can be technical and as advanced as you like (in fact the goal is to show you can combine data analysis, machine learning, and data visualization), but the website itself should not be technical, but rather aim at using visualization and explanation to get your data driven insights across to a non-scientific reader (think of a friend from DTU who hasn't taken this class as the audience).


The idea is that you can create much more complex, dynamic and interactive analysis (and visualizations) using the dynamic possibilities that a website affords. It is a way for you to present your work in a way that everyone can understand it (like something you could show your parents).


An explainer Jupyter Notebook. The explainer notebook should contain all the behind the scenes data-analysis stuff, details on the dataset, why you've selected these particular visualizations, explanations methodology, etc.


## More about the website
The main point of the website is to present your idea/analyses to the world in a way that showcases your use of what you've learned in class. The website should be self-contained and tell the story without the need for the details in the explainer notebook (the purpose of the explainer notebook is to provide additional details for interested/scientific readers).

More on the explainer notebook
The notebook should contain your analysis and code. Please structure it into the following sections

1.  Motivation.
* What is your dataset?
* Why did you choose this/these particular dataset(s)?
* What was your goal for the end user's experience?
2.  Basic stats. Let's understand the dataset better
* Write about your choices in data cleaning and preprocessing
* Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.
3. Data Analysis
* Describe your data analysis and explain what you've learned about the dataset.
* If relevant, talk about your machine-learning.
4. Genre. Which genre of data story did you use?
* Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
* Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?
5. Visualizations.
* Explain the visualizations you've chosen.
* Why are they right for the story you want to tell?
6. Discussion. Think critically about your creation
* What went well?,
* What is still missing? What could be improved?, Why?
7. Contributions. Who did what?
* You should write (just briefly) which group member was the main responsible for which elements of the assignment. (I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work. That's what you should explain).
* It is not OK simply to write "All group members contributed equally".
8. Make sure that you use references when they're needed and follow academic standards.

## Section 5 - Visualizations

We have initially chosen to include visualizations of the yearly development of domestic related crimes of the assualt and battery type, to see whether there might be any difference surrounding the pandemic outbreak in 2020. This period is interesting to look at, because of it shows briefly the general trends before the Covid-19 outbreak, and the following years after. It could help show us whether the change occurring in relation to the 2020 pandemic stabilized

In [1]:
import pandas as pd
import matplotlib.pyplot as plt 
import numpy as np

#First read in the csv file 
data = pd.read_csv("Crimes_-_2001_to_Present.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'Crimes_-_2001_to_Present.csv'

In [2]:
data['Year'] = pd.to_datetime(data['Year'], format='%Y')
print(data['Year'].dtype)

datetime64[ns]


In [3]:
data['Date'] = pd.to_datetime(data['Date'])
data['Date']

0         2007-08-25 09:22:18
1         2021-05-24 15:06:00
2         2021-06-26 09:24:00
3         2023-11-09 07:30:00
4         2023-11-12 07:59:00
                  ...        
8044461   2023-04-18 08:00:00
8044462   2023-08-07 18:00:00
8044463   2023-06-20 19:00:00
8044464   2023-08-26 00:00:00
8044465   2023-07-01 19:29:00
Name: Date, Length: 8044466, dtype: datetime64[ns]

In [19]:
data_filtered = data[(data['Date'].dt.year >= 2015) & (data['Date'].dt.year < 2024)]
data_filtered = data_filtered[(data_filtered['Domestic'] == True) & ((data_filtered['Primary Type'] == 'BATTERY') | (data_filtered['Primary Type'] == 'ASSAULT'))]
y = data_filtered['Domestic']
x = data_filtered['Date'].dt.year

In [20]:
plot_data = data_filtered.groupby(data_filtered['Date'].dt.year)['Domestic'].count()
x = plot_data.index
y = list(plot_data)

In [21]:
import plotly.express as px
fig = px.bar(x=x,
             y=y, 
             orientation='v', 
             title='Bar plot of domestic related battery crimes per year',
             labels={'y': 'Domestic battery count', 'x': 'Year'},
             )
fig.update_layout(showlegend=False)
fig.show()

The outbreak of Covid-19 in early 2020, lead to a nation-wide lockdown, meaning that public and private establishments were closed and people were adviced to stay home. This societal shift caused concern for particularly vulnerable families, as this change would risk escalating an already vulnerable situation, that could lead to potentially harmful outcomes.
As this concern did become a matter of fact in several big cities in the US, other cities such as Chicago didn't see any increase in domestic violence reportings/arrests. While they did experience an increase in calls to Domestic Violence Hotlines (insert source), they also experienced the opposite effect of fewer domestic violence related arrests made in 2020. While this decline does seem to make one think that this simply means fewer people experienced domestic violence during the pandemic, this was not necessarily the case, as (insert source) Chicago Domestic Violence Hotlines experienced a large spike/increase in the number of calls received after the outbreak. However, those working in that line of work, weighed in on this unexpected outcome. They stated that the lack of privacy, for the victims of domestic violence, could limit their ability to safely ask for help and report those crimes. 

### Change in distribution over the years

In [5]:
import pandas as pd
import matplotlib.pyplot as plt 
import numpy as np

#First read in the csv file 
data = pd.read_csv("Crimes_-_2001_to_Present.csv")

In [6]:
data['Date'] = pd.to_datetime(data['Date'])

In [7]:
data_filtered = data[(data['Date'].dt.year >= 2015) & (data['Date'].dt.year < 2024)]

First, we make the distribution over domestic battery crimes

In [None]:
# For simplicities sake, we will only take the crime descriptions that explicitly contain "DOMESTIC" in their description,
# And are over 1000 in count

# See below:
data_filtered[(data_filtered['Primary Type'] == 'BATTERY') & (data_filtered["Domestic"] == True)].groupby("Description").size().sort_values(ascending=False)

This gives us:\
DOMESTIC BATTERY SIMPLE = 199921

AGGRAVATED DOMESTIC BATTERY: OTHER DANG WEAPON                   5745\
AGGRAVATED DOMESTIC BATTERY - OTHER DANGEROUS WEAPON             4038

AGGRAVATED DOMESTIC BATTERY: KNIFE/CUTTING INST                  3773\
AGGRAVATED DOMESTIC BATTERY - KNIFE / CUTTING INSTRUMENT         2247

AGG. DOMESTIC BATTERY - HANDS, FISTS, FEET, SERIOUS INJURY       3228\
AGGRAVATED DOMESTIC BATTERY: HANDS/FIST/FEET SERIOUS INJURY      2419

Some categories seem to be written up slightly different, but mean the same thing. We are unsure whether the smaller group's datapoints are contained within the larger group or not, so we merge their values.


In [11]:
# Gather all descriptions, to filter out the relevant ones
battery_domestic_desc = data_filtered[(data_filtered['Primary Type'] == 'BATTERY') & (data_filtered["Domestic"] == True)]["Description"].unique()

In [74]:
crime_dict = {
    'Simple': [],
    'Aggravated: Other Dangerous Weapon': [],
    'Aggravated: Knife/Cutting Instrument': [],
    'Aggravated: Hands/Fist/Feet Serious Injury': []
}

for description in battery_domestic_desc:
    if "DOMESTIC" not in description:
        continue
    if "SIMPLE" in description:
        crime_dict['Simple'].append(description)
    elif "OTHER DANG WEAPON" in description or "- OTHER DANGEROUS WEAPON" in description:
        crime_dict['Aggravated: Other Dangerous Weapon'].append(description)
    elif "KNIFE/CUTTING INST" in description or "KNIFE / CUTTING INSTRUMENT" in description:
        crime_dict['Aggravated: Knife/Cutting Instrument'].append(description)
    elif "HANDS, FISTS, FEET, SERIOUS INJURY" in description or "HANDS/FIST/FEET SERIOUS INJURY" in description:
        crime_dict['Aggravated: Hands/Fist/Feet Serious Injury'].append(description)


In [75]:
bookehData = pd.DataFrame(columns=["Date"])

bookehData["Date"] = range(2015,2024)


for drug_type, drug_strings in crime_dict.items():
    
    sum = data_filtered[(data_filtered["Description"] == drug_strings[0])].groupby(data_filtered.Date.dt.year).size()
    for crime in drug_strings[1:]:
        temp = data_filtered[(data_filtered["Description"] == crime)].groupby(data_filtered.Date.dt.year).size()
        sum = sum.add(temp, fill_value=0)
        

        

    sum = sum.reset_index(name=drug_type)
    sum.fillna(0,inplace=True)
    bookehData = pd.merge(bookehData, sum, on=["Date"], how="left")  # Merging and updating dataframeDrug
    

        

In [76]:
bookehData.fillna(0,inplace=True)
bookehData

Unnamed: 0,Date,Simple,Aggravated: Other Dangerous Weapon,Aggravated: Knife/Cutting Instrument,Aggravated: Hands/Fist/Feet Serious Injury
0,2015,24619,1019.0,691.0,152.0
1,2016,24752,1089.0,728.0,255.0
2,2017,23832,1132.0,718.0,462.0
3,2018,24291,1175.0,757.0,590.0
4,2019,23608,1210.0,773.0,860.0
5,2020,20686,1127.0,722.0,673.0
6,2021,19625,1077.0,609.0,745.0
7,2022,18689,973.0,488.0,806.0
8,2023,19819,981.0,534.0,1104.0


In [77]:
bookehData['Date'] = pd.to_datetime(bookehData['Date'], format='%Y')


In [78]:
bookehData['Date']

0   2015-01-01
1   2016-01-01
2   2017-01-01
3   2018-01-01
4   2019-01-01
5   2020-01-01
6   2021-01-01
7   2022-01-01
8   2023-01-01
Name: Date, dtype: datetime64[ns]

In [79]:
# Set 'Date' as index
bookehData.set_index('Date', inplace=True)

# Resample data by year and sum values for each column
df_grouped = bookehData.resample('Y').sum().reset_index()

In [80]:
df_melted = df_grouped.melt(id_vars='Date', var_name='Column', value_name='Value')


In [81]:
df_melted["Date"] = df_melted["Date"].dt.year

In [83]:
fig = px.bar(df_melted, x='Value', y='Date', color='Column', orientation='h',
             title='Discrete Distribution by Year',
             labels={'Value': 'Total', 'Date': 'Year'},
             color_discrete_map={'Column1': 'blue', 'Column2': 'green', 'Column3': 'red'},
             barmode='stack')

fig.show()