# **Information Visualization**
## **First project:** Analysis of Mass Shootings in the US

Authors: **Raquel Jolis Carné and Martina Massana Massip**

## **Data Cleaning with OpenRefine**

For both documents *MassShootings.csv* and *SchoolIncidents.csv*,  we have executed the following data cleaning procedure.  

1. Changing the type of **numerical data columns from *strings* to *integers*.** As well as setting the ***Incident Date*** column as a ***timestamp.***
2. Combining the columns ***State, City or County* and *Address* into a single *Complete_Address*** with the three fields.
3. Extracting **OpenStreetMap coordinates** for the complete addresses into a new *Coordinates* column.  
4. **Erasing rows** where **coordinates** were **not found.**
5. **Separating the *Coordinates*** values into two columns: ***Longitude* and *Latitude.***
6. Extracting ***FIPS* codes and *Population*** for each state by **fetching information from wikidata** using a Reconciling facet in OpenRefine.
7. Manually renaming **District of Columbia Federal Voting rights** of *MassShootings.csv* to **District of Columbia**
8. Manually **filling in the rows for District of Columbia** with values: 11 for *FIPS* and 670587 for *Population*.


Additional transformations to answer the concrete questions have been specified in the pertinent exercicies in the creation of *Pandas* dataframes by joining multiple datasets and selecting relevant columns.

In [274]:
!pip install altair==5.4.1
!pip install geopandas



In [275]:
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import altair as alt
from vega_datasets import data

# for County choropleths
import geopandas as gpd
from shapely.geometry import Point
from shapely.geometry import Polygon, MultiPolygon

In [276]:
mass_shootings = pd.read_csv('MassShootings.csv')
county_population = pd.read_csv('CountyPopulation.csv')
counties_gdf = gpd.read_file('Counties.geojson')
school_incidents = pd.read_csv('SchoolIncidents.csv')

## **First question**
#### **What are the states with large number of mass shootings per citizen?**

To visualize the states with the highest number of mass shootings per capita, we chose a **bar chart** as it effectively represents quantitative data. The x-axis displays the number of shootings per million residents for each state, while the state names are listed along the y-axis for easier reading. To enhance clarity, we placed a label with the exact x-axis value to the right of each bar, using black text to make it more legible, especially since the differences between consecutive bars are minimal in several cases.

Since the task focuses on states with high mass shooting rates, we sorted the data in **descending order** and highlighted the **Top 10 states** with a distinctive dark blue shade that aligns with the color scheme used across all visualizations. This particular shade, along with other blue tones in subsequent charts, is color-blind friendly, ensuring accessibility for viewers with color vision deficiencies.


**Data preparation**

In [277]:
state_shootings = {state: [0, 0] for state in set(mass_shootings['State'])}
                        # ['Shootings', 'Population']

# data for states not appearing in the original dataset
state_shootings['Montana'] = [0, 1122878]
state_shootings['Wyoming'] = [0, 584057]
state_shootings['Vermont'] = [0, 643077]

for _, row in mass_shootings.iterrows():
    current_state = row['State']
    current_population = row['Population']

    state_shootings[current_state][0] += 1 # occurrence count
    state_shootings[current_state][1] = current_population

state_shootings_Q1 = pd.DataFrame.from_dict(state_shootings, orient='index',
                                          columns=['Total Shootings', 'Population'])
state_shootings_Q1 = state_shootings_Q1.reset_index().rename(columns={'index': 'State'})
state_shootings_Q1.groupby(['State']).sum()
state_shootings_Q1['Shootings per 1M Habitants'] = state_shootings_Q1['Total Shootings'] / state_shootings_Q1['Population'] * 10**6

# sort state values in descending order
state_shootings_Q1 = state_shootings_Q1.sort_values(by = 'Shootings per 1M Habitants', ascending = False).reset_index(drop = True)
state_shootings_Q1['Top_10'] = state_shootings_Q1.index < 10

**Barchart plotting**

In [278]:
shootings_bars = alt.Chart(state_shootings).mark_bar().encode(
        alt.X('Shootings per 1M Habitants:Q', axis = alt.Axis(titleColor = 'black', labelColor = 'black', titleFontSize = 14, labelFontSize = 12)),
        alt.Y('State:N', sort='-x', axis = alt.Axis(titleColor = 'black', labelColor = 'black', titleFontSize = 14, labelFontSize = 12)),
        color = alt.condition(
            alt.datum.Top_10,
            alt.value('#1f78b4'),  # color for top 10
            alt.value('#b2b2b2')   # color for the rest
        ),
        order = alt.Order('Shootings per 1M Habitants', sort = 'descending')
).properties(title= alt.TitleParams(
        text = 'Mass shootings per million citizens by State',
        fontSize = 18,
        color = 'black')
)

shootings_text = alt.Chart(state_shootings).mark_text(
    align = 'left', baseline = 'middle', dx = 3, color = 'black', fontSize = 12
).encode(
    alt.X('Shootings per 1M Habitants:Q'),
    alt.Y('State:N', sort='-x'),
    alt.Text('Shootings per 1M Habitants:Q', format='.2f')
)

shootings_barchart = shootings_bars + shootings_text
shootings_barchart

ValueError: Shootings per 1M Habitants encoding field is specified without a type; the type cannot be automatically inferred because the data is not specified as a pandas.DataFrame.

alt.LayerChart(...)

From the plot above we know the Top 10 states are: **District of Columbia, Mississippi, Louisiana, Illinois, Alabama, Maryland, Tennessee, Missouri, Pennsylvania and South Carolina**, in descending order.

## **Second question**
#### **How is the number of mass shootings per citizen distributed across the different counties in the US? And across states?**



To conveniently answer this question we have opted to represent **two** separate **choropleths.** We have chosen this type of plot because it is an optimal option to encode geospatial data, that is, quantities accross geographical data. Additionally, color encoding allows us to save space while conveying information. We selected a colorblind-friendly palette **diverging  from light yellow (for lower values) to dark blue (for higher values).**

To further enhance space and understandability we have used the **tooltip feature**. For the first map, the tooltip shows the **state name and the number of shootings per million residents**. For the second map, it displays the **county name and the number of shootings per hundred thousand residents**.

In both representations we marked the **outline of each state in dark gray** to improve understandability, as well as the **outline of each county in a light gray** in the second chart.

**Data preparation for the plot by States**

In [279]:
state_shootings = {state: [0, 0, 0, 0, 0, 0] for state in set(mass_shootings['State'])}
                        # ['FIPS', 'Shootings', 'Population', 'Suspects Injured', 'Suspects Killed', 'Total Suspects']

# data for states not appearing in the original dataset
state_shootings['Montana'] = [30, 0, 1122878, 0, 0, 0]
state_shootings['Wyoming'] = [56, 0, 584057, 0, 0, 0]
state_shootings['Vermont'] = [50, 0, 643077, 0, 0, 0]

for _, row in mass_shootings.iterrows():
    current_state = row['State']
    current_FIPS = row['FIPS']
    current_population = row['Population']
    suspects_injured = row['Suspects Injured']
    suspects_killed = row['Suspects Killed']
    suspects_arrested = int(row['Suspects Arrested'])

    state_shootings[current_state][0] = current_FIPS
    state_shootings[current_state][1] += 1 # occurrence count
    state_shootings[current_state][2] = current_population
    state_shootings[current_state][3] += suspects_injured
    state_shootings[current_state][4] += suspects_killed
    state_shootings[current_state][5] += suspects_injured + suspects_killed + suspects_arrested

state_shootings = pd.DataFrame.from_dict(state_shootings, orient='index',
                                          columns=['FIPS', 'Total Shootings', 'Population', 'Suspects Injured', 'Suspects Killed', 'Suspects'])
state_shootings = state_shootings.reset_index().rename(columns={'index': 'State'})
state_shootings['Shootings per 1M Habitants'] = state_shootings['Total Shootings'] / state_shootings['Population'] * 10**6 # 10**6 is a scaling factor
state_shootings['% of Suspects Injured'] = state_shootings['Suspects Injured'] / state_shootings['Suspects'] * 100
state_shootings['% of Suspects Killed'] = state_shootings['Suspects Killed'] / state_shootings['Suspects'] * 100

# there's three states where Total Shootings = 0, for them, we have computed 0 / 0 when creating the last two columns
state_shootings.fillna(0, inplace=True)

# eliminating Columbia to expand color range, we will represent Columbia alternatively
shootings_notcolumbia = state_shootings[state_shootings['FIPS'] != 11]

columbia_data = pd.DataFrame({
    'Latitude': [38.89511],
    'Longitude': [-77.03637],
    'State': ['District of Columbia'],
    'Shootings per 1M Habitants': [state_shootings[state_shootings['FIPS'] == 11]['Shootings per 1M Habitants'].iloc[0]]
})

**Choropleth plotting by States**

In [280]:
USA_states = alt.topo_feature(data.us_10m.url, 'states')

state_shootings_map = alt.Chart(USA_states
).transform_lookup(
    lookup = 'id',
    from_ = alt.LookupData(shootings_notcolumbia, 'FIPS', list(shootings_notcolumbia.columns))
).mark_geoshape(stroke='darkgray'
).encode(
    color=alt.Color(
        'Shootings per 1M Habitants:Q',
        legend=alt.Legend(
            title='Shootings per 1M Habitants',
            titleColor='black',
            labelColor='black',
            orient='bottom'
        ),
        scale=alt.Scale(scheme='lighttealblue')
    ),
    tooltip = ['State:N', 'Shootings per 1M Habitants:Q']
).properties(
    title = alt.TitleParams(
        text = 'Distribution of shootings per million habitants, by state',
        fontSize = 18,
        color = 'black'),
    width = 500,
    height = 300
).project(
    type = 'albersUsa'
)


# to highlight the District of Columbia in the map
columbia_data = pd.DataFrame({
    'Latitude': [38.89511],
    'Longitude': [-77.03637],
    'State': ['District of Columbia'],
    'Shootings per 1M Habitants': [state_shootings[state_shootings['FIPS'] == 11]['Shootings per 1M Habitants'].iloc[0]]
})

columbia_zoom = alt.Chart(columbia_data).mark_circle(
    size = 50,
    opacity = 0.7
).encode(
    color = alt.Color('State:N', scale=alt.Scale(scheme = 'reds'),
                      legend=alt.Legend(labelColor='black', titleColor='black', orient='bottom')),
    longitude = 'Longitude:Q',
    latitude = 'Latitude:Q',
    tooltip = ['State:N', 'Shootings per 1M Habitants:Q']
).properties(
    width = 500,
    height = 300
)

Q2_state_map_final = state_shootings_map + columbia_zoom
Q2_state_map_final

Because we are working with proportions and the **District of Columbia** is very densely populated, we have chosen to represent its value with an **additional encoding**, overlaying a **semi-opaque red circle** to mark it.  This way, we could **"expand" the color range** for the original choropleth, being able to distinguish the differences between the other states more easily.

From the plot above, we can see **District of Columbia shows the highest proportion** and **Louisiana, Mississippi and Illinois closely follow** with dark blue hues, where gun laws are less strict. From a general point of view, there seem to be more shootings in the Eastern side of the country, where the overall coloring appears more blue. On the Western side, **Utah, Montana, Wyoming stand out with near-zero values.** Montana specially stands out since it is one of the most permissive states regarding gun laws but there are no recorded shootings in the last four years.

**Data preparation for the plot by Counties**

In [281]:
# to perform spatial join and intersect shooting coordinates with actual counties
geometry = [Point(lon_lat) for lon_lat in zip(mass_shootings['Longitude'], mass_shootings['Latitude'])]
mass_shootings_gdf = gpd.GeoDataFrame(mass_shootings[['State', 'FIPS']], geometry=geometry)

counties_gdf = counties_gdf[['STATEFP', 'GEOID', 'NAME', 'geometry']] # reducing dimensionality

# swapping coordinates
def swap_coordinates(geometry):
    if isinstance(geometry, Polygon):
        return Polygon([(lon, lat) for lat, lon in geometry.exterior.coords])
    elif isinstance(geometry, MultiPolygon):
        return  MultiPolygon([Polygon([(lat, lon) for (lat, lon) in polygon.exterior.coords]) for polygon in geometry.geoms])

counties_gdf['geometry'] = counties_gdf['geometry'].apply(swap_coordinates) # swapping each row of the geometry column

# setting and ensuring the use of the same coordinate system
mass_shootings_gdf.set_crs(epsg=4326, inplace=True)
counties_gdf = counties_gdf.to_crs(mass_shootings_gdf.crs)

# dropping Puerto Rico, because it is outside of the North America region
counties_gdf = counties_gdf[counties_gdf['STATEFP'] != '72']

coordinates_w_counties = mass_shootings_gdf.sjoin(counties_gdf, how='right', predicate='within')
# 'how=right' to ensure we keep all counties and their FIPS, even if there's no coordinate data for them in the mass_shootings dataframe
coordinates_w_counties = coordinates_w_counties[['GEOID', 'FIPS']]
coordinates_w_counties = coordinates_w_counties.set_axis(['FIPS', 'STATEFIPS'], axis=1)
coordinates_w_counties['FIPS'] = coordinates_w_counties['FIPS'].astype(int)

county_population = county_population[['FIPStxt', 'Area_Name', 'State', 'POP_ESTIMATE_2023']]
# to take into account same County names in different States
county_population['County'] = county_population['Area_Name'] + ', ' + county_population['State']
county_population = county_population[['FIPStxt', 'County', 'POP_ESTIMATE_2023']]
county_population = county_population.set_axis(['FIPS', 'COUNTYNAME', 'COUNTYPOPULATION'], axis=1)

coordinates_w_counties = coordinates_w_counties.merge(county_population, on='FIPS')

county_shootings = {county: [0, 0, 0] for county in set(coordinates_w_counties['COUNTYNAME'])}
                          # ['County FIPS', 'Shootings', 'County Population']
for _, row in coordinates_w_counties.iterrows():
    current_county = row['COUNTYNAME']
    current_county_FIPS = row['FIPS']
    current_county_population = row['COUNTYPOPULATION']
    if current_county_population is None:
        current_county_population = 0
    elif current_county_population is not None and type(current_county_population) == str:
        current_county_population = int(row['COUNTYPOPULATION'].replace(',', ''))

    county_shootings[current_county][0] = current_county_FIPS

    # we only want to sum occurrences if we have shooting data for it
    if pd.notna(row['STATEFIPS']):
        county_shootings[current_county][1] += 1 # occurrence count
    # if there's no data for this county in the original dataset, we keep the 'count' at 0

    county_shootings[current_county][2] = current_county_population

county_shootings = pd.DataFrame.from_dict(county_shootings, orient='index',
                                          columns=['County FIPS', 'Total Shootings', 'County Population'])

county_shootings['Shootings per 100K habitants'] = county_shootings['Total Shootings'] * 1 / county_shootings['County Population'] * 10**5 # 10**5 is a scaling factor
county_shootings.fillna(0, inplace=True)

full_FIPS_list = data.unemployment() # contains all FIPS inside the USA area in the column 'id', plottable in Altair
# adding the remaining rows of 'unemployment_df['id']' to have all plotable county FIPS
for f in full_FIPS_list['id']:
    if f not in set(county_shootings['County FIPS']):
        num_of_rows = county_shootings.shape[0]
        county_shootings.loc[num_of_rows] = [f, 0, 0, 0]

# only keeping County FIPS
county_shootings = county_shootings[county_shootings['County FIPS'].isin(set(full_FIPS_list['id']))]

county_shootings = county_shootings.reset_index().rename(columns={'index': 'County'})

**Choropleth plotting by Counties**


In [282]:
USA_counties = alt.topo_feature(data.us_10m.url, 'counties')

county_shootings_map = alt.Chart(USA_counties
).transform_lookup(
    lookup = 'id',
    from_ = alt.LookupData(county_shootings, 'County FIPS', list(county_shootings.columns))
).mark_geoshape().encode(
    color = alt.Color(
        'Shootings per 100K habitants:Q',
        legend=alt.Legend(
            title='Shootings per 100K habitants',
            titleColor='black',
            labelColor='black',
            orient='bottom'
        ),
        scale=alt.Scale(scheme='lighttealblue')
    ),
    tooltip = ['County:N', 'Shootings per 100K habitants:Q']
).properties(
    title = alt.TitleParams(
        text = 'Distribution of shootings per 100k habitants, by county',
        fontSize = 18,
        color = 'black'),
    width = 500,
    height = 300
).project(
    type = 'albersUsa'
)

state_shape_overlay = alt.Chart(USA_states
).transform_lookup(
    lookup = 'id',
    from_ = alt.LookupData(state_shootings, 'FIPS', list(state_shootings.columns))
).mark_geoshape(
    stroke = 'gray',
    fill = 'transparent'
).properties(
    width = 500,
    height = 300
).project(
    type = 'albersUsa'
)

county_shootings_overlay = alt.Chart(USA_counties
).transform_lookup(
    lookup = 'id',
    from_ = alt.LookupData(county_shootings, 'County FIPS', list(county_shootings.columns))
).mark_geoshape(
    stroke='lightgray',
    fill = 'transparent'
).encode(
    color = alt.Color(
        'Shootings per 100K habitants:Q',
        legend=alt.Legend(
            title="Shootings per 100K habitants",
            titleColor='black',
            labelColor='black')
    ),
    tooltip = ['County:N', 'Shootings per 100K habitants:Q'],
).properties(
    title = alt.TitleParams(
        text = 'Distribution of shootings per 100k habitants, by county',
        fontSize = 18,
        color = 'black'),
    width = 500,
    height = 300
).project(
    type = 'albersUsa'
)

Q2_county_map_final = county_shootings_map + state_shape_overlay + county_shootings_overlay
Q2_county_map_final

From the plot above, we can see **Quitman County (Mississippi)** shows the highest proportion, and **Mississippi County (Missouri) and Dallas County (Arkansas) closely follow** with dark blue hues. This reasonates with the proportions by states, where Missouri and Mississipi where states where the proportions where the highest after District of Columbia. From a general point of view, we can infer there seem to be low proportions accross the majority of counties because the map appears mainly colored with light yellow and light green hues.

### **Additional Question**
#### **How is police brutality distributed across the different states in the US?**

To clearly visualize this information we have opted to represent **two** separate **choropleths** as well. We have chosen this type of plot for the same reasons as mention in question two, and we have used the same palette and state outline marking.

The value we encoded with said palette is the **number of injured and killed suspects** respectively in each state. That is, the proportion of total injured or killed suspects over the total suspects involved in the shooting (sum of injured, killed and arrested -having checked they were disjoint quantities). To improve understandability we have shown these values and the name of the state they belong to with the tooltip feature.

**Choropleth plotting for Injured Suspects**

In [283]:
Qextra_injured_map_final = alt.Chart(USA_states
    ).transform_lookup(
        lookup = 'id',
        from_ = alt.LookupData(state_shootings, 'FIPS', list(state_shootings.columns))
    ).mark_geoshape(stroke='darkgray'
    ).encode(
        color=alt.Color(
        'Shootings per 1M Habitants:Q',
        legend=alt.Legend(
            title='Shootings per 1M Habitants',
            titleColor='black',
            labelColor='black',
            orient='bottom'
        ),
        scale=alt.Scale(scheme='lighttealblue')
    ),
        tooltip = ['State:N', '% of Suspects Injured:Q']
    ).properties(
        title = alt.TitleParams(
            text = 'Percentage of Suspects Injured per shooting, by state',
            fontSize = 18,
            color = 'black'),
        width = 500,
        height = 300
    ).project(
        type = 'albersUsa'
    )

Qextra_injured_map_final

In the plot above, **Utah clearly stands out** at a 50% and Connecticut and Nebraska follow at a 25% and 20%, respectively. We can observe a **majority of light yellow and light green colored regions** that represents a low index of police brutality in terms of injured suspects.

**Choropleth plotting for Killed Suspects**

In [284]:
Qextra_killed_map_final = alt.Chart(USA_states
    ).transform_lookup(
        lookup = 'id',
        from_ = alt.LookupData(state_shootings, 'FIPS', list(state_shootings.columns))
    ).mark_geoshape(stroke='darkgray'
    ).encode(
        color=alt.Color(
        'Shootings per 1M Habitants:Q',
        legend=alt.Legend(
            title='Shootings per 1M Habitants',
            titleColor='black',
            labelColor='black',
            orient='bottom'
        ),
        scale=alt.Scale(scheme='lighttealblue')
    ),
        tooltip = ['State:N', '% of Suspects Killed:Q']
    ).properties(
        title = alt.TitleParams(
            text = 'Percentage of suspects killed per shooting, by state',
            fontSize = 18,
            color = 'black'),
        width = 500,
        height = 300
    ).project(
        type = 'albersUsa'
    )

Qextra_killed_map_final

In the plot above, **Hawaii and North Dakota clearly stand out at a 100%**, as well as Oregon, Idaho (at 50%) and Nevada (at 44.5%). While this rates are certainly alarming, we can observe a **majority of light yellow and light green colored regions** that represents a low index of police brutality in terms of killed suspects.

## **Third Question**
#### **Are the mass shootings correlated with gun violence incidents in schools?**

To explore any potential correlation between mass shootings and gun violence incidents in schools, we used a **scatter plot with a regression line**. For each state, we calculated the ratio of mass shootings and gun violence incidents per capita to account for population differences. This allows us to gain insight into the relative frequency of incidents across different states. Given that the raw numbers were quite small, we chose to display the data as points per million inhabitants for clearer interpretation.

Additionally, we included a regression line to illustrate the trend and correlation in the data. Despite some outliers, such as New Mexico and South Carolina, which have a relatively higher number of mass shootings compared to gun violence incidents, a **positive correlation** is evident, suggesting a relationship between population-adjusted rates of mass shootings and school gun violence incidents.

On the axes, we encoded the ratio values and added **gridlines** to improve readability and comprehension. Finally, we included tooltips displaying the state name and the ratio values, allowing users to see precisely what each point represents.







**Data preparation**

In [285]:
mass_shootings['Incident Date'] = pd.to_datetime(mass_shootings['Incident Date'])
mass_shootings_reduced = mass_shootings[mass_shootings['Incident Date'] > '2022-11-01']


shootings_count = mass_shootings_reduced.groupby('State').size().reset_index(name="Shootings_count")
school_count = school_incidents.groupby('State').size().reset_index(name="School_count")
total_count = pd.merge(shootings_count, school_count, on='State', how='outer')
total_count = total_count.fillna(0)

total_count = total_count.merge(mass_shootings[['State','Population']], on = "State", how = "left")
total_count.loc[total_count['State'] == "Wyoming", 'Population'] = 584057.0
total_count.loc[total_count['State'] == "Montana", 'Population'] = 1122878.0
total_count.loc[total_count['State'] == "Vermont", 'Population'] = 643077.0

total_count['Ratio Mass Shootings'] = (total_count['Shootings_count']/total_count['Population'])*10**6
total_count['Ratio School Incidents'] = (total_count['School_count']/total_count['Population'])*10**6

**Scatter plot plotting**

In [286]:
scatter_plot = alt.Chart(total_count).mark_circle(color='#1f78b4').encode(
        alt.X('Ratio Mass Shootings:Q', title = "Mass Shootings per million citizens", axis = alt.Axis(titleColor = 'black', labelColor = 'black', titleFontSize = 14, labelFontSize = 12)),
        alt.Y('Ratio School Incidents:Q', title = "School Incidents per million citizens", axis = alt.Axis(titleColor = 'black', labelColor = 'black', titleFontSize = 14, labelFontSize = 12)),
        tooltip=['State', 'Ratio Mass Shootings', 'Ratio School Incidents']
    ).properties(
        title = alt.TitleParams(
            text = 'Relationship Between Mass Shootings and School Incidents',
            fontSize = 18,
            color = 'black'),
        width = 400,
        height = 200
    )


linear_regression = scatter_plot.transform_regression(
      'Ratio Mass Shootings',
      'Ratio School Incidents'
).mark_line(color ='#a6cee3')

Q3_scatterplot = scatter_plot + linear_regression
Q3_scatterplot

## **Fourth Question**
#### **How have mass shootings evolved the last years in the US?**

In order to see the evolution of mass shootings in the last years in the US we selected to represent a **linechart** with additional marks for the **minimum, maximum and average value.** We have chosen this type of plot because it is optimal to display numerical data with temporal continuity. On the **x-axis** we displayed the **months by year**, which we have angled slightly to improve legibility. In the **y-axis** we have shown the amount of **shootings in the US.**  

By adding marks for **minimum and maximum** values we have been able to gather additional insights on outlier cases. We have encoded the former in **green**, and the latter in **orange**. As well as distinguishing them by a **triangle up and triangle down shape** respectively, to aid users with colorblindness. With the horizontal line for the mean value, corresponding to 38.59, we have noted there is an **average of more than one mass shooting per day** in the US, given a month has an average of 30 days.
<br>
Looking at the data as a whole we can identify certain yearly patterns. The **highest peak usually** occurs in **July**, as well as a **lower second peak around October.** From **January to April** there are **less shootings** recorded.

**Data preparation**

In [287]:
mass_shootings['Incident Date'] = pd.to_datetime(mass_shootings['Incident Date'], errors= 'coerce')
mass_shootings['Year_Month'] = mass_shootings['Incident Date'].dt.to_period('M')

# to have monthly total shootings
total_shootings = mass_shootings.groupby('Year_Month').size().reset_index(name = 'Count')
total_shootings['Year_Month'] = total_shootings['Year_Month'].dt.to_timestamp()
total_shootings = total_shootings[1 : -1] # deleting the first and last month that are incomplete


max_value = total_shootings['Count'].max()
min_value = total_shootings['Count'].min()
mean_value = total_shootings['Count'].mean().round(decimals = 2)

max_point = total_shootings[total_shootings['Count'] == max_value]
min_point = total_shootings[total_shootings['Count'] == min_value]

**Linechart plotting by Month and Year**

In [288]:
shootings_linechart = alt.Chart(total_shootings).mark_line().encode(
        alt.X('Year_Month:T', title = 'Month - Year', axis = alt.Axis(labelColor = 'black', labelAngle = 45, format = '%b-%Y', titleColor = 'black', titleFontSize = 14, labelFontSize = 12)),
        alt.Y('Count:Q', title = 'Mass shootings', axis = alt.Axis(labelColor = 'black', titleColor = 'black', titleFontSize = 14, labelFontSize = 12)),
    ).properties(
        title = alt.TitleParams(
            text = 'Mass shootings during the last four years in the USA',
            fontSize = 18,
            color = 'black'),
            width = 400,
            height = 200
    )

max_point_chart = alt.Chart(max_point).mark_point(
    size = 170, color = '#d95f02', filled = True, shape = 'triangle-up'
).encode(
    alt.X('Year_Month:T'),
    alt.Y('Count:Q'),
)

min_point_chart = alt.Chart(min_point).mark_point(
    size = 170, color = '#1b9e77', filled = True, shape = 'triangle-down'
).encode(
    alt.X('Year_Month:T'),
    alt.Y('Count:Q'),
)

min_text = alt.Chart(min_point).mark_text(
    align = 'left', dx = 12, dy = 15, fontSize = 14, color = '#1b9e77'
).encode (
    alt.X('Year_Month:T'),
    alt.Y('Count:Q'),
    alt.Text('Count:Q')
)

max_text = alt.Chart(max_point).mark_text(
    align = 'left', dx = 7, dy = -10, fontSize = 14, color = '#d95f02'
).encode (
    alt.X('Year_Month:T'),
    alt.Y('Count:Q'),
    alt.Text('Count:Q')
)

mean_line = alt.Chart(total_shootings).mark_rule(
    color = '#a6cee3'
).encode(
    alt.Y('mean_value:Q')
).transform_calculate(mean_value=str(mean_value))

mean_text = alt.Chart(total_shootings).mark_text(
    align = "left", dx = 200, dy = -10, color = '#a6cee3', size = 18
).encode(
    alt.Y('mean_value:Q'),
    text = alt.value(f'Mean: {mean_value:.2f}')
).transform_calculate(
    mean_value = str(mean_value)
)

Q4_linechart_final = shootings_linechart + max_point_chart + min_point_chart + mean_line + mean_text + max_text + min_text
Q4_linechart_final

In [289]:
shootings_bars = alt.Chart(state_shootings_Q1).mark_bar().encode(
        alt.X('Shootings per 1M Habitants:Q', axis = alt.Axis(titleColor = 'black', labelColor = 'black', titleFontSize = 14, labelFontSize = 12)),
        alt.Y('State:N', sort='-x', axis = alt.Axis(titleColor = 'black', labelColor = 'black', titleFontSize = 14, labelFontSize = 12)),
        color = alt.value('#1f78b4'),
        order = alt.Order('Shootings per 1M Habitants', sort = 'descending')
).transform_filter(
    alt.datum.Top_10 == True
).properties(title = alt.TitleParams(
    text = 'Top 10 States with most mass shootings per 1M citizens',
    fontSize = 18,
    color = 'black',
    offset = 12.5)
).properties(
    width = 400,
    height = 200
)

shootings_text = alt.Chart(state_shootings_Q1).mark_text(
    align = 'left', baseline = 'middle', dx = 3, color = 'black', fontSize = 12
).encode(
    alt.X('Shootings per 1M Habitants:Q'),
    alt.Y('State:N', sort='-x'),
    alt.Text('Shootings per 1M Habitants:Q', format='.2f')
).transform_filter(
    alt.datum.Top_10 == True
).properties(
    width = 400,
    height = 200
)

Q1_barchart_final = shootings_bars + shootings_text

In [290]:
# Chart preparation
Q1_Q4_charts = (Q1_barchart_final | Q4_linechart_final).properties(spacing = 50)
Q2_charts = (Q2_state_map_final | Q2_county_map_final).resolve_scale(color='independent').properties(spacing = 80)
Qextra_charts = (Qextra_injured_map_final | Qextra_killed_map_final).resolve_scale(color='independent').properties(spacing = 80)

Q1_Q4_charts.display()
Q2_charts.display()
Qextra_charts.display()
Q3_scatterplot.display()