Questions about the dataset

1. What is in your data?

Our dataset is about murders throughout the U.S. The dataset spans from 1976 to 2023, but for the purposes of this project we will be focusing on data from after the year 2010. This subset the data to a more manageable range. In the original dataset, there were some columns that were dropped due to them not being relevant, such as the filedata or the MSA. The columns in the current data set are:
- State - The state of the Crime.
- Agency - The city of the Crime.
- Source - Where this data came from (i.e. the FBI)
- Solved - Was the case solved?
- Year - Year of Crime.
- Month - Month of Crime.
- Homicide - Type of Crime.
- Situation - Situation in which the crime occurred.
- VicAge - Age of Victim.
- VicSex - Sex of Victim.
- VicRace - Race of Victim.
- VicEthnic - Ethnicity of Victim.
- OffAge - Age of Offender.
- OffSex - Sex of Offender.
- OffRace - Race of Offender.
- OffEthnic - Ethnicity of Offender.
- Weapon - Weapon Used.
- Relationship - Relationship between Victim and Offender.
- VicCount - Amount of Victims.
- OffCount - Amount of Offenders.





2. How will these data be useful for studying the phenomenon you're interested in?



Some trends we can use the data to analyze include age of offender based on location, how COVID years have affected crime rates (including who’s doing the crime and where it’s happening), if certain places are more inclined to use a certain type of weapon states with highest crime frequencies, season with highest crime rates and which months, whether or not crime rates have decreased since the start of data gathering, most common types of relationships that incite violence, etc. Many questions can be explored using this dataset.

3. What are the challenges you've resolved or expect to face in using them?


There are a lot of missing values or placeholder values that would interfere with analysis, so cleaning those and converting values into integers if needed will be a longer process. If an observation has 4+ missing values, it will be dropped. If a column has 20%+ entries missing, it will be removed. Additionally, if it's decided to replace data with means, medians, a range, common values, or simply leave them as unknowns, then we will need to determine what they are first before deciding further action. At the same time, categories like offender race shouldn't be assumed. Missing values will make it hard to tell if there were any preventive or safety measures put in place, and whether or not they have helped at all. So far, similar inputs are grouped like firearms, but more complex relations could be kept separate depending on the specific question to answer.
The possible are all listed from the tables after replacing unknown values and grouping other entries together.



In [27]:
import pandas as pd
import seaborn as sns
import numpy as np

# ignore warnings about "Setting with copy" and keep all others
import warnings
warnings.simplefilter(action="ignore", category=pd.errors.SettingWithCopyWarning)

In [28]:
df = pd.read_csv("SHR65_23.csv")

In [29]:
print(df.columns)

Index(['ID', 'CNTYFIPS', 'Ori', 'State', 'Agency', 'Agentype', 'Source',
       'Solved', 'Year', 'Month', 'Incident', 'ActionType', 'Homicide',
       'Situation', 'VicAge', 'VicSex', 'VicRace', 'VicEthnic', 'OffAge',
       'OffSex', 'OffRace', 'OffEthnic', 'Weapon', 'Relationship',
       'Circumstance', 'Subcircum', 'VicCount', 'OffCount', 'FileDate', 'MSA'],
      dtype='object')


In [30]:
#subset for columns that are important
new_df = df[['State', "Agency", "Source", "Solved", "Year", "Month", "Homicide", "Situation", "VicAge", "VicSex", "VicRace", "VicEthnic", "OffAge", "OffSex", "OffRace", "OffEthnic", "Weapon", "Relationship", "VicCount", "OffCount"  ]]
new_df

Unnamed: 0,State,Agency,Source,Solved,Year,Month,Homicide,Situation,VicAge,VicSex,VicRace,VicEthnic,OffAge,OffSex,OffRace,OffEthnic,Weapon,Relationship,VicCount,OffCount
0,Alaska,Anchorage,FBI,Yes,1976,March,Murder and non-negligent manslaughter,Single victim/single offender,48,Male,Unknown,Unknown or not reported,68,Male,Black,Unknown or not reported,"Handgun - pistol, revolver, etc",Relationship not determined,0,0
1,Alaska,Anchorage,FBI,Yes,1976,April,Murder and non-negligent manslaughter,Single victim/single offender,33,Female,White,Unknown or not reported,44,Male,White,Unknown or not reported,"Handgun - pistol, revolver, etc",Girlfriend,0,0
2,Alaska,Anchorage,FBI,Yes,1976,June,Murder and non-negligent manslaughter,Single victim/single offender,38,Male,White,Unknown or not reported,27,Male,Black,Unknown or not reported,"Handgun - pistol, revolver, etc",Stranger,0,0
3,Alaska,Anchorage,FBI,Yes,1976,June,Murder and non-negligent manslaughter,Single victim/single offender,41,Male,White,Unknown or not reported,34,Male,White,Unknown or not reported,"Handgun - pistol, revolver, etc",Other - known to victim,0,0
4,Alaska,Anchorage,FBI,Yes,1976,July,Murder and non-negligent manslaughter,Single victim/single offender,33,Male,American Indian or Alaskan Native,Unknown or not reported,37,Female,American Indian or Alaskan Native,Unknown or not reported,Knife or cutting instrument,Brother,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
894631,Wyoming,Wind River Agency,FBI,No,2018,August,Murder and non-negligent manslaughter,Single victim/unknown offender(s),29,Male,American Indian or Alaskan Native,Not of Hispanic origin,999,Unknown,Unknown,Unknown or not reported,Shotgun,Other - known to victim,0,0
894632,Wyoming,Wind River Agency,FBI,Yes,2019,August,Murder and non-negligent manslaughter,Single victim/single offender,29,Male,American Indian or Alaskan Native,Not of Hispanic origin,30,Male,American Indian or Alaskan Native,Not of Hispanic origin,"Firearm, type not stated",Acquaintance,0,0
894633,Wyoming,Wind River Agency,FBI,Yes,2023,March,Murder and non-negligent manslaughter,Single victim/single offender,44,Male,American Indian or Alaskan Native,Not of Hispanic origin,33,Male,American Indian or Alaskan Native,Not of Hispanic origin,"Firearm, type not stated",Other - known to victim,0,0
894634,Wyoming,Wind River Agency,FBI,Yes,2023,March,Murder and non-negligent manslaughter,Single victim/single offender,37,Male,American Indian or Alaskan Native,Not of Hispanic origin,29,Male,American Indian or Alaskan Native,Not of Hispanic origin,Other or type unknown,Relationship not determined,0,0


Now we can check if there are any missing values

In [31]:
#Checking for any missing values in the data

for col in new_df.columns:
  print(col)
  print(new_df[col].isna().sum())

State
0
Agency
0
Source
0
Solved
0
Year
0
Month
0
Homicide
0
Situation
0
VicAge
0
VicSex
0
VicRace
0
VicEthnic
0
OffAge
0
OffSex
0
OffRace
0
OffEthnic
0
Weapon
0
Relationship
0
VicCount
0
OffCount
0




```
# This is formatted as code
```

There are some missing values

In [32]:
#subset the data based on year
new_df = new_df[new_df['Year']>=2010]
#check for weird values
print(new_df['Year'].unique())

[2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2022 2023 2021]


In [33]:
#check for weird values
print(new_df['State'].unique())

['Alaska' 'Alabama' 'Arkansas' 'Arizona' 'California' 'Colorado'
 'Connecticut' 'District of Columbia' 'Delaware' 'Florida' 'Georgia'
 'Hawaii' 'Iowa' 'Idaho' 'Illinois' 'Indiana' 'Kansas' 'Kentucky'
 'Louisiana' 'Massachusetts' 'Maryland' 'Maine' 'Michigan' 'Minnesota'
 'Missouri' 'Mississippi' 'Montana' 'Nebraska' 'North Carolina'
 'North Dakota' 'New Hampshire' 'New Jersey' 'New Mexico' 'Nevada'
 'New York' 'Ohio' 'Oklahoma' 'Oregon' 'Pennsylvania' 'Rhodes Island'
 'South Carolina' 'South Dakota' 'Tennessee' 'Texas' 'Utah' 'Virginia'
 'Vermont' 'Washington' 'Wisconsin' 'West Virginia' 'Wyoming']


In [34]:
#check for weird values
print(new_df['Source'].unique())

['FBI' 'MAP']


In [35]:
#check for weird values
print(new_df['Solved'].unique())

['No' 'Yes']


In [36]:
#check for weird values
print(new_df['Situation'].unique())

['Single victim/unknown offender(s)' 'Single victim/single offender'
 'Single victim/multiple offenders' 'Multiple victims/single offender'
 'Multiple victims/unknown offender(s)'
 'Multiple victims/multiple offenders']


In [37]:
#check for weird values
print(new_df['VicCount'].unique())

[ 0  1  2  3  4  9  5  6 10  7  8 11 21 20 52]


In [38]:
#check for weird values
print(new_df['OffCount'].unique())

[ 0  1  2  3  4  5  8  7  6  9 11 15 12 10 17 40 13 14 21]


In [39]:
#check for weird values
print(new_df['VicAge'].unique())
#see the value 999, which is a place holder
new_df["VicAge"] = new_df['VicAge'].replace(999, np.nan)
print(new_df['VicAge'].unique())

[ 44  29  26  40  34  48  19   0  24  30  57  23  54  42  58  52  20  28
  55  35  38   1  33  14  50  67  18  27  47  43  59  31  21  73  71  53
  36  22  15  51  63  64  74  56  39   4  11  41  60   2  49  32  17  25
  69  37  70  92  76  65  46   6  45  61   3  13  16  66   5  62  12  88
  77  75  10 999   7  68   8  81   9  72  80  79  78  83  90  86  85  89
  82  84  91  99  87  95  94  93  96  97  98]
[44. 29. 26. 40. 34. 48. 19.  0. 24. 30. 57. 23. 54. 42. 58. 52. 20. 28.
 55. 35. 38.  1. 33. 14. 50. 67. 18. 27. 47. 43. 59. 31. 21. 73. 71. 53.
 36. 22. 15. 51. 63. 64. 74. 56. 39.  4. 11. 41. 60.  2. 49. 32. 17. 25.
 69. 37. 70. 92. 76. 65. 46.  6. 45. 61.  3. 13. 16. 66.  5. 62. 12. 88.
 77. 75. 10. nan  7. 68.  8. 81.  9. 72. 80. 79. 78. 83. 90. 86. 85. 89.
 82. 84. 91. 99. 87. 95. 94. 93. 96. 97. 98.]


In [40]:
#check for weird values
print(new_df['OffAge'].unique())
#see the value 999, which is a place holder
new_df["OffAge"] = new_df['OffAge'].replace(999, np.nan)
print(new_df['OffAge'].unique())

[999  28  26  35  38  21  23  45  27  24  62  43  17  66  22  33  31  47
  25  34  30  55  40  29  50  41  61  20  49  18  73  64  19  39  51  16
  46  15  32  44  71  37  57  63  91  75  42  68  36  12  48  53  56  13
  59  14  67   9   8  69   6  11  99  54  58  52   7  80  60  76  82  10
   1  70  72  89  81  74  84  65  77  78  79   5  83  90  85  87  86  93
  96  92  88   4  95  97   0   3  94  98   2]
[nan 28. 26. 35. 38. 21. 23. 45. 27. 24. 62. 43. 17. 66. 22. 33. 31. 47.
 25. 34. 30. 55. 40. 29. 50. 41. 61. 20. 49. 18. 73. 64. 19. 39. 51. 16.
 46. 15. 32. 44. 71. 37. 57. 63. 91. 75. 42. 68. 36. 12. 48. 53. 56. 13.
 59. 14. 67.  9.  8. 69.  6. 11. 99. 54. 58. 52.  7. 80. 60. 76. 82. 10.
  1. 70. 72. 89. 81. 74. 84. 65. 77. 78. 79.  5. 83. 90. 85. 87. 86. 93.
 96. 92. 88.  4. 95. 97.  0.  3. 94. 98.  2.]


In [41]:
#check for weird values
print(new_df['Month'].unique())

['January' 'February' 'March' 'April' 'May' 'June' 'July' 'August'
 'September' 'October' 'December' 'November']


In [42]:
#check for weird values
print(new_df['Homicide'].unique())

['Murder and non-negligent manslaughter' 'Manslaughter by negligence']


In [43]:
#check for weird values
print(new_df['VicSex'].unique())
new_df["VicSex"] = new_df['VicSex'].replace("Unknown", np.nan)
print(new_df['VicSex'].unique())

['Male' 'Female' 'Unknown']
['Male' 'Female' nan]


In [44]:
#check for weird values
print(new_df['OffSex'].unique())
new_df["OffSex"] = new_df['OffSex'].replace("Unknown", np.nan)
print(new_df['OffSex'].unique())

['Unknown' 'Male' 'Female']
[nan 'Male' 'Female']


In [45]:
#check for weird values
print(new_df['VicRace'].unique())
new_df["VicRace"] = new_df['VicRace'].replace("Unknown", np.nan)
print(new_df['VicRace'].unique())

['White' 'Black' 'Asian' 'American Indian or Alaskan Native' 'Unknown'
 'Native Hawaiian or Pacific Islander']
['White' 'Black' 'Asian' 'American Indian or Alaskan Native' nan
 'Native Hawaiian or Pacific Islander']


In [46]:
#check for weird values
print(new_df['OffRace'].unique())
new_df["OffRace"] = new_df['OffRace'].replace("Unknown", np.nan)
print(new_df['OffRace'].unique())

['Unknown' 'Black' 'White' 'Asian' 'American Indian or Alaskan Native'
 'Native Hawaiian or Pacific Islander']
[nan 'Black' 'White' 'Asian' 'American Indian or Alaskan Native'
 'Native Hawaiian or Pacific Islander']


In [47]:
#check for weird values
print(new_df['VicEthnic'].unique())
new_df["VicEthnic"] = new_df['VicEthnic'].replace("Unknown or not reported", np.nan)
print(new_df['VicEthnic'].unique())

['Unknown or not reported' 'Not of Hispanic origin' 'Hispanic origin']
[nan 'Not of Hispanic origin' 'Hispanic origin']


In [48]:
#check for weird values
print(new_df['OffEthnic'].unique())
new_df["OffEthnic"] = new_df['OffEthnic'].replace("Unknown or not reported", np.nan)
print(new_df['OffEthnic'].unique())

['Unknown or not reported' 'Not of Hispanic origin' 'Hispanic origin']
[nan 'Not of Hispanic origin' 'Hispanic origin']


In [49]:
#check for weird values
print(new_df['Relationship'].unique())
new_df["Relationship"] = new_df['Relationship'].replace(["Wife", "Girlfriend", "Boyfriend", "Common-law husband", "Common-law wife", "Homosexual relationship", "Ex-husband", "Ex-wife", "Husband" ], 'Romantic Relation')
new_df["Relationship"] = new_df['Relationship'].replace(["Son", "Daughter"], 'Offspring')
new_df["Relationship"] = new_df['Relationship'].replace(["Father", "Mother"], 'Parents')
new_df["Relationship"] = new_df['Relationship'].replace(["Stepmother", "Stepfather", "Stepdaughter", "Stepson"], 'Step Family')
new_df["Relationship"] = new_df['Relationship'].replace(["Brother", "Sister"], 'Sibling')
new_df["Relationship"] = new_df['Relationship'].replace(["Friend", "Acquaintance"], 'Friend')
new_df["Relationship"] = new_df['Relationship'].replace(["Other family", "In-law"], 'Distant Family')
new_df["Relationship"] = new_df['Relationship'].replace(["Stranger", "Employer", "Neighbor", "Other - known to victim", "Employee"], 'Other')
new_df["Relationship"] = new_df['Relationship'].replace(["Relationship not determined"], np.nan)
print(new_df['Relationship'].unique())

['Relationship not determined' 'Acquaintance' 'Stranger' 'Daughter'
 'Other family' 'Other - known to victim' 'Friend' 'Wife' 'Neighbor'
 'Employer' 'Son' 'Girlfriend' 'Boyfriend' 'Mother' 'Ex-husband' 'Sister'
 'Ex-wife' 'Brother' 'Common-law wife' 'Father' 'Homosexual relationship'
 'In-law' 'Husband' 'Stepfather' 'Employee' 'Stepson' 'Stepmother'
 'Common-law husband' 'Stepdaughter']
[nan 'Friend' 'Other' 'Offspring' 'Distant Family' 'Romantic Relation'
 'Parents' 'Sibling' 'Step Family']


In [50]:
#check for weird values
print(new_df['Weapon'].unique())
new_df["Weapon"] = new_df['Weapon'].replace(["Firearm, type not stated", "Handgun - pistol, revolver, etc", "Other gun", "Rifle", "Shotgun"], 'Firearm')
new_df["Weapon"] = new_df['Weapon'].replace(["Personal weapons, includes beating", "Blunt object - hammer, club, etc", "Pushed or thrown out window", "Strangulation - hanging"], 'Interpersonal Violence')
new_df["Weapon"] = new_df['Weapon'].replace(["Narcotics or drugs, sleeping pills", "Poison - does not include gas"], 'Substance')
new_df["Weapon"] = new_df['Weapon'].replace(["Weapon Not Reported"], np.nan)
new_df["Weapon"] = new_df['Weapon'].replace(["Other or type unknown", "Asphyxiation - includes death by gas"], "Other")

['Firearm, type not stated' 'Personal weapons, includes beating'
 'Knife or cutting instrument' 'Handgun - pistol, revolver, etc'
 'Blunt object - hammer, club, etc' 'Narcotics or drugs, sleeping pills'
 'Strangulation - hanging' 'Asphyxiation - includes death by gas'
 'Other or type unknown' 'Fire' 'Other gun' 'Rifle' 'Shotgun' 'Drowning'
 'Weapon Not Reported' 'Poison - does not include gas' 'Explosives'
 'Pushed or thrown out window']


In [51]:
# Removing any rows that are missing more than 5 values

# create drop_df to SAVE only rows with LESS than 5 missing values
drop_df = new_df[new_df.isnull().sum(axis=1) <= 5]

# Count how many rows were dropped -> new_df length - drop_df length
rows_dropped = len(new_df) - len(drop_df)

# print how many rows were dropped
print(f"Rows dropped: {rows_dropped}")

Rows dropped: 29069
