**Prediction Quesiton**: Were Asian Americans more likely to be attacked in cities/towns with a higher amount of Covid cases due to rise in xenophobia during the Pandemic?

**Importance for client**: This research question can guide law enforcement in the future. Even though there may not be the same case of anti-asian hate, there will most likely be more xenophobia in the future. By analyzing this data, we can see if there was actually a rise during the pandemic of attacks on asian americans, and then law enforcement can protect groups in a similar scenario in the future.

**Setting Up Data:**

In [1]:
# Setting up environment:
import pandas as pd
import numpy as np

In [2]:
# loading in dataset as dataframe:
df = pd.read_csv("cleaned_df.csv")
print(df.shape)

# previewing the data:
df.head(5)
df.columns

(13794, 21)


Index(['Unnamed: 0', 'State', 'Agency', 'Source', 'Solved', 'Year', 'Month',
       'Homicide', 'Situation', 'VicAge', 'VicSex', 'VicRace', 'VicEthnic',
       'OffAge', 'OffSex', 'OffRace', 'OffEthnic', 'Weapon', 'Relationship',
       'VicCount', 'OffCount'],
      dtype='object')

In [3]:
# selecting relevant columns to use for model analysis:
df_new = df[['Year', 'Month', 'State', 'OffRace', 'OffAge', 'Situation', 'VicRace']]
df_new.head(5)

Unnamed: 0,Year,Month,State,OffRace,OffAge,Situation,VicRace
0,2016,January,Alaska,American Indian or Alaskan Native,21.0,Single victim/single offender,American Indian or Alaskan Native
1,2016,January,Alaska,White,15.0,Multiple victims/single offender,White
2,2016,January,Alaska,White,15.0,Multiple victims/single offender,White
3,2016,January,Alaska,White,34.0,Single victim/multiple offenders,American Indian or Alaskan Native
4,2016,January,Alaska,American Indian or Alaskan Native,33.0,Single victim/single offender,American Indian or Alaskan Native


In [4]:
# checking for NaNs:
print(df_new.isna().sum())

# dropping missing values for 'OffRace' since can not impute for categorical:
# df_new = df_new.dropna(subset=['OffRace'], axis=0)
df_new['OffRace'] = df_new["OffRace"].fillna('Unknown')


# imputing missing values for 'OffAge' with median:
df_new['OffAge'] = df_new['OffAge'].fillna(df_new['OffAge'].median())

# imputing missing values for 'VicRace' with 'Unknown':
df_new['VicRace'] = df_new['VicRace'].fillna('Unknown')

# previewing new/cleaned dataframe:
print(df_new.head())
print(df_new.isna().sum())

Year            0
Month           0
State           0
OffRace      4413
OffAge       4441
Situation       0
VicRace       210
dtype: int64
   Year    Month   State                            OffRace  OffAge  \
0  2016  January  Alaska  American Indian or Alaskan Native    21.0   
1  2016  January  Alaska                              White    15.0   
2  2016  January  Alaska                              White    15.0   
3  2016  January  Alaska                              White    34.0   
4  2016  January  Alaska  American Indian or Alaskan Native    33.0   

                          Situation                            VicRace  
0     Single victim/single offender  American Indian or Alaskan Native  
1  Multiple victims/single offender                              White  
2  Multiple victims/single offender                              White  
3  Single victim/multiple offenders  American Indian or Alaskan Native  
4     Single victim/single offender  American Indian or Alaskan Nativ

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['OffRace'] = df_new["OffRace"].fillna('Unknown')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['OffAge'] = df_new['OffAge'].fillna(df_new['OffAge'].median())
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_new['VicRace'] = df_new['VicRace'].fillna('Unknown')


**One Hot Encoding Data for Model:**

In [5]:
# separating numerical and categorical columns:

# numerical columns:
num_cols = df_new.select_dtypes(include=[np.number]).columns.tolist()
print(num_cols)

# categorical columns:
cat_cols = df_new.select_dtypes(exclude=[np.number]).columns.tolist()
print(cat_cols)

['Year', 'OffAge']
['Month', 'State', 'OffRace', 'Situation', 'VicRace']


In [6]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split

# one-hot encoding categorical columns:
df_encoded = pd.get_dummies(df_new, columns=cat_cols, drop_first=True)

X = pd.concat([df_new[num_cols], df_encoded], axis=1)
y = df_new['VicRace']

# splitting data into test/train:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [7]:
# importing all necessary libraries:
from sklearn import tree
from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score

cart = tree.DecisionTreeRegressor(min_samples_leaf= 500)
cart = cart.fit(X_train, Y_train)

var_names = cart.feature_names_in_
plot_tree(cart, filled=True, feature_names=var_names)

# testing through min_samples_leaf from 1 - 25
r2_list = []
leaf_values = list(range(1, 26))

for leaf in leaf_values:
    model = DecisionTreeRegressor(min_samples_leaf=leaf, random_state=42)
    model.fit(X_train, Y_train)
    y_hat = model.predict(X_test)
    r2 = r2_score(Y_test, y_hat)
    r2_list.append(r2)

best_leaf = leaf_values[r2_list.index(max(r2_list))]
best_r2 = max(r2_list)

print(f'Best min_sample_left value: {best_leaf} with R^2 of {best_r2:.3f}')

ValueError: could not convert string to float: 'White'