# Naïve Bayes Classification

给定一组观测特征 \(X\)，最大化目标变量 \(y\) 属于某一类别 \(C\) 的后验概率。

朴素贝叶斯分类器的决策公式为：后验概率最大化

在所有可能的类别 y 中，选择使得后验概率P(y=C∣X) 最大的那个类别。

$$
\hat{y} = \mathrm{argmax}_y \left(P(y=C \mid X = \{x_1, x_2, \ldots, x_n\})\right)
$$

最大化“先验概率” P(y=C) 与所有特征在该类别下的“条件概率” P(xi∣y) 的乘积。

$$
\hat{y} = \mathrm{argmax}_y \left( P(y = C) \cdot \prod_{i=1}^n P(x_i|y) \right )
$$

高斯分布（Gaussian/Normal）：适用于连续型特征。

伯努利分布（Bernoulli）：适用于二元（0/1）特征。

多项式分布（Multinomial）：适用于计数型或离散型特征。

In [2]:
# Import Pandas and Numpy
import numpy as np
import pandas as pd

# Import Plotting Libraries
import matplotlib.pyplot as plt
%matplotlib inline

In [3]:
df = pd.read_csv("C:/Users/zhaoj/OneDrive - The University of Chicago/Desktop/uchicago/Q3/Machine Learning for public policy\ML_2021/data/chicago-crimes-2019.csv")
df.head(1)

  df = pd.read_csv("C:/Users/zhaoj/OneDrive - The University of Chicago/Desktop/uchicago/Q3/Machine Learning for public policy\ML_2021/data/chicago-crimes-2019.csv")


Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,11922110,JC547456,12/15/2019 03:40:00 AM,039XX W NORTH AVE,1020,ARSON,BY FIRE,VEHICLE NON-COMMERCIAL,False,False,...,26.0,23.0,9,1149951.0,1910348.0,2019,04/27/2020 03:48:23 PM,41.909907,-87.724578,"(41.909907002, -87.724577987)"


Clean the features

In [4]:
print("Found {} NaN community area records.".format(df['Community Area'].isna().sum()))
df.dropna(inplace=True)

Found 4 NaN community area records.


### Transform the Features


In [5]:
df['Hour'] = pd.to_datetime(df['Date']).dt.hour
df['Community Area'] = df['Community Area'].astype(int)
df['Hour'] = df['Hour'].astype(int)
df['Arrest'] = df['Arrest'].astype(int)

In [6]:
df_backup = df.copy()
df = df.loc[:,['Hour', 'Community Area', 'Arrest']]

## Bayes Classification by Hand

In [7]:
arrests = df[df['Arrest']==1]['Arrest'].count()
no_arrests = df[df['Arrest']==0]['Arrest'].count()
total = df['Arrest'].count()

# Probability of Arrest
p_y = [arrests / total,
        no_arrests / total]

print("P(y=0) = {:10.4f}\nP(y=1) = {:10.4f}".format(p_y[0],p_y[1]))

P(y=0) =     0.2139
P(y=1) =     0.7861


In [8]:
# We could have used ranges for these, but best to derive directly from the data and not make any assumptions.
ca = np.sort(df['Community Area'].unique())
hr = np.sort(df['Hour'].unique())

arrest = [no_arrests,arrests]
likelihood = [[ [0 for col in range(2)] for col in range(24)] for row in range(78)]

In [9]:
# Take a subset of the dataframe for y=1 since we'll need this a lot
for c in ca:
    for h in hr:
        for a in (0,1):
            likelihood[c][h][a] = df[(df['Community Area']==c) & \
                                     (df['Hour']==h) & \
                                     (df['Arrest']==a)].count()[0] / arrest[a]

  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df['Community Area']==c) & \
  likelihood[c][h][a] = df[(df[

统计每种社区和小时组合在逮捕/未逮捕两种情况下的条件概率

Sanity check

In [10]:
s = 0
for c in ca:
    for h in hr:
        s = s + likelihood[c][h][0]
s

np.float64(0.9999999999999976)

理论上，所有可能的社区和小时组合的条件概率之和应该等于1，因为它们覆盖了所有未被逮捕的情况。
如果s非常接近1，说明概率分布是合理的。
如果远离1，说明概率计算或数据处理有问题。

### Predictions

Hyde Park, 10 a.m.

Hyde Park is area 41.  If the incident is at 1000h, will an arrest take place?

In [11]:
likelihood[41][10][0] * p_y[0] < likelihood[41][10][1] * p_y[1]

np.False_

In [12]:
s = 0
h = 10
n = 41
for a in (0,1):
    s = s + likelihood[n][h][a]

likelihood[n][h][1]/s * p_y[1]

np.float64(0.12880841153745723)

#### Austin, 10 a.m.

Austin is area 25. If the incident is at 1000h, will an arrest take place?

In [13]:
likelihood[25][10][0] * p_y[0] < likelihood[25][10][1] * p_y[1]

np.True_

## Naïve Bayes with Scikit-Learn

In [14]:
# Import the Naïve Bayes Classifiers. (We'll only use Multinomial for now.)
from sklearn.naive_bayes import ComplementNB
nb = ComplementNB() 

features = df.loc[:,['Community Area', 'Hour']].values
target = df['Arrest'].values

#### Hyde Park, 10 a.m.

In [15]:
nb.fit(features,target)
nb.predict([[41,10]])[0]

np.int64(0)

#### Austin, 10 a.m.

In [16]:
nb.predict([[25,10]])[0]

np.int64(1)

统计每个类别的先验概率（即总体上逮捕和未逮捕的比例）。

统计每个特征在每个类别下的条件概率（即在逮捕/未逮捕情况下，不同社区和小时的分布）。

对于新样本，计算每个类别的后验概率，选择概率最大的类别作为预测结果。

特征之间强相关时效果较差。