一般来说，EDA 分为以下几个步骤：

确定分析任务的目标；

筛选、清洗数据；

检测异常值与缺失值

数据分析，可视化；

挖掘特征之间的相互关系

挖掘特征与目标变量之间的关系

根据上一步的结果构建模型；

得出最终结论。

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import random
from plotly import tools
import plotly_express as px
from plotly.offline import init_notebook_mode,iplot,plot
import plotly.graph_objs as go

In [3]:
df_train = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_train

Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities
0,1,,Afghanistan,2020-01-22,0.0,0.0
1,2,,Afghanistan,2020-01-23,0.0,0.0
2,3,,Afghanistan,2020-01-24,0.0,0.0
3,4,,Afghanistan,2020-01-25,0.0,0.0
4,5,,Afghanistan,2020-01-26,0.0,0.0
...,...,...,...,...,...,...
35990,35991,,Zimbabwe,2020-05-11,36.0,4.0
35991,35992,,Zimbabwe,2020-05-12,36.0,4.0
35992,35993,,Zimbabwe,2020-05-13,37.0,4.0
35993,35994,,Zimbabwe,2020-05-14,37.0,4.0


In [4]:
df_test

Unnamed: 0,ForecastId,Province_State,Country_Region,Date
0,1,,Afghanistan,2020-04-02
1,2,,Afghanistan,2020-04-03
2,3,,Afghanistan,2020-04-04
3,4,,Afghanistan,2020-04-05
4,5,,Afghanistan,2020-04-06
...,...,...,...,...
13454,13455,,Zimbabwe,2020-05-10
13455,13456,,Zimbabwe,2020-05-11
13456,13457,,Zimbabwe,2020-05-12
13457,13458,,Zimbabwe,2020-05-13


In [5]:
df_train.isna().sum()

Id                    0
Province_State    20700
Country_Region        0
Date                  0
ConfirmedCases        0
Fatalities            0
dtype: int64

In [6]:
df_train = df_train.fillna("")
df_train

Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities
0,1,,Afghanistan,2020-01-22,0.0,0.0
1,2,,Afghanistan,2020-01-23,0.0,0.0
2,3,,Afghanistan,2020-01-24,0.0,0.0
3,4,,Afghanistan,2020-01-25,0.0,0.0
4,5,,Afghanistan,2020-01-26,0.0,0.0
...,...,...,...,...,...,...
35990,35991,,Zimbabwe,2020-05-11,36.0,4.0
35991,35992,,Zimbabwe,2020-05-12,36.0,4.0
35992,35993,,Zimbabwe,2020-05-13,37.0,4.0
35993,35994,,Zimbabwe,2020-05-14,37.0,4.0


In [7]:
df_train.describe()

Unnamed: 0,Id,ConfirmedCases,Fatalities
count,35995.0,35995.0,35995.0
mean,17998.0,3683.508737,243.560217
std,10391.005806,18986.978708,1832.966999
min,1.0,0.0,0.0
25%,8999.5,0.0,0.0
50%,17998.0,19.0,0.0
75%,26996.5,543.0,7.0
max,35995.0,345813.0,33998.0


In [8]:
df_countries = df_train.groupby(["Country_Region","Province_State","Date"])["ConfirmedCases"].sum()
df_countries = df_countries.groupby(["Country_Region","Province_State"]).max()
df_countries = df_countries.groupby(["Country_Region"]).sum().sort_values(ascending=False)
df_countries = df_countries.head(20)

In [9]:
fig = px.bar(df_countries,x=df_countries.index,y='ConfirmedCases',labels={"x":"Country"},
             color='ConfirmedCases',color_continuous_scale=px.colors.sequential.Bluered)
fig.update_layout(title_text='国家历史最高确诊数')
fig.show()

In [12]:
df_usa_records = df_train.loc[df_train["Country_Region"]=="US",["Province_State","Date","ConfirmedCases","Fatalities"]]
df_usa_records = df_usa_records.groupby("Date").sum()
df_usa_records = df_usa_records.reset_index()
df_usa_records

Unnamed: 0,Date,ConfirmedCases,Fatalities
0,2020-01-22,0.0,0.0
1,2020-01-23,0.0,0.0
2,2020-01-24,0.0,0.0
3,2020-01-25,0.0,0.0
4,2020-01-26,0.0,0.0
...,...,...,...
110,2020-05-11,1347710.0,80677.0
111,2020-05-12,1369403.0,82371.0
112,2020-05-13,1390235.0,84114.0
113,2020-05-14,1417603.0,85893.0


In [14]:
fig = px.bar(df_usa_records,x="Date",y="ConfirmedCases",color="ConfirmedCases",color_continuous_scale=px.colors.sequential.Magma)
fig.update_layout(title_text="美国随时间确诊病例数")
fig.show()

In [16]:
fig = px.bar(df_usa_records,x="Date",y="Fatalities",color="Fatalities",color_continuous_scale=px.colors.sequential.Magma)
fig.update_layout(title_text="美国随时间死亡病例数")
fig.show()

In [18]:
df_brz_records = df_train.loc[df_train["Country_Region"]=="Brazil",["Province_State","Date","ConfirmedCases","Fatalities"]]
df_brz_records = df_brz_records.groupby("Date").sum()
df_brz_records = df_brz_records.reset_index()
fig = px.bar(df_brz_records,x="Date",y="ConfirmedCases",color="ConfirmedCases",color_continuous_scale=px.colors.sequential.Magma)
fig.update_layout(title_text="巴西随时间确诊病例数")
fig.show()

In [19]:
fig = px.bar(df_brz_records,x="Date",y="Fatalities",color="Fatalities",color_continuous_scale=px.colors.sequential.Magma)
fig.update_layout(title_text="巴西随时间死亡病例数")
fig.show()

In [11]:
def get_year(date_str):
    comps = date_str.split("-")
    return int(comps[0])

def get_month(date_str):
    comps = date_str.split("-")
    return int(comps[1])

def get_day(date_str):
    comps = date_str.split("-")
    return int(comps[2])

df_train["Year"] = df_train.Date.apply(get_year)
df_train["Month"] = df_train.Date.apply(get_month)
df_train["Day"] = df_train.Date.apply(get_day)
df_train

Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities,Year,Month,Day
0,1,,Afghanistan,2020-01-22,0.0,0.0,2020,1,22
1,2,,Afghanistan,2020-01-23,0.0,0.0,2020,1,23
2,3,,Afghanistan,2020-01-24,0.0,0.0,2020,1,24
3,4,,Afghanistan,2020-01-25,0.0,0.0,2020,1,25
4,5,,Afghanistan,2020-01-26,0.0,0.0,2020,1,26
...,...,...,...,...,...,...,...,...,...
35990,35991,,Zimbabwe,2020-05-11,36.0,4.0,2020,5,11
35991,35992,,Zimbabwe,2020-05-12,36.0,4.0,2020,5,12
35992,35993,,Zimbabwe,2020-05-13,37.0,4.0,2020,5,13
35993,35994,,Zimbabwe,2020-05-14,37.0,4.0,2020,5,14


In [12]:
df_train["Country_Region"] = df_train["Country_Region"] + df_train["Province_State"]
df_train["Country_Region"].value_counts()

Ireland                             115
USLouisiana                         115
USPennsylvania                      115
ChinaQinghai                        115
Saint Lucia                         115
                                   ... 
Antigua and Barbuda                 115
Saint Vincent and the Grenadines    115
Grenada                             115
Holy See                            115
Indonesia                           115
Name: Country_Region, Length: 313, dtype: int64

In [13]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df_train["Country_Region"] = encoder.fit_transform(df_train["Country_Region"])
df_train

Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities,Year,Month,Day
0,1,,0,2020-01-22,0.0,0.0,2020,1,22
1,2,,0,2020-01-23,0.0,0.0,2020,1,23
2,3,,0,2020-01-24,0.0,0.0,2020,1,24
3,4,,0,2020-01-25,0.0,0.0,2020,1,25
4,5,,0,2020-01-26,0.0,0.0,2020,1,26
...,...,...,...,...,...,...,...,...,...
35990,35991,,312,2020-05-11,36.0,4.0,2020,5,11
35991,35992,,312,2020-05-12,36.0,4.0,2020,5,12
35992,35993,,312,2020-05-13,37.0,4.0,2020,5,13
35993,35994,,312,2020-05-14,37.0,4.0,2020,5,14


In [16]:
df_train_final = df_train[["Country_Region","Year","Month","Day"]]
lables = df_train.ConfirmedCases

In [18]:
from xgboost import XGBRegressor
xgb = XGBRegressor(n_estimators = 2500,random_state =0,max_depth = 27)
xgb.fit(df_train_final,lables)

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=27,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=2500, n_jobs=8, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [20]:
df_test = df_test.fillna("")
df_test["Year"] = df_test.Date.apply(get_year)
df_test["Month"] = df_test.Date.apply(get_month)
df_test["Day"] = df_test.Date.apply(get_day)
df_test["Country_Region"] = df_test["Country_Region"] +df_test["Province_State"]
df_test


Unnamed: 0,ForecastId,Province_State,Country_Region,Date,Year,Month,Day
0,1,,Afghanistan,2020-04-02,2020,4,2
1,2,,Afghanistan,2020-04-03,2020,4,3
2,3,,Afghanistan,2020-04-04,2020,4,4
3,4,,Afghanistan,2020-04-05,2020,4,5
4,5,,Afghanistan,2020-04-06,2020,4,6
...,...,...,...,...,...,...,...
13454,13455,,Zimbabwe,2020-05-10,2020,5,10
13455,13456,,Zimbabwe,2020-05-11,2020,5,11
13456,13457,,Zimbabwe,2020-05-12,2020,5,12
13457,13458,,Zimbabwe,2020-05-13,2020,5,13


In [25]:
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
df_test["Country_Region"] = encoder.fit_transform(df_test["Country_Region"])
df_test

Unnamed: 0,ForecastId,Province_State,Country_Region,Date,Year,Month,Day
0,1,,0,2020-04-02,2020,4,2
1,2,,0,2020-04-03,2020,4,3
2,3,,0,2020-04-04,2020,4,4
3,4,,0,2020-04-05,2020,4,5
4,5,,0,2020-04-06,2020,4,6
...,...,...,...,...,...,...,...
13454,13455,,312,2020-05-10,2020,5,10
13455,13456,,312,2020-05-11,2020,5,11
13456,13457,,312,2020-05-12,2020,5,12
13457,13458,,312,2020-05-13,2020,5,13


In [26]:
df_test_final = df_test[["Country_Region","Year","Month","Day"]]
df_test["predict_confirm"] = xgb.predict(df_test_final)
df_test

Unnamed: 0,ForecastId,Province_State,Country_Region,Date,Year,Month,Day,predict_confirm
0,1,,0,2020-04-02,2020,4,2,272.999054
1,2,,0,2020-04-03,2020,4,3,281.000397
2,3,,0,2020-04-04,2020,4,4,299.000580
3,4,,0,2020-04-05,2020,4,5,349.000519
4,5,,0,2020-04-06,2020,4,6,367.000305
...,...,...,...,...,...,...,...,...
13454,13455,,312,2020-05-10,2020,5,10,36.000092
13455,13456,,312,2020-05-11,2020,5,11,35.999012
13456,13457,,312,2020-05-12,2020,5,12,36.000759
13457,13458,,312,2020-05-13,2020,5,13,36.999180


In [27]:
df_train[df_train.Date >= '2020-04-02']

Unnamed: 0,Id,Province_State,Country_Region,Date,ConfirmedCases,Fatalities,Year,Month,Day
71,72,,0,2020-04-02,273.0,6.0,2020,4,2
72,73,,0,2020-04-03,281.0,6.0,2020,4,3
73,74,,0,2020-04-04,299.0,7.0,2020,4,4
74,75,,0,2020-04-05,349.0,7.0,2020,4,5
75,76,,0,2020-04-06,367.0,11.0,2020,4,6
...,...,...,...,...,...,...,...,...,...
35990,35991,,312,2020-05-11,36.0,4.0,2020,5,11
35991,35992,,312,2020-05-12,36.0,4.0,2020,5,12
35992,35993,,312,2020-05-13,37.0,4.0,2020,5,13
35993,35994,,312,2020-05-14,37.0,4.0,2020,5,14


EDA 的概念和基本的步骤。

遵循 EDA 的基本步骤来进行了新冠肺炎蔓延趋势的案例实战，主要包括：

通过 fillna 填充缺失数据；

通过多次 groupby 聚合来处理出我们希望要的数据；

通过 plotly 绘制柱状图来分析相关趋势；

通过对字段 apply 处理函数来拆分日期维度；

通过 LabelEncoder 来将国家处理为数值；

通过 xgboost 来拟合非线性关系的数据。