# Exploratory Data Analysis

1. Generate questions about your data.

2. Search for answers by visualising, transforming, and modelling your data.

3. Use what you learn to refine your questions and/or generate new questions.


问题导向去理解、探索数据，再回答问题的过程中，使用各式各样的工具、方法。对于提出什么样的问题，没有规则，但是基本有以下两类：
- 在变量中，出现了哪种类型的变体？（What type of variation occurs within my variables?）
- 在变量间，出现了哪些类型的共变？（What type of covariation occurs between my variables?）

In [3]:
from skimpy import skim
from pandas_profiling import ProfileReport
import pandas as pd
from pandas.api.types import CategoricalDtype
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

  def hasna(x: np.ndarray) -> bool:
  from pandas_profiling import ProfileReport


# Variation

In [4]:
diamonds = pd.read_csv(
    "https://github.com/mwaskom/seaborn-data/raw/master/diamonds.csv"
)
diamonds["cut"] = diamonds["cut"].astype(
    CategoricalDtype(
        categories=["Fair", "Good", "Very Good", "Premium", "Ideal"], ordered=True
    )
)
diamonds["color"] = diamonds["color"].astype(
    CategoricalDtype(categories=["D", "E", "F", "G", "H", "I", "J"], ordered=True)
)
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## 典型值

In [5]:
(ggplot(diamonds, aes(x="carat"))
 + geom_histogram(binwidth=0.5)
 )

- Which values are the most common? Why?
- Which values are rare? Why? Does that match your expectations?
- Can you see any unusual patterns? What might explain them?

In [6]:
smaller_diamonds = diamonds.query("carat < 3").copy()

(ggplot(smaller_diamonds, aes(x="carat"))
 + geom_histogram(binwidth=0.01)
 )

- Why are there more diamonds at whole carats and common fractions of carats?
- Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?


可视化还对数据进行集群（clusters）显示，同时也表明在数据中存在子组。
- How are the observations within each subgroup similar to each other?
- How are the observations in separate clusters different from each other?
- How can you explain or describe the clusters?
- Why might the appearance of clusters be misleading?

## 异常值

In [7]:
# 直方图中无法直观显示异常值，因为相对来说会比较短，甚至无法肉眼观察到
(ggplot(diamonds, aes(x="y"))
 + geom_histogram(binwidth=0.5)
 )

In [8]:
# 这是可以配合 coord_cartesian(ylim=[0, 50]) 来限制 y 轴的范围
(ggplot(diamonds, aes(x="y"))
 + geom_histogram(binwidth=0.5)
 + coord_cartesian(ylim=[0, 50])
 )

In [9]:
unusual = diamonds.query("y < 3 or y > 20").loc[:, ["x", "y", "z", "price"]]
unusual

# 下列数据展现了宽度为 0 的项，这些项在直方图中无法显示。像这样的数据，我们在做数据处理时，也将其值设置为 NA
# 异常值还有就是不符合常规的，比如说，大小为 58.9，但价格和其他的却相差无几

Unnamed: 0,x,y,z,price
11963,0.0,0.0,0.0,5139
15951,0.0,0.0,0.0,6381
24067,8.09,58.9,8.06,12210
24520,0.0,0.0,0.0,12800
26243,0.0,0.0,0.0,15686
27429,0.0,0.0,0.0,18034
49189,5.15,31.8,5.12,2075
49556,0.0,0.0,0.0,2130
49557,0.0,0.0,0.0,2130


### 异常值处理

In [10]:
# 1. 丢弃异常值（不推荐使用）
condition = ((diamonds["y"] < 3) | (diamonds["y"] > 20))
diamonds2 = diamonds.loc[~condition, :]

In [11]:
# 2. 用 NA 替换异常值
diamonds2 = diamonds.copy()
condition = (diamonds2["y"] < 3) | (diamonds2["y"] > 20)
diamonds2.loc[condition, "y"] = pd.NA

In [12]:
(ggplot(diamonds2, aes(x="x", y="y"))
 + geom_point()
 )

In [13]:
# 检查缺失值与非缺失值在观测上的对比
url = "https://raw.githubusercontent.com/byuidatascience/data4python4ds/master/data-raw/flights/flights.csv"
flights = pd.read_csv(url)
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z


In [14]:
flights2 = flights.assign(
    cancelled=lambda x: pd.isna(x["dep_time"]),
    sched_hour=lambda x: x["sched_dep_time"] // 100,
    sched_min=lambda x: x["sched_dep_time"] % 100,
    sched_dep_time=lambda x: x["sched_hour"] + x["sched_min"] / 60,
)

(
        ggplot(flights2, aes(x="sched_dep_time"))
        + geom_freqpoly(aes(color="cancelled"), binwidth=1 / 4)
)

# 以下的图例显然有点不够，因为 准点 和 取消 的数据量差距太大了

# Covariation

协变是两个或多个变量的值以相关方式一起变化的趋势。但请注意，协变并不意味着变量之间存在因果关系。

## 分类变量与数值变量

In [15]:
(ggplot(diamonds, aes(x="price"))
 + geom_freqpoly(aes(color="cut"), binwidth=500, linewidth=0.75)
 )

# 总体显示也不够好，因为高度差异太大

In [16]:
(ggplot(diamonds, aes(x="price"))
 + geom_density(aes(color="cut", fill="cut"), size=1, alpha=0.2)
 )

# fair 类型的均价居然最高

In [17]:
(ggplot(diamonds, aes(x="cut", y="price"))
 + geom_boxplot()
 )

In [23]:
# 也可以根据中位数进行排序
(ggplot(diamonds, aes(x="cut", y="price"))
 + geom_boxplot(aes(as_discrete("cut", order_by="..middle.."), "price"))
 )

## pandas

In [24]:
diamonds.describe()

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [25]:
sum_table = diamonds.describe().round(1)
sum_table

Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,0.8,61.7,57.5,3932.8,5.7,5.7,3.5
std,0.5,1.4,2.2,3989.4,1.1,1.1,0.7
min,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,0.4,61.0,56.0,950.0,4.7,4.7,2.9
50%,0.7,61.8,57.0,2401.0,5.7,5.7,3.5
75%,1.0,62.5,59.0,5324.2,6.5,6.5,4.0
max,5.0,79.0,95.0,18823.0,10.7,58.9,31.8


In [26]:
sum_table = sum_table.T
sum_table

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
carat,53940.0,0.8,0.5,0.2,0.4,0.7,1.0,5.0
depth,53940.0,61.7,1.4,43.0,61.0,61.8,62.5,79.0
table,53940.0,57.5,2.2,43.0,56.0,57.0,59.0,95.0
price,53940.0,3932.8,3989.4,326.0,950.0,2401.0,5324.2,18823.0
x,53940.0,5.7,1.1,0.0,4.7,5.7,6.5,10.7
y,53940.0,5.7,1.1,0.0,4.7,5.7,6.5,58.9
z,53940.0,3.5,0.7,0.0,2.9,3.5,4.0,31.8


In [30]:
(
    diamonds.groupby(["cut", "color"])["price"]
    .mean()
    .unstack()
    .apply(lambda x: x / 1e3)
    .fillna("-")
    .style.format(precision=2)
    .set_caption("Sale price (thousands)")
)

color,D,E,F,G,H,I,J
cut,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Fair,4.29,3.68,3.83,4.24,5.14,4.69,4.98
Good,3.41,3.42,3.5,4.12,4.28,5.08,4.57
Very Good,3.47,3.21,3.78,3.87,4.54,5.26,5.1
Premium,3.63,3.54,4.32,4.5,5.22,5.95,6.29
Ideal,2.63,2.6,3.37,3.72,3.89,4.45,4.92


In [31]:
pd.crosstab(diamonds["color"], diamonds["cut"]).style.background_gradient(cmap="plasma")

cut,Fair,Good,Very Good,Premium,Ideal
color,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
D,163,662,1513,1603,2834
E,224,933,2400,2337,3903
F,312,909,2164,2331,3826
G,314,871,2299,2924,4884
H,303,702,1824,2360,3115
I,175,522,1204,1428,2093
J,119,307,678,808,896


In [32]:
taxis = pd.read_csv("https://github.com/mwaskom/seaborn-data/raw/master/taxis.csv")
# turn the pickup time column into a datetime
taxis["pickup"] = pd.to_datetime(taxis["pickup"])
# set some other columns types
taxis = taxis.astype(
    {
        "dropoff": "datetime64[ns]",
        "pickup": "datetime64[ns]",
        "color": "category",
        "payment": "category",
        "pickup_zone": "string",
        "dropoff_zone": "string",
        "pickup_borough": "category",
        "dropoff_borough": "category",
    }
)
taxis.head()

Unnamed: 0,pickup,dropoff,passengers,distance,fare,tip,tolls,total,color,payment,pickup_zone,dropoff_zone,pickup_borough,dropoff_borough
0,2019-03-23 20:21:09,2019-03-23 20:27:24,1,1.6,7.0,2.15,0.0,12.95,yellow,credit card,Lenox Hill West,UN/Turtle Bay South,Manhattan,Manhattan
1,2019-03-04 16:11:55,2019-03-04 16:19:00,1,0.79,5.0,0.0,0.0,9.3,yellow,cash,Upper West Side South,Upper West Side South,Manhattan,Manhattan
2,2019-03-27 17:53:01,2019-03-27 18:00:25,1,1.37,7.5,2.36,0.0,14.16,yellow,credit card,Alphabet City,West Village,Manhattan,Manhattan
3,2019-03-10 01:23:59,2019-03-10 01:49:51,1,7.7,27.0,6.15,0.0,36.95,yellow,credit card,Hudson Sq,Yorkville West,Manhattan,Manhattan
4,2019-03-30 13:27:42,2019-03-30 13:37:14,3,2.16,9.0,1.1,0.0,13.4,yellow,credit card,Midtown East,Yorkville West,Manhattan,Manhattan


In [33]:
taxis.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6433 entries, 0 to 6432
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   pickup           6433 non-null   datetime64[ns]
 1   dropoff          6433 non-null   datetime64[ns]
 2   passengers       6433 non-null   int64         
 3   distance         6433 non-null   float64       
 4   fare             6433 non-null   float64       
 5   tip              6433 non-null   float64       
 6   tolls            6433 non-null   float64       
 7   total            6433 non-null   float64       
 8   color            6433 non-null   category      
 9   payment          6389 non-null   category      
 10  pickup_zone      6407 non-null   string        
 11  dropoff_zone     6388 non-null   string        
 12  pickup_borough   6407 non-null   category      
 13  dropoff_borough  6388 non-null   category      
dtypes: category(4), datetime64[ns](2), float

In [34]:
skim(taxis)

In [35]:
profile = ProfileReport(taxis, minimal=True, title="Profiling Report: Taxis Dataset")
profile.to_notebook_iframe()

TypeCheckError: argument "config_file" (None) did not match any element in the union:
  pathlib.Path: is not an instance of pathlib.Path
  str: is not an instance of str