# Exploratory Data Analysis

1. Generate questions about your data.

2. Search for answers by visualising, transforming, and modelling your data.

3. Use what you learn to refine your questions and/or generate new questions.


问题导向去理解、探索数据，再回答问题的过程中，使用各式各样的工具、方法。对于提出什么样的问题，没有规则，但是基本有以下两类：
- 在变量中，出现了哪种类型的变体？（What type of variation occurs within my variables?）
- 在变量间，出现了哪些类型的共变？（What type of covariation occurs between my variables?）

In [3]:
from skimpy import skim
from pandas_profiling import ProfileReport
import pandas as pd
from pandas.api.types import CategoricalDtype
from lets_plot import *
from lets_plot.mapping import as_discrete

LetsPlot.setup_html()

  def hasna(x: np.ndarray) -> bool:
  from pandas_profiling import ProfileReport


# Variation

In [4]:
diamonds = pd.read_csv(
    "https://github.com/mwaskom/seaborn-data/raw/master/diamonds.csv"
)
diamonds["cut"] = diamonds["cut"].astype(
    CategoricalDtype(
        categories=["Fair", "Good", "Very Good", "Premium", "Ideal"], ordered=True
    )
)
diamonds["color"] = diamonds["color"].astype(
    CategoricalDtype(categories=["D", "E", "F", "G", "H", "I", "J"], ordered=True)
)
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## 典型值

In [5]:
(ggplot(diamonds, aes(x="carat"))
 + geom_histogram(binwidth=0.5)
 )

- Which values are the most common? Why?
- Which values are rare? Why? Does that match your expectations?
- Can you see any unusual patterns? What might explain them?

In [6]:
smaller_diamonds = diamonds.query("carat < 3").copy()

(ggplot(smaller_diamonds, aes(x="carat"))
 + geom_histogram(binwidth=0.01)
 )

- Why are there more diamonds at whole carats and common fractions of carats?
- Why are there more diamonds slightly to the right of each peak than there are slightly to the left of each peak?


可视化还对数据进行集群（clusters）显示，同时也表明在数据中存在子组。
- How are the observations within each subgroup similar to each other?
- How are the observations in separate clusters different from each other?
- How can you explain or describe the clusters?
- Why might the appearance of clusters be misleading?

## 异常值

In [7]:
# 直方图中无法直观显示异常值，因为相对来说会比较短，甚至无法肉眼观察到
(ggplot(diamonds, aes(x="y"))
 + geom_histogram(binwidth=0.5)
 )

In [8]:
# 这是可以配合 coord_cartesian(ylim=[0, 50]) 来限制 y 轴的范围
(ggplot(diamonds, aes(x="y"))
 + geom_histogram(binwidth=0.5)
 + coord_cartesian(ylim=[0, 50])
 )

In [9]:
unusual = diamonds.query("y < 3 or y > 20").loc[:, ["x", "y", "z", "price"]]
unusual

# 下列数据展现了宽度为 0 的项，这些项在直方图中无法显示。像这样的数据，我们在做数据处理时，也将其值设置为 NA
# 异常值还有就是不符合常规的，比如说，大小为 58.9，但价格和其他的却相差无几

Unnamed: 0,x,y,z,price
11963,0.0,0.0,0.0,5139
15951,0.0,0.0,0.0,6381
24067,8.09,58.9,8.06,12210
24520,0.0,0.0,0.0,12800
26243,0.0,0.0,0.0,15686
27429,0.0,0.0,0.0,18034
49189,5.15,31.8,5.12,2075
49556,0.0,0.0,0.0,2130
49557,0.0,0.0,0.0,2130


### 异常值处理

In [10]:
# 1. 丢弃异常值（不推荐使用）
condition = ((diamonds["y"] < 3) | (diamonds["y"] > 20))
diamonds2 = diamonds.loc[~condition, :]

In [11]:
# 2. 用 NA 替换异常值
diamonds2 = diamonds.copy()
condition = (diamonds2["y"] < 3) | (diamonds2["y"] > 20)
diamonds2.loc[condition, "y"] = pd.NA

In [12]:
(ggplot(diamonds2, aes(x="x", y="y"))
 + geom_point()
 )

In [13]:
# 检查缺失值与非缺失值在观测上的对比
url = "https://raw.githubusercontent.com/byuidatascience/data4python4ds/master/data-raw/flights/flights.csv"
flights = pd.read_csv(url)
flights.head()

Unnamed: 0,year,month,day,dep_time,sched_dep_time,dep_delay,arr_time,sched_arr_time,arr_delay,carrier,flight,tailnum,origin,dest,air_time,distance,hour,minute,time_hour
0,2013,1,1,517.0,515,2.0,830.0,819,11.0,UA,1545,N14228,EWR,IAH,227.0,1400,5,15,2013-01-01T10:00:00Z
1,2013,1,1,533.0,529,4.0,850.0,830,20.0,UA,1714,N24211,LGA,IAH,227.0,1416,5,29,2013-01-01T10:00:00Z
2,2013,1,1,542.0,540,2.0,923.0,850,33.0,AA,1141,N619AA,JFK,MIA,160.0,1089,5,40,2013-01-01T10:00:00Z
3,2013,1,1,544.0,545,-1.0,1004.0,1022,-18.0,B6,725,N804JB,JFK,BQN,183.0,1576,5,45,2013-01-01T10:00:00Z
4,2013,1,1,554.0,600,-6.0,812.0,837,-25.0,DL,461,N668DN,LGA,ATL,116.0,762,6,0,2013-01-01T11:00:00Z


In [14]:
flights2 = flights.assign(
    cancelled=lambda x: pd.isna(x["dep_time"]),
    sched_hour=lambda x: x["sched_dep_time"] // 100,
    sched_min=lambda x: x["sched_dep_time"] % 100,
    sched_dep_time=lambda x: x["sched_hour"] + x["sched_min"] / 60,
)

(
        ggplot(flights2, aes(x="sched_dep_time"))
        + geom_freqpoly(aes(color="cancelled"), binwidth=1 / 4)
)

# 以下的图例显然有点不够，因为 准点 和 取消 的数据量差距太大了