## What Is Data Visualization  
 the techniques used to communicate data or information by encoding it as visual objects  
 **将数据转变为易于理解的图像**  

## Why Data Visualization  
You will never be an expert on the data you are working with, and will always need to explore the variables in great depth before you can move on to building a model or doing something else with the data  
**让我们更了解数据，进而设计更好的模型，更精确的解决问题**  

## Content Of Kaggle Data Visualization Tutorial  
1. Plot in pandas  
2. Plot using seaborn  
3. Plot using plotly  
4. Plot using matplotlib  

## 利用 Pandas 进行单变量绘图    

### Types of charts  
1. Bar chart  
`df.plot.bar` -  善于处理标称和小序数分类数据  
2. Line chart  
`df.plot.line` - 善于处理顺序分类和区间数据  
3. Area Chart  
`df.plot.area` - 善于处理顺序分类和区间数据  
4. Histogram chart  
`df.plot.hist` - 善于处理区间数据   

### Bar chart (条状图) and categorical data  
**nominal categories: "pure" categories that don't make a lot of sense to order.(ike countries, ZIP codes, types of cheese, and lunar landers)**    
**ordinal categories: things that do make sense to compare, like earthquake magnitudes, housing complexes with certain numbers of apartments, and the sizes of bags of chips at your local deli**  
```python  
reviews['province'].value_counts().head(10).plot.bar()
(reviews['province'].value_counts().head(10) / len(reviews)).plot.bar()
reviews['points'].value_counts().sort_index().plot.bar()
```  

### Line chart  
**the tool of first choice for distributions with many unique values or categories**  
**weakness: unlike bar charts, they're not appropriate for nominal categorical data**  
```python
reviews['points'].value_counts().sort_index().plot.line()
```

### Area Chart  
**just line charts, but with the bottom shaded in**  
```python  
reviews['points'].value_counts().sort_index().plot.area()
```  
#### What is interval data  
Examples of interval variables are the wind speed in a hurricane, shear strength in concrete, and the temperature of the sun. An interval variable goes beyond an ordinal categorical variable: it has a meaningful order, in the sense that we can quantify what the difference between two entries is itself an interval variable.  


### Histograms chart(柱状图)
A histogram looks, trivially, like a bar plot. And it basically is! In fact, a histogram is special kind of bar plot that splits your data into even intervals and displays how many rows are in each interval with bars. The only analytical difference is that instead of each bar representing a single value, it represents a range of values  
**一种特殊的 bar 图，展示了一个范围内的数据**  
**weakness: they don't deal very well with skewed data**  

```python  
reviews[reviews['price'] < 200]['price'].plot.hist()
```  

## 利用 Pandas 进行 双变量 绘图  
**Many pandas multivariate plots expect input data to be in this format, with one categorical variable in the columns, one categorical variable in the rows, and counts of their intersections in the entries.**  

### Types of charts  
1. Scatter plot  
`df.plot.scatter` - 善于处理连续以及某些顺序无关的分类数据  
2. Hex plot  
`df.plot.hex` - 善于处理连续以及某些顺序无关的分类数据    
3. Stacked Bar chart  
`df.plot.bar(stacked=True)` - 善于处理连续以及顺序相关的分类数据  
4. Bivariate Line chart  
`df.plot.line()` - 善于处理连续以及顺序相关的分类数据  

### Scatter plot(散点图)  
simply maps each variable of interest to a point in two-dimensional space  
**为了应对数据太过密集导致的问题，通常需要降低采样率**  

```python  
reviews[reviews['price'] < 100].sample(100).plot.scatter(x='price', y='points')  
```  

### Hex plot(六角图)  
aggregates points in space into hexagons, and then colorize those hexagons  
```python
reviews[reviews['price'] < 100].plot.hexbin(x='price', y='points', gridsize=15)
```  

### Stacked plot(堆栈图)  
plots the variables one on top of the other    
**weakness:  **  
1. the second variable in a stacked plot must be a variable with a very limited number of possible values (probably an ordinal categorical, as here)  
2. interpretability
```python  
wine_counts.plot.bar(stacked=True)  
wine_counts.plot.area()  
```  

### Bivariate line chart(双变量线图)  
```python
wine_counts.plot.line()  
```

## Styling plot  
pandas data visualization tools are built on top of another, lower-level graphics library called matplotlib. Anything that you build in pandas can be built using matplotlib directly. pandas merely make it easier to get that work done.
### Resize Figure    
figsize controls the size of the image, in inches. It expects a tuple of (width, height) values
```python
reviews['points'].value_counts().sort_index().plot.bar(figsize=(12, 6))  
```  

### Colorize Figure   
```python  
reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred'
)
```  
### Font Size  
```python  
reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)
```  
### Title  
```python  
reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16,
    title='Rankings Given by Wine Magazine',
)
# add title and resize title using pandas  
import matplotlib.pyplot as plt

ax = reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)
ax.set_title("Rankings Given by Wine Magazine", fontsize=20)
```  
### Subploting  
**What?** - a technique for creating multiple plots that live side-by-side in one overall figure  
**How?**:  
```python  
import matplotlib.pyplot as plt
fig, axarr = plt.subplots(2, 1, figsize=(12, 8))  
reviews['points'].value_counts().sort_index().plot.bar(
    ax=axarr[0]
)

reviews['province'].value_counts().head(20).plot.bar(
    ax=axarr[1]
)

```  
**Why?**:  
1. In some cases, combine many plots together is making logical sense  
2. give more attractive and informative panels displays  
3. Faceting(Breaking dta variables up across multiple subplots, and combine subplots into a single figure)  


## Plot With Seaborn  

### What?  
`seaborn` is a standalone data visualization package that provides many extremely valuable data visualizations in a single package  

### Why?  
"record-oriented"/"tidy data" format - Each individual row is a single record (a review)  
`seaborn` is designed to work with this kind of data out-of-the-box, for all of its plot types, with minimal fuss. This makes it an incredibly convenient workbench tool.  

`pandas` is not designed this way. In `pandas`, every plot we generate is tied very directly to the input data. In essence, `pandas` expects your data being in exactly the right *output* shape, regardless of what the input is.  

### Plots  
1. Count(Bar) Plot - pandas bar  
    `seaborn.countplot()` - 善于处理标称和数量较小有序分类数据
2. KDE Plot - pandas line  
    `seaborn.kdeplot()` - 善于处理连续数据  
3. Dist Plot - pandas histogram  
    `seaborn.distplot()` - 善于处理区间数据  
3. Joint Plot - pandas scatter/hex  
    `seaborn.jointplot()` - 善于处理连续数据和某些标称数据  
4. Violin Plot
    `seaborn.violinplot()` - 善于处理连续数据和某些标称数据  
    
### Count Plot  
aggregates the data for us(which mean we do not need to using value_count to plot)  
```python  
sns.countplot(reviews['points'])  
```  

### KDE Plot  
Kernel Density Estimate - a statistical technique for smoothing out data noise  
It addresses an important fundamental weakness of a line chart: it will buff out outlier or "in-betweener" values which would cause a line chart to suddenly dip.
**A KDE plot is better than a line chart for getting the "true shape" of interval data**  
**a worse choice for ordinal categorical data**  
```python  
# 1d
sns.kdeplot(reviews.query('price < 200').price)
# 2d
sns.kdeplot(reviews[reviews['price'] < 200].loc[:, ['price', 'points']].dropna().sample(5000))
```  

### Dist Plot  
```python
sns.distplot(reviews['points'], bins=10, kde=False)
```

### Joint Plot  
correlation coefficient is provided, along with histograms on the sides  
```python  
# scatter
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 100])  
# hex  
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 100], kind='hex', 
              gridsize=20)
```  

### Violin Plot  
**What is boxplot?: **  

<img src="https://gss0.bdstatic.com/94o3dSag_xI4khGkpoWK1HF6hhy/baike/c0%3Dbaike80%2C5%2C5%2C80%2C26/sign=4e5ee1bdacaf2eddc0fc41bbec796a8c/aa18972bd40735fade9ad1029e510fb30f240826.jpg" />  

A `violinplot` cleverly replaces the box in the boxplot with a kernel density estimate for the data. It shows basically the same data, but is harder to misinterpret and much prettier than the utilitarian boxplot.  

```python  
# box plot  
df = reviews[reviews.variety.isin(reviews.variety.value_counts().head(5).index)]

sns.boxplot(
    x='variety',
    y='points',
    data=df
)

# violin plot  
sns.violinplot(
    x='variety',
    y='points',
    data=reviews[reviews.variety.isin(reviews.variety.value_counts()[:5].index)]
)
```  

### Facet  
Faceting is the act of breaking data variables up across multiple subplots, and combining those subplots into a single figure  

#### Why?
1. the easiest way to make your data visualization multivariate  
2. easy to use  

#### Limitation  
It can only be used to break data out across singular or paired categorical variables with very low numeracy & mdash  
any more than five or so dimensions in the grid, and the plots become too small (or involve a lot of scrolling). Additionally it involves choosing (or letting Python) an order to plot in, but with nominal categorical variables that choice is distractingly arbitrary  


#### Facet Grid  
`seaborn.FacetGrid` - 善于处理多个类别型数据  
A `FacetGrid` is an object which stores some information on how you want to break up your data visualization.
```python
# one column data
df = footballers[footballers['Position'].isin(['ST', 'GK'])]
## use col_wrap to limit count of plot in one row
g = sns.FacetGrid(df, col="Position", col_wrap=2)
g.map(sns.kdeplot, "Overall")  

# row & column  
df = footballers[footballers['Position'].isin(['ST', 'GK'])]
df = df[df['Club'].isin(['Real Madrid CF', 'FC Barcelona', 'Atlético Madrid'])]
## use rol/col order to specific order
g = sns.FacetGrid(df, 
                  row="Position", 
                  row_order=['GK', 'ST'], 
                  col="Club",
                  col_order=['Atlético Madrid', 'FC Barcelona', 'Real Madrid CF'])
g.map(sns.boxplot, "Overall")
```  

### Pair Plot  
`pairplot` is a very useful and widely used `seaborn` method for faceting *variables* (as opposed to *variable values*). You pass it a `pandas` `DataFrame` in the right shape, and it returns you a gridded result of your variable values:  
```python  
sns.pairplot(footballers[['Overall', 'Potential', 'Value']])  
```  
By default `pairplot` will return scatter plots in the main entries and a histogram in the diagonal. `pairplot` is oftentimes the first thing that a data scientist will throw at their data, and it works fantastically well in that capacity, even if sometimes the scatter-and-histogram approach isn't quite appropriate, given the data types.  


## MultiVariate Ploting  
### Plots  
1. Scatter Plot  
    `df.plot.scatter`
2. Grouped Box Plot  
    `df.plot.box`  
3. Heatmap  
    `sns.heatmap`  
4. Parallel Coordinates  
    `pd.plotting.parallel_coordinates`  
5. Scatter matrix  
    `pd.plotting.scatter_matrix`  

### Visual Variable  
any visual dimension or marker that we can use to perceptually distinguish two data elements from one another. Examples include size, color, shape, and one, two, and even three dimensional position  

### More Visual Variable: Sactter Plot  
`lmplot` - Plot data and regression model fits across a FacetGrid.  
```python  

sns.lmplot(x='Value', y='Overall', hue='Position', 
           data=footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])], 
           fit_reg=False)  
## use marker to indicate the shape of point  
sns.lmplot(x='Value', y='Overall', markers=['o', 'x', '*'], hue='Position',
           data=footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])],
           fit_reg=False
          )  
```  

### More Visual Variable: Grouped Box Plot  
```python  
f = (footballers
         .loc[footballers['Position'].isin(['ST', 'GK'])]
         .loc[:, ['Value', 'Overall', 'Aggression', 'Position']]
    )
f = f[f["Overall"] >= 80]
f = f[f["Overall"] < 85]
f['Aggression'] = f['Aggression'].astype(float)
```  

### Summarization: Heatmap  
Summarization is the creation and addition of new variables by mixing and matching the information provided in the old ones.
It allows us to "boil down" potentially very complicated relationships into simpler ones.  
```python  
f = (
    footballers.loc[:, ['Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control']]
        .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
        .dropna()
).corr()

sns.heatmap(f, annot=True)
```  

###  Summarization: Parallel Coordinates  
```python  
from pandas.plotting import parallel_coordinates

f = (
    footballers.iloc[:, 12:17]
        .loc[footballers['Position'].isin(['ST', 'GK'])]
        .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
        .dropna()
)
f['Position'] = footballers['Position']
f = f.sample(200)

parallel_coordinates(f, 'Position')
```  

### Summarization: Scatter Matrix  
[Doc](https://pandas.pydata.org/pandas-docs/stable/visualization.html#scatter-matrix-plot)  
in scatter matrix there are relations between each feature
```python  
color_list = ['red' if i=='Abnormal' else 'green' for i in data.loc[:,'class']]
pd.plotting.scatter_matrix(data.loc[:, data.columns != 'class'],
                                       c=color_list,
                                       figsize= [15,15],
                                       diagonal='hist',
                                       alpha=0.5,
                                       s = 200,
                                       marker = '*',
                                       edgecolor= "black")
```  


## Plotly  
an open-source plotting library that provide more interactivity and animations  

### Mode: Offline & Online  
You can only use offline mode in **kaggle**  

### Limitation  
Interactive graphics are much, much more resource-intensive than static ones.

### Scatter Plot  
```python  
import plotly.graph_objs as go

iplot([go.Scatter(x=reviews.head(1000)['points'], y=reviews.head(1000)['price'], mode='markers')])  
```  
### Histogram Plot  
```python  
iplot([go.Histogram2dContour(x=reviews.head(500)['points'], 
                             y=reviews.head(500)['price'], 
                             contours=go.Contours(coloring='heatmap')),
       go.Scatter(x=reviews.head(1000)['points'], y=reviews.head(1000)['price'], mode='markers')])
```  
### Surface  
```python  
df = reviews.assign(n=0).groupby(['points', 'price'])['n'].count().reset_index()
df = df[df["price"] < 100]
v = df.pivot(index='price', columns='points', values='n').fillna(0).values.tolist()  
iplot([go.Surface(z=v)])    
```  

### Choropleth  
```python  
df = reviews['country'].replace("US", "United States").value_counts()

iplot([go.Choropleth(
    locationmode='country names',
    locations=df.index.values,
    text=df.index,
    z=df.values
)])
```  

## Time Series Plotting    
Time-series variables are populated by values which are specific to a point in time. Time is linear and infinitely fine-grained, so really time-series values are a kind of special case of interval variables.  
Time-series data tends to exhibit a behavior called **periodicity**: rises and peaks in the data that are correlated with time.

### [Pandas Period](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Period.html)    
Represents a period of time  

### Format Data While Read  
```python  
shelter_outcomes = pd.read_csv(
    "../input/austin-animal-center-shelter-outcomes-and/aac_shelter_outcomes.csv", 
    parse_dates=['date_of_birth', 'datetime']
)  
```  

### Resample Data  
Group data with specific field  
**choosing certain periods you can more clearly visualize certain aspects of the dataset.**  
```python  
shelter_outcomes['date_of_birth'].value_counts().resample('Y').sum().plot.line()  
```  

### Lag Plot  
 compares data points from each observation in the dataset against data points from a previous observation  
 ```python  
from pandas.plotting import lag_plot

lag_plot(stocks['volume'].sample(250))  
```  

### Auto Correlation Plot  
The autocorrelation plot is a multivariate summarization-type plot that lets you check *every* periodicity at the same time. It does this by computing a summary statistic&mdash;the correlation score&mdash;across every possible lag in the dataset. This is known as autocorrelation.  
In an autocorrelation plot the lag is on the x-axis and the autocorrelation score is on the y-axis. The farther away the autocorrelation is from 0, the greater the influence that records that far away from each other exert on one another.  
```python  
from pandas.plotting import autocorrelation_plot

autocorrelation_plot(stocks['volume'])
```  
