# pandas-profiling

## 分析資訊包括以下
- 要點：類型，唯一值，缺失值
- 分位數統計量，如最小值，Q1，中位數，Q3，最大值，範圍，四分位數範圍
- 描述性統計資料，如均值，模式，標準差，總和，中位數絕對偏差，變異係數，峰度，偏度

## 安裝指令
- pip install pandas-profiling
- conda install pandas-profiling

## 安裝pandas-profiling過程需注意
##### 1. 要求套件版本，如numpy >= 1.11、pandas >= 0.22.0、matplotlib >= 2.2.3、scipy >= 1.1.0
- pip install 套件名 --upgrade

##### 2. 遇到llvmlite無法解除安裝，改用以下指令忽略 pip install --ignore-installed llvmlite
##### 3. 完成以上步驟後，重新執行 pip install pandas-profiling

## 中文繪圖亂碼
#### 1. 設定檔 pandas_profiling.mplstyle
- 路徑C:\ProgramData\Anaconda3\Lib\site-packages\pandas_profiling\view\pandas_profiling.mplstyle
- 中文字顯示問題，在pandas_profiling.mplstyle檔第29行設定中文字型
- 關閉重啟jupyter notebook

#### 2. 依據 matplotlib.pyplot 中文繪圖設定方法調整

參考資料：
- https://www.twblogs.net/a/5d52ffaebd9eee5327fcd6ee
- https://www.it610.com/article/1177198813606113280.htm

In [1]:
!pip install pandas-profiling



# 範例1：鐵達尼

In [4]:
import seaborn as sns
import pandas as pd
import pandas_profiling as pp
import matplotlib.pyplot as plt

df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [5]:
import pandas_profiling as ppf
profile = ppf.ProfileReport(df)
profile



# 範例2：波士頓房屋

In [1]:
from sklearn.datasets import load_boston
import pandas as pd

data = load_boston()["data"]
cols = load_boston()["feature_names"]
df = pd.DataFrame(data=data, columns=cols)
df

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.0900,1.0,296.0,15.3,396.90,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.90,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.90,5.33
...,...,...,...,...,...,...,...,...,...,...,...,...,...
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67
502,0.04527,0.0,11.93,0.0,0.573,6.120,76.7,2.2875,1.0,273.0,21.0,396.90,9.08
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.90,5.64
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48


In [3]:
import webbrowser
import pandas_profiling as ppf
profile = ppf.ProfileReport(df)
profile.to_file(output_file="output.html") 
webbrowser.open('output.html')

True

# 範例3：學生成績

In [1]:
import pandas as pd
df = pd.read_csv('成績.csv', index_col = 0, engine='python') #未添加engine='python'，pandas無法讀取中文名稱檔案
df.head()

Unnamed: 0_level_0,姓名,國文,英文,數學,地理,歷史,理化,電腦
座號,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,劉德華,50,56,65,78,45,89,75
2,黎明,56,78,78,98,65,65,85
3,郭富城,78,63,96,65,78,78,65
4,張學友,96,25,65,45,98,96,32
5,張惠妹,21,98,45,23,14,36,65


In [2]:
import webbrowser
import pandas_profiling as ppf
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = 'Microsoft YaHei'  #用來正常顯示中文標籤，微軟雅黑體，['SimHei']
plt.rcParams['axes.unicode_minus'] = False #解決座標軸負數的負號顯示問題
profile = ppf.ProfileReport(df)
profile.to_file(output_file="output.html")
webbrowser.open('output.html')

True