## Seaborn

那么Pandas与Seaborn之间有什么区别呢？

其实两者都是使用了matplotlib来作图，但是有非常不同的设计差异

1. 在只需要简单地作图时直接用Pandas，但要想做出更加吸引人，更丰富的图就可以使用Seaborn
2. Pandas的作图函数并没有太多的参数来调整图形，所以你必须要深入了解matplotlib
3. Seaborn的作图函数中提供了大量的参数来调整图形，所以并不需要太深入了解matplotlib
4. Seaborn的API：https://stanford.edu/~mwaskom/software/seaborn/api.html#style-frontend

## 泰坦尼克号数据分析

这是是历史中著名的海难事件，大量游客在事故中丧生，也有部分游客获救。现在这里有一份数据给出一批乘客的信息如姓名、年龄、性别、票价等等一些信息，和是否获救，然后让你建模分析，再去预测另一批乘客的获救与否。我们一起来看看

## 掌握数据概况

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from sklearn.utils.testing import ignore_warnings

def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn

In [None]:
df_train = pd.read_csv("titanic/train.csv")
df_test = pd.read_csv("titanic/test.csv") # 留作练习让你们分析

In [None]:
df_train.head()

- PassengerId => 乘客ID
-　Survived => 是否获救
- Pclass => 乘客等级(1/2/3等舱位)
- Name => 乘客姓名
- Sex => 性别
- Age => 年龄
- SibSp => 堂兄弟/妹个数
- Parch => 父母与小孩个数
- Ticket => 船票信息
- Fare => 票价
- Cabin => 客舱
- Embarked => 登船港口

In [None]:
df_train.info()

粗略观察一下数据，发现age里有不少缺失，Cabin（舱号）大量缺失，其他属性个别缺失

In [None]:
fig, ax = plt.subplots(figsize=(9,5))
sns.heatmap(df_train.isnull(), cbar=False, cmap="YlGnBu_r")
plt.show()

In [None]:
# 这些是类别列
cols = ['Survived', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']

In [None]:
nr_rows = 2
nr_cols = 3

fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*3.5,nr_rows*3))

for r in range(0,nr_rows):
    for c in range(0,nr_cols):  
        
        i = r*nr_cols+c       
        ax = axs[r][c]
        sns.countplot(df_train[cols[i]], hue=df_train["Survived"], ax=ax)
        ax.set_title(cols[i])
        ax.legend() 
        
plt.tight_layout()  

### 认识数据

- 第一张图：？
- 第二张图：？
- 第三张图：？
- 第四，五张图：？
- 第六张图: ？

### 看看年龄的因素 

In [None]:
bins = np.arange(0, 80, 5)
g = sns.FacetGrid(df_train, row='Sex', col='Pclass', hue='Survived', margin_titles=True, size=3, aspect=1.1)
g.map(sns.distplot, 'Age', kde=False, bins=bins, hist_kws=dict(alpha=0.6))
g.add_legend()  
plt.show()  

In [None]:
# 分析一下

### 看看你票价因素 

In [None]:
df_train['Fare'].max()

In [None]:
bins = np.arange(0, 550, 20)
g = sns.FacetGrid(df_train, row='Sex', col='Pclass', hue='Survived', margin_titles=True, size=3, aspect=1.1)
g.map(sns.distplot, 'Fare', kde=False, bins=bins, hist_kws=dict(alpha=0.6))
g.add_legend()  
plt.show()  

### 仓位因素

In [None]:
sns.barplot(x='Pclass', y='Survived', data=df_train)
plt.ylabel("Survival Rate")
plt.title("Survival as function of Pclass")
plt.show()

In [None]:
sns.barplot(x='Sex', y='Survived', hue='Pclass', data=df_train)
plt.ylabel("Survival Rate")
plt.title("Survival as function of Pclass and Sex")
plt.show()

### 登船口因素

In [None]:
sns.barplot(x='Embarked', y='Survived', data=df_train)
plt.ylabel("Survival Rate")
plt.title("Survival as function of Embarked Port")
plt.show()

In [None]:
sns.boxplot(x='Embarked', y='Fare', data=df_train)
plt.title("Fare distribution as function of Embarked Port")
plt.show()

## 增加一些新维度

### 家庭大小，单独，名字长度，称呼

In [None]:
for df in [df_train, df_test] :
    
    df['FamilySize'] = df['SibSp'] + df['Parch']
    
    df['Alone']=0
    df.loc[(df.FamilySize==0),'Alone'] = 1
    
    df['NameLen'] = df.Name.apply(lambda x : len(x)) 
    df['NameLenBin']=np.nan
    for i in range(20,0,-1):
        df.loc[ df['NameLen'] <= i*5, 'NameLenBin'] = i
    
    
    df['Title']=0
    df['Title']=df.Name.str.extract(r'([A-Za-z]+)\.') #lets extract the Salutations
    df['Title'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],
                    ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)

In [None]:
plt.subplots(figsize=(10,6))
sns.barplot(x='NameLenBin' , y='Survived' , data = df_train)
plt.ylabel("Survival Rate")
plt.title("Survival as function of NameLenBin")
plt.show()

### 结论？？

In [None]:
g = sns.factorplot(x="NameLenBin", y="Survived", col="Sex", data=df_train, kind="bar", size=5, aspect=1.2)

### 结论？？

### 称呼因素

In [None]:
plt.subplots(figsize=(10,6))
sns.barplot(x='Title' , y='Survived' , data = df_train)
plt.ylabel("Survival Rate")
plt.title("Survival as function of Title")
plt.show()

In [None]:
pd.crosstab(df_train.FamilySize,df_train.Survived).apply(lambda r: r/r.sum(), axis=1).style.background_gradient(cmap='summer_r')

### 结论？？

## 数据清洗

### 第一步填充缺失数据

In [None]:
# 根据称呼补充他们的性别
df_train['Title'] = df_train['Title'].fillna(df_train['Title'].mode().iloc[0])

# 年龄使用平均值填充
df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Mr'),'Age']= df_train.Age[df_train.Title=="Mr"].mean()
df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Mrs'),'Age']= df_train.Age[df_train.Title=="Mrs"].mean()
df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Master'),'Age']= df_train.Age[df_train.Title=="Master"].mean()
df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Miss'),'Age']= df_train.Age[df_train.Title=="Miss"].mean()
df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Other'),'Age']= df_train.Age[df_train.Title=="Other"].mean()
df_train = df_train.drop('Name', axis=1)

In [None]:
# 设置登船口默认值是第一个
df_train['Embarked'] = df_train['Embarked'].fillna(df_train['Embarked'].mode().iloc[0])
# 票价用平均值填充
df_train['Fare'] = df_train['Fare'].fillna(df_train['Fare'].mean())

In [None]:
# 年龄按10年分段，票价按50分段，方便查找规律
df = df_train
df['Age_bin']=np.nan
for i in range(8,0,-1):
    df.loc[ df['Age'] <= i*10, 'Age_bin'] = i

df['Fare_bin']=np.nan
for i in range(12,0,-1):
    df.loc[ df['Fare'] <= i*50, 'Fare_bin'] = i        

# 把文字变成数字，让计算机更好处理
df['Title'] = df['Title'].map( {'Other':0, 'Mr': 1, 'Master':2, 'Miss': 3, 'Mrs': 4 } )
# 如果称呼为空，填充第一个
df['Title'] = df['Title'].fillna(df['Title'].mode().iloc[0])
df['Title'] = df['Title'].astype(int) 

In [None]:
# 复制一份数据，保护原始数据
df_train_ml = df_train.copy()

In [None]:
df_train_ml.info()

In [None]:
# 把类别参数做成新的列，用０－１表示对应项
df_train_ml = pd.get_dummies(df_train_ml, columns=['Sex', 'Embarked', 'Pclass'], drop_first=True)
df_train_ml.drop(['PassengerId','Ticket','Cabin','Age', 'Fare_bin'],axis=1,inplace=True)
df_train_ml.dropna(inplace=True)
df_train_ml.drop(['NameLen'], axis=1, inplace=True)
df_train_ml.drop(['SibSp'], axis=1, inplace=True)
df_train_ml.drop(['Parch'], axis=1, inplace=True)
df_train_ml.drop(['Alone'], axis=1, inplace=True)

In [None]:
df_train_ml.head()

## 下一步就是机器学习，有兴趣同学可以看 sklearn

官网：https://scikit-learn.org/stable/