#### Data Pre Cleaning: 
Data pre cleaning is to detect, correct, clean inaccurate or inproper process of model building.
- Problems include: data type inconsistance(date, str, text, continue, descrite...), data quality(noise, anomaly, missing, error, duplicate), large data or small data. 

- Goal: to fit the data with model, satisfy the need for building model.

#### Feature Engineering: 
The process to convert original data into predictive model that can predict potential problem with features. Can be done through pick most relevent feature to create model. Dimension-decreasing algorithm（pca, lda, lle, laplician...)
- https://chenrudan.github.io/blog/2016/04/01/dimensionalityreduction.html
- https://zhuanlan.zhihu.com/p/43225794

- Problems include: features' correlation, unrelated label with feature, too many/less features
- Goal: lower cost, increase model accuracy

#### Sklearn: (6 major types) 
1. Classification: SVM, nearest neighbors, random forest, decision tree...
2. Regression: SVR, Ridge, Lasso, Linear
3. Clustering: K-means, spectral clustering, mean-shift...
4. Dimensionality Reduction: PCA, feature selection, non negative matrix factorization.
5. Model Selection: grid search, CV, metrics
- **Preprocessing: preprocessing, feature extraction**
    (preprocessing, impute(NAN), feature_selection, decomposition)

## Nondimensionalize: 不同规格转化到同一规格
无量纲化可以加快求解速度，和提高模型精度。（特例：decision tree和树的集成算法不需要无量纲化，因为决策树可以把任意模型都处理得很好）。无量纲化可以是线性和非线性的，包括zero-centerd, mean-subtraction and scale.

### preprocessing.MinMaxScaler
当数据(x)按照最小值中心化后，再按极差(最大值 - 最小值)缩放，数据移动了最小值个单位，并且会被收敛到 [0,1]之间，而这个过程，就叫做数据归一化(Normalization，又称Min-Max Scaling)。注意，Normalization是归 一化，不是正则化，真正的正则化是regularization，不是数据预处理的一种手段。归一化之后的数据服从正态分布.

在sklearn当中，我们使用preprocessing.MinMaxScaler来实现这个功能。**MinMaxScaler有一个重要参数， feature_range，控制我们希望把数据压缩到的范围，默认是[0,1]**。

In [4]:
from sklearn.preprocessing import MinMaxScaler 
import pandas as pd

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

pd.DataFrame(data)

Unnamed: 0,0,1
0,-1.0,2
1,-0.5,6
2,0.0,10
3,1.0,18


In [11]:
#实现归一化
scaler = MinMaxScaler()  #实例化
scaler = scaler.fit(data)  #fit，在这里本质是生成min(x)和max(x)
result = scaler.transform(data) #通过接口导出结果
result #把数据都压缩到0，1之间

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

In [8]:
#或使用fit_transform一步到位实现，结果相同
result_ = scaler.fit_transform(data)  #训练和导出结果一步达成

In [12]:
scaler.inverse_transform(result)  #将normalization后的result逆转（返回归一化之前的数据）

array([[-1. ,  2. ],
       [-0.5,  6. ],
       [ 0. , 10. ],
       [ 1. , 18. ]])

In [14]:
#使用MinMaxScaler的参数feature_range实现将数据归一化到[0,1]以外的范围中

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

scaler = MinMaxScaler(feature_range=[5,10])  #依然实例化，归一到设定的range之内，且不改变数据的分布
result = scaler.fit_transform(data) #fit_transform一步导出结果 
result

array([[ 5.  ,  5.  ],
       [ 6.25,  6.25],
       [ 7.5 ,  7.5 ],
       [10.  , 10.  ]])

In [16]:
#当X中的特征数量非常多的时候（如100-2000），fit会报错并表示，数据量太大了我计算不了 
#此时使用partial_fit作为训练接口
#scaler = scaler.partial_fit(data)

#### Bonus: use numpy to achieve Normalization

In [17]:
import numpy as np

X = np.array([[-1, 2], [-0.5, 6], [0, 10], [1, 18]]) 

#归一化
X_nor = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) 
X_nor

array([[0.  , 0.  ],
       [0.25, 0.25],
       [0.5 , 0.5 ],
       [1.  , 1.  ]])

In [18]:
#逆转归一化
X_returned = X_nor * (X.max(axis=0) - X.min(axis=0)) + X.min(axis=0) 
X_returned

array([[-1. ,  2. ],
       [-0.5,  6. ],
       [ 0. , 10. ],
       [ 1. , 18. ]])

### preprocessing.StandardScaler
当数据(x)按均值(μ)中心化后，再按标准差(σ)缩放，数据就会服从为均值为0，方差为1的正态分布(即标准正态分 布)，而这个过程，就叫做数据标准化(Standardization，又称Z-score normalization)，公式如下: (x-mean)/SD

In [27]:
from sklearn.preprocessing import StandardScaler 

data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]

scaler = StandardScaler() #实例化
scaler.fit(data)  #fit，本质是生成均值和方差

scaler.mean_    #查看均值的属性mean_,返回两列特征的均值和方差

array([-0.125,  9.   ])

In [28]:
scaler.var_     #查看方差的属性var

array([ 0.546875, 35.      ])

In [29]:
x_std = scaler.transform(data)   #通过接口导出结果
x_std.mean()          #导出的结果是一个数组，用mean()查看均值

0.0

In [30]:
x_std.std()           #用std()查看方差

1.0

In [31]:
scaler.fit_transform(data)       #使用fit_transform(data)一步达成结果 

array([[-1.18321596, -1.18321596],
       [-0.50709255, -0.50709255],
       [ 0.16903085,  0.16903085],
       [ 1.52127766,  1.52127766]])

In [32]:
scaler.inverse_transform(x_std)  #使用inverse_transform逆转标准化

array([[-1. ,  2. ],
       [-0.5,  6. ],
       [ 0. , 10. ],
       [ 1. , 18. ]])

StandardScaler和MinMaxScaler选哪个? 

- 看情况。大多数机器学习算法中，会选择StandardScaler来进行特征缩放，因为MinMaxScaler对异常值非常敏
感。
- 在PCA，聚类，逻辑回归，支持向量机，神经网络这些算法中，StandardScaler往往是最好的选择。 MinMaxScaler在不涉及距离度量、梯度、协方差计算以及数据需要被压缩到特定区间时使用广泛，比如数字图像
处理中量化像素强度时，都会使用MinMaxScaler将数据压缩于[0,1]区间之中。
- 建议先试试看StandardScaler，效果不好换MinMaxScaler

## NAN, missing value

In [110]:
#index_col = 0

In [111]:
tantani = {'Age':[22.0, 38.0, 26.0, 35.0, 0],
                   'Sex':['male', 'female','female','female','male'],
                   'Embarked':['S','C','S','S','S'],
                   'Survived': ['No','Yes','Yes','Yes','No']}
data = pd.DataFrame.from_dict(tantani)

In [112]:
data.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,No
1,38.0,female,C,Yes
2,26.0,female,S,Yes
3,35.0,female,S,Yes
4,0.0,male,S,No


### impute.SimpleImputer
sklearn.impute.SimpleImputer (missing_values=nan, strategy=’mean’, fill_value=None, verbose=0, copy=True)
- missing_values： 告诉SimpleImputer，数据中的缺失值长什么样，默认空值np.nan
- stretagy： 我们填补缺失值的策略，默认均值。 
    输入“mean”使用均值填补(仅对数值型特征可用) 
    输入“median"用中值填补(仅对数值型特征可用) 
    输入"most_frequent”用众数填补(对数值型和字符型特征都可用) 
    输入“constant"表示请参考参数"fill_value"中的值
- fill_value： 当参数startegy为”constant"的时候可用，可输入字符串或数字表示要填充的值，常用0
- copy： 默认为True，将创建特征矩阵的副本，反之则会将缺失值填补到原本的特征矩阵中去。

In [113]:
data.info()   #check data information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Age       5 non-null      float64
 1   Sex       5 non-null      object 
 2   Embarked  5 non-null      object 
 3   Survived  5 non-null      object 
dtypes: float64(1), object(3)
memory usage: 288.0+ bytes


In [114]:
#fill age: (assume missing value)
Age = data.loc[:,'Age'].values.reshape(-1,1)  #sklearn当中的特征矩阵必须是 二维！！！！！
Age[:20]

array([[22.],
       [38.],
       [26.],
       [35.],
       [ 0.]])

In [115]:
#进行填充：
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer( )   #实例化，默认均值填补
imp_median = SimpleImputer(strategy = 'median') #实例化，用中位数填补
imp_0 = SimpleImputer(strategy = 'constant', fill_value = 0) #实例化，用0填补

In [116]:
#三种类型分别导入数据：

imp_mean = imp_mean.fit_transform(Age)   #调取结果（属性）
imp_median = imp_median.fit_transform(Age)
imp_0 = imp_0.fit_transform(Age)

In [117]:
imp_mean[:5]

array([[22.],
       [38.],
       [26.],
       [35.],
       [ 0.]])

In [118]:
imp_median[:5]

array([[22.],
       [38.],
       [26.],
       [35.],
       [ 0.]])

In [119]:
imp_0[:5]

array([[22.],
       [38.],
       [26.],
       [35.],
       [ 0.]])

In [120]:
#当决定选用中位数填补时：
data.loc[:,'Age'] = imp_median

In [121]:
#当使用众数（mode）填补 Embarked

Embarked = data.loc[:,'Embarked'].values.reshape(-1,1)
imp_mode = SimpleImputer(strategy = 'most_frequent')
data.loc[:,'Embarked'] = imp_mode.fit_transform(Embarked)

In [122]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Age       5 non-null      float64
 1   Sex       5 non-null      object 
 2   Embarked  5 non-null      object 
 3   Survived  5 non-null      object 
dtypes: float64(1), object(3)
memory usage: 288.0+ bytes


#### BONUS:用Pandas和Numpy进行填补其实更加简单

In [123]:
data.head()
data.loc[:,"Age"] = data.loc[:,"Age"].fillna(data.loc[:,"Age"].median())
#.fillna 在DataFrame里面直接进行填补

In [124]:
data.dropna(axis=0,inplace=True) 
#.dropna(axis=0)删除所有有缺失值的行，.dropna(axis=1)删除所有有缺失值的列 
#参数inplace，为True表示在原数据集上进行修改，为False表示生成一个复制对象，不修改原数据，默认False

## Dummy Variable (text -> numerical)

In [125]:
tantani = {'Age':[22.0, 38.0, 26.0, 35.0, 35.0],
                   'Sex':['male', 'female','female','female','male'],
                   'Embarked':['S','C','S','S','S'],
                   'Survived': ['No','Yes','Yes','Unkown','No']}
data1 = pd.DataFrame.from_dict(tantani)

### preprocessing.LabelEncoder 标签专用，允许1维！

In [126]:
from sklearn.preprocessing import LabelEncoder

y = data1.iloc[:,-1]   #要输入的是标签，不是特征矩阵，所以允许一维

le = LabelEncoder()   #实例化
le = le.fit(y)        #导入数据 
label = le.transform(y)  #transform接口调取结果

In [127]:
label

array([0, 2, 2, 1, 0])

In [128]:
le.classes_ #属性.classes_查看标签中究竟有多少类别

array(['No', 'Unkown', 'Yes'], dtype=object)

In [129]:
le.fit_transform(y)   #也可以直接fit_transform一步到位

array([0, 2, 2, 1, 0])

In [106]:
le.inverse_transform(label)  #使用inverse_transform可以逆转

array(['No', 'Yes', 'Yes', 'Unkown', 'No'], dtype=object)

In [130]:
data1.iloc[:,-1] = label #让标签等于我们运行出来的结果

In [132]:
data1.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,0
1,38.0,female,C,2
2,26.0,female,S,2
3,35.0,female,S,1
4,35.0,male,S,0


### preprocessing.OrdinalEncoder: 特征专用，能够将分类特征转换为分类数值

In [136]:
from sklearn.preprocessing import OrdinalEncoder

data2 = data1.copy()
data2.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,male,S,0
1,38.0,female,C,2
2,26.0,female,S,2
3,35.0,female,S,1
4,35.0,male,S,0


In [137]:
OrdinalEncoder().fit(data2.iloc[:,1:-1]).categories_

[array(['female', 'male'], dtype=object), array(['C', 'S'], dtype=object)]

In [139]:
data2.iloc[:,1:-1] = OrdinalEncoder().fit_transform(data2.iloc[:,1:-1])

In [140]:
data2.head()

Unnamed: 0,Age,Sex,Embarked,Survived
0,22.0,1.0,1.0,0
1,38.0,0.0,0.0,2
2,26.0,0.0,1.0,2
3,35.0,0.0,1.0,1
4,35.0,1.0,1.0,0


### preprocessing.OneHotEncoder 创造dummy，各个取值之间或许没有关系！

In [141]:
#"S":0, "C":1 --> S:[1,0], C:[0,1]

data3 = data.copy()

In [142]:
from sklearn.preprocessing import OneHotEncoder

X = data3.iloc[:,1:-1]
enc = OneHotEncoder(categories = 'auto').fit(X)  #实例化+fit
result = enc.transform(X).toarray()   #转换为array
result

array([[0., 1., 0., 1.],
       [1., 0., 1., 0.],
       [1., 0., 0., 1.],
       [1., 0., 0., 1.],
       [0., 1., 0., 1.]])

In [143]:
#依然可以还原 pd.DataFrame(enc.inverse_transform(result))
pd.DataFrame(enc.inverse_transform(result))

Unnamed: 0,0,1
0,male,S
1,female,C
2,female,S
3,female,S
4,male,S


In [145]:
enc.get_feature_names() #重要接口！！该接口帮助生成对应变量

array(['x0_female', 'x0_male', 'x1_C', 'x1_S'], dtype=object)

In [146]:
result
result.shape

(5, 4)

In [147]:
#axis=1,表示跨行进行合并，也就是将量表左右相连，如果是axis=0，就是将量表上下相连
newdata = pd.concat([data3,pd.DataFrame(result)],axis=1)
newdata.head()

Unnamed: 0,Age,Sex,Embarked,Survived,0,1,2,3
0,22.0,male,S,No,0.0,1.0,0.0,1.0
1,38.0,female,C,Yes,1.0,0.0,1.0,0.0
2,26.0,female,S,Yes,1.0,0.0,0.0,1.0
3,35.0,female,S,Yes,1.0,0.0,0.0,1.0
4,0.0,male,S,No,0.0,1.0,0.0,1.0


In [148]:
newdata.drop(["Sex","Embarked"],axis=1,inplace=True)

In [150]:
newdata.columns = ["Age","Survived","Female","Male","Embarked_C","Embarked_S"]

In [151]:
newdata.head()

Unnamed: 0,Age,Survived,Female,Male,Embarked_C,Embarked_S
0,22.0,No,0.0,1.0,0.0,1.0
1,38.0,Yes,1.0,0.0,1.0,0.0
2,26.0,Yes,1.0,0.0,0.0,1.0
3,35.0,Yes,1.0,0.0,0.0,1.0
4,0.0,No,0.0,1.0,0.0,1.0


特征可以做哑变量，标签也可以吗?可以，使用类sklearn.preprocessing.LabelBinarizer可以对做哑变量，许多算 法都可以处理多标签问题(比如说决策树)，但是这样的做法在现实中不常见，因此我们在这里就不赘述了

## Sklearn - PCA, SVD