# Machine Learning Matrix

### Classification:
- logistic regression
- linear regression
- decision tree
- LDA
- naive bayies
- SVM
- KNN
- AdaBoost
- XGBoost

### Clustering
- K-Means
- EM cluster
- Mean-shift 
- DBSCAN
- layer cluster
- PCA

### Recommandation System
##### Labeling : 
- SimpleTagBased
- NormTagBased
- TagBased-TFIDF

##### Content based (static)

##### collabertive filtering:
- user-CF
- Item-CF

##### CTR prediction:
- GBDT+LR 
- Wide&Deep
- FM
- FFM
- DeepFM
- NFm
- Deep& cross
- xDeepFM
- DIN
- DIEN
- DSIN
---

## Personas:
- Personas are fictional characters, which you create based upon your research in order to represent the different user types that might use your service, product, site, or brand in a similar way

#### Role personas:
- Unify user_id 
- sex
- age
- location
- income
- education
- career
#### Engage personas:
- shopping preferrence
- sensitive to big sell
#### Function personas:
- using time period
- frequence
- collections
- clicking
- rating

## Loan-to-value ratio (LTV)
* engaging
* attraction
* maintaining
---
### Algorithm: KMeans
- K-Means is one of the most popular "clustering" algorithms. K-means stores $k$ centroids that it uses to define clusters. A point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid. 

- Implementation:
    - 1. Inintialize multiple centriod randomly $ \{ \mu_1, \mu_2, \mu_3 ... \mu_n \} $
    - 2. Repeat 
    
            { 
            
            $ foreach \; c^{(i)} = arg\; min_j ||x^{(i)} -\mu_j||^2 $
            
            $ foreach \; \mu_j = \frac{\sum_{i=1}^{m} 1\{ c^{(i)} = j \} x^{(i)}}{\sum_{i=1}^{m} 1\{c^{(i)} = j \}} $ 
            
            }
            
---
### Algorithm: Gaussian Mixture Model (GMM)
- Gaussian mixture models are a probabilistic model for representing normally distributed subpopulations within an overall population. 

- Implementation:
    - one-dimensional model :  
    
        $ p(x) = \sum_{i=1}^{K}\phi_i N(x|\mu_i, \sigma_i) $
    
        $ N(x|\mu_i,\sigma_i)  = \frac{1}{\sigma_i\sqrt[2]{2\pi}}exp(-\frac{(x-\mu_i)^2}{2\sigma^2_i})$ 
 
        $ \sum_{i=1}^{K} \phi_1 = 1 $ 
    
    - Multi-dimensional Model:
        
        $ p(x) = \sum_{i=1}^{K}\phi_i N( \overrightarrow{x} \  | \ \overrightarrow{\mu_i}, \sum_i) $
        
        $ N( \overrightarrow{x} \  | \ \overrightarrow{\mu_i}, \sum_i)  = \frac{1}{\sqrt[2]{(2\pi)^K|\sum_i|} exp(−2/1​(x−μ​i​)TΣi​−1(x−μ​i​)) $ 
        
        $ \sum_{i=1}^{K} \phi_1 = 1 $ 
        
---

### Algorithm: Expectation–Maximization (EM) 
- EM is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables. 
- E: $ \gamma(i,k) $ 
- M: $ \pi_k , \mu_k, \sum_k $
- Log-likelihood: ![function](https://blog.pluskid.org/latexrender/pictures/9921aae9b012b629ab6e96a945de39be.png)

---

#### Project 1: clustering football teams

In [2]:
from sklearn.cluster import KMeans
from sklearn import preprocessing
import pandas as pd
import numpy as np
data = pd.read_csv('team_cluster_data.csv',encoding='gbk')
data

Unnamed: 0,国家,2019国际排名,2018世界杯排名,2015亚洲杯排名
0,中国,73,40,7
1,日本,60,15,5
2,韩国,61,19,2
3,伊朗,34,18,6
4,沙特,67,26,10
5,伊拉克,91,40,4
6,卡塔尔,101,40,13
7,阿联酋,81,40,6
8,乌兹别克斯坦,88,40,8
9,泰国,122,40,17


In [7]:
kmeans = KMeans(n_clusters=3)
train_x = data[["2019国际排名","2018世界杯排名","2015亚洲杯排名"]]
min_max_scaler=preprocessing.MinMaxScaler()
train_x=min_max_scaler.fit_transform(train_x)
kmeans.fit(train_x)
predict_y = kmeans.predict(train_x)
result = pd.concat((data,pd.DataFrame(predict_y)),axis=1)
result.rename({0:u'聚类结果'},axis=1,inplace=True)
result

Unnamed: 0,国家,2019国际排名,2018世界杯排名,2015亚洲杯排名,聚类结果
0,中国,73,40,7,0
1,日本,60,15,5,2
2,韩国,61,19,2,2
3,伊朗,34,18,6,2
4,沙特,67,26,10,2
5,伊拉克,91,40,4,0
6,卡塔尔,101,40,13,1
7,阿联酋,81,40,6,0
8,乌兹别克斯坦,88,40,8,0
9,泰国,122,40,17,1
