# Are you a fan of Mayor Han?

<img src="https://i.imgur.com/GgPPfk4.png" width="900" />

(image source: [link](https://www.storm.mg/article/525105?srcid=gAAAAABcsea_WZX1btdtDfPbAuBP8p_-9GcnyFvKDxXCZIjJg9dNW26Gskj7oZGDlmsQQjZQq_6v0Pqx57FbQhbMy7_D2Mveyf8pecA6NS_0Kxhj9N1P1CI%253D))

## 1. Define Question
Recently, Mayor of Kaohsiung Han Kuo-yu (韓國瑜) is in fasion, and even his name became a buzzword.
In this section, I'll **determine whether a person is a fan of Mayor Han**.

## 2. Generate mock dataset 

This dataset contains 20000 rows, The attributes are:
- **gender**: male/female
- **age**: int
- **party**: KMT/Democratic Progressive Party
- **hasMCTfamily**: 0/1
    - if any of your family works as a military personnel, a civil servant or a teacher
- **loveFerrisWheel**: 0/1
- **loveChina**: 0/1
- **willToReunification**: 0%~100%

and the label indicates whether this person loves mayor Han or not:
- **fanOfHan**: 0/1

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import random
import math

### gender
- according to [wikipedia](https://zh.wikipedia.org/wiki/%E8%87%BA%E7%81%A3%E4%BA%BA%E5%8F%A3#%E4%BA%BA%E5%8F%A3), the average ratio of male and femail is about 0.99

In [2]:
def decideGenger(n):
    return 'male' if n*random.uniform(0, 1) < 0.5 else 'female'

pdata = pd.DataFrame(1, index=range(0,20000), columns=['gender'])
pdata['gender'] = pdata['gender'].apply(decideGenger)
pdata.head()

Unnamed: 0,gender
0,female
1,male
2,male
3,male
4,female


### age
- again, according to [wikipedia](https://zh.wikipedia.org/wiki/%E8%87%BA%E7%81%A3%E4%BA%BA%E5%8F%A3#%E4%BA%BA%E5%8F%A3), we can consider this a normal distribution
- I simply set mean = 50, and std,dev = 16

In [3]:
# pdata['age'] = 0
def decideAge(n):
    if n < 15:
        return 20
    elif n > 90:
        return 85

mu, sigma = 50, 15 # mean and standard deviation
pdata['age'] = np.random.normal(mu, sigma, 20000)
pdata['age'] = pdata['age'].apply(round)
pdata.head()

Unnamed: 0,gender,age
0,female,39
1,male,34
2,male,64
3,male,25
4,female,53


### party
- from the research [1992/06~2018/12](https://esc.nccu.edu.tw/app/news.php?Sn=165#), we can tell the party preference distribution.
- random_percent = random(0, 1.0)
- if random_percent <= 0.246, then 'Democratic Progressive Party'
- else if 0.246 < random_percent <= 0.561, then 'KMT'
- else 'none'
![](https://i.imgur.com/2XfwH3V.png)
(image source: [link](https://esc.nccu.edu.tw/app/news.php?Sn=165#))

In [4]:
def decideParty(n):
    n = n*random.uniform(0, 1)
    if n <= 0.246:
        return 'Democratic Progressive Party'
    elif n > 0.246 and n <= 0.561:
        return 'KMT'
    else:
        return 'none'

pdata['party'] = 1
pdata['party'] = pdata['party'].apply(decideParty)
pdata.head()

Unnamed: 0,gender,age,party
0,female,39,Democratic Progressive Party
1,male,34,Democratic Progressive Party
2,male,64,none
3,male,25,Democratic Progressive Party
4,female,53,KMT


### willToReunification
- one's will to reunification with China
- if party == 'KMT', then there's 40%~100% of chance
- else 0%~50%. I believe that no one wants this to happen

In [5]:
pdata['willToReunification'] = np.nan
for index, row in pdata.iterrows():
    pdata.at[index,'willToReunification'] = round(random.uniform(0.6, 1), 2) if row['party'] == 'KMT' else round(random.uniform(0, 1), 2)

pdata.head()

Unnamed: 0,gender,age,party,willToReunification
0,female,39,Democratic Progressive Party,0.33
1,male,34,Democratic Progressive Party,0.44
2,male,64,none,0.01
3,male,25,Democratic Progressive Party,0.67
4,female,53,KMT,0.79


### hasMCTfamily
- if one's family works as a **military personnel**, a **civil servant** or a **teacher** (軍公教人員), there's a great chance that he/she won't like Democratic Progressive Party, because of the "pension reform (年金改革)" policy, thus, he/she might be forced to support Han in order not to let Chi-Mai Chen, who stands for Democratic Progressive Party, win the election.
- from [this report](http://www.fund.gov.tw/public/data/6851558871.pdf), I found that around 3% of population works as a military personnel, a civil servant or a teacher. and let's assume all their family feel sad for them and thus decide to select Mayor Han, with each family 6 people. That's total 18% approximately.
- randomly choose 18% from dataset

In [6]:
def decideMCTfamily(_):
    n = random.uniform(0, 1)
    return 1 if n <= 0.18 else 0

pdata['hasMCTfamily'] = 1
pdata['hasMCTfamily'] = pdata['hasMCTfamily'].apply(decideMCTfamily)
pdata.head()

Unnamed: 0,gender,age,party,willToReunification,hasMCTfamily
0,female,39,Democratic Progressive Party,0.33,1
1,male,34,Democratic Progressive Party,0.44,1
2,male,64,none,0.01,0
3,male,25,Democratic Progressive Party,0.67,0
4,female,53,KMT,0.79,0


### loveChina
- it depends, but if one's prefered party is KMT, he/she 's more likely (70%) to love China
- if party == 'KMT' and random_percent >= 0.3, then 1 
- else 0

In [7]:
pdata['loveChina'] = 1
for index, row in pdata.iterrows():
    pdata.at[index,'loveChina'] = 1 if row['party'] == 'KMT' and random.uniform(0, 1) > 0.3 else 0

pdata.head()

Unnamed: 0,gender,age,party,willToReunification,hasMCTfamily,loveChina
0,female,39,Democratic Progressive Party,0.33,1,0
1,male,34,Democratic Progressive Party,0.44,1,0
2,male,64,none,0.01,0,0
3,male,25,Democratic Progressive Party,0.67,0,0
4,female,53,KMT,0.79,0,1


### loveFerrisWheel
- haters' gonna hate.
- randomly selected (50%)

In [8]:
def decideLoveFerrisWheel(_):
    return 1 if random.uniform(0, 1) < 0.5 else 0

pdata['loveFerrisWheel'] = 1
pdata['loveFerrisWheel'] = pdata['loveFerrisWheel'].apply(decideLoveFerrisWheel)
pdata.head()

Unnamed: 0,gender,age,party,willToReunification,hasMCTfamily,loveChina,loveFerrisWheel
0,female,39,Democratic Progressive Party,0.33,1,0,0
1,male,34,Democratic Progressive Party,0.44,1,0,0
2,male,64,none,0.01,0,0,0
3,male,25,Democratic Progressive Party,0.67,0,0,1
4,female,53,KMT,0.79,0,1,0


## 3. Define rules:

<img src="" width="900" />

### fanOfHan
Now I'm gonna label the dataset, and the following are the rules.
1. if one's party is KMT => fan of Mayor Han
2. if one's willing to reunification with China over 70% => fan of Mayor Han
3. if one match 3 or more rules listed below => fan of Mayor Han
    - loveChina = 1
    - hasMCTfamily = 1
    - loveFerrisWheel = 1
    - gender = 'Female' and age >= 55

In [9]:
pdata['fanOfHan'] = 0
for index, row in pdata.iterrows():
    if (row['party'] == 'KMT') or (row['willToReunification'] >= 0.9):
        pdata.at[index,'fanOfHan'] = 1
        continue
    if (row['gender'] == 'female') and (row['age'] >= 55):
        pdata.at[index,'fanOfHan'] = 1
        continue
        
    match = 0
    match = match + row['loveChina'] + row['hasMCTfamily'] + row['loveFerrisWheel']

    pdata.at[index,'fanOfHan'] = 1 if match > 1 else 0

# save the dataset as csv
pdata.to_csv('mock_data.csv')
pdata.head()

Unnamed: 0,gender,age,party,willToReunification,hasMCTfamily,loveChina,loveFerrisWheel,fanOfHan
0,female,39,Democratic Progressive Party,0.33,1,0,0,0
1,male,34,Democratic Progressive Party,0.44,1,0,0,0
2,male,64,none,0.01,0,0,0,0
3,male,25,Democratic Progressive Party,0.67,0,0,1,0
4,female,53,KMT,0.79,0,1,0,1


## 4. Construct Decision Tree with tools:

In [10]:
# from scipy.stats import logistic
# def sigmoid(x):
#     return 1 / (1 + math.exp(-x))

# pdata['fanOfHan'] = 0
# for index, row in pdata.iterrows():
#     isAgedFemale = 1 if row['gender'] == 'Female' and row['age'] > 50 else 0
#     isKMT = 1 if row['party'] == 'KMT' else 0
#     percentage = (isAgedFemale*0.1) + (isKMT*0.99) + (row['hasMCTfamily']*0.8) + (row['loveFerrisWheel']*0.7) + (row['loveChina']*0.5) + (row['wantToReunification']*0.99)
# #     percentage = logistic.cdf(percentage)
#     percentage = sigmoid(percentage)
# #     print(percentage)
#     pdata.at[index,'fanOfHan'] = 1 if percentage > 0.9 else 0

# pdata.head()

### Deal with string

In [11]:
# gender
gender_map = {'male': 1, 'female': 0}
pdata['gender'] = pdata['gender'].map(gender_map)

# party
party_map = {'KMT': 1, 'Democratic Progressive Party': 2, 'none': 3}
pdata['party'] = pdata['party'].map(party_map)

pdata.head()

Unnamed: 0,gender,age,party,willToReunification,hasMCTfamily,loveChina,loveFerrisWheel,fanOfHan
0,0,39,2,0.33,1,0,0,0
1,1,34,2,0.44,1,0,0,0
2,1,64,3,0.01,0,0,0,0
3,1,25,2,0.67,0,0,1,0
4,0,53,1,0.79,0,1,0,1


### Seperate to training set and testing set

In [12]:
pdata_train = pdata[:15000]
pdata_test = pdata[15000:]

### Seperate attributes and answer

In [13]:
y = pdata_train['fanOfHan'].values
pdata_train = pdata_train.drop('fanOfHan', axis=1)

### Draw Decision Tree

In [14]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydotplus

dtree=DecisionTreeClassifier(max_depth=4)
dtree.fit(pdata_train, y)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(pdata_train),
                class_names=['Chi Mai','Mayor Han'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# graph.write_pdf("tree.pdf")
graph.write_png("tree.png")

True

<img src="tree.png" width="900" />

## 5. Compare with rules and then calculate accuracy

In [15]:
y_test = pdata_test['fanOfHan'].values
X_test = pdata_test.drop('fanOfHan', axis=1)

y_predict = dtree.predict(X_test)

y_predict

array([1, 1, 0, ..., 1, 1, 0])

In [16]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

0.9576