# Are you a fan of Mayor Han?

<img src="https://i.imgur.com/GgPPfk4.png" width="900" />

(image source: [link](https://www.storm.mg/article/525105?srcid=gAAAAABcsea_WZX1btdtDfPbAuBP8p_-9GcnyFvKDxXCZIjJg9dNW26Gskj7oZGDlmsQQjZQq_6v0Pqx57FbQhbMy7_D2Mveyf8pecA6NS_0Kxhj9N1P1CI%253D))

## 1. Define Question
Recently, Mayor of Kaohsiung Han Kuo-yu (韓國瑜) is in fasion, and even his name became a buzzword.
In this section, I'll **determine whether a person is a fan of Mayor Han**.

## 2. Design attributes (features)
This dataset contains 20000 rows. The attributes are:
- **gender**: male/female
- **age**: int
- **party**: KMT/Democratic Progressive Party/None
- **hasMCTfamily**: 0/1
    - if any of your family works as a military personnel, a civil servant or a teacher
- **loveFerrisWheel**: 0/1
- **loveChina**: 0/1
- **willToReunification**: 0%~100%

and the label indicates whether this person loves mayor Han or not:
- **fanOfHan**: 0/1

## 3. Define rules:
This is a simple decision tree that defined the rules.
<img src="imgs/rules_simple.png" width="700" />


## 4. Generate mock dataset 

In [1]:
import pandas as pd
import numpy as np
import scipy as sp
import random
import math

### gender
- according to [wikipedia](https://zh.wikipedia.org/wiki/%E8%87%BA%E7%81%A3%E4%BA%BA%E5%8F%A3#%E4%BA%BA%E5%8F%A3), the average ratio of male and femail is about 0.99

In [2]:
def decideGenger(n):
    return 'male' if n*random.uniform(0, 1) < 0.5 else 'female'

pdata = pd.DataFrame(1, index=range(0,20000), columns=['gender'])
pdata['gender'] = pdata['gender'].apply(decideGenger)
pdata.head()

Unnamed: 0,gender
0,male
1,female
2,male
3,male
4,female


### age
- again, according to [wikipedia](https://zh.wikipedia.org/wiki/%E8%87%BA%E7%81%A3%E4%BA%BA%E5%8F%A3#%E4%BA%BA%E5%8F%A3), we can consider this a normal distribution
- I simply set mean = 50, and std,dev = 16

In [3]:
# pdata['age'] = 0
def decideAge(n):
    if n < 15:
        return 20
    elif n > 90:
        return 85

mu, sigma = 50, 15 # mean and standard deviation
pdata['age'] = np.random.normal(mu, sigma, 20000)
pdata['age'] = pdata['age'].apply(round)
pdata.head()

Unnamed: 0,gender,age
0,male,59
1,female,45
2,male,45
3,male,51
4,female,60


### party
- from the research [1992/06~2018/12](https://esc.nccu.edu.tw/app/news.php?Sn=165#), we can tell the party preference distribution.
- random_percent = random(0, 1.0)
- if random_percent <= 0.246, then 'Democratic Progressive Party'
- else if 0.246 < random_percent <= 0.561, then 'KMT'
- else 'none'
![](https://i.imgur.com/2XfwH3V.png)
(image source: [link](https://esc.nccu.edu.tw/app/news.php?Sn=165#))

In [4]:
def decideParty(n):
    n = n*random.uniform(0, 1)
    if n <= 0.246:
        return 'Democratic Progressive Party'
    elif n > 0.246 and n <= 0.561:
        return 'KMT'
    else:
        return 'None'

pdata['party'] = 1
pdata['party'] = pdata['party'].apply(decideParty)
pdata.head()

Unnamed: 0,gender,age,party
0,male,59,
1,female,45,Democratic Progressive Party
2,male,45,Democratic Progressive Party
3,male,51,Democratic Progressive Party
4,female,60,KMT


### hasMCTfamily
- if one's family works as a **military personnel**, a **civil servant** or a **teacher** (軍公教人員), there's a great chance that he/she won't like Democratic Progressive Party, because of the "pension reform (年金改革)" policy, thus, he/she might be forced to support Han in order not to let Chi-Mai Chen, who stands for Democratic Progressive Party, win the election.
- from [this report](http://www.fund.gov.tw/public/data/6851558871.pdf), I found that around 3% of population works as a military personnel, a civil servant or a teacher. and let's assume all their family feel sad for them and thus decide to select Mayor Han, with each family 6 people. That's total 18% approximately.
- randomly choose 18% from dataset

In [5]:
def decideMCTfamily(_):
    n = random.uniform(0, 1)
    return 1 if n <= 0.18 else 0

pdata['hasMCTfamily'] = 1
pdata['hasMCTfamily'] = pdata['hasMCTfamily'].apply(decideMCTfamily)
pdata.head()

Unnamed: 0,gender,age,party,hasMCTfamily
0,male,59,,0
1,female,45,Democratic Progressive Party,0
2,male,45,Democratic Progressive Party,0
3,male,51,Democratic Progressive Party,0
4,female,60,KMT,0


### loveMoney
- "Sell Goods, welcome people, prosperous Kaohsiung." --- by Mayor Han
- most people love money.

In [6]:
pdata['loveMoney'] = 1
for index, row in pdata.iterrows():
#     pdata.at[index,'loveChina'] = 1 if row['party'] == 'KMT' and random.uniform(0, 1) > 0.3 else 0
    pdata.at[index,'loveMoney'] = random.randint(0, 1) # 1 if random.uniform(0, 1) > 0.5 else 0

pdata.head()

Unnamed: 0,gender,age,party,hasMCTfamily,loveMoney
0,male,59,,0,0
1,female,45,Democratic Progressive Party,0,1
2,male,45,Democratic Progressive Party,0,1
3,male,51,Democratic Progressive Party,0,0
4,female,60,KMT,0,1


### loveFerrisWheel
- haters' gonna hate.
- randomly selected (50%)

In [7]:
def decideLoveFerrisWheel(_):
    return random.randint(0, 1) # 1 if random.uniform(0, 1) < 0.5 else 0

pdata['loveFerrisWheel'] = 1
pdata['loveFerrisWheel'] = pdata['loveFerrisWheel'].apply(decideLoveFerrisWheel)
pdata.head()

Unnamed: 0,gender,age,party,hasMCTfamily,loveMoney,loveFerrisWheel
0,male,59,,0,0,1
1,female,45,Democratic Progressive Party,0,1,1
2,male,45,Democratic Progressive Party,0,1,1
3,male,51,Democratic Progressive Party,0,0,1
4,female,60,KMT,0,1,0


### fanOfHan
Now, label the dataset based on the rules made up previously.
1. if one's party is KMT => fan of Mayor Han
2. if one's willing to reunification with China over 70% => fan of Mayor Han
3. if one's gender = 'Female' and age >= 55 => fan of Mayor Han
4. if one match 2 or more rules listed below => fan of Mayor Han
    - loveChina = 1
    - hasMCTfamily = 1
    - loveFerrisWheel = 1

In [8]:
pdata['fanOfHan'] = 0
for index, row in pdata.iterrows():
    if (row['party'] == 'KMT'):
        if (row['loveFerrisWheel'] == 1):
            pdata.at[index,'fanOfHan'] = 1
        else:
            if (row['gender'] == 'female') and (row['age'] >= 55):
                pdata.at[index,'fanOfHan'] = 1
    else:
        if (row['hasMCTfamily'] == 1):
            if (row['loveMoney'] == 1):
                pdata.at[index,'fanOfHan'] = 1

# save the dataset as csv
pdata.to_csv('mock_data.csv')
pdata.head()

Unnamed: 0,gender,age,party,hasMCTfamily,loveMoney,loveFerrisWheel,fanOfHan
0,male,59,,0,0,1,0
1,female,45,Democratic Progressive Party,0,1,1,0
2,male,45,Democratic Progressive Party,0,1,1,0
3,male,51,Democratic Progressive Party,0,0,1,0
4,female,60,KMT,0,1,0,1


## 5. Construct Decision Tree:

### Deal with string

In [9]:
# gender
gender_map = {'male': 1, 'female': 0}
pdata['gender'] = pdata['gender'].map(gender_map)

# party
party_map = {'KMT': 1, 'Democratic Progressive Party': 2, 'None': 3}
pdata['party'] = pdata['party'].map(party_map)

pdata.head()

Unnamed: 0,gender,age,party,hasMCTfamily,loveMoney,loveFerrisWheel,fanOfHan
0,1,59,3,0,0,1,0
1,0,45,2,0,1,1,0
2,1,45,2,0,1,1,0
3,1,51,2,0,0,1,0
4,0,60,1,0,1,0,1


### Seperate to training set and testing set

In [10]:
pdata_train = pdata[:15000]
pdata_test = pdata[15000:]

### Seperate attributes and answer

In [11]:
y = pdata_train['fanOfHan'].values
pdata_train = pdata_train.drop('fanOfHan', axis=1)

### Draw Decision Tree

In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydotplus

dtree=DecisionTreeClassifier(max_depth=6)
dtree.fit(pdata_train, y)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(pdata_train),
                class_names=['Chi Mai','Mayor Han'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# graph.write_pdf("tree.pdf")
graph.write_png("imgs/tree.png")

True

<img src="imgs/tree.png" width="700" />

### calculate accuracy

In [13]:
y_test = pdata_test['fanOfHan'].values
X_test = pdata_test.drop('fanOfHan', axis=1)

y_predict = dtree.predict(X_test)

y_predict

array([1, 1, 1, ..., 0, 0, 0])

In [14]:
from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_predict)

1.0

# Report
The following part is the report, which can also be found [here](https://github.com/jwang0306/fundamentals-data-analytics/tree/master/HW2/report.md).

## 6. Compare the rules

### (a). Original rules v.s rules generated by decision tree

#### Original rules:
<img src="imgs/rules_simple.png" width="500" />

#### Generated rules:
<img src="imgs/tree.png" width="500" />


### (b). Observation
- The generated rules is exacly the idea that I originally thought of, though the rule order is slightly different.

## 7. Futher Discussion
### (a). If we try a more complicated rules, will the tree still look similar ?
<img src="imgs/complicated_rules.png" width="500" />

In [15]:
test_data = pdata.copy()
test_data = test_data.drop('fanOfHan', axis=1)
test_data['fanOfHan'] = 0
# complicated rules
for index, row in test_data.iterrows():
    if (row['party'] == 1):
        test_data.at[index,'fanOfHan'] = 1
    else:
        if (row['loveMoney'] == 1):
            if (row['loveFerrisWheel'] == 1):
                test_data.at[index,'fanOfHan'] = 1
        else:
            if (row['hasMCTfamily'] == 1):
                if (row['party'] == 2):
                    test_data.at[index,'fanOfHan'] = 1
            else:
                if (row['age'] >= 55):
                    if (row['gender'] == 0):
                        test_data.at[index,'fanOfHan'] = 1
                        
test_data.head()

Unnamed: 0,gender,age,party,hasMCTfamily,loveMoney,loveFerrisWheel,fanOfHan
0,1,59,3,0,0,1,0
1,0,45,2,0,1,1,1
2,1,45,2,0,1,1,1
3,1,51,2,0,0,1,0
4,0,60,1,0,1,0,1


In [16]:
# split to train, test, y
test_data_train = test_data[:15000]
test_data_test = test_data[15000:]
test_y = test_data_train['fanOfHan'].values
test_data_train = test_data_train.drop('fanOfHan', axis=1)

# draw the tree
dtree=DecisionTreeClassifier(max_depth=10)
dtree.fit(test_data_train, y)

dtree.fit(test_data_train, test_y)
dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(test_data_train),
                class_names=['Chi Mai','Mayor Han'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png("imgs/test_complicated.png")

True

<img src="imgs/test_complicated.png" width="700" />

### Observation
- The tre looks very differnet, though it can still be 100% while training, however, some error occurs when predicting new data.

### (b). If we add some noise to the dataset, what will the tree be like ?

In [17]:
test_data = pdata.copy()

# randomly select rows to add noise
idx = list(random.randint(0, 20000-1) for x in range(1, 50))
print(idx)
for i in idx:
    test_data.loc[i] = [
        random.randint(0, 1),
        random.randint(20, 90),
        random.randint(1, 3),
        random.uniform(0, 1),
        random.randint(0, 1),
        random.randint(0, 1),
        random.randint(0, 1)
    ]

test_data_train = test_data[:15000]
test_data_test = test_data[15000:]
test_y = test_data_train['fanOfHan'].values
test_data_train = test_data_train.drop('fanOfHan', axis=1)
test_data_train.head()

[2095, 13901, 18483, 14646, 15133, 1781, 6143, 5537, 8648, 1981, 18088, 4168, 10014, 9675, 1346, 5999, 14989, 18901, 9862, 8693, 8772, 10683, 19428, 1574, 2864, 12686, 9445, 19876, 15735, 17067, 1469, 1141, 2399, 6448, 14589, 19300, 14990, 17562, 8494, 8840, 11465, 4433, 19916, 10909, 1542, 2040, 3332, 6062, 7412]


Unnamed: 0,gender,age,party,hasMCTfamily,loveMoney,loveFerrisWheel
0,1,59,3,0.0,0,1
1,0,45,2,0.0,1,1
2,1,45,2,0.0,1,1
3,1,51,2,0.0,0,1
4,0,60,1,0.0,1,0


In [18]:
dtree.fit(test_data_train, test_y)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(test_data_train),
                class_names=['Chi Mai','Mayor Han'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png("imgs/test_noise.png")

True

<img src="imgs/test_noise.png" width="900" />

### What about the accuracy?

In [19]:
y_test = test_data_test['fanOfHan'].values
X_test = test_data_test.drop('fanOfHan', axis=1)

y_predict = dtree.predict(X_test)

accuracy_score(y_test, y_predict)

0.999

### Observation

- The tree became much more complicated.
- It shows that Decision Tree is prone to noise, even though only 0.25% of dataset is randomly made.
- However, it seems like that the accuracy isn't affected that much. The tree is still well functioning whle it looks complicated.

### (c). If we reduce the depth of Decision Tree, how will the result be?

In [20]:
test_data = pdata.copy()
test_data_train = test_data[:15000]
test_data_test = test_data[15000:]
test_y = test_data_train['fanOfHan'].values
test_data_train = test_data_train.drop('fanOfHan', axis=1)

In [21]:
dtree=DecisionTreeClassifier(max_depth=3) # reduce the depth
dtree.fit(pdata_train, y)

dot_data = StringIO()
export_graphviz(dtree, 
                out_file=dot_data,  
                filled=True, 
                feature_names=list(pdata_train),
                class_names=['Chi Mai','Mayor Han'],
                special_characters=True)

graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png("imgs/test_depth.png")

True

<img src="imgs/test_depth.png" width="600" />

In [22]:
y_test = test_data_test['fanOfHan'].values
X_test = test_data_test.drop('fanOfHan', axis=1)

y_predict = dtree.predict(X_test)

accuracy_score(y_test, y_predict)

0.9658

### Observation
- At the bottom, there are still lots of values that couldn't be classified.
- The accuracy turns out poorer.
- This is because the rules aren't as many as the original ones to represent the results due to the depth of tree.