# CSC311 Decision Trees and Accuracy-based Diagnostics

As before, we will import `matplotlib` and `numpy` for plotting and linear algebra
manipulations.

In [3]:
import matplotlib.pyplot as plt # For plotting
import numpy as np              # Linear algebra library

In addition to using `numpy` for its linear algebra functionalities, we will also use
a library called `pandas` to view the data and `re` to clean the data as well as sklearns decisionTreeClassifier to create our classifier. We will also use the functions given to us in challenge_basic.py to clean the data before using it to train our classifier.

In [2]:
import re
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree as treeViz
from sklearn.metrics import accuracy_score

Use pandas to read the dataset.

In [4]:
# read each of the csv files as a *pandas data frame*
data = pd.read_csv("clean_dataset.csv")

# display one the dataframes in the notebook
data

Unnamed: 0,id,Q1,Q2,Q3,Q4,Q5,Q6,Q7,Q8,Q9,Q10,Label
0,42659,4.0,1.0,2.0,1.0,Co-worker,"Skyscrapers=>6,Sport=>4,Art and Music=>2,Carni...",25,7.0,100,Slavery,Dubai
1,508149,4.0,3.0,5.0,2.0,Co-worker,"Skyscrapers=>6,Sport=>1,Art and Music=>2,Carni...",20,3.0,4,"Wherever there is great property, there is gre...",Dubai
2,496935,5.0,4.0,5.0,1.0,"Partner,Friends","Skyscrapers=>6,Sport=>2,Art and Music=>3,Carni...",32,5.0,4,Futuristic land,Dubai
3,502824,5.0,4.0,4.0,1.0,"Partner,Friends","Skyscrapers=>6,Sport=>5,Art and Music=>1,Carni...",23,10.0,3,The city where anything is possible,Dubai
4,523028,4.0,3.0,3.0,3.0,"Partner,Friends,Siblings","Skyscrapers=>6,Sport=>4,Art and Music=>1,Carni...",20,10.0,5,"If you can think of a high building, it probab...",Dubai
...,...,...,...,...,...,...,...,...,...,...,...,...
1463,394579,4.0,2.0,5.0,3.0,"Partner,Friends,Siblings","Skyscrapers=>2,Sport=>1,Art and Music=>4,Carni...",5,2.0,2,"""The average piece of junk is probably more me...",Paris
1464,499389,5.0,3.0,4.0,2.0,"Partner,Co-worker","Skyscrapers=>1,Sport=>3,Art and Music=>6,Carni...",7,2.0,4,Oui oui baguette,Paris
1465,522120,2.0,1.0,3.0,4.0,Friends,"Skyscrapers=>2,Sport=>3,Art and Music=>6,Carni...",9,87.0,67,E OhÂ,Paris
1466,519777,5.0,4.0,5.0,3.0,"Partner,Friends","Skyscrapers=>5,Sport=>6,Art and Music=>2,Carni...",15,1.0,15,croissants and cigarettesÂ,Paris


## Part 1. Data

We will be focusing on the survey data relevant to the assessment of which city it is based on the answer to the questions.

We will be looking at 10 features from the ml challenge data set. The definitions of the features are as follows:

- `Q1` which is the answer to the question "From a scale 1 to 5, how popular is this city?": which can be 1 to 5 where 1 is least popular and 5 is most popular.
- `Q2` which is the answer to the question "On a scale of 1 to 5, how efficient is this city at turning everyday occurrences into potential viral moments on social media?": Which be 1 to 5 where 1 is the least efficient and 5 is the most efficient.
- `Q3` which is the answer to the question "Rate the city's architectural uniqueness from 1 to 5, with 5 being a blend of futuristic wonder and historical charm.": Which be 1 to 5 where 1 is the least unique and 5 is the most unique.
- `Q4` which is the answer to the question "Rate the city's enthusiasm for spontaneous street parties on a scale of 1 to 5, with 5 being the life of the celebration.": Which be 1 to 5 where 1 is the least enthusiasm for street parties and 5 is the most enthusiasm for street parties.
- `Q5` which is the answer to the question "If you were to travel to this city, who would be likely with you?": Which be a combination of any of the following, Co-Worker, Partner, Friends, Family.
- `Q6` which is the answer to the question "Rank the following words from the least to most relatable to this city. Each area should have a different number assigned to it.": Which be 1 to 6 for each word where 1 is the least relatable and 6 is the most relatable.
- `Q7` which is the answer to the question "In your opinion, what is the average temperature of this city over the month of January?": Which be any number.
- `Q8` which is the answer to the question "How many different languages might you overhear during a stroll through the city?": Which be any number.
- `Q9` which is the answer to the question "How many different fashion styles might you spot within a 10-minute walk in the city?": Which be any number.
- `Q10` which is the answer to the question "What quote comes to mind when you think of this city?": Which be any string representing a quote.

We will be using these features to predict the column `Label`:

- `Label`: The city for which the answers to these questions corresponds to.


Let's start by exploring the data that we have in hand. Pandas has a nice function to summarize the mean and dispersion of each feature in our data frame:

In [5]:
data.describe()

Unnamed: 0,id,Q1,Q2,Q3,Q4,Q8
count,1468.0,1462.0,1461.0,1461.0,1462.0,1461.0
mean,459092.092643,4.321477,3.521561,3.707734,3.391929,4.79603
std,114308.717089,0.884464,1.192031,1.142037,1.297189,22.142305
min,5978.0,1.0,1.0,1.0,1.0,1.0
25%,406181.0,4.0,3.0,3.0,2.0,2.0
50%,503173.0,5.0,4.0,4.0,3.0,3.0
75%,522169.0,5.0,5.0,5.0,5.0,5.0
max,727099.0,5.0,5.0,5.0,5.0,800.0


For the categorical features, we can also tabulate the frequency that each category
occurs in the data set:

In [6]:
data['Q1'].value_counts()

5.0    798
4.0    405
3.0    205
2.0     39
1.0     15
Name: Q1, dtype: int64

In [7]:
data['Q2'].value_counts()

5.0    391
4.0    376
3.0    366
2.0    260
1.0     68
Name: Q2, dtype: int64

In [8]:
data['Q3'].value_counts()

5.0    446
4.0    439
3.0    337
2.0    181
1.0     58
Name: Q3, dtype: int64

We see a trend of more answers towards the higher numerical value.

Finally, let's take a look at the distribution of our target variable.

In [9]:
data['Label'].value_counts()

Dubai             367
Rio de Janeiro    367
New York City     367
Paris             367
Name: Label, dtype: int64

Note there are an equal number of cases for each city which is good as it allows us to balance all the cases and helps eliminate bias from the data set. Now let us take a look at panda's crosstabs which gives us an idea of which numerical value corresponds to each city for each question.

In [10]:
pd.crosstab(data["Label"], data["Q1"])

Q1,1.0,2.0,3.0,4.0,5.0
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dubai,5,5,42,162,153
New York City,2,5,6,35,317
Paris,1,5,12,72,275
Rio de Janeiro,7,24,145,136,53


In [11]:
pd.crosstab(data["Label"], data["Q2"])

Q2,1.0,2.0,3.0,4.0,5.0
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dubai,22,58,103,117,67
New York City,4,18,23,83,237
Paris,17,65,121,94,68
Rio de Janeiro,25,119,119,82,19


In [12]:
pd.crosstab(data["Label"], data["Q3"])

Q3,1.0,2.0,3.0,4.0,5.0
Label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Dubai,14,25,51,124,153
New York City,19,58,98,112,78
Paris,8,12,48,119,178
Rio de Janeiro,17,86,140,84,37


Before being able to use our data to train a decision tree model,
we need to transform some of the ways that our features are encoded.

To do this we will use the cleaning up tools given to us in challenge_basic.py. We will also be dropping Q10 as this feature is not very usefull to our model as quotes can be very different and unless all of them have the name of the city in the quote it will not be helpful and may actually harm the performance.

In [13]:
def to_numeric(s):
    """Converts string `s` to a float."""
    if isinstance(s, str):
        s = s.replace(",", '')
    return pd.to_numeric(s, errors="coerce")

def get_number_list(s):
    """Get a list of integers contained in string `s`."""
    return [int(n) for n in re.findall(r"(\d+)", str(s))]

def get_number_list_clean(s):
    """Return a clean list of numbers contained in `s`."""
    n_list = get_number_list(s)
    # Ensure the list has a fixed size of 6, fill missing values with -1
    return n_list + [-1] * (6 - len(n_list))

def get_number(s):
    """Get the first number contained in string `s`."""
    n_list = get_number_list(s)
    return n_list[0] if n_list else -1

def find_area_at_rank(l, i):
    """Return the area at a certain rank in list `l`."""
    return l.index(i) + 1 if i in l else -1

def cat_in_s(s, cat):
    """Return if a category is present in string `s`."""
    return int(cat in str(s))

# Apply preprocessing to numeric fields
data['Q7'] = data['Q7'].apply(to_numeric).fillna(0)
data['Q8'] = data['Q8'].apply(to_numeric).fillna(0)
data['Q9'] = data['Q9'].apply(to_numeric).fillna(0)

# Convert Q1 to its first number
data['Q1'] = data['Q1'].apply(get_number)
data['Q2'] = data['Q2'].apply(get_number)
data['Q3'] = data['Q3'].apply(get_number)
data['Q4'] = data['Q4'].apply(get_number)

# Process Q6 to create area rank categories
data['Q6'] = data['Q6'].apply(get_number_list_clean)

temp_names = []
for i in range(1, 7):
    col_name = f"rank_{i}"
    temp_names.append(col_name)
    data[col_name] = data["Q6"].apply(lambda l: find_area_at_rank(l, i))
del data["Q6"]

# Create category indicators and dummy variables
new_names = []
for col in ["Q1", "Q2", "Q3", "Q4", "Q8", "Q9"] + temp_names:
    indicators = pd.get_dummies(data[col], prefix=col)
    new_names.extend(indicators.columns)
    data = pd.concat([data, indicators], axis=1)
    del data[col]

# Create multi-category indicators
for cat in ["Partner", "Friends", "Siblings", "Co-worker"]:
    cat_name = f"Q5_{cat}"
    new_names.append(cat_name)
    data[cat_name] = data["Q5"].apply(lambda s: cat_in_s(s, cat))
del data["Q5"]


# Preparing the features and labels
data = data[new_names + ["Q7", "Label"]]
data = data.sample(frac=1, random_state=42)
# features = data.drop("Label", axis=1)
# labels = pd.get_dummies(data["Label"].values)

Finally, let's separate our data into training, validation, and test sets.
We will use 1000 data points for training, 230 for validation, and 230 for test.

Instead of manually splitting the data into two sets, we will use a function provided by `sklearn` which randomly splits the data for us. Use the `train_test_split` function to split the data into training and test sets.

In [14]:
x = data.drop("Label", axis=1).values
y = pd.get_dummies(data["Label"]).values

n_train = 1200
x_train = x[:n_train]
y_train = y[:n_train]

x_test = x[n_train:]
y_test = y[n_train:]

Next, we will use sklearn's `DecisionTreeClassifier` to create some decision trees to fit to our data.

Fit a `DecisionTreeClassifier` to our dataset. Then, print the training and validation scores (accuracy). I also used this code block to test a multitudes of different values for our hyper parameters: criterion, max_depth, min_samples_split. and kept the values that preformed the best.


In [80]:
# Creating a DecisionTreeClassifier
tree = DecisionTreeClassifier(criterion="gini", max_depth=8, min_samples_split=10)
# tree = DecisionTreeClassifier(criterion="entropy", max_depth=8, min_samples_split=11)
# TODO: fit it to our data
tree.fit(x_train, y_train)

# Print the training and validation scores (accuracy)
print("Training Accuracy:", tree.score(x_train, y_train))
print("Validation Accuracy:", tree.score(x_test, y_test))

Training Accuracy: 0.88
Validation Accuracy: 0.832089552238806
