# CSE 5243 - Introduction to Data Mining
## Homework 2: Classification
- Semester: Spring 2023
- Instructor: Tom Bihari
- Section Days/Time: Wednesday/Friday 9:35 AM
- Student Name: Jiyong Kwag
- Student Email: kwag.3@osu.edu
- Student ID: 500165290
***

***
# Section: Overview

### Assignment Overview

This assignment covers the **steps 4 and 5 of the six steps** of the **CRISP-DM process model** (Modelng and Evaluation). (See the CRISP-DM materials on CARMEN.)

The **objectives** of this assignment are:
- Solve a business problem by creating, evaluating, and comparing three classification models, and produce the outputs needed to provide business value for your stakeholders.
- Experiment with built-in classification models in **scikit-learn**.

### Dataset
**NOTE: Since you already have pre-processed this dataset in the previous assignment, you may choose to use your "cleaned up" dataset from that assignment instead of re-doing the work here.**

In this assignment, you will analyze an ALTERED copy of the “Hotel Booking Demand” dataset.
- This dataset was pulled on 4/8/22 from: https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand
- The dataset file is named: **hotel_bookings_with_errors_V1.csv**

**The data has been altered slightly for use in course assignments,etc.:**
- A unique ROW attribute has been added.
- Errors have been added, such as: duplicated records, deleted records, deleted attribute values, erroneous attribute values.
**DO NOT PUBLISH THIS DATASET - it contains intentionally wrong data!**

### Problem Statement
Assume that you are the Director of Data Science for Buckeye Resorts, Inc. (BRI), an international hotel chain.  As is the case for all hotel chains, reservation cancellations cause significant impacts to BRI, in profitability, logistics, and other areas.  Approximately **20%** of reservations are cancelled, and the cost to BRI of a cancelled reservation is **$500** on average. 

- BRI wants to improve (decrease) the cancellation rates at its hotels, using more tailored interventions, based on newly available detailed data.  BRI processes **100,000** reservations per year, so an incremental improvement in cancellation rates would have a significant impact.

- One intervention being considered is to offer a special financial incentive to customers who have reservations, but who are “at risk” of cancellation.  BRI has performed a small pilot test, and has found that offering a **$80** discount to a customer who is planning to cancel is effective **35%** of the time in inducing the customer not to cancel (by locking in a “no cancellation” clause).

- BRI leadership has asked your team to analyze the new data, and determine if it is suitable for developing analyses and models that would be effective in predicting which future reservations are likely to be at risk of cancellation, so the aforementioned financial incentive could be offered.

- The head of BRI would then like you attend the upcoming BRI Board of Directors meeting.  She has asked you to present your findings to her and to the BOD, to help them decide whether to go forward with the planned tailored intervention approach, and/or to adjust or abandon the approach.  Your goal is to support the BOD in making a decision. 

**In the previous assignment**, you completed the sections for the first three steps of CRISP-DM.  You **explored** the dataset, and **prepared** a clean dataset from it that contains the kind of information you think might be useful.  You now will make use of the dataset.

### Things To Do
You now must **develop** and **evaluate** specific models for predicting the cancellations.  You will try the **off-the-shelf KNN classifier**, and **two other classifiers of your choice**.

Some intial guidance / sugggestions:

- You must develop a cost model from the problem statement above.  Consider creating a table that lists the benefit and cost dollar amounts for a decision on a **single customer**.  Note that the incentive will be "offered" if Predicted is True, and the incentive is "needed" if Actual is True:

| Actual "At Risk" | Predicted "At Risk" | Incentive Benefit | Incentive Cost | Net Benefit (Benefit-Cost) |
|---|---|---|---|---|
| False | False | 0 | 0 | 0 |
| False | True  | 0 | 80 | -80 |
| True  | False | 0 | 0 | 0 |
| True  | True  | 175 | 80 | 95 |

- Much of the code below may be repetitive.  Consider creating a few reusable functions that can be called for each of the models you build (e.g., evaluation functions).

- **Follow the instructions** in each of the sections below.

It is essential that you **communicate** your goals, thought process, actions, results, and conclusions to the **audience** that will consume this work.  It is **not enough** to show just the code.  It is not appropriate to show long sections of **unexplained printout**, etc.  Be kind to your readers and provide value to them!

**ALWAYS follow this pattern** when doing **each portion** of the work.  This allows us to give feedback and assign scores, and to give partial credit.  Make it easy for the reader to understand your work.
- Say (briefly) **what** you are trying to do, and **why**.
- Do it (code).
- Show or describe the **result** clearly (and briefly as needed), and explain the significant **conclusions or insights** derived from the results. 

### Collaboration
For this assignment, you should work as an individual. You may informally discuss ideas with classmates, but your work should be your own.

### What you need to turn in:
1)	Code

-	For this homework, the code is the Jupyter Notebook.  Use the provided Jupyter Notebook template, and fill in the appropriate information.
-	You may use common Python libraries for I/O, data manipulation, data visualization, etc. (e.g., NumPy, Pandas, MatPlotLib,…) 
-	You may not use library operations that perform, in effect, the “core” computations for this homework (e.g., If the assignment is to write a K-Means algorithm, you may not use a library operation that, in effect, does the core work needed to implement a K-Means algorithm.).  When in doubt, ask the grader or instructor.  (**Note: For this assignment, you *will* be using build in library functions, so you are permitted to do so.  You may not, however, make use of a single function that does *all* of the work for you.**
-	The code must be written by you, and any significant code snips you found on the Internet and used to understand how to do your coding for the core functionality must be attributed.  (You do not need to attribute basic functionality – matrix operations, IO, etc.)
-	The code must be commented sufficiently to allow a reader to understand the algorithm without reading the actual Python, step by step.
-	**When in doubt, ask the grader or instructor.**

2)	Written Report
-	For this homework, the report is the Jupyter Notebook.  The report should be well-written.  Please proof-read and remove spelling and grammar errors and typos.
-	The report should discuss your analysis and observations. Key points and findings must be written in a style suitable for consumption by non-experts.  Present charts and graphs to support your observations. If you performed any data processing, cleaning, etc., please discuss it within the report.

### Grading

1.	Overall readability and organization of your report (5%)
> - Is it well organized and does the presentation flow in a logical manner?
> - Are there no grammar and spelling mistakes?
> - Do the charts/graphs relate to the text?
> - Are the summarized key points and findings understandable by non-experts?
> - Do the Overview and Conclusions provide context for the entire exercise?
2.	Evaluation Method (10%)
> - Does your evaluation method meet the needs of the developer (you) as well as the needs of your business stakeholders?
> - Is the evaluation method sound?
> - Did you describe both the method itself and why you chose it?
3.	Pre-Processing of the Dataset (10%)
> - Did you make reasonable choices for pre-processing, and explain why you made them?
4.	Evaluation of the KNN Classifier (20%)
> - Is your algorithm design and coding correct?
> - Is it well documented?
> - Have you made an effort to tune it for good performance?
> - Is the evaluation sound?
5.	Evaluation of the Second Classifier (20%)
> - Is your algorithm design and coding correct?
> - Is it well documented?
> - Have you made an effort to tune it for good performance?
> - Is the evaluation sound?
6.	Evaluation of the Third Classifier (20%)
> - Is your algorithm design and coding correct?
> - Is it well documented?
> - Have you made an effort to tune it for good performance?
> - Is the evaluation sound?
7.	Comparison of the Three Classifiers (10%)
> - Is the comparison sound?
> - Did you choose a specific classifier as best and explain why?
8.  Conclusions (5%)
> - Did you summarize appropriately your critical findings. 
> - Did you provide appropriate conclusions and next steps.

### How to turn in your work on Carmen:

Submit to Carmen the Jupyter Notebook. You do not need to include the input data.

**HAVE FUN!**
***

***
# Section: Overview
- Insert a short description of the scope of this exercise, any supporting information, etc.
***

In [None]:
# N/A See above.

***
# Section: Setup
- Add any needed imports, helper functions, etc., here.
***

In [14]:
import numpy as np
import pandas as pd

#!pip install matplotlib
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
#matplotlib.use('Qt5Agg')

import seaborn as sns

pd.set_option('display.max_columns', 50) #include to avoid ... in middle of display

#Turning off the futureWarning
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

***
# Section: 1 - Evaluation Method
- Define measures for evaluating the classification models you develop.  Explain why the measures you choose provide a useful view into the value and usefulness of the model you eventually chose for the company to use.  **Note: In this section, you should define and explain your measures.  You may create a reusable function here if you like.  You then will use the functions in later sections.**
- Define two types:
***

***
## Section: 1.1 - Define measures that **do not** include the cost information
- (e.g., confusion matrices, accuracy, precision, recall, F-measures, etc.).
- Consider using: from sklearn.metrics import classification_report, confusion_matrix
***

### **Measure**

In [15]:
from sklearn import metrics

def confusion_matrices(y_test, y_pred):
    c_matrix = metrics.confusion_matrix(y_test, y_pred)
    accuracy = metrics.accuracy_score(y_test, y_pred)
    precision = metrics.precision_score(y_test, y_pred)
    recall = metrics.recall_score(y_test, y_pred)
    f1 = metrics.f1_score(y_test, y_pred)
    
    print("Confusion_Matrix:\n", c_matrix)
    print("Accuracy: ", accuracy)
    print("Precision: ", precision)
    print("Recall: ", recall)     
    print("F-Measure:", f1)

### **Confusion Matrix**

Confusion Matrix is the matrix that shows the True-Positive, False-Negative, False-Positive, and True-Negative.

**True-Positive (TP)**: When predicted class is 1 and actual class is 1. In our model, it is when customer is at risk of canceling and we predicted that customer will cancel.<br>
**False-Negative (FN)**: When predicted class is 0 but actual class is 1. In our model, it is when customer is at risk of canceling but we did not predicted that customer will cancel.<br>
**False-Positive (FP)**: When predicted class is 1 but actual calss is 0. In our model, it is when customer is not at risk of canceling but we predicted that customer will cancel.<br>
**True-Negative (TN)**: When predicted class is 0 and actual class is 0. In our model, it is when customer is not at risk of canceling and we predicted that customer will not cancel.<br>

By having four attributes, we can measure not only accuracy but also precision, recall, and f measure to decided the importance of model that we made based on classification algorithm. 

### **Accurarcy**

Accurarcy is adding up the all the case that model got right (True-Positive and True-Negative) and divide it to total number of sample. This measure give simple outcome of how many number of cases that classification model predictes correctly. However, accuracy cannot provide detail correcteness of the model. For example, if model only makes correct predictions on True-Negative on 90 percent of data, it still gives high number of accuracy while it does not make correct prediction on True-Positive.

### **Precision**

Precision provides further detail about accuracy of the data. It give percentage of correcteness for those who model predicted 1 but actually result is 1. The formula for precision is (TP / TP+FP)

### **Recall**

Recaull is another way of analyze the correcteness of the model. It gives percentage of correcteness for those who actually 1 but model predicted as 1.
The formula for recall is (TP / TP +FN)

### **F mesaure**

Simple definition of F-measure is harmonic mean of precision and recall (however, it is not really calculated by average of recall and precision). It symmetrically provides information about precision and recall which how good those two measures are.

***
## Section: 1.2 - Define measures that **do** include the cost information
- (e.g., using cost matrices).
- Consider creating a function that takes a confusion matrix and calculates the cost.
***

### **Cost Matrices**

In [13]:
cost_matrix = [[0,-80],[0,95]]

def cost(cost_matrix, y_test,predict):
    cost = 0;
    c_matrix = metrics.confusion_matrix(y_test, predict)
    for r in (0,1):
        for c in (0,1):
            cost = cost + cost_matrix[r][c] * c_matrix[r][c]
    return cost

cost matrix function is to calculate the benefit from how each of the customer actually is going to cancel the reservation and prediciton from the model. <br>

Thus, each of predicted at risk, we are presenting the incentive to the customer who has possibility of cancellation. Then, Incentive benefit is benefit that comapany can earn from preventing the cancellation. In this case, only case that represents the prevention of cancellation is true positive. So, we muliply 500 to 35 percent in which percentage that customer will likely not cancel after receiving the incentive. <br>

At last, we subtract the incentive benefit and incentive cost to calculate the net benefit of a customer after model's prediction.

| Actual "At Risk" | Predicted "At Risk" | Incentive Benefit | Incentive Cost | Net Benefit (Benefit-Cost) |
|---|---|---|---|---|
| False | False | 0 | 0 | 0 |
| False | True  | 0 | 80 | -80 |
| True  | False | 0 | 0 | 0 |
| True  | True  | 175 | 80 | 95 |

Thus, using the cost matrix and confusion matrix, we can calcuate the actual cost that model can make by multiplying each cells by each other. Higher the cost means model is more efficient than other model. 

***
# Section: 2 - Pre-Processing of the Dataset
- Load the dataset.
- Split the dataset into a Training dataset and a Test dataset based on the class attribute.  Keep them separate and use the Training dataset for training/tuning and the Test dataset for testing. For consistency, use the **train_test_split** operation available in SciKit Learn (use a specific random seed, so it is reproducible).
  - from sklearn.model_selection import train_test_split
  - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
- **NOTE: You have done much of the data preprocessing in the previous assignment, so you don't have to re-do it here.  You can either copy the necessary code from the previous assignment, or generate the clean dataset from the previous assignment and load it here.**
***

***
## Section: 2.1 - Explore the attributes
- As in Homework 1, explore the attributes briefly. Reference the website listed in the Introduction.
- Provide basic statistics for the attributes.
- List which attributes are Nominal (even though they are encoded as numbers), Ordinal, Interval, Ratio.
- **NOTE: Just summarize here.  You will need to know which attributes are Nominal, etc., so it would be useful to list them here.**
***

**Discussion:**

  |Num|Name                            |Data Type|       Type|       Meaning|
 | :--- | :------ |                         :------------- |:-----     | :---------- |
 |0  | row  |                           Integer   |      Ordinal  |  Counting the number of records
 |1  | hotel                        |   String      |    Nominal   | Name of the hotel
 |2   |is_canceled                  |   Float       |    Nominal   | If booking is canceled (1) or not canceled (0)  
 |3  | lead_time                     |  Integer     |    Ratio  | Number of days between entering into PMS to customser arrival
 |4   |arrival_date_year             |  Integer     |    Nominal*|  Customer arrival year
 |5   |arrival_date_month            |  String      |    Nominal*|  Customer arrival month
 |6  | arrival_date_week_number      |  Integer     |    Nominal* | Customer arrival week number
 |7  | arrival_date_day_of_month     |  Integer     |    Nominal*| Customer arrival day of month
 |8   |stays_in_weekend_nights       |  Integer     |    Ratio  | Number of weekend nights customer stayed
 |9  | stays_in_week_nights          |  Integer     |    Ratio  | Number of week day nights customer stayed
 |10 | adults                        |  Integer     |    Ratio   | Number of adults
 |11  |children                      |  Integer     |    Ratio| Number of childre
 |12  |babies                        |  Integer     |    Ratio |   Number of babies
 |13  |meal                          |  String      |    Nominal | Meal Type
 |14  |country                       |  String      |    Nominal | Customer country
 |15  |market_segment                |  String      |    Nominal | Market Segment: Travel Agency (TA) or Tour Operator (TO)
 |16  |distribution_channel          |  String      |  Nominal | Booking distribution:  Travel Agency (TA) or Tour Operator(TO)
 |17 | is_repeated_guest             |  Integer     |    Ordinal  | if customer is repeate guest (1) else (0)
 |18  |previous_cancellations        |  Integer     |    Ratio| Number of previous cancellation of customer
 |19  |previous_bookings_not_canceled|  Intger      |    Ratio |  Number of booking not canceled by customer
 |20  |reserved_room_type            |  String       |   Nominal|  Reserved room
 |21  |assigned_room_type            |  String      |    Nominal | Actual room assigned to customer. 
 |22  |booking_changes               |  Intger      |    Ratio  | Number of change by customer after entering into PMS
 |23  |deposit_type                  |  String    |    Nominal | Deposite type of customer: No deposite, non refund, refundable
 |24  |agent                         |  Float       |    Nominal| ID of travel agency 
 |25  |company                       |  Float       |    Nominal| ID of company books the room
 |26  |days_in_waiting_list           | Integer      |   Ratio| Days between waiting list and booking confirmation
 |27  |customer_type                 |  String      |    Nominal | type of booking:  Contract, Group, Transient, Transient-party
 |28 |adr                  |  Integer     |    Ratio |  average daily rate: sum of lodging transaction divided by staying night
 |29  |required_car_parking_spaces   |  Integer     |    Ratio |   Number of parking space needed 
 |30  |total_of_special_requests     |  Integer     |    Ratio| Number of special request
 |31  |reservation_status             | String      |    Nominal| reservation status: canceled, no show, check out
 |32  |reservation_status_date        | String      |    Nominal*| date that made reservation status

***
## Section: 2.2 - Revise the dataset
- Review the meanings of the attributes and consider removing redundant or (likely) irrelevant attributes, combining attributes, etc., to reduce the number of attributes.
- (You may choose to use techniques such as those you used in Homework 1 to analyze the impacts of individual attributes on the CLASS attribute.)
- Describe what you chose to do (and not do), and why.
-**NOTE: You can just load your cleaned-up dataset here if you like.**
***

### **Loading the Clean-Up Data**

In the homework 1, we used various technique to find various errors in the data. As I mentioned in hw1, data set contains a number of NA values, wrong values, outliers that cannot be decided with the given data. For example, adr is average daily rate which customer pays to company.It can be calculated by sum of loding transaction divided by staying night. However, data contains negative adr which is impossible. Furthermore, with the given attribute, we cannot verify the adr because there is no attribute name lodging transaction. Moreover, there are a several NA values for number of children. This can be happened because some employees enetered zero children as NA and some employees enters number 0. Lastly, when we are checking lead_time, maximum number is outlier. It has way too big number. However, we are not given the PMS entering date. Thus, we are not able to verify this also. 

For those reason above, the errors that I handled in the clean-up data is duplicate rows, NA value from the is_canceled, and typo in the market segment (Onlin_ta to Online_ta). Especially, NA value from is_canceled can be assumed using the reservation_status, in which provided the latest reservation status of customer: canceled, no show, check out. 

Thus, in this homework, I will use the clean-up data from homework1 to create model based on three classification models below.

In [16]:
try:
    _ = data_from_source_file_df
    print("Reusing source data")
except:
    print("Loading source data")
    data_from_source_file_df = pd.read_csv("hotel_clean_up.csv")
data_df = data_from_source_file_df

Reusing source data


In [17]:
prev_cancels = data_df["previous_cancellations"]
prev_not_cancels = data_df["previous_bookings_not_canceled"]
adults = data_df["adults"]
booking_changes = data_df["booking_changes"]

is_canceled = data_df["is_canceled"]

### **Previous Cancelation vs Previous booking not canceled**

In [18]:
p_pn_data = [[l,p] for l,p in zip(prev_cancels, prev_not_cancels)]

In [19]:
from sklearn.model_selection import train_test_split
ppn_x_train, ppn_x_test, ppn_y_train, ppn_y_test = train_test_split(p_pn_data, is_canceled, shuffle=True, test_size=0.2, random_state=20)

### **Previous booking not canceled vs Adults**

In [8]:
pn_a_data = [[l,p] for l,p in zip(prev_not_cancels, adults)]

In [9]:
from sklearn.model_selection import train_test_split
pna_x_train, pna_x_test, pna_y_train, pna_y_test = train_test_split(pn_a_data, is_canceled, shuffle=True, test_size=0.2, random_state=20)

### **Previous booking not canceled vs Booking changes**

In [10]:
pn_b_data = [[l,p] for l,p in zip(prev_not_cancels, booking_changes)]

In [11]:
from sklearn.model_selection import train_test_split
pnb_x_train, pnb_x_test, pnb_y_train, pnb_y_test = train_test_split(pn_b_data, is_canceled, shuffle=True, test_size=0.2, random_state=20)

### **Using Attributes to make Model**

In the homework1, we went through several tests to determine what attribute most likely impact the cancellation ratio of the customer. The example tests were chi-square, pearson, quartile box plot, and ANOVA test. From tests, especially ANNOVA test, we figured out that there are four attributes that mainly have greater impact on cancellation of customers.

**Adults**: If more adults are making reservation, there is more likelyhood that they will cancel the reservation. 

**Previous_cancelation**: If customer made previous cancellations more than 5 times, there is little likelyhood the customer will cancel the reservation.

**Previous_booking_not_canceled**: if customer did not made reservation cancellation more than 30 times, then, customer has possibility of not cancel the reservation for current booking.

**Booking_changes**: If customer is constantly making changes on reservation more than 10 times, there is little likelyhood that cusomter will not cancel the reservation

Thus, I will mainly use these four attributes that build models to find customers who have high likelyhood to cancel the reservation.

***
## Section: 2.3 - Transform the attributes
- Consider transforming the remaining attributes (e.g., using the data dictionary to replace the numbers with text values for some attributes – this might or might not be useful), normalizing / scaling values, encoding labels (if necessary), etc.
- Describe what you chose to do (and not do), and why.
-**NOTE: If you want to do any additional transformations, you can do them here.**
  - You also may need to do some specific transformations below for each of the classification models you choose.
-**IMPORTANT: Any transformations you do to the datasets (particularly the Test dataset) must not artificially impact the evaluation measures.  We want the chosen classification model to work "in the real world", and the Test dataset is an approximation of the real world.**
***

In homework2, I have decided to use four attributes: adults, previous_cancelations, previous_booking_not_canceled, and booking_changes. These are the numerical attributes that have the counting number as their unit. Furthermore, each of the attributes have relative close maimum values: 55, 26, 72, and 21 respecively. Thus, there is no need to make adjustment for making normalization or standardize the values to compare. 

***
# Section: 3 - Evaluation of the Off-The-Shelf KNN Classifier
- Select the KNN classifier from the SciKit Learn library and run it on the dataset.
***

***
## Section: 3.1 - Configure the off-the-shelf KNN classifier
- Use the KNeighborsClassifier from the SciKit Learn library
- Explain all setup, parameters and execution options you chose to set, and why.
***

KNN classification is algorithms that predict class attribute of individual or group of test data using the KNN model. For example, in our case, class attribute is is_canceled. Thus, our duty is to make KNN model that predict whether the test case (test customer) will cancel or not. <br>
At first, in the KNN classification, it uses the graph. Then, using the graph to find closest data or datas from given test data. Then, if the closest datas are reservation canceled data, then the model will predict the test data is reservation canceled data.  <br>
In this part, deciding how many number of closest datas we will use to make model is important step because of its accuracy of prediction. Thus, in the following section, we will use small functions to find the best accuracy K and best net cost K.

Moreover, I will use the measure function above to evaluates the correcteness of KNN model.

***
## Section: 3.2 - Run and evaluate the classifier
- Try several values of the K parameter and compare the results.
- Evaluate the performance of the classifier, using the evaluation methods you defined above.
***

In [9]:
from sklearn import metrics
from sklearn.neighbors import KNeighborsClassifier

def findBestK(x_train, x_test, y_train, y_test):
    acc = []
    ben = []
    
    for i in range(1,10):
        neigh = KNeighborsClassifier(n_neighbors = i).fit(x_train,y_train)
        result = neigh.predict(x_test)
        acc.append(neigh.score(x_test,y_test))

        predict = np.asarray(result, dtype = 'int')
        actual = np.asarray(y_test, dtype = 'int')
        
        net = cost(cost_matrix, actual, predict);
        ##print(i, " : ", net)
        ben.append(net)
    
    print("Maximum Accuracy:",max(acc),"at K =",acc.index(max(acc))+1)
    print("Maximum Cost:",max(ben),"at K =",ben.index(max(ben))+1)

In [10]:
def printMeasure(x_train, x_test, y_train, y_test, i):
    neigh = KNeighborsClassifier(n_neighbors = i).fit(x_train,y_train)
    result = neigh.predict(x_test)

    predict = np.asarray(result, dtype = 'int')
    actual = np.asarray(y_test, dtype = 'int')
    
    confusion_matrices(actual, predict)

In this function, we are looping through the KNN for 10 times to find the best K that produces highest score and cost. Furthermore, in the function, I used score function of sklearn KNN classifier. According to the sklearn, Score "return the mean accuracy on the given test data and labels." Thus, we can use this value to estimate accuracy of the model. <br>
In this function, I want to use higher ceiling for the loop. At first, I set the range to 40 instead of 10. However, it tooks too many hours to calculate each of the KNN and predict model. Thus, if you had better performanced computer, it might be good idea of extend the range further more to seee better K.

### **Previous Cancelation vs Previous booking not canceled**

In [54]:
findBestK(ppn_x_train, ppn_x_test, ppn_y_train, ppn_y_test)

Maximum Accuracy: 0.6747633805176313 at K = 8
Maximum Cost: 109210 at K = 8


In [55]:
printMeasure(ppn_x_train, ppn_x_test, ppn_y_train, ppn_y_test,8)

Confusion_Matrix:
 [[14954    10]
 [ 7756  1158]]
Accuracy:  0.6747633805176313
Precision:  0.9914383561643836
Recall:  0.1299080098721113
F-Measure: 0.22971632612576873


From finding the best K, it seems both maximum accuracy and maximum cost have same K value. Thus, I used K=8 to find the more specific measure from the confusion matrix. <br>

**Accuracy** and **Precision**: Total accuracy of previous cancelation vs previous booking not canceled is about 70 percent and 99 percent. This model has some credential to estimate whether new customer has probability of future cancelation. <br>

**Recall** and **F-Measure**: This model seems to have really low recall and F-measure. Thus, it is highly doubtable that it predicts well. <br>

Using the cost matrix, it is important of reduce False Positive, case when there is no risk of canceling but model predicts that customer will cancel the reservation because of its negative net benefit. Moreover, increasing the true positive will increase the total net benefit. <br>

Room for improvement for this model is that model has approximately 30 percent of false negative. This is case when customer is actually at risk of cancellation but model predicted no. This case does not produce any negative net benefit.

Thus, this model is not predicting the class attribute with high performance. However, using cost matrix, this model is acceptable model for prediction. 



### **Previous booking not canceled vs Adults**

In [56]:
findBestK(pna_x_train, pna_x_test, pna_y_train, pna_y_test)

Maximum Accuracy: 0.6268112907278667 at K = 2
Maximum Cost: 285 at K = 2


In [58]:
printMeasure(pna_x_train, pna_x_test, pna_y_train, pna_y_test,2)

Confusion_Matrix:
 [[14964     0]
 [ 8911     3]]
Accuracy:  0.6268112907278667
Precision:  1.0
Recall:  0.0003365492483733453
F-Measure: 0.0006728720421666479


### **Previous booking not canceled vs Booking changes**

In [57]:
findBestK(pnb_x_train, pnb_x_test, pnb_y_train, pnb_y_test)

Maximum Accuracy: 0.626685652064662 at K = 4
Maximum Cost: 0 at K = 4


In [59]:
printMeasure(pnb_x_train, pnb_x_test, pnb_y_train, pnb_y_test,4)

Confusion_Matrix:
 [[14964     0]
 [ 8914     0]]
Accuracy:  0.626685652064662
Precision:  0.0
Recall:  0.0
F-Measure: 0.0


  _warn_prf(average, modifier, msg_start, len(result))


For both previous booking not canceled vs Adults and previous booking not canceled vs booking changes, both models did not produce ideal result. They did not predict much true positive values which we highly needed to increase our cost matrix.

In previous booking not canceled vs. adults model, it says precision is 1.0. However, this number of meaningless becasue it has too low number of true positive number.

In Previous booking not canceled vs booking changes, precision, recall, and f-measure have zero because it does not able to predict and true positive and false positive. 

As a result, in KNN classification, using the previous cancelations vs previous booking not canceled seems to be the best model that create the ideal amount of cost eventhough it has some false negative.

***
## Section: 3.3 - Evaluate the choice of the KNN classifier
- What characteristics of the problem and data made KNN a good or bad choice?
***

Through several choices of attributes and K model, KNN model seems to have making good precision and fair accuracy. However, it has the down side of low recall. As I mentioned above, increasing the true positive and reduce the false positive is the main part for increasing the net benefit of the company. So, because KNN can make high precision, it would be good idea to use KNN as prediction model for customer cancellations. 

***
# Section: 4 - Evaluation of Off-The-Shelf Classifier #2
- As with the KNN classifier above, choose another classifier from the SciKit Learn library (Decision Tree, SVM, Logistic Regression, etc.) and run it on the dataset.
***

***
## Section: 4.1 - Configure the classifier
- Use the appropriate classifier from the SciKit Learn library.
- Explain all setup, parameters and execution options you chose to set, and why.
***

### **SVM**

In this section, I will use the SVM to predict the cancellation. SVM is classification model that predict the class attribute using the graph and its boundary line between two data sets of class attributes: canceled customer and not canceled customer. In this classification model the ideal situation is that two data sets of class attributes are far away from each other and distinction line is at the middle of two set of datas, using the margin between line and each sets of data. In the sklearn svm classification, it defines the the most appropriate line that distinguish two data sets of class attribute.<br>

Then, when new test data comes in, we are using the boundary line to decide the class attribute of test data. 

When we are making decision for running the SVM in the sklearn, deciding the parameter c is the very important step to do. C is the paramter in which how many number of error that the distinction line is going to accept. Higher the C means model is not allowing the error and smaller C means it is allowing some error. 

Moreover, one of parameter is kernel. Kernel is another important parameter that helps users to decide how to draw a line between sets of data. For example, in our case, in homework1 section 2.5 graph, most of the datas are spread throughout the graph and it is hard to distinguish linear relationship from the data. Thus, sklearn enables us to use parameter types such as poly or rbf. These feature helps to draw a line with more flexible way. So, in our case I used rbf to draw line. 

Lastly, gamma is another paramter of SVM. It is similar concept as paramter C. It tells model how tight line should be drawn. Too high number of gamma will allow overfitting and too low number cause underfitting of the data. Thus, I used 0.5 as gamma. 

***
## Section: 4.2 - Run and evaluate the classifier
- Try several values of the parameters (if appropriate) and compare the results.
- Evaluate the performance of the classifier, using the evaluation methods you defined above.
***

In [12]:
from sklearn import metrics
from sklearn.svm import SVC
import math

def findBestC(x_train, x_test, y_train, y_test):
    acc = []
    ben = []
    num = 1
    increase = 0.1
    for i in range(1,5):
        classifier = SVC(kernel = 'rbf', C=num, gamma=0.5).fit(x_train, y_train)
        result = classifier.predict(x_test)
        acc.append(classifier.score(x_test,y_test))
        
        predict = np.asarray(result, dtype = 'int')
        actual = np.asarray(y_test, dtype = 'int')
        
        net = cost(cost_matrix, actual, predict);
        ##print(num, " : ", net)
        ben.append(net)
        num -= increase
    print("Maximum Accuracy:",max(acc),"at C =",math.pow(num,acc.index(max(acc))+1))
    print("Maximum Cost:",max(ben),"at C =",math.pow(num,ben.index(max(ben))+1))

In [13]:
def printSVMMeasure(x_train, x_test, y_train, y_test, num):
    classifier = SVC(kernel = 'linear', C=num, gamma=0.5).fit(x_train, y_train)
    result = classifier.predict(x_test)
        
    predict = np.asarray(result, dtype = 'int')
    actual = np.asarray(y_test, dtype = 'int')
    
    confusion_matrices(actual, predict)

FindBestC function is similar to FindBestK. From 1.0 and decreased by 0.1, function loops through each of SVM classification model to find the best c with highest score and cost. Furthermore, in the function, I used score function of sklearn SVM classifier. According to the sklearn, Score "return the mean accuracy on the given test data and labels." Thus, we can use this value to estimate accuracy of the model. <br>

In this function, I want to use higher ceiling for the loop. At first, I set the range to 10 instead of 5. In the KNN algorithm, I used 10 as maximum range. However, it tooks too many hours to calculate each of the SVM and predict model than KNN. Thus, if you had better performanced computer, it might be good idea of extend the range further more to seee better C.

### **Previous Cancelation vs Previous booking not canceled**

In [79]:
findBestC(ppn_x_train, ppn_x_test, ppn_y_train, ppn_y_test)

Maximum Accuracy: 0.6743864645280174 at C = 0.6000000000000001
Maximum Cost: 108535 at C = 0.6000000000000001


In [17]:
printSVMMeasure(ppn_x_train, ppn_x_test, ppn_y_train, ppn_y_test, 0.6)

Confusion_Matrix:
 [[14942    22]
 [ 7753  1161]]
Accuracy:  0.6743864645280174
Precision:  0.981403212172443
Recall:  0.13024455912048463
F-Measure: 0.22996929781123104


As a result, we figured out that the maximum accuracy and cost of paramter C is 0.6 Thus, I used 0.6 as main paramter C to find the confusion matrix. 

**Accuracy** and **Precision**: Like KNN model, SVM also produced somewhat good accuracy and precsion precentage. Good precision mean it has migher value of true positive and low number of false positive which always increase the net benefit. 

**Recall** and **F-measure**: Like KNN model, SVM also shows low percentage of recall which causes to increase the F-measure. In this case, we have a bit higher number of false negative. False negative did not cost anything. However, our goal is to reduce the false negative value and increase the true positive to increase the net benefit. 

This mode is also good model in a way of producing the net benefit. However, the down-side of high percentagge of false negative (or low recall) must be considered to increase net benefit.

### **Previous booking not canceled vs Adults**

In [14]:
findBestC(pna_x_train, pna_x_test, pna_y_train, pna_y_test)

Maximum Accuracy: 0.626685652064662 at C = 0.6000000000000001
Maximum Cost: 0 at C = 0.6000000000000001


In [18]:
printSVMMeasure(pna_x_train, pna_x_test, pna_y_train, pna_y_test, 0.6)

Confusion_Matrix:
 [[14964     0]
 [ 8914     0]]
Accuracy:  0.626685652064662
Precision:  0.0
Recall:  0.0
F-Measure: 0.0


  _warn_prf(average, modifier, msg_start, len(result))


### **Previous booking not canceled vs Booking changes**

In [16]:
findBestC(pnb_x_train, pnb_x_test, pnb_y_train, pnb_y_test)

Maximum Accuracy: 0.626685652064662 at C = 0.6000000000000001
Maximum Cost: 0 at C = 0.6000000000000001


In [19]:
printSVMMeasure(pnb_x_train, pnb_x_test, pnb_y_train, pnb_y_test, 0.6)

Confusion_Matrix:
 [[14964     0]
 [ 8914     0]]
Accuracy:  0.626685652064662
Precision:  0.0
Recall:  0.0
F-Measure: 0.0


  _warn_prf(average, modifier, msg_start, len(result))


Like KNN model both previous booking not canceled vs adults and previous ooking not canceled vs booking changes did not produced any meaningful data according to the maximum cost and confusion matrix. 

***
## Section: 4.3 - Evaluate the choice of the classifier
- What characteristics of the problem and data made the classifier a good or bad choice?
***

Like KNN, SVM did great job on producing the meaningful data from previous cancelation and previous booking not canceled. Postiive net balance shows SVM model's predict brings positive result to the company. However, as I mentioned above, it has more higher false negative values then KNN. I think it is because of paramter C and gamma values. Furthermore, SVM model is taking too long time to produce the result than kNN which cause the function to decrease its looping ceil from 10 to 5.


***
# Section: 5 - Evaluation of Off-The-Shelf Classifier #3
- As with the KNN classifier above, choose another classifier from the SciKit Learn library (Decision Tree, SVM, Logistic Regression, etc.) and run it on the dataset.
***

***
## Section: 5.1 - Configure the classifier
- Use the appropriate classifier from the SciKit Learn library.
- Explain all setup, parameters and execution options you chose to set, and why.
***

### **Logistic Regression**

Logistic Regression is a classification model that predicts the test data between 0 and 1. For example, in our case, if model classifies the test case with number greater than 0.5. Then, we can assume that the test data, or customer, has high likelyhood of canceling the reservation. <br>

Logistic regression is somewhat more complicate idea then the KNN and SVM. It uses logs-odds and sigmoid function to calculate the probability of customer cancellation. Logs-odds is odds of cancellation. In our case 20 percent of customer is canceling the reservation and 80 percents are not. So, odds of passing is 0.8/0.2 = 4. Then, use the sigmoid fucntion to swap the logs odd into range between 0 and 1

***
## Section: 5.2 - Run and evaluate the classifier
- Try several values of the parameters (if appropriate) and compare the results.
- Evaluate the performance of the classifier, using the evaluation methods you defined above.
***

In Logistice regression, there is no specific parameters to set. 

### **Previous Cancelation vs Previous booking not canceled**

In [25]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression().fit(ppn_x_train, ppn_y_train)
result = model.predict(ppn_x_test)

predict = np.asarray(result, dtype = 'int')
actual = np.asarray(ppn_y_test, dtype = 'int')

net = cost(cost_matrix, actual, predict);
print("Net Balace is ", net)
confusion_matrices(actual, predict)

Net Balace is  108565
Confusion_Matrix:
 [[14940    24]
 [ 7751  1163]]
Accuracy:  0.6743864645280174
Precision:  0.9797809604043808
Recall:  0.13046892528606685
F-Measure: 0.23027423027423027


In the logistic regression, there is no need to set up any parameter to test. 

**Accuracy** and **Precision**: Like SVM model, logistic regression also produced somewhat good accuracy and precsion precentage. As I mentioned above, good precision mean it has migher value of true positive and low number of false positive which always increase the net benefit. 

**Recall** and **F-measure**: Like KNN, SVM model, logistic regression also shows low percentage of recall which causes to increase the F-measure. In this case, we have a bit higher number of false negative. False negative did not cost anything. However, our goal is to reduce the false negative value and increase the true positive to increase the net benefit. 

This mode is also good model in a way of producing the net benefit. However, the down-side of high percentagge of false negative (or low recall) must be considered to increase net benefit.

### **Previous booking not canceled vs Adults**

In [24]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression().fit(pna_x_train, pna_y_train)
result = model.predict(pna_x_test)

predict = np.asarray(result, dtype = 'int')
actual = np.asarray(pna_y_test, dtype = 'int')

net = cost(cost_matrix, actual, predict);
print("Net Balace is ", net)
confusion_matrices(actual, predict)

Net Balace is  190
Confusion_Matrix:
 [[14964     0]
 [ 8912     2]]
Accuracy:  0.6267694111734651
Precision:  1.0
Recall:  0.0002243661655822302
F-Measure: 0.00044863167339614175


### **Previous booking not canceled vs Booking changes**

In [23]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression().fit(pnb_x_train, pnb_y_train)
result = model.predict(pnb_x_test)

predict = np.asarray(result, dtype = 'int')
actual = np.asarray(pnb_y_test, dtype = 'int')

net = cost(cost_matrix, actual, predict);
print("Net Balace is ", net)
confusion_matrices(actual, predict)

Net Balace is  0
Confusion_Matrix:
 [[14964     0]
 [ 8914     0]]
Accuracy:  0.626685652064662
Precision:  0.0
Recall:  0.0
F-Measure: 0.0


  _warn_prf(average, modifier, msg_start, len(result))


Similar to the KNN and SVM cases, these attributes did not produces any meaningful result again. 

***
## Section: 5.3 - Evaluate the choice of the classifier
- What characteristics of the problem and data made the classifier a good or bad choice?
***

Like SVM and KNN, positive side of logistic regression is that it produces meaningful cost result for the previous cancellations and previous booking not canceled. It shows good percentage of precision which causes to increase the net benefit of the model while having low false positive that is only case that reduces the total benefit. <br>

Eventhough logistic regression does not produced the highest cost net benefit, I believe the logistic regression the fastest algorithm that produced the result compare to the KNN and SVM.

***
# Section: 6 - Comparison of the Three Classifiers
***

***
## Section: 6.1 - Compare the performance of these classifiers to each other
- What are their strong and weak points?
***

If model is able to calculate the all of the customer correctly, the maximum net benefit for the test case would be 846830. This is about 8 times greater than the test model that I created for KNN, SVM, and logistic regression.


KNN: In general, KNN algorithms is the classficiation algorithm that produced the highest cost compare to other two algorithms: SVM and logistic regression. In KNN algorithm, we are to find the best K that produces the highest: .score percentage and highest cost net benefit. Number of K is number of close data that the test case is having for classifying the class type. I tested 10 numbers of K. During the test the number of K, number of K did not have any linear relationship with the net cost or accuracy. In other word, eventhough number of K increase, cost and accurarcy actual did not increased. Downside of KNN was like SVM, KNN tooks some time to produce the result and finding the appropriate K is not easy step, because we have to loop through each number to find the appropriate K and there seems to be any correlation with K and accuracy or score and cost benefit. 

SVM: SVM algorithm is the most slow algorithm among other algorithm. SVM algorithms is the classification model that draw boundary lines between the sets of class attribute and see which boundary line does the test data fall in. In SVM, we have two main paramters that use can set: C and gamma values. C and gamma values mean how many error that the I will set to admit possible errors or outlier in boundary line. While I test the KNN with 10 different values, I only able to test 5 numbers with SVM becasue of the speed of algorithm. The cost difference between KNN and SVM are only 70 dollars. It not big difference. Furthermore, if my computer has the high performance, I would be able to test more values of C and gamma to see difference. 

Logistic regression: Logistic regression is similar to the odd ratio. Logistic regression is the only classficiation algorithm that did not cotain any paramters and this algorithm produced the fastest result. Result is somewhat similar to the KNN and SVM, having high precision and low recall. 

***
## Section: 6.2 - Choose a Best Classifier
- Choose one of the three classifiers as best and explain why.
***

For this homework, the highly recomending algorithm would be the KNN. Eventhough KNN does not have highest speed, it has the most meaningful result of net cost and accuracy. Furthermore, it has the parameter K that can be set to have possible increase of accuracy and cost if my computer has the better performance. And the least recomending algorithm would be the SVM. In the SVM, I was able to set up C and gamma values, in which set the how many errors the model will accept. Like I said, performing to test only C parameter is taking almost days to finish the job. Thus, I would not recommend use this SVM model. 

***
# Section: 7 - Conclusions
- Write a paragraph on what you discovered or learned from this homework.
- What are your overall conclusions about the data?
- What did you learn? What would you explore further with additional data, time or resources. What might "future research" require to gain deeper insight? 
***

Through homeowork2, I learn detail part of running the KNN, SVM, and logistic regression with the code. When we are learning the classification in lecture, I was just understanding what they are. But, now I am able to explanin the feature that they have and how those features interact with the data to produce the test result. <br>

As a result, I believe the model that I created through KNN is not bad not so good. It is able to produce some benefit from true positive. However, true positive rate is not high which limit the net benefit. However, in the model, it is able to reduce down the false positive, case when customer is not at rist of canceling but model predicted as cancelling. This case is the only case that possibily produce the negative net balance. However, KNN model only shows approximatly 10 data on false positive which is really good. <br>

In the future homeowrk, I might tried to create more attributes into on and normalize the data to see how performance might change. During the homework1, I decided that the previous cancellations and previous booking not canceled is the most effectful attributes after having chi-square and pearson test to compare with other attributes. So, next-time I might tried to deep into other attributes to make model. 

***
### END-OF-SUBMISSION
***