# Lesson 1: Classification

## Review

**Question:** What is classification? 

**Question:** Is classification a form of supervised or unsupervised learning? Why? 

Let's review the example we saw last time. The code below reads our data points from a file and then creates a scatterplot.

In [None]:
%matplotlib inline
# import necessary packages
import matplotlib.pyplot as plt 
import pandas
import numpy as np

# Read the data, and put it into a variable. 
data = pandas.read_csv("../assets/classification-synthetic.csv") 

# show the first five data points
# in the data, "r" stands for red, and "b" stands for blue
print(data.head()) 

In [None]:
# Create a scatter plot. 
# "c" stands for color. We color our points by class.
plt.scatter(x=data["x"], y=data["y"], c=data["class"]) 

The red points are in group Red, and the blue points are in group Blue. The two classes have different characteristics; the Red class is in the top left corner of the plot, and the Blue class is in the bottom right.

If we are given a new point, we can classify it by how similar it is to the other data points. For example, the new point shown below belongs in the Red class.

In [None]:
# Create a new point at (3,7) and color it black.
new_point = pandas.DataFrame({"x":[3], "y":[7], "class":["k"]}) 
# Add the new point to the dataset. 
new_data = data.append(new_point) 
plt.scatter(x=new_data["x"], y=new_data["y"], c=new_data["class"]) 

And the point below belongs to the Blue class since it is closer to the Blue points.

In [None]:
# Create and add a new point to the dataset. 
new_point = pandas.DataFrame({"x":[6], "y":[5], "class":["k"]}) 
new_data = data.append(new_point) 
plt.scatter(x=new_data["x"], y=new_data["y"], c=new_data["class"]) 

*Support vector machines* draw lines between classes. We can draw a line between the Red and Blue groups. New points on one side of the line belong to the Red group, and points on the other side of the line belong to the Blue group.

In [None]:
# Create an equidistant line between the classes
plt.scatter(x=data["x"], y=data["y"], c=data["class"]) 
x = np.linspace(0, 10, 1000)
plt.plot(x, 1.5*x-3, color='black')
plt.xlim(0,11)
plt.ylim(0,11)

## Classification for the World Happiness Dataset

We're going to take the World Happiness Dataset we examined last time and apply classification to it. 

Today, we are going to classify countries into three classes: **low, medium, and high happiness**. Low happiness is defined as a happiness score below A, while medium happiness is between A and B. High happiness is a score above B. 

In [None]:
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from matplotlib import rcParams
from PIL import Image
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn import svm

%matplotlib inline
# figure size 
rcParams['figure.figsize'] = 20,20

In [None]:
# Load World Happiness Data
df = pd.read_csv("../assets/happinessDataset/2015.csv")

In [None]:
df.columns

In [None]:
classification_data = df[["Happiness Score", 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual']]
classification_data

In [None]:
# normalize the data
ss = StandardScaler()
transformed_data = ss.fit_transform(classification_data)
transformed_df = pd.DataFrame(transformed_data, index=classification_data.index, columns=classification_data.columns)
transformed_df

In [None]:
# add a new column with the class label

# silence warning
import warnings
warnings.filterwarnings('ignore')

# set scores >6 to high, <4 to low, and everything else to medium
# 1 is low, 2 is medium, 3 is high
classification_data["Class"] = 2
classification_data["Class"].loc[classification_data["Happiness Score"] > 6] = 3
classification_data["Class"].loc[classification_data["Happiness Score"] < 4] = 1

classification_data

In [None]:
# create an SVM
clf = svm.LinearSVC()
# apply the SVM to the data
clf.fit(transformed_df[["Happiness Score", 'Economy (GDP per Capita)', 'Family',
       'Health (Life Expectancy)', 'Freedom', 'Trust (Government Corruption)',
       'Generosity', 'Dystopia Residual']], 
        classification_data["Class"])

In [None]:
# Plot the SVM
xx, yy = np.meshgrid(np.arange(transformed_df["Happiness Score"].min(), 
                               transformed_df["Happiness Score"].max(), 0.02),
                     np.arange(transformed_df["Economy (GDP per Capita)"].min(), 
                               transformed_df["Economy (GDP per Capita)"].max(), 0.02)
                    )
length = xx.ravel().shape[0]

Z = clf.predict(np.c_[xx.ravel(), yy.ravel(), [0]*length, [0]*length, [0]*length, [0]*length, [0]*length, [0]*length])

Z = Z.reshape(xx.shape)

plt.contourf(xx, yy, Z, cmap=plt.cm.coolwarm, alpha=0.8)

plt.scatter(transformed_df['Happiness Score'],
            transformed_df['Economy (GDP per Capita)'], 
            c=classification_data['Class'], 
            edgecolors='k',
            cmap=plt.cm.coolwarm)

plt.title('Support Vector Machine')
plt.xlabel('Happiness Score - Normalized')
plt.ylabel('Economy (GDP per Capita) - Normalized')

**Question**: Why do you think so many "low" and "high" happiness countries get misclassified as "medium" happiness? 

### Activity
Change the code in the cell above to plot a variable other than "Economy (GDP per Capita)." How is it different from the "Economy (GDP per Capita)" plot?