### Multiclass Classification with Logistic Regression

1. <b>Good data management</b>
2. <b> One versus all classification </b>
3. <b> Model testing </b>
4. <b> Prediction </b>
5. <b> accuracy </b>

In [1]:
import json
import matplotlib
import warnings
import pandas as pd
import numpy as np
import re 
from IPython.core.pylabtools import figsize
from sklearn.linear_model import LogisticRegression

root = r"/Users/Kenneth-Aristide/anaconda3/bin/python_prog/ML/styles/bmh_matplotlibrc.json"
s = json.load(open(root))
warnings.simplefilter("ignore")
% matplotlib inline

The dataset we will be working with, contains information on various cars. The task is to predict the origin of a vehicle,
$North-America$, $Europe$, or $Asia$ base on some technical parameters.

In [2]:
# import the data
headers = ["mpg", "cylindres", "displacement", "horsepower", "weight", "acceleration",
                            "year", "origin", "name"]


def handle(s):
    """
    convenience function :
        find and process the useful structure
    """
    if s == '?':
        return None
    elif re.match(r'^\d+\.\d+$', s):
        return float(s)
    elif re.match(r'^\d+$', s):
        return int(s, 10)
    else:
        return s.strip().strip('"')

    
with open('/Users/Kenneth-Aristide/anaconda3/bin/python_prog/ML/data/auto.csv') as fp:
    data = [
        tuple(map(handle, row))
        for row in (line.split(maxsplit=8) for line in fp)]
    
# read into pandas DataFrame
data = pd.DataFrame(data, columns = headers)
data = data.drop("name", axis = 1)

# clean the data
data["horsepower"] = data["horsepower"].fillna(data["horsepower"].median())
data.tail()

Unnamed: 0,mpg,cylindres,displacement,horsepower,weight,acceleration,year,origin
393,27.0,4,140.0,86.0,2790.0,15.6,82,1
394,44.0,4,97.0,52.0,2130.0,24.6,82,2
395,32.0,4,135.0,84.0,2295.0,11.6,82,1
396,28.0,4,120.0,79.0,2625.0,18.6,82,1
397,31.0,4,119.0,82.0,2720.0,19.4,82,1


In [3]:
# We have 5 different categories of cylindres, we would split the column into five differents binary categories:
dummy_cylindres = pd.get_dummies(data["cylindres"], prefix = "cycl")
dummy_year = pd.get_dummies(data["year"], prefix = "year")

car_data = pd.concat([data, dummy_cylindres, dummy_year], axis = 1)
car_data.head()

Unnamed: 0,mpg,cylindres,displacement,horsepower,weight,acceleration,year,origin,cycl_3,cycl_4,...,year_73,year_74,year_75,year_76,year_77,year_78,year_79,year_80,year_81,year_82
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### One versus all classification
It's a technique where we chosse a single category as the positive case and group the rest of the category as the false case.
We're essentially splitting the into multiple binary classification problems.<br>
For each observation, the model will then output the $probabilities$ of belonging to a each category

In [4]:
# first let's split into train and test
np.random.seed(0)
shuffled_index = np.random.permutation(car_data.index)
shuffled_car_data = car_data.loc[shuffled_index]

# holdout cross_validation
split_line = int(car_data.shape[0] * .70)
train = shuffled_car_data[:split_line]
test = shuffled_car_data[split_line:]

# useful container
unique_origin = car_data.origin.unique()
unique_origin.sort()

<b>In the $one-vs-all$ approach we're essentially converting an $n class$ (n = 3) classification problem into n binary classifiaction problems.</b> For our case we would need to train 3 models:

    1. A model where all cars built in North America are considered Positive (1) and those built in Europe and Asia          are considered Negative (0).
    2. A model where all cars built in Europe are considered Positive (1) and those built in North America and Asia          are considered Negative (0).
    3. A model where all cars built in Asia are labeled Positive (1) and those built in North America and Europe are          considered Negative (0).
    
Each of these models is a binary classification model that will return a probability between 0 and 1. When we apply this model on new data, a probability value will be returned from each model (3 total). For each observation, we choose the label corresponding to the model that predicted the highest probability.


In [7]:
# Let's train using just cylindres and year columns as features

models = {}

features = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]

for origin in unique_origin:
    model = LogisticRegression()
    
    X_train = train[features]
    y_train = train["origin"] == origin
    
    model.fit(X_train, y_train)
    
    models[origin] = model

### Testing the models : 
now we have a model for each category, we can run our test dataset through the models and evaluate how well they performed.

In [8]:
test_results = pd.DataFrame(columns = unique_origin)

for k, v in models.items():
    test_results[k] = (v.predict_proba(test[features]))[:, 1]
    

While each column in our dataframe $testing_results$ represents an $origin$ we just need to choose the one with <b>the largest probability</b>. We can use the Dataframe method $.idxmax()$ to return a Series where each value corresponds to the column or where the maximum value occurs for that observation. We need to make sure to set the axis paramater to 1 since we want to calculate the maximum value across columns. Since each column maps directly to an origin the resulting Series will be the classification from our model.

In [9]:
predicted_origins = test_results.idxmax(axis = 1)

In [10]:
accuracy = len(predicted_origins[predicted_origins == test.origin])/ len(test.origin)
print(accuracy)

0.6166666666666667
