In [8]:
import pandas as pd
import os
os.chdir("C:/Users/Lenovo 4/Desktop/Data Quest Folder/Logistic Regression/Multiclass Classification")

cars = pd.read_csv("auto.csv")
print(cars.head())
unique_regions = cars["origin"].unique()
print(unique_regions)

    mpg  cylinders  displacement  horsepower  weight  acceleration  year  \
0  18.0          8         307.0       130.0  3504.0          12.0    70   
1  15.0          8         350.0       165.0  3693.0          11.5    70   
2  18.0          8         318.0       150.0  3436.0          11.0    70   
3  16.0          8         304.0       150.0  3433.0          12.0    70   
4  17.0          8         302.0       140.0  3449.0          10.5    70   

   origin  
0       1  
1       1  
2       1  
3       1  
4       1  
[1 3 2]


In [9]:
# if we set the prefix parameter to cyl, Pandas will pre-pend the column names to match the style we'd like

dummy_cylinders = pd.get_dummies(cars["cylinders"], prefix="cyl")

cars = pd.concat([cars, dummy_cylinders], axis=1)
dummy_years = pd.get_dummies(cars["year"], prefix="year")
cars = pd.concat([cars, dummy_years], axis=1)
cars = cars.drop("year", axis=1)
cars = cars.drop("cylinders", axis=1)
print(cars.head())

    mpg  displacement  horsepower  weight  acceleration  origin  cyl_3  cyl_4  \
0  18.0         307.0       130.0  3504.0          12.0       1      0      0   
1  15.0         350.0       165.0  3693.0          11.5       1      0      0   
2  18.0         318.0       150.0  3436.0          11.0       1      0      0   
3  16.0         304.0       150.0  3433.0          12.0       1      0      0   
4  17.0         302.0       140.0  3449.0          10.5       1      0      0   

   cyl_5  cyl_6  ...  year_73  year_74  year_75  year_76  year_77  year_78  \
0      0      0  ...        0        0        0        0        0        0   
1      0      0  ...        0        0        0        0        0        0   
2      0      0  ...        0        0        0        0        0        0   
3      0      0  ...        0        0        0        0        0        0   
4      0      0  ...        0        0        0        0        0        0   

   year_79  year_80  year_81  year_82  
0   

 For this dataset, categorical variables exist in three columns, cylinders, year, and origin. The cylinders and year columns must be converted to numeric values so we can use them to predict label origin. Even though the column year is a number, we’re going to treat them like categories. The year 71 is unlikely to relate to the year 70 in the same way those two numbers do numerically, but rather just as two different labels. In these instances, it is always safer to treat discrete values as categorical variables.

We must use dummy variables for columns containing categorical values. 

In [11]:
import numpy as np
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
highest_train_row = int(cars.shape[0] * .70)
train = shuffled_cars.iloc[0:highest_train_row]
test = shuffled_cars.iloc[highest_train_row:]


There are a few different methods of doing multiclass classification and in this mission, we'll focus on the one-versus-all method.

In [12]:
from sklearn.linear_model import LogisticRegression

unique_origins = cars["origin"].unique()
unique_origins.sort()

models = {}
features = [c for c in train.columns if c.startswith("cyl") or c.startswith("year")]

for origin in unique_origins:
    model = LogisticRegression()
    
    X_train = train[features]
    y_train = train["origin"] == origin

    model.fit(X_train, y_train)
    models[origin] = model

In the one-vs-all approach, we're essentially converting an n-class (in our case n is 3) classification problem into n binary classification problems. For our case, we'll need to train 3 models:

A model where all cars built in North America are considered Positive (1) and those built in Europe and Asia are considered Negative (0).
A model where all cars built in Europe are considered Positive (1) and those built in North America and Asia are considered Negative (0).
A model where all cars built in Asia are labeled Positive (1) and those built in North America and Europe are considered Negative (0).
Each of these models is a binary classification model that will return a probability between 0 and 1. When we apply this model on new data, a probability value will be returned from each model (3 total). For each observation, we choose the label corresponding to the model that predicted the highest probability.

We'll use the dummy variables we created from the cylinders and year columns to train 3 models using the LogisticRegression class from scikit-learn.

In [13]:
testing_probs = pd.DataFrame(columns=unique_origins)
testing_probs = pd.DataFrame(columns=unique_origins)  

for origin in unique_origins:
    # Select testing features.
    X_test = test[features]   
    # Compute probability of observation being in the origin.
    testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]

In [14]:
predicted_origins = testing_probs.idxmax(axis=1)
print(predicted_origins)

0      1
1      1
2      1
3      1
4      1
      ..
113    1
114    1
115    1
116    1
117    1
Length: 118, dtype: int64


While each column in our dataframe testing_probs represents an origin we just need to choose the one with the largest probability. We can use the Dataframe method .idxmax() to return a Series where each value corresponds to the column or where the maximum value occurs for that observation. We need to make sure to set the axis paramater to 1 since we want to calculate the maximum value across columns. Since each column maps directly to an origin the resulting Series will be the classification from our model.