# Introduction To The Data
For each car we have information about the technical aspects of the vehicle such as the motor's displacement, the weight of the car, the miles per gallon, and how fast the car accelerates. 

Using this information we will predict the origin of the vehicle, either North America, Europe, or Asia.

In [18]:
import pandas as pd
import numpy as np

columns_name = ["mpg","cylinders","displacement","horsepower","weight","acceleration","year","origin","car_name"]
cars = pd.read_csv('auto-mpg.data',delim_whitespace=True,names=columns_name)
unique_regions = cars['origin'].unique()
print(unique_regions)

[1 3 2]


# Dummy Variables
In many cases you'll have to create numeric representation of categorical values yourself.

We must use dummy variables for columns containing categorical values.Whenever we have more than 2 categories, we need to create more columns to represent the categories. Since we have 5 different categories of cylinders, we could use 3, 4, 5, 6, and 8 to represent the different categories. We can split the column into separate binary columns.

We can use the Pandas function get_dummies to return a Dataframe containing binary columns from the values in the cylinders column.

We then use the Pandas function concat to add the columns from this Dataframe back to cars.

In [19]:
dummy_cylinders = pd.get_dummies(cars['cylinders'],prefix='cyl')
cars = pd.concat([cars,dummy_cylinders],axis=1)

dummy_years = pd.get_dummies(cars['year'],prefix='year')
cars = pd.concat([cars,dummy_years],axis=1)
cars.drop('year',axis=1)
cars.drop('cylinders',axis=1)
print(cars.head())

    mpg  cylinders  displacement horsepower  weight  acceleration  year  \
0  18.0          8         307.0      130.0  3504.0          12.0    70   
1  15.0          8         350.0      165.0  3693.0          11.5    70   
2  18.0          8         318.0      150.0  3436.0          11.0    70   
3  16.0          8         304.0      150.0  3433.0          12.0    70   
4  17.0          8         302.0      140.0  3449.0          10.5    70   

   origin                   car_name  cyl_3   ...     year_73  year_74  \
0       1  chevrolet chevelle malibu    0.0   ...         0.0      0.0   
1       1          buick skylark 320    0.0   ...         0.0      0.0   
2       1         plymouth satellite    0.0   ...         0.0      0.0   
3       1              amc rebel sst    0.0   ...         0.0      0.0   
4       1                ford torino    0.0   ...         0.0      0.0   

   year_75  year_76  year_77  year_78  year_79  year_80  year_81  year_82  
0      0.0      0.0      0.0

# Multiclass Classification
When we have 3 or more categories, we call the problem a multiclass classification problem.There are a few different methods of doing multiclass classification and in this mission, we'll focus on the one-versus-all method.

The one-versus-all method is a technique where we choose a single category as the Positive case and group the rest of the categories as the False case. We're essentially splitting the problem into multiple binary classification problems. For each observation, the model will then output the probability of belonging to each category.

In [22]:
shuffled_rows = np.random.permutation(cars.index)
shuffled_cars = cars.iloc[shuffled_rows]
train = shuffled_cars[0:int(cars.shape[0]*0.7)]
test = shuffled_cars[int(cars.shape[0]*0.7):]

# Training A Multiclass Logistic Regression Model
In the one-vs-all approach, we're essentially converting an n-class (in our case n is 3) classification problem into n binary classification problems. For our case, we'll need to train 3 models:
- A model where all cars built in North America are considered Positive (1) and those built in Europe and Asia are considered Negative (0).
- A model where all cars built in Europe are considered Positive (1) and those built in North America and Asia are considered Negative (0).
- A model where all cars built in Asia are labeled Positive (1) and those built in North America and Europe are considered Negative (0).

For each observation, we choose the label corresponding to the model that predicted the highest probability.

We'll use the binary features we created from the cylinders and year columns to train 3 models using the LogisticRegression class from scikit-learn.

In [28]:
from sklearn.linear_model import LogisticRegression

unique_origins = cars['origin'].unique() #countries where made in
unique_origins.sort()

models = {}
features = [c for c in train.columns if c.startswith('cyl') or c.startswith('year')]

for origin in unique_origins:
    model = LogisticRegression()
    
    X_train = train[features]
    y_train = train['origin'] == origin
    
    model.fit(X_train,y_train)
    models[origin] = model

# Testing The Models

In [37]:
testing_probs = pd.DataFrame(columns=unique_origins)
for origin in unique_origins:
    X_test = test[features]
    testing_probs[origin] = models[origin].predict_proba(X_test)[:,1] #1 indicates positive probability

# Choose The Origin
To classify each observation we want to select the origin with the highest probability of classification for that observation.

While each column in our dataframe testing_probs represents an origin we just need to choose the one with the largest probability. We can use the Dataframe method .idxmax() to return a Series where each value corresponds to the column or where the maximum value occurs for that observation. We need to make sure to set the axis paramater to 1 since we want to calculate the maximum value across columns.

In [40]:
predicted_origins = testing_probs.idxmax(axis=1)
print(predicted_origins)

0      3
1      1
2      1
3      1
4      1
5      3
6      1
7      3
8      1
9      1
10     3
11     1
12     2
13     1
14     1
15     3
16     3
17     2
18     1
19     1
20     1
21     1
22     3
23     1
24     1
25     1
26     2
27     2
28     3
29     1
      ..
90     1
91     1
92     3
93     1
94     1
95     1
96     1
97     1
98     3
99     2
100    1
101    1
102    3
103    2
104    2
105    1
106    2
107    2
108    1
109    1
110    2
111    3
112    1
113    1
114    1
115    3
116    1
117    3
118    1
119    1
dtype: int64
