# Supervised and Unsupervised Learning

Depending on the type of the data and the model to be built, you can separate the learning problems into two broad categories:

### Supervised learning. 
They are the methods in which the training set contains additional attributes that you want to predict (target). 
  ##### Classification: 
The data in the training set belong to two or more classes or categories; then, the data, already being labeled, allow us to teach the system to recognize the characteristics that distinguish each class. When you will need to consider a new value unknown to the system, the system will evaluate its class according to its characteristics.
 ###### Regression: 
When the value to be predicted is a continuous variable. The simplest case to understand is when you want to find the line which describes the trend from a series of points represented in a scatterplot.

### Unsupervised learning. 
These are the methods in which the training set consists of a series of input values x without any corresponding
target value.
 ##### Clustering: 
 The goal of these methods is to discover groups of similar examples in a dataset.
 ##### Dimensionality reduction: 
 Reduction of a high-dimensional dataset to one with only two or three dimensions is useful not just for data
visualization, but for converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions conveys much more information.

In addition to these two main categories, there is a further group of methods which have the purpose of validation and evaluation of the models.

# Training Set and Testing Set
Machine learning enables learning some properties by a model from a data set and applying them to new data. This is because a common practice in machine learning is to evaluate an algorithm. This valuation consists of splitting the data into two parts, one called the training set, with which we will learn the properties of the data, and the other called the testing set, on which to test these properties.

In [3]:
import pandas as pd
from sklearn import linear_model as lm
from sklearn.feature_selection import RFE #recursive feature elimination
import numpy as np
import matplotlib.pyplot as plt
import os

# function to calculate r-squared, MAE, RMSE
from sklearn.metrics import r2_score , mean_absolute_error,mean_squared_error

print(os.getcwd())

os.chdir('C:\\Analytics\\Personal\\Machine Learning\\Training\\R\\Dataset')

C:\Users\manish.khati\Python\Class\Day5


In [4]:
#Read the car price
df = pd.read_csv('carPrice.csv')


# View the first few rows
df.head()

Unnamed: 0,car_ID,symboling,carCompany,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,wheelbase,...,enginesize,fuelsystem,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950.0
4,5,2,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450.0


In [5]:
df.columns.values 

array(['car_ID', 'symboling', 'carCompany', 'fueltype', 'aspiration',
       'doornumber', 'carbody', 'drivewheel', 'enginelocation',
       'wheelbase', 'carlength', 'carwidth', 'carheight', 'curbweight',
       'enginetype', 'cylindernumber', 'enginesize', 'fuelsystem',
       'boreratio', 'stroke', 'compressionratio', 'horsepower', 'peakrpm',
       'citympg', 'highwaympg', 'price'], dtype=object)

In [7]:
#Describe data

df.describe()

Unnamed: 0,car_ID,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,103.0,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,3.329756,3.255415,10.142537,104.117073,5125.121951,25.219512,30.75122,13276.710571
std,59.322565,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,0.270844,0.313597,3.97204,39.544167,476.985643,6.542142,6.886443,7988.852332
min,1.0,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,2.54,2.07,7.0,48.0,4150.0,13.0,16.0,5118.0
25%,52.0,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,3.15,3.11,8.6,70.0,4800.0,19.0,25.0,7788.0
50%,103.0,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,3.31,3.29,9.0,95.0,5200.0,24.0,30.0,10295.0
75%,154.0,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,3.58,3.41,9.4,116.0,5500.0,30.0,34.0,16503.0
max,205.0,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,3.94,4.17,23.0,288.0,6600.0,49.0,54.0,45400.0


In [8]:
#create dummy variables for the column
dummy_cols = """carCompany fueltype aspiration doornumber carbody 
drivewheel enginelocation enginetype 
cylindernumber fuelsystem""".split()

dummies = pd.get_dummies(df[dummy_cols])

# Alternatively you can use sklearn package's LabelEncoder function
#from sklearn.preprocessing import LabelEncoder
#le = LabelEncoder()

In [9]:
dummies.columns.values 

array(['carCompany_alfa-romero', 'carCompany_audi', 'carCompany_bmw',
       'carCompany_chevrolet', 'carCompany_dodge', 'carCompany_honda',
       'carCompany_isuzu', 'carCompany_jaguar', 'carCompany_mazda',
       'carCompany_mercedes-benz', 'carCompany_mercury',
       'carCompany_mitsubishi', 'carCompany_nissan', 'carCompany_peugot',
       'carCompany_plymouth', 'carCompany_porsche', 'carCompany_renault',
       'carCompany_saab', 'carCompany_subaru', 'carCompany_toyota',
       'carCompany_volkswagen', 'carCompany_volvo', 'fueltype_diesel',
       'fueltype_gas', 'aspiration_std', 'aspiration_turbo',
       'doornumber_four', 'doornumber_two', 'carbody_convertible',
       'carbody_hardtop', 'carbody_hatchback', 'carbody_sedan',
       'carbody_wagon', 'drivewheel_4wd', 'drivewheel_fwd',
       'drivewheel_rwd', 'enginelocation_front', 'enginelocation_rear',
       'enginetype_dohc', 'enginetype_dohcv', 'enginetype_l',
       'enginetype_ohc', 'enginetype_ohcf', 'enginetype_ohcv'

In [10]:
#drop the original column
df = df.drop(dummy_cols, axis=1)

In [11]:
df.columns.values 

array(['car_ID', 'symboling', 'wheelbase', 'carlength', 'carwidth',
       'carheight', 'curbweight', 'enginesize', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg',
       'highwaympg', 'price'], dtype=object)

In [12]:
#add dummy variables
df = df.join(dummies)

In [13]:
df.columns.values

array(['car_ID', 'symboling', 'wheelbase', 'carlength', 'carwidth',
       'carheight', 'curbweight', 'enginesize', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg',
       'highwaympg', 'price', 'carCompany_alfa-romero', 'carCompany_audi',
       'carCompany_bmw', 'carCompany_chevrolet', 'carCompany_dodge',
       'carCompany_honda', 'carCompany_isuzu', 'carCompany_jaguar',
       'carCompany_mazda', 'carCompany_mercedes-benz',
       'carCompany_mercury', 'carCompany_mitsubishi', 'carCompany_nissan',
       'carCompany_peugot', 'carCompany_plymouth', 'carCompany_porsche',
       'carCompany_renault', 'carCompany_saab', 'carCompany_subaru',
       'carCompany_toyota', 'carCompany_volkswagen', 'carCompany_volvo',
       'fueltype_diesel', 'fueltype_gas', 'aspiration_std',
       'aspiration_turbo', 'doornumber_four', 'doornumber_two',
       'carbody_convertible', 'carbody_hardtop', 'carbody_hatchback',
       'carbody_sedan', 'carbody_wagon', 'drive

In [14]:
#Break Data Up Into Training And Test Datasets

# Create our predictor/independent variable
# and our response/dependent variable

X = df.drop(['price','car_ID'],axis = 1)
y = df['price']

col = X.columns.values

In [15]:
#Normalizing the data

#A unit or scale of measurement for different variables varies, so an analysis with the raw measurement could 
#be artificially skewed toward the variables with higher absolute values. Bringing all the different types of 
#variable units in the same order of magnitude thus eliminates the potential outlier measurements that would 
#misrepresent the finding and negatively affect the accuracy of the conclusion. Two broadly used methods 
#for rescaling data are normalization and standardization.

#Normalizing data can be achieved by Min-Max scaling
#The standardization technique will transform the variables to have a zero mean and standard deviation of one.

from sklearn import preprocessing
std_scale = preprocessing.StandardScaler().fit(X)
X_std = std_scale.transform(X)

X = pd.DataFrame(X_std,columns = col)

#minmax_scale = preprocessing.MinMaxScaler().fit(X)
#X_minmax = minmax_scale.transform(X)

In [16]:
X.columns.values

array(['symboling', 'wheelbase', 'carlength', 'carwidth', 'carheight',
       'curbweight', 'enginesize', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg',
       'highwaympg', 'carCompany_alfa-romero', 'carCompany_audi',
       'carCompany_bmw', 'carCompany_chevrolet', 'carCompany_dodge',
       'carCompany_honda', 'carCompany_isuzu', 'carCompany_jaguar',
       'carCompany_mazda', 'carCompany_mercedes-benz',
       'carCompany_mercury', 'carCompany_mitsubishi', 'carCompany_nissan',
       'carCompany_peugot', 'carCompany_plymouth', 'carCompany_porsche',
       'carCompany_renault', 'carCompany_saab', 'carCompany_subaru',
       'carCompany_toyota', 'carCompany_volkswagen', 'carCompany_volvo',
       'fueltype_diesel', 'fueltype_gas', 'aspiration_std',
       'aspiration_turbo', 'doornumber_four', 'doornumber_two',
       'carbody_convertible', 'carbody_hardtop', 'carbody_hatchback',
       'carbody_sedan', 'carbody_wagon', 'drivewheel_4wd',
       

In [17]:
X.head()

Unnamed: 0,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,...,cylindernumber_twelve,cylindernumber_two,fuelsystem_1bbl,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
0,1.74347,-1.690772,-0.426521,-0.844782,-2.020417,-0.014566,0.074449,0.519071,-1.839377,-0.288349,...,-0.070014,-0.141069,-0.23812,-0.689072,-0.121867,-0.328798,-0.070014,1.08667,-0.214286,-0.070014
1,1.74347,-1.690772,-0.426521,-0.844782,-2.020417,-0.014566,0.074449,0.519071,-1.839377,-0.288349,...,-0.070014,-0.141069,-0.23812,-0.689072,-0.121867,-0.328798,-0.070014,1.08667,-0.214286,-0.070014
2,0.133509,-0.708596,-0.231513,-0.190566,-0.543527,0.514882,0.604046,-2.40488,0.685946,-0.288349,...,-0.070014,-0.141069,-0.23812,-0.689072,-0.121867,-0.328798,-0.070014,1.08667,-0.214286,-0.070014
3,0.93849,0.173698,0.207256,0.136542,0.235942,-0.420797,-0.431076,-0.517266,0.462183,-0.035973,...,-0.070014,-0.141069,-0.23812,-0.689072,-0.121867,-0.328798,-0.070014,1.08667,-0.214286,-0.070014
4,0.93849,0.10711,0.207256,0.230001,0.235942,0.516807,0.218885,-0.517266,0.462183,-0.540725,...,-0.070014,-0.141069,-0.23812,-0.689072,-0.121867,-0.328798,-0.070014,1.08667,-0.214286,-0.070014


In [18]:
#Feature Construction or Generation

#Machine learning algorithms give best results only when we provide it the best possible features that 
#structure the underlying form of the problem that you are trying to address

#Correlation Matrix

corr = X.corr()
print(corr)

                          symboling  wheelbase  carlength  carwidth  \
symboling                  1.000000  -0.531954  -0.357612 -0.232919   
wheelbase                 -0.531954   1.000000   0.874587  0.795144   
carlength                 -0.357612   0.874587   1.000000  0.841118   
carwidth                  -0.232919   0.795144   0.841118  1.000000   
carheight                 -0.541038   0.589435   0.491029  0.279210   
curbweight                -0.227691   0.776386   0.877728  0.867032   
enginesize                -0.105790   0.569329   0.683360  0.735433   
boreratio                 -0.130051   0.488750   0.606454  0.559150   
stroke                    -0.008735   0.160959   0.129533  0.182942   
compressionratio          -0.178515   0.249786   0.158414  0.181129   
horsepower                 0.070873   0.353294   0.552623  0.640732   
peakrpm                    0.273606  -0.360469  -0.287242 -0.220012   
citympg                   -0.035823  -0.470414  -0.670909 -0.642704   
highwa

In [19]:
# Create our test data from the first 150 observations
X_train = X[0:150]
y_train = y[0:150]

# Create our training data from the remaining observations
X_test = X[150:]# Create an object that is an ols regression
y_test = y[150:]

In [20]:
X_test.columns.values 

array(['symboling', 'wheelbase', 'carlength', 'carwidth', 'carheight',
       'curbweight', 'enginesize', 'boreratio', 'stroke',
       'compressionratio', 'horsepower', 'peakrpm', 'citympg',
       'highwaympg', 'carCompany_alfa-romero', 'carCompany_audi',
       'carCompany_bmw', 'carCompany_chevrolet', 'carCompany_dodge',
       'carCompany_honda', 'carCompany_isuzu', 'carCompany_jaguar',
       'carCompany_mazda', 'carCompany_mercedes-benz',
       'carCompany_mercury', 'carCompany_mitsubishi', 'carCompany_nissan',
       'carCompany_peugot', 'carCompany_plymouth', 'carCompany_porsche',
       'carCompany_renault', 'carCompany_saab', 'carCompany_subaru',
       'carCompany_toyota', 'carCompany_volkswagen', 'carCompany_volvo',
       'fueltype_diesel', 'fueltype_gas', 'aspiration_std',
       'aspiration_turbo', 'doornumber_four', 'doornumber_two',
       'carbody_convertible', 'carbody_hardtop', 'carbody_hatchback',
       'carbody_sedan', 'carbody_wagon', 'drivewheel_4wd',
       

In [21]:
#Train The Linear Model

# Create an object that is an ols regression
ols = lm.LinearRegression()

In [22]:
# Train the model using our training data
model = ols.fit(X_train, y_train)

In [23]:
model.intercept_

2136354725835088.0

In [24]:
# View the training model's coefficient
model.coef_

array([  2.46679194e+02,   1.25298855e+03,  -1.34252536e+03,
         5.90677935e+02,  -1.88472271e+01,   4.58976109e+03,
         3.75442605e+03,  -1.02803934e+03,  -9.32682519e+01,
         5.33594866e+03,   1.76871805e+03,   1.70414599e+03,
         4.45534200e+01,   1.51899060e+03,  -1.02803812e+15,
        -1.55472836e+15,  -1.65787206e+15,  -1.02803812e+15,
        -1.75397015e+15,  -2.08638858e+15,  -1.18413422e+15,
        -1.02803812e+15,  -2.36089293e+15,  -1.65787206e+15,
        -5.96469155e+14,  -2.08638858e+15,  -2.42286916e+15,
        -1.83688948e+15,  -1.55472836e+15,  -1.32060491e+15,
        -8.41464742e+14,  -1.44302904e+15,  -3.20949945e+13,
        -2.55783518e+14,   7.04514849e+13,  -2.85864813e+14,
        -7.05067255e+15,   7.66421026e+15,  -5.18541480e+14,
        -5.18541480e+14,  -1.67019484e+14,  -1.67019484e+14,
        -2.83785290e+14,  -3.26036201e+14,  -7.98368919e+14,
        -8.40110828e+14,  -5.50926894e+14,   1.02159021e+15,
         2.45656080e+15,

### How Good Is Your Model?
There are three metrics widely used for evaluating linear model performance.
#R-squared
#RMSE
#MAE


The R-squared metric is the most popular practice of evaluating how well your model fits the data. R-squared value designates the total proportion of variance in the dependent variable explained by the independent variable. It is a value between 0 and 1; the value toward 1 indicates a better model fit

In [25]:
# Run the model on X_test and show the first five results
list(model.predict(X_test)[0:5])

[7856305644828697.0,
 7856305644828933.0,
 7856305644828249.0,
 7856305644829106.0,
 7856305644828494.0]

In [30]:
# View the R-Squared score

Price_Pred = model.predict(X_test)
r_squared = r2_score(Price_Pred, y_test)
r_squared

-232.64083596473213

In [31]:
# Adjusted R Squared
1 - (1-r_squared)*(len(y_train)-1)/(len(y_train)-X_train.shape[1]-1)

-463.16646078326789

In [80]:
# View the first five test Y values
list(y_test)[0:5]

[5348.0, 6338.0, 6488.0, 6918.0, 7898.0]

In [81]:
#The difference between the model’s predicted values and the actual values is how is we judge as model’s 
#accuracy, because a perfectly accurate model would have residuals of zero.

#The most common statistic used for quantitative Ys is the residual sum of squares

# Apply the model we created using the training data 
# to the test data, and calculate the RSS.
((y_test - model.predict(X_test)) **2).sum()

5.112358972390869e+30

In [82]:
#Note: You can also use Mean Squared Error, which is RSS divided by the degrees of freedom

# Calculate the MSE
np.mean((model.predict(X_test) - y_test) **2)

9.295198131619762e+28