## Auto MPG Data Set
- https://archive.ics.uci.edu/ml/datasets/Auto+MPG
- The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.
- This is a regression problem where we aim to predict the output of a continuous value, ie fuel efficiency.

## Steps perfomed in this SAS notebook:
- Importing Raw Data Files
- Check Data Types of Variables
- Check for Missing Data
- Handle Missing Values
- Check for Duplicate Entries
- Check for Outliers
- Check for Normal Distribution of Variables
- Handle Outliers using Multiple Regression Model
- Check Correlation between Variables
- Log Transformation
- Additional Visualizations
- Prediction (Multiple Linear Regression Model)

## Importing Raw Data Files

In [31]:
options nosource nonotes errors=0;

libname auto '/folders/myfolders/Project';

proc import Datafile= "~/Project/auto_mpg.csv"
    out= auto.original
    dbms=csv
    replace;
run;

*Ignore error in data import. It is beacuse of '?' values in the Horsepower variable;
proc print data= auto.original (obs=5);
run;

Obs,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
1,18,8,307,130,3504,12.0,70,1,chevrolet chevelle malibu
2,15,8,350,165,3693,11.5,70,1,buick skylark 320
3,18,8,318,150,3436,11.0,70,1,plymouth satellite
4,16,8,304,150,3433,12.0,70,1,amc rebel sst
5,17,8,302,140,3449,10.5,70,1,ford torino


## Check Data Types of Variables

In [2]:
*Check the variables type, length and formatting;
proc contents data= auto.original varname;
run;

*Check mean, minimum and maximum values of all numeric variables;
proc means data = auto.original;
run;

0,1,2,3
Data Set Name,AUTO.ORIGINAL,Observations,398
Member Type,DATA,Variables,9
Engine,V9,Indexes,0
Created,12/05/2020 15:18:44,Observation Length,96
Last Modified,12/05/2020 15:18:44,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,NO
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,1
First Data Page,1
Max Obs per Page,681
Obs in First Data Page,398
Number of Data Set Repairs,0
Filename,/folders/myfolders/Project/original.sas7bdat
Release Created,9.0401M6
Host Created,Linux
Inode Number,270

Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order
#,Variable,Type,Len,Format,Informat
1,mpg,Num,8,BEST12.,BEST32.
2,cylinders,Num,8,BEST12.,BEST32.
3,displacement,Num,8,BEST12.,BEST32.
4,horsepower,Num,8,BEST12.,BEST32.
5,weight,Num,8,BEST12.,BEST32.
6,acceleration,Num,8,BEST12.,BEST32.
7,model_year,Num,8,BEST12.,BEST32.
8,origin,Num,8,BEST12.,BEST32.
9,car_name,Char,25,$25.,$25.

Variable,N,Mean,Std Dev,Minimum,Maximum
mpg cylinders displacement horsepower weight acceleration model_year origin,398 398 398 392 398 398 398 398,23.5145729 5.4547739 193.4258794 104.4693878 2970.42 15.5680905 76.0100503 1.5728643,7.8159843 1.7010042 104.2698382 38.4911599 846.8417742 2.7576889 3.6976266 0.8020549,9.0000000 3.0000000 68.0000000 46.0000000 1613.00 8.0000000 70.0000000 1.0000000,46.6000000 8.0000000 455.0000000 230.0000000 5140.00 24.8000000 82.0000000 3.0000000


## Check for Missing Data

In [3]:
*Output shows 6 missing values in horsepower variable;

proc means data=auto.original n nmiss;
run;

Variable,N,N Miss
mpg cylinders displacement horsepower weight acceleration model_year origin,398 398 398 392 398 398 398 398,0 0 0 6 0 0 0 0


In [4]:
*Output shows no missing values in character variables;

proc format;
    value $car
    ' ' = 'Missing'
    other = 'Non Missing';

proc freq data=auto.original;
    tables _character_ /nocum missing;
    format _character_ $car.;
run;

car_name,Frequency,Percent
Non Missing,398,100.0


In [5]:
*Print the row number for which data is missing in horsepower;

data _null_;
    set auto.original;
    file print;
    if horsepower eq '?' then put
    _n_= horsepower=;
run;

## Handle Missing Values

In [6]:
*Replace missing 6 values in horsepower with the mean value of 104.47;
proc means data=auto.original mean median;
    var horsepower;
run;

Analysis Variable : horsepower,Analysis Variable : horsepower
Mean,Median
104.4693878,93.5


In [7]:
*Replacing missing values with mean value of horsepower variable;

data auto.updated;
    set auto.original;
    if horsepower eq '?' then horsepower=104.47;
run;

* Re-check if there are any missing values now;
proc means data=auto.updated n nmiss;
    var horsepower;
run;

Analysis Variable : horsepower,Analysis Variable : horsepower
N,N Miss
398,0


## Check for Duplicate Entries

In [8]:
*Log shows "0 duplicate observations were deleted". So all observations are unique;

options source notes errors=4;
    proc sort data=auto.updated out=temp_1 noduprecs;
    by mpg;
run;

## Check for Outliers

In [9]:
*Three different methods have been used here to detect the outliers;

*Method 1: Print outlier values 3 Standard Deviation away from the mean;
*Triming 5% values from the top and bottom of the data;

ods output TrimmedMeans=auto.trimmed (keep= Varname mean stdmean df);
proc univariate data = auto.updated trim=0.05 nextrobs=10;
run;
ods output close;

proc print data=auto.trimmed;
run;

Obs,VarName,Mean,StdMean,DF
1,acceleration,15.5198,0.1375,357
2,cylinders,5.405,0.0942,357
3,displacement,187.3,5.6646,357
4,horsepower,101.7,1.9251,357
5,model_year,76.0112,0.2061,357
6,mpg,23.2229,0.4094,357
7,origin,1.5251,0.0447,357
8,weight,2937.2,45.1982,357


In [10]:
*Restructuring the dataset;

data temp_2;
set auto.updated;
array vars[*] _numeric_;
length VarName $ 32;
do i=1 to dim(vars);
    Varname=vname(vars[i]);
    Value=vars[i];
    output;
end;
keep Varname Value;
run;

proc print data=temp_2 (obs=10);
run;

Obs,VarName,Value
1,mpg,18
2,cylinders,8
3,displacement,307
4,horsepower,130
5,weight,3504
6,acceleration,12
7,model_year,70
8,origin,1
9,mpg,15
10,cylinders,8


In [11]:
*Checking values 3 Standard Deviation away from the mean;

proc sort data=temp_2;
    by varname;
run;

proc sort data=auto.trimmed;
    by varname;
run;

data auto.outlier;
merge temp_2 auto.trimmed;
by varname;
std_dev=stdmean*sqrt(df+1);
length Reason $12.;
if value lt mean-3*std_dev then do;
    reason='Low';
    output;
end;
else if value gt mean+3*std_dev then do;
    reason='High';
    output;
end;
run;

* Print the outlier values and the reason;
proc print data=auto.outlier;
    var varname value reason;
run;

Obs,VarName,Value,Reason
1,acceleration,23.5,High
2,acceleration,24.8,High
3,acceleration,23.7,High
4,acceleration,24.6,High
5,horsepower,220.0,High
6,horsepower,215.0,High
7,horsepower,225.0,High
8,horsepower,225.0,High
9,horsepower,215.0,High
10,horsepower,215.0,High


In [12]:
*Method 2: Check outliers using box plot method;

proc sgplot data= auto.updated;
    vbox acceleration /datalabel;
run;

In [13]:
proc sgplot data= auto.updated;
    vbox horsepower /datalabel;
run;

In [14]:
proc sgplot data= auto.updated;
    vbox mpg /datalabel;
run;

In [15]:
proc sgplot data= auto.updated;
    vbox displacement /datalabel;
run;

In [16]:
proc sgplot data= auto.updated;
    vbox weight /datalabel;
run;

In [17]:
*Method 3: Detect outliers via IQR method;

proc means data=auto.updated noprint;
    var acceleration horsepower mpg;
    output out=Tmpo (drop=_type_ _freq_)
    Q1=
    Q3=
    QRange= / autoname;
run;

data _null_;
    file print;
    set auto.updated(keep=acceleration horsepower mpg car_name);
    if _n_ = 1 then set Tmpo;
    if acceleration le acceleration_Q1 - 1.5*acceleration_QRange and not missing(acceleration) 
        or acceleration ge acceleration_Q3 + 1.5*acceleration_QRange then
        put "Possible Outlier for acceleration in " car_name "is " acceleration;
    else if horsepower le horsepower_Q1 - 1.5*horsepower_QRange and not missing(horsepower) 
        or horsepower ge horsepower_Q3 + 1.5*horsepower_QRange then
        put "Possible Outlier for horsepower in " car_name "is " horsepower;
    else if mpg le mpg_Q1 - 1.5*mpg_QRange and not missing(mpg) 
        or mpg ge mpg_Q3 + 1.5*mpg_QRange then
        put "Possible Outlier for mpg in " car_name "is " mpg;
run;

## Handle Outliers using Multiple Regression Model

In [18]:
*Only 18 observations have significant outlier values among total 398 observations;
*One way of handling outlier is to delete these 18 rows but as the total dataset is quite small so, we can't delete these rows;
*To handle the outliers, I am replacing the outliers found by IQR method with null value first and then;
*I am creating multiple regression models to predict those null values and; 
*thus, replace the values of the outliers with the predicted values in respective variables;


*Replacing the outliers found by IQR method with null value;

proc means data=auto.updated noprint;
    var acceleration horsepower mpg;
    output out=Tmpo (drop=_type_ _freq_)
    Q1=
    Q3=
    QRange= / autoname;
run;

data auto.outlier_null;
    set auto.updated;
    if _n_ = 1 then set Tmpo;
    if acceleration le acceleration_Q1 - 1.5*acceleration_QRange and not missing(acceleration) 
        or acceleration ge acceleration_Q3 + 1.5*acceleration_QRange then acceleration=.;
    else if horsepower le horsepower_Q1 - 1.5*horsepower_QRange and not missing(horsepower) 
        or horsepower ge horsepower_Q3 + 1.5*horsepower_QRange then horsepower=.;
    else if mpg le mpg_Q1 - 1.5*mpg_QRange and not missing(mpg) 
        or mpg ge mpg_Q3 + 1.5*mpg_QRange then mpg=.;
    
    keep mpg cylinders displacement horsepower weight acceleration model_year origin car_name;
run;

*10 null values in horsepower, 1 in mpg and 7 in accelaration;
proc means data=auto.outlier_null n nmiss;
run;

Variable,N,N Miss
mpg cylinders displacement horsepower weight acceleration model_year origin,397 398 398 388 398 391 398 398,1 0 0 10 0 7 0 0


In [19]:
*Model 1: Predict horsepower values;

proc reg data = auto.outlier_null noprint outest=betas;
    model horsepower = mpg cylinders displacement weight acceleration model_year origin;
run;
quit;

proc print data=betas noobs;
run;

data need_predictions;
    set auto.outlier_null;
    where horsepower = .;
run;

_MODEL_,_TYPE_,_DEPVAR_,_RMSE_,Intercept,mpg,cylinders,displacement,weight,acceleration,model_year,origin,horsepower
MODEL1,PARMS,horsepower,10.7732,121.338,-0.37394,0.5999,0.049125,0.022206,-4.38577,-0.32894,3.18011,-1


In [20]:
proc score data=need_predictions score=betas out=predictions type=parms;
    var mpg cylinders displacement weight acceleration model_year origin;
run;

data temp_3;
    set predictions;
    horsepower=Model1;
    drop model1;
run;

proc sort data=temp_3;
    by car_name weight;
run;

proc sort data=auto.outlier_null out=temp_4;
    by car_name weight;
run;

data temp_corrected;
    merge temp_4(in=l) temp_3(in=r);
    by car_name weight;
    if l=1;
run;

data auto.outlier_null;
    set temp_corrected;
run;

*Check if the horsepower values are updated. Nmiss is 0 for horsepower now;
proc means data=auto.outlier_null n nmiss;
run;

Variable,N,N Miss
mpg cylinders displacement horsepower weight acceleration model_year origin,397 398 398 398 398 391 398 398,1 0 0 0 0 7 0 0


In [21]:
*Replacing the only one outlier of mpg variable with the mean value;

data auto.outlier_null;
    set auto.outlier_null;
    if mpg eq . then mpg=23.5;
run;

*Check if the mpg values are updated. Nmiss is 0 for mpg now;
proc means data=auto.outlier_null n nmiss;
run;

Variable,N,N Miss
mpg cylinders displacement horsepower weight acceleration model_year origin,398 398 398 398 398 391 398 398,0 0 0 0 0 7 0 0


In [22]:
*Model 2: Predict acceleration values;

proc reg data = auto.outlier_null noprint outest=betas2;
    model acceleration = mpg cylinders displacement weight horsepower model_year origin;
run;
quit;

proc print data=betas2 noobs;
run;

data need_predictions2;
    set auto.outlier_null;
    where acceleration = .;
run;

_MODEL_,_TYPE_,_DEPVAR_,_RMSE_,Intercept,mpg,cylinders,displacement,weight,horsepower,model_year,origin,acceleration
MODEL1,PARMS,acceleration,1.56961,19.8988,-0.021623,0.15407,-0.014269,0.003302349,-0.095911,-0.022818,-0.080295,-1


In [23]:
proc score data=need_predictions2 score=betas2 out=predictions2 type=parms;
    var mpg cylinders displacement weight horsepower model_year origin;
run;

data temp_5;
    set predictions2;
    acceleration=Model1;
    drop model1;
run;

proc sort data=temp_5;
    by car_name weight;
run;

proc sort data=auto.outlier_null out=temp_6;
    by car_name weight;
run;

data temp_corrected2;
    merge temp_6(in=l) temp_5(in=r);
    by car_name weight;
    if l=1;
run;

data auto.outlier_removed;
    set temp_corrected2;
run;

*Check if the acceleration values are updated. Nmiss is 0 for all variables now;
proc means data=auto.outlier_removed n nmiss;
run;

Variable,N,N Miss
mpg cylinders displacement horsepower weight acceleration model_year origin,398 398 398 398 398 398 398 398,0 0 0 0 0 0 0 0


## Check for Normal Distribution of Variables

In [24]:
*Check histogram, Q-Q and probability plots for checking distribution of variables;

proc univariate data=auto.outlier_removed;
    var mpg displacement horsepower weight acceleration;
    ppplot;
    qqplot;
    histogram/ normal kernel;
run;

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,23.4565327,Sum Observations,9335.7
Std Deviation,7.7294129,Variance,59.7438237
Skewness,0.4277896,Kurtosis,-0.5871874
Uncorrected SS,242701.45,Corrected SS,23718.298
Coeff Variation,32.9520693,Std Error Mean,0.38744046

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,23.45653,Std Deviation,7.72941
Median,23.0,Variance,59.74382
Mode,13.0,Range,35.6
,,Interquartile Range,11.5

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,60.54229,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,44.6
99%,43.4
95%,37.0
90%,34.3
75% Q3,29.0
50% Median,23.0
25% Q1,17.5
10%,14.0
5%,13.0
1%,11.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
9,221,43.1,382
10,176,43.4,393
10,103,44.0,394
11,268,44.3,397
11,255,44.6,231

Parameters for Normal Distribution,Parameters for Normal Distribution,Parameters for Normal Distribution
Parameter,Symbol,Estimate
Mean,Mu,23.45653
Std Dev,Sigma,7.729413

Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution
Test,Statistic,Statistic.1,p Value,p Value.1
Kolmogorov-Smirnov,D,0.07842462,Pr > D,<0.010
Cramer-von Mises,W-Sq,0.51447954,Pr > W-Sq,<0.005
Anderson-Darling,A-Sq,3.40439002,Pr > A-Sq,<0.005

Quantiles for Normal Distribution,Quantiles for Normal Distribution,Quantiles for Normal Distribution
Percent,Quantile,Quantile
Percent,Observed,Estimated
1.0,11.0,5.47523
5.0,13.0,10.74278
10.0,14.0,13.55089
25.0,17.5,18.24312
50.0,23.0,23.45653
75.0,29.0,28.66994
90.0,34.3,33.36217
95.0,37.0,36.17029
99.0,43.4,41.43784

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,193.425879,Sum Observations,76983.5
Std Deviation,104.269838,Variance,10872.1992
Skewness,0.71964516,Kurtosis,-0.7465966
Uncorrected SS,19206864.3,Corrected SS,4316263.06
Coeff Variation,53.9068704,Std Error Mean,5.22657472

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,193.4259,Std Deviation,104.26984
Median,148.5,Variance,10872.0
Mode,97.0,Range,387.0
,,Interquartile Range,158.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,37.00815,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,455.0
99%,454.0
95%,400.0
90%,350.0
75% Q3,262.0
50% Median,148.5
25% Q1,104.0
10%,90.0
5%,85.0
1%,70.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
68,165,440,296
70,246,454,82
70,245,455,44
70,236,455,45
71,353,455,317

Parameters for Normal Distribution,Parameters for Normal Distribution,Parameters for Normal Distribution
Parameter,Symbol,Estimate
Mean,Mu,193.4259
Std Dev,Sigma,104.2698

Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution
Test,Statistic,Statistic.1,p Value,p Value.1
Kolmogorov-Smirnov,D,0.1830796,Pr > D,<0.010
Cramer-von Mises,W-Sq,3.0834214,Pr > W-Sq,<0.005
Anderson-Darling,A-Sq,17.8988361,Pr > A-Sq,<0.005

Quantiles for Normal Distribution,Quantiles for Normal Distribution,Quantiles for Normal Distribution
Percent,Quantile,Quantile
Percent,Observed,Estimated
1.0,70.0,-49.142
5.0,85.0,21.9173
10.0,90.0,59.7987
25.0,104.0,123.0969
50.0,148.5,193.4259
75.0,262.0,263.7548
90.0,350.0,327.0531
95.0,400.0,364.9345
99.0,454.0,435.9938

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,103.245892,Sum Observations,41091.8648
Std Deviation,35.2549053,Variance,1242.90835
Skewness,0.82380494,Kurtosis,-0.1450159
Uncorrected SS,4736000.83,Corrected SS,493434.615
Coeff Variation,34.1465455,Std Error Mean,1.76716872

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,103.2459,Std Deviation,35.25491
Median,95.0,Variance,1243.0
Mode,150.0,Range,169.0
,,Interquartile Range,49.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,58.42447,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,215
99%,193
95%,175
90%,155
75% Q3,125
50% Median,95
25% Q1,76
10%,67
5%,60
1%,48

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
46,385,190,110
46,372,193,221
48,397,198,186
48,393,198,256
48,382,215,296

Parameters for Normal Distribution,Parameters for Normal Distribution,Parameters for Normal Distribution
Parameter,Symbol,Estimate
Mean,Mu,103.2459
Std Dev,Sigma,35.25491

Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution
Test,Statistic,Statistic.1,p Value,p Value.1
Kolmogorov-Smirnov,D,0.1447187,Pr > D,<0.010
Cramer-von Mises,W-Sq,2.0267791,Pr > W-Sq,<0.005
Anderson-Darling,A-Sq,11.2076512,Pr > A-Sq,<0.005

Quantiles for Normal Distribution,Quantiles for Normal Distribution,Quantiles for Normal Distribution
Percent,Quantile,Quantile
Percent,Observed,Estimated
1.0,48.0,21.2307
5.0,60.0,45.2567
10.0,67.0,58.0649
25.0,76.0,79.4668
50.0,95.0,103.2459
75.0,125.0,127.025
90.0,155.0,148.4269
95.0,175.0,161.2351
99.0,193.0,185.2611

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,2970.42462,Sum Observations,1182229.0
Std Deviation,846.841774,Variance,717140.991
Skewness,0.53106251,Kurtosis,-0.7855289
Uncorrected SS,3796427105.0,Corrected SS,284704973.0
Coeff Variation,28.5091151,Std Error Mean,42.4483425

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,2970.425,Std Deviation,846.84177
Median,2803.5,Variance,717141.0
Mode,1985.0,Range,3527.0
,,Interquartile Range,1386.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,69.9774,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,5140.0
99%,4952.0
95%,4464.0
90%,4278.0
75% Q3,3609.0
50% Median,2803.5
25% Q1,2223.0
10%,1985.0
5%,1915.0
1%,1760.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
1613,111,4951,44
1649,357,4952,256
1755,367,4955,157
1760,230,4997,83
1773,352,5140,328

Parameters for Normal Distribution,Parameters for Normal Distribution,Parameters for Normal Distribution
Parameter,Symbol,Estimate
Mean,Mu,2970.425
Std Dev,Sigma,846.8418

Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution
Test,Statistic,Statistic.1,p Value,p Value.1
Kolmogorov-Smirnov,D,0.09343449,Pr > D,<0.010
Cramer-von Mises,W-Sq,1.13653228,Pr > W-Sq,<0.005
Anderson-Darling,A-Sq,7.3015623,Pr > A-Sq,<0.005

Quantiles for Normal Distribution,Quantiles for Normal Distribution,Quantiles for Normal Distribution
Percent,Quantile,Quantile
Percent,Observed,Estimated
1.0,1760.0,1000.38
5.0,1915.0,1577.49
10.0,1985.0,1885.15
25.0,2223.0,2399.24
50.0,2803.5,2970.42
75.0,3609.0,3541.61
90.0,4278.0,4055.7
95.0,4464.0,4363.36
99.0,4952.0,4940.47

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,15.5183385,Sum Observations,6176.29872
Std Deviation,2.64478844,Variance,6.99490592
Skewness,0.02957328,Kurtosis,0.03437618
Uncorrected SS,98622.8717,Corrected SS,2776.97765
Coeff Variation,17.0429872,Std Error Mean,0.13257127

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,15.51834,Std Deviation,2.64479
Median,15.5,Variance,6.99491
Mode,14.5,Range,15.70838
,,Interquartile Range,3.4

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,117.0566,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,22.2
99%,21.9
95%,19.9
90%,19.0
75% Q3,17.2
50% Median,15.5
25% Q1,13.8
10%,12.0
5%,11.2
1%,9.5

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
6.49162,296,21.8,248
8.05554,2,21.9,280
9.0,82,22.1,101
9.5,321,22.2,72
9.5,87,22.2,265

Parameters for Normal Distribution,Parameters for Normal Distribution,Parameters for Normal Distribution
Parameter,Symbol,Estimate
Mean,Mu,15.51834
Std Dev,Sigma,2.644788

Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution
Test,Statistic,Statistic.1,p Value,p Value.1
Kolmogorov-Smirnov,D,0.04081069,Pr > D,0.104
Cramer-von Mises,W-Sq,0.07851793,Pr > W-Sq,0.223
Anderson-Darling,A-Sq,0.46181707,Pr > A-Sq,>0.250

Quantiles for Normal Distribution,Quantiles for Normal Distribution,Quantiles for Normal Distribution
Percent,Quantile,Quantile
Percent,Observed,Estimated
1.0,9.5,9.36564
5.0,11.2,11.16805
10.0,12.0,12.12891
25.0,13.8,13.73446
50.0,15.5,15.51834
75.0,17.2,17.30222
90.0,19.0,18.90777
95.0,19.9,19.86863
99.0,21.9,21.67104


## Check Correlation between Variables

In [25]:
*Check Pearson correlation for linear variables and Spearman correlation for non-linear variables;

proc corr data=auto.outlier_removed pearson spearman plots=matrix(nvar=all histogram);
run;

0,1
8 Variables:,mpg cylinders displacement horsepower weight acceleration model_year origin

Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics
Variable,N,Mean,Std Dev,Median,Minimum,Maximum
mpg,398,23.45653,7.72941,23.0,9.0,44.6
cylinders,398,5.45477,1.701,4.0,3.0,8.0
displacement,398,193.42588,104.26984,148.5,68.0,455.0
horsepower,398,103.24589,35.25491,95.0,46.0,215.0
weight,398,2970.0,846.84177,2804.0,1613.0,5140.0
acceleration,398,15.51834,2.64479,15.5,6.49162,22.2
model_year,398,76.01005,3.69763,76.0,70.0,82.0
origin,398,1.57286,0.80205,1.0,1.0,3.0

"Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0"
Unnamed: 0_level_1,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
mpg,1.00000,-0.77764 <.0001,-0.80545 <.0001,-0.78800 <.0001,-0.83341 <.0001,0.40710 <.0001,0.57763 <.0001,0.55637 <.0001
cylinders,-0.77764 <.0001,1.00000,0.95072 <.0001,0.85694 <.0001,0.89602 <.0001,-0.50993 <.0001,-0.34875 <.0001,-0.56254 <.0001
displacement,-0.80545 <.0001,0.95072 <.0001,1.00000,0.89786 <.0001,0.93282 <.0001,-0.55234 <.0001,-0.37016 <.0001,-0.60941 <.0001
horsepower,-0.78800 <.0001,0.85694 <.0001,0.89786 <.0001,1.00000,0.87861 <.0001,-0.69577 <.0001,-0.39843 <.0001,-0.46671 <.0001
weight,-0.83341 <.0001,0.89602 <.0001,0.93282 <.0001,0.87861 <.0001,1.00000,-0.42528 <.0001,-0.30656 <.0001,-0.58102 <.0001
acceleration,0.40710 <.0001,-0.50993 <.0001,-0.55234 <.0001,-0.69577 <.0001,-0.42528 <.0001,1.00000,0.28611 <.0001,0.20412 <.0001
model_year,0.57763 <.0001,-0.34875 <.0001,-0.37016 <.0001,-0.39843 <.0001,-0.30656 <.0001,0.28611 <.0001,1.00000,0.18066 0.0003
origin,0.55637 <.0001,-0.56254 <.0001,-0.60941 <.0001,-0.46671 <.0001,-0.58102 <.0001,0.20412 <.0001,0.18066 0.0003,1.00000

"Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0"
Unnamed: 0_level_1,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
mpg,1.00000,-0.82145 <.0001,-0.85281 <.0001,-0.83944 <.0001,-0.87339 <.0001,0.43311 <.0001,0.57088 <.0001,0.57633 <.0001
cylinders,-0.82145 <.0001,1.00000,0.91188 <.0001,0.80854 <.0001,0.87331 <.0001,-0.47278 <.0001,-0.33501 <.0001,-0.60455 <.0001
displacement,-0.85281 <.0001,0.91188 <.0001,1.00000,0.86640 <.0001,0.94599 <.0001,-0.49486 <.0001,-0.30526 <.0001,-0.70720 <.0001
horsepower,-0.83944 <.0001,0.80854 <.0001,0.86640 <.0001,1.00000,0.86955 <.0001,-0.64586 <.0001,-0.37699 <.0001,-0.50736 <.0001
weight,-0.87339 <.0001,0.87331 <.0001,0.94599 <.0001,0.86955 <.0001,1.00000,-0.40328 <.0001,-0.27701 <.0001,-0.62843 <.0001
acceleration,0.43311 <.0001,-0.47278 <.0001,-0.49486 <.0001,-0.64586 <.0001,-0.40328 <.0001,1.00000,0.27279 <.0001,0.21833 <.0001
model_year,0.57088 <.0001,-0.33501 <.0001,-0.30526 <.0001,-0.37699 <.0001,-0.27701 <.0001,0.27279 <.0001,1.00000,0.16655 0.0009
origin,0.57633 <.0001,-0.60455 <.0001,-0.70720 <.0001,-0.50736 <.0001,-0.62843 <.0001,0.21833 <.0001,0.16655 0.0009,1.00000


## Log Transformation

In [26]:
*The log transformation can be used to make highly skewed distributions less skewed;
*As displacement and mpg have skewed distribution so applying log transformation on these variables;

*Apply logarithmic transformation on displacement variable;

data auto.log_data;
    set auto.outlier_removed;
    disp_log = log(displacement);
run;

*Compare the new graphs;
ods select Histogram;
proc univariate data= auto.log_data noprint;
    var disp_log displacement;
    histogram / kernel normal;
run;

In [27]:
*Apply logarithmic transformation on mpg variable;

data auto.log_data;
    set auto.log_data;
    mpg_log = log(mpg);
run;

*Compare the new graphs;
ods select Histogram;
proc univariate data= auto.log_data noprint;
    var mpg_log mpg;
    histogram / kernel normal;
run;

## Additional Visualizations

In [28]:
*Plot scatter plots to better visualize the relation between all the variables;

proc sgscatter data=auto.log_data;
    matrix mpg cylinders displacement horsepower weight model_year acceleration/ diagonal=(histogram);
run;

In [29]:
*Create 3D visualizations for checking the impact of different variables on mpg;

proc kde data=auto.log_data;
    bivar mpg cylinders / noprint plots = histogram surface;
    bivar mpg displacement / noprint plots = histogram surface;
    bivar mpg horsepower / noprint plots = histogram surface;
    bivar mpg weight / noprint plots = histogram surface;
    bivar mpg model_year / noprint plots = histogram surface;
    bivar mpg acceleration / noprint plots = histogram surface;
run;

## Prediction (Multiple Linear Regression Model)

In [32]:
*Using Multiple regression model as below, mpg can be predicted with an accuracy of 81.80%;

proc reg data = auto.log_data plots=all;
    model mpg = cylinders horsepower displacement weight acceleration model_year origin / selection = rsquare cp adjrsq best=4;
run; 
quit;

0,1
Number of Observations Read,398
Number of Observations Used,398

Model Index,Number in Model,R-Square,Adjusted R-Square,C(p),Variables in Model
1,1,0.6946,0.6938,271.972,weight
2,1,0.6488,0.6479,371.8624,displacement
3,1,0.6209,0.62,432.5105,horsepower
4,1,0.6047,0.6037,467.8635,cylinders
5,2,0.8091,0.8081,24.2298,weight model_year
6,2,0.7393,0.7379,176.5068,displacement model_year
7,2,0.7116,0.7102,236.7694,cylinders model_year
8,2,0.7082,0.7067,244.2448,horsepower weight
9,3,0.8168,0.8154,9.5273,weight model_year origin
10,3,0.8096,0.8081,25.2177,horsepower weight model_year
