## Auto MPG Data Set
- https://archive.ics.uci.edu/ml/datasets/Auto+MPG
- The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of 3 multivalued discrete and 5 continuous attributes.
- This is a regression problem where we aim to predict the output of a continuous value, ie fuel efficiency.

## Steps perfomed in this SAS notebook:
- Importing Raw Data Files
- Check Data Types of Variables
- Check for Missing Data
- Handle Missing Values
- Check for Duplicate Entries
- Check for Outliers
- Check for Normal Distribution of Variables
- Handle Outliers using Multiple Linear Regression (Pending)
- Check Correlation between Variables
- Log Transformation (Pending)
- Feature Construction: Extracting name of the brand (Pending)
- Final Visualizations (Pending)

## Importing Raw Data Files

In [3]:
libname auto '/folders/myfolders/Project';

proc import Datafile= "~/Project/auto_mpg.csv"
out= auto.original
dbms=csv
replace;
run;

* Ignore error in data import. It is beacuse of '?' values in Horsepower variable;
proc print data= auto.original;
run;

Obs,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
1,18.0,8,307.0,130,3504,12.0,70,1,chevrolet chevelle malibu
2,15.0,8,350.0,165,3693,11.5,70,1,buick skylark 320
3,18.0,8,318.0,150,3436,11.0,70,1,plymouth satellite
4,16.0,8,304.0,150,3433,12.0,70,1,amc rebel sst
5,17.0,8,302.0,140,3449,10.5,70,1,ford torino
6,15.0,8,429.0,198,4341,10.0,70,1,ford galaxie 500
7,14.0,8,454.0,220,4354,9.0,70,1,chevrolet impala
8,14.0,8,440.0,215,4312,8.5,70,1,plymouth fury iii
9,14.0,8,455.0,225,4425,10.0,70,1,pontiac catalina
10,15.0,8,390.0,190,3850,8.5,70,1,amc ambassador dpl


## Check Data Types of Variables

In [2]:
proc contents data= auto.original varname;
run;

0,1,2,3
Data Set Name,AUTO.ORIGINAL,Observations,398
Member Type,DATA,Variables,9
Engine,V9,Indexes,0
Created,11/12/2020 21:29:40,Observation Length,96
Last Modified,11/12/2020 21:29:40,Deleted Observations,0
Protection,,Compressed,NO
Data Set Type,,Sorted,NO
Label,,,
Data Representation,"SOLARIS_X86_64, LINUX_X86_64, ALPHA_TRU64, LINUX_IA64",,
Encoding,utf-8 Unicode (UTF-8),,

Engine/Host Dependent Information,Engine/Host Dependent Information.1
Data Set Page Size,65536
Number of Data Set Pages,1
First Data Page,1
Max Obs per Page,681
Obs in First Data Page,398
Number of Data Set Repairs,0
Filename,/folders/myfolders/Project/original.sas7bdat
Release Created,9.0401M6
Host Created,Linux
Inode Number,230

Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order,Variables in Creation Order
#,Variable,Type,Len,Format,Informat
1,mpg,Num,8,BEST12.,BEST32.
2,cylinders,Num,8,BEST12.,BEST32.
3,displacement,Num,8,BEST12.,BEST32.
4,horsepower,Num,8,BEST12.,BEST32.
5,weight,Num,8,BEST12.,BEST32.
6,acceleration,Num,8,BEST12.,BEST32.
7,model_year,Num,8,BEST12.,BEST32.
8,origin,Num,8,BEST12.,BEST32.
9,car_name,Char,25,$25.,$25.


## Check for Missing Data

In [3]:
proc means data=auto.original n nmiss;
run;
* Output shows 6 missing values in horsepower variable;

Variable,N,N Miss
mpg cylinders displacement horsepower weight acceleration model_year origin,398 398 398 392 398 398 398 398,0 0 0 6 0 0 0 0


In [4]:
proc format;
value $car
' ' = 'Missing'
other = 'Non Missing';

proc freq data=auto.original;
tables _character_ /nocum missing;
format _character_ $car.;
run;

*Output shows no missing values in character variables;

car_name,Frequency,Percent
Non Missing,398,100.0


In [5]:
*Print the row number for which data is missing in horsepower;
data _null_;
set auto.original;
file print;
if horsepower eq '?' then put
_n_= horsepower=;
run;

In [6]:
proc means data=auto.original mean median;
var horsepower;
run;

Analysis Variable : horsepower,Analysis Variable : horsepower
Mean,Median
104.4693878,93.5


## Handle Missing Values

In [7]:
data auto.updated;
set auto.original;
*Replacing missing values with median value of horsepower variable;
if horsepower eq '?' then horsepower=93.5; 
run;

* Re-check if there are any missing values now;
proc means data=auto.updated n nmiss;
var horsepower;
run;

Analysis Variable : horsepower,Analysis Variable : horsepower
N,N Miss
398,0


## Check for Duplicate Entries

In [8]:
proc sort data=auto.updated out=auto.temp3 noduprecs;
by mpg;
run;

*Log shows 0 duplicates were deleted. So all observations are unique;

## Check for Outliers

In [9]:
ods output TrimmedMeans=auto.trimmed (keep= Varname mean stdmean df);

*Triming 0.5% values from the top and bottom of the data;
proc univariate data = auto.updated trim=0.05 nextrobs=10;
run;

ods output close;

proc print data=auto.trimmed;
run;

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,23.5145729,Sum Observations,9358.8
Std Deviation,7.81598431,Variance,61.0896108
Skewness,0.45706634,Kurtosis,-0.5107813
Uncorrected SS,244320.76,Corrected SS,24252.5755
Coeff Variation,33.2388955,Std Error Mean,0.39177989

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,23.51457,Std Deviation,7.81598
Median,23.0,Variance,61.08961
Mode,13.0,Range,37.6
,,Interquartile Range,11.5

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,60.01986,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means
Percent Trimmed in Tail,Number Trimmed in Tail,Trimmed Mean,Std Error Trimmed Mean,95% Confidence Limits,95% Confidence Limits.1,DF,t for H0: Mu0=0.00,Pr > |t|
5.03,20,23.22291,0.40944,22.41769,24.02812,357,56.71873,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,46.6
99%,44.0
95%,37.2
90%,34.4
75% Q3,29.0
50% Median,23.0
25% Q1,17.5
10%,14.0
5%,13.0
1%,11.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
9,29,39.4,248
10,27,40.8,325
10,26,40.9,331
11,125,41.5,310
11,104,43.1,245
11,68,43.4,327
11,28,44.0,395
12,107,44.3,326
12,105,44.6,330
12,96,46.6,323

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,5.45477387,Sum Observations,2171.0
Std Deviation,1.70100424,Variance,2.89341544
Skewness,0.52692155,Kurtosis,-1.3766622
Uncorrected SS,12991.0,Corrected SS,1148.68593
Coeff Variation,31.183772,Std Error Mean,0.08526364

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,5.454774,Std Deviation,1.701
Median,4.0,Variance,2.89342
Mode,4.0,Range,5.0
,,Interquartile Range,4.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,63.97538,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means
Percent Trimmed in Tail,Number Trimmed in Tail,Trimmed Mean,Std Error Trimmed Mean,95% Confidence Limits,95% Confidence Limits.1,DF,t for H0: Mu0=0.00,Pr > |t|
5.03,20,5.405028,0.094154,5.219861,5.590195,357,57.40598,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,8
99%,8
95%,8
90%,8
75% Q3,8
50% Median,4
25% Q1,4
10%,4
5%,4
1%,3

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
3,335,8,287
3,244,8,288
3,112,8,289
3,72,8,290
4,398,8,291
4,397,8,292
4,396,8,293
4,395,8,299
4,394,8,301
4,393,8,365

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,193.425879,Sum Observations,76983.5
Std Deviation,104.269838,Variance,10872.1992
Skewness,0.71964516,Kurtosis,-0.7465966
Uncorrected SS,19206864.3,Corrected SS,4316263.06
Coeff Variation,53.9068704,Std Error Mean,5.22657472

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,193.4259,Std Deviation,104.26984
Median,148.5,Variance,10872.0
Mode,97.0,Range,387.0
,,Interquartile Range,158.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,37.00815,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means
Percent Trimmed in Tail,Number Trimmed in Tail,Trimmed Mean,Std Error Trimmed Mean,95% Confidence Limits,95% Confidence Limits.1,DF,t for H0: Mu0=0.00,Pr > |t|
5.03,20,187.3282,5.664627,176.188,198.4684,357,33.06982,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,455.0
99%,454.0
95%,400.0
90%,350.0
75% Q3,262.0
50% Median,148.5
25% Q1,104.0
10%,90.0
5%,85.0
1%,70.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
68,118,400,232
70,335,429,6
70,112,429,68
70,72,429,91
71,132,440,8
71,54,440,95
72,55,454,7
76,145,455,9
78,247,455,14
79,344,455,96

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,104.30402,Sum Observations,41513.0
Std Deviation,38.2226249,Variance,1460.96905
Skewness,1.10622429,Kurtosis,0.76358327
Uncorrected SS,4909977.5,Corrected SS,580004.714
Coeff Variation,36.6453995,Std Error Mean,1.91592706

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,104.304,Std Deviation,38.22262
Median,93.5,Variance,1461.0
Mode,150.0,Range,184.0
,,Interquartile Range,49.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,54.4405,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means
Percent Trimmed in Tail,Number Trimmed in Tail,Trimmed Mean,Std Error Trimmed Mean,95% Confidence Limits,95% Confidence Limits.1,DF,t for H0: Mu0=0.00,Pr > |t|
5.03,20,101.5559,1.926297,97.76755,105.3442,357,52.72078,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,230.0
99%,225.0
95%,180.0
90%,158.0
75% Q3,125.0
50% Median,93.5
25% Q1,76.0
10%,67.0
5%,60.0
1%,48.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
46,103,208,68
46,20,210,28
48,327,215,8
48,326,215,26
48,245,215,95
49,118,220,7
52,395,225,9
52,247,225,14
52,196,225,96
52,145,230,117

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,2970.42462,Sum Observations,1182229.0
Std Deviation,846.841774,Variance,717140.991
Skewness,0.53106251,Kurtosis,-0.7855289
Uncorrected SS,3796427105.0,Corrected SS,284704973.0
Coeff Variation,28.5091151,Std Error Mean,42.4483425

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,2970.425,Std Deviation,846.84177
Median,2803.5,Variance,717141.0
Mode,1985.0,Range,3527.0
,,Interquartile Range,1386.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,69.9774,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means
Percent Trimmed in Tail,Number Trimmed in Tail,Trimmed Mean,Std Error Trimmed Mean,95% Confidence Limits,95% Confidence Limits.1,DF,t for H0: Mu0=0.00,Pr > |t|
5.03,20,2937.168,45.1982,2848.279,3026.056,357,64.98417,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,5140.0
99%,4952.0
95%,4464.0
90%,4278.0
75% Q3,3609.0
50% Median,2803.5
25% Q1,2223.0
10%,1985.0
5%,1915.0
1%,1760.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
1613,55,4699,138
1649,145,4732,29
1755,344,4735,95
1760,346,4746,44
1773,54,4906,105
1795,199,4951,96
1795,182,4952,91
1800,249,4955,43
1800,246,4997,104
1825,219,5140,45

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,15.5680905,Sum Observations,6196.1
Std Deviation,2.75768893,Variance,7.60484823
Skewness,0.27877684,Kurtosis,0.41949688
Uncorrected SS,99480.57,Corrected SS,3019.12475
Coeff Variation,17.7137263,Std Error Mean,0.13823046

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,15.56809,Std Deviation,2.75769
Median,15.5,Variance,7.60485
Mode,14.5,Range,16.8
,,Interquartile Range,3.4

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,112.6242,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means
Percent Trimmed in Tail,Number Trimmed in Tail,Trimmed Mean,Std Error Trimmed Mean,95% Confidence Limits,95% Confidence Limits.1,DF,t for H0: Mu0=0.00,Pr > |t|
5.03,20,15.51983,0.137505,15.24941,15.79025,357,112.8678,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,24.8
99%,23.5
95%,20.5
90%,19.0
75% Q3,17.2
50% Median,15.5
25% Q1,13.8
10%,12.0
5%,11.2
1%,9.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
8.0,12,21.7,326
8.5,10,21.8,329
8.5,8,21.9,210
9.0,7,22.1,197
9.5,117,22.2,196
9.5,13,22.2,301
10.0,14,23.5,60
10.0,11,23.7,327
10.0,9,24.6,395
10.0,6,24.8,300

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,76.0100503,Sum Observations,30252.0
Std Deviation,3.69762665,Variance,13.6724428
Skewness,0.01153459,Kurtosis,-1.1812317
Uncorrected SS,2304884.0,Corrected SS,5427.9598
Coeff Variation,4.86465492,Std Error Mean,0.18534528

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,76.01005,Std Deviation,3.69763
Median,76.0,Variance,13.67244
Mode,73.0,Range,12.0
,,Interquartile Range,6.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,410.0997,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means
Percent Trimmed in Tail,Number Trimmed in Tail,Trimmed Mean,Std Error Trimmed Mean,95% Confidence Limits,95% Confidence Limits.1,DF,t for H0: Mu0=0.00,Pr > |t|
5.03,20,76.01117,0.206083,75.60588,76.41646,357,368.8372,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,82
99%,82
95%,82
90%,81
75% Q3,79
50% Median,76
25% Q1,73
10%,71
5%,70
1%,70

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
70,29,82,389
70,28,82,390
70,27,82,391
70,26,82,392
70,25,82,393
70,24,82,394
70,23,82,395
70,22,82,396
70,21,82,397
70,20,82,398

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,1.57286432,Sum Observations,626.0
Std Deviation,0.80205488,Variance,0.64329203
Skewness,0.9237763,Kurtosis,-0.8175968
Uncorrected SS,1240.0,Corrected SS,255.386935
Coeff Variation,50.9932654,Std Error Mean,0.04020338

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,1.572864,Std Deviation,0.80205
Median,1.0,Variance,0.64329
Mode,1.0,Range,2.0
,,Interquartile Range,1.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,39.12269,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means,Trimmed Means
Percent Trimmed in Tail,Number Trimmed in Tail,Trimmed Mean,Std Error Trimmed Mean,95% Confidence Limits,95% Confidence Limits.1,DF,t for H0: Mu0=0.00,Pr > |t|
5.03,20,1.52514,0.044702,1.437228,1.613051,357,34.11818,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,3
99%,3
95%,3
90%,3
75% Q3,2
50% Median,1
25% Q1,1
10%,1
5%,1
1%,1

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
1,398,3,363
1,397,3,377
1,396,3,378
1,394,3,381
1,393,3,382
1,392,3,383
1,390,3,384
1,389,3,385
1,388,3,386
1,387,3,391

Obs,VarName,Mean,StdMean,DF
1,mpg,23.2229,0.4094,357
2,cylinders,5.405,0.0942,357
3,displacement,187.3,5.6646,357
4,horsepower,101.6,1.9263,357
5,weight,2937.2,45.1982,357
6,acceleration,15.5198,0.1375,357
7,model_year,76.0112,0.2061,357
8,origin,1.5251,0.0447,357


In [10]:
*Restructuring the dataset;
data auto.temp;
set auto.updated;
array vars[*] _numeric_;
length VarName $ 32;
do i=1 to dim(vars);
    Varname=vname(vars[i]);
    Value=vars[i];
    output;
end;
keep Varname Value;
run;

proc print data=auto.temp (obs=10);
run;

Obs,VarName,Value
1,mpg,18
2,cylinders,8
3,displacement,307
4,horsepower,130
5,weight,3504
6,acceleration,12
7,model_year,70
8,origin,1
9,mpg,15
10,cylinders,8


In [11]:
proc sort data=auto.temp;
by varname;
run;

proc sort data=auto.trimmed;
by varname;
run;

data auto.outlier;
merge auto.temp auto.trimmed;
by varname;

std_dev=stdmean*sqrt(df+1);

* Checking values 3 Standard Deviation away from the mean;
if value lt mean-3*std_dev then do;
reason='Low';
output;
end;

else if value gt mean+3*std_dev then do;
reason='High';
output;
end;
run;

* Print the outlier values and the reason;
proc print data=auto.outlier;
var varname value reason;
run;


Obs,VarName,Value,reason
1,acceleration,23.5,Hig
2,acceleration,24.8,Hig
3,acceleration,23.7,Hig
4,acceleration,24.6,Hig
5,horsepower,220.0,Hig
6,horsepower,215.0,Hig
7,horsepower,225.0,Hig
8,horsepower,225.0,Hig
9,horsepower,215.0,Hig
10,horsepower,215.0,Hig


In [12]:
data auto.outlier;
merge auto.temp auto.trimmed;
by varname;

std_dev=stdmean*sqrt(df+1);

* Checking values 2 Standard Deviation away from the mean;
if value lt mean-2*std_dev then do;
reason='Low';
output;
end;

else if value gt mean+2*std_dev then do;
reason='High';
output;
end;
run;

proc print data=auto.outlier;
var varname value reason;
run;

Obs,VarName,Value,reason
1,acceleration,10.0,Low
2,acceleration,9.0,Low
3,acceleration,8.5,Low
4,acceleration,10.0,Low
5,acceleration,8.5,Low
6,acceleration,10.0,Low
7,acceleration,8.0,Low
8,acceleration,9.5,Low
9,acceleration,10.0,Low
10,acceleration,23.5,Hig


In [13]:
*Check outliers using box plot;
proc sgplot data= auto.updated;
hbox acceleration /datalabel;
run;

In [14]:
proc sgplot data= auto.updated;
hbox horsepower /datalabel;
run;

In [15]:
proc sgplot data= auto.updated;
hbox mpg /datalabel;
run;

In [16]:
proc sgplot data= auto.updated;
hbox displacement /datalabel;
run;

In [17]:
proc sgplot data= auto.updated;
hbox weight /datalabel;
run;

## Handle Outliers

In [18]:
*Pending;

## Check for Normal Distribution of Variables

In [19]:
*Check Q-Q and probability plots for checking distribution of variables;
proc univariate data=auto.updated;
var mpg displacement horsepower weight acceleration;
ppplot;
histogram/normal;
probplot/normal;
run;

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,23.5145729,Sum Observations,9358.8
Std Deviation,7.81598431,Variance,61.0896108
Skewness,0.45706634,Kurtosis,-0.5107813
Uncorrected SS,244320.76,Corrected SS,24252.5755
Coeff Variation,33.2388955,Std Error Mean,0.39177989

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,23.51457,Std Deviation,7.81598
Median,23.0,Variance,61.08961
Mode,13.0,Range,37.6
,,Interquartile Range,11.5

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,60.01986,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,46.6
99%,44.0
95%,37.2
90%,34.4
75% Q3,29.0
50% Median,23.0
25% Q1,17.5
10%,14.0
5%,13.0
1%,11.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
9,29,43.4,327
10,27,44.0,395
10,26,44.3,326
11,125,44.6,330
11,104,46.6,323

Parameters for Normal Distribution,Parameters for Normal Distribution,Parameters for Normal Distribution
Parameter,Symbol,Estimate
Mean,Mu,23.51457
Std Dev,Sigma,7.815984

Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution
Test,Statistic,Statistic.1,p Value,p Value.1
Kolmogorov-Smirnov,D,0.07887911,Pr > D,<0.010
Cramer-von Mises,W-Sq,0.52226726,Pr > W-Sq,<0.005
Anderson-Darling,A-Sq,3.44258042,Pr > A-Sq,<0.005

Quantiles for Normal Distribution,Quantiles for Normal Distribution,Quantiles for Normal Distribution
Percent,Quantile,Quantile
Percent,Observed,Estimated
1.0,11.0,5.33187
5.0,13.0,10.65842
10.0,14.0,13.49799
25.0,17.5,18.24277
50.0,23.0,23.51457
75.0,29.0,28.78637
90.0,34.4,33.53116
95.0,37.2,36.37072
99.0,44.0,41.69727

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,193.425879,Sum Observations,76983.5
Std Deviation,104.269838,Variance,10872.1992
Skewness,0.71964516,Kurtosis,-0.7465966
Uncorrected SS,19206864.3,Corrected SS,4316263.06
Coeff Variation,53.9068704,Std Error Mean,5.22657472

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,193.4259,Std Deviation,104.26984
Median,148.5,Variance,10872.0
Mode,97.0,Range,387.0
,,Interquartile Range,158.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,37.00815,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,455.0
99%,454.0
95%,400.0
90%,350.0
75% Q3,262.0
50% Median,148.5
25% Q1,104.0
10%,90.0
5%,85.0
1%,70.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
68,118,440,95
70,335,454,7
70,112,455,9
70,72,455,14
71,132,455,96

Parameters for Normal Distribution,Parameters for Normal Distribution,Parameters for Normal Distribution
Parameter,Symbol,Estimate
Mean,Mu,193.4259
Std Dev,Sigma,104.2698

Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution
Test,Statistic,Statistic.1,p Value,p Value.1
Kolmogorov-Smirnov,D,0.1830796,Pr > D,<0.010
Cramer-von Mises,W-Sq,3.0834214,Pr > W-Sq,<0.005
Anderson-Darling,A-Sq,17.8988361,Pr > A-Sq,<0.005

Quantiles for Normal Distribution,Quantiles for Normal Distribution,Quantiles for Normal Distribution
Percent,Quantile,Quantile
Percent,Observed,Estimated
1.0,70.0,-49.142
5.0,85.0,21.9173
10.0,90.0,59.7987
25.0,104.0,123.0969
50.0,148.5,193.4259
75.0,262.0,263.7548
90.0,350.0,327.0531
95.0,400.0,364.9345
99.0,454.0,435.9938

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,104.30402,Sum Observations,41513.0
Std Deviation,38.2226249,Variance,1460.96905
Skewness,1.10622429,Kurtosis,0.76358327
Uncorrected SS,4909977.5,Corrected SS,580004.714
Coeff Variation,36.6453995,Std Error Mean,1.91592706

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,104.304,Std Deviation,38.22262
Median,93.5,Variance,1461.0
Mode,150.0,Range,184.0
,,Interquartile Range,49.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,54.4405,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,230.0
99%,225.0
95%,180.0
90%,158.0
75% Q3,125.0
50% Median,93.5
25% Q1,76.0
10%,67.0
5%,60.0
1%,48.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
46,103,220,7
46,20,225,9
48,327,225,14
48,326,225,96
48,245,230,117

Parameters for Normal Distribution,Parameters for Normal Distribution,Parameters for Normal Distribution
Parameter,Symbol,Estimate
Mean,Mu,104.304
Std Dev,Sigma,38.22262

Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution
Test,Statistic,Statistic.1,p Value,p Value.1
Kolmogorov-Smirnov,D,0.1679433,Pr > D,<0.010
Cramer-von Mises,W-Sq,2.4379162,Pr > W-Sq,<0.005
Anderson-Darling,A-Sq,13.1432211,Pr > A-Sq,<0.005

Quantiles for Normal Distribution,Quantiles for Normal Distribution,Quantiles for Normal Distribution
Percent,Quantile,Quantile
Percent,Observed,Estimated
1.0,48.0,15.3849
5.0,60.0,41.4334
10.0,67.0,55.3198
25.0,76.0,78.5233
50.0,93.5,104.304
75.0,125.0,130.0848
90.0,158.0,153.2883
95.0,180.0,167.1746
99.0,225.0,193.2231

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,2970.42462,Sum Observations,1182229.0
Std Deviation,846.841774,Variance,717140.991
Skewness,0.53106251,Kurtosis,-0.7855289
Uncorrected SS,3796427105.0,Corrected SS,284704973.0
Coeff Variation,28.5091151,Std Error Mean,42.4483425

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,2970.425,Std Deviation,846.84177
Median,2803.5,Variance,717141.0
Mode,1985.0,Range,3527.0
,,Interquartile Range,1386.0

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,69.9774,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,5140.0
99%,4952.0
95%,4464.0
90%,4278.0
75% Q3,3609.0
50% Median,2803.5
25% Q1,2223.0
10%,1985.0
5%,1915.0
1%,1760.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
1613,55,4951,96
1649,145,4952,91
1755,344,4955,43
1760,346,4997,104
1773,54,5140,45

Parameters for Normal Distribution,Parameters for Normal Distribution,Parameters for Normal Distribution
Parameter,Symbol,Estimate
Mean,Mu,2970.425
Std Dev,Sigma,846.8418

Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution
Test,Statistic,Statistic.1,p Value,p Value.1
Kolmogorov-Smirnov,D,0.09343449,Pr > D,<0.010
Cramer-von Mises,W-Sq,1.13653228,Pr > W-Sq,<0.005
Anderson-Darling,A-Sq,7.3015623,Pr > A-Sq,<0.005

Quantiles for Normal Distribution,Quantiles for Normal Distribution,Quantiles for Normal Distribution
Percent,Quantile,Quantile
Percent,Observed,Estimated
1.0,1760.0,1000.38
5.0,1915.0,1577.49
10.0,1985.0,1885.15
25.0,2223.0,2399.24
50.0,2803.5,2970.42
75.0,3609.0,3541.61
90.0,4278.0,4055.7
95.0,4464.0,4363.36
99.0,4952.0,4940.47

Moments,Moments.1,Moments.2,Moments.3
N,398.0,Sum Weights,398.0
Mean,15.5680905,Sum Observations,6196.1
Std Deviation,2.75768893,Variance,7.60484823
Skewness,0.27877684,Kurtosis,0.41949688
Uncorrected SS,99480.57,Corrected SS,3019.12475
Coeff Variation,17.7137263,Std Error Mean,0.13823046

Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures,Basic Statistical Measures
Location,Location.1,Variability,Variability.1
Mean,15.56809,Std Deviation,2.75769
Median,15.5,Variance,7.60485
Mode,14.5,Range,16.8
,,Interquartile Range,3.4

Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0,Tests for Location: Mu0=0
Test,Statistic,Statistic.1,p Value,p Value.1
Student's t,t,112.6242,Pr > |t|,<.0001
Sign,M,199.0,Pr >= |M|,<.0001
Signed Rank,S,39700.5,Pr >= |S|,<.0001

Quantiles (Definition 5),Quantiles (Definition 5)
Level,Quantile
100% Max,24.8
99%,23.5
95%,20.5
90%,19.0
75% Q3,17.2
50% Median,15.5
25% Q1,13.8
10%,12.0
5%,11.2
1%,9.0

Extreme Observations,Extreme Observations,Extreme Observations,Extreme Observations
Lowest,Lowest,Highest,Highest
Value,Obs,Value,Obs
8.0,12,22.2,301
8.5,10,23.5,60
8.5,8,23.7,327
9.0,7,24.6,395
9.5,117,24.8,300

Parameters for Normal Distribution,Parameters for Normal Distribution,Parameters for Normal Distribution
Parameter,Symbol,Estimate
Mean,Mu,15.56809
Std Dev,Sigma,2.757689

Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution,Goodness-of-Fit Tests for Normal Distribution
Test,Statistic,Statistic.1,p Value,p Value.1
Kolmogorov-Smirnov,D,0.05083745,Pr > D,0.014
Cramer-von Mises,W-Sq,0.14226261,Pr > W-Sq,0.031
Anderson-Darling,A-Sq,0.80907962,Pr > A-Sq,0.038

Quantiles for Normal Distribution,Quantiles for Normal Distribution,Quantiles for Normal Distribution
Percent,Quantile,Quantile
Percent,Observed,Estimated
1.0,9.0,9.15275
5.0,11.2,11.0321
10.0,12.0,12.03397
25.0,13.8,13.70806
50.0,15.5,15.56809
75.0,17.2,17.42812
90.0,19.0,19.10221
95.0,20.5,20.10409
99.0,23.5,21.98343


## Check Correlation between Variables

In [20]:
*Check pearson correlation for linear and spearman correlation for non linear variables;
proc corr data=auto.updated pearson spearman plots=matrix(nvar=all histogram);
run;

0,1
8 Variables:,mpg cylinders displacement horsepower weight acceleration model_year origin

Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics,Simple Statistics
Variable,N,Mean,Std Dev,Median,Minimum,Maximum
mpg,398,23.51457,7.81598,23.0,9.0,46.6
cylinders,398,5.45477,1.701,4.0,3.0,8.0
displacement,398,193.42588,104.26984,148.5,68.0,455.0
horsepower,398,104.30402,38.22262,93.5,46.0,230.0
weight,398,2970.0,846.84177,2804.0,1613.0,5140.0
acceleration,398,15.56809,2.75769,15.5,8.0,24.8
model_year,398,76.01005,3.69763,76.0,70.0,82.0
origin,398,1.57286,0.80205,1.0,1.0,3.0

"Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Pearson Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0"
Unnamed: 0_level_1,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
mpg,1.00000,-0.77540 <.0001,-0.80420 <.0001,-0.77345 <.0001,-0.83174 <.0001,0.42029 <.0001,0.57927 <.0001,0.56345 <.0001
cylinders,-0.77540 <.0001,1.00000,0.95072 <.0001,0.84128 <.0001,0.89602 <.0001,-0.50542 <.0001,-0.34875 <.0001,-0.56254 <.0001
displacement,-0.80420 <.0001,0.95072 <.0001,1.00000,0.89578 <.0001,0.93282 <.0001,-0.54368 <.0001,-0.37016 <.0001,-0.60941 <.0001
horsepower,-0.77345 <.0001,0.84128 <.0001,0.89578 <.0001,1.00000,0.86244 <.0001,-0.68659 <.0001,-0.41373 <.0001,-0.45210 <.0001
weight,-0.83174 <.0001,0.89602 <.0001,0.93282 <.0001,0.86244 <.0001,1.00000,-0.41746 <.0001,-0.30656 <.0001,-0.58102 <.0001
acceleration,0.42029 <.0001,-0.50542 <.0001,-0.54368 <.0001,-0.68659 <.0001,-0.41746 <.0001,1.00000,0.28814 <.0001,0.20587 <.0001
model_year,0.57927 <.0001,-0.34875 <.0001,-0.37016 <.0001,-0.41373 <.0001,-0.30656 <.0001,0.28814 <.0001,1.00000,0.18066 0.0003
origin,0.56345 <.0001,-0.56254 <.0001,-0.60941 <.0001,-0.45210 <.0001,-0.58102 <.0001,0.20587 <.0001,0.18066 0.0003,1.00000

"Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0","Spearman Correlation Coefficients, N = 398 Prob > |r| under H0: Rho=0"
Unnamed: 0_level_1,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin
mpg,1.00000,-0.82186 <.0001,-0.85569 <.0001,-0.84797 <.0001,-0.87495 <.0001,0.43868 <.0001,0.57347 <.0001,0.58069 <.0001
cylinders,-0.82186 <.0001,1.00000,0.91188 <.0001,0.81179 <.0001,0.87331 <.0001,-0.47419 <.0001,-0.33501 <.0001,-0.60455 <.0001
displacement,-0.85569 <.0001,0.91188 <.0001,1.00000,0.86994 <.0001,0.94599 <.0001,-0.49651 <.0001,-0.30526 <.0001,-0.70720 <.0001
horsepower,-0.84797 <.0001,0.81179 <.0001,0.86994 <.0001,1.00000,0.87215 <.0001,-0.65136 <.0001,-0.38566 <.0001,-0.50441 <.0001
weight,-0.87495 <.0001,0.87331 <.0001,0.94599 <.0001,0.87215 <.0001,1.00000,-0.40455 <.0001,-0.27701 <.0001,-0.62843 <.0001
acceleration,0.43868 <.0001,-0.47419 <.0001,-0.49651 <.0001,-0.65136 <.0001,-0.40455 <.0001,1.00000,0.27463 <.0001,0.22057 <.0001
model_year,0.57347 <.0001,-0.33501 <.0001,-0.30526 <.0001,-0.38566 <.0001,-0.27701 <.0001,0.27463 <.0001,1.00000,0.16655 0.0009
origin,0.58069 <.0001,-0.60455 <.0001,-0.70720 <.0001,-0.50441 <.0001,-0.62843 <.0001,0.22057 <.0001,0.16655 0.0009,1.00000


In [21]:
* Plot scatter plots to visualize the relation between varibles;
proc sgscatter data=auto.updated;
    matrix mpg cylinders displacement horsepower weight model_year acceleration/ diagonal=(histogram);
run;

## Feature Construction: Extracting name of the brand

In [None]:
list_brand=[]
for i in range(398):
    list_brand.append(df['car name'][i].split(" ")[0])

df['brand']=list_brand
df['brand'].unique()


# some of the names as written in short so converitng it to a common name for each

df['brand'] = df['brand'].replace(['volkswagen','vokswagen','vw'],'volkswagen')
df['brand'] = df['brand'].replace('maxda','mazda')
df['brand'] = df['brand'].replace('toyouta','toyota')
df['brand'] = df['brand'].replace('mercedes-benz','mercedes')
df['brand'] = df['brand'].replace('nissan','datsun')
df['brand'] = df['brand'].replace('capri','ford')
df['brand'] = df['brand'].replace(['chevroelt','chevy'],'chevrolet')

df['brand'].value_counts()