Let's begin with statistics mainly for univariate data analysis, covering some basic concepts like descriptive and inferential statistics and distributions. 

## Loading data

In [2]:
auto_mpg <- read.csv("datasets/auto-mpg.csv")

In [3]:
head(auto_mpg)

mpg,cylinders,displacement,horsepower,weight,acceleration,model.year,origin,car.name
18,8,307,130,3504,12.0,70,1,chevrolet chevelle malibu
15,8,350,165,3693,11.5,70,1,buick skylark 320
18,8,318,150,3436,11.0,70,1,plymouth satellite
16,8,304,150,3433,12.0,70,1,amc rebel sst
17,8,302,140,3449,10.5,70,1,ford torino
15,8,429,198,4341,10.0,70,1,ford galaxie 500


In [4]:
names(auto_mpg)

In [5]:
column_names <- names(auto_mpg)

In [6]:
column_names

In [7]:
# modify the names of the columns of auto_mpg dataset by assigning new names
names(auto_mpg) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i")

In [8]:
names(auto_mpg)

### Changing a specific column name

In [9]:
names(auto_mpg)[1] = "z" 

In [10]:
names(auto_mpg)

In [11]:
names(auto_mpg) <- column_names

In [12]:
names(auto_mpg)

summary() command gives a summary of each variable in the dataframe. As shown below, the command is very informative. It calculates the minimum value, 1st quartile, 2nd quartile (mean), 3rd quartile, and the maximimum values of numeric variables. If the variable has NA values, number of such rows with NA values is displayed too. You can use this info to quickly identify if the variables are qualitative (discrete) or quantitative (continuous). For example, all the variables in auto_mpg dataset (except for origin and car name) are continuous. 

In [13]:
summary(auto_mpg)

      mpg          cylinders      displacement     horsepower      weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   150    : 22   Min.   :1613  
 1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   90     : 20   1st Qu.:2224  
 Median :23.00   Median :4.000   Median :148.5   88     : 19   Median :2804  
 Mean   :23.51   Mean   :5.455   Mean   :193.4   110    : 18   Mean   :2970  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   100    : 17   3rd Qu.:3608  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   75     : 14   Max.   :5140  
                                                 (Other):288                 
  acceleration     model.year        origin                car.name  
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   ford pinto    :  6  
 1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000   amc matador   :  5  
 Median :15.50   Median :76.00   Median :1.000   ford maverick :  5  
 Mean   :15.57   Mean   :76.01   Mean   :1.573   toyota corolla:  5  
 3rd Qu.:17.18   3rd Qu.:7

The reason why horsepower has no min, max, mean, etc is because the horsepower variable values are strings. 

In [14]:
str(auto_mpg)

'data.frame':	398 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : Factor w/ 94 levels "?    ","100",..: 17 35 29 29 24 42 47 46 48 40 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ model.year  : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ car.name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...


Horsepower is a factor datatype. Factors are categorical variables. They are not treated as continuous variables. So the summary function would not calculate its mean, median, min/max, etc. str() tells you the datatype of variables, the dimensions fo the dataframe, and also gives an overview of the kind of values each variable contains. 

### Column access

In [15]:
summary(auto_mpg$weight)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1613    2224    2804    2970    3608    5140 

## Descriptive Statistics
Descriptive statistics are used to summarize and describe data. If we are analyzing miles per gallon data, for example, a descriptive statistics might be the percentage of cars with different numbers of cylinders or the average miles per gallon for all cars. Many descriptive statistics are often used at one time to give a full picture of the data. 

There are mainly 2 categories of descriptive statics: measures of central tendency (or averages) and measures of dispersion (which summarizes how spread out and dispersed the data points are). A variable can have many observations (data points or values) and a summary set of numbers that describe those multiple observations, such as those shown by the summary() command, are descriptive statistics. 

There are 3 important measures of central tendency used to summarize data. The mean, the median, and the mode. When we talk about the mean, we'll be referring to the arithmetic mean as contrasted to some other means, such as the geometric mean or the harmonic mean, where are not used as frequently as the arithmetic mean. The mean of a set of data is simply the sum of data observations divided by the total number of observations. 

The median of a set of ordered observations is a middle number that divides the data into 2 parts, where half of the data points are in one part and the other in the 2nd part. 

The mean is influenced to a greater extent by extreme observations. So, if you notice extreme observations in your data, then perhaps a median is a better summary of data than a mean. Income and price data generally follow this pattern, which is why census organizations report median incomes and median prices. 

Descriptive statistics are just descriptive. They cannot generalize anything beyond the data at hand. Generalizing from our data to another set of cases is dealt with in inferential statistics. R has built-in functions to calculate the mean, median, and standard deviation and other descriptive statistics. 

In [16]:
head(auto_mpg)

mpg,cylinders,displacement,horsepower,weight,acceleration,model.year,origin,car.name
18,8,307,130,3504,12.0,70,1,chevrolet chevelle malibu
15,8,350,165,3693,11.5,70,1,buick skylark 320
18,8,318,150,3436,11.0,70,1,plymouth satellite
16,8,304,150,3433,12.0,70,1,amc rebel sst
17,8,302,140,3449,10.5,70,1,ford torino
15,8,429,198,4341,10.0,70,1,ford galaxie 500


### Mean
The mean() function gives the average value of the column. The mean is the most basic statistics to help you understand the distribution of observations (data points) of a variable. 

    Mean = Sum of all Observations / No. of Observations

In [17]:
paste("Average auto displacement is:", mean(auto_mpg$displacement))

In [18]:
mean(auto_mpg$displacement)

### Mode
Mode is the value that has been repeated most frequently in a set of values and is especially useful when dealing with discrete variables. R does not have any built-in function to compute Mode as you would expect. Instead, the mode() function returns the type or storage mode of the object.

In [19]:
paste("Datatype of mpg is:", mode(auto_mpg$mpg))

However, you can use a command like below to calculate the most frequently occurring value. The value() command tells us the distribution/count of different values of a variable. So, using the command below, we are able to print how often each value occurred in the dataset. 

In [20]:
table(auto_mpg$mpg)


   9   10   11   12   13   14 14.5   15 15.5   16 16.2 16.5 16.9   17 17.5 17.6 
   1    2    4    6   20   19    1   16    5   13    1    3    1    7    5    2 
17.7   18 18.1 18.2 18.5 18.6   19 19.1 19.2 19.4 19.8 19.9   20 20.2 20.3 20.5 
   1   17    2    1    3    1   12    1    3    2    1    1    9    4    1    3 
20.6 20.8   21 21.1 21.5 21.6   22 22.3 22.4 22.5   23 23.2 23.5 23.6 23.7 23.8 
   2    1    8    1    3    1   10    1    1    1   10    1    1    1    1    1 
23.9   24 24.2 24.3 24.5   25 25.1 25.4 25.5 25.8   26 26.4 26.5 26.6 26.8   27 
   2   11    1    1    2   11    1    2    2    1   14    1    1    2    1    9 
27.2 27.4 27.5 27.9   28 28.1 28.4 28.8   29 29.5 29.8 29.9   30 30.5 30.7 30.9 
   3    1    1    1   10    1    1    1    8    2    2    1    7    2    1    1 
  31 31.3 31.5 31.6 31.8 31.9   32 32.1 32.2 32.3 32.4 32.7 32.8 32.9   33 33.5 
   7    1    2    1    1    1    6    1    1    1    2    1    1    1    3    3 
33.7 33.8   34 34.1 34.2 34

**Note:** In the output above, the table is line-wrapping in the display
```
Value->   9   10   11   12   13   14 14.5   15 15.5   16 16.2 16.5 16.9   17 17.5 17.6 
Count->   1    2    4    6   20   19    1   16    5   13    1    3    1    7    5    2 

Value->   17.7   18 18.1 18.2 18.5 18.6   19 19.1 19.2 19.4 19.8 19.9   20 20.2 20.3 20.5
Count->      1   17    2    1    3    1   12    1    3    2    1    1    9    4    1    3

...
```

We see that 13 MPG occurs 20 times (13 is at position 5). The table is actually computing a histogram of the value for the given set/column. This tells us, 13 is the most commonly occurring mileage (mpg) of the vehicles. Of the 398 vehicles, 20 vehicles have 13 miles per gallon. 

Then we can use which.max() to ask: which index holds the greatest value?

In [21]:
which.max(table(auto_mpg$mpg))

In [22]:
paste("The mode using which.max():", names(which.max(table(auto_mpg$mpg))))

### Median 
The median value divides the dataset into 2 equal halves. One half lies to the left of the median and the other to the right. Median values are less affected than the mean by outliers (extreme values). Therefore, the median is considered an ideal choice for measuring central tendency when the data is skewed (when the data has outliers). 

In [23]:
median(auto_mpg$acceleration)

In [24]:
paste("median:", median(auto_mpg$acceleration))

Of 398 observations in the dataset, 199 observations have acceleration less than or equal to 15.5 and the other 199 observations have acceleration greater than or equal to 15.5.

### Range
The range is also a measure of spread or extreme values of a variable. 

In [25]:
range(auto_mpg$model.year)

So the range of model years for cars in the dataset are from year 70 to year 82

### Quantile
The quantile() function divides the dataset into 4 equal parts, based on quantity of measurements. The first is Q1, second is Q2, third is Q3, and fourth is Q4. Quantiles are well understood when used with boxplots. Boxplots summarize and identify the range (min and max), Q1, Q2, and Q3 of a variable. 

In [26]:
quantile(auto_mpg$displacement)

In [27]:
quantile(auto_mpg$model.year)

The command is very informative as it gives min, max, 25th, 50th (median), and 75th percentile of values of the variable. Quantiles are used for explaining the variance in the variables as it is less immune to outliers and explains variation better than other measures. 

### Variance
Variance measures how widely the values in the variable are spread around the mean. If the observations vary greatly from the variable mean, the variance will be big and vice-versa. 

In [28]:
var(auto_mpg$displacement)

In [29]:
paste("variance", var(auto_mpg$displacement))

The above value represents the squared error of all the displacement values. Variance often doesn't make much sense when trying to understand the spread of the data as the units of variance are not the same as the units of the original data. However, standard deviation will give us a clearer idea of how data is spread. 

### Standard deviation

In [30]:
sd(auto_mpg$displacement)

In [31]:
paste("standard deviation", sd(auto_mpg$displacement))

The values in the displacement variable have a standard deviation of 104.27. Recall that the mean was 193.42. The combination of mean and standard deviation can be combined to model a data population. 

### Max, Min, Median, and Mean Absolute Deviation
We'll be doing this using a forloop (super slow). c() combines its arguments into a vector. Min, max, etc cannot by calculated for factor data. So let's create a new df called numeric_data with horsepower as numeric type variable. We'll also exclude the car name.

In [32]:
numeric_data <- auto_mpg[-ncol(auto_mpg)] # take all columns except for the last column

In [33]:
head(numeric_data)

mpg,cylinders,displacement,horsepower,weight,acceleration,model.year,origin
18,8,307,130,3504,12.0,70,1
15,8,350,165,3693,11.5,70,1
18,8,318,150,3436,11.0,70,1
16,8,304,150,3433,12.0,70,1
17,8,302,140,3449,10.5,70,1
15,8,429,198,4341,10.0,70,1


In [34]:
numeric_data$horsepower <- as.numeric(numeric_data$horsepower) # convert horsepower to numeric

In [35]:
str(numeric_data)

'data.frame':	398 obs. of  8 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : num  17 35 29 29 24 42 47 46 48 40 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ model.year  : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...


In [41]:
ncol(numeric_data) # number of columns

In [42]:
print(sprintf("%15s %10s %10s %10s %30s", "Column", "Maximum", "Minimum", "Median", "Mean Absolute Deviation"))
for (i in 1:ncol(numeric_data)){
    print(sprintf("%15s %10.1f %10.1f %10.1f %10.1f", 
                    names(numeric_data[i]), 
                    max(numeric_data[, i]),
                    min(numeric_data[, i]),
                    median(numeric_data[, i]),
                    mad(numeric_data[, i])
                 ))
}

[1] "         Column    Maximum    Minimum     Median        Mean Absolute Deviation"
[1] "            mpg       46.6        9.0       23.0        8.9"
[1] "      cylinders        8.0        3.0        4.0        0.0"
[1] "   displacement      455.0       68.0      148.5       86.7"
[1] "     horsepower       94.0        1.0       60.5       35.6"
[1] "         weight     5140.0     1613.0     2803.5      945.2"
[1] "   acceleration       24.8        8.0       15.5        2.5"
[1] "     model.year       82.0       70.0       76.0        4.4"
[1] "         origin        3.0        1.0        1.0        0.0"


The above code is effective but inefficient. Let's do the same thing with apply(). 2 refers the columns in the x array while 1 refers to rows. 

In [44]:
cbind(Max = apply(numeric_data, 2, max), 
      Min = apply(numeric_data, 2, min),
      Median = apply(numeric_data, 2, median),
      Mean_Absolute_Dev = apply(numeric_data, 2, mad)
     )

Unnamed: 0,Max,Min,Median,Mean_Absolute_Dev
mpg,46.6,9,23.0,8.8956
cylinders,8.0,3,4.0,0.0
displacement,455.0,68,148.5,86.7321
horsepower,94.0,1,60.5,35.5824
weight,5140.0,1613,2803.5,945.1575
acceleration,24.8,8,15.5,2.52042
model.year,82.0,70,76.0,4.4478
origin,3.0,1,1.0,0.0


## Types of Variables
The most important distinction between variables is if they are either qualitative or quantitative. 

* Qualitative variables: Variables that express a qualitative attribute such as religion, favorite movie, gender, and so on fall into this category. Sometimes referred to as categorical or nominal variables. 

* Quantitative variables: Variables that are measure in terms of numbers. 

### Flavors of quantitative data:
* Descrete variables: Some measures in data are discrete and cannot be made more precise. For example, the number of children in a family is discrete because you are counting indivisible entities. You can't have 1.3 children. 
* Continuous variables: Data can be reduced to finer levels or we can say it is continuous in nature. You can measure the weight of yourself at different precisions. milligrams, grams, pounds, etc. 

### Levels of measurement:
Both qualitative and quantitative variables follow levels of measurement. There are 4 levels: nominal, ordinal, interval, and ratio scaled. 
* Nominal: Car name, marital status, gender, religion, etc
* Ordinal: Number of cylinders in a car have an order. Soldier rankings, student grade, etc
* Interval: 10-15 are intervals. Temperature, IQ, etc are also intervals
* Ratio: Daily calorie intake or GPA score. 

http://onlinestatbook.com/2/introduction/levels_of_measurement.html

In [46]:
summary(auto_mpg)

      mpg          cylinders      displacement     horsepower      weight    
 Min.   : 9.00   Min.   :3.000   Min.   : 68.0   150    : 22   Min.   :1613  
 1st Qu.:17.50   1st Qu.:4.000   1st Qu.:104.2   90     : 20   1st Qu.:2224  
 Median :23.00   Median :4.000   Median :148.5   88     : 19   Median :2804  
 Mean   :23.51   Mean   :5.455   Mean   :193.4   110    : 18   Mean   :2970  
 3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:262.0   100    : 17   3rd Qu.:3608  
 Max.   :46.60   Max.   :8.000   Max.   :455.0   75     : 14   Max.   :5140  
                                                 (Other):288                 
  acceleration     model.year        origin                car.name  
 Min.   : 8.00   Min.   :70.00   Min.   :1.000   ford pinto    :  6  
 1st Qu.:13.82   1st Qu.:73.00   1st Qu.:1.000   amc matador   :  5  
 Median :15.50   Median :76.00   Median :1.000   ford maverick :  5  
 Mean   :15.57   Mean   :76.01   Mean   :1.573   toyota corolla:  5  
 3rd Qu.:17.18   3rd Qu.:7

In [47]:
str(auto_mpg)

'data.frame':	398 obs. of  9 variables:
 $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
 $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
 $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
 $ horsepower  : Factor w/ 94 levels "?    ","100",..: 17 35 29 29 24 42 47 46 48 40 ...
 $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
 $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
 $ model.year  : int  70 70 70 70 70 70 70 70 70 70 ...
 $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ car.name    : Factor w/ 305 levels "amc ambassador brougham",..: 50 37 232 15 162 142 55 224 242 2 ...
