# Summary Statistics Lecture Practical

This notebook can be used to follow along with the lecture notes. 

In [22]:
url <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank.zip'
download.file(url, destfile = 'bank.zip')
unzip('bank.zip', list = T)

## Data Description

### Numeric Data

1. age
2. balance
3. contact day
4. duration
5. campaign contacts
6. previous contacts
7. previous days (-1, no previous contact)


### Categorical Data

1. job (11) { ’admin’, ’entrepreneur’. . . }
2. marital (3) { ’divorced’, ’married’, ’single’ } 
3. education (3) { ’primary’, ’secondary’. . . }
4. default (2) { ’yes’, ’no’ }
5. housing loan (2) { ’yes’, ’no’ }
6. Personal loan (2) { ’yes’, ’no’ }
7. contact type (2) { ’telephone’, ’cellular’ }
8. contact month (12) { ’jan’, ’feb’. . . }
9. previous outcome (3) { ’success’, ’failure’. . . }
10. subscribed (2) { ’yes’, ’no’ }


## Task 1: Import the data into R

Make sure you run the initial steps in the cell above to download and unzip the dataset. Then import the data into R, you may need to manually inspect the dataset to get an idea of the format. Save the dataframe in a variable called bankData.

In [68]:
bankData <- read.csv('bank-full.csv', sep=";", na.strings=c("","unknown","NA"))

## Task 2: Basic Checks

Before we get to work on the dataframe we want to perform some basic checks. First, get an idea of the size of the dataset by checking the number of columns and rows. Then double check the first few and last few rows of the dataframe to make sure there were no problems with the import

In [69]:
ncol(bankData)
nrow(bankData)
head(bankData)

Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
Unnamed: 0_level_1,<int>,<chr>,<chr>,<chr>,<chr>,<int>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>,<int>,<int>,<chr>,<chr>
1,58,management,married,tertiary,no,2143,yes,no,,5,may,261,1,-1,0,,no
2,44,technician,single,secondary,no,29,yes,no,,5,may,151,1,-1,0,,no
3,33,entrepreneur,married,secondary,no,2,yes,yes,,5,may,76,1,-1,0,,no
4,47,blue-collar,married,,no,1506,yes,no,,5,may,92,1,-1,0,,no
5,33,,single,,no,1,no,no,,5,may,198,1,-1,0,,no
6,35,management,married,tertiary,no,231,yes,no,,5,may,139,1,-1,0,,no


## Task 3: Type Conversion

By default, read_delim reads character columns as characters, but all of our character columns should be factors. Manually convert columns 2,3,4 and 5 to factors below

In [70]:
bankData[,2] <- as.factor(bankData[,2])
bankData[,3] <- as.factor(bankData[,3])
bankData[,4] <- as.factor(bankData[,4])
bankData[,5] <- as.factor(bankData[,5])
head(bankData)

Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
Unnamed: 0_level_1,<int>,<fct>,<fct>,<fct>,<fct>,<int>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>,<int>,<int>,<chr>,<chr>
1,58,management,married,tertiary,no,2143,yes,no,,5,may,261,1,-1,0,,no
2,44,technician,single,secondary,no,29,yes,no,,5,may,151,1,-1,0,,no
3,33,entrepreneur,married,secondary,no,2,yes,yes,,5,may,76,1,-1,0,,no
4,47,blue-collar,married,,no,1506,yes,no,,5,may,92,1,-1,0,,no
5,33,,single,,no,1,no,no,,5,may,198,1,-1,0,,no
6,35,management,married,tertiary,no,231,yes,no,,5,may,139,1,-1,0,,no


Or, use a loop to convert all factor columns

In [71]:
for(factIndex in 2:5){
    bankData[,factIndex] <- as.factor(bankData[,factIndex])
}
head(bankData)

Unnamed: 0_level_0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
Unnamed: 0_level_1,<int>,<fct>,<fct>,<fct>,<fct>,<int>,<chr>,<chr>,<chr>,<int>,<chr>,<int>,<int>,<int>,<int>,<chr>,<chr>
1,58,management,married,tertiary,no,2143,yes,no,,5,may,261,1,-1,0,,no
2,44,technician,single,secondary,no,29,yes,no,,5,may,151,1,-1,0,,no
3,33,entrepreneur,married,secondary,no,2,yes,yes,,5,may,76,1,-1,0,,no
4,47,blue-collar,married,,no,1506,yes,no,,5,may,92,1,-1,0,,no
5,33,,single,,no,1,no,no,,5,may,198,1,-1,0,,no
6,35,management,married,tertiary,no,231,yes,no,,5,may,139,1,-1,0,,no


## Task 4: Finding NAs

1. How many NAs are there for the education attribute?
2. What percentage of the total is this?

In [72]:
sum(is.na(bankData$education))#No NAs but loads of unkowns, I'll conver those to NA
bankData$education[bankData$education == "unknown"] = NA
qtyOfNAs <- sum(is.na(bankData$education))
qtyOfNAs #Now we can see 1857 NAs
(qtyOfNAs * 100 / nrow(bankData)) #Calculate percentage of NAs

## Finding NAs in all columns

See the lecture notes for how to do this. Get a list of all columns containing at least 1 NA and find the counts of each

In [99]:

# colsAndRowsWithNA <- which(is.na(bankData), arr.ind=TRUE)
# rowsWithNA <- colsAndRowsWithNA[,1]
# colsWithNA <- colsAndRowsWithNA[,2]
# colsWithNA <- colsWithNA[!duplicated(colsWithNA)] 
# #colsWithNA
# for(col in colsWithNA){
#     print(cat("Num NA rows for col ", col, " is: ", length(colsAndRowsWithNA[colsAndRowsWithNA[,2] == col,])))

# }
# str(colsAndRowsWithNA)
# colsAndRowsWithNA[colsAndRowsWithNA[,2] == 16,]

allMissing <- is.na(bankData)
counts <- colSums(allMissing)
print(round(counts/nrow(bankData)*100, 2))

      age       job   marital education   default   balance   housing      loan 
     0.00      0.64      0.00      4.11      0.00      0.00      0.00      0.00 
  contact       day     month  duration  campaign     pdays  previous  poutcome 
    28.80      0.00      0.00      0.00      0.00      0.00      0.00     81.75 
        y 
     0.00 


## Task 5: Finding the Average

What is the average value of the following columns?
1. age
2. duration
3. pDays

In [109]:
mean(bankData$age)
mean(bankData$duration)
mean(bankData$pdays)


## Fixing the pDays Column
See the lecture notes for how to do this. Implement option 1 and option 2 in the cells below

In [None]:
bankData$pdays[bankData$pdays == -1] = NA #changing from -1 for no contract to NA instead
mean(bankData$pdays, na.rm=TRUE) #now the mean not considering the -1s will be higher

## Task 6 - Omitting Rows Containing NA
Create a copy of the dataset omitting any rows with NAs. How much data do we lose doing this? (express as a percentage of total rows)

In [115]:
cpBankData <- bankData 
fullSetRows <- nrow(bankData)
naOmmitSetRows <- nrow(na.omit(cpBankData))
cat("Percentage of the dataset left after NA Omit: ", round(100 - ((naOmmitSetRows * 100) / fullSetRows),2))


Percentage of the dataset left after NA Omit:  82.65

## Summary Statistics

### Task 7 - Calculating Quantiles

1. Use the first cell to write a script to find the tenth percentile of the following data.

```R
c(14, 15, 9, 14, 8, 9, 6, 7, 8, 12)
```

2. Then add in the script the code to find the 30th, 50th and 75th percentile
3. In the second cell, use the r **quantile()** function to calculate each of the percentiles above

In [126]:
arr <- sort(c(14, 15, 9, 14, 8, 9, 6, 7, 8, 12))
arrLen <- length(arr)
arr[arrLen/2] #median
arr[arrLen * 0.3] #30th percentile
arr[arrLen * 0.75] #75th percentile (THIS IS WRONG)
arrLen * 0.75

In [125]:
quantile(arr, 0.5)
quantile(arr, 0.3)
quantile(arr, 0.75)

Get the 0th, 25th, 50th, 75th and 100th percentiles of the *age* feature using the fivenum function

In [128]:
fivenum(bankData$age)

Use the summary function to get the mean as well as the fivenum quantiles of the bankData age feature

Get the counts of each level of the *education* feature

### Task 8 - Calculate the Trimmed Mean

1. Calculate the mean and median of the *balance* feature of our bankData dataframe
2. Calculate the trimmed mean @ T = 0.1
3. Show that the trimmed mean is more representative of the centre than the raw mean 

In [138]:
#by hand
sortedBalance <- sort(bankData$balance)
qt10 <- quantile(sortedBalance, 0.1)
qt90 <- quantile(sortedBalance, 0.9)
qt10[[1]]
qt90[[1]]
trimmedBalance <- sortedBalance[sortedBalance >= qt10[[1]] & sortedBalance <= qt90[[1]]]
mean(sortedBalance)
mean(trimmedBalance)
mean(sort(bankData$balance), trim=0.1) #767.212

### Measures of Spread

The inter-quartile range (IQR) and range are similar in that they estimate the spread of a column by subtracting extreme values

The standard deviation is a measure of the average distance of each column value from the mean:

The **variance** is simply the square of the standard deviation: