# Mitchell Pudil
## 2016 NFL Rushing Statistics
## For Advanced R Programming Class
## Topics covered: Webscraping, regular expressions

In [2]:
install.packages('XML')
library(XML)

Installing package into ‘/srv/rlibs’
(as ‘lib’ is unspecified)


### Read in the first page of 2016 NFL rushing statistics. When sorted by decreasing average yards per game, there should be 50 players.

In [3]:
url <- "http://www.nfl.com/stats/categorystats?tabSeq=0&season=2016&seasonType=REG&experience=&Submit=Go&archive=true&conference=null&statisticCategory=RUSHING&d-447263-p=1/"
download.file(url,destfile = "nfl.html", method = "curl")
d <- readHTMLTable("nfl.html",header=TRUE,stringsAsFactors=FALSE)
e <- readHTMLTable("nfl.html",header=FALSE,stringsAsFactors=FALSE)[[1]]

### Changing column names and edit columns using regular expressions

In [4]:
colnames(e) <- colnames(data.frame(d))  #Changing column names
colnames(e) <- gsub("result", "", colnames(e))
colnames(e) <- gsub("\\.", "", colnames(e))
colnames(e) <- gsub("AttG", "Att/G", colnames(e))
colnames(e) <- gsub("YdsG", "Yds/G", colnames(e))
myTable <- e

Now we will create an indicator variable named "longRunTD" if the longest run (Lng) resulted in a TD.  This is indicated by a "T" at the end of the variable's value (For example, Jay Ajayi's long run of 62 yards was a TD, but Le'Veon Bell's long run of 44 was not a TD.)

In [5]:
mytable2 <- as.integer(myTable$Lng)
na.rows <- which(is.na(mytable2))
myTable$longRunTD <- 1
myTable$longRunTD[c(na.rows)] <- 0

“NAs introduced by coercion”

For all variables except "Player", "Team", and "Pos": Make sure to remove "T" and "," from the variable values and make sure the variable is treated by R as a numeric variable.

In [6]:
fix <- function(x) {
  x <- gsub("T","",x)
  x <- gsub(",","",x)
  as.numeric(x)
}

myTable$'Att/G' <- fix(myTable$'Att/G')
myTable$'Yds' <- fix(myTable$'Yds')
myTable$'Yds/G' <- fix(myTable$'Yds/G')
myTable$'Lng' <- fix(myTable$'Lng')

### How many QB's had a long run of at least 20 yards?

In [10]:
qb20yds <- dim(subset(myTable, myTable$Lng >= 20))[1]
print(noquote(paste(qb20yds, 'QBs had a long run of at least 20 yds.')))

[1] 43 QBs had a long run of at least 20 yds.


### How many RB's scored at least 5 rushing TDs?

In [9]:
rb <- subset(myTable, myTable$Pos=="RB")
touchrb <- subset(rb, TD >= 5)
rb5td <- dim(touchrb)[1]
print(noquote(paste(rb5td, 'RBs scored at least 5 rushing TDs.')))

[1] 17 RBs scored at least 5 rushing TDs.


### Among players all players with no fumbles, who scored the most rushing TDs?

In [11]:
nofumble <- subset(myTable, myTable$FUM==0)
maxtd <- which.max(nofumble$TD)
most <- nofumble[maxtd,2]  #Answer: Jeremy Hill 
print(noquote(paste(most, 'scored the most rushing TDs')))


[1] Jeremy Hill scored the most rushing TDs


#### Create the variable AbbrevName which contains the first initial (of the first name), a space, and then the entire last name of the player. For example, for Le'Veon Bell, the AbbrevName entry should be "L Bell"

In [18]:
for(i in 1:dim(myTable)[1]) {
  myTable$AbbrevName[i] <- paste(substr(myTable$Player[i], 1,1), tail(strsplit(myTable$Player[i],split=" ")[[1]],1), sep=" ")
}
head(myTable[,c(1:5,18)])

Rk,Player,Team,Pos,Att,AbbrevName
1,Ezekiel Elliott,DAL,RB,322,E Elliott
2,Le'Veon Bell,PIT,RB,261,L Bell
3,Jordan Howard,CHI,RB,252,J Howard
4,Andre Williams,SD,RB,18,A Williams
5,Jay Ajayi,MIA,RB,260,J Ajayi
6,LeSean McCoy,BUF,RB,234,L McCoy
