Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results from current season in engsocerdata #16

Closed
JoGall opened this issue Jan 12, 2017 · 9 comments
Closed

Results from current season in engsocerdata #16

JoGall opened this issue Jan 12, 2017 · 9 comments

Comments

@JoGall
Copy link
Contributor

JoGall commented Jan 12, 2017

Do you plan to add a function to obtain up-to-date results for the current season at any point?

It's something I regularly need so made this script to fetch results from the current season from http://football-data.co.uk and change the formatting to use with engsoccerdata. The website has freely-available CSVs of historical results that are updated twice weekly (and also results from several other European leagues if you ever want to expand the data included with the package).

Feel free to use / adapt this if you can find a way to implement it in the package.

Thanks for all your work!

URLs <- c("http://www.football-data.co.uk/mmz4281/1617/E0.csv",
	"http://www.football-data.co.uk/mmz4281/1617/E1.csv",
	"http://www.football-data.co.uk/mmz4281/1617/E2.csv",
	"http://www.football-data.co.uk/mmz4281/1617/E3.csv")

convertFromFD <- function(df = NULL, tier = NULL) {
	data.frame("Date" = as.factor(as.Date(df$Date, "%d/%m/%y")),
		"Season" = 2016,
		"home" = df$HomeTeam,
		"visitor" = df$AwayTeam,
		"FT" = paste0(d$FTHG, "-", df$FTAG),
		"hgoal" = df$FTHG,
		"vgoal" = df$FTAG,
		"division" = tier,
		"tier" = tier,
		"totgoal" = df$FTHG + df$FTAG,
		"goaldif" = df$FTHG - df$FTAG,
		"result" = ifelse(df$FTHG > df$FTAG, "H",
			ifelse(df$FTHG < df$FTAG, "A", "D"))
	)
}

currentSeasonEng <- function() {
	rbind(convertFromFD(df = read.csv(URLs[1]), tier = 1), 
		convertFromFD(df = read.csv(URLs[2]), tier = 2),
		convertFromFD(df = read.csv(URLs[3]), tier = 3),
		convertFromFD(df = read.csv(URLs[4]), tier = 4)
	)
}

## Example: All EPL data including current season
#require(engsoccerdata)
# EPL <- rbind(subset(england, Season %in% 1992:2015 & tier == 1),
# 	subset(currentSeasonEng(), tier==1)
# )
@jalapic
Copy link
Owner

jalapic commented Jan 12, 2017

I will try and implement this, it's a good idea and something that I frequently have to do. I have to make sure that the data are ok to take and include in a package. Also, the big issue is team names - my package may use slightly different versions of names for two or three teams. I have a dataframe of all possible version of teamnames for each team e.g. "Man Utd", "Manchester Utd", "Manchester United", "Man United", "Newton Heath", "Newton H", etc. That should allow us to ensure that the most recent season data can be used with historical data by team.

@JoGall
Copy link
Contributor Author

JoGall commented Jan 12, 2017

Football-Data.co.uk claim on their website that data are "FREE" and that "You are free experiment with the data yourselves", but might be worth checking with them just in case. Hadn't thought about teamname variations so a separate dataframe is a good idea. Good luck with the implementation!

@jalapic
Copy link
Owner

jalapic commented Jan 12, 2017

I altered the function a bit - there was a typo and I've stripped it down slightly. I think the best thing is to have one function that brings all the data in from England and then puts it into engsoccerdata format. This is what I have:

england_current <- function(){
  df <- rbind(read.csv("http://www.football-data.co.uk/mmz4281/1617/E0.csv"),
              read.csv("http://www.football-data.co.uk/mmz4281/1617/E1.csv"),
              read.csv("http://www.football-data.co.uk/mmz4281/1617/E2.csv"),
              read.csv("http://www.football-data.co.uk/mmz4281/1617/E3.csv")
  ) 
  return( 
    data.frame("Date" = as.factor(as.Date(df$Date, "%d/%m/%y")),
               "Season" = 2016,
               "home" = df$HomeTeam,
               "visitor" = df$AwayTeam,
               "FT" = paste0(df$FTHG, "-", df$FTAG),
               "hgoal" = df$FTHG,
               "vgoal" = df$FTAG,
               "division" = as.numeric(df$Div),
               "tier" = as.numeric(df$Div),
               "totgoal" = df$FTHG + df$FTAG,
               "goaldif" = df$FTHG - df$FTAG,
               "result" = ifelse(df$FTHG > df$FTAG, "H",
                                 ifelse(df$FTHG < df$FTAG, "A", "D"))
    )
  )
}



england_current()

I will add as a function to the package and leave on GitHub. If I have time, I'd like to add this too for the other leagues.

As a note - if you're interested in collating data / helping, I have other leagues and competitions going all the way back to their origins e.g. League Cup, French League, - just haven't had time to check + add to package yet.

@jalapic
Copy link
Owner

jalapic commented Jan 12, 2017

oops.... forgot about the teamnames fix. ugh - that will take time. I notice a lot of inconsistencies with my data, e.g. Bristol Rvs - I'll have to manually add that to the teamname df. Might take me a day or two to get that together.

@JoGall
Copy link
Contributor Author

JoGall commented Jan 13, 2017

I had a bit of spare time this morning so wrote code to fetch data for the five other leagues available on Football-Data.co.uk. Their data only goes back to '94/'95 but better than nothing. I've left the 'division' variable as a factor for now rather than numeric, e.g. Scotland's divisions are defined as SC0, SC1, etc...

I'm happy to help collate more data whenever I get a chance if you add them to the repo. Where did you obtain your data from by the way? It would be great to have European Cup fixtures too for completeness but can't find an archive of them anywhere.

## FUNs
##-------
#make season codes for URLs
makeSeasons <- function(start = NULL) {
	paste0(substr(start:2016, 3, 4), substr((start+1):2017, 3, 4))
}

#get CSVs
getCSVs <- function(x) {
	df <- read.csv(x)
	Sys.sleep(sample(seq(1, 2, by=0.001), 1))
	df$Season <- format(as.Date(df$Date[1], format="%d/%m/%y"), "%Y") #extract Season as year of first fixture
	df
}

#reformat to engsoccerdata
convertEngSoccerData <- function(df){
	return(data.frame(
		"Date" = as.factor(as.Date(df$Date, "%d/%m/%y")),
		"home" = df$HomeTeam,
		"visitor" = df$AwayTeam,
		"FT" = paste0(df$FTHG, "-", df$FTAG),
		"hgoal" = df$FTHG,
		"vgoal" = df$FTAG,
		"division" = df$Div,
		"tier" = as.numeric(df$Div),
		"totgoal" = df$FTHG + df$FTAG,
		"goaldif" = df$FTHG - df$FTAG,
		"result" = ifelse(df$FTHG > df$FTAG, "H",
		ifelse(df$FTHG < df$FTAG, "A", "D"))
	)
	)
}

## URLs
##------
#Scotland
sco_urls <- c(paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1994), "/SC0.csv"), paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1994), "/SC1.csv"), paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1997), "/SC2.csv"), paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1997), "/SC3.csv"))

#Belgium
bel_urls <- paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1995), "/B1.csv")

#Portugal
por_urls <- paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1994), "/P1.csv")

#Turkey
turk_urls <- paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1994), "/T1.csv")

#Greece
grc_urls <- paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1994), "/G1.csv")


## Example
##---------
scotland <- sco_urls %>%
	lapply(getCSVs) %>%
	lapply(convertEngSoccerData) %>%
	do.call(rbind.data.frame, .)

@jalapic
Copy link
Owner

jalapic commented Jan 13, 2017

@JoGall Thanks Joe. This is great. Will take a closer look at it - happy to add more leagues. I do have European Cup / Champion's League data - it's in the champs dataframe that comes in the engsoccerdata package from CRAN.

The data come from everywhere - all open source. I believe there are a bunch in the ReadMe. I did notice when I collated this a few years ago that a lot of the online websites with soccer data had copied each other and there were a few errors that they made. Only about 0.1% of the data, but annoying nonetheless.

@jalapic
Copy link
Owner

jalapic commented Jan 14, 2017

@JoGall Hi Joe - I had a look at importing the other leagues. On my first pass, the csvs imported by the Greek league would not all convert to tidy data using the convert function. Also, other leagues sometimes returned NAs in the csvs. I think these will work, but we'd have to check each file in turn before adding to the package. Also, adding a "Season" variable would be super useful for each csv- that would keep the data consistent with the other dataframes.

@JoGall
Copy link
Contributor Author

JoGall commented Jan 14, 2017

Ok I've updated this and tested it properly now. There were some inconsistencies in the CSVs (e.g. some columns had 'HT' instead of 'HomeTeam') and annoyingly the division names for some leagues are zero indexed and some aren't, making it hard to parse 'tier' properly. The convert functions seems to work now for all the leagues; I've added a boolean parameter to help create a tier number from the division data (e.g. when 'zeroIndexed' = TRUE, 'division' SC0 becomes 'tier' 1). Added a 'Season' variable too.

## FUNs
##-------
#make season codes for URLs
makeSeasons <- function(start = NULL) {
	paste0(substr(start:2016, 3, 4), substr((start+1):2017, 3, 4))
}

#get CSVs
getCSVs <- function(x) {
	#read and remove whitespace
	df <- read.csv(x, na.strings = c("NA", ""))
	df = na.omit(df[,1:7])
	#extract Season as year of first fixture
	df$Season <- format(as.Date(df$Date[1], format="%d/%m/%y"), "%Y")
	#change column names if required
	if(any(names(df)=="HT")) colnames(df)[which(names(df) == "HT")] <- "HomeTeam"
	if(any(names(df)=="AT")) colnames(df)[which(names(df) == "AT")] <- "AwayTeam"
	
	Sys.sleep(sample(seq(1, 2, by=0.001), 1))
	
	df
}

#reformat to engsoccerdata
convertToESD <- function(df, zeroIndexed = FALSE){

	dfx <- data.frame(
		"Date" = as.factor(as.Date(df$Date, "%d/%m/%y")),
		"Season" = as.numeric(as.character(df$Season)),
		"home" = df$HomeTeam,
		"visitor" = df$AwayTeam,
		"FT" = paste0(df$FTHG, "-", df$FTAG),
		"hgoal" = df$FTHG,
		"vgoal" = df$FTAG,
		"division" = df$Div,
		"tier" = as.numeric(unlist(strsplit(gsub("[^0-9]", "", unlist(df$Div)), ""))),
		"totgoal" = df$FTHG + df$FTAG,
		"goaldif" = df$FTHG - df$FTAG,
		"result" = df$FTR
	)
	
	if(zeroIndexed) dfx$tier <- dfx$tier + 1
	
	dfx
}

## Example: Scotland
##-------
urls <- c(paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1994), "/SC0.csv"), paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1994), "/SC1.csv"), paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1997), "/SC2.csv"), paste0("http://www.football-data.co.uk/mmz4281/", makeSeasons(1997), "/SC3.csv"))

scotland <- urls %>%
	lapply(getCSVs) %>%
	lapply(function(x) convertToESD(x, zeroIndexed=TRUE)) %>%
	do.call(rbind.data.frame, .)

@jalapic
Copy link
Owner

jalapic commented Jan 22, 2017

Sorry for slow action on this - when adding I was running checks to ensure CRAN compatibility etc. which always take longer.

All these data for these seasons have been added. Thanks for your help. I'd love to get more data going further back, but this is a great addition.

I noted one error with the function - it assigns tier=2 for Belgium/Portugal/Turkey/Greece rather than tier =1. It works for Scotland to get the correct tier. I've corrected that.

I like to release high quality proofed and checked data like I have for England, Germany, Spain etc. However, realistically I don't have time to do that level of checking for all leagues. Therefore, I've added these "as is" and hopefully if people find errors or additions they can file issues and/or pull-requests. Also, if these leagues had playoff games, those aren't included just yet. I've added that as a thing to do in the ReadMe.

Also, all teamnames are added to the teamnames dataframe. Going forward if other Seasons are added to each league, we should pick a teamname for each team to stay with.

Finally, I don't know if teams changed team names between 1994-2016 in each league. If they did, I've not noted that in the teamnames- I'm assuming unique teamnames in the data are unique teams not just those who changed name. I'll let others find that out and let me know if I need to fix.

@jalapic jalapic closed this as completed Jan 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants