Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a team names dataframe to standardize variations #17

Closed
aqsmith08 opened this issue Jan 12, 2017 · 12 comments
Closed

Create a team names dataframe to standardize variations #17

aqsmith08 opened this issue Jan 12, 2017 · 12 comments

Comments

@aqsmith08
Copy link
Contributor

Hi @jalapic ,

I wanted to start an issue to track the team names dataframe work since it has come up a couple times. For example, you have it listed in the README:

Team Names. Consistency in team names is very hard. A dataframe showing the various variations of team names for each team would be great (this is a particular problem in the French league).

and it's also been mentioned in your discussion with @JoGall in issue #16.

Before any work is done, it'd be great to clarify what your ideal outcome is. For example, are you looking for a dataframe like this --

team_name standardized_team_name most_recent
st. louis rams los angeles rams false
los angeles ram los angeles rams true
oakland raiders oakland raiders true

Yes, I apologize for using NFL teams here but it was the quickest example I could think of.

By having these three columns, folks will only have to check a subset of team names each year (e.g. only where most_recent == true) and can create a new row if something has changed.

What do you think?

@jalapic
Copy link
Owner

jalapic commented Jan 12, 2017

Here's what the csv for england - here it is as a file

As you can see it has two columns. The first is the team name that appears in england as the home team or away team. The second column has all variants I've come across. The data for the 2016 season appears to have even more variants, which will need to be added. I think this is a safer option than doing closest-matching or things like that, which will typically lead to errors.

I don't think I started a similar file for the other leagues. A most_recent column variable would be useful too.

@jalapic
Copy link
Owner

jalapic commented Jan 12, 2017

In terms of goals - I think it's definitely useful to have an index of all team names and their variants. This will help a lot in future data collection.

@aqsmith08
Copy link
Contributor Author

@jalapic - I sent pull request #18 using La Liga as an example. Please take a look!

@JoGall
Copy link
Contributor

JoGall commented Jan 13, 2017

Could you perhaps automate adding new variants to the team name index and making it available on the repo? Then anyone can pitch in and manually annotate them. Might be useful if/when you add other European leagues whose teams you know less about!

@jalapic
Copy link
Owner

jalapic commented Jan 13, 2017

Hi @JoGall - can you expand on what you mean by automate adding new variants? Not sure how to implement a way of editing it by other users (maybe I'm just missing something obvious).

I've merged the pull request of @aqsmith08 which has created a new teamname df. I will add England names to this - I might modify it a bit later today.

@aqsmith08
Copy link
Contributor Author

Hey @JoGall - I'd love to hear more too. Are you looking for something like --

oldsoccerdata.csv

team_name standardized_team_name most_recent
st. louis rams los angeles rams false
los angeles ram los angeles rams true
oakland raiders oakland raiders true

newsoccerdata.csv

team_name standardized_team_name most_recent
las vegas gamblers las vegas gamblers true

We create a function that would take all the team_name values from oldsoccerdata.csv and compare them to all the team_name values in newsoccerdata.csv. Any new team name that doesn't exist in the oldsoccerdata.csv should be flagged for review. So in the example given above, las vegas gamblers would be flagged. From there, any name review would still need to be manual. Is that what you're thinking?


Also, I should add that I chose not to change any team names that @jalapic already had in the CSV files (e.g. spain.csv). I figured it was best to use the name that existed already as the standardized_team_name for consistency.

@jalapic
Copy link
Owner

jalapic commented Jan 14, 2017

@aqsmith08 @JoGall - ok, in the new GitHub version of the package there is a .rda file in ./data (and a corresponding csv file in ./data-raw) called teamnames. If you reinstall the package, you can load this by typing teamnames. You'll see I went with four columns -

country - country of league team plays in
name - the teamname I've used for all Seasons for that team
name_other - other versions of the team name that have appeared in various data sources
most_recent - is this the current name of the team ?

The most-recent is woefully incomplete.

name_other contains lots of weird versions of some teams. I use this in the new england_current() function to find the imported team name and correct it to the name version that is used in other seasons.

If you have any ideas on improving this, let me know. It definitely helps me in cleaning up the data.

@JoGall
Copy link
Contributor

JoGall commented Jan 14, 2017

Yes @aqsmith08 that's exactly what I was thinking. Maybe a snippet of code that could run when importing new leagues that automatically adds unknown names to teamnames. Something like this for a new 'scotland' dataset from the current season:

teamnames <- if(!unique(scotland$home) %in% teamnames$name){
	rbind(teamnames, data.frame(
		country="Scotland",
		name = unique(scotland$home),
		name_other="NA",
		most_recent="NA")
	)
}

which returns:

> tail(teamnames, 14)

847    spain Villarreal CF Villarreal CF        TRUE
848    spain      Xerez CD      Xerez CD        TRUE
849 Scotland    Kilmarnock            NA          NA
850 Scotland       Partick            NA          NA
851 Scotland       Rangers            NA          NA
852 Scotland   Ross County            NA          NA
853 Scotland        Hearts            NA          NA
854 Scotland  St Johnstone            NA          NA
855 Scotland      Aberdeen            NA          NA
856 Scotland        Dundee            NA          NA
857 Scotland      Hamilton            NA          NA
858 Scotland   Inverness C            NA          NA
859 Scotland    Motherwell            NA          NA
860 Scotland        Celtic            NA          NA

Of course we'd have to review them manually from this point but at least we'd know what teams need doing. @jalapic you could update the 'teamnames' dataframe each time you add new data and add it to the repo (perhaps as a CSV) so anyone can pitch in and help?

@JoGall
Copy link
Contributor

JoGall commented Jan 14, 2017

Made this function to add unindexed team names to the teamnames dataframe. Seems to work using the Scotland '16/'17 dataset I've been practicing with.

updateTeamnames <- function(df, country = NULL) {
	
	#find unindexed teamnames (home and away)
	unique(as.factor(c(
		as.character(subset(df, !(df$home %in% subset(teamnames, country==country)$name_other))$home),
		as.character(subset(df, !(df$visitor %in% subset(teamnames, country==country)$name_other))$visitor)
	))) %>%

	#make new entry if team can't be approximately matched in the existing teamnames df
	lapply(., function(x) {
		idx_name <- unique(teamnames[agrep(x, teamnames$name),]$name)
		if(length(idx_name)==0) {
			data.frame(
			country=country,
			name = x,
			name_other = x,
			most_recent = "NA")
		}
	}) %>%

	#flatten list
	do.call("rbind", .) %>%

	#bind to teamnames df
	rbind(teamnames, .)
}

> tail(updateTeamnames(scotland, "Scotland"), 14)

# A tibble: 14 × 4
    country          name    name_other most_recent
      <chr>         <chr>         <chr>       <chr>
1     spain     UE Lleida     UE Lleida       FALSE
2     spain   Valencia CF   Valencia CF        TRUE
3     spain Villarreal CF Villarreal CF        TRUE
4     spain      Xerez CD      Xerez CD        TRUE
5  Scotland    Kilmarnock    Kilmarnock          NA
6  Scotland       Partick       Partick          NA
7  Scotland   Ross County   Ross County          NA
8  Scotland        Hearts        Hearts          NA
9  Scotland  St Johnstone  St Johnstone          NA
10 Scotland      Aberdeen      Aberdeen          NA
11 Scotland        Dundee        Dundee          NA
12 Scotland      Hamilton      Hamilton          NA
13 Scotland   Inverness C   Inverness C          NA
14 Scotland    Motherwell    Motherwell          NA

@JoGall
Copy link
Contributor

JoGall commented Jan 14, 2017

TODO: another function to replace alternative team names with their preferred name.

e.g. if 'home' = Spurs, look up 'name' in teamnames df:

> subset(teamnames, name_other=="Tottenham")$name
"Tottenham Hotspur"

@aqsmith08
Copy link
Contributor Author

Hey @jalapic & @JoGall -

Just wanted to check-in and see how we're doing here. It seems like we've successfully completed the focus of this issue -- Create a team names dataframe to standardize variations. I double checked all the leagues we have in teamnames.csv and it seems like every country league listed in the data-raw folder is covered. Is that correct or am I missing anything?

I do think there's other things we can do (e.g. adding new functions, update the most_recent column etc) but I vote we keep issues tightly scoped on a single item and open a new issue for anything outside this use case. What do you think?

> table(teams$country)

England  France Germany Holland   Italy   spain 
    430      82     135      68      73      60

@jalapic
Copy link
Owner

jalapic commented Jan 22, 2017

@aqsmith08 @JoGall Agreed - thanks both. I'll close this issue. We could open a new issue to check most_recent variable of the teamnames df. For the new league data added today (Scotland,Portugal,Turkey,Greece,Belgium), I have added unique teamnames. As I said in that issue thread - we do need to double-check if any teams changed name during that period but are in fact the same team.

I have added the function for checking if a teamname is unique and adding to teamname dataframe to /data-raw based on @hadley 's recommendations here.

@jalapic jalapic closed this as completed Jan 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants