Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Player names in Baseball Reference stats contain mis-encoded non-ASCII characters #393

Closed
AndrewsOR opened this issue Nov 30, 2023 · 5 comments

Comments

@AndrewsOR
Copy link

AndrewsOR commented Nov 30, 2023

The FanGraphs functions pitching_stats() and batting_stats() appear to convert names from that site such as Ronald Acuña Jr. and José Abreu to Ronald Acuna Jr., Jose Abreu etc., which are recognizable if not entirely correct.

On the other hand, the Baseball Reference functions batting_stats_bref() and pitching_stats_bref() return what seems like mis-converted HTML encodings of those names, resulting in lower readability, although the names on the site itself appear correct.

For example:

import pybaseball as pb
df_bref_batting = pb.batting_stats_bref(2023).sort_values("mlbID")
df_bref_pitching = pb.pitching_stats_bref(2023).sort_values("mlbID")

for side, df in zip(["batting","pitching"],[df_bref_batting,df_bref_pitching]):
    print(side)
    print(df[df["Name"].str.contains("x")][["Name","mlbID"]].head().to_string(index=False)+"\n")

prints:

batting
                        Name  mlbID
           Manny Pi\xc3\xb1a 444489
     Mart\xc3\xadn Maldonado 455117
           Sandy Le\xc3\xb3n 506702
Avisa\xc3\xadl Garc\xc3\xada 541645
         Carlos P\xc3\xa9rez 542208

pitching
                   Name  mlbID
           Max Scherzer 453286
Mart\xc3\xadn Maldonado 455117
     Luis Garc\xc3\xada 472610
   Jos\xc3\xa9 Quintana 500779
              Alex Cobb 502171

My current workaround is to use playerid_reverse_lookup to bridge to FanGraphs names and use those instead. (I like to use the Baseball Reference batting stats because of how it labels players who played in multiple teams/leagues in a given season, providing both team names instead of "---".)

I love pybaseball... thank you!

@BrayanMnz
Copy link
Contributor

Hi, these are "tildes" for spanish words,
I have a workaround for this in an internal project that maybe can be useful.

Will try to replicate your example and apply my workaround

@BrayanMnz
Copy link
Contributor

BrayanMnz commented Dec 1, 2023

The issue is because we are wrongly encoding a bytes object parsed to string
what we need to do instead is, decode the bytes object directly.

I'll submit a PR to fix this.

How it looks like after my fix:
image

@BrayanMnz
Copy link
Contributor

BrayanMnz commented Dec 18, 2023

Hi, @AndrewsOR this now has been merged into master
just need to wait for the next pybaseball release or use the project directly from github/master branch.

Feel free to close the issue.

@BrayanMnz
Copy link
Contributor

This issue can be closed since the solution was merged. @schorrm

@AndrewsOR
Copy link
Author

Thank you @BrayanMnz !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants