<a href="https://www.kaggle.com/code/mikedelong/off-to-histogram-city?scriptVersionId=161071577" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
import pandas as pd

CHESS = '/kaggle/input/top-chess-players-in-the-world/top_chess_players.csv'

# we need to drop some rows with missing values in both Title and Rating/Ranking
df = pd.read_csv(filepath_or_buffer=CHESS).dropna()
# and we need to split the ratings from the rankings and make them numeric
df['Rating'] = df['Rating | Ranking'].apply(func=lambda x: int(x.split('|')[0]))
df['Ranking']= df['Rating | Ranking'].apply(func=lambda x: int(x.split('|')[1][2:]))
df.head()

Unnamed: 0,Title,Player,Rating | Ranking,Federation,Rating,Ranking
0,GM,Magnus Carlsen,2830 | #1,Norway,2830,1
1,GM,Fabiano Caruana,2804 | #2,United States,2804,2
2,GM,Hikaru Nakamura,2788 | #3,United States,2788,3
3,GM,Ding Liren,2780 | #4,China,2780,4
4,GM,Alireza Firouzja,2759 | #5,France,2759,5


In [2]:
from plotly.express import scatter
scatter(data_frame=df, y='Rating', x='Ranking', color='Title', hover_name='Player', log_y=False)

  sf: grouped.get_group(s if len(s) > 1 else s[0])


It's amazing how much a handful of players stand out at the top of the ratings/rankings; this is a very smooth curve with a little bit of noise.

In [3]:
from plotly.express import histogram
histogram(data_frame=df, x='Rating', color='Title')





It is surprising that the right tail would be so thin, isn't it?

In [4]:
from plotly.express import histogram
histogram(data_frame=df, y='Federation', height=1500, color='Title')





These isn't simply a map of population, is it? Chess players aren't uniformly distributed.

In [5]:
scatter(data_frame=df, y='Federation', x='Rating', color='Title', height=1500, hover_name='Player')





Maybe a Federation x Rating scatter plot does a better job of repreenting the data we saw in the bar chart above as it includes more data and the Player data adds a lot of richness.

In [6]:
scatter(data_frame=df, x='Ranking', y='Rating', trendline='ols', color='Title', trendline_scope='overall')





It is kind of surprising how linear our data is, though: the R2 for our OLS is 0.88. Our data isn't linear, obviously; let's fit some polynomials.

In [7]:
from sklearn.metrics import r2_score
from numpy import polyfit
from numpy import polyval

for degree in range(2, 10):
    z = polyfit(x=df['Ranking'], y=df['Rating'], deg=degree, )
    yfit = polyval(z, df['Ranking'])
    name = '{} degree'.format(degree)
    df[name] = yfit
    print(r2_score(y_true=df['Rating'], y_pred=yfit))

0.9631129618196576
0.9842332877311866
0.9917168430412739
0.9957507685974248
0.9975504674137983
0.99849797762111
0.9988304934775372
0.9988912805764524


In [8]:
df.head()

Unnamed: 0,Title,Player,Rating | Ranking,Federation,Rating,Ranking,2 degree,3 degree,4 degree,5 degree,6 degree,7 degree,8 degree,9 degree
0,GM,Magnus Carlsen,2830 | #1,Norway,2830,1,2697.951472,2724.653389,2742.614628,2757.123567,2767.541472,2775.622169,2780.679288,2782.948858
1,GM,Fabiano Caruana,2804 | #2,United States,2804,2,2697.474602,2723.874179,2741.496422,2755.59474,2765.599332,2773.253186,2777.966843,2780.043676
2,GM,Hikaru Nakamura,2788 | #3,United States,2788,3,2696.998224,2723.096883,2740.383003,2754.076112,2763.675184,2770.913039,2775.294546,2777.186627
3,GM,Ding Liren,2780 | #4,China,2780,4,2696.522337,2722.3215,2739.274356,2752.567627,2761.768883,2768.601413,2772.661849,2774.376951
4,GM,Alireza Firouzja,2759 | #5,France,2759,5,2696.046941,2721.548027,2738.170465,2751.06923,2759.880287,2766.317998,2770.068211,2771.6139


In [9]:
scatter(data_frame=df, x='Ranking', y=['Rating', '2 degree', '3 degree', '9 degree'], )



