In [1]:
from ibm_watsonx_ai.foundation_models.utils.enums import ModelTypes
from ibm_watsonx_ai.metanames import GenTextParamsMetaNames as GenParams
from ibm_watsonx_ai.foundation_models.utils.enums import DecodingMethods
from ibm_watsonx_ai.foundation_models import Model
import time

In [2]:
from IPython.display import display, HTML
import pandas as pd

def pretty_print(df):
    return display( HTML( df.to_html().replace("\\n","<br>").replace("\\t","  ").replace("<td", "<td style='text-align:left'") ) )

# Document Generation

This notebook walks throught the process of using the meeting notes for a business need. In this use case we use them to modify a document. In any case we believe that the first step is to summarize the meeting notes with the final objective in mind.

In [3]:
# IBM Cloud Credentials
my_credentials = { 
    "url"    : "https://us-south.ml.cloud.ibm.com", 
    "apikey" : '< Your IBM Cloud API KEY >'
}      

## Summarization

This section generates a summary in form of a bullet list for each chunk of the transcript, a set of changes to be applied to the document. We are splitting and pasting the chunks manually. This could be done automatically in an MVP.

In [8]:
transcript_chunks = [
    '''Let's go with the abstract section. Let's see. Ok, is the objective still temporal visualization? Yeah, yeah, it's still still the same. That's that's okay. Ok, perfect. And what all the keywords? So, keywords, we have only these 4. I think I think we should add more specific visualizations into the keyword. So, let's add dendrogram, animated bubbles, animated lines, bars, and lines as well. Ok, sounds good. Let's go with that. now let's move on to the introduction section. Tell me a little bit about it. Ok, so let's go paragraph by paragraph. I think this first one, it's okay. You think these are the correct references that we are using? This date is correct. Ok, I think I think that that is good. 3. Revolution, okay. Yeah, I'll double-check that, but I think that that is fine as well. This Dupree, I think I think this second one, it's also ok. Let's jump into this last one. Uh, okay, there is something that we should be able to change here, right? 
    So, I think this wording is is a little bit off, where it says that the truth is that there is that it is very hard to compare players who played in different areas and different positions. I think we can do a little bit better here. So, I will change this last paragraph and these wording to say something around, you know, comparison between errors require a deeper analytics into the change of rules and players positions. Ok, I think I think that makes more sense. All right, yeah, ok. Let's let's go with that wording, making sure that this well now it looks more like a comparison between errors require a deeper analytics into the change of rules and players, yeah. Sounds good.'''
    '''Now, this the next is ok, background. Ok, background. Ok, this first paragraph looks ok to me. Yeah, this hasn't changed. Second paragraph, traditional positions, although well known, have never been strictly defined nor enforced. This is true, although not everyone might know the traditional positions. So, here, let's add the 5 traditional positions, Steve. So, that will be center, point guard, shooting guard, small forward, and forward. Yeah, that that will help a lot. And where should we add those? I think you just after it says, yeah, and background, second paragraph, just after it says, "m the well-known positions." Well-known, yeah. You can maybe at the parentheses or something that you can figure it out. Ok, perfect. So, we will add those 5 traditional positions there. Sounds good. And now, the next part will be, yes, approach. Ok, the approach section, yeah. This small section, yeah. This is this is ok. This is still the same. 
    Perfect. And now, Section datasets. Yeah, datasets, we did have a change here, a big change. We are not using anymore the basketball reference data set. So, what we should do to Steve is, in all the document, in all, in all the paper, we should change, sorry, we should change where it says, "back basketball reference" to "ESPN analytics." Ok, all right. So, good. Let's go with that change. Make sure to put the ESPN analytics instead. Yeah, yeah. Let's let's go with that. Other than that, let me, let me scroll down this first table. I think there's a change, since we change data sets. There is no offensive rebounds or defensive rebounds. So, we should delete these 2 rows and add one that says just total rebounds. Ok, perfect. Let's do that addition to that section, yeah. So, that table one, table 2, yeah. This one is still the same thing. Yeah, there is no change there.''',
    '''Perfect. Alright, now how about a section 3.3? 3.2, I think it's 3.2 first. Oh yeah, you're right. Tell me a little bit about that one. OK, 3.2 first, OK 3 point evolution per position. OK, in this section. I think I would add a paragraph. OK, let me let me see where. Where was that? OK, this chart is OK. Uh, this wording is is fine. This is what matters. Umm, this cluster names are still relevant. Here are here we are saying that we're gonna use these two algorithms. SARIMAX and AIC, and we should spend maybe a paragraph explaining what these two algorithms are. OK, so add a paragraph here at the end of the section and explaining what SARIMAX and AIC is OK. Perfect. We'll make that definition to the paragraph, so it helps out. Perfect. Ohh. and, let's move now to section. Now section 3.3. Sure. Yeah. OK. So this is the LeBron versus Michael umm inlet analysis. 
    OK, here we are good with the this assessment and I think there is something maybe we can change here Steve and that is. Umm, explain why the first 15 seasons here at the end of the first paragraph we could add a sentence that says something around that we only choose this 15 seasons because that's the amount of seasons that Michael Jordan played, OK. All right. OK, so in the first paragraph in section 3.3. We will explain that we use 15 season because that is ammount of season that Michael played. OK, perfect. Sounds good. Yeah, that is correct. OK. Results. Right moving section results perfect. OK. Yeah, Section 4 and four one. OK, Section 4.1, I think this is fine. This is just a link to the animation would change there. UM now to section 4.2 is there is other one. This is the Michael Jordan visualization again. We have the same. Two Visualization, I think I think I'm seeing here a lot of PER mentions, OK, yeah. 
    So let's let's modify this. So here where it says PER. Let's change it to player efficiency rating and then in parentheses PER OK. OK. We'll make that change sound good. Perfect.'''
]

In [43]:
def llama2_summary_prompt(transcript):
    return ( '[INST]<<SYS>>\nYou are a helpful, and honest assistant. You are asked to write a summary of the meeting transcripts including all the discussed changes to a document. Include the details necessary to make the discussed changes.\n\n'
            +'For example:\n\n'
            +'Text:\n'
            +'''In general, we are not using brand A, now we are using brand B. So, can we change all references from brand A to brand B? Yes. Now moving to the introduction, it is fine. We can leave it as it is. Now to the discussion section, we have changed the data upload process. We are no longer uploading the CSV files manually. Instead, the data is loaded to the database directly from the website. The cost section is all right, although we didn't have a workforce cost. What should we add for workforce? Let's put 15 USD per hour. Include the break down, 10 USD base salary, 2 USD for gas and 3 USD for food support. Tips not included. Ok. Moving to the constraints sections, we don't have any now. So we can delete the entire section? Yes.\n\n'''
            +'Output:\n'
            +'* Replace all references from brand A to brand B.\n'
            +'* Replace the references to manual uploading with data being directly loaded to the database from the website in the discussion section of the data upload process\n'
            +'* Add 15 USD per hour of workforce in the cost section with the following breakdown:\n'
            +'  + 10 USD as base salary.\n'
            +'  + 2 USD for gas.\n'
            +'  + 3 USD for food support.\n'
            +'  + Tips not included.\n'
            +'* Delete the constraints section.\n<</SYS>>\n\n'
            +f"Text:\n{transcript}\n\n"
            +'Output:[/INST]\n')

In [44]:
# This prompt template is an example for LLAMA3. You will need to tune it for your needs.
def llama3_summary_prompt(transcript):
    return ( '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are a helpful, and honest assistant. You are asked to write a summary of the meeting transcripts including all the discussed changes to a document. Include the details necessary to make the discussed changes.\n\n'
            +'For example:\n\n'
            +'Text:\n'
            +'''In general, we are not using brand A, now we are using brand B. So, can we change all references from brand A to brand B? Yes. Now moving to the introduction, it is fine. We can leave it as it is. Now to the discussion section, we have changed the data upload process. We are no longer uploading the CSV files manually. Instead, the data is loaded to the database directly from the website. The cost section is all right, although we didn't have a workforce cost. What should we add for workforce? Let's put 15 USD per hour. Include the break down, 10 USD base salary, 2 USD for gas and 3 USD for food support. Tips not included. Ok. Moving to the constraints sections, we don't have any now. So we can delete the entire section? Yes.\n\n'''
            +'Output:\n'
            +'* Replace all references from brand A to brand B.\n'
            +'* Replace the references to manual uploading with data being directly loaded to the database from the website in the discussion section of the data upload process\n'
            +'* Add 15 USD per hour of workforce in the cost section with the following breakdown:\n'
            +'  + 10 USD as base salary.\n'
            +'  + 2 USD for gas.\n'
            +'  + 3 USD for food support.\n'
            +'  + Tips not included.\n'
            +'* Delete the constraints section.\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n'
            +f"Text:\n{transcript}\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n"
            +'Output:\n')

Now we use GenAI to summarize the chunks

In [15]:
model_id    = ModelTypes.LLAMA_3_70B_INSTRUCT
gen_parms   = None
project_id  = '< Your watsonx.ai project ID >'
space_id    = None
verify      = False

model = Model( model_id, my_credentials, gen_parms, project_id, space_id, verify )   

These are all the changes summarized:

In [16]:
parameters = {
    GenParams.MIN_NEW_TOKENS: 50,
    GenParams.MAX_NEW_TOKENS: 800,
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.REPETITION_PENALTY: 1
}

changes = []
for _chunk in transcript_chunks:
    prompt_txt = llama3_summary_prompt(_chunk)
    generated_response = model.generate_text( prompt_txt, parameters )
    changes.append(generated_response.strip())
    time.sleep(1)

print("".join(changes))

* In the abstract section:
  + Add the following keywords: dendrogram, animated bubbles, animated lines, bars, and lines.
* In the introduction section:
  + Revise the last paragraph to read: "Comparison between errors require a deeper analytics into the change of rules and players positions."
* In the background section:
  + Add the 5 traditional positions (center, point guard, shooting guard, small forward, and forward) in parentheses after the phrase "well-known positions" in the second paragraph.
* In the datasets section:
  + Replace all references to "basketball reference" with "ESPN analytics" throughout the document.
  + In table 1, delete the rows for "offensive rebounds" and "defensive rebounds" and add a new row for "total rebounds".* Add a paragraph to the end of section 3.2 explaining what SARIMAX and AIC algorithms are.
* In section 3.3, add a sentence to the first paragraph explaining that the first 15 seasons were chosen because that is the amount of seasons Michael Jor

These are the set of changes for each chunk of the transcript. The goal is to validate the accuracy of the changes (Corrects, Inclusions, Omissions, etc)

In [17]:
df = pd.DataFrame(zip(transcript_chunks, changes), columns=["Transcript", "Changes"])
pretty_print(df)

Unnamed: 0,Transcript,Changes
0,"Let's go with the abstract section. Let's see. Ok, is the objective still temporal visualization? Yeah, yeah, it's still still the same. That's that's okay. Ok, perfect. And what all the keywords? So, keywords, we have only these 4. I think I think we should add more specific visualizations into the keyword. So, let's add dendrogram, animated bubbles, animated lines, bars, and lines as well. Ok, sounds good. Let's go with that. now let's move on to the introduction section. Tell me a little bit about it. Ok, so let's go paragraph by paragraph. I think this first one, it's okay. You think these are the correct references that we are using? This date is correct. Ok, I think I think that that is good. 3. Revolution, okay. Yeah, I'll double-check that, but I think that that is fine as well. This Dupree, I think I think this second one, it's also ok. Let's jump into this last one. Uh, okay, there is something that we should be able to change here, right? So, I think this wording is is a little bit off, where it says that the truth is that there is that it is very hard to compare players who played in different areas and different positions. I think we can do a little bit better here. So, I will change this last paragraph and these wording to say something around, you know, comparison between errors require a deeper analytics into the change of rules and players positions. Ok, I think I think that makes more sense. All right, yeah, ok. Let's let's go with that wording, making sure that this well now it looks more like a comparison between errors require a deeper analytics into the change of rules and players, yeah. Sounds good.Now, this the next is ok, background. Ok, background. Ok, this first paragraph looks ok to me. Yeah, this hasn't changed. Second paragraph, traditional positions, although well known, have never been strictly defined nor enforced. This is true, although not everyone might know the traditional positions. So, here, let's add the 5 traditional positions, Steve. So, that will be center, point guard, shooting guard, small forward, and forward. Yeah, that that will help a lot. And where should we add those? I think you just after it says, yeah, and background, second paragraph, just after it says, ""m the well-known positions."" Well-known, yeah. You can maybe at the parentheses or something that you can figure it out. Ok, perfect. So, we will add those 5 traditional positions there. Sounds good. And now, the next part will be, yes, approach. Ok, the approach section, yeah. This small section, yeah. This is this is ok. This is still the same. Perfect. And now, Section datasets. Yeah, datasets, we did have a change here, a big change. We are not using anymore the basketball reference data set. So, what we should do to Steve is, in all the document, in all, in all the paper, we should change, sorry, we should change where it says, ""back basketball reference"" to ""ESPN analytics."" Ok, all right. So, good. Let's go with that change. Make sure to put the ESPN analytics instead. Yeah, yeah. Let's let's go with that. Other than that, let me, let me scroll down this first table. I think there's a change, since we change data sets. There is no offensive rebounds or defensive rebounds. So, we should delete these 2 rows and add one that says just total rebounds. Ok, perfect. Let's do that addition to that section, yeah. So, that table one, table 2, yeah. This one is still the same thing. Yeah, there is no change there.","* In the abstract section:  + Add the following keywords: dendrogram, animated bubbles, animated lines, bars, and lines. * In the introduction section:  + Revise the last paragraph to read: ""Comparison between errors require a deeper analytics into the change of rules and players positions."" * In the background section:  + Add the 5 traditional positions (center, point guard, shooting guard, small forward, and forward) in parentheses after the phrase ""well-known positions"" in the second paragraph. * In the datasets section:  + Replace all references to ""basketball reference"" with ""ESPN analytics"" throughout the document.  + In table 1, delete the rows for ""offensive rebounds"" and ""defensive rebounds"" and add a new row for ""total rebounds""."
1,"Perfect. Alright, now how about a section 3.3? 3.2, I think it's 3.2 first. Oh yeah, you're right. Tell me a little bit about that one. OK, 3.2 first, OK 3 point evolution per position. OK, in this section. I think I would add a paragraph. OK, let me let me see where. Where was that? OK, this chart is OK. Uh, this wording is is fine. This is what matters. Umm, this cluster names are still relevant. Here are here we are saying that we're gonna use these two algorithms. SARIMAX and AIC, and we should spend maybe a paragraph explaining what these two algorithms are. OK, so add a paragraph here at the end of the section and explaining what SARIMAX and AIC is OK. Perfect. We'll make that definition to the paragraph, so it helps out. Perfect. Ohh. and, let's move now to section. Now section 3.3. Sure. Yeah. OK. So this is the LeBron versus Michael umm inlet analysis. OK, here we are good with the this assessment and I think there is something maybe we can change here Steve and that is. Umm, explain why the first 15 seasons here at the end of the first paragraph we could add a sentence that says something around that we only choose this 15 seasons because that's the amount of seasons that Michael Jordan played, OK. All right. OK, so in the first paragraph in section 3.3. We will explain that we use 15 season because that is ammount of season that Michael played. OK, perfect. Sounds good. Yeah, that is correct. OK. Results. Right moving section results perfect. OK. Yeah, Section 4 and four one. OK, Section 4.1, I think this is fine. This is just a link to the animation would change there. UM now to section 4.2 is there is other one. This is the Michael Jordan visualization again. We have the same. Two Visualization, I think I think I'm seeing here a lot of PER mentions, OK, yeah. So let's let's modify this. So here where it says PER. Let's change it to player efficiency rating and then in parentheses PER OK. OK. We'll make that change sound good. Perfect.","* Add a paragraph to the end of section 3.2 explaining what SARIMAX and AIC algorithms are. * In section 3.3, add a sentence to the first paragraph explaining that the first 15 seasons were chosen because that is the amount of seasons Michael Jordan played. * In section 4.2, replace all instances of ""PER"" with ""player efficiency rating (PER)""."


## Classify Changes
In this section we classify the changes to be executed in the proper order.

In [18]:
def mixtral_classification_prompt(action):
    return ( '[INST]Classify this action into one of four classes: Delete, Add, Replace, No Change.\n\n'
            +'Class name: Delete\n'
            +'Description: The action asks to delete or remove an entire section or a portion.\n\n'
            +'Class name: Add\n'
            +'Description: The action asks to add a new section or paragraph.\n\n'
            +'Class name: Replace\n'
            +'Description: The action asks to replace one reference with another.\n\n'
            +'Class name: No Change\n'
            +'Description: The action states that no changes are to be made.\n\n'
            +f"Action:{action}\n"
            +'Just generate the class name without explanations:\n'
            +'[/INST]')

In [19]:
all_changes = []
for _changes in changes:
    for _change in _changes.split('*'):
        if len(_change)>5:
            all_changes.append(_change)

print(f"There are {len(all_changes)} changes")


There are 7 changes


Now we use GenAI to classify the changes into: Delete, Add, Replace, or No Change

In [20]:
model_id    = ModelTypes.MIXTRAL_8X7B_INSTRUCT_V01
gen_parms   = None
project_id  = '1dbd8eb5-588c-4f7e-98a8-6c65a179fc9d'
space_id    = None
verify      = False

model_mix = Model( model_id, my_credentials, gen_parms, project_id, space_id, verify )   

In [21]:
parameters_mix = {
    GenParams.MIN_NEW_TOKENS: 0,
    GenParams.MAX_NEW_TOKENS: 5,
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.REPETITION_PENALTY: 1
}


_delete = []
_add = []
_replace = []
_nochange = []
for _change in all_changes:
    prompt_txt = mixtral_classification_prompt(_change)
    generated_response = model_mix.generate_text( prompt_txt, parameters_mix ).lower().strip()
    if 'delete' in generated_response:
        _delete.append(_change)
    elif 'add' in generated_response:
        _add.append(_change)
    elif 'replace' in generated_response:
        _replace.append(_change)
    elif 'no change' in generated_response:
        _nochange.append(_change)
    else:
        print('Misclassifications')
        print('Action: '+_change)
        print(generated_response)
        _replace.append(_change) # Default to replace
    time.sleep(1)

These are the classified changes.

In [22]:
_nochange

[]

In [23]:
_delete

[' In the datasets section:\n  + Replace all references to "basketball reference" with "ESPN analytics" throughout the document.\n  + In table 1, delete the rows for "offensive rebounds" and "defensive rebounds" and add a new row for "total rebounds".']

In [24]:
_add

[' In the abstract section:\n  + Add the following keywords: dendrogram, animated bubbles, animated lines, bars, and lines.\n',
 ' In the background section:\n  + Add the 5 traditional positions (center, point guard, shooting guard, small forward, and forward) in parentheses after the phrase "well-known positions" in the second paragraph.\n',
 ' Add a paragraph to the end of section 3.2 explaining what SARIMAX and AIC algorithms are.\n',
 ' In section 3.3, add a sentence to the first paragraph explaining that the first 15 seasons were chosen because that is the amount of seasons Michael Jordan played.\n']

In [25]:
_replace

[' In the introduction section:\n  + Revise the last paragraph to read: "Comparison between errors require a deeper analytics into the change of rules and players positions."\n',
 ' In section 4.2, replace all instances of "PER" with "player efficiency rating (PER)".']

## Generation by Sets

The objective of this experiment is to concatenate the output of one set of changes into the generation of the next set of changes.
The goal is to have at the end a document with all the changes applied and to be able to measure its performance.

In [41]:
def mix_gen_prompt_1(changes, text):
    return ( '[INST] <<SYS>>\nYou are a helpful, and honest assistant. Modify the text according only to the listed changes.\n'
            +f'Changes:\n{changes}\n'
            +'<</SYS>>\n\n'
            +f"Text:\n\n{text}\n[/INST]\nHere's the modified text with the requested changes:\n" )

In [42]:
def llama3_gen_prompt(changes, text):
    return ( '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\nYou are an honest assistant. Modify the text according only to the listed changes.\n'
            +f'Changes:\n{changes}\n'
            +'<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n'
            +f"Text:\n\n{text}\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\nHere's the modified text with the requested changes:\n" )

Now we apply the changes in order, first all the delete, then replace and finally the addition.

In [38]:
parameters = {
    GenParams.MIN_NEW_TOKENS: 500,
    GenParams.MAX_NEW_TOKENS: 4000,
    GenParams.DECODING_METHOD: DecodingMethods.GREEDY,
    GenParams.REPETITION_PENALTY: 1
}

prev_doc_text = ''

with open('document.txt') as f:
    # Read the contents of the file into a variable
    prev_doc_text = f.read()


doc_text_sets = prev_doc_text
doc_changed_sets = []

_delete_changes = '*' + "*".join(_delete)
_replace_changes = '*' + "*".join(_replace)
_add_changes = '*' + "*".join(_add)

prompt_txt_delete = llama3_gen_prompt(_delete_changes, doc_text_sets)
generated_response = model.generate_text( prompt_txt_delete, parameters )
doc_text_sets = generated_response.strip()
doc_changed_sets.append(doc_text_sets)
time.sleep(1)

prompt_txt_replace = llama3_gen_prompt(_replace_changes, doc_text_sets)
parameters[GenParams.MIN_NEW_TOKENS] = int(model.tokenize(prompt=doc_text_sets, return_tokens=True)['result']['token_count'])
generated_response = model.generate_text( prompt_txt_replace, parameters )
doc_text_sets = generated_response.strip()
doc_changed_sets.append(doc_text_sets)
time.sleep(1)

prompt_txt_add = llama3_gen_prompt(_add_changes, doc_text_sets)
parameters[GenParams.MIN_NEW_TOKENS] = int(model.tokenize(prompt=doc_text_sets, return_tokens=True)['result']['token_count'])
generated_response = model.generate_text( prompt_txt_add, parameters )
doc_text_sets = generated_response.strip()
doc_changed_sets.append(doc_text_sets)
time.sleep(1)    


These are the results after applying each set of changes. Remember that these changes are cumulative, you should see the changes hold at each step.

In [39]:
df = pd.DataFrame(zip([_delete_changes,_replace_changes,_add_changes], doc_changed_sets), columns=['Changes', 'Modified'])
pretty_print(df)

Unnamed: 0,Changes,Modified
0,"* In the datasets section:  + Replace all references to ""basketball reference"" with ""ESPN analytics"" throughout the document.  + In table 1, delete the rows for ""offensive rebounds"" and ""defensive rebounds"" and add a new row for ""total rebounds"".","Visual Historical Analysis of the NBA: Final Project Abstract This document describes the implementation of several temporal visualizations to analyze time-sensitive data from the NBA. It contains a description of the machine learning models implemented to analyze the data and prepare it for the visual analysis. In particular it covers the implementation of a clustering algorithm and a time series applied to the produced clusters. Keywords: neural clustering, time series, visualizations, NBA. 1 Introduction The game of basketball was first invented in 1891 in Springfield, Massachusetts in a small college, fast forward to today is a global game with more than 200 countries with national teams (McCuaig, 2016). The National Basketball Association (NBA) is the most popular basketball league, with more than 400 players and a 10 billion revenue in 2022 (Ozanian & Teitelbaum, 2022). The history of the NBA started in 1949 when the Basketball Association of America (BAA) and the National Basketball League (NBL) merged into one single league (History.com Editors, 2022). Since then, not only the teams and league has evolved but also how the game is played. It is common to refer to the eras of the NBA, the dominance of the big men, and currently the 3-point evolution. The 3-point line was introduced in 1979 as a response to the dominance of big men around the hoop, clogging the painted area and making the pace of the game slow and unattracted to some (DuPree, 1979). The main idea was to space the player around the court to allow for better tactical styles. Grating an extra point compared to any other field goal, the idea was to entice players to take more 3-points to gain the extra point. The height or the position of the player now doesn’t play as big as a factor in the ability to score. Besides the eras in the NBA, the history of the league can’t be written without its stars. Arguably the two biggest stars in modern history are Michael Jordan and Lebron James. Their curriculum is well-known by every fan of the NBA, often posting the question: who is the greatest of all time? Although the question is interesting to spark debates, the truth is that it is very hard to compare players who played in different eras and different positions. Basketball is a team game, having contributions from starters and bench players. The second objective of this paper is to present a comparison of Michael Jordan's and Lebron James' careers in terms of the contribution of their teams, trying to answer, who had more help. 2 Background The purpose of this project is to analyze historical data on the NBA, proposing visualization and machine learning techniques to analyze data that has latent variables related to the game. This latent variable refers to intangibles that cannot be measured like style, rules, and the evolution of the game from season to season. The first analysis is focused on the 3-point revolution effects on the roles and positions of basketball from the 1979-198 season to the 2021-2022 season and beyond. The second analysis focuses on comparing the career and team performance of two superstars playing in two different eras: Michael Jordan from 1984 to 2003 and LeBron James from 2004 to 2018. Both analyses pose similar challenges in having temporal data, statistics, and relative comparisons. Analyzing the player's performance per position poses several difficulties. The traditional positions although well-known have never been strictly defined nor enforced. For instance, a center may be known to play close to the basket and to rebound and score, it is not enforced that a player listed as a center cannot break that model. This is more prevalent as for the last two decades players have been more skilled and can play differently. In addition, teams can list their players without any control or consistency. In fact, many players are listed differently from season to season. For example, LeBron James has been listed as both a point guard and a small forward while Stephen Curry was classified as a point guard for the first three seasons, then listed as a shooting guard. There are several works done to address the role or position of a player in the NBA using statistics. One approach is to use the regular statistics used in the box score to measure players to create clusters, however, since the analysis in this project is temporal a decision must be made to create clusters per season, or one single mega-cluster as proposed by Hedquist (Hedquist, 2022). Clustering is an unsupervised machine-learning technique used to find patterns in the data and classify them. There are hierarchical and non-hierarchical models that can be used, each with advantages and disadvantages to this project. On the other hand, there is a myriad of comparisons between Michael Jordan versus LeBron James, most of them focus on single-point statistics to summarize their entire career. This could be simple statistics like average points per game, or advanced statistics like field goal percentage in winning shots. These statistics are not doing service to the complexity of the game, the situations each faced, the teammates, or the opponents. Although these statistics are useful for this project analysis, the purpose is to create a side-by-side comparison of their career and their teammates. This is achieved using a temporal visualization of advanced statistics while making a distinction between the superstar and their teammates. 3 Approach Section 3.1 presents a description of the datasets used for this project. Section 3.2 and 3.3 describes the data preprocessing and machine learning techniques applied to the datasets to achieve the desired visualizations. 3.1 Datasets This project uses data from the 1979-1980 season to the 2021-2022 season from ESPN analytics. Three datasets were used, the player's season totals, the player's advanced season totals, and the NBA league index. Table 1 describes the features used from the player's season totals dataset from ESPN analytics. Although there are more features, these were not included as they did not provide information relevant to the analysis done in this project. Column Description name Player’s name positions Player Position games_played Games Played in the season FG Field Goals made FGA Field Goals attempts 3P 3-Points made 3PA 3-points attempts FT Free Throws made FTA Free Throws attempts AST Assists STL Steals BLK Blocks TOV Turnovers PF Personal Fouls PTS Points season Year season ended TRB Total Rebounds Table 1. The player's season totals dataset. Table 2 describes the features used from the player's advanced season totals dataset from ESPN analytics. Although there are more features, these were not included as they did not provide information relevant to the analysis done in this project. Column Description name Player’s name PER Player Efficiency Rating: A measure of per-minute production standardized such that the league average is 15. OBPM Offensive Box Plus/Minus: box score estimate of the offensive points per 100 possessions a player contributed above a league-average player, translated to an average team. DBPM Defensive Box Plus/Minus: A box score estimate of the defensive points per 100 possessions a player contributed above a league-average player, translated to an average team. BPM Box Plus/Minus: A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team. season Year season ended Table 2. The player's advanced season totals dataset. Finally, the NBA league index dataset was used only to cross-reference the championship winners for each season. 3.2 3-Point evolution per Position. The first step in the pipeline implemented was to transform the player's season totals to per-game statistics to normalize each player's production to the games he played. Additionally, since players can be traded between a season, those would have several inputs in the dataset within a season. An aggregation of these players' statistics was done to add up their performance in a season to have a single input per player per season. This proposed dataset is then used for the first analysis of the 3-point impact in the positions. The second step is to define a consistent position in the NBA. As discussed in Section 2, clustering presents an alternative to classify players according to their production on the court. A hierarchical clustering algorithm was chosen due to its ability to consider outliers without skewing the clusters. To apply these models, the number of clusters must be determined, to do so a dendrogram can be used. Figure 1 shows the dendrogram for all the players from 1979 to 2022. The obvious option would be to use 5 clusters, however, upon testing this configuration produced cluster describing primarily the role of bench, star, or superstar. Since the idea is to have a cluster that also describes the way of contributing, 9 clusters were chosen just as Hedquist proposed. The algorithm used was a Ward, an agglomerative hierarchical clustering used as a mega-cluster for all seasons. Figure 2 shows the cluster's percentile per each of the statistics used for the 2021-2022 season. A descriptive name was given for each cluster using both Figure 2 and the prominent players in each cluster for the 2021-2022 season. Table 3 shows the names assigned and the prominent players per cluster in the 2021-2022 season. Id Cluster Name Prominent Players 0 Score First Carmelo Anthony, Andrew Wiggins, Bogdan Bogdanović 1 Superstars LeBron James, Stephen Curry, Giannis Antetokounmpo, Joel Embiid 2 Two-way players Alex Caruso, Patrick Beverle 3 Defensive Big Men Steven Adams, Dwight Howard, JaVale McGee 4 Bench Role Players Thanasis Antetokounmpo, Santi Aldama 5 Interior Big Men LaMarcus Aldridge, Rudy Gobert, Deandre Ayton 6 Deep Bench Bol Bol, Moses Wright 7 Scoring Big Men Bam Adebayo, Evan Mobley, Jusuf Nurkić 8 Bench Perimeter Scorers Lou Williams, Avery Bradley, Trevor Ariza Table 3. Cluster naming and prominent players. To predict the trend and forecast the 3-point evolution in the upcoming seasons the time series algorithm SARIMAX was implemented. This algorithm can handle both seasonal and non-seasonal data. The parameters and seasonal parameters for this model were chosen per cluster performing a grid search minimizing the Akaike information criteria (AIC). 3.3 Michael Jordan vs LeBron James Career and Teams. To produce this visual analysis the Players advanced totals dataset was preprocessed to filter only Michael’s and LeBron’s teams. Both superstars were subtracted from their teams, to have a clear comparison on how much their teams were dependent on them to thrive. It's worth noting that this analysis was limited to the top 6 teammates with the most minutes played per season. This was done to avoid skewing the statistics with bench players since there are more bench players in the NBA now than when Michael played. It is also true that in playoffs the rotation of players tends to close on 6 to 9 players depending on the coach. To be able to compare their career, this analysis focuses only on the first 15 seasons of each player. The advanced statistics are useful to sum up the player's impact on the team in one aspect of the game. For example, the Player Efficiency Rating (PER) “sums up all a player's positive accomplishments, subtracts the negative accomplishments, and returns a per-minute rating of a player's performance."" According to its creator John Hollinger (ESPN analytics, 2022). These statistics are adjusted around a league average, giving a direct indication of whether the players are above or below the average for each season. This allows us to compare the teams across different eras, having as a base their own contemporary players. 4 Results The results show visualizations proposed as models to compare temporal data in different analyses for the NBA. 4.1 3-Point evolution per Position. Figure 3 shows the first visualization proposed is an animated bubble graph on a cartesian plane. The x-axis is depicting the 3-point made while the y-axis depicts the 3-point attempted. In addition, the size of the bubble reflects the number of players in each position. As time advances the bubbles move along the plane and increase/decrease in size as the positions evolve. The purpose of this visualization is to allow the data to tell a visual story of the evolution of these clusters. This should serve as an exploratory visualization to move into a more specialized visualization. The user can hover over the visualization to gain exact information about the cluster. In addition, the slider below the graph can be used to control de animation, pause, resume, and move to a specific year of interest. The labels on the right are also interactive, as they serve as a filter to activate/deactivate series in the visualization. Figure 4 shows the second type of visualization proposed for this analysis, an animated line graph with a forecast of the expected evolution of the 3-point made in the next 10 years. The purpose of this visualization is to provide a specialized view of the evolution of the 3-point per cluster position as well as to provide an estimate of the projections for the next 10 years. The user can hover over the lines to get a unified detail of the exact values of all cluster positions for a specific season. In addition, the range slider below the graph allows to zoom into a specific timeframe of interest. 4.2 Michael Jordan vs LeBron James Career and Teams. Figure 5 shows the visualization type proposed for this analysis applied to the PER statistic. This visualization is a column chart combined with a line graph, the columns are used for the teammates while the line is used for the superstars. In addition, markers can be used to mark relevant events in their careers, for instance, the championships are coded as stars over the line graphs. A deactivated series per superstar is added to provide an exact depiction of the difference between the PER of the superstar and the average PER of his teammates. The user can activate/deactivate any series by clicking on the label's name. The user can hover over the elements to get detailed information, for example, the columns provide extra information on the name of the team, the year of the season, and the exact PER value. positions for a specific season. It can be noted that the 1995-1997 Bulls teams had a high efficiency, while only the 2015 Cavalier had a comparable high efficiency. Figure 6 shows the same type of visualization used to analyze the Defensive box plus minus. It is evident that the 1995-1998 Bulls were great defensive teams, and that Michael relied on them on the defensive side for the second three-peat. On the other hand, LeBron had several good defensive teams, although he was far and beyond the average of his teammates being a solid defensive contributor for all his championships. Figure 7 shows the performance of the superstars and their teams using the Offensive box plus minus. Figure 7 shows that both superstars were solid offensive contributors to their teams. It can be argued that Michael had the better offensive teammates, from 1995 to 1997. LeBron’s best offensive teammates were those Cavaliers from 2015 to 2017. Figure 8 shows the same type of visualization used to analyze the overall Box plus-minus. Overall, Michael had the best teammates according to the league’s BMP. Michael was still a strong contributor on his second three-peat, however, his impact was diminishing toward the end of his career, while LeBron’s impact remained high at the end of his first 15 years."
1,"* In the introduction section:  + Revise the last paragraph to read: ""Comparison between errors require a deeper analytics into the change of rules and players positions."" * In section 4.2, replace all instances of ""PER"" with ""player efficiency rating (PER)"".","Visual Historical Analysis of the NBA: Final Project Abstract This document describes the implementation of several temporal visualizations to analyze time-sensitive data from the NBA. It contains a description of the machine learning models implemented to analyze the data and prepare it for the visual analysis. In particular it covers the implementation of a clustering algorithm and a time series applied to the produced clusters. Keywords: neural clustering, time series, visualizations, NBA. 1 Introduction The game of basketball was first invented in 1891 in Springfield, Massachusetts in a small college, fast forward to today is a global game with more than 200 countries with national teams (McCuaig, 2016). The National Basketball Association (NBA) is the most popular basketball league, with more than 400 players and a 10 billion revenue in 2022 (Ozanian & Teitelbaum, 2022). The history of the NBA started in 1949 when the Basketball Association of America (BAA) and the National Basketball League (NBL) merged into one single league (History.com Editors, 2022). Since then, not only the teams and league has evolved but also how the game is played. It is common to refer to the eras of the NBA, the dominance of the big men, and currently the 3-point evolution. The 3-point line was introduced in 1979 as a response to the dominance of big men around the hoop, clogging the painted area and making the pace of the game slow and unattracted to some (DuPree, 1979). The main idea was to space the player around the court to allow for better tactical styles. Grating an extra point compared to any other field goal, the idea was to entice players to take more 3-points to gain the extra point. The height or the position of the player now doesn’t play as big as a factor in the ability to score. Besides the eras in the NBA, the history of the league can’t be written without its stars. Arguably the two biggest stars in modern history are Michael Jordan and Lebron James. Their curriculum is well-known by every fan of the NBA, often posting the question: who is the greatest of all time? Although the question is interesting to spark debates, the truth is that it is very hard to compare players who played in different eras and different positions. Basketball is a team game, having contributions from starters and bench players. The second objective of this paper is to present a comparison of Michael Jordan's and Lebron James' careers in terms of the contribution of their teams, trying to answer, who had more help. Comparison between errors require a deeper analytics into the change of rules and players positions. 2 Background The purpose of this project is to analyze historical data on the NBA, proposing visualization and machine learning techniques to analyze data that has latent variables related to the game. This latent variable refers to intangibles that cannot be measured like style, rules, and the evolution of the game from season to season. The first analysis is focused on the 3-point revolution effects on the roles and positions of basketball from the 1979-198 season to the 2021-2022 season and beyond. The second analysis focuses on comparing the career and team performance of two superstars playing in two different eras: Michael Jordan from 1984 to 2003 and LeBron James from 2004 to 2018. Both analyses pose similar challenges in having temporal data, statistics, and relative comparisons. Analyzing the player's performance per position poses several difficulties. The traditional positions although well-known have never been strictly defined nor enforced. For instance, a center may be known to play close to the basket and to rebound and score, it is not enforced that a player listed as a center cannot break that model. This is more prevalent as for the last two decades players have been more skilled and can play differently. In addition, teams can list their players without any control or consistency. In fact, many players are listed differently from season to season. For example, LeBron James has been listed as both a point guard and a small forward while Stephen Curry was classified as a point guard for the first three seasons, then listed as a shooting guard. There are several works done to address the role or position of a player in the NBA using statistics. One approach is to use the regular statistics used in the box score to measure players to create clusters, however, since the analysis in this project is temporal a decision must be made to create clusters per season, or one single mega-cluster as proposed by Hedquist (Hedquist, 2022). Clustering is an unsupervised machine-learning technique used to find patterns in the data and classify them. There are hierarchical and non-hierarchical models that can be used, each with advantages and disadvantages to this project. On the other hand, there is a myriad of comparisons between Michael Jordan versus LeBron James, most of them focus on single-point statistics to summarize their entire career. This could be simple statistics like average points per game, or advanced statistics like field goal percentage in winning shots. These statistics are not doing service to the complexity of the game, the situations each faced, the teammates, or the opponents. Although these statistics are useful for this project analysis, the purpose is to create a side-by-side comparison of their career and their teammates. This is achieved using a temporal visualization of advanced statistics while making a distinction between the superstar and their teammates. 3 Approach Section 3.1 presents a description of the datasets used for this project. Section 3.2 and 3.3 describes the data preprocessing and machine learning techniques applied to the datasets to achieve the desired visualizations. 3.1 Datasets This project uses data from the 1979-1980 season to the 2021-2022 season from ESPN analytics. Three datasets were used, the player's season totals, the player's advanced season totals, and the NBA league index. Table 1 describes the features used from the player's season totals dataset from ESPN analytics. Although there are more features, these were not included as they did not provide information relevant to the analysis done in this project. Column Description name Player’s name positions Player Position games_played Games Played in the season FG Field Goals made FGA Field Goals attempts 3P 3-Points made 3PA 3-points attempts FT Free Throws made FTA Free Throws attempts AST Assists STL Steals BLK Blocks TOV Turnovers PF Personal Fouls PTS Points season Year season ended TRB Total Rebounds Table 1. The player's season totals dataset. Table 2 describes the features used from the player's advanced season totals dataset from ESPN analytics. Although there are more features, these were not included as they did not provide information relevant to the analysis done in this project. Column Description name Player’s name PER Player Efficiency Rating (PER): A measure of per-minute production standardized such that the league average is 15. OBPM Offensive Box Plus/Minus: box score estimate of the offensive points per 100 possessions a player contributed above a league-average player, translated to an average team. DBPM Defensive Box Plus/Minus: A box score estimate of the defensive points per 100 possessions a player contributed above a league-average player, translated to an average team. BPM Box Plus/Minus: A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team. season Year season ended Table 2. The player's advanced season totals dataset. Finally, the NBA league index dataset was used only to cross-reference the championship winners for each season. 3.2 3-Point evolution per Position. The first step in the pipeline implemented was to transform the player's season totals to per-game statistics to normalize each player's production to the games he played. Additionally, since players can be traded between a season, those would have several inputs in the dataset within a season. An aggregation of these players' statistics was done to add up their performance in a season to have a single input per player per season. This proposed dataset is then used for the first analysis of the 3-point impact in the positions. The second step is to define a consistent position in the NBA. As discussed in Section 2, clustering presents an alternative to classify players according to their production on the court. A hierarchical clustering algorithm was chosen due to its ability to consider outliers without skewing the clusters. To apply these models, the number of clusters must be determined, to do so a dendrogram can be used. Figure 1 shows the dendrogram for all the players from 1979 to 2022. The obvious option would be to use 5 clusters, however, upon testing this configuration produced cluster describing primarily the role of bench, star, or superstar. Since the idea is to have a cluster that also describes the way of contributing, 9 clusters were chosen just as Hedquist proposed. The algorithm used was a Ward, an agglomerative hierarchical clustering used as a mega-cluster for all seasons. Figure 2 shows the cluster's percentile per each of the statistics used for the 2021-2022 season. A descriptive name was given for each cluster using both Figure 2 and the prominent players in each cluster for the 2021-2022 season. Table 3 shows the names assigned and the prominent players per cluster in the 2021-2022 season. Id Cluster Name Prominent Players 0 Score First Carmelo Anthony, Andrew Wiggins, Bogdan Bogdanović 1 Superstars LeBron James, Stephen Curry, Giannis Antetokounmpo, Joel Embiid 2 Two-way players Alex Caruso, Patrick Beverle 3 Defensive Big Men Steven Adams, Dwight Howard, JaVale McGee 4 Bench Role Players Thanasis Antetokounmpo, Santi Aldama 5 Interior Big Men LaMarcus Aldridge, Rudy Gobert, Deandre Ayton 6 Deep Bench Bol Bol, Moses Wright 7 Scoring Big Men Bam Adebayo, Evan Mobley, Jusuf Nurkić 8 Bench Perimeter Scorers Lou Williams, Avery Bradley, Trevor Ariza Table 3. Cluster naming and prominent players. To predict the trend and forecast the 3-point evolution in the upcoming seasons the time series algorithm SARIMAX was implemented. This algorithm can handle both seasonal and non-seasonal data. The parameters and seasonal parameters for this model were chosen per cluster performing a grid search minimizing the Akaike information criteria (AIC). 3.3 Michael Jordan vs LeBron James Career and Teams. To produce this visual analysis the Players advanced totals dataset was preprocessed to filter only Michael’s and LeBron’s teams. Both superstars were subtracted from their teams, to have a clear comparison on how much their teams were dependent on them to thrive. It's worth noting that this analysis was limited to the top 6 teammates with the most minutes played per season. This was done to avoid skewing the statistics with bench players since there are more bench players in the NBA now than when Michael played. It is also true that in playoffs the rotation of players tends to close on 6 to 9 players depending on the coach. To be able to compare their career, this analysis focuses only on the first 15 seasons of each player. The advanced statistics are useful to sum up the player's impact on the team in one aspect of the game. For example, the player efficiency rating (PER) “sums up all a player's positive accomplishments, subtracts the negative accomplishments, and returns a per-minute rating of a player's performance."" According to its creator John Hollinger (ESPN analytics, 2022). These statistics are adjusted around a league average, giving a direct indication of whether the players are above or below the average for each season. This allows us to compare the teams across different eras, having as a base their own contemporary players. 4 Results The results show visualizations proposed as models to compare temporal data in different analyses for the NBA. 4.1 3-Point evolution per Position. Figure 3 shows the first visualization proposed is an animated bubble graph on a cartesian plane. The x-axis is depicting the 3-point made while the y-axis depicts the 3-point attempted. In addition, the size of the bubble reflects the number of players in each position. As time advances the bubbles move along the plane and increase/decrease in size as the positions evolve. The purpose of this visualization is to allow the data to tell a visual story of the evolution of these clusters. This should serve as an exploratory visualization to move into a more specialized visualization. The user can hover over the visualization to gain exact information about the cluster. In addition, the slider below the graph can be used to control de animation, pause, resume, and move to a specific year of interest. The labels on the right are also interactive, as they serve as a filter to activate/deactivate series in the visualization. Figure 4 shows the second type of visualization proposed for this analysis, an animated line graph with a forecast of the expected evolution of the 3-point made in the next 10 years. The purpose of this visualization is to provide a specialized view of the evolution of the 3-point per cluster position as well as to provide an estimate of the projections for the next 10 years. The user can hover over the lines to get a unified detail of the exact values of all cluster positions for a specific season. In addition, the range slider below the graph allows to zoom into a specific timeframe of interest. 4.2 Michael Jordan vs LeBron James Career and Teams. Figure 5 shows the visualization type proposed for this analysis applied to the player efficiency rating (PER) statistic. This visualization is a column chart combined with a line graph, the columns are used for the teammates while the line is used for the superstars. In addition, markers can be used to mark relevant events in their careers, for instance, the championships are coded as stars over the line graphs. A deactivated series per superstar is added to provide an exact depiction of the difference between the player efficiency rating (PER) of the superstar and the average player efficiency rating (PER) of his teammates. The user can activate/deactivate any series by clicking on the label's name. The user can hover over the elements to get detailed information, for example, the columns provide extra information on the name of the team, the year of the season, and the exact player efficiency rating (PER) value. positions for a specific season. It can be noted that the 1995-1997 Bulls teams had a high efficiency, while only the 2015 Cavalier had a comparable high efficiency. Figure 6 shows the same type of visualization used to analyze the Defensive box plus minus. It is evident that the 1995-1998 Bulls were great defensive teams, and that Michael relied on them on the defensive side for the second three-peat. On the other hand, LeBron had several good defensive teams, although he was far and beyond the average of his teammates being a solid defensive contributor for all his championships. Figure 7 shows the performance of the superstars and their teams using the Offensive box plus minus. Figure 7 shows that both superstars were solid offensive contributors to their teams. It can be argued that Michael had the better offensive teammates, from 1995 to 1997. LeBron’s best offensive teammates were those Cavaliers from 2015 to 2017. Figure 8 shows the same type of visualization used to analyze the overall Box plus-minus. Overall, Michael had the best teammates according to the league’s BMP. Michael was still a strong contributor on his second three-peat, however, his impact was diminishing toward the end of his career, while LeBron’s impact remained high at the end of his first 15 years."
2,"* In the abstract section:  + Add the following keywords: dendrogram, animated bubbles, animated lines, bars, and lines. * In the background section:  + Add the 5 traditional positions (center, point guard, shooting guard, small forward, and forward) in parentheses after the phrase ""well-known positions"" in the second paragraph. * Add a paragraph to the end of section 3.2 explaining what SARIMAX and AIC algorithms are. * In section 3.3, add a sentence to the first paragraph explaining that the first 15 seasons were chosen because that is the amount of seasons Michael Jordan played.","Abstract This document describes the implementation of several temporal visualizations to analyze time-sensitive data from the NBA. It contains a description of the machine learning models implemented to analyze the data and prepare it for the visual analysis. In particular it covers the implementation of a clustering algorithm and a time series applied to the produced clusters. Keywords: neural clustering, time series, visualizations, NBA, dendrogram, animated bubbles, animated lines, bars, lines. 1 Introduction The game of basketball was first invented in 1891 in Springfield, Massachusetts in a small college, fast forward to today is a global game with more than 200 countries with national teams (McCuaig, 2016). The National Basketball Association (NBA) is the most popular basketball league, with more than 400 players and a 10 billion revenue in 2022 (Ozanian & Teitelbaum, 2022). The history of the NBA started in 1949 when the Basketball Association of America (BAA) and the National Basketball League (NBL) merged into one single league (History.com Editors, 2022). Since then, not only the teams and league has evolved but also how the game is played. It is common to refer to the eras of the NBA, the dominance of the big men, and currently the 3-point evolution. The 3-point line was introduced in 1979 as a response to the dominance of big men around the hoop, clogging the painted area and making the pace of the game slow and unattracted to some (DuPree, 1979). The main idea was to space the player around the court to allow for better tactical styles. Grating an extra point compared to any other field goal, the idea was to entice players to take more 3-points to gain the extra point. The height or the position of the player now doesn’t play as big as a factor in the ability to score. Besides the eras in the NBA, the history of the league can’t be written without its stars. Arguably the two biggest stars in modern history are Michael Jordan and Lebron James. Their curriculum is well-known by every fan of the NBA, often posting the question: who is the greatest of all time? Although the question is interesting to spark debates, the truth is that it is very hard to compare players who played in different eras and different positions (center, point guard, shooting guard, small forward, and forward). Basketball is a team game, having contributions from starters and bench players. The second objective of this paper is to present a comparison of Michael Jordan's and Lebron James' careers in terms of the contribution of their teams, trying to answer, who had more help. Comparison between errors require a deeper analytics into the change of rules and players positions. 2 Background The purpose of this project is to analyze historical data on the NBA, proposing visualization and machine learning techniques to analyze data that has latent variables related to the game. This latent variable refers to intangibles that cannot be measured like style, rules, and the evolution of the game from season to season. The first analysis is focused on the 3-point revolution effects on the roles and positions of basketball from the 1979-198 season to the 2021-2022 season and beyond. The second analysis focuses on comparing the career and team performance of two superstars playing in two different eras: Michael Jordan from 1984 to 2003 and LeBron James from 2004 to 2018. Both analyses pose similar challenges in having temporal data, statistics, and relative comparisons. Analyzing the player's performance per position poses several difficulties. The traditional positions although well-known (center, point guard, shooting guard, small forward, and forward) have never been strictly defined nor enforced. For instance, a center may be known to play close to the basket and to rebound and score, it is not enforced that a player listed as a center cannot break that model. This is more prevalent as for the last two decades players have been more skilled and can play differently. In addition, teams can list their players without any control or consistency. In fact, many players are listed differently from season to season. For example, LeBron James has been listed as both a point guard and a small forward while Stephen Curry was classified as a point guard for the first three seasons, then listed as a shooting guard. There are several works done to address the role or position of a player in the NBA using statistics. One approach is to use the regular statistics used in the box score to measure players to create clusters, however, since the analysis in this project is temporal a decision must be made to create clusters per season, or one single mega-cluster as proposed by Hedquist (Hedquist, 2022). Clustering is an unsupervised machine-learning technique used to find patterns in the data and classify them. There are hierarchical and non-hierarchical models that can be used, each with advantages and disadvantages to this project. On the other hand, there is a myriad of comparisons between Michael Jordan versus LeBron James, most of them focus on single-point statistics to summarize their entire career. This could be simple statistics like average points per game, or advanced statistics like field goal percentage in winning shots. These statistics are not doing service to the complexity of the game, the situations each faced, the teammates, or the opponents. Although these statistics are useful for this project analysis, the purpose is to create a side-by-side comparison of their career and their teammates. This is achieved using a temporal visualization of advanced statistics while making a distinction between the superstar and their teammates. 3 Approach Section 3.1 presents a description of the datasets used for this project. Section 3.2 and 3.3 describes the data preprocessing and machine learning techniques applied to the datasets to achieve the desired visualizations. 3.1 Datasets This project uses data from the 1979-1980 season to the 2021-2022 season from ESPN analytics. Three datasets were used, the player's season totals, the player's advanced season totals, and the NBA league index. Table 1 describes the features used from the player's season totals dataset from ESPN analytics. Although there are more features, these were not included as they did not provide information relevant to the analysis done in this project. Column Description name Player’s name positions Player Position games_played Games Played in the season FG Field Goals made FGA Field Goals attempts 3P 3-Points made 3PA 3-points attempts FT Free Throws made FTA Free Throws attempts AST Assists STL Steals BLK Blocks TOV Turnovers PF Personal Fouls PTS Points season Year season ended TRB Total Rebounds Table 1. The player's season totals dataset. Table 2 describes the features used from the player's advanced season totals dataset from ESPN analytics. Although there are more features, these were not included as they did not provide information relevant to the analysis done in this project. Column Description name Player’s name PER Player Efficiency Rating (PER): A measure of per-minute production standardized such that the league average is 15. OBPM Offensive Box Plus/Minus: box score estimate of the offensive points per 100 possessions a player contributed above a league-average player, translated to an average team. DBPM Defensive Box Plus/Minus: A box score estimate of the defensive points per 100 possessions a player contributed above a league-average player, translated to an average team. BPM Box Plus/Minus: A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team. season Year season ended Table 2. The player's advanced season totals dataset. Finally, the NBA league index dataset was used only to cross-reference the championship winners for each season. 3.2 3-Point evolution per Position. The first step in the pipeline implemented was to transform the player's season totals to per-game statistics to normalize each player's production to the games he played. Additionally, since players can be traded between a season, those would have several inputs in the dataset within a season. An aggregation of these players' statistics was done to add up their performance in a season to have a single input per player per season. This proposed dataset is then used for the first analysis of the 3-point impact in the positions. The second step is to define a consistent position in the NBA. As discussed in Section 2, clustering presents an alternative to classify players according to their production on the court. A hierarchical clustering algorithm was chosen due to its ability to consider outliers without skewing the clusters. To apply these models, the number of clusters must be determined, to do so a dendrogram can be used. Figure 1 shows the dendrogram for all the players from 1979 to 2022. The obvious option would be to use 5 clusters, however, upon testing this configuration produced cluster describing primarily the role of bench, star, or superstar. Since the idea is to have a cluster that also describes the way of contributing, 9 clusters were chosen just as Hedquist proposed. The algorithm used was a Ward, an agglomerative hierarchical clustering used as a mega-cluster for all seasons. Figure 2 shows the cluster's percentile per each of the statistics used for the 2021-2022 season. A descriptive name was given for each cluster using both Figure 2 and the prominent players in each cluster for the 2021-2022 season. Table 3 shows the names assigned and the prominent players per cluster in the 2021-2022 season. Id Cluster Name Prominent Players 0 Score First Carmelo Anthony, Andrew Wiggins, Bogdan Bogdanović 1 Superstars LeBron James, Stephen Curry, Giannis Antetokounmpo, Joel Embiid 2 Two-way players Alex Caruso, Patrick Beverle 3 Defensive Big Men Steven Adams, Dwight Howard, JaVale McGee 4 Bench Role Players Thanasis Antetokounmpo, Santi Aldama 5 Interior Big Men LaMarcus Aldridge, Rudy Gobert, Deandre Ayton 6 Deep Bench Bol Bol, Moses Wright 7 Scoring Big Men Bam Adebayo, Evan Mobley, Jusuf Nurkić 8 Bench Perimeter Scorers Lou Williams, Avery Bradley, Trevor Ariza Table 3. Cluster naming and prominent players. To predict the trend and forecast the 3-point evolution in the upcoming seasons the time series algorithm SARIMAX was implemented. This algorithm can handle both seasonal and non-seasonal data. The parameters and seasonal parameters for this model were chosen per cluster performing a grid search minimizing the Akaike information criteria (AIC). SARIMAX is a type of time series algorithm that combines the features of ARIMA (AutoRegressive Integrated Moving Average) and Exogenous variables. It is used to forecast future values in a time series based on past values. AIC (Akaike information criterion) is a measure of the relative quality of a statistical model for a given set of data. It is used to select the best model among a set of candidate models. 3.3 Michael Jordan vs LeBron James Career and Teams. To produce this visual analysis the Players advanced totals dataset was preprocessed to filter only Michael’s and LeBron’s teams. Both superstars were subtracted from their teams, to have a clear comparison on how much their teams were dependent on them to thrive. It's worth noting that this analysis was limited to the top 6 teammates with the most minutes played per season. This was done to avoid skewing the statistics with bench players since there are more bench players in the NBA now than when Michael played. It is also true that in playoffs the rotation of players tends to close on 6 to 9 players depending on the coach. To be able to compare their career, this analysis focuses only on the first 15 seasons of each player, as Michael Jordan played for 15 seasons and LeBron James played for 18 seasons, but the first 15 seasons were chosen because that is the amount of seasons Michael Jordan played. The advanced statistics are useful to sum up the player's impact on the team in one aspect of the game. For example, the player efficiency rating (PER) “sums up all a player's positive accomplishments, subtracts the negative accomplishments, and returns a per-minute rating of a player's performance."" According to its creator John Hollinger (ESPN analytics, 2022). These statistics are adjusted around a league average, giving a direct indication of whether the players are above or below the average for each season. This allows us to compare the teams across different eras, having as a base their own contemporary players. 4 Results The results show visualizations proposed as models to compare temporal data in different analyses for the NBA. 4.1 3-Point evolution per Position. Figure 3 shows the first visualization proposed is an animated bubble graph on a cartesian plane. The x-axis is depicting the 3-point made while the y-axis depicts the 3-point attempted. In addition, the size of the bubble reflects the number of players in each position. As time advances the bubbles move along the plane and increase/decrease in size as the positions evolve. The purpose of this visualization is to allow the data to tell a visual story of the evolution of these clusters. This should serve as an exploratory visualization to move into a more specialized visualization. The user can hover over the visualization to gain exact information about the cluster. In addition, the slider below the graph can be used to control de animation, pause, resume, and move to a specific year of interest. The labels on the right are also interactive, as they serve as a filter to activate/deactivate series in the visualization. Figure 4 shows the second type of visualization proposed for this analysis, an animated line graph with a forecast of the expected evolution of the 3-point made in the next 10 years. The purpose of this visualization is to provide a specialized view of the evolution of the 3-point per cluster position as well as to provide an estimate of the projections for the next 10 years. The user can hover over the lines to get a unified detail of the exact values of all cluster positions for a specific season. In addition, the range slider below the graph allows to zoom into a specific timeframe of interest. 4.2 Michael Jordan vs LeBron James Career and Teams. Figure 5 shows the visualization type proposed for this analysis applied to the player efficiency rating (PER) statistic. This visualization is a column chart combined with a line graph, the columns are used for the teammates while the line is used for the superstars. In addition, markers can be used to mark relevant events in their careers, for instance, the championships are coded as stars over the line graphs. A deactivated series per superstar is added to provide an exact depiction of the difference between the player efficiency rating (PER) of the superstar and the average player efficiency rating (PER) of his teammates. The user can activate/deactivate any series by clicking on the label's name. The user can hover over the elements to get detailed information, for example, the columns provide extra information on the name of the team, the year of the season, and the exact player efficiency rating (PER) value. positions for a specific season. It can be noted that the 1995-1997 Bulls teams had a high efficiency, while only the 2015 Cavalier had a comparable high efficiency. Figure 6 shows the same type of visualization used to analyze the Defensive box plus minus. It is evident that the 1995-1998 Bulls were great defensive teams, and that Michael relied on them on the defensive side for the second three-peat. On the other hand, LeBron had several good defensive teams, although he was far and beyond the average of his teammates being a solid defensive contributor for all his championships. Figure 7 shows the performance of the superstars and their teams using the Offensive box plus minus. Figure 7 shows that both superstars were solid offensive contributors to their teams. It can be argued that Michael had the better offensive teammates, from 1995 to 1997. LeBron’s best offensive teammates were those Cavaliers from 2015 to 2017. Figure 8 shows the same type of visualization used to analyze the overall Box plus-minus. Overall, Michael had the best teammates according to the league’s BMP. Michael was still a strong contributor on his second three-peat, however, his impact was diminishing toward the end of his career, while LeBron’s impact remained high at the end of his first 15 years."


This is the comparision between the orignal document and the final modified text after applying all the changes.

In [40]:
df = pd.DataFrame([(prev_doc_text, doc_text_sets)], columns=['Original', 'Modified'])
pretty_print(df)

Unnamed: 0,Original,Modified
0,"Visual Historical Analysis of the NBA: Final Project Abstract This document describes the implementation of several temporal visualizations to analyze time-sensitive data from the NBA. It contains a description of the machine learning models implemented to analyze the data and prepare it for the visual analysis. In particular it covers the implementation of a clustering algorithm and a time series applied to the produced clusters. Keywords: neural clustering, time series, visualizations, NBA. 1 Introduction The game of basketball was first invented in 1891 in Springfield, Massachusetts in a small college, fast forward to today is a global game with more than 200 countries with national teams (McCuaig, 2016). The National Basketball Association (NBA) is the most popular basketball league, with more than 400 players and a 10 billion revenue in 2022 (Ozanian & Teitelbaum, 2022). The history of the NBA started in 1949 when the Basketball Association of America (BAA) and the National Basketball League (NBL) merged into one single league (History.com Editors, 2022). Since then, not only the teams and league has evolved but also how the game is played. It is common to refer to the eras of the NBA, the dominance of the big men, and currently the 3-point evolution. The 3-point line was introduced in 1979 as a response to the dominance of big men around the hoop, clogging the painted area and making the pace of the game slow and unattracted to some (DuPree, 1979). The main idea was to space the player around the court to allow for better tactical styles. Grating an extra point compared to any other field goal, the idea was to entice players to take more 3-points to gain the extra point. The height or the position of the player now doesn’t play as big as a factor in the ability to score. Besides the eras in the NBA, the history of the league can’t be written without its stars. Arguably the two biggest stars in modern history are Michael Jordan and Lebron James. Their curriculum is well-known by every fan of the NBA, often posting the question: who is the greatest of all time? Although the question is interesting to spark debates, the truth is that it is very hard to compare players who played in different eras and different positions. Basketball is a team game, having contributions from starters and bench players. The second objective of this paper is to present a comparison of Michael Jordan's and Lebron James' careers in terms of the contribution of their teams, trying to answer, who had more help. 2 Background The purpose of this project is to analyze historical data on the NBA, proposing visualization and machine learning techniques to analyze data that has latent variables related to the game. This latent variable refers to intangibles that cannot be measured like style, rules, and the evolution of the game from season to season. The first analysis is focused on the 3-point revolution effects on the roles and positions of basketball from the 1979-198 season to the 2021-2022 season and beyond. The second analysis focuses on comparing the career and team performance of two superstars playing in two different eras: Michael Jordan from 1984 to 2003 and LeBron James from 2004 to 2018. Both analyses pose similar challenges in having temporal data, statistics, and relative comparisons. Analyzing the player's performance per position poses several difficulties. The traditional positions although well-known have never been strictly defined nor enforced. For instance, a center may be known to play close to the basket and to rebound and score, it is not enforced that a player listed as a center cannot break that model. This is more prevalent as for the last two decades players have been more skilled and can play differently. In addition, teams can list their players without any control or consistency. In fact, many players are listed differently from season to season. For example, LeBron James has been listed as both a point guard and a small forward while Stephen Curry was classified as a point guard for the first three seasons, then listed as a shooting guard. There are several works done to address the role or position of a player in the NBA using statistics. One approach is to use the regular statistics used in the box score to measure players to create clusters, however, since the analysis in this project is temporal a decision must be made to create clusters per season, or one single mega-cluster as proposed by Hedquist (Hedquist, 2022). Clustering is an unsupervised machine-learning technique used to find patterns in the data and classify them. There are hierarchical and non-hierarchical models that can be used, each with advantages and disadvantages to this project. On the other hand, there is a myriad of comparisons between Michael Jordan versus LeBron James, most of them focus on single-point statistics to summarize their entire career. This could be simple statistics like average points per game, or advanced statistics like field goal percentage in winning shots. These statistics are not doing service to the complexity of the game, the situations each faced, the teammates, or the opponents. Although these statistics are useful for this project analysis, the purpose is to create a side-by-side comparison of their career and their teammates. This is achieved using a temporal visualization of advanced statistics while making a distinction between the superstar and their teammates. 3 Approach Section 3.1 presents a description of the datasets used for this project. Section 3.2 and 3.3 describes the data preprocessing and machine learning techniques applied to the datasets to achieve the desired visualizations. 3.1 Datasets This project uses data from the 1979-1980 season to the 2021-2022 season from Basketball-Reference. Three datasets were used, the player's season totals, the player's advanced season totals, and the NBA league index. Table 1 describes the features used from the player's season totals dataset from Basketball-Reference. Although there are more features, these were not included as they did not provide information relevant to the analysis done in this project. Column Description name Player’s name positions Player Position games_played Games Played in the season FG Field Goals made FGA Field Goals attempts 3P 3-Points made 3PA 3-points attempts FT Free Throws made FTA Free Throws attempts ORB Offensive Rebounds DRB Defensive Rebounds AST Assists STL Steals BLK Blocks TOV Turnovers PF Personal Fouls PTS Points season Year season ended Table 1. The player's season totals dataset. Table 2 describes the features used from the player's advanced season totals dataset from Basketball-Reference. Although there are more features, these were not included as they did not provide information relevant to the analysis done in this project. Column Description name Player’s name PER Player Efficiency Rating: A measure of per-minute production standardized such that the league average is 15. OBPM Offensive Box Plus/Minus: box score estimate of the offensive points per 100 possessions a player contributed above a league-average player, translated to an average team. DBPM Defensive Box Plus/Minus: A box score estimate of the defensive points per 100 possessions a player contributed above a league-average player, translated to an average team. BPM Box Plus/Minus: A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team. season Year season ended Table 2. The player's advanced season totals dataset. Finally, the NBA league index dataset was used only to cross-reference the championship winners for each season. 3.2 3-Point evolution per Position. The first step in the pipeline implemented was to transform the player's season totals to per-game statistics to normalize each player's production to the games he played. Additionally, since players can be traded between a season, those would have several inputs in the dataset within a season. An aggregation of these players' statistics was done to add up their performance in a season to have a single input per player per season. This proposed dataset is then used for the first analysis of the 3-point impact in the positions. The second step is to define a consistent position in the NBA. As discussed in Section 2, clustering presents an alternative to classify players according to their production on the court. A hierarchical clustering algorithm was chosen due to its ability to consider outliers without skewing the clusters. To apply these models, the number of clusters must be determined, to do so a dendrogram can be used. Figure 1 shows the dendrogram for all the players from 1979 to 2022. The obvious option would be to use 5 clusters, however, upon testing this configuration produced cluster describing primarily the role of bench, star, or superstar. Since the idea is to have a cluster that also describes the way of contributing, 9 clusters were chosen just as Hedquist proposed. The algorithm used was a Ward, an agglomerative hierarchical clustering used as a mega-cluster for all seasons. Figure 2 shows the cluster's percentile per each of the statistics used for the 2021-2022 season. A descriptive name was given for each cluster using both Figure 2 and the prominent players in each cluster for the 2021-2022 season. Table 3 shows the names assigned and the prominent players per cluster in the 2021-2022 season. Id Cluster Name Prominent Players 0 Score First Carmelo Anthony, Andrew Wiggins, Bogdan Bogdanović 1 Superstars LeBron James, Stephen Curry, Giannis Antetokounmpo, Joel Embiid 2 Two-way players Alex Caruso, Patrick Beverle 3 Defensive Big Men Steven Adams, Dwight Howard, JaVale McGee 4 Bench Role Players Thanasis Antetokounmpo, Santi Aldama 5 Interior Big Men LaMarcus Aldridge, Rudy Gobert, Deandre Ayton 6 Deep Bench Bol Bol, Moses Wright 7 Scoring Big Men Bam Adebayo, Evan Mobley, Jusuf Nurkić 8 Bench Perimeter Scorers Lou Williams, Avery Bradley, Trevor Ariza Table 3. Cluster naming and prominent players. To predict the trend and forecast the 3-point evolution in the upcoming seasons the time series algorithm SARIMAX was implemented. This algorithm can handle both seasonal and non-seasonal data. The parameters and seasonal parameters for this model were chosen per cluster performing a grid search minimizing the Akaike information criteria (AIC). 3.3 Michael Jordan vs LeBron James Career and Teams. To produce this visual analysis the Players advanced totals dataset was preprocessed to filter only Michael’s and LeBron’s teams. Both superstars were subtracted from their teams, to have a clear comparison on how much their teams were dependent on them to thrive. It's worth noting that this analysis was limited to the top 6 teammates with the most minutes played per season. This was done to avoid skewing the statistics with bench players since there are more bench players in the NBA now than when Michael played. It is also true that in playoffs the rotation of players tends to close on 6 to 9 players depending on the coach. To be able to compare their career, this analysis focuses only on the first 15 seasons of each player. The advanced statistics are useful to sum up the player's impact on the team in one aspect of the game. For example, the Player Efficiency Rating (PER) “sums up all a player's positive accomplishments, subtracts the negative accomplishments, and returns a per-minute rating of a player's performance."" According to its creator John Hollinger (Basketball Reference, 2022). These statistics are adjusted around a league average, giving a direct indication of whether the players are above or below the average for each season. This allows us to compare the teams across different eras, having as a base their own contemporary players. 4 Results The results show visualizations proposed as models to compare temporal data in different analyses for the NBA. 4.1 3-Point evolution per Position. Figure 3 shows the first visualization proposed is an animated bubble graph on a cartesian plane. The x-axis is depicting the 3-point made while the y-axis depicts the 3-point attempted. In addition, the size of the bubble reflects the number of players in each position. As time advances the bubbles move along the plane and increase/decrease in size as the positions evolve. The purpose of this visualization is to allow the data to tell a visual story of the evolution of these clusters. This should serve as an exploratory visualization to move into a more specialized visualization. The user can hover over the visualization to gain exact information about the cluster. In addition, the slider below the graph can be used to control de animation, pause, resume, and move to a specific year of interest. The labels on the right are also interactive, as they serve as a filter to activate/deactivate series in the visualization. Figure 4 shows the second type of visualization proposed for this analysis, an animated line graph with a forecast of the expected evolution of the 3-point made in the next 10 years. The purpose of this visualization is to provide a specialized view of the evolution of the 3-point per cluster position as well as to provide an estimate of the projections for the next 10 years. The user can hover over the lines to get a unified detail of the exact values of all cluster positions for a specific season. In addition, the range slider below the graph allows to zoom into a specific timeframe of interest. 4.2 Michael Jordan vs LeBron James Career and Teams. Figure 5 shows the visualization type proposed for this analysis applied to the PER statistic. This visualization is a column chart combined with a line graph, the columns are used for the teammates while the line is used for the superstars. In addition, markers can be used to mark relevant events in their careers, for instance, the championships are coded as stars over the line graphs. A deactivated series per superstar is added to provide an exact depiction of the difference between the PER of the superstar and the average PER of his teammates. The user can activate/deactivate any series by clicking on the label's name. The user can hover over the elements to get detailed information, for example, the columns provide extra information on the name of the team, the year of the season, and the exact PER value. positions for a specific season. It can be noted that the 1995-1997 Bulls teams had a high efficiency, while only the 2015 Cavalier had a comparable high efficiency. Figure 6 shows the same type of visualization used to analyze the Defensive box plus minus. It is evident that the 1995-1998 Bulls were great defensive teams, and that Michael relied on them on the defensive side for the second three-peat. On the other hand, LeBron had several good defensive teams, although he was far and beyond the average of his teammates being a solid defensive contributor for all his championships. Figure 7 shows the performance of the superstars and their teams using the Offensive box plus minus. Figure 7 shows that both superstars were solid offensive contributors to their teams. It can be argued that Michael had the better offensive teammates, from 1995 to 1997. LeBron’s best offensive teammates were those Cavaliers from 2015 to 2017. Figure 8 shows the same type of visualization used to analyze the overall Box plus-minus. Overall, Michael had the best teammates according to the league’s BMP. Michael was still a strong contributor on his second three-peat, however, his impact was diminishing toward the end of his career, while LeBron’s impact remained high at the end of his first 15 years.","Abstract This document describes the implementation of several temporal visualizations to analyze time-sensitive data from the NBA. It contains a description of the machine learning models implemented to analyze the data and prepare it for the visual analysis. In particular it covers the implementation of a clustering algorithm and a time series applied to the produced clusters. Keywords: neural clustering, time series, visualizations, NBA, dendrogram, animated bubbles, animated lines, bars, lines. 1 Introduction The game of basketball was first invented in 1891 in Springfield, Massachusetts in a small college, fast forward to today is a global game with more than 200 countries with national teams (McCuaig, 2016). The National Basketball Association (NBA) is the most popular basketball league, with more than 400 players and a 10 billion revenue in 2022 (Ozanian & Teitelbaum, 2022). The history of the NBA started in 1949 when the Basketball Association of America (BAA) and the National Basketball League (NBL) merged into one single league (History.com Editors, 2022). Since then, not only the teams and league has evolved but also how the game is played. It is common to refer to the eras of the NBA, the dominance of the big men, and currently the 3-point evolution. The 3-point line was introduced in 1979 as a response to the dominance of big men around the hoop, clogging the painted area and making the pace of the game slow and unattracted to some (DuPree, 1979). The main idea was to space the player around the court to allow for better tactical styles. Grating an extra point compared to any other field goal, the idea was to entice players to take more 3-points to gain the extra point. The height or the position of the player now doesn’t play as big as a factor in the ability to score. Besides the eras in the NBA, the history of the league can’t be written without its stars. Arguably the two biggest stars in modern history are Michael Jordan and Lebron James. Their curriculum is well-known by every fan of the NBA, often posting the question: who is the greatest of all time? Although the question is interesting to spark debates, the truth is that it is very hard to compare players who played in different eras and different positions (center, point guard, shooting guard, small forward, and forward). Basketball is a team game, having contributions from starters and bench players. The second objective of this paper is to present a comparison of Michael Jordan's and Lebron James' careers in terms of the contribution of their teams, trying to answer, who had more help. Comparison between errors require a deeper analytics into the change of rules and players positions. 2 Background The purpose of this project is to analyze historical data on the NBA, proposing visualization and machine learning techniques to analyze data that has latent variables related to the game. This latent variable refers to intangibles that cannot be measured like style, rules, and the evolution of the game from season to season. The first analysis is focused on the 3-point revolution effects on the roles and positions of basketball from the 1979-198 season to the 2021-2022 season and beyond. The second analysis focuses on comparing the career and team performance of two superstars playing in two different eras: Michael Jordan from 1984 to 2003 and LeBron James from 2004 to 2018. Both analyses pose similar challenges in having temporal data, statistics, and relative comparisons. Analyzing the player's performance per position poses several difficulties. The traditional positions although well-known (center, point guard, shooting guard, small forward, and forward) have never been strictly defined nor enforced. For instance, a center may be known to play close to the basket and to rebound and score, it is not enforced that a player listed as a center cannot break that model. This is more prevalent as for the last two decades players have been more skilled and can play differently. In addition, teams can list their players without any control or consistency. In fact, many players are listed differently from season to season. For example, LeBron James has been listed as both a point guard and a small forward while Stephen Curry was classified as a point guard for the first three seasons, then listed as a shooting guard. There are several works done to address the role or position of a player in the NBA using statistics. One approach is to use the regular statistics used in the box score to measure players to create clusters, however, since the analysis in this project is temporal a decision must be made to create clusters per season, or one single mega-cluster as proposed by Hedquist (Hedquist, 2022). Clustering is an unsupervised machine-learning technique used to find patterns in the data and classify them. There are hierarchical and non-hierarchical models that can be used, each with advantages and disadvantages to this project. On the other hand, there is a myriad of comparisons between Michael Jordan versus LeBron James, most of them focus on single-point statistics to summarize their entire career. This could be simple statistics like average points per game, or advanced statistics like field goal percentage in winning shots. These statistics are not doing service to the complexity of the game, the situations each faced, the teammates, or the opponents. Although these statistics are useful for this project analysis, the purpose is to create a side-by-side comparison of their career and their teammates. This is achieved using a temporal visualization of advanced statistics while making a distinction between the superstar and their teammates. 3 Approach Section 3.1 presents a description of the datasets used for this project. Section 3.2 and 3.3 describes the data preprocessing and machine learning techniques applied to the datasets to achieve the desired visualizations. 3.1 Datasets This project uses data from the 1979-1980 season to the 2021-2022 season from ESPN analytics. Three datasets were used, the player's season totals, the player's advanced season totals, and the NBA league index. Table 1 describes the features used from the player's season totals dataset from ESPN analytics. Although there are more features, these were not included as they did not provide information relevant to the analysis done in this project. Column Description name Player’s name positions Player Position games_played Games Played in the season FG Field Goals made FGA Field Goals attempts 3P 3-Points made 3PA 3-points attempts FT Free Throws made FTA Free Throws attempts AST Assists STL Steals BLK Blocks TOV Turnovers PF Personal Fouls PTS Points season Year season ended TRB Total Rebounds Table 1. The player's season totals dataset. Table 2 describes the features used from the player's advanced season totals dataset from ESPN analytics. Although there are more features, these were not included as they did not provide information relevant to the analysis done in this project. Column Description name Player’s name PER Player Efficiency Rating (PER): A measure of per-minute production standardized such that the league average is 15. OBPM Offensive Box Plus/Minus: box score estimate of the offensive points per 100 possessions a player contributed above a league-average player, translated to an average team. DBPM Defensive Box Plus/Minus: A box score estimate of the defensive points per 100 possessions a player contributed above a league-average player, translated to an average team. BPM Box Plus/Minus: A box score estimate of the points per 100 possessions a player contributed above a league-average player, translated to an average team. season Year season ended Table 2. The player's advanced season totals dataset. Finally, the NBA league index dataset was used only to cross-reference the championship winners for each season. 3.2 3-Point evolution per Position. The first step in the pipeline implemented was to transform the player's season totals to per-game statistics to normalize each player's production to the games he played. Additionally, since players can be traded between a season, those would have several inputs in the dataset within a season. An aggregation of these players' statistics was done to add up their performance in a season to have a single input per player per season. This proposed dataset is then used for the first analysis of the 3-point impact in the positions. The second step is to define a consistent position in the NBA. As discussed in Section 2, clustering presents an alternative to classify players according to their production on the court. A hierarchical clustering algorithm was chosen due to its ability to consider outliers without skewing the clusters. To apply these models, the number of clusters must be determined, to do so a dendrogram can be used. Figure 1 shows the dendrogram for all the players from 1979 to 2022. The obvious option would be to use 5 clusters, however, upon testing this configuration produced cluster describing primarily the role of bench, star, or superstar. Since the idea is to have a cluster that also describes the way of contributing, 9 clusters were chosen just as Hedquist proposed. The algorithm used was a Ward, an agglomerative hierarchical clustering used as a mega-cluster for all seasons. Figure 2 shows the cluster's percentile per each of the statistics used for the 2021-2022 season. A descriptive name was given for each cluster using both Figure 2 and the prominent players in each cluster for the 2021-2022 season. Table 3 shows the names assigned and the prominent players per cluster in the 2021-2022 season. Id Cluster Name Prominent Players 0 Score First Carmelo Anthony, Andrew Wiggins, Bogdan Bogdanović 1 Superstars LeBron James, Stephen Curry, Giannis Antetokounmpo, Joel Embiid 2 Two-way players Alex Caruso, Patrick Beverle 3 Defensive Big Men Steven Adams, Dwight Howard, JaVale McGee 4 Bench Role Players Thanasis Antetokounmpo, Santi Aldama 5 Interior Big Men LaMarcus Aldridge, Rudy Gobert, Deandre Ayton 6 Deep Bench Bol Bol, Moses Wright 7 Scoring Big Men Bam Adebayo, Evan Mobley, Jusuf Nurkić 8 Bench Perimeter Scorers Lou Williams, Avery Bradley, Trevor Ariza Table 3. Cluster naming and prominent players. To predict the trend and forecast the 3-point evolution in the upcoming seasons the time series algorithm SARIMAX was implemented. This algorithm can handle both seasonal and non-seasonal data. The parameters and seasonal parameters for this model were chosen per cluster performing a grid search minimizing the Akaike information criteria (AIC). SARIMAX is a type of time series algorithm that combines the features of ARIMA (AutoRegressive Integrated Moving Average) and Exogenous variables. It is used to forecast future values in a time series based on past values. AIC (Akaike information criterion) is a measure of the relative quality of a statistical model for a given set of data. It is used to select the best model among a set of candidate models. 3.3 Michael Jordan vs LeBron James Career and Teams. To produce this visual analysis the Players advanced totals dataset was preprocessed to filter only Michael’s and LeBron’s teams. Both superstars were subtracted from their teams, to have a clear comparison on how much their teams were dependent on them to thrive. It's worth noting that this analysis was limited to the top 6 teammates with the most minutes played per season. This was done to avoid skewing the statistics with bench players since there are more bench players in the NBA now than when Michael played. It is also true that in playoffs the rotation of players tends to close on 6 to 9 players depending on the coach. To be able to compare their career, this analysis focuses only on the first 15 seasons of each player, as Michael Jordan played for 15 seasons and LeBron James played for 18 seasons, but the first 15 seasons were chosen because that is the amount of seasons Michael Jordan played. The advanced statistics are useful to sum up the player's impact on the team in one aspect of the game. For example, the player efficiency rating (PER) “sums up all a player's positive accomplishments, subtracts the negative accomplishments, and returns a per-minute rating of a player's performance."" According to its creator John Hollinger (ESPN analytics, 2022). These statistics are adjusted around a league average, giving a direct indication of whether the players are above or below the average for each season. This allows us to compare the teams across different eras, having as a base their own contemporary players. 4 Results The results show visualizations proposed as models to compare temporal data in different analyses for the NBA. 4.1 3-Point evolution per Position. Figure 3 shows the first visualization proposed is an animated bubble graph on a cartesian plane. The x-axis is depicting the 3-point made while the y-axis depicts the 3-point attempted. In addition, the size of the bubble reflects the number of players in each position. As time advances the bubbles move along the plane and increase/decrease in size as the positions evolve. The purpose of this visualization is to allow the data to tell a visual story of the evolution of these clusters. This should serve as an exploratory visualization to move into a more specialized visualization. The user can hover over the visualization to gain exact information about the cluster. In addition, the slider below the graph can be used to control de animation, pause, resume, and move to a specific year of interest. The labels on the right are also interactive, as they serve as a filter to activate/deactivate series in the visualization. Figure 4 shows the second type of visualization proposed for this analysis, an animated line graph with a forecast of the expected evolution of the 3-point made in the next 10 years. The purpose of this visualization is to provide a specialized view of the evolution of the 3-point per cluster position as well as to provide an estimate of the projections for the next 10 years. The user can hover over the lines to get a unified detail of the exact values of all cluster positions for a specific season. In addition, the range slider below the graph allows to zoom into a specific timeframe of interest. 4.2 Michael Jordan vs LeBron James Career and Teams. Figure 5 shows the visualization type proposed for this analysis applied to the player efficiency rating (PER) statistic. This visualization is a column chart combined with a line graph, the columns are used for the teammates while the line is used for the superstars. In addition, markers can be used to mark relevant events in their careers, for instance, the championships are coded as stars over the line graphs. A deactivated series per superstar is added to provide an exact depiction of the difference between the player efficiency rating (PER) of the superstar and the average player efficiency rating (PER) of his teammates. The user can activate/deactivate any series by clicking on the label's name. The user can hover over the elements to get detailed information, for example, the columns provide extra information on the name of the team, the year of the season, and the exact player efficiency rating (PER) value. positions for a specific season. It can be noted that the 1995-1997 Bulls teams had a high efficiency, while only the 2015 Cavalier had a comparable high efficiency. Figure 6 shows the same type of visualization used to analyze the Defensive box plus minus. It is evident that the 1995-1998 Bulls were great defensive teams, and that Michael relied on them on the defensive side for the second three-peat. On the other hand, LeBron had several good defensive teams, although he was far and beyond the average of his teammates being a solid defensive contributor for all his championships. Figure 7 shows the performance of the superstars and their teams using the Offensive box plus minus. Figure 7 shows that both superstars were solid offensive contributors to their teams. It can be argued that Michael had the better offensive teammates, from 1995 to 1997. LeBron’s best offensive teammates were those Cavaliers from 2015 to 2017. Figure 8 shows the same type of visualization used to analyze the overall Box plus-minus. Overall, Michael had the best teammates according to the league’s BMP. Michael was still a strong contributor on his second three-peat, however, his impact was diminishing toward the end of his career, while LeBron’s impact remained high at the end of his first 15 years."
