# **2022 NFL Combine - Project 1**
### Analyzing qualitative and quantitative variables.

# **Importing Necessary Python Modules**

Python incorporates a variety of open source add-ins called **modules** that add extra features to the basic setup. The name of the modules is after the `import` statement, and the purpose is in a non-code comment after thew hashtag (#).



In [1]:
import pandas as pd                 #Data analysis
import numpy as np                  #Calculations
import plotly.express as px         #Graphing
import matplotlib.pyplot as plt     #Graphing
from IPython.display import Image   #Display images
import warnings                     #Ignore version warnings
warnings.simplefilter('ignore', FutureWarning)


In [2]:
# Replace 'image_url' with the URL of the image you want to display
image_url = 'https://as2.ftcdn.net/v2/jpg/04/31/74/37/1000_F_431743763_in9BVVzCI36X304StR89pnxyUYzj1dwa.jpg'

# Display the image
Image(url=image_url)

# **Context**

National Invitational Camp (NIC), more commonly known as the NFL Scouting Combine, began in 1982 when National Football Scouting, Inc. first conducted a camp for its member NFL clubs in Tampa, Florida. The key purpose then, same as it is today, was to ascertain medical information on the top draft eligible prospects in college football. The inaugural NIC was attended by a total of 163 players and established a foundation for future expansion.

As football and the art of evaluating players has evolved, so has the NFL Scouting Combine. While medical examinations remain the number one priority of the event, athletes will also participate in a variety of psychological and physical tests, as well as, formal and informal interviews with top executives, coaches and scouts from all 32 NFL teams. NIC is the ultimate four day job interview for the top college football players eligible for the upcoming NFL Draft.

Attribution: NFLCombine.net


# **About the Dataset**

This dataset contains 133 rows corresponding to a random sample of drafted players. A total of 9 variables are provided as listed below:

| Variable Name(s)      | Description                            |
|:----------------------|:--------------------------------------|
| Player                | Player ID, which is the player's name  |
| Pos                   | Position the player plays              |
| School                | College the player attended            |
| Ht                    | Player height (inches)                 |
| Wt                    | Player weight (lbs)                    |
| 40yd                  | Time to run the 40-yard dash (seconds) |
| Vertical              | Vertical jump height (inches)          |
| Broad Jump            | Horizontal distance covered (inches) (aka long jump) |
| Drafted (tm/rnd/yr)   | Team, round, and year the player was drafted |



Let's take a look at the data. To do this, first we import it directly from the url below.



# **A Snippet of the Data**

In [3]:
url='https://raw.githubusercontent.com/thamilton562/STAT108_Projects_Students/main/DataSets/NFLCombine.csv'
df=pd.read_csv(url)

Next, we can display the data by *typing the name* of the DataFrame. To ensure we can see all columns, we'll use the *pd.set_option* method.

In [4]:
# Set display options to show all columns
pd.set_option('display.max_columns', None)
df

Unnamed: 0,Player,Pos,School,Ht,Wt,40yd,Vertical,Broad Jump,Drafted (tm/rnd/yr)
0,Myjai Sanders,EDGE,Cincinnati,77,228,4.67,33.0,120,Arizona Cardinals / 3rd / 100th pick / 2022
1,Keaontay Ingram,RB,USC,73,221,4.53,34.5,122,Arizona Cardinals / 6th / 201st pick / 2022
2,Jesse Luketa,LB,Penn St.,75,253,4.89,37.5,114,Arizona Cardinals / 7th / 256th pick / 2022
3,Marquis Hayes,OG,Oklahoma,77,318,5.30,23.5,102,Arizona Cardinals / 7th / 257th pick / 2022
4,Troy Andersen,LB,Montana St.,76,243,4.42,36.0,128,Atlanta Falcons / 2nd / 58th pick / 2022
...,...,...,...,...,...,...,...,...,...
128,Jahan Dotson,WR,Penn St.,71,178,4.43,36.0,121,Washington Commanders / 1st / 16th pick / 2022
129,Brian Robinson,RB,Alabama,74,225,4.53,30.0,119,Washington Commanders / 3rd / 98th pick / 2022
130,Percy Butler,S,Louisiana,73,194,4.36,31.5,123,Washington Commanders / 4th / 113th pick / 2022
131,Cole Turner,TE,Nevada,78,246,4.76,27.0,120,Washington Commanders / 5th / 149th pick / 2022


# **INSTRUCTIONS**

* Use Python to analyze the data set and complete each of the following.
* Replace ellipsis (...) with the relavent names or code.  
* For problems that require a written response, double click the text box to start typing.
* Reference the 3 tutorials from activity for assistance.
* Attend office hours if you still need help.

## **QUESTION 1**
Determine whether the four variables below are qualitative or quantitative. If they are quantitative, specify whether they are continuous or discrete.

| Variable | Classification |
|:---------|:----------------------------|
|School    | Qualitative                 |
|Weight    | Quantitative, continuous    |
|Vertical  | Quantitative, continuous    |
|Position  | Qualitative                 |

## **QUESTION 2**

Construct a frequency table, relative frequency table, and relative frequency bar chart to describe the distribution of player position. State any fact that jumps out to you.

**2a)** Construct a table that contains the frequency and relative frequency distribution for the player's position. Round relative frequency to 3 decimal places.

In [16]:
# Define the name of the variable to be analyzed
variable = df['Pos']                                                        #variable = df['...']

# Create the frequency table and sort the categories in numerical order.
# .sort_index() sorts the Pos categories alphabetically
freq_table = pd.value_counts(variable).sort_index()

# Rename "count" to "Frequency", and "cp" to "Chest Pain Type"
freq_table = freq_table.rename('Frequency')
freq_table = freq_table.rename_axis('Position')

# Create the relative frequency table, and rename the counts column to
#   Relative Frequency.
relative_freq_table = freq_table/len(df)                                            #relative_freq_table = freq_table/... #HINT: look back at Project 0 or Tutorial 1.
relative_freq_table = relative_freq_table.rename('Relative Frequency').round(3)     # relative_freq_table = relative_freq_table.rename('...').round(...)

# Combine both tables
# axis=1 says to put the tables together as columns
combined_table = pd.concat([freq_table, relative_freq_table], axis=1)                 # combined_table = pd.concat([..., ...], axis=1)

# Print the combined table.
combined_table                                                          # ...


Unnamed: 0_level_0,Frequency,Relative Frequency
Position,Unnamed: 1_level_1,Unnamed: 2_level_1
C,5,0.038
CB,5,0.038
DE,6,0.045
DT,5,0.038
EDGE,10,0.075
LB,15,0.113
OG,17,0.128
OT,15,0.113
P,1,0.008
QB,4,0.03


**2b)** Construct a relative frequency bar chart to describe the distribution of chest pain type.

In [17]:
# Create the relative frequency DataFrame from the relative_freq_table
# .reset_index() is needed because we changed it in (2a).
# Replace the ... to complete the name of the table needed to create
#    the required bar chart.
dfrf = relative_freq_table.reset_index()            #dfrf = ..._freq_table.reset_index()

# Rename the columns for clarity
dfrf.columns = ['Position', 'Relative Frequency']

# Create the bar graph
fig = px.bar(x=dfrf['Position'],y=dfrf['Relative Frequency'],
             title='Relative Frequency Distribution Bar Chart') #title='...')

# Update axis lables
fig.update_layout(xaxis_title = 'Position')        #xaxis_title='...')
fig.update_layout(yaxis_title = 'Relative Frequency')             #yaxis_title='...')

# Display the bar graph
fig.show()


KeyError: 'Position'

**2c)** Describe the distribution of positions. Use the full name of the position (this link can help: https://en.wikipedia.org/wiki/American_football_positions). For ex: rather than "...least likely to be P", write "...least likely to be a Punter"

Players are most likely to be wide receivers (WR). The next most likely positions are line backer, offensive guard, offensive tackle, and running back. The least likley position is Punter.

## **Question 3**

For question 3 you will analyze a quantitative variable. Find your variable based on your last name and use that variable when answering all parts of question 3.  

Once you find your variable description, scroll up to "About the Dataset" to find the variable name. Then look at the "Snippet of Data" to get the exact variable name, especially since variable names are case sensitive.

| **Last Name** | **Variable Description**    |
|:--------------|:----------------------------|
| A-L           | Vertical jump height        |
| M-Z           | Broad jump distance         |





**3a)** Construct a histogram for your variable. Use number of bins = 18.

# **A-L**

In [18]:
# Create the histogram, with the x-axis being the variable specified in the
#   table based on your last name.
fig = px.histogram(x=df['Vertical'], nbins = 18,                        #fig = px.histogram(x=df['...'],nbins = ...,
             title='Histogram of Vertical Jump Height',                 #title='...',
             labels={'x':'Vertical Jump Height'})                       #labels={'x':'...'})

# Update the vertical axis title.
fig.update_layout(yaxis_title = 'Frequency')                       #fig.update_layout(yaxis_title = '...')

# Print the histogram.
fig.show()


# **M-Z**

In [None]:

fig = px.histogram(x=df['Broad Jump'], nbins = 18,
             title='Histogram of Broad Jump Distance',
             labels={'x':'Broad Jump Distance', 'y':'Frequency'})

fig.update_layout(yaxis_title='Frequency')

fig.show()

**3b)** Construct a boxplot for your variable.  

# **A-L**

In [None]:
# Create the boxplot, with a title, and specify horizontal axis label.
px.box(x=df['Vertical'],                               #px.box(x=df['...'],
       title='Boxplot of Vertical Jump Height',        #title='...',
       labels={'x':'Vertical Jump Height'})            #labels={'x':'...'})

# **M-Z**

In [None]:
# Create the boxplot, with a title, and specify horizontal axis label.
px.box(x=df['Broad Jump'],
       title='Boxplot of Broad Jump Distance',
       labels={'x':'Broad Jump Distance'})

**3c)** Calculate the following summary statistics for your variable: 5 number summary, mean, and standard deviation. Round to three decimal places.

In [None]:
# Calculate the numerical summaries
# Indicate your variable.
descriptive_stats = df[['Vertical','Broad Jump']].describe().round(3)       #descriptive_stats = df[['...']].describe().round(...)

# Print the results.
descriptive_stats                                                                               #...


Unnamed: 0,Vertical,Broad Jump
count,133.0,133.0
mean,32.684,118.541
std,4.545,8.723
min,20.5,99.0
25%,29.5,111.0
50%,33.0,121.0
75%,36.0,125.0
max,42.0,136.0


**3d)** Use information from (3a), (3b) and 3(c) to describe your variable in terms of shape, center, spread, and outliers.
* Use the correct center and the correct spread based on the shape of the distribution.
* Specify which center and which spread you are using. For ex: Say "The mean is ..." or "The median is ...", rather than "The center is ..."
* When addressing outliers, if any, list the values of **all** outliers.
* Include units, if any, for all numbers.

**NOTE:** Students must use the center/spred that match their chosen shape. Skewed/outliers = median/IQR, Symm/no outliers = mean/sd

**A-L:** (The shape can be interpreted as skewed left or approximately symmetric.)

The distribution of vertical jump height is skewed left. There are no outliers. The median is 33 inches. The IQR is 6.5 inches.  

-OR-

The distribution of vertical jump height is approximately symmetric. There are no outliers. The mean is 32.684 inches. The standard deviation is 4.545 inches.  

**M-Z:**

The distribution of broad jump distance is skewed left. There are no outliers. The median is 121 inches. The IQR is 14 inches.  

**3e)** Interpret the standard deviation in context.

**A-L:** The typical vertical jump height falls within 4.545 inches of the mean height.

**M-Z:** The typical broad jump distance falls within 8.723 inches of the mean distance.

**3f)** Interpret the IQR in context.

**A-L:** The range of the middle half (50%) of vertical jump heights is 6.5 inches.

**M-Z:** The range of the middle half (50%) of broad jump distances is 14 inches.

## **QUESTION 4**

How do linebackers (LB) and running backs (RB) compare in height, weight and 40-yard dash times?

Calculate the mean weight, and mean 40-yard dash times for linebackers and running backs. Round to two decimal places. Compare the results and answer a question about the code.

**4a)** Calculate the mean weight, and mean 40-yard dash times for linebackers and running backs. Round to two decimal places.

In [None]:
# LB_means = df[df['Qual var'] == 'LB'][['Quant1', 'Quant2', 'Quant3']]
LB_means = df[df['Pos'] == 'LB'][['Ht', 'Wt', '40yd']].mean().round(2)      #LB_means = df[df['...'] == 'LB'][['...', '...', '...']].mean().round(...)
RB_means = df[df['Pos'] == 'RB'][['Ht', 'Wt', '40yd']].mean().round(2)      #RB_means = df[df['Pos'] == '...'][['...', '...', '...']].mean().round(...)

# Combine the two table and specify labels.
combined_means = pd.DataFrame({'Height': [LB_means['Ht'], RB_means['Ht']],
                               'Weight': [LB_means['Wt'], RB_means['Wt']],
                               '40-yd Dash': [LB_means['40yd'], RB_means['40yd']]},
                               index=['Linebacker', 'Running Back'])

#Print the table
combined_means        #...

Unnamed: 0,Height,Weight,40-yd Dash
Linebacker,74.27,237.6,4.56
Running Back,71.06,212.0,4.47


**4b)** Compare the results for these two positions.

Linebackers tend to be taller (have a larger mean height), heavier (have a larger mean weight), and be slower in the 40-yard dash (have a higher mean 40-yard dash time).

**4c)** In the table below are some snippets of code.

| **Last Name Initial** | **Code**                               |
|:----------------------|:---------------------------------------|
| A-L                   | df['Pos'] == 'LB'                  |
| M-Z                   | df[df['Pos'] == 'RB'][['Ht', 'Wt', '40yd']]|

Based on your last name, interpret the snippet of code.

**A-L:** df['Pos'] == 'LB' checks to see if the player is a Linebacker.

**M-Z:** df[df['Pos'] == 'RB'][['Ht', 'Wt', '40yd']] Creates a new DataFrame of only running backs and the variables height, weight, and 40-yard dash times.


## **QUESTION 5**

Generate a paragraph of at least 100 words to address one of the following questions. That is, answer only 5a or 5b, but not both.

**5a)** Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major?

...

--OR--

**5b)** Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career?

...


<br><br>
### Once you are done and ready to submit, follow the instructions below to save as a PDF and submit to GradeScope.

### Save as PDF
1. Run all code one last time
2. File-> Print Preview opens in a new browser window
3. Verify you can see all graphs. If not, go back to step 1.
4. File -> Print (or ctrl-p/cmnd-p)
5. Change destination to PDF (don't save, yet)
6. Scroll through preview to make sure you can see your graphs entirely. If not, click Cancel. Make the browser window narrower. Go back to step 4.
7. Repeat steps 4-6 until you can see your graphs completely. But do not make them too narrow.
8. Save the PDF, taking note of where it is saved.

### Submit to GradeScope
1. Login to the Canvas course
2. Click on GradeScope in the course navigation.
3. If you see multiple courses in GradeScope, click on the STAT 108 course
4. Click on the "Tutorial 2 Practice Upload" assignment
5. Click on "Submit Work", select PDF
6. Select the PDF you just created
7. You need to tell GradeScope which page each problem answer/output is on. You should see a list of problems on the right, and a display of pages (thumbnails) on the right.
Assign pages to questions by clicking on the question number on the left, then clicking on all pages that question is on.
8. After ALL questions have been assigned to their respective page(s), click "Submit"
