**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Vishaal Gaddipati
- Ryan Chen
- Georolyn Ngo
- Thanh Trinh
- Armaan Shaikh

# Research Question

What is the correlation between the average temperature of a given state and the quantity of Division I NCAA basketball players generated between season 2018-2019 and 2023-2024? Furthermore, how does the average temperature per state relate to the skill level of Division I basketball players, as indicated by the average of points, rebounds, and assists over the season, within that same timeframe?

## Background and Prior Work

The debate over which state produces the best basketball players has long been a point of contention among Americans, with popular contenders including New York, California, and Florida. While previous research has attempted to answer this question, it has only focused on the socio-economic factors, such as the study by James Tompsett and Chris Knoester that focused extensively on the likelihood of high school athletes advancing to collegiate levels.<a name="cite_ref_1"/>[<sup>1</sup>](#cite_note_1) Indicators of family socioeconomic statuses, athletic development and merit, academic expectations and knowledge, and school contexts were used to predict the likelihood of becoming a college athlete. It was shown that high school athletes with a stronger socioeconomic background are much more likely to become collegiate athletes. The correlation with this is clear but it still does not answer our question. Thus, we aim to investigate another less conventional variable that could help us identify the best basketball state - the average temperature of the state.

There have been studies in the past that have looked at the effect of environmental factors on active participation in physical activities. The paper titled "Assessing the Effects of Weather Conditions on Physical Activity Participation Using Objective Measures" by Catherine B. Chan and Daniel A. Ryan examines how different weather conditions influence physical activity levels.<a name="cite_ref_2"/>[<sup>2</sup>](#cite_note_2) The study identifies weather as a significant barrier to physical activity, particularly focusing on how adverse conditions such as low temperatures, rain, and snow decrease participation. Using data from various demographic groups, the study highlights that physical activity levels are generally higher on warmer days and lower during inclement weather. We intend on extending this study into the context of basketball. Further, a study done by David Hancock, Matthew Vierimaa, and Ashley Newman revealed that communities with populations of 250,000-499,999 are talent hotspots for high-level competitive basketball. They explained that this is due to favorable infrastructure and social structure, but call for more research into other factors such as crime rates and green spaces. <a name="cite_ref_3"/>[<sup>3</sup>](#cite_note_3)

We aim to examine the correlation between a state's average temperature and the number of Division I NCAA basketball players it generates between 2019 and 2024 and foresee that environmental factors such as average temperature would provide residents in those areas different, unequal opportunities for physical activity. Another factor that could come into play is sports culture. As sports are strongly integrated in cultures, we may also observe a change in the popularity of different sports due to weather and how accessible that sport is. The accessibility of a sport due to climate could potentially have an effect on the opportunities for training and skill development, further impacting on court performance down the line. Investigating how average temperatures in different states correlate with the emergence of Division I basketball players can offer valuable insights into the broader determinants of athletic success. 

<a name="cite_note_1"/> [1](#cite_ref_1) Tompsett, J., & Knoester, C. (2022). The Making of a College Athlete: High School Experiences, Socioeconomic Advantages, and the Likelihood of Playing College Sports. Sociology of Sport Journal, 39(2), 129–140. https://doi.org/10.1123/ssj.2020-0142

<a name="cite_note_2"/> [2](#cite_ref_2) Chan, C. B., & Ryan, D. A. (2009). Assessing the effects of weather conditions on physical activity participation using objective measures. International journal of environmental research and public health, 6(10), 2639–2654. https://doi.org/10.3390/ijerph6102639

<a name="cite_note_3"/> [3](#cite_ref_3) Hancock, D. J., Vierimaa, M., & Newman, A. (2022). The geography of talent development. Frontiers in sports and active living, 4, 1031227. https://doi.org/10.3389/fspor.2022.1031227

# Hypothesis


We hypothesize that there will be a positive correlation between the average temperature of a state and the quantity of Division I NCAA basketball players generated between 2018 and 2023. Our rationale is that warmer climates may facilitate greater participation in outdoor sports activities, including basketball, thereby potentially leading to an increased number of talented athletes pursuing collegiate basketball opportunities.

Additionally, we anticipate that states with higher average temperatures will exhibit a positive correlation with the skill level of Division I basketball players. Specifically, we expect that players hailing from states with warmer climates will demonstrate a higher combined average of points, rebounds, and assists over the season. This hypothesis is grounded in the notion that favorable weather conditions may provide athletes with more opportunities for training and skill development, ultimately enhancing their performance on the court.

# Data

## Data overview

(Datasets #2 and #3 required some web scraping) 
- Dataset #1
  - Dataset Name: Average Temperature by State
  - Link to the dataset: https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/statewide/time-series/1/tavg/1/3/1895-2024?base_prd=true&begbaseyear=1901&endbaseyear=2000
  - Number of observations: 163527
  - Number of variables: 4
- Dataset #2
  - Dataset Name:
  - Link to the dataset: https://basketball.realgm.com/ncaa/players/2024/
  - Number of observations:
  - Number of variables:
- Dataset #3
  - Dataset Name: 
  - Link to the dataset: https://stats.ncaa.org/rankings/change_sport_year_div
  - Number of observations:
  - Number of variables:

Now write 2 - 5 sentences describing each dataset here. Include a short description of the important variables in the dataset; what the metrics and datatypes are, what concepts they may be proxies for. Include information about how you would need to wrangle/clean/preprocess the dataset
- Dataset #1: This dataset covers average temperature by state measured in Fahrenheit from 1895 to 2024 on a monthly basis. It uses a unique code specified in the documentation found on their page that signifies key factors such as the state and year in consideration. The key variables in the dataset are the state, year, month, and temperature all of which are read as type object. All 50 states are included, the years span from 1895-2024, covering all 12 months, with the temperature on the Fahrenheit scale. This dataset is crucial as we will be able to extrapolate the yearly temperature average for each state from 2018-2023 via data wrangling/cleaning. This would involve using a dictionary to interpret the state codes, dropping the years we are not analyzing, and creating a column for the mean temperature over all 12 months. This can then be used to analyze the role the temperature plays in relation to our research question.
- Dataset #2:
- Dataset #3:

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

We plan to combine these datasets by... [FILL IN]

## Dataset #1: Average Temperature By State

In [10]:
import pandas as pd

In [11]:
# The data cleaning will require converting some of the conventions used by the organization into equivalents. Ex: 001 at the beginning of a code indicated Alabama.
weather_df = pd.read_csv('https://www.ncei.noaa.gov/pub/data/cirs/climdiv/climdiv-tmpcst-v1.0.0-20240506', delim_whitespace=True, dtype=str) 
weather_df.size

163527

In [12]:
# code is an indicator based on a sequential standard found in detail here: https://www.ncei.noaa.gov/pub/data/cirs/climdiv/state-readme.txt
weather_df.columns = ["code", "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]

In [13]:
# Dictionary for state code mappings
state_codes = {
    '001': 'Alabama',
    '002': 'Arizona',
    '003': 'Arkansas',
    '004': 'California',
    '005': 'Colorado',
    '006': 'Connecticut',
    '007': 'Delaware',
    '008': 'Florida',
    '009': 'Georgia',
    '010': 'Idaho',
    '011': 'Illinois',
    '012': 'Indiana',
    '013': 'Iowa',
    '014': 'Kansas',
    '015': 'Kentucky',
    '016': 'Louisiana',
    '017': 'Maine',
    '018': 'Maryland',
    '019': 'Massachusetts',
    '020': 'Michigan',
    '021': 'Minnesota',
    '022': 'Mississippi',
    '023': 'Missouri',
    '024': 'Montana',
    '025': 'Nebraska',
    '026': 'Nevada',
    '027': 'New Hampshire',
    '028': 'New Jersey',
    '029': 'New Mexico',
    '030': 'New York',
    '031': 'North Carolina',
    '032': 'North Dakota',
    '033': 'Ohio',
    '034': 'Oklahoma',
    '035': 'Oregon',
    '036': 'Pennsylvania',
    '037': 'Rhode Island',
    '038': 'South Carolina',
    '039': 'South Dakota',
    '040': 'Tennessee',
    '041': 'Texas',
    '042': 'Utah',
    '043': 'Vermont',
    '044': 'Virginia',
    '045': 'Washington',
    '046': 'West Virginia',
    '047': 'Wisconsin',
    '048': 'Wyoming',
    '050': 'Alaska'
}

In [14]:
# Remove data points not from 2018 to 2023
weather_df = weather_df[weather_df["code"].str[6:].isin(["2018", "2019", "2020", "2021", "2022", "2023"])]

In [15]:
# Remove codes not from 001 to 050, as we are only analyzing the 50 states 
weather_df = weather_df[weather_df["code"].str[0:3].isin(state_codes.keys())].reset_index().drop(["index"], axis=1)

In [16]:
# Create 2 new columns to represent the year and the state of a 12-month average temperature period
weather_df.insert(1, "year", weather_df["code"].str[6:], True)
weather_df.insert(1, "state", weather_df["code"].str[0:3],True)

In [17]:
# Convert state codes into their state name strings
weather_df["state"] = weather_df["state"].map(state_codes)

In [18]:
# Drop code as it is no longer needed for data analyzing
weather_df = weather_df.drop(["code"], axis=1)

In [19]:
weather_df

Unnamed: 0,state,year,January,February,March,April,May,June,July,August,September,October,November,December
0,Alabama,2018,40.40,58.00,55.40,59.50,74.40,79.30,80.70,79.50,79.40,67.40,50.40,49.00
1,Alabama,2019,46.50,56.10,55.00,63.50,74.20,78.10,80.60,80.60,80.20,67.80,50.60,51.30
2,Alabama,2020,49.50,51.00,63.50,62.00,68.90,76.90,80.90,80.20,74.30,66.90,58.40,45.90
3,Alabama,2021,46.60,46.70,59.60,61.60,69.40,77.00,79.30,80.00,73.90,67.30,52.20,56.80
4,Alabama,2022,43.90,50.10,57.40,62.60,73.50,80.40,81.70,79.30,73.70,61.20,55.30,49.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,Alaska,2019,7.50,15.50,26.50,28.50,43.00,54.00,58.10,51.50,44.40,30.80,19.80,6.70
290,Alaska,2020,-6.40,1.50,12.00,27.50,43.30,50.50,53.20,52.10,42.00,28.90,14.50,10.60
291,Alaska,2021,10.60,1.00,9.10,23.90,40.30,50.90,53.90,49.40,39.40,27.90,4.60,6.80
292,Alaska,2022,2.60,8.20,16.70,24.90,40.10,52.70,53.30,50.10,42.70,28.80,16.30,7.10


In [20]:
# Check for any missing data
weather_df.isna().any().sum()

0

In [21]:
weather_df.dtypes

state        object
year         object
January      object
February     object
March        object
April        object
May          object
June         object
July         object
August       object
September    object
October      object
November     object
December     object
dtype: object

In [22]:
weather_df

Unnamed: 0,state,year,January,February,March,April,May,June,July,August,September,October,November,December
0,Alabama,2018,40.40,58.00,55.40,59.50,74.40,79.30,80.70,79.50,79.40,67.40,50.40,49.00
1,Alabama,2019,46.50,56.10,55.00,63.50,74.20,78.10,80.60,80.60,80.20,67.80,50.60,51.30
2,Alabama,2020,49.50,51.00,63.50,62.00,68.90,76.90,80.90,80.20,74.30,66.90,58.40,45.90
3,Alabama,2021,46.60,46.70,59.60,61.60,69.40,77.00,79.30,80.00,73.90,67.30,52.20,56.80
4,Alabama,2022,43.90,50.10,57.40,62.60,73.50,80.40,81.70,79.30,73.70,61.20,55.30,49.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
289,Alaska,2019,7.50,15.50,26.50,28.50,43.00,54.00,58.10,51.50,44.40,30.80,19.80,6.70
290,Alaska,2020,-6.40,1.50,12.00,27.50,43.30,50.50,53.20,52.10,42.00,28.90,14.50,10.60
291,Alaska,2021,10.60,1.00,9.10,23.90,40.30,50.90,53.90,49.40,39.40,27.90,4.60,6.80
292,Alaska,2022,2.60,8.20,16.70,24.90,40.10,52.70,53.30,50.10,42.70,28.80,16.30,7.10


In [23]:
# Test to get the California 12 month average temperature in 2023
weather_df[(weather_df['state'] == 'California') & (weather_df['year'] == '2023')]

Unnamed: 0,state,year,January,February,March,April,May,June,July,August,September,October,November,December
23,California,2023,42.6,42.7,44.1,55.3,61.9,66.7,78.9,75.8,68.0,62.2,51.9,48.0


In [24]:
# Convert month columns to floats
weather_df.iloc[:, 2:] = weather_df.iloc[:, 2:].astype(float)

# Calculate the average of each row
weather_df['average_temperature'] = weather_df.iloc[:, 2:].mean(axis=1)

# Drop the January to December columns
weather_df.drop(weather_df.columns[2:-1], axis=1, inplace=True)

In [25]:
weather_df

Unnamed: 0,state,year,average_temperature
0,Alabama,2018,64.450000
1,Alabama,2019,65.375000
2,Alabama,2020,64.866667
3,Alabama,2021,64.200000
4,Alabama,2022,64.016667
...,...,...,...
289,Alaska,2019,32.191667
290,Alaska,2020,27.475000
291,Alaska,2021,26.483333
292,Alaska,2022,28.625000


In [26]:
# Test to get the California 1 year average temperature in 2023
weather_df[(weather_df['state'] == 'California') & (weather_df['year'] == '2023')]

Unnamed: 0,state,year,average_temperature
23,California,2023,58.175


In [27]:
# Hawaii is the state missing data
weather_df['state'].value_counts()

Alabama           6
Nevada            6
New Jersey        6
New Mexico        6
New York          6
North Carolina    6
North Dakota      6
Ohio              6
Oklahoma          6
Oregon            6
Pennsylvania      6
Rhode Island      6
South Carolina    6
South Dakota      6
Tennessee         6
Texas             6
Utah              6
Vermont           6
Virginia          6
Washington        6
West Virginia     6
Wisconsin         6
Wyoming           6
New Hampshire     6
Nebraska          6
Arizona           6
Montana           6
Arkansas          6
California        6
Colorado          6
Connecticut       6
Delaware          6
Florida           6
Georgia           6
Idaho             6
Illinois          6
Indiana           6
Iowa              6
Kansas            6
Kentucky          6
Louisiana         6
Maine             6
Maryland          6
Massachusetts     6
Michigan          6
Minnesota         6
Mississippi       6
Missouri          6
Alaska            6
Name: state, dtype: 

## Dataset #2 Players

In [28]:
# Import csv
player_df = pd.read_csv("./players.csv")
print(player_df.size)
player_df.head()

372438


Unnamed: 0,Player,Pos,HT,WT,School,Class,Birth Date,Birth City,High School/Prep School,Unnamed: 9,Year
0,Shaqquan Aaron,GF,6-7,200,USC,Sr,"Jul 21, 1995",Seattle (WA),Rainier Beach High School,,2019
1,Eli Abaev,F,6-8,210,Austin Peay,RS-Jr,"Dec 13, 1997",Coral Springs (FL),Zion Lutheran High School,,2019
2,Huthifah Abdel Jawad,G,6-0,160,Hawaii,So,-,Honolulu (HI),McKinley High School,,2019
3,Hasan Abdullah,PG,6-0,195,Navy,Sr,-,Trussville (AL),Clay-Chalkville High School,,2019
4,Mohammed Abdusalam,C,6-9,255,UNC Greensboro,Fr,"Sep 7, 1997",Decatur (GA),Greenforest Christian High School,,2019


In [29]:
# Some rows are duplicates of the header row
# Find these rows and remove them from the dataframe
player_df = player_df[player_df["Player"] != "Player"]
player_df.head()

Unnamed: 0,Player,Pos,HT,WT,School,Class,Birth Date,Birth City,High School/Prep School,Unnamed: 9,Year
0,Shaqquan Aaron,GF,6-7,200,USC,Sr,"Jul 21, 1995",Seattle (WA),Rainier Beach High School,,2019
1,Eli Abaev,F,6-8,210,Austin Peay,RS-Jr,"Dec 13, 1997",Coral Springs (FL),Zion Lutheran High School,,2019
2,Huthifah Abdel Jawad,G,6-0,160,Hawaii,So,-,Honolulu (HI),McKinley High School,,2019
3,Hasan Abdullah,PG,6-0,195,Navy,Sr,-,Trussville (AL),Clay-Chalkville High School,,2019
4,Mohammed Abdusalam,C,6-9,255,UNC Greensboro,Fr,"Sep 7, 1997",Decatur (GA),Greenforest Christian High School,,2019


In [30]:
# We only need players' birth city's state
player_df = player_df[["Birth City", "Year"]]
player_df.head()

Unnamed: 0,Birth City,Year
0,Seattle (WA),2019
1,Coral Springs (FL),2019
2,Honolulu (HI),2019
3,Trussville (AL),2019
4,Decatur (GA),2019


In [31]:
# Remove entries where birth city is not in the US
player_df = player_df[player_df["Birth City"].apply(lambda s: str(s)[-1] == ")")]

In [32]:
# Keep only state
player_df["Birth City"] = player_df["Birth City"].apply(lambda s: str(s)[len(str(s)) - 3 : len(str(s)) - 1])
player_df = player_df.reset_index().drop(columns = "index")

In [33]:
# Reorganize into sum by state per year
player_df = player_df.groupby(["Birth City", "Year"]).size().reset_index(name="Count")

In [34]:
# Rename column to state
player_df.head()

Unnamed: 0,Birth City,Year,Count
0,AK,2019,8
1,AK,2020,8
2,AK,2021,8
3,AK,2022,9
4,AK,2023,4


## Dataset #3 Points per Game

In [35]:
# Import points per game and players csv
ppg_df = pd.read_csv("./points.csv")
players = pd.read_csv("./players.csv")
print(ppg_df.size)
ppg_df.head()

22860


Unnamed: 0,Rank,Player,Cl,Ht,Pos,G,FGM,3FG,FT,PTS,PPG,Year
0,1,"Chris Clemons, Campbell (Big South)",Sr.,5-9,G,33,304,139,246,993,30.1,2019
1,2,"Justin Wright-Foreman, Hofstra (CAA)",Sr.,6-2,G,35,330,110,178,948,27.1,2019
2,3,"Antoine Davis, Detroit Mercy (Horizon)",Fr.,6-1,G,30,263,132,126,784,26.1,2019
3,4,"Mike Daum, South Dakota St. (Summit League)",Sr.,6-9,F,33,286,67,196,835,25.3,2019
4,5,"Markus Howard, Marquette (Big East)",Jr.,5-11,G,34,252,120,227,851,25.0,2019


In [36]:
# Clean player names to include only name
ppg_df["Player"] = ppg_df["Player"].apply(lambda s: s.split(",")[0])
ppg_df.head()

Unnamed: 0,Rank,Player,Cl,Ht,Pos,G,FGM,3FG,FT,PTS,PPG,Year
0,1,Chris Clemons,Sr.,5-9,G,33,304,139,246,993,30.1,2019
1,2,Justin Wright-Foreman,Sr.,6-2,G,35,330,110,178,948,27.1,2019
2,3,Antoine Davis,Fr.,6-1,G,30,263,132,126,784,26.1,2019
3,4,Mike Daum,Sr.,6-9,F,33,286,67,196,835,25.3,2019
4,5,Markus Howard,Jr.,5-11,G,34,252,120,227,851,25.0,2019


In [37]:
## Filtering players dataset to include birth state

# Remove duplicates
players = players[players["Player"] != "Player"]

# Drop any players with NaN birth city
players.dropna(subset=['Birth City'], inplace=True)

In [38]:
# Drop players born outside the US
players = players[players["Birth City"].apply(lambda city: '(' in city and ')' in city)]

In [39]:
# Extract the state from city
players["Birth City"] = players["Birth City"].apply(lambda state: state[-3:-1])

In [40]:
# Rename the "Birth City" to "Birth State"
players.rename(columns={"Birth City": "Birth State"}, inplace=True)

In [41]:
players.head(10)

Unnamed: 0,Player,Pos,HT,WT,School,Class,Birth Date,Birth State,High School/Prep School,Unnamed: 9,Year
0,Shaqquan Aaron,GF,6-7,200,USC,Sr,"Jul 21, 1995",WA,Rainier Beach High School,,2019
1,Eli Abaev,F,6-8,210,Austin Peay,RS-Jr,"Dec 13, 1997",FL,Zion Lutheran High School,,2019
2,Huthifah Abdel Jawad,G,6-0,160,Hawaii,So,-,HI,McKinley High School,,2019
3,Hasan Abdullah,PG,6-0,195,Navy,Sr,-,AL,Clay-Chalkville High School,,2019
4,Mohammed Abdusalam,C,6-9,255,UNC Greensboro,Fr,"Sep 7, 1997",GA,Greenforest Christian High School,,2019
7,DeAndre Abram,GF,6-8,210,Milwaukee,Jr,"Jun 1, 1997",TX,Creekview High School,,2019
8,Aleks Abrams,F,6-8,240,Miami (OH),Sr,"Aug 23, 1995",CA,Oaks Christian High School,,2019
9,Shane Acoveno,F,6-6,200,Lehigh,So,-,PA,Delaware Valley High School,,2019
10,Milan Acquaah,SG,6-3,195,California Baptist,So,"Dec 22, 1997",CA,Cathedral High School,,2019
11,Kani Acree,F,6-5,185,Ball State,RS-Fr,-,IL,Carbondale High School,,2019


In [42]:
players.shape

(29259, 11)

In [43]:
ppg_df.shape

(1905, 12)

In [44]:
ppg_df[ppg_df['Player'].isin(players['Player'].unique())]

Unnamed: 0,Rank,Player,Cl,Ht,Pos,G,FGM,3FG,FT,PTS,PPG,Year
0,1,Chris Clemons,Sr.,5-9,G,33,304,139,246,993,30.1,2019
1,2,Justin Wright-Foreman,Sr.,6-2,G,35,330,110,178,948,27.1,2019
2,3,Antoine Davis,Fr.,6-1,G,30,263,132,126,784,26.1,2019
3,4,Mike Daum,Sr.,6-9,F,33,286,67,196,835,25.3,2019
4,5,Markus Howard,Jr.,5-11,G,34,252,120,227,851,25,2019
...,...,...,...,...,...,...,...,...,...,...,...,...
1899,344,Sahvir Wheeler,Sr.,5-9,G,31,175,24,69,443,14.3,2024
1900,346,Jared McCain,Fr.,6-3,G,36,175,87,77,514,14.3,2024
1901,347,Jlynn Counter,Jr.,6-3,G,29,150,25,89,414,14.3,2024
1903,349,Jonathan Mogbo,Jr.,6-8,F,34,206,0,72,484,14.2,2024


In [45]:
ppg_df.shape

(1905, 12)

In [46]:
# Filter players in points per game dataset to match with players dataset
merged_df = pd.merge(ppg_df, players[['Player', 'Birth State']], on='Player', how='left')

In [47]:
# Drop any duplicates in ppg_df and NaN in Birth State
merged_df.drop_duplicates(inplace=True)
merged_df.dropna(subset=['Birth State'], inplace=True)

In [49]:
ppg_df.head(10)

Unnamed: 0,Rank,Player,Cl,Ht,Pos,G,FGM,3FG,FT,PTS,PPG,Year,Birth State
0,1,Chris Clemons,Sr.,5-9,G,33,304,139,246,993,30.1,2019,NC
1,2,Justin Wright-Foreman,Sr.,6-2,G,35,330,110,178,948,27.1,2019,NY
2,3,Antoine Davis,Fr.,6-1,G,30,263,132,126,784,26.1,2019,IN
7,4,Mike Daum,Sr.,6-9,F,33,286,67,196,835,25.3,2019,NE
8,5,Markus Howard,Jr.,5-11,G,34,252,120,227,851,25.0,2019,AZ
11,7,Ja Morant,So.,6-3,G,33,265,57,221,808,24.5,2019,SC
12,8,Jermaine Marrow,Jr.,6-0,G,35,274,88,218,854,24.4,2019,VA
14,9,Carsen Edwards,Jr.,6-0,G,36,277,135,185,874,24.3,2019,TX
15,10,Nick Mayo,Sr.,6-9,F,31,245,54,190,734,23.7,2019,ME
16,11,Jordan Davis,Sr.,6-2,G,32,256,54,186,752,23.5,2019,NV


In [50]:
merged_df.head(10)

NameError: name 'merged_df' is not defined

## Dataset #4 Assists per Game

In [None]:
apg_df = pd.read_csv("./assists.csv")

## Dataset #5 Rebounds per Game

In [None]:
rpg_df = pd.read_csv("./rebounds.csv")

# Ethics & Privacy

To identify biases or issues in the datasets, our group will conduct a comprehensive review of the data sources, including the list of Division 1 basketball players, their respective stats in different categories, and weather data from various sources such as Kaggle or NOAA. During this review, we will pay particular attention to any potential biases in the composition of the datasets, such as underrepresentation of certain demographic groups or regions, as well as any privacy concerns related to the inclusion of sensitive or potentially identifiable information.

Our group will address these biases or issues by employing several strategies throughout the research process:

Transparency: We will openly acknowledge any biases or limitations in the datasets, ensuring our analysis is conducted with full transparency and accountability. By providing clear documentation of our data sources, methodologies, and any assumptions made during the analysis, we will enable stakeholders to critically evaluate the findings and understand the context in which they were derived. We will scrutinize any potential biases in our datasets and methods, particularly socioeconomic biases that might influence or skew the analysis. We will make sure data accurately represent all demographics to prevent exclusion of lower-income populations. It is quite easy for machine learning algorithms to ignore the presence of bias in an attempt to uphold accuracy; however, if certain factors are ignored, the accuracy does not take a major hit and the predictions are not biased as well. Therefore, we attempt to fairly represent our dataset.

Equity in Analysis: During analysis, we will keep in mind various relevant stakeholders of this project (such as athletes and colleges/universities) and address blindspots in our analysis through considering perspectives from different stakeholders. Further, we hold ourselves to a standard to not hold any biases while performing data analysis to prevent misuse of analytical strategies to propagate existing stereotypes or systemic biases. Our visualizations, summary statistics, and reports should include every point of valid data, and we will try our best to avoid dropping any points of data.

Data Validation: We will validate the accuracy and integrity of the datasets through rigorous quality assurance measures, including data cleaning and verification procedures. This will involve checking for inconsistencies or errors, and implementing appropriate corrections to ensure the reliability of our analysis. We will consult multiple sources where possible to cross check obtained data and verify their correctness. Furthermore, we will validate our ability to utilize the data we have collected through checking column headers, data types, and null fields.

Privacy Protection: We will prioritize safeguarding individuals privacy rights by adhering to relevant privacy regulations and guidelines, ensuring that any personally identifiable information is handled with care and only used in accordance with ethical standards. Visualizations, summary statistics, and reports should avoid private information where possible. One such way we will do so is by removing irrelevant personal information to our project such as the name, height, weight, and birth date of players in the datasets.

By maintaining transparency in our methodology and being vigilant about ethical considerations, we aim to uphold the integrity of our research while contributing valuable insights into the socioeconomic factors affecting athletic opportunities in NCAA basketball.


# Team Expectations 

* *Team Expectation 1*: Communicate in a timely manner with respect (via Discord or text group chat).
* *Team Expectation 2*: Ensure the work completed is quality and meets all marks specified.
* *Team Expectation 3*: Set deadlines for divided tasks and reach out whenever help is needed as early as possible.
* *Team Expectation 4*: Actively partcipate in group meetings and varying aspects of the project.
* *Team Expectation 5*: Keep track of current tasks and discussion points using meeting notes (via Google Docs).

# Project Timeline Proposal

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 4/25  |  11 AM | Read & Think about COGS 108 expectations; brainstorm topics/questions  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 4/30  |  8 PM |  Do background research on topic | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 5/3  | 8 PM  | Edit, finalize, and submit proposal; Search for datasets  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 5/12  | 12 PM  | Import & Wrangle Data (Ryan); EDA (Georolyn) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 5/13  | 12 PM  | Finalize wrangling/EDA; Begin Analysis (Vishaal; Thanh) | Discuss/edit Analysis; Complete project check-in |
| 5/20  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Armaan)| Discuss/edit full project |
| 6/10  | Before 11:59 PM  | NA | Turn in Final Project & Group Project Surveys |