# PROJECT 4 - Justin Clifton, Julia Januchowski, Zac Laney

---
## Table of Contents

1. [Introduction](#intro)
2. [The Parsing Problem](#parsing)
3. [The Kaggle Dataset](#kaggle_data)
5. [Combining The Data](#combining)
6. [Refrences](#refrences)
---

## Introduction <a class="anchor" id="intro"></a>

The most prized individual accolade for a player in the NBA is the Most Valuable Player award. This award signifies that coaches, the media, and former players thought that the winner of this award was the most important player in the entire league. There have been 65 different players who have won this award. These players played different positions, had different statistics, and come from different eras. What then, do they have in common? Can we predict who will have an MVP season? What do a players statistics indicate about their chances of winning the award? Given the right dataset, these are all questions that could potentially be answered. Unfortunately, obtaining the data to answer these sorts of questions is not always as straight forward as downloading a dataset from kaggle. In this case, there exists a dataset containing player statistics for a season, but it lacks which player won the MVP. Can we solve this problem? Can we pull data from basketball reference, combine it with data on player statistics, and then use that data to make a simple model? These are the questions that we seek to answer. In the process, we will use libraries like BeautifulSoup to parse basketball refrence, and various pandas operations to combine the datasets.

There are three key skills we would like to demonstrate. The first being scraping data from the internet. Scraping data from the internet is a useful skill because there is not always a dataset available for any problem that needs to be solved.  We will accomplish this using the python library BeautifulSoup. Next, we will attempt to combined our scraped dataset with data that already exists. Finally, we will display the skill of fitting a simple model to try and predict who will be an MVP.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
import requests
import numpy as np

## The Parsing Problem <a class="anchor" id="parsing"></a>
Basketball-reference is a website that contains basketball statistics. We will begin by using beautiful soup to parse the data contained on this website, and extract the information that we would like. The first step in this process is to create a BeautifulSoup object that will find all the data contained in tables.

In [2]:
url = "https://www.basketball-reference.com/awards/mvp.html"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, 'html.parser')
players = soup.find_all('td',{'data-stat' : 'player'})

The code above produces a list of players contained in the MVP table. However, the format is not exactly how we want it. The cell block below shows the format that is returned. We will have to extract the players name using .text. Next, the names of the players who are contained in this table will be stored in a list. Later this list will be used to create a dataset of MVP winners. It is worth noting that Giannis Antetokounmpo will be removed from this list, as the dataset containing season statistics only goes up to 2017.

In [3]:
players[2]

<td class="left" csk="Harden,James" data-append-csv="hardeja01" data-stat="player"><a href="/players/h/hardeja01.html">James Harden</a></td>

In [4]:
player_list = []
for player in players:
    if player.text != "Giannis Antetokounmpo":
        player_list.append(player.text)

An example of the names after they have been parsed.

In [5]:
player_list[0]

'James Harden'

After the players names have been collected, we will also extract the year that they won the award.  There is one important note though. Since the format of the years when they were extracted is a string (e.g. '2019-20') and the years in the season statistics dataset are formatted as floats (e.g. 2019.0), we had to extract the start year from the string. This was done using apply and a lambda function that slices the string to get the first 4 digits then casts that to a float. 

In [6]:
years = soup.find_all('th',{'data-stat' : 'season'})
year_list = []
for year in years:
    year_list.append(year.text)
year_list = [year for year in year_list if year != 'Season' and year != '2019-20' and year != '2018-19']

In [7]:
data = np.array(player_list)
mvps = pd.DataFrame(data , columns = ['Names'])

In [8]:
data = np.array(year_list)

year_series = pd.DataFrame(data = data, columns = ['Year'])



mvps = pd.concat([mvps, year_series], axis = 1)


Our MVP dataset in its current form is shown below. It can be seen that we have successfully extracted the names and years corresponding to the MVP.

In [9]:
start_years = mvps['Year'].apply(lambda year: float(year[:4]))
mvps['Year'] = start_years
mvps

Unnamed: 0,Names,Year
0,James Harden,2017.0
1,Russell Westbrook,2016.0
2,Stephen Curry,2015.0
3,Stephen Curry,2014.0
4,Kevin Durant,2013.0
...,...,...
68,Artis Gilmore,1971.0
69,Mel Daniels,1970.0
70,Spencer Haywood,1969.0
71,Mel Daniels,1968.0


## The Kaggle Dataset <a class="anchor" id="kaggle_data"></a>

As mentioned before, there is some data that already exists on this topic on kaggle. We will now import a dataset containing player stats from 1950-2017. However, since the NBA did not keep track of blocks and steals until 1973, datapoints prior to 1973 will be removed. 

In [10]:
season_stats = pd.read_csv('Seasons_Stats.csv')

In [11]:
stats_df = season_stats[season_stats['Year'] > 1973.0]

In [12]:
stats_df.drop(columns =['Unnamed: 0', 'blanl', 'blank2'], inplace = True)
stats_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,Year,Player,Pos,Age,Tm,G,GS,MP,PER,TS%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
3851,1974.0,Zaid Abdul-Aziz,C,27.0,HOU,79.0,,2459.0,15.9,0.516,...,0.804,259.0,664.0,923.0,166.0,80.0,104.0,,227.0,865.0
3852,1974.0,Kareem Abdul-Jabbar*,C,26.0,MIL,81.0,,3548.0,24.4,0.564,...,0.702,287.0,891.0,1178.0,386.0,112.0,283.0,,238.0,2191.0
3853,1974.0,Don Adams,SF,26.0,DET,74.0,,2298.0,10.9,0.457,...,0.761,133.0,315.0,448.0,141.0,110.0,12.0,,242.0,759.0
3854,1974.0,Rick Adelman,PG,27.0,CHI,55.0,,618.0,10.0,0.447,...,0.711,16.0,53.0,69.0,56.0,36.0,1.0,,63.0,182.0
3855,1974.0,Lucius Allen,PG,26.0,MIL,72.0,,2388.0,18.8,0.536,...,0.788,89.0,202.0,291.0,374.0,137.0,22.0,,215.0,1268.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24686,2017.0,Cody Zeller,PF,24.0,CHO,62.0,58.0,1725.0,16.7,0.604,...,0.679,135.0,270.0,405.0,99.0,62.0,58.0,65.0,189.0,639.0
24687,2017.0,Tyler Zeller,C,27.0,BOS,51.0,5.0,525.0,13.0,0.508,...,0.564,43.0,81.0,124.0,42.0,7.0,21.0,20.0,61.0,178.0
24688,2017.0,Stephen Zimmerman,C,20.0,ORL,19.0,0.0,108.0,7.3,0.346,...,0.600,11.0,24.0,35.0,4.0,2.0,5.0,3.0,17.0,23.0
24689,2017.0,Paul Zipser,SF,22.0,CHI,44.0,18.0,843.0,6.9,0.503,...,0.775,15.0,110.0,125.0,36.0,15.0,16.0,40.0,78.0,240.0


During the process of exploring this data, we noticed that some names had a '*' contained in them. After some research, we learned that this star indicates that a player was an allstar that season. We will utilize this information to engineer a new feature called 'is_allstar'. This feature will be 1 if the player was an allstar that season, 0 otherwise. After creating the new column, we will remove the * from the names to clean the data.

In [13]:
allstars = stats_df['Player'].apply(lambda name: 1 if '*' in name else 0)

In [14]:
allstars.rename("is_allstar", inplace = True)

3851     0
3852     1
3853     0
3854     0
3855     0
        ..
24686    0
24687    0
24688    0
24689    0
24690    0
Name: is_allstar, Length: 20797, dtype: int64

In [15]:
stats_df = pd.concat([stats_df, allstars], axis = 1)

Replacing the names that contain a * with names that do not contain a *.

In [16]:
no_star_names = stats_df['Player'].apply(lambda name: name.replace('*', "") if '*' in name else name)
no_star_names.rename("Name", inplace = True)

3851         Zaid Abdul-Aziz
3852     Kareem Abdul-Jabbar
3853               Don Adams
3854            Rick Adelman
3855            Lucius Allen
                ...         
24686            Cody Zeller
24687           Tyler Zeller
24688      Stephen Zimmerman
24689            Paul Zipser
24690            Ivica Zubac
Name: Name, Length: 20797, dtype: object

In [17]:
stats_df.drop(columns = ['Player'], inplace = True)

In [18]:
stats_df['Name'] = no_star_names

## Combining The Data <a class="anchor" id="combining"></a>
At this point, we are now going to begin the process of combining our two datasets. This proved to be no small task. Initially, we attempted to just find any player who was in both datasets. If the player was contained in both datasets, they would get a 1 to indicate they were the mvp. However, this idea was highly flawed. What was produced was a dataset that implied Steve Nash was the MVP in 1996 when he averaged 3 points a game. After much pondering, we determined that the best solution would be to compare the name of the player and the year of the season. To accomplish this, manipulation of the dataset was required. First, the index of the dataset was changed. Instead of being indexed by numbers, we chose to temporarily index the dataset by the players name concatenated with the year. For example, James Harden's 2017 season would temporarily be indexed by James Harden2017.0. Manipulating the dataset in this way enabled us to put a 1 in locations where the name and year matched and 0 in locations where they did not. Finally, we added a new column to the dataset denoting if the player was an MVP or not. 

In [19]:
year_name_list = []
for i, j in stats_df.iterrows(): 
    year_name_list.append(j.Name + str(j.Year))


data = np.array(year_name_list)

name_year_series = pd.DataFrame(data = data, columns = ['name_year'])
stats_df.index = name_year_series['name_year']

In [20]:
mvps_list = []
for name, year in zip(mvps['Names'], mvps['Year']):
    for name2, year2 in zip(stats_df['Name'], stats_df['Year']):
        if(name == name2 and year == year2):
            #print(name, name2, year, year2)
            mvps_list.append(name + str(year))


In [21]:
was_mvp_list = []
for i, j in stats_df.iterrows(): 
    
    if i in mvps_list:
        was_mvp_list.append(1)
    else:
        was_mvp_list.append(0)

In [22]:
data = np.array(was_mvp_list)

was_mvp = pd.DataFrame(data = data, columns = ['was_mvp'])

In [23]:
stats_df.index = was_mvp.index
stats_df = pd.concat([stats_df, was_mvp], axis = 1)

After all that work, let us bask in the glory of our newly formed dataset.

In [24]:
stats_df

Unnamed: 0,Year,Pos,Age,Tm,G,GS,MP,PER,TS%,3PAr,...,TRB,AST,STL,BLK,TOV,PF,PTS,is_allstar,Name,was_mvp
0,1974.0,C,27.0,HOU,79.0,,2459.0,15.9,0.516,,...,923.0,166.0,80.0,104.0,,227.0,865.0,0,Zaid Abdul-Aziz,0
1,1974.0,C,26.0,MIL,81.0,,3548.0,24.4,0.564,,...,1178.0,386.0,112.0,283.0,,238.0,2191.0,1,Kareem Abdul-Jabbar,0
2,1974.0,SF,26.0,DET,74.0,,2298.0,10.9,0.457,,...,448.0,141.0,110.0,12.0,,242.0,759.0,0,Don Adams,0
3,1974.0,PG,27.0,CHI,55.0,,618.0,10.0,0.447,,...,69.0,56.0,36.0,1.0,,63.0,182.0,0,Rick Adelman,0
4,1974.0,PG,26.0,MIL,72.0,,2388.0,18.8,0.536,,...,291.0,374.0,137.0,22.0,,215.0,1268.0,0,Lucius Allen,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20792,2017.0,PF,24.0,CHO,62.0,58.0,1725.0,16.7,0.604,0.002,...,405.0,99.0,62.0,58.0,65.0,189.0,639.0,0,Cody Zeller,0
20793,2017.0,C,27.0,BOS,51.0,5.0,525.0,13.0,0.508,0.006,...,124.0,42.0,7.0,21.0,20.0,61.0,178.0,0,Tyler Zeller,0
20794,2017.0,C,20.0,ORL,19.0,0.0,108.0,7.3,0.346,0.000,...,35.0,4.0,2.0,5.0,3.0,17.0,23.0,0,Stephen Zimmerman,0
20795,2017.0,SF,22.0,CHI,44.0,18.0,843.0,6.9,0.503,0.448,...,125.0,36.0,15.0,16.0,40.0,78.0,240.0,0,Paul Zipser,0


In [25]:
stats_df[stats_df["was_mvp"] == 1]

Unnamed: 0,Year,Pos,Age,Tm,G,GS,MP,PER,TS%,3PAr,...,TRB,AST,STL,BLK,TOV,PF,PTS,is_allstar,Name,was_mvp
137,1974.0,C,22.0,BUF,74.0,,3185.0,24.7,0.594,,...,1117.0,170.0,88.0,246.0,,252.0,2261.0,1,Bob McAdoo,1
246,1975.0,C,27.0,MIL,65.0,,2747.0,26.4,0.55,,...,912.0,264.0,65.0,212.0,,205.0,1949.0,1,Kareem Abdul-Jabbar,1
522,1976.0,C,28.0,LAL,82.0,,3379.0,27.2,0.567,,...,1383.0,413.0,119.0,338.0,,292.0,2275.0,1,Kareem Abdul-Jabbar,1
1115,1977.0,C,24.0,POR,65.0,,2264.0,22.9,0.563,,...,934.0,245.0,66.0,211.0,,174.0,1210.0,1,Bill Walton,1
1363,1978.0,C,22.0,HOU,59.0,,2107.0,21.2,0.559,,...,886.0,31.0,48.0,76.0,220.0,179.0,1144.0,1,Moses Malone,1
1526,1979.0,C,31.0,LAL,80.0,,3157.0,25.5,0.612,,...,1025.0,431.0,76.0,316.0,282.0,230.0,1903.0,1,Kareem Abdul-Jabbar,1
1974,1980.0,SF,29.0,PHI,78.0,,2812.0,25.4,0.568,0.012,...,576.0,355.0,170.0,140.0,284.0,208.0,2100.0,1,Julius Erving,1
2440,1981.0,C,25.0,HOU,80.0,,3245.0,25.1,0.585,0.002,...,1180.0,141.0,83.0,150.0,308.0,223.0,2222.0,1,Moses Malone,1
2813,1982.0,C,26.0,HOU,81.0,81.0,3398.0,26.8,0.576,0.003,...,1188.0,142.0,76.0,125.0,294.0,208.0,2520.0,1,Moses Malone,1
2984,1983.0,PF,26.0,BOS,79.0,79.0,2982.0,24.1,0.561,0.052,...,870.0,458.0,148.0,71.0,240.0,197.0,1867.0,1,Larry Bird,1


Finally, we have the answer to our 'big question'. We will now recap what unfolded. First, data was parsed from basketball-reference to obtain information on which players won the MVP and in what year. Then, a dataset containing season stats was imported from kaggle. Next, a feature indicating if a player was an allstar that season was engineered based on a '*' contained in the players name. Lastly, a new dataset was created by combining our MVP dataset and our season stats dataset. This new dataset contains season stats and information about whether a player won the mvp or was an allstar that season. 

citations:

https://stackoverflow.com/questions/45746426/data-scraping-basketball-data-using-beautiful-soup
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
https://medium.com/hardwood-convergence/intro-to-virtual-environments-and-scraping-nba-data-with-beautifulsoup-6ce745f8c26e
https://stackoverflow.com/questions/3559559/how-to-delete-a-character-from-a-string-using-python
https://www.kite.com/python/answers/how-to-create-pandas-dataframe-from-a-numpy-array-in-python
https://stackoverflow.com/questions/3559559/how-to-delete-a-character-from-a-string-using-python
https://thispointer.com/python-how-to-use-if-else-elif-in-lambda-functions/
https://thispointer.com/pandas-check-if-a-value-exists-in-a-dataframe-using-in-not-in-operator-isin/
https://www.geeksforgeeks.org/iterating-over-rows-and-columns-in-pandas-dataframe/
https://www.basketball-reference.com/awards/mvp.html