# Assignment 4
## Description
In this assignment you must read in a file of metropolitan regions and associated sports teams from [assets/wikipedia_data.html](assets/wikipedia_data.html) and answer some questions about each metropolitan region. Each of these regions may have one or more teams from the "Big 4": NFL (football, in [assets/nfl.csv](assets/nfl.csv)), MLB (baseball, in [assets/mlb.csv](assets/mlb.csv)), NBA (basketball, in [assets/nba.csv](assets/nba.csv) or NHL (hockey, in [assets/nhl.csv](assets/nhl.csv)). Please keep in mind that all questions are from the perspective of the metropolitan region, and that this file is the "source of authority" for the location of a given sports team. Thus teams which are commonly known by a different area (e.g. "Oakland Raiders") need to be mapped into the metropolitan region given (e.g. San Francisco Bay Area). This will require some human data understanding outside of the data you've been given (e.g. you will have to hand-code some names, and might need to google to find out where teams are)!

For each sport I would like you to answer the question: **what is the win/loss ratio's correlation with the population of the city it is in?** Win/Loss ratio refers to the number of wins over the number of wins plus the number of losses. Remember that to calculate the correlation with [`pearsonr`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html), so you are going to send in two ordered lists of values, the populations from the wikipedia_data.html file and the win/loss ratio for a given sport in the same order. Average the win/loss ratios for those cities which have multiple teams of a single sport. Each sport is worth an equal amount in this assignment (20%\*4=80%) of the grade for this assignment. You should only use data **from year 2018** for your analysis -- this is important!

## Notes

1. Do not include data about the MLS or CFL in any of the work you are doing, we're only interested in the Big 4 in this assignment.
2. I highly suggest that you first tackle the four correlation questions in order, as they are all similar and worth the majority of grades for this assignment. This is by design!
3. It's fair game to talk with peers about high level strategy as well as the relationship between metropolitan areas and sports teams. However, do not post code solving aspects of the assignment (including such as dictionaries mapping areas to teams, or regexes which will clean up names).
4. There may be more teams than the assert statements test, remember to collapse multiple teams in one city into a single value!

## Question 1
For this question, calculate the win/loss ratio's correlation with the population of the city it is in for the **NHL** using **2018** data.

In [24]:
# Import
import pandas as pd
import numpy as np
import scipy.stats as stats
import re


# Functions
def remove_square_brackets(data):
# Search for everything between the square brackets [] and replace it with '' if any, otherwise, return the string
    if re.search("\[(.*?)\]", data) is None:
        return data
    else:
        return data.replace(re.search("\[(.*?)\]", data).group(), '')

    
def remove_asterisk(data):
    # Search for * at the end and remove it if any, otherwise, return the string
    char = '*'
    if char == data[-1]:
        return data.strip(char)
    return data


def extract_region(team_last_word):
    # Search for the team last name in all the teams
    for teams in list(nhl_city_population['NHL'].values):
        # If the team is found, return the Metropolitan area
        if team_last_word in teams:
            return nhl_city_population[nhl_city_population['NHL'] == teams].index[0]


def nhl_return_merged_data():
    return merged_df_nhl


# Variables
nhl_df=pd.read_csv("assets/nhl.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]
        
# Data cleaning in the cities DataFrame
cities['NHL'] = cities['NHL'].apply(lambda x: remove_square_brackets(x))
# Select Year = 2018 and drop irrelevant rows
nhl_df = nhl_df[nhl_df['year'] == 2018].drop([0, 9, 18, 26], axis=0)

# Select the required columns and create nhl_city_population DataFrame and clean the data
nhl_city_population = cities[['Metropolitan area', 'Population (2016 est.)[8]', 'NHL']]
nhl_city_population.set_index('Metropolitan area', inplace=True)
nhl_city_population['NHL'] = nhl_city_population['NHL'].apply(lambda x: remove_square_brackets(x))
nhl_city_population['NHL'] = nhl_city_population['NHL'].replace(to_replace=['—', ''], value=np.nan)
nhl_city_population.dropna(inplace=True, axis=0)

# Clean the data and split the team column to extract the region of the team 
nhl_df['team'] = nhl_df['team'].apply(lambda x: remove_asterisk(x))
nhl_df['team_last_word'] = nhl_df['team'].apply(lambda x: x.split(' ')[-1].strip())
nhl_df['Region'] = nhl_df['team_last_word'].apply(lambda x: extract_region(x))

# Transform the Win and Loss Columns from object to numeric
nhl_df['W'] = nhl_df['W'].astype(int)
nhl_df['L'] = nhl_df['L'].astype(int)

# Estimate the Win/(Win+Loss) Ratio column
nhl_df['Win/Loss Ratio'] = nhl_df['W'] / (nhl_df['W'] + nhl_df['L'])
nhl_df.set_index('Region', inplace=True)

# groupby() the region column and estimate the mean ratio per region
report_list = []
for group, frame in nhl_df.groupby(level=nhl_df.index.name):
    report_list.append({'Region': group, 'Win-Loss Ratio': frame['Win/Loss Ratio'].mean()})

# Create DataFrame from the report
report_df = pd.DataFrame(report_list).set_index('Region')

# Merge report_df and nhl_city_population DataFrames
merged_df_nhl = pd.merge(report_df, nhl_city_population, how='inner', left_index=True, right_index=True)

# Transform Population Columns from object to numeric
merged_df_nhl['Population (2016 est.)[8]'] = merged_df_nhl['Population (2016 est.)[8]'].astype(int)

# Create the lists of the population and the ratio to estimate the correlation and p-value
population_by_region_nhl = list(merged_df_nhl['Population (2016 est.)[8]'].values)
win_loss_by_region_nhl = list(merged_df_nhl['Win-Loss Ratio'].values)

# ANSWER FUNCTION 1
def nhl_correlation(): 
    global population_by_region_nhl
    global win_loss_by_region_nhl
    
    assert len(population_by_region_nhl) == len(win_loss_by_region_nhl), "Q1: Your lists must be the same length"
    assert len(population_by_region_nhl) == 28, "Q1: There should be 28 teams being analysed for NHL"
    
    corr, pval = stats.pearsonr(population_by_region_nhl, win_loss_by_region_nhl)
    return corr


In [25]:
nhl_correlation()

0.012486162921209907

### NHL - National Hockey League

Variable Description

- **GP**: Number of games played
- **W**: Number of games won
- **L**: Number of Lost
- **OL**: How many games were lost in overtime or in a shootout
- **PTS**: Total points
- **PTS%**: Percentage of total points earned from the points available
- **GF**: Number of goals scored by the team
- **GA**: Number of goals scored against the team
- **SRS**: Simple Rating System; a rating that takes into account average goal differential and strength of schedule. The rating is denominated in goals above/below average, where zero is average.
- **SOS**: Strength of Schedule; a rating of strength of schedule. The rating is denominated in goals above/below average, where zero is average. 
- **RPt%**: "Real" Points percentage. It counts 0 points for an OT loss, and any shootout game as a tie (1 point). In other words, the pre-2000 situation.
- **ROW**: Team's total number of regulation and overtime wins.
- **Year**: Year that the season occurred. Since the NHL season is split over two calendar years, the year given is the last year for that season. For example, the year for the 2008-09 season would be 2009.
- **League**: League of the Big 4 sport.


In [3]:
def remove_square_brackets(data):
    # Search for everything between the square brackets [] and replace it with '' if any, otherwise, return the string
    if re.search("\[(.*?)\]", data) is None:
        return data
    else:
        return data.replace(re.search("\[(.*?)\]", data).group(), '')
    
    
def remove_asterisk(data):
    # Search for * at the end and remove it if any, otherwise, return the string
    char = "*"
    if char == data[-1]:
        return data.strip(char)
    return data


def extract_region(team_last_word):
    # Search for the team last name in all the teams
    for teams in list(nhl_city_population['NHL'].values):
        # If the team is found, return the Metropolitan area
        if team_last_word in teams:
            return nhl_city_population[nhl_city_population['NHL'] == teams].index[0]

In [4]:
# Data cleaning in the cities DataFrame
cities['NHL'].apply(lambda x: remove_square_brackets(x))
# Select Year = 2018 and drop irrelevant rows
nhl_df = nhl_df[nhl_df['year'] == 2018].drop([0, 9, 18, 26], axis=0)
nhl_df.head()

KeyError: '[0 9 18 26] not found in axis'

In [16]:
# Select the required columns and create nhl_city_population DataFrame and clean the data
nhl_city_population = cities[['Metropolitan area', 'Population (2016 est.)[8]', 'NHL']]
nhl_city_population.set_index('Metropolitan area', inplace=True)
nhl_city_population['NHL'] = nhl_city_population['NHL'].apply(lambda x: remove_square_brackets(x))
nhl_city_population['NHL'] = nhl_city_population['NHL'].replace(to_replace=['—', ''], value=np.nan)
nhl_city_population.dropna(inplace=True, axis=0)
nhl_city_population.head()

TypeError: expected string or bytes-like object

In [None]:
# Clean the data and split the team column to extract the region of the team 
nhl_df['team'] = nhl_df['team'].apply(lambda x: remove_asterisk(x))
nhl_df['team_last_word'] = nhl_df['team'].apply(lambda x: x.split(' ')[-1].strip())
nhl_df['Region'] = nhl_df['team_last_word'].apply(lambda x: extract_region(x))

# Transform the Win and Loss Columns from object to numeric
nhl_df['W'] = nhl_df['W'].astype(int)
nhl_df['L'] = nhl_df['L'].astype(int)

# Estimate the Win/(Win+Loss) Ratio column
nhl_df['Win/Loss Ratio'] = nhl_df['W'] / (nhl_df['W'] + nhl_df['L'])
nhl_df.set_index('Region', inplace=True)
nhl_df.head()

In [None]:
# groupby() the region column and estimate the mean ratio per region
report_list = []
for group, frame in nhl_df.groupby(level=nhl_df.index.name):
    report_list.append({'Region': group, 'Win-Loss Ratio': frame['Win/Loss Ratio'].mean()})

# Create DataFrame from the report
report_df = pd.DataFrame(report_list).set_index('Region')

# Merge report_df and nhl_city_population DataFrames
merged_df = pd.merge(report_df, nhl_city_population, how='inner', left_index=True, right_index=True)

In [None]:
# Transform Population Columns from object to numeric
merged_df['Population (2016 est.)[8]'] = merged_df['Population (2016 est.)[8]'].astype(int)

# Create the lists of the population and the ratio to estimate the correlation and p-value
population_by_region = list(merged_df['Population (2016 est.)[8]'].values)
win_loss_by_region = list(merged_df['Win-Loss Ratio'].values)

# Estimate the correlation and the p-value using scipy.stats
corr, pval = stats.pearsonr(population_by_region, win_loss_by_region)
print('corr:', round(corr, 1))
print('p-value:', pval)

## Question 2
For this question, calculate the win/loss ratio's correlation with the population of the city it is in for the **NBA** using **2018** data.

In [5]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import re

nba_df=pd.read_csv("assets/nba.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]


def extract_region_two(team_last_word):
    # Search for the team last name in all the teams
    for teams in list(nba_city_population['NBA'].values):
        # If the team is found, return the Metropolitan area (index of the DataFrame)
        if team_last_word in teams:
            return nba_city_population[nba_city_population['NBA'] == teams].index[0]

        
def nba_return_merged_data():
    return merged_df_nba        


# NBA_DF
# Remove not usefull rows that have the division information
nba_df = nba_df[~nba_df['team'].str.contains('Division')]

# Split the NBA team's columns and get rid of not usfull data
nba_df['team'] = nba_df['team'].apply(lambda x: x.split('\xa0(')[0])

# Remove * character from the end of the title column data if any
nba_df['team'] = nba_df['team'].apply(lambda x: remove_asterisk(x).strip())

# Select only the year = 2018
nba_df = nba_df[nba_df['year'] == 2018]

# Transform the Win and Loss Columns from object to numeric
nba_df['W'] = nba_df['W'].astype(int)
nba_df['L'] = nba_df['L'].astype(int)

# Estimate the Win/(Win + Loss) Ratio Column
nba_df['Win/Loss Ratio'] = nba_df['W'] / (nba_df['W'] + nba_df['L'])

# Data cleaning in the cities DataFrame
cities['NBA'].apply(lambda x: remove_square_brackets(x))

# Select the required columns and create nhl_city_population DataFrame and clean the data
nba_city_population = cities[['Metropolitan area', 'Population (2016 est.)[8]', 'NBA']]
nba_city_population.set_index('Metropolitan area', inplace=True)
nba_city_population['NBA'] = nba_city_population['NBA'].apply(lambda x: remove_square_brackets(x))
nba_city_population['NBA'] = nba_city_population['NBA'].replace(to_replace=['—', ''], value=np.nan)
nba_city_population.dropna(inplace=True, axis=0)

# Extract the team last name from the title column
nba_df['team_last_word'] = nba_df['team'].apply(lambda x: x.split(' ')[-1].strip())

# Create the region data from the title column
nba_df['Region'] = nba_df['team_last_word'].apply(lambda x: extract_region_two(x))

# groupby() the region column and estimate the mean ratio per region
report_list = []

for group, frame in nba_df.groupby(by='Region'):
    report_list.append({'Region': group, 'Win-Loss Ratio': frame['Win/Loss Ratio'].mean()})

# Create DataFrame from the report
report_df = pd.DataFrame(report_list).set_index('Region')

# Merge report_df and nhl_city_population DataFrames
merged_df_nba = pd.merge(report_df, nba_city_population, how='inner', left_index=True, right_index=True)

# Transform Population Columns from object to numeric
merged_df_nba['Population (2016 est.)[8]'] = merged_df_nba['Population (2016 est.)[8]'].astype(int)

# Create the lists of the population and the ratio to estimate the correlation and p-value
population_by_region_nba = list(merged_df_nba['Population (2016 est.)[8]'].values)
win_loss_by_region_nba = list(merged_df_nba['Win-Loss Ratio'].values)


def nba_correlation():
    # YOUR CODE HERE
    global population_by_region_nba
    global win_loss_by_region_nba
    
    assert len(population_by_region_nba) == len(win_loss_by_region_nba), "Q2: Your lists must be the same length"
    assert len(population_by_region_nba) == 28, "Q2: There should be 28 teams being analysed for NBA"

    corr, pval = stats.pearsonr(population_by_region_nba, win_loss_by_region_nba)
    return corr


In [6]:
nba_correlation()

-0.17657160252844617

In [None]:
# NBA_DF
# Remove not usefull rows that have the division information
nba_df = nba_df[~nba_df['team'].str.contains('Division')]

# Split the NBA team's columns and get rid of not usfull data
nba_df['team'] = nba_df['team'].apply(lambda x: x.split('\xa0(')[0])

# Remove * character from the end of the title column data if any
nba_df['team'] = nba_df['team'].apply(lambda x: remove_asterisk(x).strip())

# Select only the year = 2018
nba_df = nba_df[nba_df['year'] == 2018]

# Transform the Win and Loss Columns from object to numeric
nba_df['W'] = nba_df['W'].astype(int)
nba_df['L'] = nba_df['L'].astype(int)

# Estimate the Win/(Win + Loss) Ratio Column
nba_df['Win/Loss Ratio'] = nba_df['W'] / (nba_df['W'] + nba_df['L'])

In [None]:
# Data cleaning in the cities DataFrame
cities['NBA'].apply(lambda x: remove_square_brackets(x))

# Select the required columns and create nhl_city_population DataFrame and clean the data
nba_city_population = cities[['Metropolitan area', 'Population (2016 est.)[8]', 'NBA']]
nba_city_population.set_index('Metropolitan area', inplace=True)
nba_city_population['NBA'] = nba_city_population['NBA'].apply(lambda x: remove_square_brackets(x))
nba_city_population['NBA'] = nba_city_population['NBA'].replace(to_replace=['—', ''], value=np.nan)
nba_city_population.dropna(inplace=True, axis=0)

In [None]:
# Extract the team last name from the title column
nba_df['team_last_word'] = nba_df['team'].apply(lambda x: x.split(' ')[-1].strip())

def extract_region_two(team_last_word):
    # Search for the team last name in all the teams
    for teams in list(nba_city_population['NBA'].values):
        # If the team is found, return the Metropolitan area (index of the DataFrame)
        if team_last_word in teams:
            return nba_city_population[nba_city_population['NBA'] == teams].index[0]

# Create the region data from the title column
nba_df['Region'] = nba_df['team_last_word'].apply(lambda x: extract_region_two(x))

In [None]:
# groupby() the region column and estimate the mean ratio per region
report_list = []

for group, frame in nba_df.groupby(by='Region'):
    report_list.append({'Region': group, 'Win-Loss Ratio': frame['Win/Loss Ratio'].mean()})

# Create DataFrame from the report
report_df = pd.DataFrame(report_list).set_index('Region')

# Merge report_df and nhl_city_population DataFrames
merged_df = pd.merge(report_df, nba_city_population, how='inner', left_index=True, right_index=True)

# Transform Population Columns from object to numeric
merged_df['Population (2016 est.)[8]'] = merged_df['Population (2016 est.)[8]'].astype(int)

# Create the lists of the population and the ratio to estimate the correlation and p-value
population_by_region = list(merged_df['Population (2016 est.)[8]'].values)
win_loss_by_region = list(merged_df['Win-Loss Ratio'].values)


In [None]:
# Estimate the correlation and the p-value using scipy.stats
corr, pval = stats.pearsonr(population_by_region, win_loss_by_region)
print('corr:', corr)
print('p-value:', pval)

## Question 3
For this question, calculate the win/loss ratio's correlation with the population of the city it is in for the **MLB** using **2018** data.

In [15]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import re


# Functions
def extract_last_words_mlb(data):
    boston_char = 'Boston'
    chicago_char = 'Chicago'
    
    if boston_char in data.split(' ')[0]:
        return 'Red Sox'
    elif chicago_char in data.split(' ')[0]:
        return 'White Sox'
    
    return data.split(' ')[-1]


def extract_region_three(team_last_word):
    # Search for the team last name in all the teams
    for teams in list(mlb_city_population['MLB'].values):
        # If the team is found, return the Metropolitan area
        if team_last_word in teams:
            return mlb_city_population[mlb_city_population['MLB'] == teams].index[0]

        
def mlb_return_merged_data():
    return merged_df_mlb


######################
# Dataframes
mlb_df=pd.read_csv("assets/mlb.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]

# Select only the year = 2018
mlb_df = mlb_df[mlb_df['year'] == 2018]

# Estimate the Win/(Win + Loss) Ratio Column
mlb_df['Win/Loss Ratio'] = mlb_df['W'] / (mlb_df['W'] + mlb_df['L'])

# Data cleaning in the cities DataFrame
cities['MLB'] = cities['MLB'].apply(lambda x: remove_square_brackets(x))
cities.replace(to_replace=['—', ''], value=np.nan, inplace=True)

# Select the required columns and create nhl_city_population DataFrame and clean the data
mlb_city_population = cities[['Metropolitan area', 'Population (2016 est.)[8]', 'MLB']]
mlb_city_population.set_index('Metropolitan area', inplace=True)
mlb_city_population.dropna(inplace=True, axis=0)

# Extract the team name or last name from the team column
mlb_df['team_last_name'] = mlb_df['team'].apply(lambda x: extract_last_words_mlb(x))

# Create the region data from the title column and set it as the index of the df
mlb_df['Region'] = mlb_df['team_last_name'].apply(lambda x: extract_region_three(x))
mlb_df.set_index('Region', inplace=True)

# groupby() the region column and estimate the mean ratio per region
report_list = []

for group, frame in mlb_df.groupby(by='Region'):
    report_list.append({'Region': group, 'Win-Loss Ratio': frame['Win/Loss Ratio'].mean()})

# Create DataFrame from the report
report_df = pd.DataFrame(report_list).set_index('Region')

# Merge report_df and nhl_city_population DataFrames
merged_df_mlb = pd.merge(report_df, mlb_city_population, how='inner', left_index=True, right_index=True)

# Transform Population Columns from object to numeric
merged_df_mlb['Population (2016 est.)[8]'] = merged_df_mlb['Population (2016 est.)[8]'].astype(int)

# Create the lists of the population and the ratio to estimate the correlation and p-value
population_by_region_mlb = list(merged_df_mlb['Population (2016 est.)[8]'].values)
win_loss_by_region_mlb = list(merged_df_mlb['Win-Loss Ratio'].values)


def mlb_correlation(): 
    # YOUR CODE HERE
    global population_by_region_mlb
    global win_loss_by_region_mlb

    assert len(population_by_region_mlb) == len(win_loss_by_region_mlb), "Q3: Your lists must be the same length"
    assert len(population_by_region_mlb) == 26, "Q3: There should be 26 teams being analysed for MLB"

    corr, pval = stats.pearsonr(population_by_region_mlb, win_loss_by_region_mlb)
    
    return corr


In [16]:
mlb_correlation()

0.15027698302669307

In [23]:
# MLB_DF

mlb_df=pd.read_csv("assets/mlb.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]

# Select only the year = 2018
mlb_df = mlb_df[mlb_df['year'] == 2018]

# Estimate the Win/(Win + Loss) Ratio Column
mlb_df['Win/Loss Ratio'] = mlb_df['W'] / (mlb_df['W'] + mlb_df['L'])

# Data cleaning in the cities DataFrame
cities['MLB'] = cities['MLB'].apply(lambda x: remove_square_brackets(x))
cities.replace(to_replace=['—', ''], value=np.nan, inplace=True)

# Select the required columns and create nhl_city_population DataFrame and clean the data
mlb_city_population = cities[['Metropolitan area', 'Population (2016 est.)[8]', 'MLB']]
mlb_city_population.set_index('Metropolitan area', inplace=True)
mlb_city_population.dropna(inplace=True, axis=0)


def extract_last_words_mlb(data):
    boston_char = 'Boston'
    chicago_char = 'Chicago'
    if boston_char in data.split(' ')[0]:
        return 'Red Sox'
    elif chicago_char in data.split(' ')[0]:
        return 'White Sox'
    
    return data.split(' ')[-1]


# Extract the team name or last name from the team column
mlb_df['team_last_name'] = mlb_df['team'].apply(lambda x: extract_last_words_mlb(x))

#print('DF length: ', len(mlb_city_population))
#print(mlb_city_population)
#print()
#print('DF length: ', len(mlb_df))
print(mlb_df)

                     team    W    L   W-L%    GB  year League  Win/Loss Ratio  \
0          Boston Red Sox  108   54  0.667    --  2018    MLB        0.666667   
1        New York Yankees  100   62  0.617   8.0  2018    MLB        0.617284   
2          Tampa Bay Rays   90   72  0.556  18.0  2018    MLB        0.555556   
3       Toronto Blue Jays   73   89  0.451  35.0  2018    MLB        0.450617   
4       Baltimore Orioles   47  115  0.290  61.0  2018    MLB        0.290123   
5       Cleveland Indians   91   71  0.562    --  2018    MLB        0.561728   
6         Minnesota Twins   78   84  0.481  13.0  2018    MLB        0.481481   
7          Detroit Tigers   64   98  0.395  27.0  2018    MLB        0.395062   
8       Chicago White Sox   62  100  0.383  29.0  2018    MLB        0.382716   
9      Kansas City Royals   58  104  0.358  33.0  2018    MLB        0.358025   
10         Houston Astros  103   59  0.636    --  2018    MLB        0.635802   
11      Oakland Athletics   

In [31]:
def extract_region_three(team_last_word):
    # Search for the team last name in all the teams
    for teams in list(mlb_city_population['MLB'].values):
        # If the team is found, return the Metropolitan area
        if team_last_word in teams:
            return mlb_city_population[mlb_city_population['MLB'] == teams].index[0]

        
# Create the region data from the title column and set it as the index of the df
mlb_df['Region'] = mlb_df['team_last_name'].apply(lambda x: extract_region_three(x))
mlb_df.set_index('Region', inplace=True)

In [33]:
# groupby() the region column and estimate the mean ratio per region
report_list = []

for group, frame in mlb_df.groupby(by='Region'):
    report_list.append({'Region': group, 'Win-Loss Ratio': frame['Win/Loss Ratio'].mean()})

# Create DataFrame from the report
report_df = pd.DataFrame(report_list).set_index('Region')

# Merge report_df and nhl_city_population DataFrames
merged_df = pd.merge(report_df, mlb_city_population, how='inner', left_index=True, right_index=True)

# Transform Population Columns from object to numeric
merged_df['Population (2016 est.)[8]'] = merged_df['Population (2016 est.)[8]'].astype(int)

# Create the lists of the population and the ratio to estimate the correlation and p-value
population_by_region_mlb = list(merged_df['Population (2016 est.)[8]'].values)
win_loss_by_region_mlb = list(merged_df['Win-Loss Ratio'].values)

(0.15027698302669307, 0.46370703378875583)

## Question 4
For this question, calculate the win/loss ratio's correlation with the population of the city it is in for the **NFL** using **2018** data.

In [9]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import re

# Imports
nfl_df=pd.read_csv("assets/nfl.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]


# Functions
def remove_character(data, char='*'):
    # Search for * at the end and remove it if any, otherwise, return the string
    for item in char:
        if item == data[-1]:
            return data.strip(char)
    return data


def extract_region_four(team_last_word):
    # Search for the team last name in all the teams
    for teams in list(nfl_city_population['NFL'].values):
        # If the team is found, return the Metropolitan area
        if team_last_word in teams:
            return nfl_city_population[nfl_city_population['NFL'] == teams].index[0]

        
def nfl_return_merged_data():
    return merged_df_nfl


# Only include data from year 2018
nfl_df = nfl_df[nfl_df['year'] == 2018]

# Remove the rows that contains eather ['AFC', 'NFC']
nfl_df = nfl_df[~nfl_df['DSRS'].str.contains(pat='[A-Z]{3}', regex=True)]

# Remove the characters ['*', '+'] from the team name
nfl_df['team'] = nfl_df['team'].apply(lambda x: remove_character(x, '*+'))

# Transform nfl_df Win and Loss from str() to int()
nfl_df['W'] = nfl_df['W'].astype(int)
nfl_df['L'] = nfl_df['L'].astype(int)

# Estimate the Win/(Win + Loss) Ratio Column
nfl_df['Win/Loss Ratio'] = nfl_df['W'] / (nfl_df['W'] + nfl_df['L'])

# Remove [] from cities dataframe
cities['NFL'] = cities['NFL'].apply(lambda x: remove_square_brackets(x))
cities.replace(to_replace=['—', ''], value=np.nan, inplace=True)

# Select the required columns and create nhl_city_population DataFrame and clean the data
nfl_city_population = cities[['Metropolitan area', 'Population (2016 est.)[8]', 'NFL']]
nfl_city_population.set_index('Metropolitan area', inplace=True)
nfl_city_population.dropna(inplace=True, axis=0)
nfl_city_population.drop('Toronto', inplace=True)

# Create the region data from the title column and set it as the index of the df
nfl_df['team_last_name'] = nfl_df['team'].apply(lambda x: x.split(' ')[-1])
nfl_df['Region'] = nfl_df['team_last_name'].apply(lambda x: extract_region_four(x))
nfl_df.set_index('Region', inplace=True)

# groupby() the region column and estimate the mean ratio per region
report_list = []

for group, frame in nfl_df.groupby(by='Region'):
    report_list.append({'Region': group, 'Win-Loss Ratio': frame['Win/Loss Ratio'].mean()})

# Create DataFrame from the report
report_df = pd.DataFrame(report_list).set_index('Region')

# Merge report_df and nhl_city_population DataFrames
merged_df_nfl = pd.merge(report_df, nfl_city_population, how='inner', left_index=True, right_index=True)

# Transform Population Columns from object to numeric
merged_df_nfl['Population (2016 est.)[8]'] = merged_df_nfl['Population (2016 est.)[8]'].astype(int)

# Create the lists of the population and the ratio to estimate the correlation and p-value
population_by_region_nfl = list(merged_df_nfl['Population (2016 est.)[8]'].values)
win_loss_by_region_nfl = list(merged_df_nfl['Win-Loss Ratio'].values)


def nfl_correlation(): 
    # YOUR CODE HERE
    global population_by_region_nfl
    global win_loss_by_region_nfl

    assert len(population_by_region_nfl) == len(win_loss_by_region_nfl), "Q4: Your lists must be the same length"
    assert len(population_by_region_nfl) == 29, "Q4: There should be 29 teams being analysed for NFL"

    corr, pval = stats.pearsonr(population_by_region_nfl, win_loss_by_region_nfl)
    
    return corr


In [10]:
nfl_correlation()

0.004922112149349393

In [92]:
# Imports
nfl_df=pd.read_csv("assets/nfl.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]


# Functions
def remove_character(data, char='*'):
    # Search for * at the end and remove it if any, otherwise, return the string
    for item in char:
        if item == data[-1]:
            return data.strip(char)
    return data


def extract_region_four(team_last_word):
    # Search for the team last name in all the teams
    for teams in list(nfl_city_population['NFL'].values):
        # If the team is found, return the Metropolitan area
        if team_last_word in teams:
            return nfl_city_population[nfl_city_population['NFL'] == teams].index[0]


# Only include data from year 2018
nfl_df = nfl_df[nfl_df['year'] == 2018]

# Remove the rows that contains eather ['AFC', 'NFC']
nfl_df = nfl_df[~nfl_df['DSRS'].str.contains(pat='[A-Z]{3}', regex=True)]

# Remove the characters ['*', '+'] from the team name
nfl_df['team'] = nfl_df['team'].apply(lambda x: remove_character(x, '*+'))

# Transform nfl_df Win and Loss from str() to int()
nfl_df['W'] = nfl_df['W'].astype(int)
nfl_df['L'] = nfl_df['L'].astype(int)

# Estimate the Win/(Win + Loss) Ratio Column
nfl_df['Win/Loss Ratio'] = nfl_df['W'] / (nfl_df['W'] + nfl_df['L'])

# Remove [] from cities dataframe
cities['NFL'] = cities['NFL'].apply(lambda x: remove_square_brackets(x))
cities.replace(to_replace=['—', ''], value=np.nan, inplace=True)

# Select the required columns and create nhl_city_population DataFrame and clean the data
nfl_city_population = cities[['Metropolitan area', 'Population (2016 est.)[8]', 'NFL']]
nfl_city_population.set_index('Metropolitan area', inplace=True)
nfl_city_population.dropna(inplace=True, axis=0)
nfl_city_population.drop('Toronto', inplace=True)

# Create the region data from the title column and set it as the index of the df
nfl_df['team_last_name'] = nfl_df['team'].apply(lambda x: x.split(' ')[-1])
nfl_df['Region'] = nfl_df['team_last_name'].apply(lambda x: extract_region_four(x))
nfl_df.set_index('Region', inplace=True)

# groupby() the region column and estimate the mean ratio per region
report_list = []

for group, frame in nfl_df.groupby(by='Region'):
    report_list.append({'Region': group, 'Win-Loss Ratio': frame['Win/Loss Ratio'].mean()})

# Create DataFrame from the report
report_df = pd.DataFrame(report_list).set_index('Region')

# Merge report_df and nhl_city_population DataFrames
merged_df = pd.merge(report_df, nfl_city_population, how='inner', left_index=True, right_index=True)

# Transform Population Columns from object to numeric
merged_df['Population (2016 est.)[8]'] = merged_df['Population (2016 est.)[8]'].astype(int)

# Create the lists of the population and the ratio to estimate the correlation and p-value
population_by_region_nfl = list(merged_df['Population (2016 est.)[8]'].values)
win_loss_by_region_nfl = list(merged_df['Win-Loss Ratio'].values)

## Question 5
In this question I would like you to explore the hypothesis that **given that an area has two sports teams in different sports, those teams will perform the same within their respective sports**. How I would like to see this explored is with a series of paired t-tests (so use [`ttest_rel`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ttest_rel.html)) between all pairs of sports. Are there any sports where we can reject the null hypothesis? Again, average values where a sport has multiple teams in one region. Remember, you will only be including, for each sport, cities which have teams engaged in that sport, drop others as appropriate. This question is worth 20% of the grade for this assignment.

In [48]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import re


# Import data
mlb_df=pd.read_csv("assets/mlb.csv")
nhl_df=pd.read_csv("assets/nhl.csv")
nba_df=pd.read_csv("assets/nba.csv")
nfl_df=pd.read_csv("assets/nfl.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]

#####################################################################################
# Functions
def remove_square_brackets(data):
# Search for everything between the square brackets [] and replace it with '' if any, otherwise, return the string
    if re.search("\[(.*?)\]", data) is None:
        return data
    else:
        return data.replace(re.search("\[(.*?)\]", data).group(), '')
        
        
def estimate_pvalue_for_sport(data):
    p_values_list = []
    for sport in sports:
        df = pd.merge(left=dfs[data], right=dfs[sport], how='inner', left_index=True, right_index=True)
        p_value = stats.ttest_rel(df['Win-Loss Ratio_x'], df['Win-Loss Ratio_y'])[1]
        p_value = round(p_value, 2)
        p_values_list.append(p_value)
    return p_values_list
        
# Import Merged data from Q1 - Q4    
nba_merged_df = nba_return_merged_data().drop(columns='Population (2016 est.)[8]', axis=1)
nfl_merged_df = nfl_return_merged_data().drop(columns='Population (2016 est.)[8]', axis=1)
nhl_merged_df = nhl_return_merged_data().drop(columns='Population (2016 est.)[8]', axis=1)
mlb_merged_df = mlb_return_merged_data().drop(columns='Population (2016 est.)[8]', axis=1)

# Clean the columns from cities dataframe
# Remove the square brackets '[]'
cities['NFL'] = cities['NFL'].apply(lambda x: remove_square_brackets(x))
cities['MLB'] = cities['MLB'].apply(lambda x: remove_square_brackets(x))
cities['NHL'] = cities['NHL'].apply(lambda x: remove_square_brackets(x))
cities['NBA'] = cities['NBA'].apply(lambda x: remove_square_brackets(x))

# Replace the ['-', ''] for NaN values
cities.replace(to_replace=['—', '— ', ''], value=np.nan, inplace=True)

# Create a dictionary to hold the DataFrames with their name as key
dfs = {'NFL': nfl_merged_df,
       'NBA': nba_merged_df,
       'MLB': mlb_merged_df,
       'NHL': nhl_merged_df}

sports = ['NFL', 'NBA', 'NHL', 'MLB']


def sports_team_performance():
    # YOUR CODE HERE
    #raise NotImplementedError()
    
    # Note: p_values is a full dataframe, so df.loc["NFL","NBA"] should be the same as df.loc["NBA","NFL"] and
    # df.loc["NFL","NFL"] should return np.nan
    sports = ['NFL', 'NBA', 'NHL', 'MLB']
    p_values = pd.DataFrame({k: estimate_pvalue_for_sport(k) for k in sports}, index=sports)
    
    assert abs(p_values.loc["NBA", "NHL"] - 0.02) <= 1e-2, "The NBA-NHL p-value should be around 0.02"
    assert abs(p_values.loc["MLB", "NFL"] - 0.80) <= 1e-2, "The MLB-NFL p-value should be around 0.80"
    return p_values


In [49]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import re


# Import data
mlb_df=pd.read_csv("assets/mlb.csv")
nhl_df=pd.read_csv("assets/nhl.csv")
nba_df=pd.read_csv("assets/nba.csv")
nfl_df=pd.read_csv("assets/nfl.csv")
cities=pd.read_html("assets/wikipedia_data.html")[1]
cities=cities.iloc[:-1,[0,3,5,6,7,8]]

#####################################################################################
# Functions
def remove_square_brackets(data):
# Search for everything between the square brackets [] and replace it with '' if any, otherwise, return the string
    if re.search("\[(.*?)\]", data) is None:
        return data
    else:
        return data.replace(re.search("\[(.*?)\]", data).group(), '')
        
nba_merged_df = nba_return_merged_data().drop(columns='Population (2016 est.)[8]', axis=1)
nfl_merged_df = nfl_return_merged_data().drop(columns='Population (2016 est.)[8]', axis=1)
nhl_merged_df = nhl_return_merged_data().drop(columns='Population (2016 est.)[8]', axis=1)
mlb_merged_df = mlb_return_merged_data().drop(columns='Population (2016 est.)[8]', axis=1)

# Clean the columns from cities dataframe
# Remove the square brackets '[]'
cities['NFL'] = cities['NFL'].apply(lambda x: remove_square_brackets(x))
cities['MLB'] = cities['MLB'].apply(lambda x: remove_square_brackets(x))
cities['NHL'] = cities['NHL'].apply(lambda x: remove_square_brackets(x))
cities['NBA'] = cities['NBA'].apply(lambda x: remove_square_brackets(x))
# Replace the ['-', ''] for NaN values
cities.replace(to_replace=['—', '— ', ''], value=np.nan, inplace=True)

# Create a dictionary to hold the DataFrames with their name as key
dfs = {'NFL': nfl_merged_df,
      'NBA': nba_merged_df,
      'MLB': mlb_merged_df,
      'NHL': nhl_merged_df}

sports = ['NFL', 'NBA', 'NHL', 'MLB']


def estimate_pvalue_for_sport(data):
    p_values_list = []
    for sport in sports:
        df = pd.merge(left=dfs[data], right=dfs[sport], how='inner', left_index=True, right_index=True)
        p_value = stats.ttest_rel(df['Win-Loss Ratio_x'], df['Win-Loss Ratio_y'])[1]
        p_value = round(p_value, 2)
        p_values_list.append(p_value)
    
    return p_values_list

sports_team_performance()

Unnamed: 0,NFL,NBA,NHL,MLB
NFL,,0.94,0.03,0.8
NBA,0.94,,0.02,0.95
NHL,0.03,0.02,,0.0
MLB,0.8,0.95,0.0,
