## Description
In this assignment we will read a file of metropolitan regions and associated sports teams from [assets/wikipedia_data.html](assets/wikipedia_data.html) and answer some questions about each metropolitan region. Each of these regions may have one or more teams from the "Big 4": NFL (football, in [assets/nfl.csv](assets/nfl.csv)), MLB (baseball, in [assets/mlb.csv](assets/mlb.csv)), NBA (basketball, in [assets/nba.csv](assets/nba.csv) or NHL (hockey, in [assets/nhl.csv](assets/nhl.csv)). It is important to keep in mind that all the analysis made is from the perspective of the metropolitan region, and that this file is the "source of authority" for the location of a given sports team. Thus teams which are commonly known by a different area (e.g. "Oakland Raiders") need to be mapped into the metropolitan region given (e.g. San Francisco Bay Area). This will require some human data understanding outside of the data we've been given

For each sport we are going to answer the question: **what is the win/loss ratio's correlation with the population of the city it is in?** Win/Loss ratio refers to the number of wins over the number of wins plus the number of losses. 



### Importing and cleaning cities data from wikipedia html
In this section we imported the cities dataframe form wikipedia html, we reduced to only 6 relevant columns, deleted the totals row, and cleaned the columns names.
Additionally we cleaned the notes from the team columns

In [44]:
# Importing required libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
import re

# Importing cities dataset from wikipedia
cities = pd.read_html('assets/wikipedia_data.html')[1]

cities.head()

Unnamed: 0,Metropolitan area,Country,Pop.rank,Population (2016 est.)[8],B4,NFL,MLB,NBA,NHL,B6,MLS,CFL
0,New York City,United States,1,20153634,9,GiantsJets[note 1],YankeesMets[note 2],KnicksNets,RangersIslandersDevils[note 3],11,Red BullsNew York City FC,—
1,Los Angeles,United States,2,13310447,8,RamsChargers[note 4],DodgersAngels,LakersClippers,KingsDucks,10,GalaxyLos Angeles FC[note 5],—
2,San Francisco Bay Area,United States,6,6657982,6,49ersRaiders[note 6],GiantsAthletics,Warriors,Sharks[note 7],7,Earthquakes,—
3,Chicago,United States,3,9512999,5,Bears[note 8],CubsWhite Sox,Bulls[note 9],Blackhawks,6,Fire,—
4,Dallas–Fort Worth,United States,4,7233323,4,Cowboys,Rangers,Mavericks,Stars,5,FC Dallas,—


In [45]:
# Reducing rows and columns from cities dataset (removing last row 'totals', and keeping relevand columns)
cities = cities.iloc[:-1, [0, 3, 5, 6, 7, 8]]

# Cleaning column names
cities.columns = [x.lower().strip() for x in cities.columns]
cities.rename({'population (2016 est.)[8]': 'population'}, axis=1, inplace=True)

# Cleaning notes in team columns
leagues = cities.columns
for league in leagues:
    cities[league].replace('\[.+\]', '', regex=True, inplace=True)

cities.head()

Unnamed: 0,metropolitan area,population,nfl,mlb,nba,nhl
0,New York City,20153634,GiantsJets,YankeesMets,KnicksNets,RangersIslandersDevils
1,Los Angeles,13310447,RamsChargers,DodgersAngels,LakersClippers,KingsDucks
2,San Francisco Bay Area,6657982,49ersRaiders,GiantsAthletics,Warriors,Sharks
3,Chicago,9512999,Bears,CubsWhite Sox,Bulls,Blackhawks
4,Dallas–Fort Worth,7233323,Cowboys,Rangers,Mavericks,Stars


### Importing and cleaning teams dataframes

In [46]:
# Importing league teams datasets
nhl_df = pd.read_csv('assets/nhl.csv')
nfl_df = pd.read_csv('assets/nfl.csv')
nba_df = pd.read_csv('assets/nba.csv')
mlb_df = pd.read_csv('assets/mlb.csv')

nhl_df.head()

Unnamed: 0,team,GP,W,L,OL,PTS,PTS%,GF,GA,SRS,SOS,RPt%,ROW,year,League
0,Atlantic Division,Atlantic Division,Atlantic Division,Atlantic Division,Atlantic Division,Atlantic Division,Atlantic Division,Atlantic Division,Atlantic Division,Atlantic Division,Atlantic Division,Atlantic Division,Atlantic Division,2018,NHL
1,Tampa Bay Lightning*,82,54,23,5,113,.689,296,236,0.66,-0.07,.634,48,2018,NHL
2,Boston Bruins*,82,50,20,12,112,.683,270,214,0.62,-0.07,.610,47,2018,NHL
3,Toronto Maple Leafs*,82,49,26,7,105,.640,277,232,0.49,-0.06,.567,42,2018,NHL
4,Florida Panthers,82,44,30,8,96,.585,248,246,-0.01,-0.04,.537,41,2018,NHL


### Exploring & Cleaning NHL dataset. Creating functions for the rest of the dfs

In [47]:
# Cleaning column names and filtering year function
def cols_year_cleaning(df, year = 2018):
    '''
    Function that takes a dataset and cleans column names converting them into lower case and removing any
    spaces. In addition in filters only the year desired. 
    '''
    # Cleaning column names
    df.columns = [x.strip().lower() for x in df]
    # Selecting only 2018
    df = df[nhl_df['year'] == year]
    return df

# Applying column cleaning and year filtering to nhl df
nhl_df = cols_year_cleaning(nhl_df, 2018)

# Reducing to desired columns function
def select_columns(df, cols):
    df = df[cols]
    return df

nhl_df = select_columns(nhl_df, ['team', 'w', 'l'])
# Cleaning any notes or symbols at the end of the team names



Unnamed: 0,team,w,l
0,Atlantic Division,Atlantic Division,Atlantic Division
1,Tampa Bay Lightning*,54,23
2,Boston Bruins*,50,20
3,Toronto Maple Leafs*,49,26
4,Florida Panthers,44,30
5,Detroit Red Wings,30,39
6,Montreal Canadiens,29,40
7,Ottawa Senators,28,43
8,Buffalo Sabres,25,45
9,Metropolitan Division,Metropolitan Division,Metropolitan Division
