## Description
In this assignment we will read a file of metropolitan regions and associated sports teams from [assets/wikipedia_data.html](assets/wikipedia_data.html) and answer some questions about each metropolitan region. Each of these regions may have one or more teams from the "Big 4": NFL (football, in [assets/nfl.csv](assets/nfl.csv)), MLB (baseball, in [assets/mlb.csv](assets/mlb.csv)), NBA (basketball, in [assets/nba.csv](assets/nba.csv) or NHL (hockey, in [assets/nhl.csv](assets/nhl.csv)). It is important to keep in mind that all the analysis made is from the perspective of the metropolitan region, and that this file is the "source of authority" for the location of a given sports team. Thus teams which are commonly known by a different area (e.g. "Oakland Raiders") need to be mapped into the metropolitan region given (e.g. San Francisco Bay Area). This will require some human data understanding outside of the data we've been given

For each sport we are going to answer the question: **what is the win/loss ratio's correlation with the population of the city it is in?** Win/Loss ratio refers to the number of wins over the number of wins plus the number of losses. 



### Importing and cleaning cities data from wikipedia html
In this section we imported the cities dataframe form wikipedia html, we reduced to only 6 relevant columns, deleted the totals row, and cleaned the columns names.
Additionally we cleaned the notes from the team columns

In [30]:
# Importing required libraries
import pandas as pd
import numpy as np
import scipy.stats as stats
import re

# Importing cities dataset from wikipedia
cities = pd.read_html('assets/wikipedia_data.html')[1]

# Reducing rows and columns from cities dataset (removing last row 'totals', and keeping relevand columns)
cities = cities.iloc[:-1, [0, 3, 5, 6, 7, 8]]

# Cleaning column names
cities.columns = [x.lower().strip() for x in cities.columns]
cities.rename({'population (2016 est.)[8]': 'population'}, axis=1, inplace=True)

# Cleaning notes in team columns
leagues = cities.columns
for league in leagues:
    cities[league].replace('\[.+\]', '', regex=True, inplace=True)

cities.head()

Unnamed: 0,metropolitan area,population,nfl,mlb,nba,nhl
0,New York City,20153634,GiantsJets,YankeesMets,KnicksNets,RangersIslandersDevils
1,Los Angeles,13310447,RamsChargers,DodgersAngels,LakersClippers,KingsDucks
2,San Francisco Bay Area,6657982,49ersRaiders,GiantsAthletics,Warriors,Sharks
3,Chicago,9512999,Bears,CubsWhite Sox,Bulls,Blackhawks
4,Dallas–Fort Worth,7233323,Cowboys,Rangers,Mavericks,Stars


### NHL
In this part we are going to calculate the win/loss ratio's correlation with the population of the city it is in for the **NHL** using **2018** data.