<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/04Apr20_0_checking_a_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finding a Sampling Frame

### Introduction

In this lesson, we'll see how we can make inferences about the underlying population, once we draw a representative sample.  Remember that when we are doing a survey, it is rare for us to have all of the information we need on the underlying population.  So we must rely on inferences about the underlying population from sampling.

### Constructing a Sample

With our NBA players, we purport to have the underlying population.  We gathered the data from Sports Reference data.  Let's load up our data, and check to see if lines up to what we know about the underlying population.

In [0]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/sampling-statistics/master/nba_combined.csv"
players_df = pd.read_csv(url, index_col = 0)

In [0]:
players_df.loc[3]

Unnamed: 0,player_id,name,weight,birth_date,height,nationality,team_abbreviation,most_recent_season,box_plus_minus,games_played,...,total_rebounds,turnovers,position,points,three_pointers,free_throw_percentage,assists,three_point_attempts,steals,blocks
0,klebima01,Maxi Kleber,240,1992-01-29,82,Germany,DAL,2019.0,0.3,209.0,...,355.0,48.0,C,604.0,107.0,0.863,77.0,286.0,20.0,78.0
1,wrighde01,Delon Wright,183,1992-04-26,77,United States of America,DAL,2019.0,2.2,263.0,...,259.0,68.0,SG,474.0,45.0,0.789,220.0,117.0,76.0,21.0
2,finnedo01,Dorian Finney-Smith,220,1993-05-04,79,United States of America,DAL,2019.0,-1.6,247.0,...,352.0,63.0,PF,597.0,99.0,0.722,97.0,265.0,44.0,37.0


In [0]:
df.shape

(496, 29)

We check the shape of the data and see that there are 496 observations.  Let's see if that lines up with the underlying population.  We can get to the underlying population by looking at the number of the teams in the league, and then multiplying by the amount of players.

In [0]:
dfs = pd.read_html('https://www.espn.com/nba/standings/_/group/league', header = None)

In [0]:
standings = dfs[0]
standings[:3]

Unnamed: 0,x --MILMilwaukee Bucks
0,x --LALLos Angeles Lakers
1,x --TORToronto Raptors
2,LACLA Clippers


Here is a list of teams from the NBA standings.  Now let's look at the shape to see how many teams are in the league.

In [0]:
standings.shape

# (29, 1)

(29, 1)

Here, we see that there are 29 teams in the league.  But a google search will tell us there are 30.  Do you see the error?

In [0]:
standings[:3]

Unnamed: 0,x --MILMilwaukee Bucks
0,x --LALLos Angeles Lakers
1,x --TORToronto Raptors
2,LACLA Clippers


Yea, the Milwaukee bucks are included in the header.  Let's change that.

In [0]:
import numpy as np
vals = standings.iloc[:, 0].values
updated_vals = np.insert(vals, 0, 'Milwaukee Bucks')

In [0]:
teams = pd.Series(updated_vals, name = 'Teams')
teams.shape

(30,)

Now we see that there are thirty teams in the NBA.  Let's sample three of them and check the number of players.

In [0]:
teams.sample(n=3, random_state=1)

17    NONew Orleans Pelicans
21     WSHWashington Wizards
10         INDIndiana Pacers
Name: Teams, dtype: object

* Pelicans: 17 players
* Wizards: 17 players
* Pacers: 17 players

Ok, now let's estimate the number of players in the league.

In [0]:
17*30

510

In [0]:
496/510

0.9725490196078431

It looks like we have .97 percent of the population, and so can use this as a sampliong frame.  For most studies this is quite good.  And is certainly enough to move forward with a preliminary analysis.  If we continue with the analysis, we may wish to see (1) if there really is missing data (2) where is this missing data, and does it matter (3) if we can correct for this missing data.

### Sampling From our Dataset

Now that we have sampling frame, we can take an randomly sample from the data.  

> Now currently we have a lot of underlying data about our population, so we do not need to take a sample.  We would just verify that the population data is accurate and comprehensive, and perform the analysis.

But let's keep going, and simulate sampling our underlying population by just taking a sample of our data.

In [0]:
idcs = players_df.index

In [0]:
idcs[:3]

Int64Index([0, 1, 2], dtype='int64')

Here we have a list of all of the indices assigned to each row in our data.

In [0]:
import numpy as np
np.random.seed(3)
rand_idcs = np.random.choice(idcs, size=150, replace=False)

In [0]:
selected_players = players_df.iloc[rand_idcs, :]

In [0]:
selected_players.shape

(150, 29)

Now we have 150 players to reach out to and contact about our data.

### Resources

[Sports Reference](https://sportsreference.readthedocs.io/en/stable/)