# Lesson 05: Data Sampling

We will use this lesson to introduce the `pandas` module. `pandas` is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

The main object we will use is the `DataFrame`. This is similar in structure to the `Table` object from the `datascience` library we used in the Foundations of Data Science course.

In [None]:
from datascience import *
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

## Fake Election Data Set

Import the `gotham.csv` data set into `Table`. The file is located in the data folder. This fake data set contains information on every registered voter in Gotham City.

In [None]:
gotham_table = Table().read_table('data/gotham.csv')
gotham_table

Use the code cells below (add more if you like) to explore the `gotham` table. You'll be required to give a brief synopsis of your initial EDA (Exploratory Data Analysis).

In [None]:
gotham_table.sort('age', descending=False)

In [None]:
np.sum(gotham_table[1])/gotham_table.num_rows

In [None]:
gotham_table.group('vote')

In [None]:
np.average(gotham_table.where('vote', are.equal_to('Rep')).column('age'))

In this course we'll be using `pandas`. Let's load the `gotham.csv` file into  `pandas DataFrame`. The `pandas` User Guide can be found [here](https://pandas.pydata.org/docs/user_guide/index.html#user-guide).

In [None]:
gotham_table.num_rows

In [None]:
np.unique(gotham_table.column('vote'))

In [None]:
np.average(gotham_table.column('age'))

In [None]:
gotham_table.select('age').hist()

In [None]:
gotham_table.hist('age')

In [None]:
gotham_table.column('vote')

In [None]:
import pandas as pd

In [None]:
gotham = pd.read_csv('data/gotham.csv')
gotham

## The Basics of a `pandas DataFrame`

In [None]:
gotham.describe()

In [None]:
gotham.shape

In [None]:
gotham.shape[0]

In [None]:
gotham['vote']

In [None]:
gotham.vote

In [None]:
np.average(gotham['age'])

In [None]:
gotham.info

In [None]:
gotham.info()

In [None]:
type(gotham.info)

In [None]:
type(gotham.info())

**Example 1.** What percentage of voters are planning to vote for the Democrat candidate?

In [None]:
(np.sum(gotham['vote.dem']).to_numpy() == "Dem")/len(gotham)

In [None]:
np.sum(np.array(gotham['vote.dem']))/len(np.array(gotham['vote.dem']))

In [None]:
np.mean(gotham['vote.dem'])

In [None]:
len(gotham[gotham['vote'] == 'Dem'])/len(gotham)

In [None]:
gotham['vote.dem'].to_numpy()

In [None]:
gotham[gotham['vote'] == 'Dem'].shape[0]

**Example 2.** How many people are retired?

In [None]:
len(gotham.query('age >= 65'))

In [None]:
gotham.query('age >= 65')

**Example 3.** Suppose we take a convenience sample of everyone who is retired?

In [None]:
len(gotham.query('age >= 65'))

In [None]:
len(gotham.query('age >= 65'))/gotham.shape[0]

**Example 4.** What percentage of voters are planning to vote Democrat by age and male/female?

In [None]:
gotham.groupby('is.male').agg("sum")['vote.dem']

In [None]:
gotham.head(2)

In [None]:
gotham.groupby(["age","is.male"]).agg("mean")

In [None]:
gotham.groupby(["age","is.male"]).agg("mean").reset_index()

In [None]:
votes_by_demo = gotham.groupby(["age","is.male"]).agg("mean")
votes_by_demo = votes_by_demo.reset_index()

In [None]:
votes_by_demo

In [None]:
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns

In [None]:
fig = plt.figure();
red_blue = ["#bf1518", "#397eb7"]
with sns.color_palette(sns.color_palette(red_blue)):
    ax = sns.pointplot(data = votes_by_demo, x = "age", y = "vote.dem", hue = "is.male")

ax.set_title("Voting preferences by demographics")
fig.canvas.draw()
new_ticks = [i.get_text() for i in ax.get_xticklabels()];
plt.xticks(range(0, len(new_ticks), 10), new_ticks[::10]);

In [None]:
np.mean(gotham.query('age >= 65')['vote.dem'])

**Example 5.** What if we instead took a simple random sample? How big would it need to be to outperform our convenience sample?

In [None]:
np.mean(gotham['vote.dem'])

In [None]:
np.mean(gotham.sample(1000, replace = False)['vote.dem'])

In [None]:
x = []
for i in range(100):
    random_sample = gotham.sample(1000, replace = False)
    x.append(100 * np.mean(random_sample["vote.dem"]))

In [None]:
plt.hist(x);

**Example 6.** What if older voters are more likely to answer the phone?