# Section 1.2 — Data in practice

This notebook contains all the code from [Section 1.2 Data in practice](https://minireference.com/static/excerpts/noBSstats/noBSstats_ch01_DATA.pdf#page=34) of the **No Bullshit Guide to Statistics**.

In [1]:
2 + 3

5

### Note for Windows users

If you're on macOS or Linux you can ignore this section—skip to the next section **Data management with Pandas**.

File paths on Windows use the backslash character (`\`) as path separator,
while UNIX operating systems like Linux and macOS use forward slash separator `/` as path separator. 

If you you're on Windows you'll need to manually edit the code examples below to make them work by replacing all occurrences of "`/`" with "`\\`". The double backslash is required to get a literal backslash because the character `\` has special meaning as an escape character.

In [2]:
import os

if os.path.sep == "/":
    print("You're on a UNIX system (Linux or macOS).")
    print("Enjoy civilization!")
elif os.path.sep == "\\":
    print("You're on Windows so you should use \\ as path separator.")
    print("Replace any occurence of / (forward slash) in paths with \\\\ (double-backslash).")

You're on a UNIX system (Linux or macOS).
Enjoy civilization!


The current working directory is a path on your computer where this notebook is running.
The code cell below shows you can **get** you **c**urrent **w**orking **d**irectory.

In [3]:
os.getcwd()

'/Users/ivan/Projects/Minireference/STATSbook/noBSstatsnotebooks/notebooks'

You're in the `notebooks/` directory, which is inside the parent directory `noBSstatsnotebooks/`.

The datasets we'll be using in this notebook are located in the `datasets/` directory, which is sibling of the `notebooks/` directory, inside the parent `noBSstatsnotebooks/`. To access data file `minimal.csv` in the `datasets/` directory from the current directory, we must specify a path that includes the `..` directive (go to parent), then go into the `datasets` directory, then open the file `minimal.csv`.

This combination of "directions" for getting to the file will look different if you're on a Windows system or on a UNIX system. The code below shows the correct path you should access.

In [4]:
if os.path.sep == "/":
    # UNIX path separators
    path = "../datasets/minimal.csv"
else:
    # Windows path separators
    path = "..\\datasets\\minimal.csv"

print("The path to the file minimal.csv in the datasets/ directory is")
path

The path to the file minimal.csv in the datasets/ directory is


'../datasets/minimal.csv'

All the code examples provided below assume you're on a UNIX system,
hence the need to manually modify them to use double-backslashes in path strings for the code to work.

## Hands-on data management

### Getting started with JupyterLab Desktop

Follow instructions in the [Python tutorial](./python_tutorial.ipynb) to install JupyterLab Desktop on your computer.


### Download the interactive notebooks for this book

## Data management with Pandas

First let's make sure pandas is installed using the `%pip` Jupyter command.

In [5]:
# %pip install pandas

We then import the `pandas` library as the alias `pd`.

In [6]:
import pandas as pd

### Series

Pandas `pd.Series` objects are similar to Python lists `[3,5,7,9]`.
They are containers for series of values.

In [7]:
s = pd.Series([3, 5, 7, 9])
s

0    3
1    5
2    7
3    9
dtype: int64

Python lists use integers for identifying the elements of the list
(first = index `0`, second = index `1`, last = index `len(self)-1`).

Pandas series support the same functionality.
Here are some example of accessing individual values of the series using the default 0-base indexing.

In [8]:
print("First:  index =", 0, " value =", s[0])
print("Second: index =", 1, " value =", s[1])
print("Last:   index =", len(s)-1, " value =", s[len(s)-1])

First:  index = 0  value = 3
Second: index = 1  value = 5
Last:   index = 3  value = 9


The series `index` attribute tells you all the possible indices for the series.

In [9]:
s.index

RangeIndex(start=0, stop=4, step=1)

The series `s` uses the default index `[0, 1, 2, 3]`, 
which consists of a range of integers, starting at 0,
just like the index of a Python list with four elements.

In [10]:
s.values

array([3, 5, 7, 9])

In [11]:
type(s.values)

numpy.ndarray

In addition to accessing individual elements like this,

In [12]:
s[0]

3

we can also "slice" a series to obtain a new series that contains indices and values of the slice:

In [13]:
s[0:3]

0    3
1    5
2    7
dtype: int64

Performing arithmetic operations on the series.

In [14]:
s.sum()

24

In [15]:
s / s.sum()

0    0.125000
1    0.208333
2    0.291667
3    0.375000
dtype: float64

In [16]:
s.mean()

6.0

In [17]:
s.std()

2.581988897471611

We can also use arbitrary functions from `numpy` on a series,
and Pandas will apply the function to the values in the series.

In [18]:
import numpy as np
np.log(s)

0    1.098612
1    1.609438
2    1.945910
3    2.197225
dtype: float64

### Data frames

Loading the dataset `minimal.csv`

In [19]:
# !cat "../datasets/minimal.csv"

In [20]:
df = pd.read_csv("../datasets/minimal.csv")
df

Unnamed: 0,x,y,team,level
0,1.0,2.0,a,3
1,1.5,1.0,a,2
2,2.0,1.5,a,1
3,2.5,2.0,b,3
4,3.0,1.5,b,3


In [21]:
df.dtypes

x        float64
y        float64
team      object
level      int64
dtype: object

#### Other ways to create dataframes
There are other ways to create a `pd.DataFrame` from Python data containers like dicts and lists.

Creating a data frame from a dictionary of columns:

In [22]:
dict_of_columns = {
    "x": [1.0, 1.5, 2.0, 2.5, 3.0],
    "y": [2.0, 1.0, 1.5, 2.0, 1.5],
    "team": ["a", "a", "a", "b", "b"],
    "level": [3, 2, 1, 3, 3],
}

df2 = pd.DataFrame(dict_of_columns)
df2

Unnamed: 0,x,y,team,level
0,1.0,2.0,a,3
1,1.5,1.0,a,2
2,2.0,1.5,a,1
3,2.5,2.0,b,3
4,3.0,1.5,b,3


In [23]:
# df2 is identical to df loaded from minimal.csv
df2.equals(df)

True

Creating a data frame from a list of records (lists or tuples):

In [24]:
list_records = [
    [1.0, 2.0, "a", 3],
    [1.5, 1.0, "a", 2],
    [2.0, 1.5, "a", 1],
    [2.5, 2.0, "b", 3],
    [3.0, 1.5, "b", 3],
]
columns = ["x", "y", "team", "level"]

df3 = pd.DataFrame(list_records, columns=columns)
df3

Unnamed: 0,x,y,team,level
0,1.0,2.0,a,3
1,1.5,1.0,a,2
2,2.0,1.5,a,1
3,2.5,2.0,b,3
4,3.0,1.5,b,3


In [25]:
# df3 is identical to df loaded from minimal.csv
df3.equals(df)

True

Creating a data frame from a list of `dict`s:

In [26]:
dict_records = [
    dict(x=1.0, y=2.0, team="a", level=3),
    dict(x=1.5, y=1.0, team="a", level=2),
    dict(x=2.0, y=1.5, team="a", level=1),
    dict(x=2.5, y=2.0, team="b", level=3),
    dict(x=3.0, y=1.5, team="b", level=3),
]
df4 = pd.DataFrame(dict_records)
df4

Unnamed: 0,x,y,team,level
0,1.0,2.0,a,3
1,1.5,1.0,a,2
2,2.0,1.5,a,1
3,2.5,2.0,b,3
4,3.0,1.5,b,3


In [27]:
# df4 is identical to df loaded from minimal.csv
df4.equals(df)

True

In [28]:
# Note dict(key="val") is just an alternative syntax for {"key":"val"}
dict(x=1.0, y=2.0, group="a", level=3) == {"x":1.0, "y":2.0, "group":"a", "level":3}

True

### Data frame properties

In [29]:
type(df)

pandas.core.frame.DataFrame

In [30]:
df.index

RangeIndex(start=0, stop=5, step=1)

In [31]:
df.columns

Index(['x', 'y', 'team', 'level'], dtype='object')

In [32]:
df.shape

(5, 4)

In [33]:
df.dtypes

x        float64
y        float64
team      object
level      int64
dtype: object

In [34]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       5 non-null      float64
 1   y       5 non-null      float64
 2   team    5 non-null      object 
 3   level   5 non-null      int64  
dtypes: float64(2), int64(1), object(1)
memory usage: 538.0 bytes


In [35]:
# df.axes
# df.memory_usage()
# df.values

### Accessing values in a DataFrame

In [36]:
df.loc[2, "y"]

1.5

#### Entire rows

In [37]:
row2 = df.loc[2,:]
row2

x        2.0
y        1.5
team       a
level      1
Name: 2, dtype: object

In [38]:
# Rows of the dataframe are Series objects
type(row2)

pandas.core.series.Series

In [39]:
row2.index

Index(['x', 'y', 'team', 'level'], dtype='object')

In [40]:
row2.values

array([2.0, 1.5, 'a', 1], dtype=object)

In [41]:
row2["y"]

1.5

#### Entire columns

In [42]:
ys = df["y"]
ys

0    2.0
1    1.0
2    1.5
3    2.0
4    1.5
Name: y, dtype: float64

In [43]:
df["y"].equals( df.loc[:,"y"] )

True

In [44]:
df["y"].equals( df.y )

True

In [45]:
type(ys)

pandas.core.series.Series

In [46]:
ys.index

RangeIndex(start=0, stop=5, step=1)

In [47]:
ys.values

array([2. , 1. , 1.5, 2. , 1.5])

In [48]:
ys[2]

1.5

In [49]:
print(df[["x", "y"]])

     x    y
0  1.0  2.0
1  1.5  1.0
2  2.0  1.5
3  2.5  2.0
4  3.0  1.5


#### Selecting subsets of rows


In [50]:
df.head(2)
# df.tail(2)
# df.sample(3)

Unnamed: 0,x,y,team,level
0,1.0,2.0,a,3
1,1.5,1.0,a,2


To select only rows where `team` is `b`, we first build the boolean selection mask...

In [51]:
mask = df["team"] == "b"
mask

0    False
1    False
2    False
3     True
4     True
Name: team, dtype: bool

... then select the rows using the mask.

In [52]:
df[mask]

Unnamed: 0,x,y,team,level
3,2.5,2.0,b,3
4,3.0,1.5,b,3


The above two step process can be combined into a more compact expression:

In [53]:
df[df["team"]=="b"]

Unnamed: 0,x,y,team,level
3,2.5,2.0,b,3
4,3.0,1.5,b,3


In [54]:
df[(df["team"] == "b") & (df["x"] >= 3)]

Unnamed: 0,x,y,team,level
4,3.0,1.5,b,3


In [55]:
df["level"].isin([2,3])

0     True
1     True
2    False
3     True
4     True
Name: level, dtype: bool

#### Grouping and aggregation

In [56]:
df.groupby("team")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11d7ad490>

In [57]:
df.groupby("team")["x"]

<pandas.core.groupby.generic.SeriesGroupBy object at 0x11d7ad3d0>

In [58]:
df.groupby("team")["x"].mean()

team
a    1.50
b    2.75
Name: x, dtype: float64

In [59]:
df.groupby("team")["x"].aggregate(["sum", "count", "mean"])

Unnamed: 0_level_0,sum,count,mean
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,4.5,3,1.5
b,5.5,2,2.75


In [60]:
df.groupby("team")["x"] \
  .agg(["sum", "count", "mean"])

Unnamed: 0_level_0,sum,count,mean
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,4.5,3,1.5
b,5.5,2,2.75


In [61]:
(df
  .groupby("team")["x"]
  .agg(["sum", "count", "mean"])
)

Unnamed: 0_level_0,sum,count,mean
team,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,4.5,3,1.5
b,5.5,2,2.75


## Datasets for the book

### Website visitors conversion rates

### Electricity prices

### Students effort and scores

In [62]:
students = pd.read_csv("../datasets/students.csv")

In [63]:
students.head()

Unnamed: 0,student_ID,background,curriculum,effort,score
0,1,arts,debate,10.96,75.0
1,2,science,lecture,8.69,75.0
2,3,arts,debate,8.6,67.0
3,4,arts,lecture,7.92,70.3
4,5,science,debate,9.9,76.1


### Apple weights

### Kombucha volumes

### Doctors' sleep study

## Data pre-processing tasks

- *Extract* the "raw" data from various data source formats
  (spreadsheet, databases, files, web servers).
- *Transform* the data by reshaping and cleaning it.
- *Load* the data into the system used for statistical analysis.