In [3]:
import pandas as pd

In [15]:
s = pd.Series([1,2,3,4])

In [16]:
type(s.values)

numpy.ndarray

<!-- NUMPY DIGRESSION -->
Under the hood,
the values of the series use stored in a `numpy` arrays.
<!-- Robyn said: Needs definition? -->


In [131]:
# OTHER ATTRIBUTES
# df.axes
# df.memory_usage()
# df.values

The column `y` can also be obtained by accessing the attribute `df.y`.

In [44]:
df["y"].equals( df.y )

True

Note accessing columns as attributes only works for column names that do not contain spaces or special characters.
<!-- TODO: say we won't use in this book, but you need to know since other authors may use this syntax -->


In [45]:
# MAYBE
ys.values

array([2. , 1. , 1.5, 2. , 1.5])

In [46]:
# MAYBE
ys[2]

1.5

#### Selecting subsets of the data frame

We can use a combined selection expression to choose
and arbitrary subset of the rows and columns of the data frame.

We rarely need to do this,
but for the purpose of illustration of the `loc` syntax,
here is the code for selecting the `y` and `team` columns
from the last two rows of the data frame.

In [48]:
df.loc[3:5, ["y","team"]]

Unnamed: 0,y,team
3,2.0,b
4,1.5,b


In [67]:
# add the columns to the index; result is series with a multi-index
# df.stack()

In [68]:
# ALT. way to do transpose
# df.stack().reorder_levels([1,0]).unstack()

### Datasets for the book

In [5]:
students = pd.read_csv("../../datasets/students.csv")

In [6]:
students.head()

Unnamed: 0,student_ID,background,curriculum,effort,score
0,1,arts,debate,10.96,75.0
1,2,science,lecture,8.69,75.0
2,3,arts,debate,8.6,67.0
3,4,arts,lecture,7.92,70.3
4,5,science,debate,9.9,76.1


In [130]:
xD = students[students["curriculum"]=="debate"]["score"].values
xL = students[students["curriculum"]=="lecture"]["score"].values

In [131]:
import numpy as np
np.mean(xD), np.mean(xL), np.mean(xD) -np.mean(xL)

(76.4625, 68.14285714285714, 8.319642857142867)

In [132]:
from scipy.stats import ttest_ind

ttest_ind(xD, xL)

TtestResult(statistic=1.7197867420465698, pvalue=0.10917234443214315, df=13.0)

## Data pre-processing tasks

- *Extract* the "raw" data from various data source formats
  (spreadsheet, databases, files, web servers).
- *Transform* the data by reshaping and cleaning it.
- *Load* the data into the system used for statistical analysis.

## Extract


### Extract data from different source formats

On UNIX systems (Linux and macOS) the command line program `head` can be used to show the first few lines of any file. The command `head` is very useful for exploring files—by printing the first few lines, you can get an idea of the format it is in.

Unfortunately, on Windows the command `head` is not available, so instead of relying on command line tools, we'll write a simple Python function that called `head` that does the same thing as the command line tool: it prints the first few lines of a file. By default this function will print the first five lines of the file, but users can override the `count` argument to request a different number of lines to be printed.

In [7]:
import os

def head(path, count=7):
    """
    Print the first `count` lines of the file at `path`.
    """
    datafile = open(path, "r")
    lines = datafile.readlines()
    for line in lines[0:count]:
        print(line, end="")


The function `head` contains some special handling for Windows users.
If the path is specified using the UNIX path separator `/`,
it will be auto-corrected to use the Windows path separator `\`.

## Load
ALT. Storing statistical datasets
ALT. Data loading

When working with data,
it's important to follow good practices for storing data and metadata associated with your analysis.
Make sure you always have dataset in a format
that is easy to load into the software system we'll be using for data analysis.

It's still a good idea to save the dataset to a new CSV file in order to separate
the extraction, transformation, and cleaning steps from the subsequent statistical analysis.
Saving the dataset in a general-purpose format like CSV will also make it easy to share the data with collaborators,
or experiment with other statistical software like [RStudio](https://www.rstudio.com/),
[JASP](https://jasp-stats.org/), and [Jamovi](https://www.jamovi.org/).

To save the data frame `cleandf` as the CSV file `mydata.csv`,
we can use its `.to_csv()` method.

Let's save the cleaned data to the file `mydata.csv` in a directory `mydataset`.

In [11]:
# cleandf.to_csv("mydataset/mydata.csv", index=False)

We can verify the data was successfully saved to disk using the `head` function,
TODO remove head( call
which prints the first few lines from the file.

In [12]:
# head("mydataset/mydata.csv")

We should also create a short text file `README.txt` that describes the data file,
and provides the codebook information for the variables.
<!-- Robyn said: Encourage structured/standardized metadata formats -->

<!-- Information about the dataset is provided in a text file `README.txt`. -->

Include in the README any information about the dataset that might be relevant for the research project.

Remember to document all steps we followed to obtain the dataset.
This includes information about how we obtained the data
and the transformation and cleaning steps we performed.
It's important to record all these details in order to make the ETL pipeline reproducible,
meaning someone else could run the same procedure as you to obtain the same dataset.
Ideally,
you should include a Jupyter notebook or Python script with the data transformations.

<!-- Robyn said: Consider explaining the importance of reproducibility - why we don't just manually change values in an excel spreadsheet, for example. -->

In [14]:
# head("mydataset/README.txt", count=11)