Copyright 2024 Luiz Barboza, Natasha A. Sahr, Andrew M. Olney and made available under [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0) for text and [Apache-2.0](http://www.apache.org/licenses/LICENSE-2.0) for code.

# Educational Material: Data Science and the Nature of Data

This notebook introduces some foundational concepts in data science.
As a result, this notebook will have more reading and less practical exercises than normal.
But don't worry, we'll have some practical exercises at the end.
<!-- for getting data from files. -->

We have organized this notebook around **big ideas** in data science.
You may wish to refer to this notebook throughout the course when these ideas come up.
It's OK if you don't completely understand them today.
Some of these ideas are quite subtle and take time to master.

Let's get started!

## The goal of this educational material

This teaching material serves as an introductory guide for educators, addressing fundamental concepts in data analysis and preparing students to navigate the complexities of different data types and dimensions. Understanding these principles is crucial for students to make informed decisions in their analyses and draw meaningful conclusions from diverse datasets. It focuses on the diversity and properties of data, emphasizing the need for teachers to understand and convey these concepts to students. The material highlights that not all data is the same and introduces key dimensions to consider when working with data.


## The proposed practice

In this educational module, educators are focusing on practical activities related to data exploration and analysis using Python programming language. The core tool employed is the use of dataframes, particularly with the pandas library. Educators in this module teach students how to explore and analyze data with the Python programming language.

The practical activities involve reading a CSV file into a dataframe, with step-by-step instructions using Blockly, a visual programming language. Students learn to import the pandas library, read a CSV file into a dataframe, and explore ways to store and manipulate data efficiently. This hands-on approach allows students to understand the importance of dataframes in data science and gain proficiency in using Python for data analysis.

Furthermore, educators introduce data visualization using Plotly, focusing on essential plots such as line plots and bar plots. Students learn to import Plotly, create line plots to analyze trends over time, and generate bar plots for comparing and summarizing data. The emphasis is on understanding the significance of data visualization in uncovering patterns, trends, and correlations within datasets.

The experiment extends to statistical analysis, introducing concepts like quartiles and boxplots. Students utilize the pandas library to calculate quartiles for a specific dataset and create a boxplot to visualize the distribution of the data, helping them understand central tendencies and variability.

Lastly, educators introduce histograms as a tool for exploring probability distributions of a variable. Students learn to interpret histograms to identify patterns, outliers, and overall distribution characteristics.

In summary, the educational practices aim to equip students with practical skills in data exploration, analysis, and visualization using Python, emphasizing the importance of dataframes and various plotting techniques for effective data science. The hands-on activities enhance students' understanding of data manipulation and interpretation, preparing them for real-world applications in the field.

# Descriptive Data Analysis



## Loading data

Data exploration and analysis is at the core of data science. Data scientists require skills in programming languages like Python to explore, visualize, and manipulate data

### Dataframes

Data scientists often load tabular data into a **dataframe** that they can manipulate in a program.
In other words, tabular data from a file is brought into the computational notebook in a variable that represents rows, columns, header, etc just like they are stored in the tabular data file.
Because dataframes match tabular data in files, they are very intuitive to work with, which may explain their popularity.

We're now at the practical portion of this notebook, so let's work with dataframes!

**If you haven't seen a demonstration of Blockly, [see this short video tutorial](https://youtu.be/ovCJln08mG8?vq=hd720) or [this long video tutorial](https://youtu.be/-luPzplPDI0?vq=hd720).**

#### Read CSV into dataframe

First, let's read a CSV file into a dataframe.
To do that, we need to import a dataframe library called `pandas`.
**If it isn't already open**, open up the Blockly extension by clicking on the painter's palette icon, then clicking on `Blockly Jupyterlab Extension`.

![screenshot_7.png](https://pbs.twimg.com/media/GC3YsK4XUAAoO4J?format=png&name=small)

Using the IMPORT menu in the Blockly palette, click on an import block `import some library as variable name`:

![screenshot_8.png](https://pbs.twimg.com/media/GC3YvjkW0AAZ5UT?format=png&name=small)

When you click on the block, it drops onto the Blockly workspace.
Change `some library` to `pandas` by typing into that box.
Click on the `variable name` dropdown, choose `Rename variable...`, and type `pd` into the box that pops up.
This imports the `pandas` dataframe library and gives it the variable name, or alias, `pd`.

Make sure the code cell below is selected (has a blue bar next to it) and press the `Blocks to Code` button below the Blockly workspace.
This will insert the code corresponding to the blocks into the **active cell** in Jupyter, which is the cell that has a blue bar next to it.

Once the code appears in that Jupyter cell, you must **execute** or **run** it by either pressing the &#9658; button at the top of the window or by pressing Shift + Enter on your keyboard.

In [None]:
import pandas as pd



We can now do things with `pd`, like load datasets!

Our file is called `covid.csv` and it is in the `datasets` folder.
That means the **path** from this notebook (the one you're reading) to the data is `https://raw.githubusercontent.com/memphis-iis/datawhys-workshop-notebooks-2024/main/datasets/covid.csv`.

To read this file into a dataframe, we will use `pd`.
Go to the VARIABLES menu in the Blockly palette and click on the `with pd do ...` block.

![screenshot_9.png](https://pbs.twimg.com/media/GC3b8ioWQAAl9mq?format=png&name=small)

After it drops into the Blockly workspace, wait a second until the dropdown stops loading, and then click on it and select `read_csv`.
Then get a `" "` block from TEXT, drop it on the workspace, drag it to the `using` part of the first block, and type the file path `https://raw.githubusercontent.com/memphis-iis/datawhys-workshop-notebooks-2024/main/datasets/covid.csv` into it.
Your blocks should look like this:

![pd](https://pbs.twimg.com/media/GHhzTvbXAAA-bkA?format=png&name=900x900)

Make sure the cell below is selected, then press `Blocks to Code`, and execute the cell to run the code by pressing the &#9658; button.


In [None]:
import pandas as pd
pd.read_csv('https://raw.githubusercontent.com/memphis-iis/datawhys-workshop-notebooks-2024/main/datasets/covid19%20-%20cases.csv')



Unnamed: 0,date,Brazil,India,US
0,1/23/20,0,0,0
1,1/24/20,0,0,1
2,1/25/20,0,0,0
3,1/26/20,0,0,3
4,1/27/20,0,0,0
...,...,...,...,...
581,8/26/21,31024,44658,161331
582,8/27/21,27345,46759,322934
583,8/28/21,24699,45083,53069
584,8/29/21,13210,42909,38473


When you run the cell, it will display the dataframe directly below it.
This is one of the nice things about Jupyter - **it will display the output of the last line of code in a cell**, even if the output is text, a table, or a plot.

Right now, we haven't actually stored the dataframe anywhere.
We used `pd` to read the csv file, and then Jupyter output that so we could see it.
But if we wanted to do anything with the dataframe, we'd have to read the file again.

Instead of reading the file every time we want to access the data, we can **store it in a variable**.
In other words, we will create a variable and set it to be the dataframe we created from the file.

Using the VARIABLES menu in the Blockly palette, click on `Create variable...` and type `dataframe` into the pop-up window.
Then click on the `set dataframe to` block so that your blocks below look like this:

![img](https://pbs.twimg.com/media/GC3eK0oWUAA71Dy?format=png&name=240x240)

Then go get the same blocks you used before to read the file and connect them to the `set dataframe to` block.
You can do this from scratch or you can use the following procedure:

- Press `Blocks to Code` to save your intermediate work (the `set dataframe to` block)
- Go back to the previous cell, click on the block you want, and copy it using Ctrl+c
- Click on the cell below to select it, click the Blockly workspace, and paste the block using Ctrl+v

*Tip: If you don't save your intermediate work, you'll lose it because `Notebook Sync` will clear the Blockly workspace when it loads the blocks in the previous cell.*

After you've added the blocks to read the dataframe, drop a variable block for the `dataframe` underneath it to display the dataframe.

In the future, we will abbreviate these steps as:

- Create `dataframe` and set it to `with pd do read_csv using "covid19%20-%20cases.csv"`
- `print(dataframe)`

As always, you need to hit the &#9658; button or press Shift + Enter to run the code.

In [None]:
dataframe = pd.read_csv('https://raw.githubusercontent.com/memphis-iis/datawhys-workshop-notebooks-2024/main/datasets/covid19%20-%20cases.csv')

print(dataframe)



        date  Brazil  India      US
0    1/23/20       0      0       0
1    1/24/20       0      0       1
2    1/25/20       0      0       0
3    1/26/20       0      0       3
4    1/27/20       0      0       0
..       ...     ...    ...     ...
581  8/26/21   31024  44658  161331
582  8/27/21   27345  46759  322934
583  8/28/21   24699  45083   53069
584  8/29/21   13210  42909   38473
585  8/30/21   10466  30941  258532

[586 rows x 4 columns]


You should see the same output as before - the only difference is that we've read the csv and stored the data into the `dataframe` block, so we will use the `dataframe` block whenever we want to work with the data.


## Data visualization

Data visualization is the discipline of trying to understand data by using graphic context so patterns, trends, and correlations that might not otherwise be detected can be exposed.

Data visualization is an important tool to understand data.

Charts, plots, graphs, and maps (and many more) are all types of data visualizations.

There are many facets involved in data visualization; this tutorial is just the introduction to your Python plotting journey.

Today we will focus on the most often used plots:

- Scatter plots
- Bar plots
- Line plots
- Histograms

**Each type of plot requires a specific type of data and has a specific purpose.**

<!-- By the end of this introduction, you will have mastered:

- plot basics and how to read a graph
- data transformation
- a normal vs. not normal distribution
- misleading graphs -->

### Plotly

In Python, there are many options for visualizing data and is often challenging to choose which library to use.

<!-- The most common libraries used for plotting include:

- <a href="https://altair-viz.github.io" target="_top">`altair`</a>
- <a href="https://docs.bokeh.org/en/latest/" target="_top">`bokeh`</a>
- <a href="http://ggplot.yhathq.com" target="_top">`ggplot`</a>
- <a href="https://matplotlib.org" target="_top">`matplotlib`</a>
- <a href="https://pandas.pydata.org" target="_top">`pandas`</a>
- <a href="https://plotly.com" target="_top">`plotly`</a>
- <a href="http://www.pygal.org/en/stable/" target="_top">`pygal`</a>
- <a href="http://seaborn.pydata.org" target="_top">`seaborn`</a>

The documentation for each library can be found in the links above.  -->

For the purpose of this tutorial, we will focus on understanding, programming, and interpreting plots from `plotly`.

`plotly` is a Python library that produces interactive plots.

That means you can use your mouse to interact with the plot after you've created it.

<!-- It has a robust API, including one for python. Versions of `plotly`  $< 4$ are online. Versions of `plotly`  $\geq 4$ are offline, and the online functionality has been moved to the `chart-studio` library.  -->

To use `plotly`,

- `import plotly.express` as `px`

**Make sure you run the cell using the &#9658; button or Shift + Enter**

### Line plots

Line plots are virtually identical to bar plots in usage because they:

- Require the x to be discrete values
- Require the y to be a single number per x

To make a line plot:

- Get a `with px do line using` block

On our case, we have the `date` column as the X axis and the `US` column as the Y axis.

![line chart blocks](https://pbs.twimg.com/media/GDAuB2-XoAA3sQm?format=png&name=360x360)

Which generates the followin code:

In [None]:
import plotly.express  as px
import pandas as pd
grades = pd.read_csv('https://raw.githubusercontent.com/lcbjrrr/quant/master/covid19%20-%20cases.csv')
(px.line(grades,'date','US'))



The line chart is an excellent tool to perform trend analysis. As we can see above, the peaks of covid-19 in the US during 2020-21 winter, and the second ramp up by the end of the 2021 summer.

### Bar plots

Bar plots are very commonly used in both science and the business world.

Bar plots:

- Require the x to be discrete values
- Require the y to be a single number per x
- Are best for showing summary values like averages

In other words, while scatterplots show all the datapoints, bar plots only show a summary value of y for each x.

Let's make a bar plot using the average, or `mean` of the variables as a summary value.

First, let's look at the mean by itself:

- with `grades` do `mean`


In [None]:
import pandas as pd
grades = pd.read_csv('https://raw.githubusercontent.com/memphis-iis/datawhys-workshop-notebooks-2024/main/datasets/grades.csv')
grades.mean()





AP1      7.041667
AP2      7.000000
AP3      7.458333
Grade    7.108333
dtype: float64

We can see the mean for each variable, but notice this output is not formated like a dataframe, because there are no column labels.

Instead, this is something `pandas` calls a **series**, which is like a single column in a dataframe.
The difference here is that the variable names, e.g. `AP1`, are axis labels rather than numeric axis labels we've seen previously.

Since `grades.mean()` is a series, it has column names we can use for x and y in our plot.
However, `plotly` is smart enough to plot it anyways, like this:

- with `px` do `bar` using with `grades` do `mean`

**If you have trouble connecting these blocks, move your mouse slowly as you make the final connection, and try letting go even if you don't hear the click.**

In [None]:
import plotly.express as px
px.bar(grades.mean())

  px.bar(grades.mean())


In case you want to compare individual columns throughout different rows, you can have the numerical column on the Y axis, and the column that identifies each individual on the X axis.

In our case, we want to compare each `Student's` final numerical `Grade` side by side in a bar chart, as follows:

![bar chart blocks](https://pbs.twimg.com/media/GDA6C8tWEAAmEBz?format=png&name=360x360)

Which generates the following code:

In [None]:
import plotly.express as px
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/lcbjrrr/data/main/grades%20-%20okk.csv')
(px.bar(df,'Student','Grade'))



This kind of graph, a bar chart, is an excellent option to compare values between different entities or categories. In our example above it is possible to analyze the different grades between each student.

It is also possible to summarize data, in case you have a significant number of rows. In that situation, it is possible to aggregate data by a categorical column, in our case `Course` . Then apply apply a mathmatical function, in our case, `mean`  on a numerical column, `Grade` . As seen on the blocks below

![aggregated bar chart](https://pbs.twimg.com/media/GHhhlhCWIAAzK4y?format=png&name=small)

This will generate the following code, used to compare summarized information. In our case, the average grade between different courses.


In [None]:
import plotly.express as px
import pandas as pd
grades = pd.read_csv('https://raw.githubusercontent.com/lcbjrrr/data/main/grades%20-%20okk.csv')
pivot = grades.groupby('Course')
agg = pivot.mean()
print(agg.Grade)
(px.bar(agg.index,agg.Grade))



  agg = pivot.mean()


Course
ADM    6.64
ECO    7.95
LAW    7.24
Name: Grade, dtype: float64


As it is presented above, from the perspective of a college dean, it is possible to compare how students from different courses are on average performing in comparison to the other.  


Another similar graph that can be used on summarized data, is the pie chart.  

This is an excellent option in case you want to analize proportions. For that,  
just make sure that the numerical data that you are analyzing is   
correspond to the idea of whole, and it adds up to 100\%. In our case, we want to understand the percentage distribution of students by course.

To perform that, we are again grouping by `Course`, but now instead of calculating averages, we will be `count` each `row`, as shown on the blocks below. And then for the pie chart, we will have the aggraged table `index` as  
each pie category, and the `Grade` as the numerical columns to be displayed.

<html><img src=https://pbs.twimg.com/media/GFnFdojWYAAkhDC?format=png&name=small height="333"/></html>

These blocks will generate the following code for us:




In [None]:
import plotly.express as px
import pandas as pd
students = pd.read_csv('https://raw.githubusercontent.com/lcbjrrr/data/main/grades%20-%20okk.csv')
pivot = students.groupby('Course')
agg = pivot.count()
(px.pie(agg,agg.index,agg.Student))

Through this graph, it is possible to notice a 41.7\% proportion for the courses of Business Administration and Law degree and a 16.7\% for the Economics major.

## Statistical Analysis

### Quartiles and Boxplot

Quartiles are statistical measures that divide a data set into four equal parts, each representing 25% of the total observations. These values are denoted as Q1, Q2 (the median), and Q3. Q1 is the median of the lower half of the data, Q2 is the overall median, and Q3 is the median of the upper half of the data. Quartiles are particularly useful in analyzing the distribution and central tendency of a dataset, providing insights into its spread and variability.

An easy-to-use function that is available in Python pandas, is `describe` , which the name suggests, it produces a statistical description of the dataframe or the column. That includes the quartiles calculation as shown on the blocks presented below:

![quartiles blocks](https://pbs.twimg.com/media/GDBLeb8WYAIt7Q5?format=png&name=360x360)

In our case, we want to calculate the quartiles for the Grade column, which is shown in the code below.






In [None]:
import pandas as pd
dff = pd.read_csv('https://raw.githubusercontent.com/lcbjrrr/data/main/grades%20-%20okk.csv')
grd = dff.Grade
print(grd.describe())



count    12.000000
mean      7.108333
std       1.925172
min       3.600000
25%       5.850000
50%       7.500000
75%       8.600000
max       9.700000
Name: Grade, dtype: float64


As show above the minimum grade is 3.6. The lowest 25% of grades, the first quartile (Q1), are under 5.85. The median grade, that divides the lowest half of the grades from the highest, is 7.5 (different than the avarege grade of 7.1). The third quartile (Q3), 8.6, marks the lower limit of the top 25% of the highest grades. Lastly, we have the maximum of 9.7

These quartiles can be show in visual graph called boxplot. A boxplot, also known as a box-and-whisker plot, is a graphical representation that visually depicts the distribution of a dataset. It is constructed using the quartiles and displays key statistics such as the median, interquartile range (IQR), and potential outliers. The box in the plot represents the interquartile range, with the central line indicating the median. Whiskers extend from the box to the minimum and maximum values within a defined range, and any data points beyond the whiskers are considered potential outliers. Boxplots are valuable tools for quickly assessing the distribution, variability, and skewness of a dataset.

The blocks to generate it, can be seen bellow:

![boxplot blocks](https://pbs.twimg.com/media/GDBN-BtXUAEVfr8?format=png&name=360x360)

Which generates the following code:

In [None]:
import plotly.express as px
import pandas as pd
dd = pd.read_csv('https://raw.githubusercontent.com/lcbjrrr/data/main/grades%20-%20okk.csv')
gg = dd.Grade
print(gg.describe())
(px.box(gg))



count    12.000000
mean      7.108333
std       1.925172
min       3.600000
25%       5.850000
50%       7.500000
75%       8.600000
max       9.700000
Name: Grade, dtype: float64


### Histograms

Histograms introduce a new idea, **probability distributions**, into the discussion.
A probability distribution is simply a table listing the probability that a variable will have a particular value.

In our work, you can think in terms of **count distributions** or the number of times a variable has a particular value.
We will use the term **distribution** to refer to either count or probability distributions interchangeably.

There are as many different types of distributions - as many as different types of animals in the zoo!
For our purposes, we highlight five general shapes of distributions:

- **Uniform:** a flat distribution where every value is equally likely
- **Normal:** a bell curve distribution where values toward the middle are most likely
- **Skewed right:** a declining distribution where small values are likely and large values unlikely
- **Skewed left:** the opposite of skewed right
- **Mixtures:** appear as two or more of the above distributions

The purpose of generating histograms is to visually determine the approximate distribution of a variable.
Histograms can reveal extreme values, missing ranges, or skew, that may require special care in later analysis.

Histograms:

- Require x
- Automatically determine bar widths for x
- Automatically define y as the count of values for x
- Are used to show the distribution of a **single** variable

In our case, we can observe the histogram for the `Grade` column, using the following blocks:

![histogram blocks](https://pbs.twimg.com/media/GDBikuBW0AApWZy?format=png&name=small)

These will generated the following code:

In [None]:
import plotly.express as px
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/lcbjrrr/data/main/grades%20-%20okk.csv')
(px.histogram(df.Grade))



# APPENDIX

This appendix material can be used as a content target for students of this practice. If you think it's applicable, you can use it, provided the proper reference to this original transcript.

## Not all data is the same

When you hear people talk about "data," you may get the impression that all data is the same.
However, there are *many* different kinds of data.
Just like we can compare animals by how many legs they have, whether they have fur, or whether they have tails, we can compare data along various *dimensions* that affect how we work with the data and what kinds of conclusions we can draw from it.

### Is the data structured or unstructured?

One of the most basic properties of data is whether it is **structured.**
It might surprise you to hear that data can be unstructured; after all, why would someone collect data that wasn't structured?
And you'd be right: normally when people plan to collect data, they structure it.
Structure means that the data is organized and ready for analysis.
The most common kind of structure is **tabular data**, like you'd see in a spreadsheet.
We'll talk about this more in a little bit.
Databases are another common source of structured data.

Unstructured data usually comes about when the data collection wasn't planned or if people didn't know how to structure it in the first place.
For example, imagine that you have a million photographs - how would you structure them as data?
Textual data is another common example of unstructured data.
If textual data were structured, in the way we're talking about, we wouldn't need search engines like Google to find things for us!
When we work with unstructured data, we must take an extra step of structuring it somehow for our analysis, i.e. we have to turn unstructured data into structured data to do something with it.
Again, images, audio, and text are common examples that need this extra step.

### Is the data clean or dirty?

Another basic property of data is **cleanliness**.
If data is clean, then we don't need to correct it or process it to remove garbage values or correct noisy values.
Just like with structured data, you might expect all data to be clean.
However, even carefully collected data can have problems that require correction before it can be used properly.
Dirty data is the norm for unplanned data collection.
So unplanned data collection is more likely to result in *both* unstructured data and dirty data.

There are many ways that data can be dirty, but to make the idea more concrete, let's consider a few examples.
Imagine that you are interested in the weather in your backyard, so you put out a battery operated thermometer that records the temperature every hour.
You then leave it there for a month.
Now imagine that it worked fine for the first two weeks, but since you didn't change the batteries, the measurements for the last two weeks become increasingly unreliable, e.g. reporting up to 10 degrees above or below the actual temperature, until it finally shuts off leaving you with no data for the remaining days.
This kind of problem, an **instrument failure** leading to **unreliable measurement**, is actually quite common and can take a lot of planning to avoid.
<!-- Another example of dirty data is at the recording stage.
Imagine a computer is recording audio data by writing it to the hard disk.
If the computer suddenly becomes active with another task (say streaming a video or installing an operating system update) the audio data may "glitch". -->

While *very* dirty data is usually obvious, it can sometimes be hard to recognize.
For this reason, it is important to check your data for problems (e.g. crazy values, like a person being 1 ft tall or 200 years old) and think very seriously about how problems should be corrected.
Data cleaning is such a tricky topic that we will delay it until much later in the course.

### Is the data experimental or nonexperimental?

The last major dimension of data we'll talk about is whether the data came from an **experiment** or not.
Why is this important?
Knowing whether the data came from an experiment is important because it tells you if you can draw causal conclusions from it.
When we talk about experiments here, what we mean are randomized controlled trials (RCTs) or an equivalent method of constructing a counterfactual.
The basic idea with an RCT is that your **randomly** assign what you are studying (i.e. people, animals, etc) into **two or more groups**.
In one of those groups, you do nothing - this is the control group.
In one of the other groups, you **do something** to what you are studying - this is the treatment group.
After the experiment, you can see what happened when you **did something** by comparing the treatment group to the control group.
Since the two groups are the same in every other respect, you know that any differences are a result of what you did.
This is why experiments allow you to draw causal conclusions - **because you only changed one thing, you know that change caused the difference you see.**

Let's take a common example, vaccines.
To discover if a vaccine against coronavirus is effective, I would randomly assign people to two groups.
The treatment group would receive the vaccine, and the control group wouldn't.
I would then follow up with both groups 1-2 months later and see which of them had gotten sick and which hadn't.
If there was no difference in illness between the two groups, I'd say that the vaccine had no effect.
Otherwise, I'd say the vaccine had some effectiveness.
There's some subtlety we're skipping over here about *reliable differences*, but this is the basic idea.

Let's take another example of a non-experimental study.
Some researchers sent a survey to a million people in Europe and asked them how much coffee they drank and how old they were.
After analyzing the results, the researchers found that older people drank more coffee.
Can we infer that coffee makes people live longer?
No, we can't, because we have no control group to compare to.
Without that control group, there are many other reasons that older people could be drinking more coffee.
It could be that coffee is less popular now than 10 years ago, it could be that older people have more money to spend on coffee, or it could be some other reason we haven't thought of yet.
When we have a non-experimental result like this, we have to be **very** careful about interpretation.
The best we can say is that there seems to be an **association** between drinking coffee and being older, but we can't say what the cause is.
We'll talk more about associations like this later on in the course.

### Missing variables and misspecified models

Let's return to the example of growing taller with age.
If we collected a lot of data, we'd see this is a pretty strong relationship.
However, is it the case that there are no other variables that determine height?
Thinking about it more, we realize that nutrition is also an important factor.
Are there other important factors?
It turns out that air pollution is associated with stunted growth.
We could go on and on here, but the basic idea is this: you may have identified some of the important variables in your model, and you may have identified the most important ones, but it is unlikely that you've identified *all* of them.

### Measurement error

Even if your model is perfectly specified, your data might be subject to measurement error.
For example, let's say I'm interested in how many squirrels get run over in December vs. June.
I might send out teams of students to walk up and down streets looking for dead squirrels.
Some people on those teams might be very diligent and accurately count the squirrels, but others may not pay as much attention and only count about half of them.
As a result, my model will be based on inaccurate data, which may lead me to draw the wrong conclusion.
Almost all data has *some* measurement error, so this can be a real issue.

### Generalization

Finally, my model may be specified well, and my data may be free of measurement error, but I may not be **sampling** my data in a way that allows for **generalization**.
Suppose I'm trying to predict the outcome of the next election with survey data, and I only send surveys to farmers in Iowa.
Will that help me predict how people in Chicago will vote?
Or the U.S. as a whole?
Probably not, because I have not captured the diversity of the U.S. in my sample -- I've only captured one occupation in one area of the country.
If you want your model to generalize to new situations, which we almost always want, it's important to think about whether your data captures the complexity and diversity of the real world or only a small slice of it.

## Types of variables

We've talked about structured vs. unstructured data already, but we haven't gone into detail about how structured data is created.
Structured data begins with **measurements** of some type of thing in the real world, which we call a **variable**.
Let's return to the example of height.
I may measure 10 people and find that their heights in centimeters are:

| Height |
|--------|
| 165    |
| 188    |
| 153    |
| 164    |
| 150    |
| 190    |
| 169    |
| 163    |
| 165    |
| 190    |

Each of these values (e.g. 165) is a measurement of the variable *height*.
We call *height* a variable because its value isn't constant.
If everyone in the world were the same height, we wouldn't call height a variable, and we also wouldn't bother measuring it, because we'd know everyone is the same.

Variables have different **types** that can affect your analysis.

### Nominal

A nominal variable consists of unordered categories, like *male* or *female* for biological sex.
Notice that these categories are not numbers, and there is no order to the categories.
We do not say that male comes before female or is smaller than female.

### Ordinal

Ordinal variables consist of ordered categories.
You can think of it as nominal data but with an ordering from first to last or smallest to largest.
A common example of ordinal data are Likert questions like:

```
(1) Strongly disagree
(2) Disagree
(3) Neither agree nor disagree
(4) Agree
(5) Strongly agree
```

Even though these options are numbered 1 to 5, those numbers only indicate which comes before the others, not how "big" an option is.
For example, we wouldn't say that the difference between *Agree*  and *Disagree* is the same as the difference between *Neither agree nor disagree* and *Strongly agree*.

### Interval

Interval variables are ordered *and* their measurement scales are evenly spaced.
A classic example is temperature in Fahrenheit.
In degrees Fahrenheit, the difference between 70 and 71 is the same as the difference between 90 and 91 - either case is one degree.
The other most important characteristic of interval variables is also the most confusing one, which is that interval variables don't have a meaningful zero value.
Degrees Fahrenheit is an example of this because there's nothing special about 0 degrees.
0 degrees doesn't mean there's no temperature or no heat energy, it's just an arbitrary point on the scale.

### Ratio

Ratio variables are like interval variables but with meaningful zeros.
Age and height are good examples because 0 age means you have no age, and 0 height means you have no height.
The name *ratio* reflects that you can form a ratio with these variables, which means that you can say age 20 is twice as old as age 10.
Notice you can't say that about degrees Fahrenheit: 100 degrees is not really twice as hot as 50 degrees, because 0 degrees Fahrenheit doesn't mean "no temperature."

## Measurement

We previously said that structured data begins with measurement of a variable, but we haven't explained what measurement really is.
Measurement is, quite simply, the assignment of a value to a variable.
In the context of a categorical variable like biological sex, we would say the assignment of *male* or *female* is a measurement.
Similarly for height, we would say that 180 cm is a measurement.
Notice that in these two examples, the measurement depends closely on type of variable (e.g. categorical or ratio).

How we measure is tightly connected to how we've defined the variable.
This makes sense, because our measurements serve as a way of defining the variable.
For some variables, this is more obvious than for other variables.
For example, we all know what *length* is.
It is a measure of distance that we can see with our eyes, and we can measure it in different units like centimeters or inches.
However, some variables are not as obvious, like *justice*.
How do we measure *justice*?
One way would be to ask people, e.g. to ask them how just or unjust they thought a situation was.
There are two problems with this approach.
First, different people will tell you different things.
Second, you may not really be measuring *justice* when you ask this question; you could end up measuring something else by accident, like people's religious beliefs.

When we talk about measurement, especially of things we can't directly observe, there are two important properties of measurement that we want, **validity** and **reliability**.
The picture below presents a conceptual illustration of these ideas using a target.

<!-- Attribution: © Nevit Dilmen -->
<!-- https://commons.wikimedia.org/wiki/File:Reliability_and_validity.svg -->
![image.png](https://pbs.twimg.com/media/GC3XEykXEAArIAC?format=png&name=small)

Simply stated, **validity means we are measuring what we intend to measure**.
In the images, validity is being "on target," so that our measurements are *centered* on what we are trying to measure.

In contrast, **reliability means our measurements are consistent**.
Our measurements could be consistently wrong, which would make them reliable but not valid (lower left).
Ideally, our measurements will be both valid and reliable (lower right).

When it comes to validity and reliability, the most important thing to understand is that **validity is not optional.**
If you don't have validity, your variable is wrong - you're not measuring what you think you're measuring.
Reliability is optional to a certain extent, but if the reliability is very low, we won't be able to get much information out of the variable.

## Tabular data

The most common type of structured data is **tabular data** which is what you find in spreadsheets.
If you've ever used a spreadsheet, you know something about tabular data!

Here's an example of tabular data, with *height* in centimeters, *age* in years, and *weight* in kilograms:

| Height | Age | Weight |
|--------|-----|--------|
| 161    | 50  | 53     |
| 161    | 17  | 53     |
| 155    | 33  | 84     |
| 180    | 51  | 84     |
| 186    | 18  | 88     |

In tabular data like this, each **row** is a person.
More generically, we would say each row is an **observation** or **datapoint** (in statistics terminology) or an **item** (in machine learning terminology).
In each row, we have measurements for each of our variables for that particular person.
Since we have five rows of measurements, we know that there are five people in this dataset.

We can also think about tabular data in terms of **columns**.
Each column represents a variable, with the name of that variable in the **column header**.
For example, *height* is at the top of the first column and is the name of the variable for that column.
Importantly, the header is not an observation but rather a description of our data.
This is why we don't count the header when we are counting the rows in our data.

### Delimited tabular data - CSV and TSV

You are probably familiar with spreadsheet files, e.g. Microsoft Excel has files that end in `.xls` or `.xlsx`.
However, in data science, it is more common to have tabular data files that are **delimited**.
A delimited file is just a plain text file where column boundaries are represented by a specific character, usually a comma or a tab.

Here's what the data above looks like in **comma separated value (CSV)** form:

```
Height,Age,Weight
161,50,53
161,17,53
155,33,84
180,51,84
186,18,88
```

and here's what the data looks like in **tab separated value (TSV)** form:

```
Height	Age	Weight
161	50	53
161	17	53
155	33	84
180	51	84
186	18	88
```

The choice of the delimiter (comma, tab, or something else) is really arbitrary, but it's always better to use a delimiter that doesn't appear in your data.
