# Tabular Data

What does data look like? For most people, the first image that comes to mind is a spreadsheet, where each row represents something for which information is being measured and each column a type of measurement. This stereotype exists for a reason; many real-world data sets can indeed be organized this way. Data that can be represented using rows and columns is called **tabular data**. The rows are also called **observations** or **records**,  while the columns are called **variables** or **fields**. The different terms reflect the diverse communities within data science, and their origins are summarized in the table below.

|                     | Rows           | Columns     |
|---------------------|----------------|-------------|
| Statisticians       | "observations" | "variables" |
| Computer Scientists | "records"      | "fields"    |

The table below is an example of a
data set that can be represented in tabular form.
This is a sample of user profiles in the
San Francisco Bay Area from the online dating website
OKCupid. In this case, each observation is an OKCupid user, and the variables include age, body type, height, and
(relationship) status. Although a
data table can contain values of all types, the
values within a column are typically all of the same
type---the age and height columns store
numbers, while the body type and
status columns store strings. Some values may be missing, such as body type for the first user
and diet for the second.

| age | body type |        diet       | ... | smokes | height | status |
|-----|-----------|-------------------|-----|--------|--------|--------|
| 31  |           | mostly vegetarian | ... |   no   |   67   | single |
| 31  |  average  |                   | ... |   no   |   66   | single |
| 43  |   curvy   |                   | ... | trying to quit | 65 | single |
| ... |    ...    |       ...         | ... |  ...   |  ...   | ... |
| 60  |    fit    |                   | ... |   no   |   57   | single |



# Introduction to Pandas

Tabular data is essential for doing data science. But a structure for tabular data is not built into Python, so we need to import a library. That library is [Pandas](https://pandas.pydata.org/), which essentially does one thing---define a data structure called a `DataFrame` for storing tabular data. But this data structure is so fundamental to data science that importing `pandas` is the very first line of many Colab notebooks and Python scripts.

Let's import `pandas`. The standard abbreviation for Pandas is `pd`.

In [None]:
import pandas as pd

A Pandas `DataFrame` is optimized for storing tabular data; for example, it uses the fact that the values within a column are all the same type to save memory and speed up computations.

Many data sets are stored as files on disk, such as in **comma-separated values (CSV)** files. How do we get data into a `pandas` `DataFrame`? Pandas provides a function called `read_csv()` for reading in files in CSV format.

## Reading in Data from a URL

If the data file already lives on the Internet, then you can simply pass in the URL to `read_csv()`.

In [None]:
# Read in the OKCupid data set using the Pandas `read_csv` function.
df_okcupid = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/okcupid.csv")

In [None]:
# Display the first few and last few rows and columns
df_okcupid

Notice above how missing values are represented in a `pandas` `DataFrame`. Each missing value is represented by a `NaN`, which is short for "not a number". As we will see, most `pandas` operations simply ignore `NaN` values.

The `info` method returns some information about the data frame.

In [None]:
df_okcupid.info

The `shape` attribute returns the number of (rows, columns).

In [None]:
df_okcupid.shape

## Reading in Data from a File

If you instead want to read in a data file on your computer, you can pass in the path to the file (e.g., `"/home/data/mydata.csv"`) to `read_csv()`.

There's just one catch. Colab is a cloud service; it can't read files on your computer. In order to read in a data file from Colab, you have to upload the file to the Colab file system.

Instructions:

1. Click on the folder icon in the left toolbar. This will open up a pane that allows you to interact with the Colab file system.
2. Click on the upload icon and find the file that you want to upload.

Now the data file is on the Colab file system, so we can read it in using `read_csv()`. By default, files get uploaded to `/content/`. If you get a `FileNotFoundError`, double check where you uploaded the file.

In [None]:
# First download the okcupid.csv file to your compute
# Then upload the file to Colab using the instructions above
# Now you can read in the OK Cupid data set using the Pandas `read_csv` function.
df_okcupid = pd.read_csv("/content/okcupid.csv")

df_okcupid