# Tabular Data

What does data look like? For most people, the first image that comes to mind is a spreadsheet, where each row represents something for which information is being measured and each column a type of measurement. This stereotype exists for a reason; many real-world data sets can indeed be organized this way. Data that can be represented using rows and columns is called **tabular data**. The rows are also called **observations** or **records**,  while the columns are called **variables** or **fields**. The different terms reflect the diverse communities within data science, and their origins are summarized in the table below.

|                     | Rows           | Columns     |
|---------------------|----------------|-------------|
| Statisticians       | "observations" | "variables" |
| Computer Scientists | "records"      | "fields"    |

The table below is an example of a
data set that can be represented in tabular form.
This is a sample of user profiles in the
San Francisco Bay Area from the online dating website
OKCupid. In this case, each observation is an OKCupid user, and the variables include age, body type, height, and
(relationship) status. Although a
data table can contain values of all types, the
values within a column are typically all of the same
type---the age and height columns store
numbers, while the body type and
status columns store strings. Some values may be missing, such as body type for the first user
and diet for the second.

| age | body type |        diet       | ... | smokes | height | status |
|-----|-----------|-------------------|-----|--------|--------|--------|
| 31  |           | mostly vegetarian | ... |   no   |   67   | single |
| 31  |  average  |                   | ... |   no   |   66   | single |
| 43  |   curvy   |                   | ... | trying to quit | 65 | single |
| ... |    ...    |       ...         | ... |  ...   |  ...   | ... |
| 60  |    fit    |                   | ... |   no   |   57   | single |



# Introduction to Pandas

Tabular data is essential for doing data science. But a structure for tabular data is not built into Python, so we need to import a library. That library is [Pandas](https://pandas.pydata.org/), which essentially does one thing---define a data structure called a `DataFrame` for storing tabular data. But this data structure is so fundamental to data science that importing `pandas` is the very first line of many Colab notebooks and Python scripts.

Let's import `pandas`. The standard abbreviation for Pandas is `pd`.

In [1]:
import pandas as pd

A Pandas `DataFrame` is optimized for storing tabular data; for example, it uses the fact that the values within a column are all the same type to save memory and speed up computations.

Many data sets are stored as files on disk, such as in **comma-separated values (CSV)** files. How do we get data into a `pandas` `DataFrame`? Pandas provides a function called `read_csv()` for reading in files in CSV format.

## Reading in Data from a URL

If the data file already lives on the Internet, then you can simply pass in the URL to `read_csv()`.

In [2]:
# Read in the OKCupid data set using the Pandas `read_csv` function.
df_okcupid = pd.read_csv("https://raw.githubusercontent.com/kevindavisross/data301/main/data/okcupid.csv")

In [3]:
# Display the first few and last few rows and columns
df_okcupid

Unnamed: 0,age,body_type,diet,drinks,drugs,education,essay0,essay1,essay2,essay3,...,location,offspring,orientation,pets,religion,sex,sign,smokes,height,status
0,31,,mostly vegetarian,socially,sometimes,graduated from college/university,"75% nice, 45% shy, 80% stubborn, 100% charming...",i'm a new nurse. it rules.,"multiple-choice questions, dancing.",it depends on the people.,...,"san francisco, california",might want kids,gay,likes cats,buddhism,f,taurus and it&rsquo;s fun to think about,no,67.0,single
1,25,average,,socially,,working on college/university,"i like trees, spending long periods of time co...","studying landscape horticulture, beekeeping, g...","wasting time, making breakfast, nesting",i have a lot of freckles,...,"oakland, california",,gay,,,m,sagittarius and it&rsquo;s fun to think about,no,66.0,single
2,43,curvy,,rarely,never,graduated from masters program,,,,,...,"san francisco, california",has a kid,straight,likes dogs and has cats,other and laughing about it,f,leo and it&rsquo;s fun to think about,trying to quit,65.0,single
3,31,average,,socially,never,,"i am a seeker of laughs ,music ,magick good pe...",i strive to live life to the fullest and to tr...,i am good at my magic and weaving a world of i...,i am guessing y'all would notice my jewelry an...,...,"san francisco, california",doesn&rsquo;t want kids,gay,,other and very serious about it,m,capricorn and it&rsquo;s fun to think about,trying to quit,70.0,single
4,34,,,socially,,graduated from ph.d program,i've just moved here from london after finishi...,i'm doing a postdoc in psychology at stanford,,,...,"san francisco, california",,gay,,,m,cancer but it doesn&rsquo;t matter,,71.0,single
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2995,24,athletic,mostly anything,socially,sometimes,graduated from college/university,recent relocatee to san francisco. i'm writing...,cranking out two years of private equity so i ...,writing. criticizing. partying partying yeah.....,i've been described as 'all-american.' not sur...,...,"san francisco, california",,straight,,catholicism,m,,no,70.0,single
2996,50,fit,,rarely,never,graduated from college/university,i'm generally happy and typically spend my tim...,"i was raised with left-wing politics, pbs, bal...","i'm great with kids, dogs, cats and recycling....","gee, i don't know. . . that i'm smiling, that ...",...,"oakland, california",,straight,has dogs,agnosticism,f,scorpio but it doesn&rsquo;t matter,no,63.0,single
2997,31,thin,vegetarian,socially,sometimes,,,"i like to move around, therefore my life entai...","having fun and not taking life too seriously, ...",my smile and hair.,...,"san francisco, california",,straight,,,f,,no,64.0,single
2998,31,athletic,mostly vegetarian,socially,sometimes,graduated from college/university,"i work with seniors and i love it, so believe ...",going down in flames.,being a charming first date.,i dress like an adorable idiot.,...,"walnut creek, california",,straight,likes dogs and has cats,catholicism and laughing about it,f,aries and it&rsquo;s fun to think about,when drinking,62.0,single


Notice above how missing values are represented in a `pandas` `DataFrame`. Each missing value is represented by a `NaN`, which is short for "not a number". As we will see, most `pandas` operations simply ignore `NaN` values.

The `info` method returns some information about the data frame.

In [4]:
df_okcupid.info

<bound method DataFrame.info of       age body_type               diet    drinks      drugs   
0      31       NaN  mostly vegetarian  socially  sometimes  \
1      25   average                NaN  socially        NaN   
2      43     curvy                NaN    rarely      never   
3      31   average                NaN  socially      never   
4      34       NaN                NaN  socially        NaN   
...   ...       ...                ...       ...        ...   
2995   24  athletic    mostly anything  socially  sometimes   
2996   50       fit                NaN    rarely      never   
2997   31      thin         vegetarian  socially  sometimes   
2998   31  athletic  mostly vegetarian  socially  sometimes   
2999   60       fit                NaN  socially        NaN   

                              education   
0     graduated from college/university  \
1         working on college/university   
2        graduated from masters program   
3                                   NaN

The `shape` attribute returns the number of (rows, columns).

In [5]:
df_okcupid.shape

(3000, 31)

## Reading in Data from a File

If you instead want to read in a data file on your computer, you can pass in the path to the file (e.g., `"/home/data/mydata.csv"`) to `read_csv()`.

There's just one catch. Colab is a cloud service; it can't read files on your computer. In order to read in a data file from Colab, you have to upload the file to the Colab file system.

Instructions:

1. Click on the folder icon in the left toolbar. This will open up a pane that allows you to interact with the Colab file system.
2. Click on the upload icon and find the file that you want to upload.

Now the data file is on the Colab file system, so we can read it in using `read_csv()`. By default, files get uploaded to `/content/`. If you get a `FileNotFoundError`, double check where you uploaded the file.

In [6]:
# First download the okcupid.csv file to your compute
# Then upload the file to Colab using the instructions above
# Now you can read in the OK Cupid data set using the Pandas `read_csv` function.
df_okcupid = pd.read_csv("/content/okcupid.csv")

df_okcupid

FileNotFoundError: [Errno 2] No such file or directory: '/content/okcupid.csv'