# Introduction to tabular data

In data science, tabular data refers to information organized in a structured format that resembles a table with rows and columns. It's one of the most fundamental and common data representations used in various applications. Here's a breakdown of key characteristics:

* **Rows and Columns:** Tabular data is essentially a grid-like structure where each row represents a single data point or record, and each column represents a specific attribute or variable associated with that data point.
* **Consistent Data Types:** Ideally, each column should contain data of a consistent type (e.g., numbers for loan amounts, text for names, dates for application dates). This consistency allows for efficient analysis and manipulation using data science tools.
* **Headers:** Often, the first row in a table contains headers that define the meaning or description of each column. This improves readability and understanding of the data.

Here are some real-world examples of tabular data:

* **Loan application data:** Each row represents a loan applicant, with columns for attributes like name, loan amount, credit score, and property type.
* **Sales data:** Each row represents a sales transaction, with columns for product details (e.g., ID, name, price), customer information, and sales quantity.
* **Scientific data:** Each row represents a measurement or observation, with columns for different properties or variables being measured (e.g., temperature, pressure, time).

**Advantages of Tabular Data:**

* **Readability and Interpretability:** The structured format makes it easy for humans to understand the data by visually scanning rows and columns.
* **Flexibility:** Tabular data can accommodate various data types and can be easily extended to include new attributes or data points.
* **Interoperability:** Many data analysis tools and libraries are designed to work seamlessly with tabular data, making it a widely supported format.

**Common ways to store and work with tabular data:**

* **Spreadsheets (Excel, Google Sheets):** Popular tools for creating, editing, and manipulating tabular data.
* **CSV (Comma-Separated Values):** A plain text file format where data is separated by commas, suitable for storing and sharing tabular data.
* **Databases:** Relational databases store large amounts of structured data in tables with defined relationships between them.
* **Pandas DataFrames (Python):** A powerful Python library specifically designed for working with tabular data, offering extensive data manipulation and analysis capabilities.

Overall, tabular data plays a crucial role in data science by providing a well-organized and versatile foundation for various data analysis tasks.


* each row corresponds to a single house in our dataset. we'll call each of these houses an observation
* each column corresponds to a characteristic of each house. we'll call these features
* each cell contains only one value

# working with lists

Python comes with several data structures that we can use to organize tabular data. Let's start by putting a single observation in a list.

In [2]:
# Declare variable 'house_0_list'

house_0_list = [115910.26, 128, 4]
house_0_list, type(house_0_list)

([115910.26, 128, 4], list)

In [3]:
# Declare variable `house_0_price_m2`
house_0_price_m2 = house_0_list[2]

# Print object type of `house_0_price_m2`
print("house_0_price_m2 type:", type(house_0_price_m2))

# Get output of `house_0_price_m2`
house_0_price_m2

house_0_price_m2 type: <class 'int'>


4

We've explored working with data for a single house. Now, let's tackle organizing the entire dataset. One approach involves creating individual lists for each data point, and then combining those lists into a larger list. This structure is called a nested list.

In [4]:
# Declare variable `houses_nested_list`
houses_nested_list = [
    [115910.26, 128.0, 4.0],
    [48718.17, 210.0, 3.0],
    [28977.56, 58.0, 2.0],
    [36932.27, 79.0, 3.0],
    [83903.51, 111.0, 3.0],
]


# learn how to use for loop
* what's a for loop?
* write a for loop in python

In [8]:
# create for loop to iterate through 'houses_nested_list'

for house in houses_nested_list:
    price_m2 = house[0]/house[1]
    house.append(price_m2)
    print(house)

[115910.26, 128.0, 4.0, 905.54890625]
[48718.17, 210.0, 3.0, 231.9912857142857]
[28977.56, 58.0, 2.0, 499.61310344827587]
[36932.27, 79.0, 3.0, 467.4970886075949]
[83903.51, 111.0, 3.0, 755.8874774774774]


# Working with dictionaries

While lists are useful for storing data, they can only hold bare values. This makes it difficult to understand what each value represents. Imagine seeing a list like [115910.26, 128.0, 4] - it's unclear which number is the price, area, or something else. Dictionaries offer a solution! They allow us to associate each value with a label (key), making the data much clearer. Look at how this house information transforms into a dictionary for better understanding.

In [9]:
# Declare variable `house_0_dict`
house_0_dict = {
    "price_approx_usd": 115910.26,
    "surface_covered_in_m2": 128,
    "rooms": 4,
}

# Print `house_0_dict` type
print("house_0_dict type:", type(house_0_dict))

# Get output of `house_0_dict`
house_0_dict

house_0_dict type: <class 'dict'>


{'price_approx_usd': 115910.26, 'surface_covered_in_m2': 128, 'rooms': 4}

In [10]:
# Add "price_per_m2" key-value pair to `house_0_dict`
house_0_dict["price_per_m2"] = house_0_dict['price_approx_usd']/house_0_dict['surface_covered_in_m2']

# Get output of `house_0_dict`
house_0_dict

{'price_approx_usd': 115910.26,
 'surface_covered_in_m2': 128,
 'rooms': 4,
 'price_per_m2': 905.54890625}

In [11]:
# Declare variable `houses_rowwise`
houses_rowwise = [
    {
        "price_approx_usd": 115910.26,
        "surface_covered_in_m2": 128,
        "rooms": 4,
    },
    {
        "price_approx_usd": 48718.17,
        "surface_covered_in_m2": 210,
        "rooms": 3,
    },
    {
        "price_approx_usd": 28977.56,
        "surface_covered_in_m2": 58,
        "rooms": 2,
    },
    {
        "price_approx_usd": 36932.27,
        "surface_covered_in_m2": 79,
        "rooms": 3,
    },
    {
        "price_approx_usd": 83903.51,
        "surface_covered_in_m2": 111,
        "rooms": 3,
    },
]

# Print `houses_rowwise` object type
print("houses_rowwise type:", type(houses_rowwise))

# Print `houses_rowwise` length
print("houses_rowwise length:", len(houses_rowwise))

# Get output of `houses_rowwise`
houses_rowwise

houses_rowwise type: <class 'list'>
houses_rowwise length: 5


[{'price_approx_usd': 115910.26, 'surface_covered_in_m2': 128, 'rooms': 4},
 {'price_approx_usd': 48718.17, 'surface_covered_in_m2': 210, 'rooms': 3},
 {'price_approx_usd': 28977.56, 'surface_covered_in_m2': 58, 'rooms': 2},
 {'price_approx_usd': 36932.27, 'surface_covered_in_m2': 79, 'rooms': 3},
 {'price_approx_usd': 83903.51, 'surface_covered_in_m2': 111, 'rooms': 3}]

In [12]:
# Create for loop to iterate through `houses_rowwise`
for house in houses_rowwise:

    # For each observation, add "price_per_m2" key-value pair
    house["price_per_m2"] = house['price_approx_usd']/house['surface_covered_in_m2']

# Print `houses_rowwise` object type
print("houses_rowwise type:", type(houses_rowwise))

# Print `houses_rowwise` length
print("houses_rowwise length:", len(houses_rowwise))

# Get output of `houses_rowwise`
houses_rowwise

houses_rowwise type: <class 'list'>
houses_rowwise length: 5


[{'price_approx_usd': 115910.26,
  'surface_covered_in_m2': 128,
  'rooms': 4,
  'price_per_m2': 905.54890625},
 {'price_approx_usd': 48718.17,
  'surface_covered_in_m2': 210,
  'rooms': 3,
  'price_per_m2': 231.9912857142857},
 {'price_approx_usd': 28977.56,
  'surface_covered_in_m2': 58,
  'rooms': 2,
  'price_per_m2': 499.61310344827587},
 {'price_approx_usd': 36932.27,
  'surface_covered_in_m2': 79,
  'rooms': 3,
  'price_per_m2': 467.4970886075949},
 {'price_approx_usd': 83903.51,
  'surface_covered_in_m2': 111,
  'rooms': 3,
  'price_per_m2': 755.8874774774774}]

In [13]:
# Declare `house_prices` as empty list
house_prices = []

# Iterate through `houses_rowwise`
for house in houses_rowwise:
    # For each house, append "price_approx_usd" to `house_prices`
    house_prices.append(house['price_approx_usd'])

# Calculate `mean_house_price` using `house_prices`
mean_house_price = sum(house_prices) / len(house_prices)

# Print `mean_house_price` object type
print("mean_house_price type:", type(mean_house_price))

# Get output of `mean_house_price`
mean_house_price

mean_house_price type: <class 'float'>


62888.35399999999

To streamline these calculations, let's reorganize our data. Instead of focusing on individual observations, we'll structure it by features. We'll still use dictionaries and lists, but in a slightly different way.

In [15]:
# Declare variable `houses_columnwise`
houses_columnwise = {
    "price_approx_usd": [115910.26, 48718.17, 28977.56, 36932.27, 83903.51],
    "surface_covered_in_m2": [128.0, 210.0, 58.0, 79.0, 111.0],
    "rooms": [4.0, 3.0, 2.0, 3.0, 3.0],
}

# Print `houses_columnwise` object type
print("houses_columnwise type:", type(houses_columnwise))

# Get output of `houses_columnwise`
houses_columnwise

houses_columnwise type: <class 'dict'>


{'price_approx_usd': [115910.26, 48718.17, 28977.56, 36932.27, 83903.51],
 'surface_covered_in_m2': [128.0, 210.0, 58.0, 79.0, 111.0],
 'rooms': [4.0, 3.0, 2.0, 3.0, 3.0]}

In [16]:
# Calculate `mean_house_price` using `houses_columnwise`
mean_house_price = sum(houses_columnwise["price_approx_usd"])/len(houses_columnwise['price_approx_usd'])

# Print `mean_house_price` object type
print("mean_house_price type:", type(mean_house_price))

# Get output of `mean_house_price`
mean_house_price

mean_house_price type: <class 'float'>


62888.35399999999

In [17]:
# Add "price_per_m2" key-value pair for `houses_columnwise`

price = houses_columnwise['price_approx_usd']
area = houses_columnwise['surface_covered_in_m2']

price_per_m2 = []
for p,a in zip(price,area):
    price_m2 = p / a
    price_per_m2.append(price_m2)
houses_columnwise["price_per_m2"] = price_per_m2

# Print `houses_columnwise` object type
print("houses_columnwise type:", type(houses_columnwise))

# Get output of `houses_columnwise`
houses_columnwise

houses_columnwise type: <class 'dict'>


{'price_approx_usd': [115910.26, 48718.17, 28977.56, 36932.27, 83903.51],
 'surface_covered_in_m2': [128.0, 210.0, 58.0, 79.0, 111.0],
 'rooms': [4.0, 3.0, 2.0, 3.0, 3.0],
 'price_per_m2': [905.54890625,
  231.9912857142857,
  499.61310344827587,
  467.4970886075949,
  755.8874774774774]}