> ### Note on Labs and Assigments:
>
> 🔧 Look for the **wrench emoji** 🔧 — it highlights where you're expected to take action!
>
> These sections are graded and are not optional.
>

# IS 4487 Lab 4: Data Understanding

## Outline

- Load and preview a real-world dataset
- Inspect structure and preview data
- Inspect distributions and identify missing or unusual data
- Deal with outliers or skew
- Perform basic grouped summaries

<a href="https://colab.research.google.com/github/vandanara/UofUtah_IS4487/blob/main/Labs/lab_04_data_understanding.ipynb" target="_parent">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

If you have any questions about Colab, you can read more here:  
https://research.google.com/colaboratory/faq.html


## Dataset Overview

This week we will use data on Bay Area Craigslist rental housing posts.

Source: Pennington, Kate (2018). Bay Area Craigslist Rental Housing Posts, 2000-2018. Retrieved from https://github.com/katepennington/historic_bay_area_craigslist_housing_posts/blob/master/clean_2000_2018.csv.zip

**Dataset:** `rent.csv`  
Source: [TidyTuesday – 2022-07-05](https://github.com/rfordatascience/tidytuesday/blob/main/data/2022/2022-07-05/rent.csv)

| Variable       | Type       | Description |
|----------------|------------|-------------|
| `post_id`      | Categorical| Unique listing ID |
| `date`         | Numeric    | Listing date (numeric format) |
| `year`         | Integer    | Year of listing |
| `nhood`        | Categorical| Neighborhood |
| `city`         | Categorical| City |
| `county`       | Categorical| County |
| `price`        | Numeric    | Listing price (USD) |
| `beds`         | Numeric    | Number of bedrooms |
| `baths`        | Numeric    | Number of bathrooms |
| `sqft`         | Numeric    | Square footage |
| `room_in_apt`  | Binary     |  Indicates whether the rental listing is for an entire apartment (0) or a single room within an apartment (1). |
| `address`      | Categorical| Street address |
| `lat`          | Numeric    | Latitude |
| `lon`          | Numeric    | Longitude |
| `title`        | Text       | Listing title |
| `descr`        | Text       | Listing description |
| `details`      | Text       | Additional details |


## Part 1: Importing the Libraries and Data

### Instructions:
- Import the `pandas`, `matplotlib` and `seaborn` libraries.
- Import data from the rent.csv into a dataframe from the tidytuesday link.
- Use `.info()` and `.head()` to inspect the structure and preview the data.

In [None]:
# import any libraries that you wish to use
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# import data
# we can use a function in pandas called read_csv() to read in csv files.
# Similarly there are other functions such as  read_excel(), read_json(), read_html() etc
url = 'https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2022/2022-07-05/rent.csv'
df = pd.read_csv(url)

In [None]:
df.info()

In [None]:
df.head()

### 🔧 Task 1 - Try It Yourself

1. Add one line of code to print the number of rows and columns. To get the size of the table, use the .shape attribute of the Dataframe.

In [None]:
# Add code here 🔧
print(df.shape)

## Part 2: Inspecting, Cleaning & Removing Outliers

### Instructions:
- Identify whether variables have missing values.
- Check data types (e.g., dates, numeric columns). These are inferred. Do they look appropriate?
- Check for outliers key numeric variables like `price`, `sqft`, `beds`, or `baths`. Outliers are extreme or unusally low or high values compared to most others.



In [None]:
# Check for missing values - get a count for each column
df.isnull().sum()

In [None]:
# Basic summary statistics
df[['price', 'beds', 'baths', 'sqft']].describe()

In [None]:
# Check data types
df.dtypes

Data types are inferred.
- Numeric valus are either `int` (discrete, whole numbers with no limit) or `float` (continuous decimals with no limit). 64 is the default number of bits used to store the int or float.
- String/text is stored as `object`.
- Note that the inferred datatypes are not always ideal. For instance categorical is not used automatically. We have to set that. `categorical` is best when there are only a limited number of allowed values in the data. This can be numerical variables (0 or 1) or text variables with only some values to choose from (nhood, city, county)



In [None]:
#change the following object and int variables to categorical
df['nhood']=df['nhood'].astype('category')
df['city']=df['city'].astype('category')
df['county']=df['county'].astype('category')
df['room_in_apt']=df['room_in_apt'].astype('category')


### Outlier analysis
To see whether a variable contains outlier (points that are very different in value from the rest), we can create a **boxplot** of the variable.

In a boxplot, the box covers the interquartile range (IQR) from 25-75 percentile of the data, and the whiskers (T lines) extend to a certain range (often 1.5 times the IQR) from the box. Any data points that fall outside of this whisker range are considered outliers and are plotted individually as points.

In [None]:
# Boxplot of price
sns.boxplot(x=df['price'])
plt.title("Boxplot of Rental Price")
plt.show()

In [None]:
# Remove price outliers (keep 1st–99th percentile)
q_low = df['price'].quantile(0.01)
q_high = df['price'].quantile(0.99)
df = df[(df['price'] >= q_low) & (df['price'] <= q_high)]

In [None]:
#check the shape to see how many rows were removed/lost when we dropped outliers in price
df.shape

In [None]:
# the plot should have fewer outliers now - see for yourself
sns.boxplot(x=df['price'])
plt.title("Boxplot of Rental Price with outliers removed")
plt.show()

### 🔧 Try It Yourself – Part 2

1. Use `.describe()` and a boxplot to check for outliers in **square footage (`sqft`)**.

2. What patterns or issues do you see with square footage values? Is there anything unusual? You dont need to remove the outliers here - leave them.


In [None]:
# Add code here 🔧
df['sqft'].describe()

sns.boxplot(x=df['sqft'])
plt.title("Boxplot of Sqft")
plt.show()

🔧 Add comment here:

There are some very  large outliers that lie so far out - very large homes.

## Part 3: Basic Exploration

Use `groupby` and `value_counts` to summarize trends across neighborhoods and cities.


In [None]:
# Average price by neighborhood
df.groupby('nhood')['price'].mean().sort_values(ascending=False).head(10)

In [None]:
# Top cities by count
df['city'].value_counts().head(10)

### 🔧 Try It Yourself – Part 3

Explore the data by performing both of the following:

1. Group the listings by `year` and calculate the average price for each year.
2. Use `.value_counts()` on the `room_in_apt` column to see how common room rentals are.

3. Add a short comment or markdown cell describing any trends or insights you found.


In [None]:
# Add code here 🔧
display(df.groupby('year')['price'].mean())

display(df['room_in_apt'].value_counts())

🔧 Add comment here:

Prices are going up every year but not consistently so, there are some years where the avg rental price dipped.

There are certainly many many more whole home rentals than there are room in home rentals.


## Export Your Notebook to Submit in Canvas
- Use the instructions from Lab 1

In [None]:
!jupyter nbconvert --to html "lab_04_LastnameFirstname.ipynb"