# Assess Data Structure Programmatically

In this exercise, you will perform a brief structural assessment of two datasets.

In [1]:
#Imports - DO NOT MODIFY
import pandas as pd

## Dataset context

Our dataset is the "Hospital Annual Utilization Report data from the California Health and Human Services Open Data Portal, containing data on hopsital buildings. 

Columns' description (taken from https://data.chhs.ca.gov/dataset/hospital-building-data/resource/cefc10e5-5071-4ca4-8b03-2249caf0d294):

- **County Code:** County number (set by State of California) and County Name
- **Perm ID:**	Facility number per Facilities Development Division
- **Facility Name:** Name of the General Acute Care Hospital
- **City:** City
- **Building Nbr:** Unique building number assigned to seismically separate building in a hospital campus.
- **Building Name:** Building name provided by the Facility.
- **Building Status	Text:** Building Status.	
> If currently in service, status is "In Service". If under construction, status is "Under Construction". Other statuses are used to identify buildings that may be located in general acute care facility but do not provide general acute care services.
- **SPC Rating\*	Text:** SPC Rating\* Structural Performance Category 
> It is used to rate the building structure, can be 1 to 5, “s” is added where the rating is not confirmed by HCAI. SPC 1 is assigned to buildings that may be at risk of collapse in a strong earthquake and SPC 5 is assigned to buildings reasonably capable of providing services to the public following strong ground motion. N/A = Not Applicable and NYA = Not Yet Available.
- **Building URL:** A URL that oens the page associated with Building Nbr in the eServices Portal which provides access to related projects being constructed in the building.
- **Height (ft):** Height in feet for the building where available

Let's take a look at the first few rows of the data.

In [None]:
# DO NOT MODIFY - read the .csv file
ospd_data = pd.read_csv('ca-oshpd-gachospital-hospitalbuildingdata-03092023.csv')

In [None]:
# FILL IN - show the first a few rows of the dataframe

## 1. Inspect the data tidiness

During our data tidiness investigation, we will look at the application of the following rules:

- Every column is a variable.
- Every row is an observation.
- Every cell is a single value.

We can see our dataset is clearly formatted with the column names accepted as the header, each row consisting of a single observation, and the cells in the dataset being single values. But looking into the dataset, we can notice a number of issues. Below, you will find three issues related to data tidiness and investigate them programmatically below.

### 1.1 Investigate the `Building Code` 

Investigate the `Building Code` column programmatically using the `.describe()` and `.value_counts()` functionality. What are the kinds of values in this column and how could we tidy this? Hint: Think about the "Multiple variables stored in one column" guideline.	

In [None]:
#FILL IN - describe the data

In [None]:
#FILL IN - use .value_counts() on the data

*FILL IN explanation:* ...

### 1.2 Investigate unnecessary values
Which variable adds no value to the data that we could remove from the dataframe? Hint: Use the `describe()` function.

In [None]:
#Fill in
#Describe the dataset

*FILL IN explanation*: ...

### 1.3 Investigate different observational units

Do you see cases of multiple observational units being stored in a single table in the dataset? Inspect the dataframe visually by looking at the first few rows of the dataframe, and programmatically by checking the number of unique values for the dataframe. Explain how we could mitigate this duplication by having two seperate tables.

**Note:** Here, we are asking you to think about how you would separate the data into two separate entities/dataframes/tables, rather than looking at repetitive values accross columns in the original dataframe.

In [None]:
#FILL IN - print first 10 rows of dataframe

In [None]:
#FILL IN
#Find number of unique values using .unique()

*FILL IN explanation*: ...