# Assignment 05

## Due: See Date in Moodle

## This Week's Assignment

In this week's assignment you'll be introdcued to Jupyter Notebooks, you'll learn how to:

- use `pandas` `DataFrame` and `Series` methods.

- perform data moves, such as filtering and grouping.

## Guidelines

- Follow good programming practices by using descriptive variable names, maintaining appropriate spacing for readability, and adding comments to clarify your code.

- Ensure written responses use correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

Let's get started!

During the lecture, we discussed some data entry errors in the `skyscrapers_world.csv` dataset. In this notebook, we will further correct these errors and apply data moves, such as filtering, summarizing, and grouping, to gain deeper insights into the buildings contained in the dataset.

**Question 1.** Import the `pandas` library with the appropriate alias and load the `skyscrapers_world.csv` dataset from the `data` folder into a `pandas` `DataFrame` named `ss`. Display the first five rows to verify that the data loaded correctly.

**Note:** Use separate code cells, one for importing the library and another for loading the dataset.

In [None]:
## Import the pandas library
import pandas as pd

In [None]:
## Load the dataset
ss = pd.read_csv('data/skyscrapers_world.csv')
ss.head()

Let's examine the metadata for the `ss` DataFrame. In the code cell below, run the `ss.info()` command to display column names, data types, and non-null counts.

**Note:** Metadata refers to data about the data, such as column names, data types, missing values, and overall structure. Reviewing metadata helps ensure data integrity and guides preprocessing steps.

In [None]:
ss.info()

**Question 2.** Refer to **Examples 1 and 2** in the `Week 8 lecture notebook` to:  

- Remove the `Unnamed: 0` column.  

- Rename the columns `status.started`, `status.completed`, and `height in meters`, converting them to `snake_case`.

In separate code cells confirm that your processing has produced the expected results.

**Note:** Feel free to add separate code cells for each step. Be sure to include comments in each cell to clearly describe the purpose of the code.

In [None]:
ss = ss.drop(columns = ['Unnamed: 0'])

In [None]:
ss.columns

In [None]:
ss = ss.rename(columns = {'status.completed' : 'year_completed',
                          'status.started' : 'year_started',
                          'height in meters' : 'height_meters'
                         }
              )

In [None]:
ss.columns

**Question 3.** Refer to **Example 3** in the `Week 8 lecture notebook` to convert the `height in meters` column from a string to a float. Then, add the cleaned values to the `ss` `DataFrame` as a new column named `clean_height_meters`.

In a separate code cell confirm that your processing has produced the expected results.

**Note:** Feel free to add separate code cells for each step. Be sure to include comments in each cell to clearly describe the purpose of the code.

In [None]:
ss['clean_height_meters'] = ss['height_meters'].str.replace(",", "", regex = True).astype(float)

In [None]:
ss.info()

**Question 4.** Refer to **Example 5** in the `Week 8 lecture notebook` to correct any text entries in the `floors` column. Replace these incorrect values with the appropriate numerical values based on the dataset. Then, convert the `floors` column to a numeric data type.

In a separate code cell confirm that your processing has produced the expected results.

**Note:** Feel free to add separate code cells for each step. Be sure to include comments in each cell to clearly describe the purpose of the code.

In [None]:
ss.loc[48, 'floors'] = 103
ss.loc[61, 'floors'] = 73

In [None]:
ss['floors'] = pd.to_numeric(ss['floors'])

In [None]:
ss.info()

In [None]:
print(ss.loc[48, 'floors'])
print(ss.loc[61, 'floors'])

**Question 5.** Use the `.value_counts()` method to generate a frequency table of country names, showing the count of occurrences for each unique country.

In [None]:
ss['country'].value_counts()

The code

```python
ss['country'].value_counts()
```

executed without any errors, but the output is incorrect because the same country has been entered in different ways, leading to inconsistent results. For instance:

- USA, US, and United States of America are listed separately, though they refer to the same country.

- United Arab Emirates appears multiple times as United Arab Emirates (UAE) and UAE.

- Malaysia is misspelled as Malasya.

- Saudi Arabia is listed as saudi Arabia (with a lowercase "s").

- Shenzhen is a Province in China, not a country.

- New York is a city in the United States of America, not a country.

- Dubai is a city in the United Arab Emirates, not a country.

We aim to compare summary statistics for the top three countries in the dataset: China, the United Arab Emirates, and the United States. To ensure an accurate comparison, we must first standardize the country names.

## Making Choices

In this section, you will choose a categorical label of your preference. For brevity and clarity, I will demonstrate two methods to label all buildings in the United Arab Emirates as **UAE**. You can then select your preferred method to apply the same labeling approach for buildings in China and the United States.

### Filter

First we need all the rows that should be labeled UAE.

1. A list named `uae` is created that contains different variations of how the United Arab Emirates may appear in the dataset.

```python
uae = ["United Arab Emirates", "United Arab Emirates (UAE)", "Dubai", "UAE"]
```

2. The `.isin(uae)` method checks if each value in the `country` column exists in the `uae` list. This results in a boolean mask—a series of `True` and `False` values:

   - `True`: The country name is in the `uae list.

   - `False`: The country name is not in the `uae` list.
   
```python
mask = ss['country'].isin(uae)
```

3.  Filter `DataFrame` to only include rows where mask is `True` (i.e., where country matches one of the values in `uae`). Then select only the `country` column from these filtered rows. This displays the results but does not modify the original `ss` `DataFrame`.

```python
ss[mask]['country']
```

In [None]:
uae = ["United Arab Emirates", "United Arab Emirates (UAE)", "Dubai", "UAE"]
mask = ss['country'].isin(uae)
ss[mask]['country']

In [None]:
uae = ["United Arab Emirates", "United Arab Emirates (UAE)", "Dubai", "UAE"]
mask = ss['country'].isin(uae)
ss[mask]['country']

In [None]:
ss[mask]['country']

In [None]:
usa = ["USA", "United Sates of America"]
mask = ss['country'].isin(usa)
ss[mask]

## Submission

Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.

**Note:** Save the assignment before proceeding to download the file.

After downloading, locate the `.ipynb` file and upload **only** this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.