# Assignment 06

## Due: See Date in Moodle

## This Week's Assignment

In this week's assignment you'll be introdcued to data moves using Pandas, you'll learn how to:

- group to compare summaries across categroical labels

## Guidelines

- Follow good programming practices by using descriptive variable names, maintaining appropriate spacing for readability, and adding comments to clarify your code.

- Ensure written responses use correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

Let's get started!

## Data Moves & Wrangling with Pandas

The main goal of this assignment is to use Pandas and data moves to compare the number floors of skyscrapers across different countries. To achieve this, you will perform some data wrangling and apply grouping.

Import Pandas using the alias `pd`.

In [None]:
... 

Read the `skyscrapers_world.csv` file from the `data` subfolder and save it as a `pandas` `DataFrame` named `ss`. Verify that the file was read correctly by:

- Displaying the dataframe’s information (metadata)
- Previewing the first 5 rows of the dataframe

**Note:** You can accomplish all of this within a single code cell, using `print()` statements.

In [None]:
ss = ...

print(f'Dataframe information\n{"=" * 21}')
print(ss.info())
print(f'\nFirst 5 rows of the dataframe\n{"=" * 63}')
print(ss.head())

## Last Week's Class

In last week’s class, we did some data wrangling by dropping a column, renaming columns, and converting the floors column from type `object` (string) to type `int` (integer). Use the code from the `week-7-lecture-notebook.ipynb` notebook to fill in the missing sections, then run the cell(s) needed to complete each of the steps listed below.

**Step 1.** Drop the `Unnamed: 0` column.

In [None]:
# The .drop() method removes specified rows or columns from a dataframe
ss = ...

**Step 2.** Covert the `status.started`,`status.completed`, and `height in meters` column names to `snake_case`.

In [None]:
# The .rename() method changes the labels of rows or columns in a dataframe
ss = ...(
    ... = {
        ... : 'status_started',
        ... : 'status_completed',
        ... : 'height_meters'
    }
)

**Step 3.** Display the column names of the `ss` dataframe to confirm that **Steps 1** and **2** were completed correctly.

In [None]:
# Verify that the column names have changed
ss.columns

**Step 4.** Run the code below to find the rows where the `floors` values contain non-digit characters.

In [None]:
ss.loc[[48, 61], ]

**Step 5.** Update the floors values so they contain only the numeric count (e.g., change `103 floors` to `103`, and `73 (68 Above Ground and 5 Below Ground)` to `73`).

In [None]:
# .loc[] accesses rows and columns in a dataframe by their labels.
... = 103
... = 73

# Verify that the values have been updated
ss.loc[[48, 61], 'floors']

**Step 6.** Convert the `floors` column of the `ss` dataframe into integers.

In [None]:
ss['floors'] = ss['floors'].astype(int)

# Verify the data type of the column has been updated
ss.info()

Run the code cells below to verify that the column `floors` has been properly converted.

**Note:** You should see the five-number summary and a histogram.

In [None]:
# Display the five-number summary and the mean of the floors variable
ss['floors'].describe()

In [None]:
# Plot a histogram to visualize the distribution of the floors variable
ss['floors'].hist(edgecolor = 'white', grid = False);

## Country Comparisons

Next, we need to review the country names. Our focus will be on comparing skyscrapers in China, the United Arab Emirates (UAE), and the United States of America (USA).

**Question 1.** Create a frequency table showing the number of skyscrapers in each country.

In [None]:
...

## Data Wrangling for Categorical Labels

First, we need to correct the entry that lists New York as a country. Then, we want to standardize all labels referring to the United States of America so they use the same format—specifically, the ISO 3166-1 alpha-3 country code.

**Note:** The ISO 3166-1 alpha-3 country code for the United the States of America is USA.

Execute the code in the following cells to complete the steps outlined below.

**Step 1.** Complete the conditional statement to find the row where `New York` is listed as a country.

In [None]:
mask = ss['country'] == ...
ss[mask]

**Step 2.** Complete the code to update the value from `New York` to `USA`.

In [None]:
... = 'USA'

# Verify that the value has been updated
...

Next, we'll update any remaining values that refer to the USA but appear under different labels. We can use the same procedure as in the previous steps. 

Execute the code in the following cells to complete the steps outlined below.

**Step 3.** List all the unique values in the `country` column.

In [None]:
ss['country'].unique()

**Step 4.** Create a list of all the different label variations that refer to the USA.

**Note:** There should be three items in your list.

In [None]:
usa = [..., ..., ...]

**Step 5.** Use the `.isin()` method to check whether each column value matches any of the items in the `usa` list.

**Note:** The `.isin()` method in Pandas checks whether each value in a `Series` is contained in a given list (or other sequence). It returns a `Series` of `True` `False` values.

In [None]:
ss['country'].isin(usa)

**Step 6.** Run the cell below to display all rows where the country label matches any value in the `usa` list.

In [None]:
mask = ss['country'].isin(usa)
ss[mask]

**Step 7.** Next, we need the index values. We can retrieve them using the `.index` attribute, and then convert the returned Index object into a list with `.to_list()`. 

**Note:** Converting to a list is not required, but we’ll do it here since we’re already familiar with lists and have used them before.

In [None]:
ss[mask].index.to_list()

**Step 8.** Finally, we can use either a `for` loop or the `.loc` attribute with the index positions to update all values that refer to the United States of America so they consistently use USA.

### `for` loop

```python
# Create the Boolean mask using the .isin() method
mask = ss['country'].isin(usa)

# Create the list using the .index accessor and the .to_list() method
usa_index = ss[mask].index.to_list()

print(f"{'Index':<5} {'Previous Value':<25} {'Updated Value':<15}")
print(f"{'=' * 5} {'=' * 25} {'=' * 15}")

for i in usa_index:
    previous_label = ss.loc[i, 'country']
    updated_label = ss.loc[i, 'country'] = 'USA'
    print(f"{i:<5} {previous_label:<25} {updated_label:<15}")
```

### `.loc` 
```python
# Create the Boolean using the .isin() method
mask = ss['country'].isin(usa)

# Create the list using the .index accessor and the .to_list() method
usa_index = ss[mask].index.to_list()

print(f"The country name before standardization.")
print(ss.loc[usa_index, 'country'])

ss.loc[usa_index, 'country'] = 'USA'

print(f"\nThe country name after standardization")
print(ss.loc[usa_index, 'country'])
```

**Note:** Although using `.loc` is the preferred method, we’ll use a `for` loop here to revisit and apply the concept of iteration.

In [None]:
usa_index = ss[mask].index.to_list()

print(f"The country name before standardization.")
print(ss.loc[usa_index, 'country'])

ss.loc[usa_index, 'country'] = 'USA'

print(f"\nThe country name after standardization")
print(ss.loc[usa_index, 'country'])

**Question 2.** Run the cell below to view all the unique labels. Then, in the next cell, create a list of all label variations that refer to the UAE.

**Note:** There should be four items in your list.

In [None]:
ss['country'].unique()

In [None]:
# Create the uae list
uae = [..., ..., ..., ...]

**Question 3.** Fill in the missing sections of the code below to update all values referring to the United Arab Emirates (this includes Dubai) so they consistently use the label `UAE`.

In [None]:
mask = ss['country'].isin(...)
uae_index = ...

print(f"{'Index':<5} {'Previous Value':<25} {'Updated Value':<15}")
print(f"{'=' * 5} {'=' * 25} {'=' * 15}")

for ... in ...:
    previous_label = ss.loc[..., 'country']
    updated_label = ss.loc[..., 'country'] = 'UAE'
    print(f"{...:<5} {previous_label:<25} {updated_label:<15}")

**Question 4.** Update all values referring to the China (this includes Shenzhen) using the `.loc` accessor so they consistently use the label `CHN`.

In [None]:
#  Create the chn list
chn = [..., ...]

# Create the Boolean mask using the .isin() method
mask = ss['country'].isin(chn)

# Create the list using the .index accessor and the .to_list() method
chn_index = ss[mask].index.to_list()

print(f"The country name before standardization.")
print(ss.loc[chn_index, 'country'])

# Assign 'CHN' to 'country' variable in the chn_index list
ss.loc[chn_index, 'country'] = 'CHN'

print(f"\nThe country name after standardization")
print(ss.loc[chn_index, 'country'])

**Question 5.** Run the code below to filter the `ss` dataframe so that it only includes rows for `USA`. Store the output in a new `DataFrame` object named `usa`.

In [None]:
mask = ss['country'] == 'USA'
usa = ss[mask]

# Verify that the filter worked correctly
print(usa['country'].unique())

**Question 6.** Use the code from the previous question to filter the `ss` dataframe so that it only includes rows for `UAE`. Store the output in a new `DataFrame` object named `uae`.

In [None]:
...

# Verify that the filter worked correctly
print(uae['country'].unique())

**Question 7.** Use the code from the previous question to filter the `ss` dataframe so that it only includes rows for `CHN`. Store the output in a new `DataFrame` object named `uae`.

In [None]:
...

# Verify that the filter worked correctly
print(chn['country'].unique())

**Question 8.** Fill in the missing sections to display the five-number summary of the `floors` variable for skyscrapers located in the USA, UAE, and CHN.

**Note:** You can accomplish all of this within a single code cell, using `print()` statements.

In [None]:
print(f'The five-number summary for the USA\n{"=" * 35}')

# Print the five-number summary for the USA
print(...)

# Print the five-number summary for the UAE
print(f'\nThe five-number summary for the UAE\n{"=" * 35}')
print(...)

# Print the five-number summary for CHN
print(f'\nThe five-number summary for China\n{"=" * 33}')
print(...)

## Grouping

With consistent country labels, we can now analyze each country separately or use `.groupby()` to compare them directly. The `.groupby()` method groups observations by label, allowing us to aggregate the data and calculate measures like the five-number summary or other statistics."

**Question 9.** Create a list named `usa_uae_chn` that stores the country labels for `USA`, `UAE`, and `CHN` as strings.

In [None]:
usa_uae_chn = [..., ..., ...]

**Question 10.** Complete the code cell below, then run it to see how we can combine methods like `.isin()` and `.groupby()` to group the data for USA, UAE, and CHN, and display their five-number summary of the `floors` variable in a single output.

**Note:** The `.reset_index()` method resets the DataFrame’s index back to the default (0, 1, 2, …). To see what the output looks like **without using** `.reset_index()`, insert a new code cell, copy and paste the command below, then run it.

```python
mask = ss['country'].isin(usa_uae_chn)
ss[mask].groupby('country')['floors'].describe()
``` 

The output will look like this:

|           | count | mean      | std       | min  | 25%  | 50%  | 75%   | max   |
|-----------|-------|-----------|-----------|------|------|------|-------|-------|
|**country**|       |           |           |      |      |      |       |       |
|**CHN**    | 51.0  | 81.313725 | 15.556979 | 60.0 | 70.5 | 77.0 | 88.00 | 128.0 |
|**UAE**    | 16.0  | 86.687500 | 23.742279 | 54.0 | 74.5 | 84.0 | 88.75 | 163.0 |
|**USA**    | 15.0  | 86.066667 | 16.786332 | 59.0 | 72.5 | 88.0 | 100.0 | 110.0 |

In [None]:
# Create a Boolean mask for usa_uae_chn
mask = ...

# Filter the ss dataframe using a Boolean mask, groupby country, select the floors column, 
# use the describe method to generate the five-number summary, and reset the index
...

Verify that the output from the previous code cell looks like this:
    
|     | **country**|**count**|**mean**|**std**|**min**|**25%**|**50%**|**75%**|**max**|
|:----|:-----------|--------:|-------:|------:|------:|------:|------:|------:|------:|
|**0**| CHN| 51.0  | 81.313725 | 15.556979 | 60.0 | 70.5 | 77.0 | 88.00  | 128.0 |
|**1**| UA | 16.0  | 86.687500 | 23.742279 | 54.0 | 74.5 | 84.0 | 88.75  | 163.0 |
|**2**| USA| 15.0  | 86.066667 | 16.786332 | 59.0 | 72.5 | 88.0 | 100.00 | 110.0 |

## Communicating the Results

Before submitting, review the output from the previous cell and imagine how you would present it at a conference if this table were on a slide. Consider how you would guide the audience’s attention to the most important points. What insights do you want them to notice, and how would you highlight them? Think about the story the data is telling and how to frame it clearly and concisely. Finally, picture yourself giving this presentation in under five minutes, focusing on clarity, key takeaways, and impact.

## Submission

Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.

**Note:** Save the assignment before proceeding to download the file.

After downloading, locate the `.ipynb` file and upload **only** this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.