# Assignment 08

## Due: See Date in Moodle

In this assignment you will use intermediate and advanced features in R.

To receive a **full credit** for this assignment, you must complete **all** questions.

## This Week's Assignment

In this week's assignment, you will perform data wrangling. This includes, but is not limited to the following:
    
- converting strings to numbers

- dropping a column from a dataframe

- adding a new column to a dataframe

- designing user defined functions

- plotting a histogram and a line chart

### Notes

- Adhere to good programming practices, utilizing descriptive variable names, appropriate spacing for readability, and adding comments to your code. 

- Ensure written responses maintain correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

In this notebook we will be working with the skyscraper data that your class collected. The dataset can be accessed [**here**](https://docs.google.com/spreadsheets/d/1W0uRGIU43sMvQ1pANUtlSkFKeY_TMl9d3QZyegEhxOo/edit#gid=1105865786).


Let's get started!

**Question 1.** Import the `pandas` library using the appropritate alias. 

In [None]:
...

In [None]:
# Load the dataset to a dataframe namde skyscrapers
skyscrapers = pd.read_csv('data/skyscrapers.csv')

# Show the first 5 rows of the dataset
skyscrapers.head()

We can select columns from a `pandas` dataframe by the name of the column.

**Question 2.** Select the first and last columns (by name), and return a `DataFrame` object.

In [None]:
...

We can select specific columns using their index position using `.iloc`.

```
skyscrapers.iloc[:, 1:9]
```

Let's break down the code `skyscrapers.iloc[:, 1:last_column_index]`:

- `skyscrapers`: This is the `DataFrame`

- `.iloc[]`: This is an indexer used to select data from the `DataFrame` by integer location.

- `[:, [1, 9]]`: This part specifies the rows and columns to select. 

    - `:` indicates that we want to select all rows. 
    
    - `[1, 9]` is a list containing the integer positions of the columns we want to select. In this case, it specifies columns at index positions 1 and 9.

So, `skyscrapers.iloc[:, 1:9]` selects all rows and only columns at index positions 1 and 9 from the `DataFrame` `skyscrapers`.

Instead of hard-coding all the column index values, let's do it programmatically. 

**Question 3.** Fill in the missing portions of the code cell below, replacing the ellipses with the appropriate code segments to achieve the intended functionality.

In [None]:
# Assign the total number of columns to num_columns
# .shape returns a tuple (rows, columns)
# where the second item - index position 1 is selected
num_columns = ...

# Subtract 1 from num_columns
# The .iloc method does not include the endpoint
last_column_index = ...


# Use .iloc to select all columns 
# excluding the first and last
skyscrapers.iloc[:, ...:...]

Another way this can be done is by using a negative 1 for the right endpoint. In the exmple below,

```
skyscrapers.iloc[:, 1:-1]
```

the negative index (-1) represents the position of the last column, and the last column itself is excluded from the selection.

In [None]:
skyscrapers.iloc[:, 1:-1]

And yet another way we can do this is by using the `.drop()` method. First, we'll use the `.columns` method to list all the columns.

In [None]:
skyscrapers.columns

**Question 4.** Use the `.drop()` method to remove both the first and last columns from the `DataFrame` and assign the result back to `df`.

**Note:** Be sure to use -1 for the last column.

In [None]:
df = skyscrapers.drop(columns=[..., ...]])
df.head()

## Review 

Previously, we utilized R to perform some [**data moves**](https://escholarship.org/uc/item/0mg8m7g6). Specifically we

- added a column using data from the `status.started` and `status.completed` columns

- generated a distribution showcasing the count of skyscrapers in each country

- created a new dataframe containing only the countries with over 10 skyscrapers

- converted a string from the `height` column into a numeric data type 


Here's an equivalent Python code snippet accomplishing the same task.

In [None]:
# Returns the status.completed as a Series and
# saves it to an object named completed
completed = df['status.completed']

# Use the .head method to show the first 5 observations
completed.head()

In [None]:
# Returns the status.started as a Series and 
# saves it to an object named started
started = df['status.started']

# Use the .head method to show the first 5 observations
started.head()

In [None]:
# Calculates the duration by doing elementwise
# arithmetic and saves the results to an 
# object named duration
duration = completed - started 

# Use bracket notation [ ] (i.e., slicing)
# to show the first 5 observations
duration[:5]

In [None]:
df['duration'] = df['status.completed'] - df['status.started']

df.head()

The number of skyscrpaers in each country can be found by using the command

```
df['country'].value_counts()
```

where `df['country']` selects the columns and `.value_counts()` is a `Series` method that tallies the frequency of each country name.. 

**Question 5.** Use the `.value_counts()` method to show the frequency of each country name.

In [None]:
...

Then we used the **filter** [**data move**](https://escholarship.org/uc/item/0mg8m7g6) to create a new dataframe containing only the countries with over 10 skyscrapers. 

**Question 6.** Filter the `df` dataframe to include only rows where the country is China.

In [None]:
# Filter by country China and
# United States of America
...

Below are some more examples of how we do this in Python using `pandas`.

In [None]:
# Filter by country China and
# United States of America
(...) | (...)

In [None]:
# Filter by country China, United States of America,
# United Arab Emirates, and United Arab Emarites

# The backslash is used for line continuation, allowing a 
# single statement to span multiple lines, which can enhance 
# code readability, especially for long lines of code. 

(df['country'] == 'China') | \
(df['country'] == 'United States of America') | \
(df['country'] == 'United Arab Emirates') | \
(df['country'] == 'United Arab Emarites')

We could also use the `.query` method

In [None]:
q = "country == 'China' or \
     country == 'United States of America' or \
     country == 'United Arab Emirates' or \
     country == 'United Arab Emarites'"
df.query(q)

or we could use the `.isin` `pandas` dataframe method. For example,

```
df['country'].isin(countries)
```

where `df['country']` is the column from the dataframe, `.isin` is the method, and `countries` is the list of coutries we want to filter.

In [None]:
# A list of countries
countries = ['China', \
             'United States of America', \
             'United Arab Emirates', \
             'United Arab Emarites']

# A Boolean mask using the .isin method
mask = ...

# Print the df dataframe using the Boolean mask
df[mask]

Now we can assign the filtered dataframe to a new object named `dat`.

**Note:** The `.copy()` `pandas` dataframe method is used to create a copy of `df` to ensure that the new `DataFrame` `dat` is a completely separate object from the original `df`. You can read more about this process [**here**](https://docs.google.com/document/d/1cWeXQw9a9uApNaPc4zoShqU-R_X8tPQ0MouSrRsncRY/edit?usp=sharing).

In [None]:
dat = df[mask].copy()
dat.head()

## Con't Data Cleaning

Data cleaning is the process of preparing data for analysis by identifying and correcting errors, inconsistencies, and inaccuracies in the data. This includes removing duplicates, correcting typos, handling missing values, and ensuring that the data is in a consistent format. A specific component of data cleaning is calld "data parsing" or "string parsing".


## String Parsing

String parsing is the process of analyzing a string of characters, extracting relevant information, and converting it into a format suitable for further processing. This often involves separating the string into components based on specific patterns or delimiters, removing unwanted characters, and transforming the data into a different type, such as converting a numeric string into an integer or decimal number.

Suppose we want to construct a histogram and a boxplot for the distribution of heights. This issue we face is that the `height` column is formatted as a string and contains two distinct units of measurement—meters and feet.

Let's tackle this issue one step at a time.

We'll begin by addressing a single value to establish a correction method. Once successful, we'll extend this technique to the entire column. 

Display the first value in the `height` column using the `dat` dataframe.

In [None]:
s = ...
s

Now, we'll use the `.split` string method to split the string into individual substrings wherever there are spaces, which results in a data structure known as a list. If you want to learn more about string methods in Python [**read the Geeks for Geeks webpage**](https://www.geeksforgeeks.org/python-string-methods/).

The method `s.split(' ') ` will return the following output (which is a list)

```
['528', 'm', '/', '1,732', 'ft']
```

In [None]:
s.split(' ')

**Question 7.** Access the next to last element in the list.

In [None]:
...

**Question 8.** Use the `.replace` string method to replace the comma `(,)`.

In [None]:
...

**Question 9.** Use the `float` function to coerce the string into a float.

In [None]:
float(...)

## User-defined Functions in Python

A user-defined function in Python is a function that a programmer creates to perform a specific action or to process data in a way that is not provided by Python's built-in functions. User-defined functions allow for code reusability, better organization, and more readable and maintainable code. They enable you to encapsulate a task into a single unit of code that can be used repeatedly throughout your program.

Here's a basic structure of a user-defined function in Python:

```
def function_name(parameters):
    """
    Docstring explaining the function's purpose and usage.
    """
    # Function body
    # Perform actions and optionally return a value
    return result
```

If you want to learn more about user-defined functions in Python [**read my conversation with ChatGPT**](https://docs.google.com/document/d/1hEBJqjZW7gFhC1N3ofNUZYj_6z6vh11jYDLYw3qhjZ4/edit?usp=sharing).

In [None]:
def string_to_feet(s):
    """
    Converts a height string to a float. The function expects the height to be 
    in a specific format, e.g., '528 m / 1,732 ft', where the actual height value 
    precedes the last space-separated segment.
    
    Parameters: 
    s (str): The string containing the height to be processed.
        
    Returns:
    float: The height extracted from the string, converted to float.
        
    Example:
    string_to_float('528 m / 1,732 ft') returns 1732.0
    """
    feet = float(s.split(' ')[-2].replace(',', ''))
    return feet

The `.apply()` method in Pandas is a powerful tool that allows you to apply a function along an axis of a `DataFrame` or on a `Series`. It can be used for a wide range of operations, including data transformation, aggregation, and applying custom functions row-wise or column-wise in a `DataFrame`. 

In [None]:
# Use the .apply method to call the function
# string_to_feet on the values in the height 
# column in the dat dataframe
dat['height'].apply(string_to_feet)

In [None]:
# Assign the Series to the variable ft
ft = dat['height'].apply(string_to_feet)

In [None]:
# Add the `ft` `Series` as a column to the `dat` dataframe
dat['ft'] = ...

# Display the first 5 rows
dat.head()

## Visualizations

In [None]:
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [12, 6] 
plt.rcParams['figure.dpi'] = 100

**Question 10.** Construct a histogram to visually represent the distribution of values in the `ft` column. 

In [None]:
...
plt.grid(False);

**Question 11.** Construct a histogram to visually represent the distribution of values in the `duration` column. 

In [None]:
...
plt.grid(False);

###  Construction Duration

Has the average time required to construct a skyscraper increased or decreased as time has progressed?

In [None]:
# The .groupby method is used to group skyscrapers
# by year, the ['duration'] notation is used to access
# the values in the duration column, and the .mean()
# Series method is used to find the mean for each group
# of years

# The mean duration time for each each returned as 
# a Series
md = ...

md

**Question 12.** Does anything stand out? What do you notice? What do you wonder?

_Click here to type your answer replacing this text._ 

**Question 13.** Construct a line chart to depict the average construction duration of skyscrapers on a yearly basis, using the entire span of years available in the dataset. 

In [None]:
...
plt.xticks(ticks=md.index, labels=md.index, rotation=270, fontsize=6);

**Question 14.** Construct a line chart to depict the average construction duration of skyscrapers on a yearly basis, excluding 1930 (the year that the Empire State Building was built). 

In [None]:
...
year_ticks = ...
year_tick_labels ...
plt.xticks(ticks=..., labels=..., rotation=270, fontsize=6);

Now, we'll apply a similar analytical approach, this time concentrating on the heights of the skyscrapers, measured in feet. This will allow us to explore and understand variations in average skyscraper heights over time.

In [None]:
# The .groupby method is used to group skyscrapers
# by year, the ['ft'] notation is used to access
# the values in the ft column, and the .mean()
# Series method is used to find the mean for each group
# of years

# The mean height for each each returned as 
# a Series
mh = dat.groupby('status.started')['ft'].mean()

mh

The line chart below depicts the average height (ft) of skyscrapers on a yearly basis, using the entire span of years available in the dataset. 

In [None]:
mh.plot(kind='line')
year_ticks = mh.index
year_tick_labels = mh.index
plt.xticks(ticks=year_ticks, labels=year_tick_labels, rotation=270, fontsize=6);

The line chart below depicts the average construction duration of skyscrapers on a yearly basis, excluding 1930 (the year that the Empire State Building was built). 

In [None]:
mh[5:].plot(kind='line')
year_ticks = mh[5:].index
year_tick_labels = mh[5:].index
plt.xticks(ticks=year_ticks, labels=year_tick_labels, rotation=270, fontsize=6);

The line chart to depicts the average construction duration and average height (ft) of skyscrapers on a yearly basis, using the entire span of years available in the dataset. 

In [None]:
# Combine both the md and mh Series into a DataFrame
mdh = pd.DataFrame({'Duration': md, 'Height': mh})

mdh.plot(kind='line');

The line chart below depicts the average construction duration and average height (ft) of skyscrapers on a yearly basis, excluding 1930 (the year that the Empire State Building was built). 

In [None]:
# Plot one on top of the other
fig, ax = plt.subplots(2, 1,)  # 2 rows, 1 column

# Plot for duration
mdh['Duration'][5:].plot(kind='line', ax=ax[0], color='blue', title='Duration')
md_year_ticks = md[5:].index
md_year_tick_labels = md[5:].index
ax[0].set_xticks(ticks=md_year_tick_labels)
ax[0].set_xticklabels(md_year_ticks, rotation=45)
ax[0].set_ylabel('Duration')

# Plot for height
mdh['Height'][5:].plot(kind='line',ax=ax[1], color='red', title='Height(ft)')
mh_year_ticks = mh[5:].index
mh_year_tick_labels = mh[5:].index
ax[1].set_xticks(ticks=mh_year_ticks)
ax[1].set_xticklabels(mh_year_tick_labels, rotation=45)
ax[1].set_ylabel('Height')

# Adjust vertical spacing between the subplots
# hspace value is a fraction of the average 
# axis height
plt.subplots_adjust(hspace=0.75);

**Question 15.** Does anything stand out? What do you notice? What do you wonder? Explain your repsonse in fewer than 300 words.

_Click here to type your answer replacing this text._ 