# Run the cell below

To run a code cell (i.e.; execute the python code inside a Jupyter notebook) you can click the play button on the ribbon underneath the name of the notebook that looks like ▶| or hold down `Shift` + `Return`.

Before you begin run the code cell below.

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("dsc201_001_003_a8.ipynb")

**Name:** 

**Section:** 

**Date:**

## This Week's Assignment

In this week's assignment, you'll learn how to:

- slice data that is stored in a `pandas` `DataFrame`.

- write a user-defined function.

- manipulate string data.

- apply a user-defined function to a dataframe column.

Let's get started!

## Questions about Skyscrapers

In a previous lesson you were asked to write down two questions you thought could be answered by exploring the `skyscrapers` dataset. 

[Here is a compilation of the questions](https://docs.google.com/document/d/1u_fY4LZr1tqrHQGlaC3g50xg6PVyk_Lbz54Ne_uAq8I/edit?usp=sharing) from both sections of the section (001 and 003) of Intro to R/Python for Data Science that I teach.

Today in class we will answer a few of these questions and make explicit mention to the associated data move(s). As a reminder, here is a list of the core data moves we discussed in last week's class:

1. filtering

1. grouping

1. summarizing

1. calculating

1. merging/joining

1. make hierarchy

and [here is the paper](https://escholarship.org/uc/item/0mg8m7g6) from which the data moves are referenced.

**Question 1.** Let's import `pandas` as `pd`, then use `pd.read_csv` to read load the `skyscrapers.csv` file into a `pandas` `DataFrame` named `skyscrapers`.

**Note:** The `skyscrapers.csv` file is located in the data folder.

In [None]:
# Import the pandas module as pd
import ... as ...

# Read the .csv file
skyscrapers = ...

In [None]:
grader.check("q1")

<!-- BEGIN QUESTION -->

**Question 2.** Display the first 10 rows of the `skyscrapers` dataframe.

In [None]:
# Display the first 10 rows in the dataframe
...

<!-- END QUESTION -->

**Question 3.** What is the average height for skyscrapers made with **concrete**? Save this value to `avg_height_concrete`.

In [None]:
avg_height_concrete = ...
avg_height_concrete

In [None]:
grader.check("q3")

This is one way we can find the average height for buildings made from concrete. But what about the other materials? Let's check to see how many different types of materials are in the data set.

<!-- BEGIN QUESTION -->

**Question 4.** What are all the types of materials in the dataset and what is the distribution of the different materials in the dataset?

In [None]:
...

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 5.** There are three types of materials. We want the average height for each category. To accomplish this task which data move(s) do we need to use?

_Type your answer here, replacing this text._

<!-- END QUESTION -->

### `.groupby`

Groupby is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. It’s a simple concept but it’s an extremely valuable technique that’s widely used in data science.

Click [here](https://www.geeksforgeeks.org/pandas-groupby/) to read more about the concept and to see an example.

Common aggregation functions that can be applied after groupby include:

- `sum()`: Sum of values in each group.

- `mean()`: Mean (average) of values in each group.

- `count()`: Count of values in each group.

- `max()`: Maximum value in each group.

- `min()`: Minimum value in each group.

- `agg()`: Apply custom aggregation functions to grouped data.

You can also group by multiple columns and apply aggregation functions to create more complex summaries of your data.

The `groupby` method is useful for tasks like data exploration, summarizing data, and creating pivot tables, making it an essential tool in data analysis with pandas.

**Source:** [ChatGPT generated response](https://chat.openai.com/share/b34a68b0-9e33-49eb-9acc-f25196291b03)

<!-- BEGIN QUESTION -->

**Question 6.** Use `.groupby()` to find the average height across all material types in the dataset.

In [None]:
...

<!-- END QUESTION -->

So what is `.groupby` actually doing. To see how this method works let start by making a `GroupBy` object.

Run the cell below.

In [None]:
grps = skyscrapers.groupby('material')
grps

The variable `grps` is a `GroupBy` object, which acts like a container that stores information about how the data is grouped. Let's explore some of the features/attributes of a `GroupBy` object.

**Example 1.** Show the size of each group

In [None]:
# Show the size of each group
grps.size()

**Example 2.** Show the contents of the concrete group

In [None]:
# Show the contents of the concrete group
grps.get_group('concrete')

**Example 3.** Show all the groups

In [None]:
# Show all the groups
grps.groups

**Example 4.** Show the keys in the groups

In [None]:
# Show the keys in the groups
grps.groups.keys()

**Example 5.** Show all the index values of the rows in the `concrete` group

In [None]:
# Show all the index values of the rows in the 'concrete' group
grps.groups['concrete']

We can also group on multiple categories. Fro example, let's group on `material` and `city`.

In [None]:
grps = skyscrapers.groupby(['material', 'city'])
grps

**Example 6.** Show the size of each group

In [None]:
# Show the size of each group
grps.size()

**Example 7.** Show all the groups

In [None]:
# Show all the groups
grps.groups

**Example 8.** Show the keys for all the groups

In [None]:
# Show the keys for all the groups
grps.groups.keys()

**Example 9.** Show the `concrete`, `New York City` group

In [None]:
# Show the 'concrete' 'New York City' group
grps.get_group(('concrete', 'New York City') )

<!-- BEGIN QUESTION -->

**Question 7.** Create a dataframe that contains only the concrete skyscrapers in New York City that are taller than 200 meters.

**Note:** To earn all the points you must use a Boolean mask.

In [None]:
...

<!-- END QUESTION -->

## What's in a name?

Another fun question mentioned by several of your classmates was _"What are the lengths of the skyscraper names?"_. In addtion, a few students asked _"Which skyscraper has the longest name?"_. Names are important and can provide a lot of insight into relevant events happening in society (as we mentioned last week with our exploration of the `babynames` dataset).

Suppose we want to know which skyscraper has the longest name. How can we accomplish this task programmatically? We aren't going to count the letters in each name by hand. So what can we do?

<!-- BEGIN QUESTION -->

**Question 8.** Write down the steps that you think need to be taken to determine which skyscraper in the dataset has the longest name.

_Type your answer here, replacing this text._

<!-- END QUESTION -->

## Think like a data scientist

Now that you've had a moment to think about how you would determine which skyscraper in the dataset has the longest name, let's do it together (like a data scientist). Now, to be transparent, this is probably not how a professional data scientist would complete this task; nonetheless, after we implement our solution we'll know the answer to our question (and knowing is half the battle).

In [None]:
from IPython.display import IFrame

# YouTube video ID
video_id = 'pele5vptVgc'

# Embed the YouTube video with custom parameters
url = f'https://www.youtube.com/embed/{video_id}?rel=0'

IFrame(url, width=560, height=315)

First, let's see if we can count the letters in a single name. We'll use the second name (with index value 1) in the `skyscrapers` dataframe 'Willis Tower'.

In [None]:
skyscrapers.name[1]

Here are two things to consider when you are working with text data.

- When processing text for tasks like searching, sorting, or text analysis, it's often helpful to have the text in a consistent case to avoid issues caused by case differences.

- When comparing strings, you want to ensure that the comparison is case-insensitive. For example, if you're searching for a specific word in a block of text, you might want to convert both the search term and the text to lowercase (or uppercase) to ensure that you find matches regardless of the case used. This ensures consistency in your comparisons.

Let's use lowercase.

In [None]:
skyscrapers.name[1].lower()

We don't want to count the spaces, so we need to remove them.

In [None]:
skyscrapers.name[1].lower().replace(' ', '')

We could loop over the characters (i.e. use a `for` loop) to count the letters, but the `len` command will do that for us so a `for` loop isn't necessary.

In [None]:
len(skyscrapers.name[1].lower().replace(' ', ''))

Now if we can do it for one name we can use the power of computing and programming to do the same thing to all the names. But how? We could use a `for` loop, but since the names we need are in a `pandas` `DataFrame` it'll be more effecient and effective to use a dataframe method called `.apply`.

### `.apply()`

ChatGPT says that the `.apply()` method in pandas is used to apply a function along the axis of a `DataFrame` or `Series`. It allows you to perform custom operations on the elements of the `DataFrame` or `Series`. What?!?!

**Source:** [ChatGPT generated response](https://chat.openai.com/share/95b23d87-bc6c-4442-bdf7-01dd87e471a4)

Let's see if we can simply that language. 

In simple terms, the `.apply()` method in pandas is used to apply a function to each element or column of a `DataFrame`. It allows you to perform a custom operation on your data, element by element, or column by column. This explanation seems to be a bit better ... perhaps. 

**Source:** [ChatGPT generated response](https://chat.openai.com/share/de27e63c-8a24-447a-907d-16f6c764e732)

What does ChatGPT mean by - "apply a function" and what does this part of the response mean?

```
# Define a custom function to double a number
def double_number(x):
    return x * 2
```

What does the comment `# Define a custom function to double a number` mean? I thought ChatGPT was supposed to just give me the answers - now I have more questions? Is this what is meant when ChatGPT says **"Here are some general tips for using me effectively:"**

- Be clear and specific in your queries.

- Provide context when necessary.

- Review and verify the information I provide, especially if accuracy is critical.

- Experiment and iterate to get the desired results.

- Remember that while I can provide valuable information and generate text, I may not always have real-time data or up-to-date information, as my knowledge is based on data up to September 2021.

Ultimately, the best way to use me depends on your unique needs and goals. Feel free to ask questions or request assistance in any area where you think a language model can be helpful.

## The Human in AI

For a beginning programmer, ChatGPT may not be so helpful for solving complex problems. Not to say that the responses given from ChatGPT aren't correct, but if you don't know what a *custom* or *user-defined* function is then you'll need to continue your conversation (i.e. **iterate** until you get enough information to understand **how** the solution is being implemented so that you can make sure the solution is **accurate**).

Fortunately you have a human (me), to help you understand how the ChatGPT solution works so you can apply (no pun intended) it in the future to solve similar problems.

### User-defined Function

A user-defined Python function is a custom function created by a programmer to perform a specific task or a set of tasks within a Python program. These functions are not built-in or part of base or standard Python, rather they are defined by the user to meet their specific needs (like figuring out the length of each skyscraper name). 

Below is an example 

```
def double_value(x):
    result = 2*x
    return result
```

Here's a breakdown of the components:

- `def`: This is the keyword that tells Python you are defining a function.

- `double_name`: This is the name *you* choose for *your* function. It should follow the rules for naming variables in Python. This is the name you'll use to call the function in your code.

- `x`: This is called a parameter (i.e. input). This is optional, and is enclosed in parentheses. It represents the input(s) or argument(s) that the function accepts. 

- `result = 2*x`: This is called the function body. This is where you write the code that defines what the function does. It's a block of indented statements, typically using 4 spaces.

- `return result`: This is an optional part of the function definition. It specifies the value that the function should return when called. Not all functions need to return a value. In fact, and you can have functions that don't use the `return` statement.

In the code cell below we can define a function called `name_length` that takes the name a skyscraper as a parameter named `bldg`.

Run the code cell below to load the function into the notebook.

In [None]:
# User-defined function 
## function name: name_length
## parameter name: bldg (The name of the skyscraper)
## parameter object type: The name of the skyscraper as a string

def name_length(bldg):
    
    ## Find the length of the skyscraper name and save the value
    ## to the variable length
    
    length = len(bldg.lower().replace(' ', ''))
    
    ## Return the variable length - the number of letters in
    ## the skyscraper's name
    
    return length

Now let's test our function with a single value to make sure we get the results we expect. 

**Note:** This is a common practice in data science and computer science/programming.

In [None]:
name_length(skyscrapers.name[1])

Looks like our function works!

### Applying the Function to the Data

Now that we are sure that our function works for one name, we can apply it to every name in our dataframe. 

We already know that the column of a dataframe is a `Series`. We can select the column that contains the data we want to work with, then apply our function, `name_length` to each name in the column (`Series`). Meaning, each name in the column (`Series`) will be used as the parameter (`bldg`) in the `name_lengh` function.

For example, if we take the name 'Willis Tower' it will assigned to the parameter variable `bldg` like this `bldg = 'Willis Tower'`. Then in the body of the function everywhere we see the variable `bldg` the value of the variable will be `'Willis Tower'`. 

This is what's actually happening in the body of the function 

```
length = 'Willis Tower'.lower().replace(' ', '')
``` 

Now the value of `length` is 11, so when we execute the **return** statement, the number 11 will be returned.

```
return 11
```

This is **how** we use the `.apply` method and call the function to execute on the `skyscrapers.name` `Series`.

```
skyscrapers.name.apply(name_length)
```

Type this into the blank code cell below, run the cell and verify the output.

What do you noitce? What do you think is the next step? Right now we have the lengths of the names, but the correspondence between building and name length isn't clear.

### Adding a `Series` to `DataFrame`

The ouput from the command `skyscrapers.name.apply(name_length)` is a `Series`. To check that run the cell below.

In [None]:
type(skyscrapers.name.apply(name_length))

If we add this `Series` back to the `skyscrapers` dataframe then we can view the correspondence between the actual building name and the length of the name. 

Run the cell below to save the `Series` to a variable called `name_length`

In [None]:
name_length = skyscrapers.name.apply(name_length)

Now we can add the `name_length` `Series` to the `skyscrapers` dataframe.

In [None]:
skyscrapers['name_length'] = name_length
skyscrapers

### Are we there yet?

Finally, we can select the columns we want from the `skyscrapers` dataframe that will help us answer our question. There are multiple ways you can do this. Below are a few examples:

- Using **.iloc**

```
skyscrapers.iloc[:, [0, 5]]
```

- Usiing **.loc**

```
skyscrapers.loc[:, ['name', 'name_length']]
```

- Using square brackets `[ ]`

```
skyscrapers[['name', 'name_length']]
```

See if you can explain what each command is doing. Talk to a classmate or a friend (if you're not in class). Then use the blank cells below (with comments) to try each one.

**Example 10.** Using `.iloc`

In [None]:
# Using .iloc


**Example 11.** Using `.loc`

In [None]:
# Using .loc


**Example 12.** Using square brackets `[ ]`

In [None]:
# Using square brackets [ ]


Now, all that's left to do is to find the longest name. 

<!-- BEGIN QUESTION -->

**Question 9.** One way we can find the longest name is to order the `skyscrapers` dataframe by the `name_length` column. In the cell below, enter a command that will sort the values in the `name_length` column in descending order. The output should be a dataframe that contains two columns: **name** and **name_length**. Use the output from the commands in either **Example 1**, **Example 2** or **Example 3** guide.

**Hint:** Use an appropriate method based on the object type that can sort a dataframe by the values in a column. You may need to use additional parameters of the method to obtain the result in a format that is useful.

In [None]:
...

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Moodle to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)