# Run the cell below

To run a code cell (i.e.; execute the python code inside a Jupyter notebook) you can click the play button on the ribbon underneath the name of the notebook that looks like ▶| or hold down `Shift` + `Return`.

Before you begin run the code cell below.

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("dsc201_001_003_a6.ipynb")

## This Week's Assignment

In this week's assignment, you'll learn how to:

- load data into and `pandas` `DataFrame`.

- access and manipulate data that is stored in a `pandas` `DataFrame`.

Let's get started!

**Name:** 

**Section:** 

**Date:**

## Python Basics

### Built-in Functions

- A function that is already available in a programming language/application that can be accessed by end users.

- Returns some value based on its arguments.

- `print`, `abs`, `max`, `min`, `pow`, `round`, etc.

In [None]:
abs(-3)

In [None]:
abs(2-5)

In [None]:
max(3, 10**2, 100.1)

### Nesting Functions

In [None]:
round(abs(1.6002-1.688), 4)

In [None]:
1.6002-1.688

In [None]:
abs(1.6002-1.688)

In [None]:
round(abs(1.6002-1.688), 4)

## `pandas`

Pandas is an open source Python package that is most widely used for data science/data analysis and machine learning tasks. It is built on top of another library named `Numpy`, which provides support for arrays. Since we know how to perform operations on `NumPy` arrays we can operate on columns in a `pandas` dataframe. 

Pandas is a fast, powerful, flexible and (sometimes) easy to use open source data analysis and manipulation tool. Click the `Cheat Sheet` below to access the Data Wrangling with `pandas` [Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf).

**Question 1.** Let's import `pandas` as `pd`, then use `pd.read_csv` to read load the `skyscrapers.csv` file into a `pandas` `DataFrame` named `skyscrapers`.

**Note:** The `skyscrapers.csv` file is located in the data folder.

In [None]:
# Import the pandas module as pd
import ... as ...

# Read the .csv file
skyscrapers = ...

In [None]:
grader.check("q1")

<!-- BEGIN QUESTION -->

**Question 2.** Display the first 10 rows of the `skyscrapers` dataframe.

In [None]:
# Display the first 10 rows in the dataframe
...

<!-- END QUESTION -->

### Common `pandas` `DataFrame` Methods

- `.head()`
- `.shape`
- `.info()`
- `.describe()`
- `.columns`
- `.sample`

Apply the method in each **Example** to the `skyscapers` dataframe.

**Example 1.** `.head()`

In [None]:
# Returns the first 10 rows by default
# Can specify the number of rows by
# head(<number of rows to return>)


**Example 2.** `.shape`

In [None]:
# Returns the number of rows and 
# columns as a tuple


**Example 3.** `.info()`

In [None]:
# Returns information about the dataframe


**Example 4.** `.describe()`

In [None]:
# Returns basic statistical details
# from numerical columns 


**Example 5.** `.columns()`

In [None]:
# Returns the names of the columns


**Example 6.** `.sample()`

In [None]:
# Returns one random sample of rows
# By defult the sample is without replacement
# Can specify the number of rows by
# sample(<number of rows to return>)


### Accessing columns from a `pandas` `DataFrame`

**Question 3.** Access the a `name` column from the `skyscrapers` dataframe and return a `Series` type object.

In [None]:
# Returns the values from a column
# as a Series
...

## Series

A pandas series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). 

**Source:** [Geeks for Geeks](https://www.geeksforgeeks.org/python-pandas-series/)

**Question 4.** Access a catergorical column from the `skyscrapers` dataframe and return a `Series` type object. Save this to an object named `cat_column`.

In [None]:
# Returns the values from a column
# as a Series
cat_column = ...
cat_column

In [None]:
grader.check("q4")

**Question 5.** Access a numerical column from the `skyscrapers` dataframe and return a `Series` type object. Save this to an object named `num_column`.

In [None]:
# Returns the values from a column
# as a Series
num_column = ...
num_column

In [None]:
grader.check("q5")

Since a `Series` is a 1-dimensional `ndarray` with axis labels (including time series), we can use them as parameters for `Numpy` functions.

Let's import `Numpy` and see!

In [None]:
import numpy as np

In [None]:
# Returns the values from a column
# as a Series
skyscrapers["height"]

In [None]:
# Returns the values from a column
# as a DataFrame
skyscrapers.height

<!-- BEGIN QUESTION -->

**Question 6.** What is the average height for all skyscrapers in the dataset.

In [None]:
...

<!-- END QUESTION -->

### `Series` Attributes and Methods

**Attribute**
 - An attribute of a Series is a property or characteristic that provides information about the `Series` itself.

- Attributes are accessed without parentheses, simply by referencing the attribute name.

- They provide metadata, statistics, or information about the Series but do not perform operations or transformations on the data within the `Series`.

- Examples of Series attributes include `dtype` (data type of the `Series`), `name` (name of the Series), `index` (index labels), and `shape` (shape of the Series).

- Accessing an attribute doesn't require invoking it as a function/method; you access it directly.

**Method**
- A method of a `Series` is a **function** that performs an operation or computation on the data within the `Series`.

- Methods are accessed with parentheses and often accept arguments or parameters to control their behavior.

- Methods manipulate or transform the data and return a result based on the operation performed.

- Examples of `Series` methods include `.sum()` (calculates the sum of elements), `.mean()` (calculates the mean), `.unique()` (returns unique values), and `.apply()` (applies a custom function to each element).

- Accessing a method requires invoking it as a function with parentheses.

**Source:** [ChatGPT generated response](https://chat.openai.com/share/5a13dec1-af6c-40aa-93b2-504d2b37b4a0)

In [None]:
skyscrapers.height.mean

In [None]:
skyscrapers.height.mean()

**Question 7.** What are the unique material types of skyscrapers in the dataset. Save this to an object named `material`.

In [None]:
material = ...
material

In [None]:
grader.check("q7")

We can also access columns from a pandas dataframe and return a pandas dataframe.

In [None]:
# Returns the values from a column
# as a DataFrame
skyscrapers[["material"]]

## Data Moves

Data moves is a phrase coined by the authors of the academic paper entitled ["Data Moves"](https://escholarship.org/uc/item/0mg8m7g6) by 
Erickson, Tim;Wilkerson, Michelle;Finzer, William;Reichsman, Frieda (2019).

> When novices have access to large, rich datasets, a variety of different questions and interests emerge for them to pursue. Thus no matter how dataare originally structured, novices often need to manipulate the data in order to work effectively (this has been called “data wrangling” in the Information Sciences; e.g., Kandel, et al. 2011). For example, a dataset may require filtering because it contains extraneous information unrelated to the students’ goals. It may need to be merged with other datasets. Or, students may wish to use the available data to create new groupings or construct newmeasures in order to conduct their analysis. Though such actions are common, they are not typically taught as an essential component of data analysis. We call these actions data moves. 

The core data moves as defined by the authors are:

1. filtering

1. grouping

1. summarizing

1. calculating

1. merging/joining

1. make hierarchy

We will learn more about data moves in next week's class. If you are interested in reading the paper click [here](https://escholarship.org/uc/item/0mg8m7g6).

<!-- BEGIN QUESTION -->

**Question 8.** Write 2 questions that you think can be answered by exploring the `skyscrapers` dataframe. What would you need to do to the dataframe (i.e. _"data move"_) in order to answer your question.

For example, if my question is "How many skyscrapers were bulit in each year?", then I would need to collect all the different years and count how many skyscrapers were built in each year. So, I would need to group by year and count each label for that year. To group by year I would need the year to be a categorical label.

**Note:** To earn all the points for this question you do not need know *all* the details required to complete your data move. IF you get stuck ask ChatGPT to explain it to you in plain simple English. IF you use ChatGPT make a note of it in your response. 

_Type your answer here, replacing this text._

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

When done exporting, download the .zip file by `SHIFT`-clicking on the file name and selecting **Save Link As**. Or, find the .zip file in the left side of the screen and right-click and select **Download**. You'll submit this .zip file for the assignment in Moodle to Gradescope for grading.

In [None]:
# Save your notebook first, then run this cell to export your submission.
grader.export(pdf=False)