# Assignment 04

## Due: See Date in Moodle

## This Week's Assignment

In this week's assignment you'll be introdcued to Jupyter Notebooks, you'll learn how to:

- navigate the Jupyter Notebook interface

- access vaules in a `pandas` dataframe

## Guidelines

- Follow good programming practices by using descriptive variable names, maintaining appropriate spacing for readability, and adding comments to clarify your code.

- Ensure written responses use correct spelling, complete sentences, and proper grammar.

**Name:**

**Section:**

**Date:**

Let's get started!

## What is a Jupyter Notebook?

A Jupyter Notebook is an interactive coding environment that allows you to write and run code, display visualizations, and document your work with text, equations, and images, all in the same place. It's especially popular in data science because of its ability to combine code execution and explanation.

### Edit Mode vs. Command Mode 

Jupyter Notebook has a modal user interface. This means that the keyboard does different things depending on which mode the Notebook is in. There are two modes: **"Edit mode"** and **"Command mode"**. 

Edit mode allows you to type and modify the content within a cell, similar to a text editor.

Command mode lets you manage the notebook as a whole (e.g., adding, deleting, or moving cells), but it doesn't allow you to type directly into individual cells.

### The Kernel

The **kernel** is the program that runs the code inside your notebook and displays the results. In the top-right corner of your window, you'll see the name `(DSC201)` and a circle that indicates the status of your kernel. An empty circle (⚪) means the kernel is idle and ready to execute code, while a filled circle (⚫) means the kernel is busy running code.

### Cells

Jupyter Notebooks are made up of cells. As covered in lecture, there are two types of cells in a Jupyter Notebook: code cells and Markdown cells. 

- Code cells are where we write all of our Python code.

- Markdown cells allow us to write text, like the text you're reading right now. In Homework 1, you'll get a chance to learn a bit of basic Markdown.

### Code cells

Running a code cell executes all the code within it, and any output will appear directly below the cell. Notice the brackets `[ ]:` on the left side of the cells.

Before running the cell, this will be empty (`[ ]`). While the cell is running, you'll see `[*]`, indicating that the code is still processing. If the asterisk (`*`) remains for too long, it may mean the code is taking longer than expected, and you might need to interrupt the kernel (explained below). Once the cell has finished running, a number will appear inside the brackets, such as `[1]`, representing the order in which cells have been executed. The first cell you run will display a `1`, the second will display a `2`, and so on.

If your kernel becomes unresponsive, your notebook slows down significantly, or the kernel disconnects, you can try the following steps:

1. At the top of your screen, click **Kernel**, then **Interrupt Kernel**. Trying running your code again.

1. If that doesn't help, click **Kernel**, then **Restart Kernel**. If you do this, you will have to run your code cells from the start of your notebook up until where you paused your work.

After you run a cell, the brackets will be populated with a number, indicating how many times you've executed code cells in this session, including the current one.

To run the code in a code cell, first click on that cell to activate it.  It'll be highlighted with a little green or blue rectangle.  Next, hold down the `shift` key and press `return` or `enter`. You could also click the "Run cell" button ( ▶| ) in the cell toolbar above.

**Note:** After you make changes to the text cell don't forget to click the "Run cell" button at the top that looks like ▶| or hold down `shift` + `return` to view the changes.

### Errors

Whenever you write code, you'll make mistakes.  When you run a code cell that has errors, Python will sometimes produce error messages to tell you what you did wrong. Errors are okay; even experienced programmers make many errors.  When you make an error, you just have to find the source of the problem, fix it, and move on.

There is an error in the next cell. 

Run it and see what happens.

In [None]:
print("This line is missing something."

The error message you're seeing:

```python
  File "<ipython-input-1>", line 1
    print("This line is missing something."
                                           ^
SyntaxError: incomplete input
```

#### Explanation

1. This part indicates where the error occurred:

   - `Cell In[1]` refers to the first cell of your notebook.
   
   - `line 1` refers to the first line within that cell.

1. The caret (`^`) is pointing to the end of the line, indicating the exact spot where Python expects additional input, in this case, a closing parenthesis `)`.

1. A `SyntaxError` means there is something wrong with the structure of your code, and Python cannot understand or execute it as written. 

1. The message "incomplete input" is telling you that Python reached the end of the code (or the end of the line) but didn't find what it expected to complete the `print()` function call. Here, the missing piece is the closing parenthesis `)`.

To fix the error, you simply need to add the closing parenthesis to complete the function call:

```python
print("This line is missing something.")
```

**Question 1.** Complete the input that resolves the syntax issue.

In [None]:
print("This line is missing something."

### Markdown cells

To edit an existing Markdown cell, simply double-click on it and make your changes. After editing, "run" the cell just like you would with a code cell to render the Markdown correctly.

**Question 2.** Create an ordered list of steps for conducting a simple data analysis process. Your list should include these four steps: Import the dataset, Clean the data, Perform exploratory data analysis, Communicate findings.

**Hint:** Click [here](https://www.markdownguide.org/basic-syntax/#ordered-lists) to learn what an ordered list is and how to create one in Markdown.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

**Note:** Make sure to review your response to the previous question to ensure it is properly formatted.

**Question 3.** You can format mathematical equations and expressions using Markdown with LaTeX. In the text cell below, write the LaTeX code to represent the equation for the normal (Gaussian) distribution curve.

**Hints:** 

- Click [here](https://assets.ctfassets.net/nrgyaltdicpt/4e825etqMUW8vTF8drfRbw/d4f3d9adcb2980b80818f788e36316b2/A_quick_guide_to_LaTeX__Overleaf_version.pdf) to learn how to properly format mathematical expressions in LaTex $\left(\LaTeX \right)$.

- Click [here](https://en.wikipedia.org/wiki/Normal_distribution) to view the equation for the normal (Gaussian) distribution curve.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

**Note:** Make sure to review your response to the previous question to ensure it is properly formatted.

## Importing Python Modules

In Python, modules are similar to packages in R. Both are collections of code that provide additional functionality. 

- In Python you import modules to access their functionality. For example, you might use the `math` module for mathematical functions or `numpy` for numerical computing.
  
- In R, you load packages using `library()` to access functions and datasets. For example, `ggplot2` is an R package for data visualization, and `dplyr` is used for data manipulation.

Here’s how it works in Python:

```python
import numpy as np
np.array([1, 2, 3])
```
while in R it worked like:

```r
library(dplyr)
  ```

#### Importing the `math` Module in Python

To demonstrate how modules work in Python, let's import the `math` module and use it:

**Question 4.** Import the `math` module. Then use the square root function in the math module to find the square root of 25, save it to a variable, and finally print the value of the variable.

In [None]:
...

## `pandas`

First, we need to import `pandas`. Then, we'll load the `babynames.csv` file into a `pandas` `DataFrame` named `baby_names` for use throughout this notebook.

In [None]:
...

In [None]:
baby_names = ...

Now let's get a general overview of the data using the `.info()` method.

In [None]:
baby_names.info()

**Question 5.** Using the data sheet for the `babynames.csv` file and the output from the previous cell, write down three questions that you believe can be answered using this data.  

Use an ordered list to format your questions. Each question should be clear and relevant to the dataset.  

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

### Selecting Rows and Columns 

#### Column Selection Using `.loc`

To select a column from a `pandas` `DataFrame` by its label, the safest and most efficient approach is to use the `.loc` method. The general syntax is

```python
df.loc[row_label, column_label]
```
where a colon (`:`) indicates selection of all rows or columns.

For example, to select the `year` column from the `babynames` DataFrame, use:  

```python
baby_names.loc[:, 'year']
```

You can also slice across multiple columns. For instance,  
```python
baby_names.loc[:, 'name':]
```
selects the `Name` column and all columns that follow.

While `.loc` is ideal for production code, it can be verbose in interactive use. A more concise option is the bracket notation:  
  
```python
  baby_names['count']
```

#### Row Selection

To select a row by its label, use `.loc` with the row index. Here, the row "label" refers to the `DataFrame`'s index (which often serves as a primary key).  

**Example 1:** Select the first five values from the `name` column in the `baby_names` `DataFrame`.  

**Note:** The range `0:4` starts at index `0` (the first row) and includes index `4` (the fifth row). This command returns a `pandas` `Series`.

In [None]:
baby_names.loc[0:4, 'name']

**Example 2.** Select the first five values from the `name` column in the `baby_names` `DataFrame`.  

Notice the difference between this method and the method in **Example: 1.** Just passing in `name` returns a `Series` while `[name]` returns a `Dataframe`.

In [None]:
baby_names.loc[0:4, ['name']]

**Note:** The `.loc` method selects rows based on the `pandas` row index, not the row's position in the `DataFrame`. Unlike standard python slicing, when using `.loc[0:4]`, the end index (`4`) is included so the selection includes the row with index `4`.

#### Column Selection Using `.iloc`

Another key feature in Pandas is `.iloc[]`, which allows you to slice a DataFrame based on row and column positions, rather than row index and column labels as with `.loc[]`. This is the primary difference between the two, and it’s important to understand when to use each one.  

With `.iloc[]`, slicing follows standard Python behavior, meaning the end index is not included.  

**Note:** To remember this distinction, think of the _**i**_ in `.iloc` as standing for _integer-based indexing_.

Below, we have sorted the `baby_names` `DataFrame` by the `name` column. Notice that a row's position in the DataFrame does not necessarily match its index (the bold values in the far-left column). For example, the first row in the sorted DataFrame may not correspond to index `1`.  

Understanding this distinction is key to differentiating between `.loc[]` (which selects by index) and `.iloc[]` (which selects by position).  

**Example 3.**

In [None]:
sorted_baby_names = baby_names.sort_values(by = ['name'])
sorted_baby_names.head()

**Example 4.** Here’s an example of how to select the 2nd, 3rd, and 4th rows from the `name` column in the `baby_names` `DataFrame` using both `.iloc[]` and `.loc[]`. Pay close attention to the difference between the two, especially after sorting `baby_names` by name.

In [None]:
sorted_baby_names.iloc[1:10, 1]

Notice that using `.loc[]` with `1:4` gives different results, since it selects using the index.

In [None]:
sorted_baby_names.loc[1:10, 'name']

**Question 6.** List the unique years included in the dataset.

In [None]:
...

**Question 7.** Selecting multiple columns is straightforward—simply provide a list of column names.

Use both bracket notation and `.loc[]` to select the `name` and `year` columns (in that order) from the `babynames` `DataFrame`, then display only the first five rows. 

In [None]:
...

In [None]:
...

## Filtering Data

### Filtering with Boolean arrays

Filtering is the process of refining data by removing unwanted or irrelevant information. When working with data, you will inevitably need to filter it—whether to remove missing values, eliminate unusual outliers, or focus on specific subgroups for analysis.  

When using compound expressions, be sure to group conditions with parentheses to ensure proper evaluation.  

**Example usage:**  

```python
df[df['column name'] < 5]]
```

where `df` is the name of the dataframe, `column name` is the name of the column, and `< 5` is the comparison statement. Meaning, any value in the column that is larger than 5 will be displayed.


For your reference, some commonly used comparison operators are given below.

Symbol   | Usage      | Meaning 
------   | ---------- | -------------------------------------
`==`     | `a == b`     | Does a equal b?
`<=`     | `a <= b`     | Is a less than or equal to b?
`>=`     | `a >= b`     | Is a greater than or equal to b?
`<`    | `a < b`      | Is a less than 
`>`    | `a > b`      | Is a greater than b?
`~`        | `~p`         | Returns negation of p
`\|`   | `p \| q` | `p OR q`
`&`        | `p & q`      | `p AND q`

**Question 8.** Filter the `baby_names` DataFrame to create a new `DataFrame` that includes only the names registered in the year you were born.

Then display the first 5 rows.

In [None]:
...

**Question 9.** How many births were registered in your birth year? What is the distribution of male and female births for that year?

In [None]:
...

In [None]:
...

**Question 10.** Write a summary of the output from **Question 8** and **Question 9** in a paragraph. Compare the birth data from your birth year with your instructor's birth year, highlighting any similarities or differences. Consider possible reasons for these differences, such as historical trends, cultural shifts, or population growth.

_TYPE YOUR ANSWER HERE REPLACING THIS TEXT_

## Submission

Make sure that all cells in your assignment have been executed to display all output, images, and graphs in the final document.

**Note:** Save the assignment before proceeding to download the file.

After downloading, locate the `.ipynb` file and upload **only** this file to Moodle. The assignment will be automatically submitted to Gradescope for grading.