# Denison CS181/DA210 SW Lab #4 - Step 1

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [None]:
import os
import os.path
import pandas as pd

datadir = "publicdata"

---

## Part A: Data Frame - Creation

### Creation from Native Data Structure

You can create a `DataFrame` given a variety of different 2D data representations.  For example, we can use our hard-coded DoL snippet from `topnames.csv`.  (Note that it is customary to refer to `pandas` as `pd`.)

In [None]:
import pandas as pd

topnamesDoL = {'year':  [2018, 2018, 2017, 2017, 2016, 2016],
               'sex':   ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
               'name':  ['Liam', 'Emma', 'Liam', 'Emma', 'Noah', 'Emma'],
               'count': [19837, 18688, 18798, 19800, 19117, 19496]}

topnames = pd.DataFrame(topnamesDoL)

topnames

In the display of this data frame, above, we see that the columns are labeled, as are the rows.  By default, the row labels take values 0, 1, 2, ....

The row labels are also called the _index_ of the data set.

Similarly to creating a `DataFrame` from a DoL, we can do so from an LoL or LoD.  For an LoL, we need to specify the columns:

In [None]:
topnamesLoL = [[2018, 'Male', 'Liam', 19837],
               [2018, 'Female', 'Emma', 18688],
               [2017, 'Male', 'Liam', 18798],
               [2017, 'Female', 'Emma', 19800],
               [2016, 'Male', 'Noah', 19117],
               [2016, 'Female', 'Emma', 19496]]
columns = ['year', 'sex', 'name', 'count']

topnames = pd.DataFrame(topnamesDoL, columns=columns)

topnames

---

### Creation from a CSV file

If you have a function that can read a CSV file and convert it to a DoL, LoL, or LoD representation, you could use that to create a `DataFrame`, as discussed above.  However, `pandas` provides some handy functionality to create a `DataFrame` directly from a CSV file using the `read_csv()` function.

In [None]:
filepath = os.path.join(datadir, "topnames.csv")
topnames0 = pd.read_csv(filepath)

topnames0.head()

  If we know that a given column (or set of columns) should be the index, we can specify that when parsing the CSV using the `index_col` parameter.

In [None]:
filepath = os.path.join(datadir, "topnames.csv")
topnames = pd.read_csv(filepath, index_col=["year", "sex"])

topnames.head()

In the previous examples, we have used the `head()` method to return, by default, the first 5 rows of data.  We could specify `n` rows by providing `n` as a parameter.

In [None]:
topnames.head(10)

Similarly, we can view the last `n` rows of the data frame using `tail(n)`.

In [None]:
topnames.tail(4)

---

## Part B: Data Frame - Basic Access

Now that we have a `DataFrame` object, we can view relevant metadata.  For example, we can find the number of rows in the `DataFrame`:

In [None]:
len(topnames)

Alternatively, we can access both the row and column dimensions using the `shape` attribute of the `DataFrame` object:

In [None]:
# Look at the shape when the index is *not* specified
topnames0.shape

In [None]:
# Look at the shape when the index *is* specified
topnames.shape

The number of columns in the data frame depends on the number of columns in the index.  We can get more information about these columns using the `info()` method.

In [None]:
# Get info about the data frame without indices specified
topnames0.info()

In [None]:
# Get info about the data frame without indices specified
topnames.info()

In the examples above, notice that one has a `RangeIndex` and the other has a `MultiIndex` (because we specified two columns in the index).

We can get the column labels using the `columns` attribute:

In [None]:
# Get column names: no index specified
topnames0.columns

In [None]:
# Get column names: (year, sex) index
topnames.columns

We can inspect the indices using the `index` attribute:

In [None]:
# Get index info: no index specified
topnames0.index

If we have specified an index, the `index` attribute lists every combination.  (This should correspond to every combination of independent variables for Tidy Data.)

In [None]:
# Get index info: (year, sex) index
topnames.index

---

## Part C: Data Frame - Try it yourself

In the next couple of weeks, we'll work with a new dataset of country-based indicators, such as population (`pop`), gross domestic product (`gdp`), and life expectancy (`life`).

**Q1** Use the `pandas` module to load the data from `indicators2016.csv` in the `datadir` directory into a `DataFrame` object called `indicators2016`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Q2** This is a big dataset.  Write an expression to visualize the first 8 rows of data.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Q3** We can use the `code` column to represent the row labels (the _index_) of this dataset.  Re-read in the data in `indicators2016.csv`, but this time use `code` as the index.  Again, display the first 8 rows.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
indicators2016.info()

**Q4** In a single assignment line, using an attribute of the data frame object, assign to `nrows` and `ncols` the number of rows and columns in the indicators 2016 data set.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

print(nrows, ncols)

> You've reached the first checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 1: How many columns are in the `indicators2016` dataset?  Which correspond to independent variables?  Which correspond to dependent variables?  Finally, in the information listed above, why are the count values different for each column?

---

## Part D: Sorting a DataFrame

By default, a `DataFrame` is sorted in ascending order based on the index.  For `indicator2016`, that means it is sorted alphabetically by the 3-letter `code`.  For `topnames`, it sorts by `year` and `sex`.

We can sort (in-place if we use `inplace=True`) in reverse order using `ascending=False`:

In [None]:
topnames_sorted = topnames.sort_index(ascending=False)
topnames_sorted.head(8) # most recent year first

If we want to sort on a different column, we could sort by that column's values using `sort_values()`:

In [None]:
# Find most popular names since 1880: sort 'count' largest->smallest
topnames_sorted.sort_values(by=["count"], inplace=True, ascending=False)

topnames_sorted.head(8)

**Q5** Sort the `indicators2016` `DataFrame` by GDP, with highest GDP listed first.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# View highest-GDP countries in 2016
indicators2016_sorted.head(8)

**Q6** Sort the `indicators2016` `DataFrame` by life expectancy, with highest life expectancy listed first.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# View highest-GDP countries in 2016
indicators2016_sorted.head(8)

**Q7** We can always re-sort the data using the index with `sort_index`.  Imagine you've forgotten whether you already set the index for the `indicators2016` `DataFrame`, so you can set it again.

Why does the following give an error?

In [None]:
indicators2016_v2 = indicators2016.set_index(["code"], inplace=True)

> You've reached the second checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 2: Why does updating the index, above, cause an error?

---

---
## Part E

How much time (in minutes/hours) did you spend on this lab outside of class?

YOUR ANSWER HERE