# Denison CS181/DA210 SW Lab #4 - Step 1

Before you turn this problem in, make sure everything runs as expected. This is a combination of **restarting the kernel** and then **running all cells** (in the menubar, select Kernel$\rightarrow$Restart And Run All).

Make sure you fill in any place that says `# YOUR CODE HERE` or "YOUR ANSWER HERE".

---

In [1]:
import os
import os.path
import pandas as pd

datadir = "publicdata"

---

## Part A: Data Frame - Creation

### Creation from Native Data Structure

You can create a `DataFrame` given a variety of different 2D data representations.  For example, we can use our hard-coded DoL snippet from `topnames.csv`.  (Note that it is customary to refer to `pandas` as `pd`.)

In [2]:
import pandas as pd

topnamesDoL = {'year':  [2018, 2018, 2017, 2017, 2016, 2016],
               'sex':   ['Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
               'name':  ['Liam', 'Emma', 'Liam', 'Emma', 'Noah', 'Emma'],
               'count': [19837, 18688, 18798, 19800, 19117, 19496]}

topnames = pd.DataFrame(topnamesDoL)

topnames

Unnamed: 0,year,sex,name,count
0,2018,Male,Liam,19837
1,2018,Female,Emma,18688
2,2017,Male,Liam,18798
3,2017,Female,Emma,19800
4,2016,Male,Noah,19117
5,2016,Female,Emma,19496


In the display of this data frame, above, we see that the columns are labeled, as are the rows.  By default, the row labels take values 0, 1, 2, ....

The row labels are also called the _index_ of the data set.

Similarly to creating a `DataFrame` from a DoL, we can do so from an LoL or LoD.  For an LoL, we need to specify the columns:

In [3]:
topnamesLoL = [[2018, 'Male', 'Liam', 19837],
               [2018, 'Female', 'Emma', 18688],
               [2017, 'Male', 'Liam', 18798],
               [2017, 'Female', 'Emma', 19800],
               [2016, 'Male', 'Noah', 19117],
               [2016, 'Female', 'Emma', 19496]]
columns = ['year', 'sex', 'name', 'count']

topnames = pd.DataFrame(topnamesLoL, columns=columns)

topnames

Unnamed: 0,year,sex,name,count
0,2018,Male,Liam,19837
1,2018,Female,Emma,18688
2,2017,Male,Liam,18798
3,2017,Female,Emma,19800
4,2016,Male,Noah,19117
5,2016,Female,Emma,19496


---

### Creation from a CSV file

If you have a function that can read a CSV file and convert it to a DoL, LoL, or LoD representation, you could use that to create a `DataFrame`, as discussed above.  However, `pandas` provides some handy functionality to create a `DataFrame` directly from a CSV file using the `read_csv()` function.

In [4]:
filepath = os.path.join(datadir, "topnames.csv")
topnames0 = pd.read_csv(filepath)

topnames0.head()

Unnamed: 0,year,sex,name,count
0,1880,Female,Mary,7065
1,1880,Male,John,9655
2,1881,Female,Mary,6919
3,1881,Male,John,8769
4,1882,Female,Mary,8148


  If we know that a given column (or set of columns) should be the index, we can specify that when parsing the CSV using the `index_col` parameter.

In [5]:
filepath = os.path.join(datadir, "topnames.csv")
topnames = pd.read_csv(filepath, index_col=["year", "sex"])

topnames.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1880,Female,Mary,7065
1880,Male,John,9655
1881,Female,Mary,6919
1881,Male,John,8769
1882,Female,Mary,8148


In the previous examples, we have used the `head()` method to return, by default, the first 5 rows of data.  We could specify `n` rows by providing `n` as a parameter.

In [6]:
topnames.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1880,Female,Mary,7065
1880,Male,John,9655
1881,Female,Mary,6919
1881,Male,John,8769
1882,Female,Mary,8148
1882,Male,John,9557
1883,Female,Mary,8012
1883,Male,John,8894
1884,Female,Mary,9217
1884,Male,John,9388


Similarly, we can view the last `n` rows of the data frame using `tail(n)`.

In [7]:
topnames.tail(4)

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
2017,Female,Emma,19800
2017,Male,Liam,18798
2018,Female,Emma,18688
2018,Male,Liam,19837


---

## Part B: Data Frame - Basic Access

Now that we have a `DataFrame` object, we can view relevant metadata.  For example, we can find the number of rows in the `DataFrame`:

In [8]:
len(topnames)

278

Alternatively, we can access both the row and column dimensions using the `shape` attribute of the `DataFrame` object:

In [9]:
# Look at the shape when the index is *not* specified
topnames0.shape

(278, 4)

In [10]:
# Look at the shape when the index *is* specified
topnames.shape

(278, 2)

The number of columns in the data frame depends on the number of columns in the index.  We can get more information about these columns using the `info()` method.

In [11]:
# Get info about the data frame without indices specified
topnames0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 278 entries, 0 to 277
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    278 non-null    int64 
 1   sex     278 non-null    object
 2   name    278 non-null    object
 3   count   278 non-null    int64 
dtypes: int64(2), object(2)
memory usage: 8.8+ KB


In [12]:
# Get info about the data frame without indices specified
topnames.info()

<class 'pandas.core.frame.DataFrame'>
MultiIndex: 278 entries, (1880, 'Female') to (2018, 'Male')
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   name    278 non-null    object
 1   count   278 non-null    int64 
dtypes: int64(1), object(1)
memory usage: 10.5+ KB


In the examples above, notice that one has a `RangeIndex` and the other has a `MultiIndex` (because we specified two columns in the index).

We can get the column labels using the `columns` attribute:

In [13]:
# Get column names: no index specified
topnames0.columns

Index(['year', 'sex', 'name', 'count'], dtype='object')

In [14]:
# Get column names: (year, sex) index
topnames.columns

Index(['name', 'count'], dtype='object')

We can inspect the indices using the `index` attribute:

In [15]:
# Get index info: no index specified
topnames0.index

RangeIndex(start=0, stop=278, step=1)

If we have specified an index, the `index` attribute lists every combination.  (This should correspond to every combination of independent variables for Tidy Data.)

In [16]:
# Get index info: (year, sex) index
topnames.index

MultiIndex([(1880, 'Female'),
            (1880,   'Male'),
            (1881, 'Female'),
            (1881,   'Male'),
            (1882, 'Female'),
            (1882,   'Male'),
            (1883, 'Female'),
            (1883,   'Male'),
            (1884, 'Female'),
            (1884,   'Male'),
            ...
            (2014, 'Female'),
            (2014,   'Male'),
            (2015, 'Female'),
            (2015,   'Male'),
            (2016, 'Female'),
            (2016,   'Male'),
            (2017, 'Female'),
            (2017,   'Male'),
            (2018, 'Female'),
            (2018,   'Male')],
           names=['year', 'sex'], length=278)

---

## Part C: Data Frame - Try it yourself

In the next couple of weeks, we'll work with a new dataset of country-based indicators, such as population (`pop`), gross domestic product (`gdp`), and life expectancy (`life`).

**Q1** Use the `pandas` module to load the data from `indicators2016.csv` in the `datadir` directory into a `DataFrame` object called `indicators2016`.

In [17]:
filepath = os.path.join(datadir, "indicators2016.csv")
indicators2016 = pd.read_csv(filepath)

**Q2** This is a big dataset.  Write an expression to visualize the first 8 rows of data.

In [18]:
indicators2016.head(9)

Unnamed: 0,code,country,pop,gdp,life,cell
0,ABW,Aruba,0.1,,75.87,
1,AFG,Afghanistan,34.66,19.47,63.67,21.6
2,AGO,Angola,28.81,95.34,61.55,13.0
3,ALB,Albania,2.88,11.86,78.34,3.37
4,AND,Andorra,0.08,2.86,,0.07
5,ARE,United Arab Emirates,9.27,348.74,77.26,19.91
6,ARG,Argentina,43.85,545.48,76.58,63.72
7,ARM,Armenia,2.92,10.57,74.62,3.43
8,ASM,American Samoa,0.06,0.66,,


**Q3** We can use the `code` column to represent the row labels (the _index_) of this dataset.  Re-read in the data in `indicators2016.csv`, but this time use `code` as the index.  Again, display the first 8 rows.

In [19]:
indicators2016 = pd.read_csv(filepath, index_col = ["code"])
indicators2016.head(9)

Unnamed: 0_level_0,country,pop,gdp,life,cell
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ABW,Aruba,0.1,,75.87,
AFG,Afghanistan,34.66,19.47,63.67,21.6
AGO,Angola,28.81,95.34,61.55,13.0
ALB,Albania,2.88,11.86,78.34,3.37
AND,Andorra,0.08,2.86,,0.07
ARE,United Arab Emirates,9.27,348.74,77.26,19.91
ARG,Argentina,43.85,545.48,76.58,63.72
ARM,Armenia,2.92,10.57,74.62,3.43
ASM,American Samoa,0.06,0.66,,


In [20]:
indicators2016.info()

<class 'pandas.core.frame.DataFrame'>
Index: 220 entries, ABW to ZWE
Data columns (total 5 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   country  220 non-null    object 
 1   pop      219 non-null    float64
 2   gdp      194 non-null    float64
 3   life     202 non-null    float64
 4   cell     203 non-null    float64
dtypes: float64(4), object(1)
memory usage: 10.3+ KB


**Q4** In a single assignment line, using an attribute of the data frame object, assign to `nrows` and `ncols` the number of rows and columns in the indicators 2016 data set.

In [21]:
(nrows, ncols) = indicators2016.shape

print(nrows, ncols)

220 5


> You've reached the first checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 1: How many columns are in the `indicators2016` dataset?  Which correspond to independent variables?  Which correspond to dependent variables?  Finally, in the information listed above, why are the count values different for each column?

---

## Part D: Sorting a DataFrame

By default, a `DataFrame` is sorted in ascending order based on the index.  For `indicator2016`, that means it is sorted alphabetically by the 3-letter `code`.  For `topnames`, it sorts by `year` and `sex`.

We can sort (in-place if we use `inplace=True`) in reverse order using `ascending=False`:

In [22]:
topnames_sorted = topnames.sort_index(ascending=False)
topnames_sorted.head(8) # most recent year first

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
2018,Male,Liam,19837
2018,Female,Emma,18688
2017,Male,Liam,18798
2017,Female,Emma,19800
2016,Male,Noah,19117
2016,Female,Emma,19496
2015,Male,Noah,19635
2015,Female,Emma,20455


If we want to sort on a different column, we could sort by that column's values using `sort_values()`:

In [23]:
# Find most popular names since 1880: sort 'count' largest->smallest
topnames_sorted.sort_values(by=["count"], inplace=True, ascending=False)

topnames_sorted.head(8)

Unnamed: 0_level_0,Unnamed: 1_level_0,name,count
year,sex,Unnamed: 2_level_1,Unnamed: 3_level_1
1947,Female,Linda,99689
1948,Female,Linda,96211
1947,Male,James,94757
1957,Male,Michael,92704
1949,Female,Linda,91016
1956,Male,Michael,90656
1958,Male,Michael,90517
1948,Male,James,88584


**Q5** Sort the `indicators2016` `DataFrame` by GDP, with highest GDP listed first.

In [27]:
indicators2016_sorted = indicators2016.sort_index(ascending=False)
indicators2016_sorted.sort_values(by=["gdp"], inplace = True, ascending = False)

In [28]:
# View highest-GDP countries in 2016
indicators2016_sorted.head(8)

Unnamed: 0_level_0,country,pop,gdp,life,cell
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
WLD,World,7444.03,75871.74,72.04,7508.99
USA,United States,323.13,18624.47,78.69,395.88
CHN,China,1378.66,11199.15,76.25,1364.93
JPN,Japan,126.99,4949.27,83.98,166.85
DEU,Germany,82.49,3477.8,80.64,103.47
TSA,South Asia (IDA & IBRD),1766.39,2892.48,68.71,1481.31
GBR,United Kingdom,65.6,2650.85,80.96,78.93
FRA,France,66.89,2465.45,82.27,67.57


**Q6** Sort the `indicators2016` `DataFrame` by life expectancy, with highest life expectancy listed first.

In [29]:
indicators2016_sorted = indicators2016.sort_index(ascending=False)
indicators2016_sorted.sort_values(by=["life"], inplace = True, ascending = False)

In [30]:
# View highest-GDP countries in 2016
indicators2016_sorted.head(8)

Unnamed: 0_level_0,country,pop,gdp,life,cell
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
HKG,"Hong Kong SAR, China",7.34,320.91,84.23,17.58
JPN,Japan,126.99,4949.27,83.98,166.85
MAC,"Macao SAR, China",0.61,45.31,83.85,1.97
CHE,Switzerland,8.37,668.85,82.9,11.24
ESP,Spain,46.48,1237.26,82.83,51.52
SGP,Singapore,5.61,296.98,82.8,8.46
LIE,Liechtenstein,0.04,,82.66,0.04
ITA,Italy,60.63,1859.38,82.54,90.93


**Q7** We can always re-sort the data using the index with `sort_index`.  Imagine you've forgotten whether you already set the index for the `indicators2016` `DataFrame`, so you can set it again.

Why does the following give an error?

In [31]:
indicators2016_v2 = indicators2016.set_index(["code"], inplace=True)

KeyError: "None of ['code'] are in the columns"

> You've reached the second checkpoint in the lab.  Make sure to have it signed off by the instructor or TA.
>
> Checkpoint 2: Why does updating the index, above, cause an error?

---

---
## Part E

How much time (in minutes/hours) did you spend on this lab outside of class?

30 minutes