# Lab 04: Pandas

Welcome to Advanced Topics in Data Science for High School! Throughout the course you will complete assignments like this one. You can't learn technical subjects without hands-on practice, so these assignments are an important part of the course.

**Collaboration Policy:**

Collaborating on labs is more than okay -- it's encouraged! You should rarely remain stuck for more than a few minutes on questions in labs, so ask a neighbor or an instructor for help. Explaining things is beneficial, too -- the best way to solidify your knowledge of a subject is to explain it. You should **not** _just_ copy/paste someone else's code, but rather work together to gain understanding of the task you need to complete. 

**Due Date:**

## Today's Assignment 

[Pandas](https://pandas.pydata.org/) is one of the most widely used Python libraries in data science. In this lab, you will learn commonly used data wrangling operations/tools in Pandas. 

In today's assignment, you'll learn about:

* Creating dataframes

* Slicing data frames (i.e. selecting rows and columns)

* Filtering data (using boolean arrays)

**Note:** If you are familiar with the `datascience` library used in Foundations of Data Science, the `activity09.ipynb` notebook, Mapping from `datascience` to Pandas, may serve as a useful guide. It can be found in the activities folder.


In this lab you are going to use several pandas methods, such as `drop` and `loc`. You may need to read the documentation for the methods to see all the parameters. The pandas interface is notoriously confusing, and the documentation is not consistently great. Throughout the semester, you will have to search through Pandas documentation and experiment, but remember it is part of the learning experience and will help shape you as a data scientist.

First, set up the imports by running the cell below.

In [3]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt

## 1. Creating DataFrames & Basic Manipulations

A [dataframe](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe) is a table in which each column has a type; there is an index over the columns (typically string labels) and an index over the rows (typically ordinal numbers).

The [documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) for the pandas `DataFrame` class  provide at least two syntaxes to create a data frame.

**Syntax 1:** You can create a data frame by specifying the columns and values using a dictionary as shown below. 

The keys of the dictionary are the column names, and the values of the dictionary are lists containing the row entries.

In [4]:
fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
            'color': ['red', 'orange', 'yellow', 'pink']})
fruit_info

Unnamed: 0,fruit,color
0,apple,red
1,orange,orange
2,banana,yellow
3,raspberry,pink


**Syntax 2:** You can also define a dataframe by specifying the rows like below. 

Each row corresponds to a distinct tuple, and the columns are specified separately.

In [5]:
fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")], 
    columns = ["color", "fruit"])
fruit_info2

Unnamed: 0,color,fruit
0,red,apple
1,orange,orange
2,yellow,banana
3,pink,raspberry


You can obtain the dimensions of a dataframe by using the shape attribute `dataframe.shape`.

In [6]:
fruit_info.shape

(4, 2)

You can also convert the entire dataframe into a two-dimensional numpy array.

In [7]:
fruit_info.values

array([['apple', 'red'],
       ['orange', 'orange'],
       ['banana', 'yellow'],
       ['raspberry', 'pink']], dtype=object)

**Question 1.** For a `DataFrame` called `df`, you can add a column with 

`df['new column name'] = ...` 

and assign a list or array of values to the column. Use the `.Series` method to add a column of integers containing 1, 2, 3, and 4 called `rank1` to the `fruit_info` table which expresses your personal preference about the taste ordering for each fruit (1 is tastiest; 4 is least tasty).

**Note:** To earn all the points for this question you **must** use the `.Series` method. 

In [8]:
fruit_info['rank1'] = pd.Series([2, 3, 4, 1]) # SOLUTION
fruit_info

Unnamed: 0,fruit,color,rank1
0,apple,red,2
1,orange,orange,3
2,banana,yellow,4
3,raspberry,pink,1


In [9]:
fruit_info.shape

(4, 3)

In [10]:
'rank1' in fruit_info.columns

True

In [11]:
1 in fruit_info['rank1']

True

In [12]:
2 in fruit_info['rank1']

True

In [13]:
3 in fruit_info['rank1']

True

In [14]:
np.sum(fruit_info['rank1'])

10

**Question 2.** You can also add a column to `df` with 

`df.loc[:, 'new column name'] = ...`. 

As discussed in class, the first parameter is for the rows and second is for columns. The `:` means change all rows and the `new column name` indicates the column you are modifying (or in this case, adding). 

Make a copy of the `fruit_info` dataframe named `fruit_info_copy` using the `.copy` method. Then add a column called `rank2` to the `fruit_info_copy` table which contains the same values in the same order as the `rank1` column.

**Note:** To earn all the points for this question you **must** use the `.copy` method and `.loc`.

In [15]:
fruit_info_copy = fruit_info.copy() # SOLUTION
fruit_info_copy.loc[:, 'rank2'] = pd.Series([2, 3, 4, 1])  # SOLUTION
fruit_info_copy

Unnamed: 0,fruit,color,rank1,rank2
0,apple,red,2,2
1,orange,orange,3,3
2,banana,yellow,4,4
3,raspberry,pink,1,1


In [16]:
isinstance(fruit_info_copy, pd.core.frame.DataFrame)

True

In [17]:
fruit_info_copy.shape

(4, 4)

In [18]:
'rank2' in fruit_info_copy.columns

True

In [19]:
1 in fruit_info_copy['rank2']

True

In [20]:
2 in fruit_info_copy['rank2']

True

In [21]:
3 in fruit_info_copy['rank2']

True

In [22]:
np.sum(fruit_info_copy['rank2'])

10

**Question 3.** Use the `.drop` method to [drop](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) both the `rank1` and `rank2` columns you created in `fruit_info_copy` (make sure to use the `axis` parameter correctly). Save the output to an object named `fruit_info_original`.

**Note:** `.drop` does not change a table, but instead returns a new table with fewer columns or rows unless you set the optional `inplace` parameter.

**Hint:** Look through the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html) to see how you can drop multiple columns of a Pandas dataframe at once using a list of column names.

In [23]:
fruit_info_original = fruit_info_copy.drop('rank1', axis=1).drop('rank2', axis=1) # SOLUTION
fruit_info_original

Unnamed: 0,fruit,color
0,apple,red
1,orange,orange
2,banana,yellow
3,raspberry,pink


In [24]:
isinstance(fruit_info_original, pd.core.frame.DataFrame)

True

In [25]:
fruit_info_original.shape

(4, 2)

In [26]:
'rank1' not in fruit_info_original.columns

True

In [27]:
'rank2' not in fruit_info_original.columns

True

**Question 4.** Use the `.rename()` method to [rename](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rename.html) the columns of `fruit_info_copy` so they begin with capital letters. Set this new dataframe to `fruit_info_caps`.

In [28]:
fruit_info_caps = fruit_info_copy.rename(columns={'fruit': 'Fruit', 'color': 'Color', 'rank1': 'Rank1', 'rank2': 'Rank2'}) # SOLUTION
fruit_info_caps

Unnamed: 0,Fruit,Color,Rank1,Rank2
0,apple,red,2,2
1,orange,orange,3,3
2,banana,yellow,4,4
3,raspberry,pink,1,1


In [29]:
isinstance(fruit_info_caps, pd.core.frame.DataFrame)

True

In [30]:
fruit_info_caps.shape

(4, 4)

In [31]:
'Rank1' in fruit_info_caps.columns

True

In [32]:
'Rank2' in fruit_info_caps.columns

True

## Babynames

Now that we have learned the basics, let's move on to the `babynames` dataset. The `babynames` dataset contains a record of the given names of babies born in the United States each year.

First let's run the following cell to load the dataframe `babynames`.

In [33]:
babynames = pd.read_csv('data/baby_names.csv', index_col = 0)
babynames

Unnamed: 0,State,Sex,Year,Name,Count
0,GA,F,1910,Mary,841
1,GA,F,1910,Annie,553
2,GA,F,1910,Mattie,320
3,GA,F,1910,Ruby,279
4,GA,F,1910,Willie,275
...,...,...,...,...,...
890622,VA,M,2019,Yerik,5
890623,VA,M,2019,Yosef,5
890624,VA,M,2019,Zakari,5
890625,VA,M,2019,Zakariya,5


**Question 5.** Use 3 different `DataFrame` methods to explore the `babynames` dataframe.

In [32]:
babynames.shape # SOLUTION

(890627, 5)

In [33]:
babynames.info() # SOLUTION

<class 'pandas.core.frame.DataFrame'>
Int64Index: 890627 entries, 0 to 890626
Data columns (total 5 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   State   890627 non-null  object
 1   Sex     890627 non-null  object
 2   Year    890627 non-null  int64 
 3   Name    890627 non-null  object
 4   Count   890627 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 40.8+ MB


In [34]:
np.unique(babynames.State.to_list()) # SOLUTION

array(['GA', 'KY', 'NC', 'SC', 'TN', 'VA'], dtype='<U2')

**Question 6.** In the markdown cell below, explain the output from each of the commands you ran in the previous question.

**SOLUTION:** Answers may vary

## Selecting Rows and Columns (Slicing)

### Selection Using Label/Index (using `.loc`)

#### Column Selection 

To select a column of a `DataFrame` by column label, one way is to use the `.loc` [method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html). General usage of `.loc` looks like `df.loc[rowname, colname]` (**reminder** the colon `:` means "everything.").  For example, if we want the `color` column of the `friut_info` data frame, we would use: `fruit_info.loc[:, 'color']`

- You can also slice across columns. For example, `babynames.loc[:, 'Name':]` would select the column `Name` and all columns after `Name`.

- **Alternative:** While `.loc` is invaluable when writing production code, it may be a little too verbose for interactive use. One recommended alternative is the `[ ]` method, which takes on the form `df['colname']`.

#### Row Selection

Similarly, if we want to select a row by its label, we can use the same `.loc` method. In this case, the "label" of each row refers to the index (i.e. primary key) of the dataframe.

**Example 1.**

In [40]:
babynames.loc[2:5, 'Name']

2    Mattie
3      Ruby
4    Willie
5    Louise
Name: Name, dtype: object

**Example 2.**  Notice the difference between this method and the method in **Example: 1**.

Just passing in `'Name'` returns a Series while `['Name']` returns a `Dataframe`.

In [41]:
babynames.loc[2:5, ['Name']]

Unnamed: 0,Name
2,Mattie
3,Ruby
4,Willie
5,Louise


**Question 7.** Explain what is different between the output from running the commands in **Example 1** and **Example 2**.

**SOLUTION:** Answers may vary

**Note:** `.loc` actually uses the Pandas row index rather than row id/position of rows in the dataframe to perform the selection. Also, notice that if you write `2:5` with `.loc[ ]`, contrary to normal Python slicing functionality, the end index is included, so you get the row with index 5.

### Selection using Integer location (using `.iloc`)

Another pandas feature is `.iloc[ ]` which lets you slice the dataframe by row position and column position instead of by row index and column label (which is the case for `.loc[ ]`). This is really the main difference between the two functions and it is **important** that you remember the difference and why you might want to use one over the other. In addition, with `.iloc[ ]`, the end index is **not** included, like with normal Python slicing.

**Note:** As a mnemonic, remember that the *i* in `.iloc` means "integer". 

Below, we have sorted the `babynames` dataframe. Notice how the **position** of a row is not necessarily equal to the **index** of a row. For example, the first row is not necessarily the row associated with index 1. This distinction is important in understanding the different between `.loc[ ]` and `.iloc[ ]`.

In [42]:
sorted_babynames = babynames.sort_values(by = ['Name'])
sorted_babynames.head()

Unnamed: 0,State,Sex,Year,Name,Count
594983,SC,M,2014,Aaden,6
296217,KY,M,2008,Aaden,24
590207,SC,M,2008,Aaden,15
297787,KY,M,2010,Aaden,8
470397,NC,M,2012,Aaden,7


**Example 3.** Here is an example of how we would get the 2nd, 3rd, and 4th rows with only the `Name` column of the `baby_names` dataframe using both `.iloc[ ]` and `.loc[ ]`. Observe the difference, especially after sorting `babynames` by name.

In [43]:
sorted_babynames.iloc[1:4, 3]

296217    Aaden
590207    Aaden
297787    Aaden
Name: Name, dtype: object

**Example 4.** Notice that using `.loc[ ]` with 1:4 gives different results, since it selects using the **index**.

In [44]:
sorted_babynames.loc[1:4, 'Name']

1          Annie
793024     Annie
514923     Annie
613780     Annie
815197     Annie
           ...  
272670    Willie
734369    Willie
749611    Willie
764536    Willie
4         Willie
Name: Name, Length: 820497, dtype: object

**Question 8.** Explain what the difference is between the output from running the commands in **Example 3** and **Example 4**.

**SOLUTION:** Answers may vary

**Example 5.** Lastly, we can change the index of a dataframe using the `set_index` method. We change the index from $0,1,2,\ldots$ to the `Name` column.

In [45]:
idx_babynames = babynames[:5].set_index("Name") 
idx_babynames

Unnamed: 0_level_0,State,Sex,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Mary,GA,F,1910,841
Annie,GA,F,1910,553
Mattie,GA,F,1910,320
Ruby,GA,F,1910,279
Willie,GA,F,1910,275


**Example 6.** However, if we still want to access rows by location we will need to use the integer location (`.iloc`) accessor.

**Note:** We can't do this `idx_babynames.loc[1:4, 'Year']`, but we can use the integer position.

In [41]:
idx_babynames.iloc[1:4, 2:3]

Unnamed: 0_level_0,Year
Name,Unnamed: 1_level_1
Annie,1910
Mattie,1910
Ruby,1910


**Question 9.** List the different names of the states that are in the `babynames` data set. Do this using a pandas `DataFrame` method.

**Note:** To earn all the points for the question you must do it programmatically .

In [50]:
babynames.State.unique() # SOLUTION

array(['GA', 'KY', 'NC', 'SC', 'TN', 'VA'], dtype=object)

**Question 10.** Selecting multiple columns is easy.  You just need to supply a list of column names.  Use `.loc` to select the `Name` and `Year` **in that order** from the `babynames` table.

**Note:** To earn all the points for this question you **must** use the `.loc` method. 

In [51]:
name_and_year = babynames.loc[:, ['Name', 'Year']] # SOLUTION
name_and_year[:5]

Unnamed: 0,Name,Year
0,Mary,1910
1,Annie,1910
2,Mattie,1910
3,Ruby,1910
4,Willie,1910


In [52]:
isinstance(name_and_year, pd.core.frame.DataFrame)

True

In [53]:
name_and_year.shape[1]

2

In [54]:
name_and_year.columns

Index(['Name', 'Year'], dtype='object')

## Filtering Data

### Filtering with boolean arrays

Filtering is the process of removing unwanted material.  In your quest for cleaner data, you will undoubtedly filter your data at some point: whether it be for clearing up cases with missing values, for culling out fishy outliers, or for analyzing subgroups of your data set.  Note that compound expressions have to be grouped with parentheses. 

Example usage looks like `df[df['column name'] < 5]]`.

For your reference, some commonly used comparison operators are given below.

Symbol   | Usage      | Meaning 
------   | ---------- | -------------------------------------
$==$     | a == b     | Does a equal b?
$\lt =$  | a <= b     | Is a less than or equal to b?
$\gt =$  | a >= b     | Is a greater than or equal to b?
$\lt$    | a < b      | Is a less than 
$\gt$    | a > b      | Is a greater than b?
~        | ~p         | Returns negation of p
&#124;   | p &#124; q | p OR q
&        | p & q      | p AND q
^        | p ^ q      | p XOR q (exclusive or)

In the following we construct the DataFrame containing only names registered in North Carolina.

**Question 11.** Make a `DataFrame` containing only names registered in North Carolina.

In [55]:
nc = babynames[babynames['State'] == 'NC'] # SOLUTION
nc.head()

Unnamed: 0,State,Sex,Year,Name,Count
304336,NC,F,1910,Mary,837
304337,NC,F,1910,Annie,401
304338,NC,F,1910,Ruth,235
304339,NC,F,1910,Ethel,199
304340,NC,F,1910,Elizabeth,191


In [56]:
isinstance(nc, pd.core.frame.DataFrame)

True

In [57]:
nc.shape

(175577, 5)

In [58]:
nc.State.unique()[0]

'NC'

**Question 12.** Using a boolean array, select the names in Year 2019 from the `nc` dataframe that have at least 500 counts. Keep all columns from the original `nc` dataframe.

**Hint:** Any time you use `p & q` to filter the dataframe, make sure to use 

`df[(df[p]) & (df[q])]` or `df.loc[(df[p]) & (df[q])]`. 


That is, make sure to wrap conditions with parentheses.

**Note:** Both slicing and `.loc` will achieve the same result, it is just that `.loc` is typically faster in production. You are free to use whichever one you would like.

In [59]:
names_2019 = nc[(nc['Year'] == 2019) & (nc['Count'] >= 500)] # SOLUTION
names_2019.head()

Unnamed: 0,State,Sex,Year,Name,Count
398340,NC,F,2019,Ava,595
398341,NC,F,2019,Olivia,535
398342,NC,F,2019,Emma,531
478548,NC,M,2019,Liam,685
478549,NC,M,2019,Noah,605


In [60]:
isinstance(names_2019, pd.core.frame.DataFrame)

True

In [61]:
names_2019.shape

(7, 5)

In [62]:
np.sort(names_2019['Name'].unique())

array(['Ava', 'Emma', 'James', 'Liam', 'Noah', 'Olivia', 'William'],
      dtype=object)

In [63]:
np.sum(names_2019.loc[:,'Count'])

4050

### Counting Labels



**Question 13.** Use the `.value_counts()` `Series` method to count the number of male and female babies in North Carolina.

**Hint:** Use the `nc` dataframe.

In [64]:
num_of_names_m_f = nc['Sex'].value_counts() # SOLUTION
num_of_names_m_f

F    95653
M    79924
Name: Sex, dtype: int64

In [65]:
isinstance(num_of_names_m_f, pd.core.frame.Series)

True

In [66]:
num_of_names_m_f.shape

(2,)

In [67]:
num_of_names_m_f[0]

95653

In [68]:
num_of_names_m_f[1]

79924

### Using `.groupby`

The `.groupby()` method is used to split the data into groups based on the different labels in a column. Pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group.

For example, we can group the `nc` dataframe on `Name`, count the number of occurrences of the name for each year and display the results in descending order.

Run the cell below.

In [69]:
nc.groupby('Name')['Count'].agg(sum).sort_values(ascending=False)

Name
James      210596
William    159417
Mary       134170
John       124723
Robert     116095
            ...  
Avan            5
Avarie          5
Avary           5
Jonthan         5
Autum           5
Name: Count, Length: 9128, dtype: int64

If we wanted the dataframe that only contained observations for the name James we could use the `.get_group()` method on the `groupby` object.

Run the cell below.

In [62]:
grps = nc.groupby('Name')
grps.get_group('James')

Unnamed: 0,State,Sex,Year,Name,Count
304965,NC,F,1911,James,6
305302,NC,F,1912,James,8
305794,NC,F,1913,James,6
306160,NC,F,1914,James,10
306665,NC,F,1915,James,11
...,...,...,...,...,...
473311,NC,M,2015,James,537
474616,NC,M,2016,James,479
475913,NC,M,2017,James,541
477226,NC,M,2018,James,524


If we wanted to see all the different group labels we can use the `.group.keys()` method on the `groupby` object `grps`. Since the object is a dictionary (and its long) we will cast it into a list and only show the first 10 items.

Run the cell below.

In [63]:
list(grps.groups.keys())[:10]

['Aaden',
 'Aadhya',
 'Aadya',
 'Aahana',
 'Aaiden',
 'Aalayah',
 'Aaleyah',
 'Aaliyah',
 'Aalyiah',
 'Aamir']

**Question 14.** Use `groupby` to count the number of instances of each unique  name for each Year in North Carolina. 

**Note:** We are **not** computing the number of babies but instead the number of unique names for each year. Your output should look like the following:

   
```
Year Count
---- -----
1910 563
1911 572
1912 739
1913 774
1914 913
1915 1056
1916 1062
1917 1071
1918 1141
1919 1166
1920 1199
1921 1210
1922 1215
1923 1161
1924 1225
1925 1183
1926 1129
1927 1172
1928 1130
1929 1135
1930 1096
1931 1078
1932 1095
1933 1064
1934 1098
1935 1086
1936 1056
1937 1063
1938 1101
1939 1077
1940 1066
1941 1079
1942 1140
1943 1139
1944 1127
1945 1095
1946 1174
1947 1254
1948 1229
1949 1203
1950 1194
1951 1240
1952 1230
1953 1279
1954 1244
1955 1295
1956 1316
1957 1302
1958 1287
1959 1284
1960 1291
1961 1324
1962 1314
1963 1315
1964 1341
1965 1257
1966 1233
1967 1242
1968 1270
1969 1262
1970 1322
1971 1347
1972 1317
1973 1310
1974 1320
1975 1300
1976 1304
1977 1368
1978 1375
1979 1363
1980 1399
1981 1358
1982 1381
1983 1355
1984 1355
1985 1420
1986 1464
1987 1480
1988 1544
1989 1646
1990 1710
1991 1740
1992 1796
1993 1759
1994 1766
1995 1828
1996 1884
1997 1985
1998 2091
1999 2140
2000 2251
2001 2318
2002 2327
2003 2345
2004 2407
2005 2517
2006 2616
2007 2764
2008 2783
2009 2821
2010 2763
2011 2728
2012 2704
2013 2712
2014 2737
2015 2745
2016 2755
2017 2783
2018 2803
2019 2849
```

In [64]:
# BEGIN SOLUTION NO PROMPT
for i in nc.groupby('Year').groups.keys():
    print(i, len(np.unique(nc.groupby('Year').get_group(i)['Name'].to_list())))
# END SOLUTION
""" # BEGIN PROMPT
...
"""; # END PROMPT

1910 563
1911 572
1912 739
1913 774
1914 913
1915 1056
1916 1062
1917 1071
1918 1141
1919 1166
1920 1199
1921 1210
1922 1215
1923 1161
1924 1225
1925 1183
1926 1129
1927 1172
1928 1130
1929 1135
1930 1096
1931 1078
1932 1095
1933 1064
1934 1098
1935 1086
1936 1056
1937 1063
1938 1101
1939 1077
1940 1066
1941 1079
1942 1140
1943 1139
1944 1127
1945 1095
1946 1174
1947 1254
1948 1229
1949 1203
1950 1194
1951 1240
1952 1230
1953 1279
1954 1244
1955 1295
1956 1316
1957 1302
1958 1287
1959 1284
1960 1291
1961 1324
1962 1314
1963 1315
1964 1341
1965 1257
1966 1233
1967 1242
1968 1270
1969 1262
1970 1322
1971 1347
1972 1317
1973 1310
1974 1320
1975 1300
1976 1304
1977 1368
1978 1375
1979 1363
1980 1399
1981 1358
1982 1381
1983 1355
1984 1355
1985 1420
1986 1464
1987 1480
1988 1544
1989 1646
1990 1710
1991 1740
1992 1796
1993 1759
1994 1766
1995 1828
1996 1884
1997 1985
1998 2091
1999 2140
2000 2251
2001 2318
2002 2327
2003 2345
2004 2407
2005 2517
2006 2616
2007 2764
2008 2783
2009 2821
2010 