# Data Aggregation and Grouping
## Using World Flags Data

<img src='flags.JPG'>

A critical task for data analysis is often aggregating or transforming groups of data. After preparing your data, you may need to compute group statistics or possible pivot tables for reporting or visualization purposes. Pandas `groupby` is a flexible way to perform these aggregations and summarize datasets.

For this module, we will be working with data that contains details of various nations and their flags. It was originally collected from the 'Collins Gen Guide to Flags' from Collins Publishers in 1986. Note that this data is out-of-date. For instance, it still includes 'USSR' as a country.
      
Here is some basic information about the dataset:

- There are 194 instances (aka rows).
- There are 30 attributes in total (aka columns).
- 10 attributes are numeric-valued.  The remainder are either Boolean or nominal-valued.
- There are no missing values.

**Attribute Information**

1. name: Name of the country concerned
2. landmass: 1=N.America, 2=S.America, 3=Europe, 4=Africa, 5=Asia, 6=Oceania
3. zone: Geographic quadrant, based on Greenwich and the Equator (1=NE, 2=SE, 3=SW, 4=NW)
4. area: in thousands of square km
5. population: in round millions
6. language: 1=English, 2=Spanish, 3=French, 4=German, 5=Slavic, 6=Other Indo-European, 7=Chinese, 8=Arabic,            9=Japanese/Turkish/Finnish/Magyar, 10=Others
7. religion: 0=Catholic, 1=Other Christian, 2=Muslim, 3=Buddhist, 4=Hindu, 5=Ethnic, 6=Marxist, 7=Others
8. bars: Number of vertical bars in the flag
9. stripes: Number of horizontal stripes in the flag
10. colors: Number of different colors in the flag
11. red: 0 if red absent, 1 if red present in the flag
12. green: same for green
13. blue: same for blue
14. gold: same for gold (also yellow)
15. white: same for white
16. black: same for black
17. orange: same for orange (also brown)
18. mainhue: predominant colour in the flag (tie-breaks decided by taking the topmost hue, if that fails then the most central hue, and if that fails the leftmost hue)
19. circles: Number of circles in the flag
20. crosses: Number of (upright) crosses
21. saltires: Number of diagonal crosses
22. quarters: Number of quartered sections
23. sunstars: Number of sun or star symbols
24. crescent: 1 if a crescent moon symbol present, else 0
25. triangle: 1 if any triangles present, 0 otherwise
26. icon: 1 if an inanimate image present (e.g., a boat), otherwise 0
27. animate: 1 if an animate image (e.g., an eagle, a tree, a human hand) present, 0 otherwise
28. text: 1 if any letters or writing on the flag (e.g., a motto or slogan), 0 otherwise
29. topleft: color in the top-left corner (moving right to decide tie-breaks)
30. botright: color in the bottom-left corner (moving left to decide tie-breaks)

## Initial Imports

In [5]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 50)

## Initial Exploration of Flags Dataset

In [6]:
# create list of column names
columns = ['name','landmass','zone','area','population','language','religion','num_bars','num_stripes','num_colors',
           'red','green','blue','gold','white','black','orange','mainhue','num_circles','num_crosses','num_saltires',
           'num_quarters','num_sunstars','crescent','triangle','icon','animate','text','topleft_color','botright_color']

# import data and show first five rows
flags = pd.read_csv('flag.data', names=columns)
flags.head()

Unnamed: 0,name,landmass,zone,area,population,language,religion,num_bars,num_stripes,num_colors,red,green,blue,gold,white,black,orange,mainhue,num_circles,num_crosses,num_saltires,num_quarters,num_sunstars,crescent,triangle,icon,animate,text,topleft_color,botright_color
0,Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
1,Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
2,Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
3,American-Samoa,6,3,0,0,1,1,0,0,5,1,0,1,1,1,0,1,blue,0,0,0,0,0,0,1,1,1,0,blue,red
4,Andorra,3,1,0,0,6,0,3,0,3,1,0,1,1,0,0,0,gold,0,0,0,0,0,0,0,0,0,0,blue,red


In [7]:
# check size of dataset
flags.shape

(194, 30)

In [8]:
# check general information about dataset
flags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 194 entries, 0 to 193
Data columns (total 30 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   name            194 non-null    object
 1   landmass        194 non-null    int64 
 2   zone            194 non-null    int64 
 3   area            194 non-null    int64 
 4   population      194 non-null    int64 
 5   language        194 non-null    int64 
 6   religion        194 non-null    int64 
 7   num_bars        194 non-null    int64 
 8   num_stripes     194 non-null    int64 
 9   num_colors      194 non-null    int64 
 10  red             194 non-null    int64 
 11  green           194 non-null    int64 
 12  blue            194 non-null    int64 
 13  gold            194 non-null    int64 
 14  white           194 non-null    int64 
 15  black           194 non-null    int64 
 16  orange          194 non-null    int64 
 17  mainhue         194 non-null    object
 18  num_circle

In [9]:
# check general statistical information
flags.describe()

Unnamed: 0,landmass,zone,area,population,language,religion,num_bars,num_stripes,num_colors,red,green,blue,gold,white,black,orange,num_circles,num_crosses,num_saltires,num_quarters,num_sunstars,crescent,triangle,icon,animate,text
count,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0,194.0
mean,3.572165,2.21134,700.046392,23.268041,5.340206,2.190722,0.453608,1.551546,3.463918,0.78866,0.469072,0.510309,0.469072,0.752577,0.268041,0.134021,0.170103,0.149485,0.092784,0.149485,1.386598,0.056701,0.139175,0.252577,0.201031,0.082474
std,1.553018,1.308274,2170.927932,91.934085,3.496517,2.061167,1.038339,2.328005,1.300154,0.409315,0.500334,0.501187,0.500334,0.432631,0.444085,0.341556,0.463075,0.385387,0.290879,0.43586,4.396186,0.231869,0.347025,0.435615,0.401808,0.275798
min,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,3.0,1.0,9.0,0.0,2.0,1.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.0,2.0,111.0,4.0,6.0,1.0,0.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,5.0,4.0,471.25,14.0,9.0,4.0,0.0,3.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.75,0.0,0.0
max,6.0,4.0,22402.0,1008.0,10.0,7.0,5.0,14.0,8.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,2.0,1.0,4.0,50.0,1.0,1.0,1.0,1.0,1.0


## GroupBy Mechanics
### Basic Grouping

Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations. 

- First, the data is split into groups.
- Second, a function is applied to each group
- Finally, the results are combined into a result object

Here is a mockup of a simple group aggregation.

<img src='split_apply_combine.JPG'>

*Hadley Wickham, an author of many popular packages for the R programming language, coined the term split-apply-combine for describing group operations.*

To get started, let's create a small dataset.

In [10]:
# create sample dataset with random data
df = pd.DataFrame({'studio_key' : ['Marvel', 'Marvel', 'DC', 'DC', 'Marvel'],
     'department_key' : ['Production', 'Advertising', 'Production', 'Advertising', 'Production'],
     'data1' : [10,6,2,7,5],
     'data2' : [-1,4,-6,5,11]})
df

Unnamed: 0,studio_key,department_key,data1,data2
0,Marvel,Production,10,-1
1,Marvel,Advertising,6,4
2,DC,Production,2,-6
3,DC,Advertising,7,5
4,Marvel,Production,5,11


Suppose we wanted to compute the mean of `data1` grouped by the `studio_key` column.

In [11]:
# create Series GroupBy object using 'studio_key'
grouped = df['data1'].groupby(df['studio_key'])
grouped

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001F363842AD0>

Now we can simply call the 'mean' method on the GroupBy object

In [12]:
# produces new Series
grouped.mean()

studio_key
DC        4.5
Marvel    7.0
Name: data1, dtype: float64

In [13]:
# chaining it all together
grouped = df['data1'].groupby(df['studio_key']).mean()
grouped

studio_key
DC        4.5
Marvel    7.0
Name: data1, dtype: float64

We can also easily pass multiple keys to be used by the GroupBy object. This actually creates a multi-index Series.

In [14]:
# passing multiple Series as a list
means = df['data1'].groupby([df['studio_key'], df['department_key']]).mean()
means

studio_key  department_key
DC          Advertising       7.0
            Production        2.0
Marvel      Advertising       6.0
            Production        7.5
Name: data1, dtype: float64

Remember that you can use `unstack()` to produce a DataFrame

In [15]:
# using unstack() to create a DataFrame
means.unstack()

department_key,Advertising,Production
studio_key,Unnamed: 1_level_1,Unnamed: 2_level_1
DC,7.0,2.0
Marvel,6.0,7.5


Frequently the grouping information is found in the same DataFrame as the data you want to work on. In that case, you can pass column names  as the group keys.

In [16]:
# passing DataFrame column names as group keys
df.groupby('studio_key').mean()

# df['data1'].groupby('studio_key').mean() #produces error
# df['data1'].groupby(df['studio_key']).mean() #produces Series

Unnamed: 0_level_0,data1,data2
studio_key,Unnamed: 1_level_1,Unnamed: 2_level_1
DC,4.5,-0.5
Marvel,7.0,4.666667


Notice that there is no `department_key` in the above result. Since that column is not numeric, it is excluded from the result. By default, all of the numeric columns are aggregated, though it is possible to filter down to a subset as we will soon see.

A very useful GroupBy method is `size`, which returns a Series containing group sizes.

In [17]:
# show group sizes
df.groupby(['studio_key', 'department_key']).size()

studio_key  department_key
DC          Advertising       1
            Production        1
Marvel      Advertising       1
            Production        2
dtype: int64

### Student Practice
Try to perform the following tasks on the `flags` dataset. Then check your answers as I walk through the solutions. 

**Exercise:** Instantiate (create) a SeriesGroupBy object called `grouped_flags` that selects `population` from the `flags` dataset and groups it by `landmass`

*Note: landmass: 1=N.America, 2=S.America, 3=Europe, 4=Africa, 5=Asia, 6=Oceania*

In [18]:
grouped_flags = flags['population'].groupby(flags['landmass'])


**Exercise:** Using the `grouped_flags` object, what is the average population by landmass? What is the minimum population by landmass? What is the maximum population by landmass?

In [19]:
## average population x landmass
avg_pop = grouped_flags.mean()

##min population x landmass
min_pop = grouped_flags.min()

##max population x landmass

max_pop = grouped_flags.max()


print(f"average population by landmass: {avg_pop}")
print(f"minimum population by landmass: {min_pop}")
print(f"maximum population by landmass: {max_pop}")


average population by landmass: landmass
1    12.290323
2    15.705882
3    13.857143
4     8.788462
5    69.179487
6    11.300000
Name: population, dtype: float64
minimum population by landmass: landmass
1    0
2    0
3    0
4    0
5    0
6    0
Name: population, dtype: int64
maximum population by landmass: landmass
1     231
2     119
3      61
4      56
5    1008
6     157
Name: population, dtype: int64


**Exercise:** Instantiate an object called `flag_means` that selects the `population` and groups it by `zone`, then by `landmass` and calculates the mean of the population.

*Note: zone: Geographic quadrant, based on Greenwich and the Equator (1=NE, 2=SE, 3=SW, 4=NW)*

In [20]:
flag_means = flags['population'].groupby([flags['zone'], flags['landmass']]).mean()
flag_means

zone  landmass
1     3           13.500000
      4           12.789474
      5           69.179487
      6            9.600000
2     4            7.315789
      6           17.800000
3     2           22.000000
      4            0.000000
      6            0.000000
4     1           12.290323
      2            6.714286
      3           15.285714
      4            5.769231
Name: population, dtype: float64

**Exercise:** Turn `flag_means` into a DataFrame with `zone` as the rows and `landmass` as the columns. You should be able to do this using one pandas method. Take note of the missing values in the new DataFrame.

In [21]:
flag_means_df = flag_means.unstack()
flag_means_df

landmass,1,2,3,4,5,6
zone,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,,,13.5,12.789474,69.179487,9.6
2,,,,7.315789,,17.8
3,,22.0,,0.0,,0.0
4,12.290323,6.714286,15.285714,5.769231,,


**Exercise:** Group the entire `flags` dataset by `landmass` and compute the median of each numeric column.

In [22]:
median_by_landmass = flags.groupby('landmass').median()

median_by_landmass

Unnamed: 0_level_0,zone,area,population,language,religion,num_bars,num_stripes,num_colors,red,green,blue,gold,white,black,orange,num_circles,num_crosses,num_saltires,num_quarters,num_sunstars,crescent,triangle,icon,animate,text
landmass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1
1,4.0,9.0,0.0,1.0,1.0,0.0,0.0,3.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3.0,407.0,6.0,2.0,0.0,0.0,3.0,3.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,92.0,8.0,6.0,1.0,0.0,0.0,3.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2.0,298.5,5.0,8.0,5.0,0.0,1.0,3.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,1.0,185.0,10.0,8.0,2.0,0.0,0.0,3.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
6,2.0,2.0,0.0,1.0,1.0,0.0,0.0,4.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0,0.0


**Exercise:** How many countries are represented in each group if you group by `landmass` and then `zone`.

In [23]:
countries_per_group = flags.groupby(['landmass', 'zone']).size()
print(f"Countries represented: {countries_per_group}")

Countries represented: landmass  zone
1         4       31
2         3       10
          4        7
3         1       28
          4        7
4         1       19
          2       19
          3        1
          4       13
5         1       39
6         1        5
          2       10
          3        5
dtype: int64


### Iterating Over Groups

The GroupBy object supports iteration, generating a sequence of 2-tuples containing the group name along with the chunk of data.

Let's remind ourselves of our sample dataset.

In [24]:
# view DataFrame
df

Unnamed: 0,studio_key,department_key,data1,data2
0,Marvel,Production,10,-1
1,Marvel,Advertising,6,4
2,DC,Production,2,-6
3,DC,Advertising,7,5
4,Marvel,Production,5,11


In [25]:
# reminder: this creates a DataFrameGroupBy object
df.groupby('studio_key')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F363843280>

You can iterate through this object to find the name and group data.

In [26]:
# loop through object
for name, group in df.groupby('studio_key'):
    print(name)
    print(group)
    print('----')

DC
  studio_key department_key  data1  data2
2         DC     Production      2     -6
3         DC    Advertising      7      5
----
Marvel
  studio_key department_key  data1  data2
0     Marvel     Production     10     -1
1     Marvel    Advertising      6      4
4     Marvel     Production      5     11
----


In the case of multiple keys, the first element in the tuple will be a tuple of key values.

In [27]:
# loop through object
for (key1, key2), group in df.groupby(['studio_key', 'department_key']):
    print(f'key1: {key1}')
    print(f'key2: {key2}')
    print(group)
    print('----') 

key1: DC
key2: Advertising
  studio_key department_key  data1  data2
3         DC    Advertising      7      5
----
key1: DC
key2: Production
  studio_key department_key  data1  data2
2         DC     Production      2     -6
----
key1: Marvel
key2: Advertising
  studio_key department_key  data1  data2
1     Marvel    Advertising      6      4
----
key1: Marvel
key2: Production
  studio_key department_key  data1  data2
0     Marvel     Production     10     -1
4     Marvel     Production      5     11
----


By default `groupby` groups on axis=0, but you can group on any of the other axes. For example, we could group the columns of our example `df` here by `dtype` like so:

In [28]:
# check data types
df.dtypes

studio_key        object
department_key    object
data1              int64
data2              int64
dtype: object

In [29]:
# print data type and respective group data
grouped = df.groupby(df.dtypes, axis=1)
for dtype, group in grouped:
    print(dtype)
    print(group)
    print('----')

int64
   data1  data2
0     10     -1
1      6      4
2      2     -6
3      7      5
4      5     11
----
object
  studio_key department_key
0     Marvel     Production
1     Marvel    Advertising
2         DC     Production
3         DC    Advertising
4     Marvel     Production
----


### Selecting a Column or Subset of Columns

In [30]:
# groupby studio_key, average of 'data1', returns Series
df['data1'].groupby(df['studio_key']).mean()

studio_key
DC        4.5
Marvel    7.0
Name: data1, dtype: float64

In [31]:
# syntactic sugar for above
df.groupby('studio_key')['data1'].mean()

studio_key
DC        4.5
Marvel    7.0
Name: data1, dtype: float64

In [32]:
# groupby 'studio_key', average of data2, returns DataFrame
df[['data2']].groupby(df['studio_key']).mean()

Unnamed: 0_level_0,data2
studio_key,Unnamed: 1_level_1
DC,-0.5
Marvel,4.666667


In [33]:
# syntactic sugar for above
df.groupby('studio_key')[['data2']].mean()

Unnamed: 0_level_0,data2
studio_key,Unnamed: 1_level_1
DC,-0.5
Marvel,4.666667


Especially for large datasets, it may be desirable to aggregate only a few columns. For example, in the preceding dataset, to compute means for just the data2 column and get the result as a DataFrame, we could write:

In [34]:
# grouped DataFram
df.groupby(['studio_key', 'department_key'])[['data2']].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
studio_key,department_key,Unnamed: 2_level_1
DC,Advertising,5.0
DC,Production,-6.0
Marvel,Advertising,4.0
Marvel,Production,5.0


In [35]:
# grouped Series
df.groupby(['studio_key', 'department_key'])['data2'].mean()

studio_key  department_key
DC          Advertising       5.0
            Production       -6.0
Marvel      Advertising       4.0
            Production        5.0
Name: data2, dtype: float64

### Grouping with Dictionaries and Series

To see different ways to work with grouping, let's create a DataFrame of student grades that have taken two courses.

In [36]:
# create another sample DataFrame
students = pd.DataFrame(np.random.randint(80,100,(5,5)),
                       columns=[1,2,3,4,5],
                       index=['Joe','Steve','Beth','Jim','Sue'])
# add some NA values
students.iloc[2:3,[1,2]] = np.nan 

students

Unnamed: 0,1,2,3,4,5
Joe,82,83.0,94.0,82,85
Steve,85,85.0,85.0,92,99
Beth,82,,,82,83
Jim,89,83.0,81.0,99,90
Sue,87,99.0,96.0,83,98


Let's say that we want to map course names to the specific quiz. We can do this with a mapping and then groupby this dictionary mapping.

In [37]:
# create mapping
mapping = {1:670,2:670,3:520,4:520,5:670,6:680}

In [38]:
# passing the mapping to the groupby object
by_column = students.groupby(mapping, axis=1)

# summing by the grouped mapping, notice the unused mapping is OK
by_column.mean()

Unnamed: 0,520,670
Joe,88.0,83.333333
Steve,88.5,89.666667
Beth,82.0,82.5
Jim,90.0,87.333333
Sue,89.5,94.666667


We can also work with Series:

In [39]:
# create Series of
map_series = pd.Series(mapping)
map_series

1    670
2    670
3    520
4    520
5    670
6    680
dtype: int64

In [40]:
# pass Series to groupby object
students.groupby(map_series, axis=1).count()

Unnamed: 0,520,670
Joe,2,3
Steve,2,3
Beth,1,2
Jim,2,3
Sue,2,3


### Grouping with Functions

Any function passed as a group key will be called once per index value.

As an example, let's say we wanted to group students based on how many letters were in their name and find their median score. (Why we would ever want to do this? Who knows. Just go along with me here.)

In [41]:
# group by index length (in this case student name)
students.groupby(len).median()

Unnamed: 0,1,2,3,4,5
3,87.0,83.0,94.0,83.0,90.0
4,82.0,,,82.0,83.0
5,85.0,85.0,85.0,92.0,99.0


In [42]:
# let's rename the columns
students = students.rename(columns=mapping)
students

Unnamed: 0,670,670.1,520,520.1,670.2
Joe,82,83.0,94.0,82,85
Steve,85,85.0,85.0,92,99
Beth,82,,,82,83
Jim,89,83.0,81.0,99,90
Sue,87,99.0,96.0,83,98


In [43]:
# create a second key list
key_list = ['MA','NY','NY','MA','NY']

# groupby function, then by key_list
students.groupby([len, key_list]).mean()

Unnamed: 0,Unnamed: 1,670,670.1,520,520.1,670.2
3,MA,85.5,83.0,87.5,90.5,87.5
3,NY,87.0,99.0,96.0,83.0,98.0
4,NY,82.0,,,82.0,83.0
5,NY,85.0,85.0,85.0,92.0,99.0


### Grouping by Index Levels

You can easily aggregate using one of the levels of a multi-index. Let's add a `gender` column to our `students` data and create a multi-index DataFrame.

In [44]:
# adding gender column
students['gender'] = ['M','M','F','M','F']
students

Unnamed: 0,670,670.1,520,520.1,670.2,gender
Joe,82,83.0,94.0,82,85,M
Steve,85,85.0,85.0,92,99,M
Beth,82,,,82,83,F
Jim,89,83.0,81.0,99,90,M
Sue,87,99.0,96.0,83,98,F


In [45]:
# creating multi-index
students = students.reset_index().set_index(['index','gender'])
students

Unnamed: 0_level_0,Unnamed: 1_level_0,670,670,520,520,670
index,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Joe,M,82,83.0,94.0,82,85
Steve,M,85,85.0,85.0,92,99
Beth,F,82,,,82,83
Jim,M,89,83.0,81.0,99,90
Sue,F,87,99.0,96.0,83,98


In [46]:
# grouping by 'gender' index and counting number of quizzes taken by gender
students.groupby(level='gender').count()

Unnamed: 0_level_0,670,670,520,520,670
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
F,2,1,1,2,2
M,3,3,3,3,3


### Student Practice

Try to perform the following tasks on the `flags` dataset. Then check your answers as I walk through the solutions. 

In [47]:
# let's remind oursleves of our DataFrame
flags.head()

Unnamed: 0,name,landmass,zone,area,population,language,religion,num_bars,num_stripes,num_colors,red,green,blue,gold,white,black,orange,mainhue,num_circles,num_crosses,num_saltires,num_quarters,num_sunstars,crescent,triangle,icon,animate,text,topleft_color,botright_color
0,Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
1,Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
2,Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
3,American-Samoa,6,3,0,0,1,1,0,0,5,1,0,1,1,1,0,1,blue,0,0,0,0,0,0,1,1,1,0,blue,red
4,Andorra,3,1,0,0,6,0,3,0,3,1,0,1,1,0,0,0,gold,0,0,0,0,0,0,0,0,0,0,blue,red


**Exercise:** What is the sum of the `num_crosses` grouped by `religion`?

*Note: religion: 0=Catholic, 1=Other Christian, 2=Muslim, 3=Buddhist, 4=Hindu, 5=Ethnic, 6=Marxist, 7=Others*

In [49]:
sum_crosses_by_religion = flags.groupby('religion')['num_crosses'].sum()
sum_crosses_by_religion

religion
0     2
1    26
2     0
3     1
4     0
5     0
6     0
7     0
Name: num_crosses, dtype: int64

**Exercise:** What is the sum of `crescent` grouped by `religion`?

In [51]:
sum_crescent = flags.groupby('religion')['crescent'].sum()
sum_crescent

religion
0    0
1    0
2    8
3    1
4    1
5    0
6    1
7    0
Name: crescent, dtype: int64

**Exercise:** What are the total value counts of all the colors in `mainhue` grouped by `religion`?

In [54]:
color_counts_by_religion = flags.groupby('religion')['mainhue'].value_counts()

color_counts_by_religion

religion  mainhue
0         red        15
          blue        9
          gold        6
          white       6
          green       3
          black       1
1         blue       24
          red        16
          white       9
          green       6
          gold        3
          black       1
          orange      1
2         red        15
          green      12
          gold        3
          black       2
          blue        2
          brown       1
          orange      1
3         red         4
          blue        1
          gold        1
          orange      1
          white       1
4         brown       1
          green       1
          orange      1
          red         1
5         red        10
          green       8
          gold        5
          blue        2
          black       1
          white       1
6         red        10
          blue        2
          white       2
          gold        1
7         white       3
          green       

**Exercise:** What is the maximum `area` for each group that is grouped by `zone` and then `religion`? What is the minimum area?

In [57]:
max_area = flags.groupby(['zone', 'religion']).max()
min_area = flags.groupby(['zone', 'religion']).min()


print(f"max area: {max_area}")
print(f"min area: {min_area}")



max area:                         name  landmass   area  population  language  num_bars  \
zone religion                                                                   
1    0          Vatican-City         6    547          57        10         3   
     1           Switzerland         6   1222          61        10         3   
     2                   UAE         5   2506          90        10         3   
     3              Thailand         5    678          49        10         2   
     4                 Nepal         5   3268         684        10         0   
     5                Uganda         4   1284          17        10         3   
     6            Yugoslavia         5  22402        1008        10         3   
     7                  Togo         5    372         118        10         0   
2    1               Vanuatu         6   7690          29        10         1   
     2             Indonesia         6   1904         157        10         0   
     4            

**Exercise:** Let's try to determine if there are more colors or shapes on the country flags.
1. Create a subset of the `flags` data and call it `flags_subset`. This subset should include the following attributes: 'red', 'green', 'blue', 'gold', 'white', 'black', 'orange', 'num_circles', 'num_crosses', 'num_saltires', 'num_sunstars', 'crescent',  and 'triangle'
2. Create a dictionary that maps all colors to the string `color` and all shapes to the string `shape`
3. Use the mapping dictionary from step 2 to calculate the sum of the colors and shapes for each instance.
4. *Bonus:* Sum up all the colors and shapes for all instances to determine if there are more colors or shapes on all the flags.

In [None]:
### ENTER CODE HERE ###

### Data Aggregation

You can use aggregations of your own devising and additionally call any method that is also defined on the grouped object. 

Let's look at another simple example. First, let's create a similar DataFrame to our `students` that represents the grades of five quizzes these students received in one class.

In [None]:
# create another simple DataFrame
quiz_df = pd.DataFrame(np.random.randint(70,100,(8,5)),
                       columns=[1,2,3,4,5],
                       index=['Joe','Steve','Beth','Jim','Sue','James','Amy','Monika'])

# add a gender column for grouping
quiz_df['gender'] = ['M','M','F','M','F','M','F','F']
quiz_df

Unnamed: 0,1,2,3,4,5,gender
Joe,79,72,88,79,82,M
Steve,75,86,81,99,73,M
Beth,70,78,97,93,88,F
Jim,83,96,75,97,87,M
Sue,98,95,94,90,89,F
James,95,96,76,80,78,M
Amy,79,90,78,70,99,F
Monika,94,97,99,72,79,F


In [None]:
# groupby gender
grouped = quiz_df.groupby('gender')

In [None]:
# use agg and pass 'mean'
# notice you pass this as a string
grouped.agg('mean')

# notice that this is the same as passing the following:
grouped.mean()

Unnamed: 0_level_0,1,2,3,4,5
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
F,85.25,90.0,92.0,81.25,88.75
M,83.0,87.5,80.0,88.75,80.0


Now comes the fun part. Let's say that you wanted to know the range of the top score and bottom score for each quiz broken down by gender. We can create our own custom function to do this.

In [None]:
# create a simple custom function
def range_scores(arr):
    return arr.max() - arr.min()

# pass function to agg
grouped.agg(range_scores)

Unnamed: 0_level_0,1,2,3,4,5
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
F,28,19,21,23,20
M,20,24,13,20,14


Note that you can also pass the describe method to a grouped object.

In [None]:
grouped[1].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
F,4.0,85.25,13.047988,70.0,76.75,86.5,95.0,98.0
M,4.0,83.0,8.640988,75.0,78.0,81.0,86.0,95.0


Let's add another column that lists the student's level.

In [None]:
quiz_df['level'] = ['Senior','Junior','Senior','Senior','Junior','Junior','Senior','Junior']
quiz_df

Unnamed: 0,1,2,3,4,5,gender,level
Joe,79,72,88,79,82,M,Senior
Steve,75,86,81,99,73,M,Junior
Beth,70,78,97,93,88,F,Senior
Jim,83,96,75,97,87,M,Senior
Sue,98,95,94,90,89,F,Junior
James,95,96,76,80,78,M,Junior
Amy,79,90,78,70,99,F,Senior
Monika,94,97,99,72,79,F,Junior


Next, let's add an average quiz score for each student.

In [None]:
quiz_df['avg'] = quiz_df[[1,2,3,4,5]].mean(axis=1)
quiz_df

Unnamed: 0,1,2,3,4,5,gender,level,avg
Joe,79,72,88,79,82,M,Senior,80.0
Steve,75,86,81,99,73,M,Junior,82.8
Beth,70,78,97,93,88,F,Senior,85.2
Jim,83,96,75,97,87,M,Senior,87.6
Sue,98,95,94,90,89,F,Junior,93.2
James,95,96,76,80,78,M,Junior,85.0
Amy,79,90,78,70,99,F,Senior,83.2
Monika,94,97,99,72,79,F,Junior,88.2


Now, let's group by student level and then by gender. We will select only the `avg` column and see what the mean score is for the respective groupings.

In [None]:
# groupby level and gender
grouped = quiz_df.groupby(['level','gender'])

# selecting only the avg column
grouped_avg = grouped['avg']

# aggregating the mean
grouped_avg.agg('mean')


level   gender
Junior  F         90.7
        M         83.9
Senior  F         84.2
        M         83.8
Name: avg, dtype: float64

In [None]:
# same as above
grouped['avg'].mean()

level   gender
Junior  F         90.7
        M         83.9
Senior  F         84.2
        M         83.8
Name: avg, dtype: float64

If you pass a list of functions or function names instead, you get back a DataFrame with column names taken from the functions:

In [None]:
grouped_avg.agg(['mean','std',range_scores])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,range_scores
level,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Junior,F,90.7,3.535534,5.0
Junior,M,83.9,1.555635,2.2
Senior,F,84.2,1.414214,2.0
Senior,M,83.8,5.374012,7.6


You can also create a list of functions and pass this list to `agg`

In [None]:
# list of functions
functions = ['mean','std',range_scores]

grouped_avg.agg(functions)

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,std,range_scores
level,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Junior,F,90.7,3.535534,5.0
Junior,M,83.9,1.555635,2.2
Senior,F,84.2,1.414214,2.0
Senior,M,83.8,5.374012,7.6


You can also change the name of the column when you aggregate like this:

In [None]:
grouped_avg.agg([('Average', 'mean'), ('Std Dev', 'std'), ('Range', range_scores)])

Unnamed: 0_level_0,Unnamed: 1_level_0,Average,Std Dev,Range
level,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Junior,F,90.7,3.535534,5.0
Junior,M,83.9,1.555635,2.2
Senior,F,84.2,1.414214,2.0
Senior,M,83.8,5.374012,7.6


Finally, let's say that we want to apply a different function to separate columns of the DataFrame. You can pass a dictionary like this:

In [None]:
grouped.agg({1:'mean', 2:'median', 3:'count'})

Unnamed: 0_level_0,Unnamed: 1_level_0,1,2,3
level,gender,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Junior,F,96.0,96.0,2
Junior,M,85.0,91.0,2
Senior,F,74.5,84.0,2
Senior,M,81.0,84.0,2


### Apply (split-apply-combine)

The most general-purpose GroupBy method is `apply`. `apply` splits the object being manipulated into pieces, invokes the passed function on each piece, and then attempts to concatenate the pieces together.

Let's start by creating a new sample DataFrame of flights from Vienna to Charlotte.

In [None]:
# create sample flights data
flights = pd.DataFrame({
    'airline': ['Delta','Delta','Delta','Delta','Delta','United','United','United','United','Lufthansa','Lufthansa',
                'Lufthansa','Lufthansa','Lufthansa','Lufthansa','British Airways','British Airways','British Airways'],
    'price': np.random.randint(600,1000,18),
    'time': np.random.randint(8,16,18)
})

flights

Unnamed: 0,airline,price,time
0,Delta,975,10
1,Delta,730,10
2,Delta,611,14
3,Delta,952,11
4,Delta,941,15
5,United,658,12
6,United,929,8
7,United,705,9
8,United,725,9
9,Lufthansa,734,14


Now, let's create a custom function that selects `n` rows with the lowest values in a particular column.

In [None]:
# create custom function
def best(df, n=3, column='price'):
    return df.sort_values(by=column)[:n]

In [None]:
# test on full data
best(flights)

Unnamed: 0,airline,price,time
2,Delta,611,14
5,United,658,12
14,Lufthansa,693,10


Now, let's group by `airlines` and call `apply` with this function.

In [None]:
flights.groupby('airline').apply(best)

Unnamed: 0_level_0,Unnamed: 1_level_0,airline,price,time
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
British Airways,16,British Airways,709,11
British Airways,17,British Airways,913,13
British Airways,15,British Airways,983,9
Delta,2,Delta,611,14
Delta,1,Delta,730,10
Delta,4,Delta,941,15
Lufthansa,14,Lufthansa,693,10
Lufthansa,11,Lufthansa,715,12
Lufthansa,9,Lufthansa,734,14
United,5,United,658,12


You can also add other arguments in the `apply` method. What if we wanted only the two shortest flights by airline?

In [None]:
# two shortest flights grouped by airline
flights.groupby('airline').apply(best, n=2, column='time')

Unnamed: 0_level_0,Unnamed: 1_level_0,airline,price,time
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
British Airways,15,British Airways,983,9
British Airways,16,British Airways,709,11
Delta,0,Delta,975,10
Delta,1,Delta,730,10
Lufthansa,13,Lufthansa,854,9
Lufthansa,12,Lufthansa,924,10
United,6,United,929,8
United,7,United,705,9


### Student Practice

Try to perform the following tasks on the `flags` dataset. Then check your answers as I walk through the solutions. 

In [None]:
flags.head()

Unnamed: 0,name,landmass,zone,area,population,language,religion,num_bars,num_stripes,num_colors,red,green,blue,gold,white,black,orange,mainhue,num_circles,num_crosses,num_saltires,num_quarters,num_sunstars,crescent,triangle,icon,animate,text,topleft_color,botright_color
0,Afghanistan,5,1,648,16,10,2,0,3,5,1,1,0,1,1,1,0,green,0,0,0,0,1,0,0,1,0,0,black,green
1,Albania,3,1,29,3,6,6,0,0,3,1,0,0,1,0,1,0,red,0,0,0,0,1,0,0,0,1,0,red,red
2,Algeria,4,1,2388,20,8,2,2,0,3,1,1,0,0,1,0,0,green,0,0,0,0,1,1,0,0,0,0,green,white
3,American-Samoa,6,3,0,0,1,1,0,0,5,1,0,1,1,1,0,1,blue,0,0,0,0,0,0,1,1,1,0,blue,red
4,Andorra,3,1,0,0,6,0,3,0,3,1,0,1,1,0,0,0,gold,0,0,0,0,0,0,0,0,0,0,blue,red


**Exercise:** Group the data by `zone` and determine the difference between the zone's largest area and its smallest area. Do this by creating a custom function and passing it to the `agg` method.

In [None]:
### ENTER CODE HERE ###

**Exercise:** Add a new column called `pop_den` to the DataFrame that represents the respective country's population density. Population density is defined as the population divided by the area.

In [None]:
### ENTER CODE HERE ###

**Exercise:** Group by `landmass` and determine the mean, median, standard deviation and range for the population density column. Call the columns 'Avg', 'Median', 'Std Dev' and 'Range' respectively.

In [None]:
### ENTER CODE HERE ###

**Exercise:** Grouping the data by landmass, what is the max number of bars for each group, the average number of stripes, and the median value for number of colors?

In [None]:
### ENTER CODE HERE ###

**Exercise:** 
1. Create a custom function called `top` that returns the top 2 rows with the **largest** values in the `pop_den` column. 
2. Make sure that you do not include any rows with NaNs in the `pop_den` column.
3. Setup your function arguments so that you can change the number of rows to show and which column to sort by.
4. Group the `flags` data by `landmass` and use the apply function with your custom function.

In [None]:
### ENTER CODE HERE ###

**Exercise:** Using the above custom function (`top`), return the top 3 rows with the highest `population` grouped by `zone`.

In [None]:
### ENTER CODE HERE ###

**Exercise:** 
1. Create a second custom function called `bottom` that returns the 2 rows with the smallest values in the `pop_den` column. Do not include any rows with a `pop_den` of `0`.
2. Group the `flags` data by `landmass` and use the apply function with your custom function.

In [None]:
### ENTER CODE HERE ###

### More Pivot Tables and Cross-Tabulation

A pivot table aggregates a table of data by one or more keys, arranging the data in a rectangle with some of the group keys along the rows and some along the columns. Pivot tables in Python with pandas are made possible through groupby combined with reshape operations utilizing hierarchical indexing.

To show this, let's add a new column to our `flights` data that shows if the flight is in the morning or afternoon/evening.

In [None]:
# setup list
part_of_day = ['AM','PM']

# use random choice to select AM/PM for each row
time_of_day = np.random.choice(part_of_day,18)
time_of_day

array(['PM', 'AM', 'PM', 'AM', 'PM', 'PM', 'AM', 'PM', 'AM', 'AM', 'AM',
       'PM', 'PM', 'PM', 'PM', 'PM', 'PM', 'AM'], dtype='<U2')

In [None]:
# add new column to data
flights['time_of_day'] = time_of_day
flights

Unnamed: 0,airline,price,time,time_of_day
0,Delta,975,10,PM
1,Delta,730,10,AM
2,Delta,611,14,PM
3,Delta,952,11,AM
4,Delta,941,15,PM
5,United,658,12,PM
6,United,929,8,AM
7,United,705,9,PM
8,United,725,9,AM
9,Lufthansa,734,14,AM


Suppose you wanted to compute a table of price and time averages arranged by airline and time of day.

In [None]:
# average of price/time by airline/time of day
flights.pivot_table(index=['airline','time_of_day'])

Unnamed: 0_level_0,Unnamed: 1_level_0,price,time
airline,time_of_day,Unnamed: 2_level_1,Unnamed: 3_level_1
British Airways,AM,913.0,13.0
British Airways,PM,846.0,10.0
Delta,AM,841.0,10.5
Delta,PM,842.333333,13.0
Lufthansa,AM,744.0,13.0
Lufthansa,PM,796.5,10.25
United,AM,827.0,8.5
United,PM,681.5,10.5


You can choose just a select column or group of columns.

In [None]:
# select only price
flights.pivot_table('price', index=['airline','time_of_day'])

Unnamed: 0_level_0,Unnamed: 1_level_0,price
airline,time_of_day,Unnamed: 2_level_1
British Airways,AM,913.0
British Airways,PM,846.0
Delta,AM,841.0
Delta,PM,842.333333
Lufthansa,AM,744.0
Lufthansa,PM,796.5
United,AM,827.0
United,PM,681.5


Or you can compute the average price and time broken down by time of day.

In [None]:
# average price/time broken down by time of day
flights.pivot_table(index=['airline'],columns='time_of_day')

Unnamed: 0_level_0,price,price,time,time
time_of_day,AM,PM,AM,PM
airline,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
British Airways,913.0,846.0,13.0,10.0
Delta,841.0,842.333333,10.5,13.0
Lufthansa,744.0,796.5,13.0,10.25
United,827.0,681.5,8.5,10.5


If you include `margins=True`, it will compute group statistics for all the data within a single tier.

In [None]:
# include margins=True
flights.pivot_table('price', index='airline', columns='time_of_day', margins=True)

time_of_day,AM,PM,All
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
British Airways,913.0,846.0,868.333333
Delta,841.0,842.333333,841.8
Lufthansa,744.0,796.5,779.0
United,827.0,681.5,754.25
All,819.571429,797.090909,805.833333


In [None]:
flights[flights['airline'] == 'British Airways']['price'].mean()
flights[flights['time_of_day'] == 'AM']['price'].mean()
flights['price'].mean()

805.8333333333334

The default function for a pivot table is `mean` although you can change it with the `aggfunc` argument.

In [None]:
# using count
flights.pivot_table(['price'], index=['airline'], columns='time_of_day', margins=True, aggfunc='count')

Unnamed: 0_level_0,price,price,price
time_of_day,AM,PM,All
airline,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
British Airways,1,2,3
Delta,2,3,5
Lufthansa,2,4,6
United,2,2,4
All,7,11,18


A cross-tabulation (or crosstab) is a special case of a pivot table that computes group frequencies.

In [None]:
# using cross tab
pd.crosstab(flights['airline'], flights['time_of_day'], margins=True)

time_of_day,AM,PM,All
airline,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
British Airways,1,2,3
Delta,2,3,5
Lufthansa,2,4,6
United,2,2,4
All,7,11,18


### Student Practice

Try to perform the following tasks on the `flags` dataset. Then check your answers as I walk through the solutions. 

**Exercise:** Create a pivot table using the `flags` data. Group the rows by `landmass` and use `median` as the aggregation function.

In [None]:
### ENTER CODE HERE ###

**Exercise:** Create a pivot table grouping by `religion` in the index and summing the `crescent`, `num_crosses`, and `num_saltires`  columns.

In [None]:
### ENTER CODE HERE ###

**Exercise:** Create a pivot table grouping by `religion` in the index and `zone` in the columns. Use the `sum` function and select the `crescent`, `num_crosses`, and `num_saltires` columns. Add a total for each row and for each column. 

In [None]:
### ENTER CODE HERE ###

**Exercise:** Notice that the above output should have a lot of NaNs in the rows. See the [pandas documentation](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) to see how to compute the same pivot table but fill all NaNs with a `0`.

In [None]:
### ENTER CODE HERE ###

**Exercise:** Using `crosstab`, compute the group frequencies for `landmass` vs `religion`.

In [None]:
### ENTER CODE HERE ###

## Building a Machine Learning Model

Now that we learned about various aggregation and grouping operations, let's finish this module with a model to determine if we can predict a country's main religion based mostly on its flag's details.

Please note that I do not expect you to understand the rest of this code. You will learn more about this in future classes. This is meant for motivation for what you can do with this data. See the last cell for the results.

**Note:** If you attempt to run this code yourself, you will need to install `sklearn`.

In [None]:
# standard imports
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer

# dropping name column as this provides no additional data
flags = flags.drop('name', axis=1)

# creating features and response
X = flags.drop('religion', axis=1)
y = flags[['religion']]

# splitting data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
# create list of numeric columns
num_col = ['area','population','num_bars','num_stripes','num_colors','num_circles','num_crosses',
          'num_saltires','num_quarters','num_sunstars','pop_den'] 

X_train_num = X_train[num_col]

# create pipeline for numeric columns to impute and scale data
num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('std_scaler', StandardScaler())
])

KeyError: "['pop_den'] not in index"

In [None]:
# create list of numeric attributes
num_attribs = list(X_train_num)

# create list of attributes to be One-Hot-Encoded
OHE_attribs = ['landmass','zone','language','mainhue','topleft_color','botright_color']

# create full pipeline 
full_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('OHE', OneHotEncoder(), OHE_attribs),
    ], remainder='passthrough')

In [None]:
# run training and testing data through pipeline
X_train_prepared = full_pipeline.fit_transform(X_train)
X_test_prepared = full_pipeline.transform(X_test)

In [None]:
from sklearn.ensemble import RandomForestClassifier

# just using mostly default values for random forest, no grid search
rf = RandomForestClassifier(n_estimators=1000, random_state=42)
rf.fit(X_train_prepared, np.array(y_train).ravel())

In [None]:
# RandomForestRegressor Generalization Errors
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score

y_preds = rf.predict(X_test_prepared)
acc_score_forest = accuracy_score(y_test, y_preds)
prec_score_forest = precision_score(y_test, y_preds, average='micro')
recall_score_forest = recall_score(y_test, y_preds, average='micro')

model_name = type(rf).__name__

print(f'Model {model_name} | Accuracy: {acc_score_forest}')
print(f'Model {model_name} | Precision: {prec_score_forest}')
print(f'Model {model_name} | Recall: {recall_score_forest}')

The bottom line was that we are able to predict with 69% accuracy a country's main religion based mostly on the details of its flag. This is not bad considering what little data we have to work with. 

Thanks!