# A Brief Introduction to Pandas
### Part 2

## 3.1 Selection
Using .loc(), .iloc()
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats

In [None]:
# Read in the complete version of the europe data, using the first column as the index
eur_data_final = pd.read_csv('./data/complete/eur_data_final.csv', index_col=0)
eur_data_final.head()

In [None]:
# Data from the eur_data_final df, represented as a python dictionary
countries_dict = {
    15: {
        'country': 'Italy', 
         'unemp_rate': 11.7, 
         'gdp': 1689824, 
         'median_income': 16237, 
         'total_pop': 59433744
        }
}

In [None]:
# With vanilla python, how do we get the word 'Italy' from a dictionary?

In [None]:
# How do we do this with a dataframe?

In [None]:
# We can also get multiple columns

In [None]:
# Or an entire row/entry

In [None]:
# Or multiple rows and columns

In [None]:
# We can also use python's index slicing syntax

In [None]:
# Select by column value (Pandas is smart!)
# DF must be indexed by country name

### Exercise - Use .loc() to create a new dataframe with all countries from Cypress to France (alphabetically) with gdp and total_pop columns.

### Exercise - What countries have a higher unemployment rate than Slovenia and have a lowercase 't' in their name?

In [None]:
# Select slovenia unemployment values

# Generate comparison query

# Generate 'contains' query

# Make selection using queries


In [None]:
# Explore .iloc()
# https://stackoverflow.com/questions/31593201/how-are-iloc-and-loc-different


## 4.1 MultiIndexes (hierarchical indexes)
Pandas also supports multindexes, which allow users to index by multiple values or groups of values.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html

In [None]:
parent_array = ['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']
child_array = ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']
# Add arrays to one array

We want a multidimensional array of random numbers. Numpy for the win!

In [None]:
# Set the seed for number generation
np.random.seed(42)
# Create multidimensional array of pseudorandom numbers with shape 8,4

In [None]:
# Convert md array to dataframe
multi_df = pd.DataFrame(md_array)

# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.RangeIndex.html
multi_df.index

In [None]:
# Add multindex
multi_df = pd.DataFrame(md_array, index=arrays)
multi_df.index

In [None]:
# Update column names of multi_df
col_names = ['var1', 'var2', 'var3', 'var4']
multi_df.columns = col_names

In [None]:
# Select all columns and rows for 'bar'

In [None]:
# Select all rows for index foo and column var1
# Using loc, fancy indexing

# # Using loc, bracket notation

# # Using loc, dot notation

# # Without loc


## 4.2 Groupby
From the documentation: "A groupby operation involves some combination of splitting the object, applying a function, and combining the results."
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

This is tied closely to the split-apply-combine strategy: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html. This was approach outlined by Hadley Wickcham in this paper: https://www.jstatsoft.org/article/view/v040i01.


In [None]:
# Read in the UN world data
un_data = pd.read_csv('./data/complete/un_world_data.csv')

This data is wide! Let's get rid of some of the columns.

In [None]:
# This data is wide! Let's get rid of some of the columns.
columns_to_keep = [
    'country',
    'Region',
    'Surface area (km2)',
    'GDP: Gross domestic product (million current US$)', 
    'Population in thousands (2017)', 
    'Population density (per km2, 2017)'
    ]

In [None]:
# Rename the columns 
columns = {
    "Surface area (km2)": "", 
    "GDP: Gross domestic product (million current US$)": "",
    "Population in thousands (2017)": "",
    "Population density (per km2, 2017)": ''
    }

In [None]:
# Group the data by region
un_region = un_data.groupby('Region', as_index=False)
# The groupby object is iterable

Note: generally speaking, you want to avoid iteration with Pandas. It's best to leverage the power of vectorized operations. If you find yourself looping through a dataframe or a series, you might be doing unecessary work. https://towardsdatascience.com/you-dont-always-have-to-loop-through-rows-in-pandas-22a970b347ac

In [None]:
# Select Caribbean group
carib = un_region.get_group('Caribbean')

# Get the head and tail

### Exercise - Get the average surface area of all the countries in CentralAmerica

We can do vectorized operations on each group object.

In [None]:
# .min(), .max(), .mean(), .count()

Let's take a look at aggregating the data using the .agg() method: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html

In [None]:

# Get the mean of each column for each region

In [None]:
# Get mean, median, and sum

We can also pass in other functions or define our own. These are referred to as higher-order functions, i.e. functions that take in other functions as arguments. 



In [None]:
# Example of higher-order function
def do_calculation(val, func):
    return func(val)

In [None]:
# Example of filtering

Let's use the statistics module from scipy to calculate some new values: https://docs.scipy.org/doc/scipy/reference/stats.html

In [None]:
# stats.tsem(), stats.tstd(), stats.skew()

In [None]:
# Use a lambda function
# https://www.geeksforgeeks.org/python-lambda-anonymous-functions-filter-map-reduce/

In [None]:
# Get the max of each column in each region

In [None]:
# Select a single column

In [None]:
# Get sum and mean of surface area, mean of population for each region

In [None]:
# Define custom column names

In [None]:
# .filter()
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.filter.html
# https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html

In [None]:
# Filter by population

### Exercise - Get standard deviation of gdp for each region using np.std for all regions with a population density over 100

### Exercise - Generate corr plot for the UN data

In [None]:
# Select columns

# Make subplot and figure

# Generate correlation matrix

# Generate matplotlib plot

# Add colorbar to figure

# Set tick labels