#### Indexing DataFrames

- df[col][row]
- df.loc[row_name(s),col_name(s)]
- df.iloc[row_index(es),col_index(es)]


~~~
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election.loc['Perry':'Potter']

# Print the p_counties DataFrame
print(p_counties)

# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election.loc['Potter':'Perry':-1]

# Print the p_counties_rev DataFrame
print(p_counties_rev)
~~~

#### Combining filters

~~~
df[(df.salt >= 50) & (df.eggs < 200)] # and

df[(df.salt >= 50) | (df.eggs < 200)] # or
~~~

**Select columns with all nonzeros**

~~~
df.loc[:,df.all()]
~~~

**Select columns with any nonzeros**

~~~
df.loc[:,df.any()]
~~~

**Select columns with any Nans**

~~~
df.loc[:,df.isnull().any()]
~~~

**Select columns without NaNs**

~~~
df.loc[:,df.notnull().all()]
~~~

**Drop rows with any NaNs**

~~~
df.dropna(how='any') # how='all' drops all-NaN rows/cols
~~~

**Filtering/modifying a column based on another**

~~~
df.eggs[df.salt > 55] += 5
~~~


~~~
# Select the 'age' and 'cabin' columns: df
df = titanic[['age','cabin']]

# Print the shape of df
print(df.shape)

# Drop rows in df with how='any' and print the shape
print(df.dropna(how='any').shape)

# Drop rows in df with how='all' and print the shape
print(df.dropna(how='all').shape)

# Drop columns in titanic with less than 1000 non-missing values
print(titanic.dropna(thresh=1000, axis='columns').info())
~~~

### DataFrame vectorized methods

**Convert to dozen units**

~~~
df.floordiv(12)
~~~

or

~~~
import numpy as np

np.floor_divide(df,12)
~~~

or

~~~
df.apply(lambda n: n//12)
~~~

#### Storing a transformation

~~~
df['dozens_of_eggs'] = df.eggs.floordiv(12)
~~~



#### Manipulating the index

~~~
df.index = df.index.str.upper()
~~~

or

~~~
df.index = df.index.map(str.upper)
~~~

**Defining columns using other columns**

~~~
df['salty_eggs'] = df.salt + df.dozens_of_eggs
~~~

~~~
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'blue','Romney':'red'}

# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election['winner'].map(red_vs_blue)

# Print the output of election.head()
print(election.head())
~~~

#### Pandas Data Structures

- Key building blocks
	- Indexes: sequence of labels
	- Series: 1D array with Index
	- DataFrames: 2D array with Series as column (and rows)
- Indexes
	- Immutable (like dictionary keys)
	- Homogeneous in data type (like NumPy arrays)

#### Assinging the index

~~~
unemployment.index = unemployment['Zip']

del unemployment['Zip']
~~~

### Hierarchical indexing

**Setting index**

~~~
stocks =  stocks.set_index(['Symbol','Date'])
# MultiIndex

print(stocks.index.names)

stocks.stocks.sort_index()
~~~

**Indexing**

~~~
stocks.loc[('CSCO','2016-10-04')] # Tuple!

stocks.loc[('CSCO','2016-10-04'),'Volume']
~~~

**Slicing**

- Outermost index:

~~~
stocks.loc['AAPL']

stocks.loc['CSCO':'MSFT']
~~~

- Both indexes:

~~~
stocks.loc[(slice(None), slice('2016-10-03','2016-10-04')),:]
~~~

**Fancy indexing**

- Outermost index

~~~
stocks.loc[(['AAPL','MSFT'], '2016-10-05'),:]

stocks.loc[(['AAPL','MSFT'], '2016-10-05'),'Close']
~~~

- Innermost index

~~~
stocks.loc[('CSCO', ['2016-10-05','2016-10-03']),:]
~~~

~~~
# Look up data for NY in month 1: NY_month1
NY_month1 = sales.loc[('NY',1),:]

# Look up data for CA and TX in month 2: CA_TX_month2
CA_TX_month2 = sales.loc[(['CA','TX'],2),:]

# Look up data for all states in month 2: all_month2
all_month2 = sales.loc[(slice(None),2),:]
~~~

#### Reshaping DataFrames

**Reshaping by pivoting**

~~~
trials.pivot(index='treatment',
		columns='gender',
		values='response')
~~~

**Pivoting multiple columns**

~~~
trials.pivot(index='treatment',
		columns='gender')
~~~

### Stacking & unstacking DataFrames

- Unstacking (long -> wider)

~~~
trials_by_gender = trials.unstack(level='gender')
~~~

or

~~~
trials_by_gender = trials.unstack(level=1)
~~~

- Stacking (wide -> longer)

~~~
trials_by_gender.stack(level='gender')
~~~

**Swapping levels**

~~~
swapped = df.swaplevel(0,1)

sorted_df = swapped.sort_index()
~~~

#### Melting DataFrames

~~~
pd.melt(new_trials,id_vars=['treatment'],
	value_vars=['F','M'])
~~~

~~~
pd.melt(new_trials, id_vars=['treatment'],
	var_name='gender',value_name='response')
~~~

#### See pd.melt(): https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.melt.html

#### Pivot table

Works when there repeated combinations of values;
uses aggregation (default: avg)

~~~
more_trials.pivot_table(index='treatment',
			columns='gender',
			values='response',
			aggfunc='count')
~~~

##### See pandas.DataFrame.pivot_table(): https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.pivot_table.html

#### Groupby and count

~~~
sales.groupby('weekday').count()
~~~

- split by 'weekday'
- apply count() function on each group
- combine counts per group

#### Aggregation/Reduction

- Some reducing functions
	- mean()
	- std()
	- sum()
	- first(), last()
	- min(), max()

#### Groupby and sum

~~~
sales.groupby('weekday')[['bread','butter']].sum()
~~~

#### Groupby and mean: multi-level index

~~~
sales.groupby(['city','weekday']).mean()
~~~

#### Groupby and sum: by Series

~~~
customers = pd.Series([...]) # has same index as sales

sales.groupby(customers)['bread'].sum()
~~~

**Categorical data**

~~~
sales['weekday'].unique()

sales['weekday'] = sales['weekday'].astype('category')
~~~

- Advantages:
	- uses less memory
	- speeds up operations like groupby()


~~~
# Group titanic by 'pclass': by_class
by_class = titanic.groupby('pclass')

# Select 'age' and 'fare'
by_class_sub = by_class[['age','fare']]

# Aggregate by_class_sub by 'max' and 'median': aggregated
aggregated = by_class_sub.agg(['max','median'])

# Print the maximum age in each class
print(aggregated.loc[:, ('age','max')])

# Print the median fare in each class
print(aggregated.loc[:,('fare','median')])
~~~

~~~

# Read the CSV file into a DataFrame and sort the index: gapminder
gapminder = pd.read_csv('gapminder.csv',index_col=['Year','region','Country']).sort_index()

# Group gapminder by 'Year' and 'region': by_year_region
by_year_region = gapminder.groupby(level=['Year','region'])

# Define the function to compute spread: spread
def spread(series):
    return series.max() - series.min()

# Create the dictionary: aggregator
aggregator = {'population':'sum', 'child_mortality':'mean', 'gdp':spread}

# Aggregate by_year_region using the dictionary: aggregated
aggregated = by_year_region.agg(aggregator)

# Print the last 6 entries of aggregated 
print(aggregated.tail(6))
~~~

~~~
# Read file: sales
sales = pd.read_csv('sales.csv',index_col='Date',parse_dates=True)

# Create a groupby object: by_day
by_day = sales.groupby(sales.index.strftime('%a'))

# Create sum: units_sum
units_sum = by_day['Units'].sum()

# Print units_sum
print(units_sum)
~~~

### Groupby and transformation

- The z-score

~~~
def zscore(series):
	return (series - series.mean()) / series.std()
~~~

- The automobile dataset

~~~
auto = pd.read_csv('auto-mpg.csv')
~~~

- MPG z-score

~~~
zscore(auto['mpg']).head()
~~~

- MPG z-score by year

~~~
auto.groupby('yr')['mpg'].transform(zscore).head()
~~~

#### Apply transformation and aggregation

- The agg() method applies reduction
- The transform() method applies a function element-wise to groups
- In some cases, split-apply-combine operations do not neatly fall into aggregation or transformation: for those cases we use apply()

~~~
def zscore_with_year_and_name(group):
	df = pd.DataFrame(
			{'mpg': zscore(group['mpg']),
			'year': group['yr'],
			'name': group['name']})
	return df

auto.groupby('yr').apply(zscore_with_year_and_name).head()
~~~

#### groupby object

~~~
splitting = auto.groupby('yr')

print(type(splitting)) # pandas.core.groupby.DataFrameGroupBy

print(type(splitting.groups)) # dict

print(splitting.groups.keys()) # The keys are the years

# iteration
for group_name, group in splitting:
	avg = group['mpg'].mean()
	print(group_name,avg)

# iteration and filtering
for group_name, group in splitting:
	avg = group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean()
	print(group_name,avg)

# comprehension
chevy_means = {year:group.loc[group['name'].str.contains('chevrolet'), 'mpg'].mean()
		for year, group in splitting}

pd.Series(chevy_means)

# Boolean groupby
chevy = auto['name'].str.contains('chevrolet')

auto.groupby(['yr', chevy])['mpg'].mean()
~~~

~~~
# Create the Boolean Series: under10
under10 = (titanic['age'] < 10).map({True:'under 10', False:'over 10'})

# Group by under10 and compute the survival rate
survived_mean_1 = titanic.groupby(under10)['survived'].mean()
print(survived_mean_1)

# Group by under10 and pclass and compute the survival rate
survived_mean_2 = titanic.groupby([under10,'pclass'])['survived'].mean()
print(survived_mean_2)
~~~

#### Two new DataFrame methods

- idxmax(): row or column label where the maximum value is located
- idxmin(): row or column label where the minimum value is located


~~~
# Extract all rows for which the 'Edition' is between 1952 & 1988: during_cold_war
during_cold_war = (medals['Edition'] >= 1952) & (medals['Edition'] <= 1988)

# Extract rows for which 'NOC' is either 'USA' or 'URS': is_usa_urs
is_usa_urs = medals.NOC.isin(['USA','URS'])

# Use during_cold_war and is_usa_urs to create the DataFrame: cold_war_medals
cold_war_medals = medals.loc[during_cold_war & is_usa_urs]

# Group cold_war_medals by 'NOC'
country_grouped = cold_war_medals.groupby('NOC')

# Create Nsports
Nsports = country_grouped['Sport'].nunique().sort_values(ascending=False)

# Print Nsports
print(Nsports)
~~~

**Grouping the data**

~~~
france = medals.NOC == 'FRA'

france_grps = medals[france].groupby(['Edition','Medal'])

france_grps['Athlete'].count().head(10) # MultiIndex
~~~

**Reshaping data**

~~~
france_medals = france_grps['Athlete'].count().unstack()

france_medals.head(12) # Single level index
~~~