# Manipulating DataFrames with pandas
## What you will learn
- Extracting, filtering, and transforming data from DataFrames
- Advanced indexing with multiple levels
- Tidying, rearranging and restructuring your data
- Pivoting, melting, and stacking DataFrames
- Identifying and spliting DataFrames by groups

### Index ordering
In this exercise, the DataFrame `election` is provided for you. It contains the 2012 US election results for the state of Pennsylvania with county names as row indices. Your job is to select `'Bedford'` county and the `'winner'` column.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
data_file = '../_datasets/pennsylvania2012_turnout.csv'
election = pd.read_csv(data_file, index_col = 'county')
election.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adams,PA,41973,35.482334,63.112001,Romney,61156,68.632677,27.629667
Allegheny,PA,614671,56.640219,42.18582,Obama,924351,66.497575,14.454399
Armstrong,PA,28322,30.696985,67.901278,Romney,42147,67.19814,37.204293
Beaver,PA,80015,46.032619,52.63763,Romney,115157,69.483401,6.605012
Bedford,PA,21444,22.057452,76.98657,Romney,32189,66.619031,54.929118


In [3]:
election.loc['Bedford','winner']

'Romney'

### Positional and labeled indexing
Given a pair of label-based indices, sometimes it's necessary to find the corresponding positions. In this exercise, you will use the Pennsylvania election results again. 

Find `x` and `y` such that 
```Python
election.iloc[x, y] == election.loc['Bedford', 'winner']. 
```
That is, what is the row position of `'Bedford'`, and the column position of `'winner'`? 

In [4]:
# Assign the row position of election.loc['Bedford']: x
x = 4

# Assign the column position of election['winner']: y
y = 4

# Print the boolean equivalence
print(election.iloc[x, y] == election.loc['Bedford', 'winner'])

True


### Indexing and column rearrangement
There are circumstances in which it's useful to modify the order of your DataFrame columns. We do that now by extracting just two columns from the Pennsylvania election results DataFrame.

You'll assign a new DataFrame by selecting the list of columns `['winner', 'total', 'voters']`. 

In [5]:
# Create a separate dataframe with the columns ['winner', 'total', 'voters']: results
results = election[['winner', 'total', 'voters']]

# Print the output of results.head()
results.head()

Unnamed: 0_level_0,winner,total,voters
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adams,Romney,41973,61156
Allegheny,Obama,614671,924351
Armstrong,Romney,28322,42147
Beaver,Romney,80015,115157
Bedford,Romney,21444,32189


# Slicing DataFrames
### Slicing rows
The Pennsylvania US election results data set that you have been using so far is ordered by county name. This means that county names can be sliced alphabetically. In this exercise, you're going to perform slicing on the county names of the `election` DataFrame.

In [6]:
# Slice the row labels 'Perry' to 'Potter': p_counties
p_counties = election['Perry':'Potter']

# Print the p_counties DataFrame
p_counties.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Perry,PA,18240,29.769737,68.591009,Romney,27245,66.948064,38.821272
Philadelphia,PA,653598,85.224251,14.051451,Obama,1099197,59.461407,71.1728
Pike,PA,23164,43.904334,54.882576,Romney,41840,55.363289,10.978242
Potter,PA,7205,26.259542,72.158223,Romney,10913,66.022175,45.898681


In [7]:
# Slice the row labels 'Potter' to 'Perry' in reverse order: p_counties_rev
p_counties_rev = election['Potter':'Perry':-1]

# Print the p_counties_rev DataFrame
p_counties_rev.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Potter,PA,7205,26.259542,72.158223,Romney,10913,66.022175,45.898681
Pike,PA,23164,43.904334,54.882576,Romney,41840,55.363289,10.978242
Philadelphia,PA,653598,85.224251,14.051451,Obama,1099197,59.461407,71.1728
Perry,PA,18240,29.769737,68.591009,Romney,27245,66.948064,38.821272


### Slicing columns
Similar to row slicing, columns can be sliced by value. In this exercise, your job is to slice column names from the Pennsylvania election results DataFrame using `.loc[]`.

In [8]:
# Slice the columns from the starting column to 'Obama': left_columns
left_columns = election.loc[:,:'Obama']

# Print the output of left_columns.head()
left_columns.head()

Unnamed: 0_level_0,state,total,Obama
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adams,PA,41973,35.482334
Allegheny,PA,614671,56.640219
Armstrong,PA,28322,30.696985
Beaver,PA,80015,46.032619
Bedford,PA,21444,22.057452


In [9]:
# Slice the columns from 'Obama' to 'winner': middle_columns
middle_columns = election.loc[:,'Obama':'winner']

# Print the output of middle_columns.head()
middle_columns.head()

Unnamed: 0_level_0,Obama,Romney,winner
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Adams,35.482334,63.112001,Romney
Allegheny,56.640219,42.18582,Obama
Armstrong,30.696985,67.901278,Romney
Beaver,46.032619,52.63763,Romney
Bedford,22.057452,76.98657,Romney


In [10]:
# Slice the columns from 'Romney' to the end: 'right_columns'
right_columns = election.loc[:,'Romney':]

# Print the output of right_columns.head()
right_columns.head()

Unnamed: 0_level_0,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Adams,63.112001,Romney,61156,68.632677,27.629667
Allegheny,42.18582,Obama,924351,66.497575,14.454399
Armstrong,67.901278,Romney,42147,67.19814,37.204293
Beaver,52.63763,Romney,115157,69.483401,6.605012
Bedford,76.98657,Romney,32189,66.619031,54.929118


### Subselecting DataFrames with lists
You can use lists to select specific row and column labels with the `.loc[]` accessor. In this exercise, your job is to select the counties `['Philadelphia', 'Centre', 'Fulton']` and the columns `['winner','Obama','Romney']` from the `election` DataFrame.

In [11]:
# Create the list of row labels: rows
rows = ['Philadelphia', 'Centre', 'Fulton']

# Create the list of column labels: cols
cols = ['winner', 'Obama', 'Romney']

# Create the new DataFrame: three_counties
three_counties = election.loc[rows,cols]

# Print the three_counties DataFrame
three_counties

Unnamed: 0_level_0,winner,Obama,Romney
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Philadelphia,Obama,85.224251,14.051451
Centre,Romney,48.948416,48.977486
Fulton,Romney,21.096291,77.748861


# Filtering DataFrames
### Thresholding data
In this exercise, we have provided the Pennsylvania election results and included a column called `'turnout'` that contains the percentage of voter turnout per county. Your job is to prepare a boolean array to select all of the rows and columns where voter turnout exceeded 70%.

In [12]:
# Create the boolean array: high_turnout
high_turnout = election.turnout > 70

# Filter the election DataFrame with the high_turnout array: high_turnout_df
high_turnout_df = election[high_turnout]

# Print the high_turnout_results DataFrame
high_turnout_df

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Bucks,PA,319407,49.96697,48.801686,Obama,435606,73.324748,1.165284
Butler,PA,88924,31.920516,66.816607,Romney,122762,72.436096,34.896091
Chester,PA,248295,49.228539,49.650617,Romney,337822,73.498766,0.422079
Forest,PA,2308,38.734835,59.835355,Romney,3232,71.410891,21.10052
Franklin,PA,62802,30.110506,68.583803,Romney,87406,71.850903,38.473297
Montgomery,PA,401787,56.637223,42.286834,Obama,551105,72.905708,14.35039
Westmoreland,PA,168709,37.567646,61.306154,Romney,238006,70.884347,23.738508


### Filtering columns using other columns
The election results DataFrame has a column labeled `'margin'` which expresses the number of extra votes the winner received over the losing candidate. This number is given as a percentage of the total votes cast. It is reasonable to assume that in counties where this margin was less than `1%`, the results would be too-close-to-call.

Your job is to use boolean selection to filter the rows where the margin was less than 1. You'll then convert these rows of the `'winner'` column to `np.nan` to indicate that these results are too close to declare a winner.

In [13]:
# Create the boolean array: too_close
too_close = election.margin < 1

# Assign np.nan to the 'winner' column where the results were too close to call
election.loc[too_close,'winner'] = np.nan

# Print the output of election.info()
print(election.info())

<class 'pandas.core.frame.DataFrame'>
Index: 67 entries, Adams to York
Data columns (total 8 columns):
state      67 non-null object
total      67 non-null int64
Obama      67 non-null float64
Romney     67 non-null float64
winner     64 non-null object
voters     67 non-null int64
turnout    67 non-null float64
margin     67 non-null float64
dtypes: float64(4), int64(2), object(2)
memory usage: 5.4+ KB
None


### Filtering using NaNs
In certain scenarios, it may be necessary to remove rows and columns with missing data from a DataFrame. The `.dropna()` method is used to perform this action. You'll now practice using this method on a dataset obtained from [Vanderbilt University][1], which consists of data from passengers on the Titanic.

You will focus specifically on the `'age'` and `'cabin'` columns in this exercise. Your job is to use `.dropna()` to remove rows where any of these two columns contains missing data and rows where all of these two columns contain missing data.

You'll also use the `.shape` attribute, which returns the number of rows and columns in a tuple from a DataFrame, or the number of rows from a Series, to see the effect of dropping missing values from a DataFrame.

Finally, you'll use the `thresh=` keyword argument to drop columns from the full dataset that have less than 1000 non-missing values.

[1]: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html

In [14]:
data_file = '../_datasets/titanic.csv'
titanic = pd.read_csv(data_file)
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [15]:
# Select the 'age' and 'cabin' columns: df
df = titanic.loc[:,['age','cabin']]

# Print the shape of df
print(df.shape)

df.info()

(1309, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 2 columns):
age      1046 non-null float64
cabin    295 non-null object
dtypes: float64(1), object(1)
memory usage: 15.4+ KB


In [16]:
# Drop rows in df with how='any' and print the shape
print(df.dropna(how='any').shape)

# Drop rows in df with how='all' and print the shape
print(df.dropna(how='all').shape)

# Drop columns in titanic with less than 1000 non-missing values
print(titanic.dropna(thresh=1000, axis='columns').info())

(272, 2)
(1069, 2)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
pclass      1309 non-null int64
survived    1309 non-null int64
name        1309 non-null object
sex         1309 non-null object
age         1046 non-null float64
sibsp       1309 non-null int64
parch       1309 non-null int64
ticket      1309 non-null object
fare        1308 non-null float64
embarked    1307 non-null object
dtypes: float64(2), int64(4), object(4)
memory usage: 81.9+ KB
None


# Transforming DataFrames
### Using apply() to transform a column
The `.apply()` method can be used on a pandas DataFrame **to apply an arbitrary Python function to every element**. In this exercise you'll take daily weather data in Pittsburgh in 2013 obtained from [Weather Underground][1].

Your job is to use the `.apply()` method to perform this conversion on the `'Mean TemperatureF'` and `'Mean Dew PointF'` columns of the `weather` DataFrame.

[1]: https://www.wunderground.com/history

In [17]:
# Write a function to convert degrees Fahrenheit to degrees Celsius: to_celsius
def to_celsius(F):
    return 5/9*(F - 32)

In [18]:
data_file = '../_datasets/pittsburgh2013.csv'
weather = pd.read_csv(data_file, parse_dates=True, index_col='Date')

weather.head()

Unnamed: 0_level_0,Max TemperatureF,Mean TemperatureF,Min TemperatureF,Max Dew PointF,Mean Dew PointF,Min DewpointF,Max Humidity,Mean Humidity,Min Humidity,Max Sea Level PressureIn,...,Max VisibilityMiles,Mean VisibilityMiles,Min VisibilityMiles,Max Wind SpeedMPH,Mean Wind SpeedMPH,Max Gust SpeedMPH,PrecipitationIn,CloudCover,Events,WindDirDegrees
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2013-01-01,32,28,21,30,27,16,100,89,77,30.1,...,10,6,2,10,8,,0.0,8,Snow,277
2013-01-02,25,21,17,14,12,10,77,67,55,30.27,...,10,10,10,14,5,,0.0,4,,272
2013-01-03,32,24,16,19,15,9,77,67,56,30.25,...,10,10,10,17,8,26.0,0.0,3,,229
2013-01-04,30,28,27,21,19,17,75,68,59,30.28,...,10,10,6,23,16,32.0,0.0,4,,250
2013-01-05,34,30,25,23,20,16,75,68,61,30.42,...,10,10,10,16,10,23.0,0.21,5,,221


In [19]:
# Apply the function over 'Mean TemperatureF' and 'Mean Dew PointF': df_celsius
df_celsius = weather[['Mean TemperatureF','Mean Dew PointF']].apply(to_celsius)

# Reassign the columns df_celsius
df_celsius.columns = ['Mean TemperatureC', 'Mean Dew PointC']

# Print the output of df_celsius.head()
df_celsius.head()

Unnamed: 0_level_0,Mean TemperatureC,Mean Dew PointC
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-01,-2.222222,-2.777778
2013-01-02,-6.111111,-11.111111
2013-01-03,-4.444444,-9.444444
2013-01-04,-2.222222,-7.222222
2013-01-05,-1.111111,-6.666667


### Using .map() with a dictionary
The `.map()` method is used to transform values according to a Python dictionary look-up. In this exercise you'll practice this method while returning to working with the `election` DataFrame.

Your job is to use a dictionary to map the values `'Obama'` and `'Romney'` in the `'winner'` column to the values `'blue'` and `'red'`, and assign the output to the new column `'color'`.

In [20]:
# Create the dictionary: red_vs_blue
red_vs_blue = {'Obama':'blue','Romney':'red'}

# Use the dictionary to map the 'winner' column to the new column: election['color']
election['color'] = election.winner.map(red_vs_blue)

# Print the output of election.head()
election.head()

Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin,color
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Adams,PA,41973,35.482334,63.112001,Romney,61156,68.632677,27.629667,red
Allegheny,PA,614671,56.640219,42.18582,Obama,924351,66.497575,14.454399,blue
Armstrong,PA,28322,30.696985,67.901278,Romney,42147,67.19814,37.204293,red
Beaver,PA,80015,46.032619,52.63763,Romney,115157,69.483401,6.605012,red
Bedford,PA,21444,22.057452,76.98657,Romney,32189,66.619031,54.929118,red


### Using vectorized functions
**When performance is paramount, you should avoid using `.apply()` and `.map()`** because those constructs perform Python for-loops over the data stored in a pandas Series or DataFrame. By using **vectorized functions** instead, **you can loop over the data at the same speed as compiled code** (C, Fortran, etc.)! NumPy, SciPy and pandas come with a variety of vectorized functions (called Universal Functions or UFuncs in NumPy).

**_You can even write your own vectorized functions, but for now we will focus on the ones distributed by NumPy and pandas._**

In this exercise you're going to import the `zscore` function from `scipy.stats` and use it to compute the deviation in voter turnout in Pennsylvania from the mean in fractions of the standard deviation. **In statistics, the [z-score][1] $Z = {\frac{(x-\mu)}{\sigma}}$ is the number of standard deviations by which an observation is above the mean** - so if it is negative, it means the observation is below the mean.

Instead of using `.apply()` as you did in the earlier exercises, the `zscore` UFunc will take a pandas Series as input and return a `NumPy` array. You will then assign the values of the NumPy array to a new column in the DataFrame. You will be working with the `election`.

[1]: https://www.statisticshowto.datasciencecentral.com/probability-and-statistics/z-score/

In [21]:
# Import zscore from scipy.stats
from scipy.stats import zscore 

# Call zscore with election['turnout'] as input: turnout_zscore
turnout_zscore = zscore(election.turnout)

# Print the type of turnout_zscore
print(type(turnout_zscore))

# Assign turnout_zscore to a new column: election['turnout_zscore']
election['turnout_zscore'] = turnout_zscore

# Print the output of election.head()
election.head()

<class 'numpy.ndarray'>


Unnamed: 0_level_0,state,total,Obama,Romney,winner,voters,turnout,margin,color,turnout_zscore
county,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Adams,PA,41973,35.482334,63.112001,Romney,61156,68.632677,27.629667,red,0.853734
Allegheny,PA,614671,56.640219,42.18582,Obama,924351,66.497575,14.454399,blue,0.439846
Armstrong,PA,28322,30.696985,67.901278,Romney,42147,67.19814,37.204293,red,0.57565
Beaver,PA,80015,46.032619,52.63763,Romney,115157,69.483401,6.605012,red,1.018647
Bedford,PA,21444,22.057452,76.98657,Romney,32189,66.619031,54.929118,red,0.463391
