### Accessing Elements of a DataFrame

Use .loc[' '] to access rows by indexes

Use .iloc[ ] to access rows by position

Use [ [ ] ] to access columns by indexes

Use .values to get a 2D NumPy array of only values, not colum names or row indexes. Be careful of data types

In [1]:
import pandas as pd

# Subway ridership for 5 stations on 10 different days
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)

DataFrame w/ Dictionary:

    df_1 = pd.DataFrame({'A': [0, 1, 2], 'B': [3, 4, 5]})
    
DataFrame w/ lists of lists or 2D NumPy array:

    df_2 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=['A', 'B', 'C'])

In [18]:
ridership_df.iloc[0].argmax()

'R006'

In [20]:
ridership_df[ridership_df.iloc[0].argmax()].mean()

3239.9000000000001

In [13]:
ridership_df[:].mean().mean()

2342.6000000000004

In [19]:
def mean_riders_for_max_station(ridership):
    '''
    Fill in this function to find the station with the maximum riders on the
    first day, then return the mean riders per day for that station. Also
    return the mean ridership overall for comparsion.
    
    This is the same as a previous exercise, but this time the
    input is a Pandas DataFrame rather than a 2D NumPy array.
    '''
    overall_mean = ridership_df[:].mean().mean()
    mean_for_max = ridership_df[ridership_df.iloc[0].argmax()].mean()
    
    return (overall_mean, mean_for_max)

### DataFrames are a great data structure to represent CSVs

Use .head() to print out first 5 lines of a table

Use. describe() to see some interesting information about each column

### Calculating Correlation

### Pearson's  r:

1. Standardize each variable
2. Multiple each pair of values, and take the average
3. r = average of (x in std unites) * (y in std units)

If r > 0, if one increases so does the other (and vice versa)

Ranges from -1 to +1, 1 = strongly correlated

When standardizing a variable, pass ddof=0 into Pandas .std function, controls whether the corrected or uncorrected std it taken. For Pearson's r, we want the uncorrect std (ddof=0)

In [25]:
subway_df = pd.read_csv('nyc_subway_weather.csv')

def correlation(x, y):
    '''
    Compute the correlation between the two input variables. 
    Each input is either a NumPy array or a Pandas Series.
    
    correlation = average of (x in standard units) times (y in standard units)
    '''
    x_std = (x-x.mean()) / x.std(ddof=0)
    y_std = (y-y.mean()) / y.std(ddof=0)
    return (x_std * y_std).mean()

entries = subway_df['ENTRIESn_hourly']
cum_entries = subway_df['ENTRIESn']
rain = subway_df['meanprecipi']
temp = subway_df['meantempi']

print correlation(entries, rain)
print correlation(entries, temp)
print correlation(rain, temp)
print correlation(entries, cum_entries)

0.0356485157722
-0.0266933483216
-0.229034323408
0.585895470766


In [26]:
entries_and_exits = pd.DataFrame({
    'ENTRIESn': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITSn': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})

In [36]:
shifted_entries_and_exits = entries_and_exits.shift()

In [37]:
entries_and_exits.diff()

Unnamed: 0,ENTRIESn,EXITSn
0,,
1,23.0,8.0
2,18.0,18.0
3,71.0,54.0
4,170.0,44.0
5,214.0,42.0
6,87.0,11.0
7,10.0,3.0
8,36.0,89.0
9,153.0,333.0


In [38]:
entries_and_exits - shifted_entries_and_exits

Unnamed: 0,ENTRIESn,EXITSn
0,,
1,23.0,8.0
2,18.0,18.0
3,71.0,54.0
4,170.0,44.0
5,214.0,42.0
6,87.0,11.0
7,10.0,3.0
8,36.0,89.0
9,153.0,333.0


In [39]:
def get_hourly_entries_and_exits(entries_and_exits):
    '''
    Takes a DataFrame with cumulative entries
    and exits (entries in the first column, exits in the second) and
    returns a DataFrame with hourly entries and exits (entries in the
    first column, exits in the second).
    '''
    shifted_entries_and_exits = entries_and_exits.shift()
    return entries_and_exits - shifted_entries_and_exits

### DataFrame Applymap()

Calling a function on each element in a DataFrame

Apply() does something different with DataFrames: applies function on each **column**

In [56]:
def convert_grade(grade):
    if (grade >= 90):
        grade = 'A'
    if (grade >= 80 and grade < 90):
        grade = 'B'
    if (grade >= 70 and grade < 80):
        grade = 'C'
    if (grade >= 60 and grade < 70):
        grade = 'D'
    if (grade < 60):
        grade = 'F'
    return grade

In [57]:
convert_grade(82)

'B'