# Intro - Analyzing subway and weather data

How ridership changes when it starts raining?

How does it change from summet to winter?

Which stops are most and least popular?

How ridership is affected by some cultural events - identify the biggest changes in regular usage.

Were there some stops in operating of some of the stations?

---

What variables are related to subway ridership?

Which stations have the most riders?

Ridership patterns over time - how it changes during the day and month?

Will memorial day ridership will be similar to weekend day ridership?

How weather affects ridership?



## Two-dimensional Data 
Python: List of lists

NumPy: 2D array

Pandas: DataFrame

2D array, as opposed to array of arrays:
- Memory efficient
- Different elements accesing
`a[1, 3]` rather than `a[1][3]` - row
- mean(), std(). etc. operate on entire array

In [2]:
import numpy as np

# Subway ridership for 5 stations on 10 different days
ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

# Accessing elements
print(ridership[1, 3])
print(ridership[1:3, 3:5])
print(ridership[1, :])

2328
[[2328 2539]
 [6461 2691]]
[1478 3877 3674 2328 2539]


In [3]:
# Vectorized operations on rows or columns
print(ridership[0, :] + ridership[1, :])
print(ridership[:, 0] + ridership[:, 1])

[1478 3877 3676 2333 2539]
[   0 5355 5701 4952 6410 5509  324    2 5223 5385]


In [4]:
# Vectorized operations on entire arrays
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b = np.array([[1, 1, 1], [2, 2, 2], [3, 3, 3]])
print(a + b)

[[ 2  3  4]
 [ 6  7  8]
 [10 11 12]]


In [5]:
def mean_riders_for_max_station(ridership):
    '''
    Fill in this function to find the station with the maximum riders on the
    first day, then return the mean riders per day for that station. Also
    return the mean ridership overall for comparsion.
    
    Hint: NumPy's argmax() function might be useful:
    http://docs.scipy.org/doc/numpy/reference/generated/numpy.argmax.html
    '''
    max_station_idx = np.argmax(ridership[0, :])
    overall_mean = ridership.mean()
    mean_for_max = ridership[:, max_station_idx].mean()
    
    return (overall_mean, mean_for_max)

mean_riders_for_max_station(ridership)

(2342.5999999999999, 3239.9000000000001)

The max station had much higher mean than the overall stations average.

## Operations along an Axis

Many NumPy functions like mean() opearate on an array as a whole. But in many cases it might make sense to calculate the operation by row or by column.

Most of NumPy function take an axis argument for this reason - 0 or 1
- 0 calculates for each column
- 1 calculates for each row

In [8]:
a = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])

print(a.sum())
print(a.sum(axis=0)) # sum rows!
print(a.sum(axis=1)) # sum columns!

45
[12 15 18]
[ 6 15 24]


In [7]:
# Subway ridership for 5 stations on 10 different days
ridership = np.array([
    [   0,    0,    2,    5,    0],
    [1478, 3877, 3674, 2328, 2539],
    [1613, 4088, 3991, 6461, 2691],
    [1560, 3392, 3826, 4787, 2613],
    [1608, 4802, 3932, 4477, 2705],
    [1576, 3933, 3909, 4979, 2685],
    [  95,  229,  255,  496,  201],
    [   2,    0,    1,   27,    0],
    [1438, 3785, 3589, 4174, 2215],
    [1342, 4043, 4009, 4665, 3033]
])

def min_and_max_riders_per_day(ridership):
    '''
    Fill in this function. First, for each subway station, calculate the
    mean ridership per day. Then, out of all the subway stations, return the
    maximum and minimum of these values. That is, find the maximum
    mean-ridership-per-day and the minimum mean-ridership-per-day for any
    subway station.
    '''
    stations_means = ridership.mean(axis=0)
    max_daily_ridership = stations_means.max()    # Replace this with your code
    min_daily_ridership = stations_means.min()     # Replace this with your code
    
    return (max_daily_ridership, min_daily_ridership)

min_and_max_riders_per_day(ridership)

(3239.9000000000001, 1071.2)

Maximal and minimal ridership means for different stations.

Which stations are these?

## NumPy And Pandas Data Types

In [9]:
np.array([1, 2, 3, 4, 5]).dtype

dtype('int64')

It is difficult or impossible to store different types of data - like strings and floats - in array.

The solution is to store the complex data in DataFrame.

## Accessing Elements Of A DataFrame

Use .loc[] to access a single row of a data in DataFrame by indexes (THEIR NAMES!).

Use .iloc[] to access a single row of data in DataFrame by positions (INDEX NUMBER).

To access a single element of a DataFrame you can also use .loc and .iloc but indicating both a row and a column.

You can also access columns using square brackets.

In [13]:
import pandas as pd

# You can create a DataFrame out of a dictionary mapping column names to values
df_1 = pd.DataFrame({"A": [0, 1, 2], "B": [3, 4, 5]})
df_1

Unnamed: 0,A,B
0,0,3
1,1,4
2,2,5


In [14]:
# You can also use a list of lists or a 2D NumPy array
df_2 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"])
df_2

Unnamed: 0,A,B,C
0,0,1,2
1,3,4,5


In [15]:
ridership_df = pd.DataFrame(
    data=[[   0,    0,    2,    5,    0],
          [1478, 3877, 3674, 2328, 2539],
          [1613, 4088, 3991, 6461, 2691],
          [1560, 3392, 3826, 4787, 2613],
          [1608, 4802, 3932, 4477, 2705],
          [1576, 3933, 3909, 4979, 2685],
          [  95,  229,  255,  496,  201],
          [   2,    0,    1,   27,    0],
          [1438, 3785, 3589, 4174, 2215],
          [1342, 4043, 4009, 4665, 3033]],
    index=['05-01-11', '05-02-11', '05-03-11', '05-04-11', '05-05-11',
           '05-06-11', '05-07-11', '05-08-11', '05-09-11', '05-10-11'],
    columns=['R003', 'R004', 'R005', 'R006', 'R007']
)
ridership_df

Unnamed: 0,R003,R004,R005,R006,R007
05-01-11,0,0,2,5,0
05-02-11,1478,3877,3674,2328,2539
05-03-11,1613,4088,3991,6461,2691
05-04-11,1560,3392,3826,4787,2613
05-05-11,1608,4802,3932,4477,2705
05-06-11,1576,3933,3909,4979,2685
05-07-11,95,229,255,496,201
05-08-11,2,0,1,27,0
05-09-11,1438,3785,3589,4174,2215
05-10-11,1342,4043,4009,4665,3033


In [18]:
# Accessing elements.
print(ridership_df.iloc[0])
print(ridership_df.loc["05-05-11"])
print(ridership_df["R003"])
print(ridership_df.iloc[1, 3])

R003    0
R004    0
R005    2
R006    5
R007    0
Name: 05-01-11, dtype: int64
R003    1608
R004    4802
R005    3932
R006    4477
R007    2705
Name: 05-05-11, dtype: int64
05-01-11       0
05-02-11    1478
05-03-11    1613
05-04-11    1560
05-05-11    1608
05-06-11    1576
05-07-11      95
05-08-11       2
05-09-11    1438
05-10-11    1342
Name: R003, dtype: int64
2328


In [19]:
# Accessing multiple rows
ridership_df.iloc[1:4]

Unnamed: 0,R003,R004,R005,R006,R007
05-02-11,1478,3877,3674,2328,2539
05-03-11,1613,4088,3991,6461,2691
05-04-11,1560,3392,3826,4787,2613


In [20]:
# Accessing multiple columns
ridership_df[["R003", "R005"]]

Unnamed: 0,R003,R005
05-01-11,0,2
05-02-11,1478,3674
05-03-11,1613,3991
05-04-11,1560,3826
05-05-11,1608,3932
05-06-11,1576,3909
05-07-11,95,255
05-08-11,2,1
05-09-11,1438,3589
05-10-11,1342,4009


In [22]:
# Pandas axis
df = pd.DataFrame({"A": [0, 1, 2], "B": [3, 4, 5]})
df

Unnamed: 0,A,B
0,0,3
1,1,4
2,2,5


In [23]:
df.sum()

A     3
B    12
dtype: int64

In [24]:
df.sum(axis=0)

A     3
B    12
dtype: int64

In [25]:
df.sum(axis=1)

0    3
1    5
2    7
dtype: int64

In [26]:
df.values.sum()

15

In [50]:
# Returns not index but column name!
df.iloc[0].argmax() 

'B'

In [47]:
def mean_riders_for_max_station(ridership):
    '''
    Fill in this function to find the station with the maximum riders on the
    first day, then return the mean riders per day for that station. Also
    return the mean ridership overall for comparsion.
    
    This is the same as a previous exercise, but this time the
    input is a Pandas DataFrame rather than a 2D NumPy array.
    '''
    
    max_station = ridership.iloc[0].argmax()
    
    overall_mean = ridership.values.mean() # Replace this with your code
    mean_for_max = ridership[max_station].mean() # Replace this with your code
    
    return (overall_mean, mean_for_max)

In [48]:
mean_riders_for_max_station(ridership_df)

(2342.5999999999999, 3239.9)

## Loading Data Into A DataFrame

DataFrames are a great data structure to represent CSV!

In [51]:
subway_df = pd.read_csv("nyc_subway_weather.csv")

In [52]:
subway_df.head()

Unnamed: 0,UNIT,DATEn,TIMEn,ENTRIESn,EXITSn,ENTRIESn_hourly,EXITSn_hourly,datetime,hour,day_week,...,pressurei,rain,tempi,wspdi,meanprecipi,meanpressurei,meantempi,meanwspdi,weather_lat,weather_lon
0,R003,05-01-11,00:00:00,4388333,2911002,0.0,0.0,2011-05-01 00:00:00,0,6,...,30.22,0,55.9,3.5,0.0,30.258,55.98,7.86,40.700348,-73.887177
1,R003,05-01-11,04:00:00,4388333,2911002,0.0,0.0,2011-05-01 04:00:00,4,6,...,30.25,0,52.0,3.5,0.0,30.258,55.98,7.86,40.700348,-73.887177
2,R003,05-01-11,12:00:00,4388333,2911002,0.0,0.0,2011-05-01 12:00:00,12,6,...,30.28,0,62.1,6.9,0.0,30.258,55.98,7.86,40.700348,-73.887177
3,R003,05-01-11,16:00:00,4388333,2911002,0.0,0.0,2011-05-01 16:00:00,16,6,...,30.26,0,57.9,15.0,0.0,30.258,55.98,7.86,40.700348,-73.887177
4,R003,05-01-11,20:00:00,4388333,2911002,0.0,0.0,2011-05-01 20:00:00,20,6,...,30.28,0,52.0,10.4,0.0,30.258,55.98,7.86,40.700348,-73.887177


In [53]:
subway_df.describe()

Unnamed: 0,ENTRIESn,EXITSn,ENTRIESn_hourly,EXITSn_hourly,hour,day_week,weekday,latitude,longitude,fog,...,pressurei,rain,tempi,wspdi,meanprecipi,meanpressurei,meantempi,meanwspdi,weather_lat,weather_lon
count,42649.0,42649.0,42649.0,42649.0,42649.0,42649.0,42649.0,42649.0,42649.0,42649.0,...,42649.0,42649.0,42649.0,42649.0,42649.0,42649.0,42649.0,42649.0,42649.0,42649.0
mean,28124860.0,19869930.0,1886.589955,1361.487866,10.046754,2.905719,0.714436,40.724647,-73.940364,0.009824,...,29.971096,0.224741,63.10378,6.927872,0.004618,29.971096,63.10378,6.927872,40.728555,-73.938693
std,30436070.0,20289860.0,2952.385585,2183.845409,6.938928,2.079231,0.451688,0.07165,0.059713,0.098631,...,0.137942,0.417417,8.455597,4.510178,0.016344,0.131158,6.939011,3.179832,0.06542,0.059582
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,40.576152,-74.073622,0.0,...,29.55,0.0,46.9,0.0,0.0,29.59,49.4,0.0,40.600204,-74.01487
25%,10397620.0,7613712.0,274.0,237.0,4.0,1.0,0.0,40.677107,-73.987342,0.0,...,29.89,0.0,57.0,4.6,0.0,29.913333,58.283333,4.816667,40.688591,-73.98513
50%,18183890.0,13316090.0,905.0,664.0,12.0,3.0,1.0,40.717241,-73.953459,0.0,...,29.96,0.0,61.0,6.9,0.0,29.958,60.95,6.166667,40.72057,-73.94915
75%,32630490.0,23937710.0,2255.0,1537.0,16.0,5.0,1.0,40.759123,-73.907733,0.0,...,30.06,0.0,69.1,9.2,0.0,30.06,67.466667,8.85,40.755226,-73.912033
max,235774600.0,149378200.0,32814.0,34828.0,20.0,6.0,1.0,40.889185,-73.755383,1.0,...,30.32,1.0,86.0,23.0,0.1575,30.293333,79.8,17.083333,40.862064,-73.694176


## DataFrame Vectorized Operations

Similar to vectorized operations for 2d NumPy arrays.

Match elements by index and column name rather than position.

In [56]:
# Adding DataFrames with the column names
df1 = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]})
df2 = pd.DataFrame({"a": [10, 20, 30], "b": [40, 50, 60], 'c': [70, 80, 90]})

In [57]:
df1

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [58]:
df2

Unnamed: 0,a,b,c
0,10,40,70
1,20,50,80
2,30,60,90


In [59]:
df1 + df2

Unnamed: 0,a,b,c
0,11,44,77
1,22,55,88
2,33,66,99


In [60]:
# Adding DataFrames with overlapping column names
df1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
df2 = pd.DataFrame({"d": [10, 20, 30], "c": [40, 50, 60], "b": [70, 80, 90]})

In [61]:
df1

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [62]:
df2

Unnamed: 0,b,c,d
0,70,40,10
1,80,50,20
2,90,60,30


In [63]:
df1 + df2

Unnamed: 0,a,b,c,d
0,,74,47,
1,,85,58,
2,,96,69,


In [64]:
# Adding DataFrames with overlapping row indexes
df1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]},
                  index=["row1", "row2", "row3"])
df2 = pd.DataFrame({"a": [10, 20, 30], "b": [40, 50, 60], "c": [70, 80, 90]},
                  index=["row4", "row3", "row2"])

In [65]:
df1

Unnamed: 0,a,b,c
row1,1,4,7
row2,2,5,8
row3,3,6,9


In [66]:
df2

Unnamed: 0,a,b,c
row4,10,40,70
row3,20,50,80
row2,30,60,90


In [67]:
df1 + df2

Unnamed: 0,a,b,c
row1,,,
row2,32.0,65.0,98.0
row3,23.0,56.0,89.0
row4,,,


## Difference from columns

In [70]:
entries_and_exits = pd.DataFrame({
    'ENTRIES': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITS': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})
entries_and_exits

Unnamed: 0,ENTRIES,EXITS
0,3144312,1088151
1,3144335,1088159
2,3144353,1088177
3,3144424,1088231
4,3144594,1088275
5,3144808,1088317
6,3144895,1088328
7,3144905,1088331
8,3144941,1088420
9,3145094,1088753


In [71]:
entries_and_exits.diff()

Unnamed: 0,ENTRIES,EXITS
0,,
1,23.0,8.0
2,18.0,18.0
3,71.0,54.0
4,170.0,44.0
5,214.0,42.0
6,87.0,11.0
7,10.0,3.0
8,36.0,89.0
9,153.0,333.0


##  DataFrame applymap()

Similar to Series apply().

It affects whole DataFrame.

In [72]:
entries_and_exits = pd.DataFrame({
    'ENTRIES': [3144312, 3144335, 3144353, 3144424, 3144594,
                 3144808, 3144895, 3144905, 3144941, 3145094],
    'EXITS': [1088151, 1088159, 1088177, 1088231, 1088275,
               1088317, 1088328, 1088331, 1088420, 1088753]
})
entries_and_exits

Unnamed: 0,ENTRIES,EXITS
0,3144312,1088151
1,3144335,1088159
2,3144353,1088177
3,3144424,1088231
4,3144594,1088275
5,3144808,1088317
6,3144895,1088328
7,3144905,1088331
8,3144941,1088420
9,3145094,1088753


In [73]:
df1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
df1

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9


In [75]:
df1.applymap(lambda x: x * 2)

Unnamed: 0,a,b,c
0,2,8,14
1,4,10,16
2,6,12,18


## DataFrame apply() and qcut()

Applies function on each column (Series) of data in the DataFrame.

The application of this function are operations that depend on entire column values.

In [77]:
grades_df = pd.DataFrame(
    data={'exam1': [43, 81, 78, 75, 89, 70, 91, 65, 98, 87],
          'exam2': [24, 63, 56, 56, 67, 51, 79, 46, 72, 60]},
    index=['Andre', 'Barry', 'Chris', 'Dan', 'Emilio', 
           'Fred', 'Greta', 'Humbert', 'Ivan', 'James']
)
grades_df

Unnamed: 0,exam1,exam2
Andre,43,24
Barry,81,63
Chris,78,56
Dan,75,56
Emilio,89,67
Fred,70,51
Greta,91,79
Humbert,65,46
Ivan,98,72
James,87,60


### Convert grades using qcut

In [106]:
# qcut() operates on a list, array, or Series. This is the
# result of running the function on a single column of the
# DataFrame.
pd.qcut(grades_df['exam1'],
        [0, 0.1, 0.2, 0.5, 0.8, 1],
        labels=['F', 'D', 'C', 'B', 'A'])

Andre      F
Barry      B
Chris      C
Dan        C
Emilio     B
Fred       C
Greta      A
Humbert    D
Ivan       A
James      B
Name: exam1, dtype: category
Categories (5, object): [F < D < C < B < A]

In [107]:
# qcut() operates on a list, array, or Series. This is the
# result of running the function on a single column of the
# DataFrame.
pd.qcut(grades_df,
        [0, 0.1, 0.2, 0.5, 0.8, 1],
        labels=['F', 'D', 'C', 'B', 'A'])

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

In [108]:
# qcut() does not work on DataFrames, but we can use apply()
# to call the function on each column separately

def convert_grades_curve(exam_grades):
        # Pandas has a bult-in function that will perform this calculation
        # This will give the bottom 0% to 10% of students the grade 'F',
        # 10% to 20% the grade 'D', and so on. You can read more about
        # the qcut() function here:
        # http://pandas.pydata.org/pandas-docs/stable/generated/pandas.qcut.html
        return pd.qcut(exam_grades,
                       [0, 0.1, 0.2, 0.5, 0.8, 1],
                       labels=['F', 'D', 'C', 'B', 'A'])
    
grades_df.apply(convert_grades_curve)

Unnamed: 0,exam1,exam2
Andre,F,F
Barry,B,B
Chris,C,C
Dan,C,C
Emilio,B,B
Fred,C,C
Greta,A,A
Humbert,D,D
Ivan,A,A
James,B,B


### Standardize DataFrame

In [111]:
df = pd.DataFrame({
    'a': [4, 5, 3, 1, 2],
    'b': [20, 10, 40, 50, 30],
    'c': [25, 20, 5, 15, 10]
})

df

Unnamed: 0,a,b,c
0,4,20,25
1,5,10,20
2,3,40,5
3,1,50,15
4,2,30,10


In [112]:
def std_series(col):
    return (col - col.mean()) / col.std(ddof = 0)

def standardize(df):
    '''
    Fill in this function to standardize each column of the given
    DataFrame. To standardize a variable, convert each value to the
    number of standard deviations it is above or below the mean.
    '''
    return df.apply(std)

In [113]:
standardize(df)

Unnamed: 0,a,b,c
0,0.632456,-0.632456,1.264911
1,1.264911,-1.264911,0.632456
2,0.0,0.632456,-1.264911
3,-1.264911,1.264911,0.0
4,-0.632456,0.0,-0.632456


### apply() and reducing DataFrame to a Series

In [114]:
df = pd.DataFrame({
    'a': [4, 5, 3, 1, 2],
    'b': [20, 10, 40, 50, 30],
    'c': [25, 20, 5, 15, 10]
})
df

Unnamed: 0,a,b,c
0,4,20,25
1,5,10,20
2,3,40,5
3,1,50,15
4,2,30,10


In [116]:
df.apply(np.max)

a     5
b    50
c    25
dtype: int64

Same as df.max().

In [117]:
df.max()

a     5
b    50
c    25
dtype: int64

### apply() and second largest elements

In [149]:
 def second_largest(df):
    '''
    Fill in this function to return the second-largest value of each 
    column of the input DataFrame.
    '''
    return df.apply(second_largest_in_col)

def second_largest_in_col(column):
    return column.sort_values(ascending=False).iloc[1]
   

In [150]:
second_largest_in_col(df['a'])

4

In [151]:
second_largest(df)

a     4
b    40
c    20
dtype: int64

## Adding A DataFrame To A Series


### Adding a Series to a square DataFrame

In [152]:
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({
    0: [10, 20, 30, 40],
    1: [50, 60, 70, 80],
    2: [90, 100, 110, 120],
    3: [130, 140, 150, 160]
})

In [153]:
s

0    1
1    2
2    3
3    4
dtype: int64

In [154]:
df

Unnamed: 0,0,1,2,3
0,10,50,90,130
1,20,60,100,140
2,30,70,110,150
3,40,80,120,160


In [155]:
df + s

Unnamed: 0,0,1,2,3
0,11,52,93,134
1,21,62,103,144
2,31,72,113,154
3,41,82,123,164


1 was added to each element of first column. 2 was added to each element of second column. And so on.

### Adding a Series to a one-row DataFrame

In [156]:
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10], 1: [20], 2: [30], 3:[40]})

In [157]:
s

0    1
1    2
2    3
3    4
dtype: int64

In [158]:
df

Unnamed: 0,0,1,2,3
0,10,20,30,40


In [159]:
df + s

Unnamed: 0,0,1,2,3
0,11,22,33,44


Again each value from the series was added to a single column from the DataFrame.

1 was added to the first column, 2 was added to the second and so on.

### Adding a Series to a one-column DataFrame

So what will happen if I add Series to a DataFrame with only one column?

In [160]:
s = pd.Series([1, 2, 3, 4])
df = pd.DataFrame({0: [10, 20, 30, 40]})

In [161]:
s

0    1
1    2
2    3
3    4
dtype: int64

In [162]:
df

Unnamed: 0,0
0,10
1,20
2,30
3,40


In [163]:
df + s

Unnamed: 0,0,1,2,3
0,11,,,
1,21,,,
2,31,,,
3,41,,,


The result has four columns - one for each value in the series - and every column except the first one contains NaNs.

The first column have one added to each column as before.

### Adding when DataFrame column names match Series index

In [165]:
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})

In [166]:
s

a    1
b    2
c    3
d    4
dtype: int64

In [167]:
df

Unnamed: 0,a,b,c,d
0,10,50,90,130
1,20,60,100,140
2,30,70,110,150
3,40,80,120,160


In [168]:
df + s

Unnamed: 0,a,b,c,d
0,11,52,93,134
1,21,62,103,144
2,31,72,113,154
3,41,82,123,164


All previous behaviours were related to the corresponding columns and index names.

### Adding when DataFrame column names don't match Series index.

What happens if index and columns do not match?

In [174]:
s = pd.Series([1, 2, 3, 4], index=["b", "c", "d", "e"])
df = pd.DataFrame({
    'a': [10, 20, 30, 40],
    'b': [50, 60, 70, 80],
    'c': [90, 100, 110, 120],
    'd': [130, 140, 150, 160]
})

In [177]:
s

b    1
c    2
d    3
e    4
dtype: int64

In [178]:
df

Unnamed: 0,a,b,c,d
0,10,50,90,130
1,20,60,100,140
2,30,70,110,150
3,40,80,120,160


In [179]:
df + s

Unnamed: 0,a,b,c,d,e
0,,51.0,92.0,133.0,
1,,61.0,102.0,143.0,
2,,71.0,112.0,153.0,
3,,81.0,122.0,163.0,


The result is similar to when we added Series and the index did not match.

For each letter that was not present in both the DataFrame and Series the result is NaN.

Other values were matched by letters.

Summary: Adding a DataFrame to a Series, adds each vakue of the Series to one column of the DataFrame.

It matches up the Series to the DataFrane using the index of the Series and the column names of the DataFrame.