## Question:  
I decided to target the following 27 week time period to perform my EDA on: 2020-07-25 to 2021-01-29. I was interested in this time period because I wanted to find insights on how the MTA traffic pattern changes as we move out of summer (August) and into the new school year (September), onto the holiday season in December, and into the new year. One would expect the traffic to increase in September. The government fiscal year started on October 1st. Public holidays celebrated in NY include:  
  
Labor Day 	Mon, Sep 7, 2020  
Columbus Day 	Mon, Oct 12, 2020  
Veterans Day 	Wed, Nov 11, 2020  
Thanksgiving 	Thu, Nov 26, 2020  
Christmas Day 	Fri, Dec 25, 2020  
New Year's Day 	Fri, Jan 1, 2021  
Martin Luther King Jr. Day 	Mon, Jan 18, 2021  

## Importing MTA data from website into a .db file for SQL queries

I used the 'get_mta.py' file provided and imported data from the mta website ('http://web.mta.info/developers/turnstile.html') following the the instructions on the 'get_mta.md' readme file.  
  
The resulting .db file was saved in the 'data' folder as 'mta_data - 27 weeks - 2020.08.01 to 2021.01.30.db'
  
The following command was used to get the data for the months of August, 2020 to January, 2021:  
  
python 'get_mta.py' "(2008|2009|2010|2011|2012|2101)"

27 weeks to collect:

['2021-01-30', '2021-01-23', '2021-01-16', '2021-01-09', '2021-01-02', '2020-12-26', '2020-12-19', '2020-12-12', '2020-12-05', '2020-11-28', '2020-11-21', 
'2020-11-14', '2020-11-07', '2020-10-31', '2020-10-24', '2020-10-17', '2020-10-10', '2020-10-03', '2020-09-26', '2020-09-19', '2020-09-12', '2020-09-05', '2020-08-29', '2020-08-22', '2020-08-15', '2020-08-08', '2020-08-01']



## Using SQLAlchemy to import SQL query as a Pandas DataFrame

In [1]:
# if SQLAlchemy is not installed:

#!conda install -c anaconda sqlalchemy

In [2]:
from sqlalchemy import create_engine
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt

In [3]:
engine = create_engine("sqlite:///mta_data.db") #telling it to use sqlite specifically

engine

Engine(sqlite:///mta_data.db)

In [4]:
all_tables = engine.table_names # attribute with list the table names in the database
all_tables

<bound method Engine.table_names of Engine(sqlite:///mta_data.db)>

In [5]:
engine.table_names() # ??? there seems to be no name for the table.

  engine.table_names() # ??? there seems to be no name for the table.


['mta_data']

In [6]:
df = pd.read_sql('SELECT * FROM mta_data', engine)
df.head()

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/23/2021,03:00:00,REGULAR,7521371,2563177
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/23/2021,07:00:00,REGULAR,7521374,2563190
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/23/2021,11:00:00,REGULAR,7521399,2563231
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/23/2021,15:00:00,REGULAR,7521490,2563274
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/23/2021,19:00:00,REGULAR,7521630,2563312


In [7]:
df

Unnamed: 0,C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS
0,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/23/2021,03:00:00,REGULAR,7521371,2563177
1,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/23/2021,07:00:00,REGULAR,7521374,2563190
2,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/23/2021,11:00:00,REGULAR,7521399,2563231
3,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/23/2021,15:00:00,REGULAR,7521490,2563274
4,A002,R051,02-00-00,59 ST,NQR456W,BMT,01/23/2021,19:00:00,REGULAR,7521630,2563312
...,...,...,...,...,...,...,...,...,...,...,...
5673225,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,07/31/2020,12:59:32,REGULAR,5554,538
5673226,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,07/31/2020,13:00:00,REGULAR,5554,538
5673227,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,07/31/2020,13:11:06,REGULAR,5554,538
5673228,TRAM2,R469,00-05-01,RIT-ROOSEVELT,R,RIT,07/31/2020,17:00:00,REGULAR,5554,538


In [8]:
df.shape

(5673230, 11)

## Field Descriptions from MTA website  
http://web.mta.info/developers/resources/nyct/turnstile/ts_Field_Description.txt  

Field Description

C/A,UNIT,SCP,STATION,LINENAME,DIVISION,DATE,TIME,DESC,ENTRIES,EXITS


C/A      = Control Area (A002)
UNIT     = Remote Unit for a station (R051)
SCP      = Subunit Channel Position represents an specific address for a device (02-00-00)
STATION  = Represents the station name the device is located at
LINENAME = Represents all train lines that can be boarded at this station
           Normally lines are represented by one character.  LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.
DIVISION = Represents the Line originally the station belonged to BMT, IRT, or IND   
DATE     = Represents the date (MM-DD-YY)
TIME     = Represents the time (hh:mm:ss) for a scheduled audit event
DESc     = Represent the "REGULAR" scheduled audit event (Normally occurs every 4 hours)
           1. Audits may occur more that 4 hours due to planning, or troubleshooting activities. 
           2. Additionally, there may be a "RECOVR AUD" entry: This refers to a missed audit that was recovered. 
ENTRIES  = The comulative entry register value for a device
EXIST    = The cumulative exit register value for a device

## Data Cleaning

In [9]:
df.columns # unlike in the MTA 1 exercise, it seems the 'EXITS' column doesn't have a bunch of spaces at the end.

Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES', 'EXITS'],
      dtype='object')

In [10]:
# but will still strip white spaces for good measure
df.columns = [column.strip() for column in df.columns] # reassign column names for 
df.columns

Index(['C/A', 'UNIT', 'SCP', 'STATION', 'LINENAME', 'DIVISION', 'DATE', 'TIME',
       'DESC', 'ENTRIES', 'EXITS'],
      dtype='object')

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5673230 entries, 0 to 5673229
Data columns (total 11 columns):
 #   Column    Dtype 
---  ------    ----- 
 0   C/A       object
 1   UNIT      object
 2   SCP       object
 3   STATION   object
 4   LINENAME  object
 5   DIVISION  object
 6   DATE      object
 7   TIME      object
 8   DESC      object
 9   ENTRIES   int64 
 10  EXITS     int64 
dtypes: int64(2), object(9)
memory usage: 476.1+ MB


'DATE' and 'TIME' columnes are of type "object" which often means string

In [12]:
# let's take a look at the specific entry for row 0
df.iloc[0]['DATE'] # the quotes around the output implies that 'DATE' column is of type string
df.iloc[0]['TIME'] # same with 'TIME' column

'03:00:00'

### converting 'DATE' and 'TIME' columns into a DateTime object

In [13]:
df['DATE_TIME'] = df['DATE'] + ' ' + df['TIME'] # concatenate with a space seperator; Pandas can parse this format when converting to datetime
df['DATE_TIME'] # still a string

0          01/23/2021 03:00:00
1          01/23/2021 07:00:00
2          01/23/2021 11:00:00
3          01/23/2021 15:00:00
4          01/23/2021 19:00:00
                  ...         
5673225    07/31/2020 12:59:32
5673226    07/31/2020 13:00:00
5673227    07/31/2020 13:11:06
5673228    07/31/2020 17:00:00
5673229    07/31/2020 21:00:00
Name: DATE_TIME, Length: 5673230, dtype: object

In [None]:
df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME']) # converts our new 'DATE_TIME' column into a datetime object

In [None]:
df.info() # dtype for 'DATE_TIME' column is now datetime64[ns]

### Check for Duplicate Entries
Each unique turnstile is represented by the same values for the following columns:  
'C/A'  
'UNIT'   
'SCP'   
'STATION'  
- Each combination of `C/A`, `UNIT`, `SCP`, and `STATION` represents a unique turnstile.
- For each turnstile, for a given day, there should be exactly 6 rows (1 for each time slot, i.e. datetime)
- For each turnstile, the `DATE_TIME` entries should not be duplicated.

In [None]:
# code modified from MTA 1 exercise
# verify that "C/A", "UNIT", "SCP", "STATION", "DATE_TIME" is unique
(df
 .groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"])
 .ENTRIES.count()
 .reset_index()
 .sort_values("ENTRIES", ascending=False)).head(65) # looks like there are 61 duplicate entries

In [None]:
# # Get back to this if time permits; we should be 
# (df
#  .groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"])
#  .ENTRIES.count() > 1).count()

It seems when there are duplicate entries, the later row has a higher, presumably more accurate count for 'ENTRIES' or 'EXITS', hence let's just keep the later row.

In [None]:
df.shape

In [None]:
df.sort_values(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], 
                          inplace=True, ascending=False)
df2 = df.drop_duplicates(subset=["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"], inplace=False, keep='last').copy()

In [None]:
df2.shape

In [None]:
# did it drop the correct number of rows? yes
df.shape[0]-df2.shape[0] # 5673230 - 5673169 = 61 duplicates we observed above have been removed.

In [None]:
(df2
 .groupby(["C/A", "UNIT", "SCP", "STATION", "DATE_TIME"])
 .ENTRIES.count()
 .reset_index()
 .sort_values("ENTRIES", ascending=False)).head(5) # no more duplicates

In [None]:
df2

LINENAME = Represents all train lines that can be boarded at this station
           Normally lines are represented by one character.  LINENAME 456NQR repersents train server for 4, 5, 6, N, Q, and R trains.
DIVISION = Represents the Line originally the station belonged to BMT, IRT, or IND   
DATE     = Represents the date (MM-DD-YY)
TIME     = Represents the time (hh:mm:ss) for a scheduled audit event
DESc     = Represent the "REGULAR" scheduled audit event (Normally occurs every 4 hours)
           1. Audits may occur more that 4 hours due to planning, or troubleshooting activities. 
           2. Additionally, there may be a "RECOVR AUD" entry: This refers to a missed audit that was recovered. 

In [None]:
# Drop DATE and TIME columns since we have DATE_TIME
# let's keep DIVISION and DESC for now.
df2.drop(["DATE", "TIME"], axis=1, errors="ignore") # ignore errors

In [None]:
# end of day entries and exits occur at 8pm 20:00(so really should be looking at the first entry of the next day (00:00), then sift the Dates over, but we'll say it's 8pm for now; but come back and fix if time permits

# There 6 entries for each turnstile+date; let's get a dataframe with only the last timestamp (8pm) of the day; dropping the first 5 entries for the day; this will represent a daily time series; not a 4-hour time series

df2_daily_cum = (df2
                        .groupby(["C/A", "UNIT", "SCP", "STATION", "DATE"],as_index=False)
                        [['LINENAME','DESC','DIVISION','ENTRIES','EXITS']].first()).copy()
# because of the groupby, the time parameter of DATE_TIME is reduced to a string column called 'DATE'
# convert 'DATE' to timeseries data
df['DATE_TIME'] = pd.to_datetime(df['DATE_TIME']) 
df2_daily_cum['DATE'] = pd.to_datetime(df2_daily_cum['DATE'])
df2_daily_cum

In [None]:
df2_daily_cum.info()

In [None]:
df2_daily_cum = df2_daily_cum.sort_values(by=['C/A','UNIT','SCP','STATION','DATE'], inplace=False)
df2_daily_cum

- as of now we only have the cummulative counter values for entries and exits
- let's get the daily counts

In [None]:
# find daily ENTRIES and EXITS ; right now these are cumulative values on the counters
# exits - previous exits = daily exit count ; likewise for entries

df2_daily_cum[["PREV_DATE", "PREV_ENTRIES", "PREV_EXITS"]] = (df2_daily_cum
                                                       .groupby(["C/A", "UNIT", "SCP", "STATION"])["DATE", "ENTRIES", 'EXITS']
                                                       .apply(lambda grp: grp.shift(1)))

In [None]:
df2_daily_cum

In [None]:
# Drop the rows for the earliest date in the df
df2_daily_cum.dropna(subset=["PREV_DATE"], axis=0, inplace=True)
df2_daily_cum

In [None]:
937597 - 932541 # The 1st date for each turnstile should have been dropped, therefore, there should be 5056 turnstiles

In [None]:
# to make like easier, let's make a list of the columns that identifies a unique turnstile, we seem to be using it a lot.
turnstile_col_list = ['C/A','UNIT','SCP','STATION']

# how many unique turnstiles are there?
(df
 .groupby(turnstile_col_list)
 .count()) # note, we used df here: output: 5056 <-- correct
(df2_daily_cum
 .groupby(turnstile_col_list)
 .count()) # note, we used df here: output: 5029 <-- ??? why is it less?

In [None]:
# How many stations are there?
(df
 .groupby('STATION').count()) # for df, output: 379
(df2_daily_cum
 .groupby('STATION').count()) # for df2_daily_cum, output:379


### list of things to check when time permits  
- are the station names unique?


## fixing reverse entries  
- for a given day, for a given turnstile, PREV_ENTRIES should always be less than ENTRIES (entries for that day); if not, there was some error in entering the data.

In [None]:
# group by turnstile and see how rows/days have an error where the previous entries is greater than the current day's entries
(df2_daily_cum[df2_daily_cum["ENTRIES"] < df2_daily_cum["PREV_ENTRIES"]]
    .groupby(["C/A", "UNIT", "SCP", "STATION"])
    .size().sort_values(ascending=False).head(50))
#377 turnstiles have at least one of these errors. A lot of them seem to have 188 errors, why is 188 errors common???

In [None]:
(df2_daily_cum[df2_daily_cum["ENTRIES"] < df2_daily_cum["PREV_ENTRIES"]]
    .groupby(["C/A", "UNIT", "SCP", "STATION"])
    .size().sum())

In [None]:
reverse_sum = (df2_daily_cum[df2_daily_cum["ENTRIES"] < df2_daily_cum["PREV_ENTRIES"]]
    .groupby(["C/A", "UNIT", "SCP", "STATION"])
    .size().sum().copy())
reverse_sum / df2_daily_cum.shape[0]
# 0.90% of the rows have this reverse error

In [None]:
# # Fix this error: (code modified from MTA 3 exercise)
# def get_daily_counts(row, max_counter):
#     counter = row["ENTRIES"] - row["PREV_ENTRIES"]
#     if counter < 0:
#         counter = -counter
#     if counter > max_counter:
#         print(row["ENTRIES"], row["PREV_ENTRIES"])
#         return 0
#     return counter

# # If counter is > 1Million, then the counter might have been reset.  
# # Just set it to zero as different counters have different cycle limits
# _ = df2_daily_cum.apply(get_daily_counts, axis=1, max_counter=1000000)

In [None]:
# Fix this error: (code modified from MTA 3 exercise)
def get_daily_counts(row, max_counter):
    counter = row["ENTRIES"] - row["PREV_ENTRIES"]
    if counter < 0:
        # Maybe counter is reversed?
        counter = -counter
    if counter > max_counter:
        # Maybe counter was reset to 0? 
        print(row["ENTRIES"], row["PREV_ENTRIES"])
        counter = min(row["ENTRIES"], row["PREV_ENTRIES"])
    if counter > max_counter:
        # Check it again to make sure we're not still giving a counter that's too big
        return 0
    return counter

# !! we're implying that there's no way a million people passes by a turnstile in a given day. If this happens, then something went wrong and it's probably more prudent to add a value of 0 to discount that day's daily count.
# This function does not correct the problem of the cummulative counts being reversed, but it makes sure that the daily count (non cummulative) is correct by reversing the minus sign.

# If counter is > 1Million, then the counter might have been reset.  
# Just set it to zero as different counters have different cycle limits
# It'd probably be a good idea to use a number even significantly smaller than 1 million as the limit!

# create new column "DAILY_ENTRIES" of the daily count of turnstiles for entries, using the custom function above.
df2_daily_cum["DAILY_ENTRIES"] = df2_daily_cum.apply(get_daily_counts, axis=1, max_counter=200000)

In [None]:
(df2_daily_cum[df2_daily_cum["ENTRIES"] < df2_daily_cum["PREV_ENTRIES"]]
    .groupby(["C/A", "UNIT", "SCP", "STATION"])
    .size())

# notice the number of reversed entries does not change, when the daily counts of these rows are calculated, the result has the correct absolute value but has a minus sign, we make sure that this sign is reversed to be positive. So the daily counts are correct even though the prev_entries and entries columns are reversed

In [None]:
df2_daily_cum.head()

In [None]:
# we created daily entries, now do the same for exits.
def get_daily_exits(row, max_counter):
    counter = row["EXITS"] - row["PREV_EXITS"]
    if counter < 0:
        # Maybe counter is reversed?
        counter = -counter
    if counter > max_counter:
        # Maybe counter was reset to 0? 
        print(row["EXITS"], row["PREV_EXITS"])
        counter = min(row["EXITS"], row["PREV_EXITS"])
    if counter > max_counter:
        # Check it again to make sure we're not still giving a counter that's too big
        return 0
    return counter

# If counter is > 1Million, then the counter might have been reset.  
# Just set it to zero as different counters have different cycle limits
# It'd probably be a good idea to use a number even significantly smaller than 1 million as the limit!

df2_daily_cum["DAILY_EXITS"] = df2_daily_cum.apply(get_daily_exits, axis=1, max_counter=200000)

In [None]:
df2_daily_cum # now we see that we have the daily entries and daily exit counts for each turnstile (last to columns)

## daily Station exits and entries (combine turnstiles within station)

- recall, a unique turnstile is defined by ['C/A','UNIT','SCP','STATION']
- it's the SCP column that specifies a unique turnstile within a station.
- to specify a unique station, group only by ['C/A','UNIT','STATION'], i.e. leave out the sation
- from MTA 3: "There are some ControlArea/Unit/Station groups that have a single
  turnstile, but most have multiple turnstiles -- same value for the
  C/A, UNIT and STATION columns, different values for the SCP column."
  
*** IMPORTANT NOTE ON ['C/A','UNIT','STATION'] vs ['STATION']:  
    --> ['STATION'] is the station name and is unique (supposedly, but somewhat skeptical). But some stations have many lines running through them. ['C/A','UNIT','STATION'] may refer to different areas of the station where you can access different lines. But ['STATION'] refers to that physical location station for all the entrances/exits/lines/turnstile.  
    --> I will just be using ['STATION'] for my project ie. station_daily DataFrame. Will create ca_unit_station_daily DataFrame as well but not sure if I'll use it

In [None]:
ca_unit_station_daily = df2_daily_cum.groupby(["C/A", "UNIT", "STATION", "DATE"])[['DAILY_ENTRIES','DAILY_EXITS']].sum().reset_index().copy()

ca_unit_station_daily

In [None]:
# are the station names unique? we should get the same dimensions if we group just by station and date
station_daily_test = df2_daily_cum.groupby(["STATION", "DATE"])[['DAILY_ENTRIES','DAILY_EXITS']].sum().reset_index()
station_daily_test
# ???? we get about half the number of rows, which implies a ton of names being reused; maybe ['C/A','UNIT','STATION'] specifies a part of the station? (see explation in md cell above)

In [None]:
# Daily Entries and Exits for entire station
station_daily = df2_daily_cum.groupby(["STATION", "DATE"])[['DAILY_ENTRIES','DAILY_EXITS']].sum().reset_index().copy()

station_daily

In [None]:
# Total entries and exits for the entire 27 week period
# sorted by entries, descending
station_total_entries = station_daily.groupby('STATION').sum()\
    .sort_values('DAILY_ENTRIES', ascending=False)\
    .reset_index().copy()
station_total_entries.rename(columns={'DAILY_ENTRIES': 'TOTAL_ENTRIES', 'DAILY_EXITS': 'TOTAL_EXITS'}, inplace=True)
station_total_entries

In [None]:
# Sorted by exits (descending)
station_total_exits = station_daily.groupby('STATION').sum()\
    .sort_values('DAILY_EXITS', ascending=False)\
    .reset_index().copy()

station_total_exits.rename(columns={'DAILY_ENTRIES': 'TOTAL_ENTRIES', 'DAILY_EXITS': 'TOTAL_EXITS'}, inplace=True)
                           
station_total_exits

In [None]:
station_total_exits.sum()

In [None]:
diff = station_total_exits['TOTAL_ENTRIES'].sum() - station_total_exits['TOTAL_EXITS'].sum()
diff2 = station_total_entries['TOTAL_ENTRIES'].sum() - station_total_entries['TOTAL_EXITS'].sum()
diff == diff2

print(diff) # ~12M people gone missing. Where did they go?
diff / station_total_exits['TOTAL_ENTRIES'].sum() # 3.7% of the people that entered the subway disappeared

# but with counter?

In [None]:
# the top 10 exit stations not the same as top 10 entries
# BUT 34 ST-PENN STA is the most popular stations for exits AND entries

station_total_entries['STATION'].head(10) == station_total_exits['STATION'].head(10)

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.bar(x=station_total_entries['STATION'][:10], height=station_total_entries['TOTAL_ENTRIES'][:10], width = 0.4, label="Daily Entries") # ranked by Entries and showing entries

plt.bar(x=station_total_exits['STATION'][:10],
        height=station_total_exits['TOTAL_EXITS'][:10], width = 0.4, label="Daily Exits") # ranked by Exits and showing exits


fig, ax = plt.subplots()
# rects1 = ax.bar(x - width/2, men_means, width, label='Men')
# rects2 = ax.bar(x + width/2, women_means, width, label='Women')

rects1 = ax.bar(x - width/2, men_means, width, label='Men')
rects2 = ax.bar(x + width/2, women_means, width, label='Women')


# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('# of Entries/Exits')
ax.set_title('Top 10 stations by total entries and exits')
# ax.set_xticks(x)
# ax.set_xticklabels(labels)
ax.legend()

# ax.bar_label(rects1, padding=3)
# ax.bar_label(rects2, padding=3)


# plt.legend(loc="upper right")
plt.xlabel('Station Name')
plt.xticks(rotation=90)


fig.tight_layout()
plt.show()
# plt.savefig('top_10.png')


# # side by side? while maintaining same stations?

# plt.bar(x=penn_day['DAY_OF_WEEK_NUM'] - 0.2,
#         height=penn_day['DAILY_ENTRIES'], width = 0.4, label="Total Entries")
# plt.bar(x=penn_day['DAY_OF_WEEK_NUM'] + 0.2,
#         height=penn_day['DAILY_EXITS'], width = 0.4, label="Total Exits")
# plt.legend(loc='upper left')
# plt.xlabel('Day of the week')
# plt.ylabel('Number of turnstile entries/exits (in millions)')
# plt.xticks(np.arange(7),['Mo','Tu','We','Th','Fr','St','Sn']) # changing x-tick labels np.arange(7) is the index location of the labels, then list for actual labels that we want for the tick mark.
# plt.title('Ridership by Day of the Week for 34 ST-PENN STA')
# plt.savefig('penn_week_plot.png')

## Plot daily entries and exits for Penn Station (busiest by exits and entries)

In [None]:
penn_daily = station_daily[station_daily["STATION"] == "34 ST-PENN STA"].copy() # copy not necessary because, it’s not manipulating the original df
penn_daily

In [None]:
plt.figure(figsize=(15,5))
plt.plot(penn_daily['DATE'], penn_daily['DAILY_ENTRIES'], label="Daily Entries")
plt.plot(penn_daily['DATE'], penn_daily['DAILY_EXITS'], label="Daily Exits")
plt.legend(loc="upper right")
plt.ylabel('# of Entries/Exits')
plt.xlabel('Date')
plt.xticks(rotation=45)
plt.title('Daily Entries and Exits for 34 ST-PENN Station')


# ???? Thanksgiving and Christmas makes sense... What's with the rest?

??? 

Public holidays celebrated in NY include:

Labor Day Mon, Sep 7, 2020
Columbus Day Mon, Oct 12, 2020
Veterans Day Wed, Nov 11, 2020
Thanksgiving Thu, Nov 26, 2020
Christmas Day Fri, Dec 25, 2020
New Year's Day Fri, Jan 1, 2021
Martin Luther King Jr. Day Mon, Jan 18, 2021

In [None]:
print(penn_daily.sort_values("DAILY_ENTRIES", ascending=False).head(15))
print(penn_daily.sort_values("DAILY_ENTRIES", ascending=False).tail(10))

## Weekly Plots

In [None]:
import numpy as np

In [None]:
penn_daily['DAY_OF_WEEK_NUM'] = pd.to_datetime(penn_daily['DATE']).dt.dayofweek
# taking the 'DATE' converting into datetime object via 'to_datetime'; then extracting the day of the week using dt.dayofweek; dt. is an attribute within datetime object. .dayofweek, given dt gives day of the week. dt <-- what you use if you do anything with dates; helps you extract the datetime attribute of the series. dt; this is all pandas.
penn_daily['WEEK_OF_YEAR'] = pd.to_datetime(penn_daily['DATE']).dt.week
# .week <-- gives week of the year
penn_daily.head() # 6 is Sunday 0 is Monday

In [None]:
for i, group in penn_daily.groupby('WEEK_OF_YEAR'):
    plt.plot(group['DAY_OF_WEEK_NUM'], group['DAILY_ENTRIES'])
# just a for loop; grouping of the Week of the year, for every week it is; and for each group plotting, with day of week as x-axis and daily entries as y axis.
    
    
plt.xlabel('Day of the week')
plt.ylabel('Number of turnstile entries')
plt.xticks(np.arange(7),['Mo','Tu','We','Th','Fr','St','Sn']) # changing x-tick labels np.arange(7) is the index location of the labels, then list for actual labels that we want for the tick mark.
plt.title('Ridership per day for 34 ST-PENN STA')

In [None]:
# this one no good, see cell below

# for i, group in penn_daily.groupby('WEEK_OF_YEAR'):
#     plt.plot(group['DAY_OF_WEEK_NUM'], group['DAILY_ENTRIES'])
# # just a for loop; grouping of the Week of the year, for every week it is; and for each group plotting, with day of week as x-axis and daily entries as y axis.

penn_daily.groupby('DAY_OF_WEEK_NUM').sum('TOTAL_ENTRIES').plot.bar(rot=15, title="Ridership by Day of Week for 34 ST-PENN STA");
plt.show(block=True);

plt.xlabel('Day of the week')
plt.ylabel('Number of turnstile entries')
plt.xticks(np.arange(7),['Mo','Tu','We','Th','Fr','St','Sn']) # changing x-tick labels np.arange(7) is the index location of the labels, then list for actual labels that we want for the tick mark.
plt.title('Ridership per day for 34 ST-PENN STA')

In [None]:
penn_day = penn_daily.groupby('DAY_OF_WEEK_NUM').sum('TOTAL_ENTRIES').reset_index()
# plt.bar(X_axis - 0.2, Ygirls, 0.4, label = 'Girls')
# plt.bar(X_axis + 0.2, Zboys, 0.4, label = 'Boys')
plt.bar(x=penn_day['DAY_OF_WEEK_NUM'] - 0.2,
        height=penn_day['DAILY_ENTRIES'], width = 0.4, label="Total Entries")
plt.bar(x=penn_day['DAY_OF_WEEK_NUM'] + 0.2,
        height=penn_day['DAILY_EXITS'], width = 0.4, label="Total Exits")
plt.legend(loc='upper left')
plt.xlabel('Day of the week')
plt.ylabel('Number of turnstile entries/exits (in millions)')
plt.xticks(np.arange(7),['Mo','Tu','We','Th','Fr','St','Sn']) # changing x-tick labels np.arange(7) is the index location of the labels, then list for actual labels that we want for the tick mark.
plt.title('Ridership by Day of the Week for 34 ST-PENN STA')
plt.savefig('penn_week_plot.png')

In [None]:
penn_day = penn_daily.groupby('DAY_OF_WEEK_NUM').sum('TOTAL_ENTRIES').reset_index()
# plt.bar(x=penn_day['DAY_OF_WEEK_NUM'],
#         height=penn_day['DAILY_ENTRIES'], label="Total Entries")
# plt.bar(x=penn_day['DAY_OF_WEEK_NUM'],
#         height=penn_day['DAILY_EXITS'], label="Total Exits")
# plt.xlabel('Day of the week')
# plt.ylabel('Number of turnstile entries/exits')
# plt.xticks(np.arange(7),['Mo','Tu','We','Th','Fr','St','Sn']) # changing x-tick labels np.arange(7) is the index location of the labels, then list for actual labels that we want for the tick mark.
# plt.title('Ridership by Day of the Week for 34 ST-PENN STA')

fig, ax = plt.subplots()
# entries_bar = ax.bar(x=penn_day['DAY_OF_WEEK_NUM'],
#         height=penn_day['DAILY_ENTRIES'], label="Total Entries")
# exits_bar = ax.bar(x=penn_day['DAY_OF_WEEK_NUM'],
#         height=penn_day['DAILY_EXITS'], label="Total Exits")

ax.bar(x=penn_day['DAY_OF_WEEK_NUM'],
        height=penn_day['DAILY_ENTRIES'], width, label="Total Entries")
ax.bar(x=penn_day['DAY_OF_WEEK_NUM'],
        height=penn_day['DAILY_EXITS'], width, label="Total Exits")

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_xlabel('Day of the week')
ax.set_ylabel('Number of turnstile entries/exits')
ax.set_title('Ridership by Day of the Week for 34 ST-PENN STA')
ax.set_xticks(np.arange(7))
ax.set_xticklabels(['Mo','Tu','We','Th','Fr','St','Sn'])
ax.legend()

# ax.bar_label(entries_bar, padding=3)
# ax.bar_label(entries_bar, padding=3)

fig.tight_layout()

plt.show()

In [None]:
labels = ['G1', 'G2', 'G3', 'G4', 'G5']
men_means = [20, 34, 30, 35, 27]
women_means = [25, 32, 34, 20, 25]

x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

fig, ax = plt.subplots()
rects1 = ax.bar(x - width/2, men_means, width, label='Men')
rects2 = ax.bar(x + width/2, women_means, width, label='Women')

# Add some text for labels, title and custom x-axis tick labels, etc.
ax.set_ylabel('Scores')
ax.set_title('Scores by group and gender')
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()

# ax.bar_label(rects1, padding=3)
# ax.bar_label(rects2, padding=3)

fig.tight_layout()

plt.show()