![alt text](https://pandas.pydata.org/_static/pandas_logo.png "")

### What is Pandas?

* Open source Python package
* Allows for easy investigation and manipulation of tabular data e.g. CSV files
* Feature engineering for machine learning models

#### Import the Pandas package

In [None]:
import pandas as pd

#### What does our data look like?

In [None]:
df = pd.read_csv('tube_journeys.csv')

In [None]:
df.head(10)

In [None]:
df.tail()

#### Series vs. Dataframes

In [None]:
## this is a series
df.date

In [None]:
# this is a dataframe
df[['date', 'journey']]

#### How many rows are in our dataframe?

In [None]:
len(df)

#### What are the column data types?

In [None]:
df.dtypes

#### How much money have I spent on tube journeys?

In [None]:
df.charge.sum()

In [None]:
df.charge.mean()

In [None]:
df.balance.min()

In [None]:
df.balance.max()

In [None]:
%matplotlib inline
df.charge.plot(kind='hist')

#### What's my most common journey?

In [None]:
df.journey.value_counts()

In [None]:
df.groupby('journey').size()

In [None]:
df[df.journey == 'Earls Court to Temple']

#### Which journeys are the most expensive?

In [None]:
df.groupby('journey').mean().sort_values('charge', ascending=False)

#### What's my most common entry and exit station?

In [None]:
df.journey.head(10)

In [None]:
## let's get rid of all bus journeys and top-ups

df[df.journey.str.contains('bus')]

In [None]:
ignore = ['Auto top-up', 'Topped up', 'Topped-up', 'Bus journey', 'Entered']

df = df[~df.journey.str.contains('|'.join(ignore))]

In [None]:
df['start_station'] = df.journey.apply(lambda x: x.split(' to ')[0])

In [None]:
df['end_station'] = df.journey.apply(lambda x: x.split(' to ')[1])

In [None]:
df.start_station.value_counts()

In [None]:
df.head()

#### What can we do with dates and times?

In [None]:
df['start_time'] = (df.date + ' ' + df.start).astype('datetime64[ns]')

In [None]:
df['end_time'] = (df.date + ' ' + df.end).astype('datetime64[ns]')

In [None]:
df.head()

In [None]:
df.dtypes

#### What time of the day do I use the tube most often?

In [None]:
df.start_time.dt.hour

In [None]:
df['start_hour'] = df.start_time.dt.hour

In [None]:
df.groupby('start_hour').count()['start_time'].plot(kind='bar', rot=True)

In [None]:
df.start_time.dt.minute.plot(kind='hist', bins=30)

#### What day of the week or month of the year do I use the tube most often?

In [None]:
df.start_time.dt.weekday_name.value_counts()

In [None]:
df.start_time.dt.month.value_counts()

In [None]:
df[df.start_time.dt.month == 11]

## Quiz 1

#### 1) How much money have I spent going from Earls Court to Temple?

#### 2) How many unique stations have I tapped in at?

#### 3) Draw a bar chart of the number of journeys by month.

#### How long are my tube journeys?

In [None]:
df['duration'] = (df.end_time - df.start_time).astype(int)/1e9/60

In [None]:
df.duration.describe()

In [None]:
df[df.duration == -1434]

In [None]:
df[df.duration >= 0].duration.plot(kind='hist', bins=60, figsize=(10, 8))

In [None]:
df[df.duration > 60]

#### What zones do I use the most?

In [None]:
stations = pd.read_csv('stations.csv')

In [None]:
stations.tail()

In [None]:
df.head()

In [None]:
df_merged = df.merge(stations, left_on='start_station', right_on='station')

In [None]:
df_merged.zone.value_counts()

In [None]:
df_merged.head()

## Quiz 2

#### 1) How long does it take on average to get from Earls Court to Temple?  Draw the distribution.

#### 2) What is the most northerly station I've started a journey at?

#### 3) Create a column with only the smallest zone when there are multiple. 

## What have we done?

* We've calculated basic summary statistics.
* We've manipulated columns to extract further information.
* We've played with datetime objects.
* We've produced simple visualisations of our data.

## Where next?

* [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/index.html) (can be dense and convoluted)
* [Pandas Cheat Sheet](https://www.datacamp.com/community/blog/python-pandas-cheat-sheet)
* [Python for Data Analysis book](https://www.dropbox.com/s/z5buriod3xks594/Python%20for%20Data%20Analysis%20-%20Wes%20McKinney.pdf?dl=0)

[GitHub repo of this notebook and the answers to the quizes](https://github.com/imrankhan17/my-tube-journeys)