<a href="https://colab.research.google.com/github/learning-stack/Colab-ML-Playbook/blob/master/handson-ml/python_workshop_demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center>
    <img src="https://s3.amazonaws.com/weclouddata/images/logos/wcd_logo.png" width="50%">
</center>

----------

<h1 align="center"> Python for Data Science </h1>
<br>
<center align="left"> <font size='4'>  Developed by: </font><font size='4' color='#33AAFBD'>WeCloudData Academy </font></center>
<br>
<center align="left"> <font size='4' color='#FF5713'> Accelerating your data science career! </font></center>
<br>

----------

<h1><center>Why Python?</center></h1>
<center>
    <img src="https://imgs.xkcd.com/comics/python.png" width="40%">
</center>

-----

# $\Omega$ Pandas

<img src="https://upload.wikimedia.org/wikipedia/commons/c/cd/Panda_Cub_from_Wolong%2C_Sichuan%2C_China.JPG" width="30%">

> `pandas` is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

> `pandas` is well suited for many different kinds of data:

> * Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
* Ordered and unordered (not necessarily fixed-frequency) time series data.
* Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
* Any other form of observational / statistical data sets. The data actually need not be labeled at all to be placed into a pandas data structure

For more tutorials, visit: https://pandas.pydata.org/pandas-docs/stable/tutorials.html

-----

In [3]:
1 + 1

2

### Imports

In [0]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline #print directly within the cell

### Read data

In [0]:
titanic = pd.read_csv('https://raw.githubusercontent.com/vinnywcd/datasets/master/titanic.csv')

### Help

In [0]:
pd.read_csv?

### Get first n rows

In [5]:
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


### Dataframe dimensions

In [6]:
titanic.shape

(1308, 14)

### Index

In [7]:
titanic.index

RangeIndex(start=0, stop=1308, step=1)

### Columns

In [8]:
titanic.columns

Index(['pclass', 'survived', 'name', 'sex', 'age', 'sibsp', 'parch', 'ticket',
       'fare', 'cabin', 'embarked', 'boat', 'body', 'home.dest'],
      dtype='object')

### Dataframe information

In [9]:
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1308 entries, 0 to 1307
Data columns (total 14 columns):
pclass       1308 non-null int64
survived     1308 non-null int64
name         1308 non-null object
sex          1308 non-null object
age          1045 non-null float64
sibsp        1308 non-null int64
parch        1308 non-null int64
ticket       1308 non-null object
fare         1308 non-null float64
cabin        295 non-null object
embarked     1306 non-null object
boat         486 non-null object
body         120 non-null float64
home.dest    745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.1+ KB


### Summary statistics for numeric columns

# $\Omega$ Selecting Data

### Select column

### Select multiple columns

### Select rows and columns by name

### Select rows and columns by range of indices

### Select rows based on condition (filter)

## $\Delta$ Exercise 1 - Dinesafe Data

https://www.toronto.ca/city-government/data-research-maps/open-data/open-data-catalogue/

**Questions:**
1. Read the trip data `data/dinesafe.csv`
2. Show the last 10 rows
3. Show the summary statistics
4. Select columns "latitude" and "longitude" and rows 100 to 200
5. Select age greater than 50

In [0]:
########################
# Your Code Below
########################

In [0]:
dinesafe = pd.read_csv('https://raw.githubusercontent.com/vinnywcd/datasets/master/dinesafe.csv')

# $\Omega$ Transforming Data

### Calculate ticket price in today's dollar
According to the Bureau of Labor Statistics consumer price index, prices in 2018 are 2,669.00% higher than prices in 1909. The dollar experienced an average inflation rate of 3.09% per year.

### Add new column

### Count missing values

### Calculate mean

### Fill missing values

### Count categorical data

### Groupby & aggregate

### Save new dataframe as csv

## $\Delta$ Exercise 2 - Dinesafe Data

**Questions:**
1. Count the missing values in `AMOUNT_FINED` as a percentage
2. Create a new column called `AMOUNT_FINED_NOTNULL` and replace the missing values with 0
3. Get a count of every category in `ESTABLISHMENTTYPE`
4. For every category in `ESTABLISHMENTTYPE` get a count of those that were fined

In [0]:
########################
# Your Code Below
########################

# $\Omega$ Data Visualization
https://pandas.pydata.org/pandas-docs/stable/visualization.html

### Bar plot

### Pie plot

### Set figure size

### Histogram

### Set histogram bin size

### Boxplot

### Scatter plot

### Transform then plot

## $\Delta$ Exercise 3 - Dinesafe Data

**Questions:**
1. Create a bar plot of `ESTABLISHMENT_STATUS`
2. Create a new dataframe called `dinesafe_fined` with only the rows where `AMOUNT_FINED` is not null
3. Create a boxplot of `AMOUNT_FINED` in this new dataframe
4. Create a histogram of `AMOUNT_FINED` with 30 bins
5. Create a **horizontal** bar plot of the mean `AMOUNT_FINED` by `ESTABLISHMENTTYPE`
6. Create a line plot of `INSPECTION_DATE` and `AMOUNT_FINED`

**Bonus:**
- Explore the dataset and see if you can come up with your own interesting analysis or plot

In [0]:
########################
# Your Code Below
########################