# Data Analysis Workshop

Agenda:
* Jupyter Notebook (10mins)
* Python Review (20mins)
* Pandas Introduction (1hr)
* Modeling (30mins)

### Jupyter Notebook

Advantages:
    1. Easy to reproduce work.
    2. Annotate steps with `markdown`, making it easier for others to understand your code.
    3. Run specific code chunks
    4. Insert images/diagrams

### Python review
(and a bit more)

Libraries

Assigning variables and basic operations

In [None]:
# add,subtract,multiple, divide, exponential

In [None]:
# logical operators 

Data types

In [None]:
# Strings

In [None]:
# Integers

In [None]:
# Floats

Some handy string methods

In [None]:
"LOL".replace("O","U")

Objects

In [None]:
class student():
    def __init__(self):
        self.python_skill = 0
    def attend_workshop(self):
        self.python_skill = 9999
    def do_math(self,a,b):
        return a + b

In [None]:
# Create a student object

This is a student object. It has one attribute `python_skill` and two methods `attend_workshop` & `do_math`.

In [None]:
# access the python_skill attribute


In [None]:
# use the attend workshop method


In [None]:
# check back on the python_skill attribute


In [None]:
# not all methods have to change attributes


Lists and indexing

In [None]:
stuff = ["apples","oranges","bananas"]
# print out each item in the list forward and backwards

Functions

In [None]:
# define a function that takes a number and adds one to it.

In [None]:
# Lambda functions: a concise way to write functions for one time use

### Pandas

Pandas - A Python Library for Data Analysis
- Introduction to DataFrames
- Sub-setting Data
- Apply
- Simple Plots
- Merge
- Groupby
- Linear Regression without the Math

### Introduction to DataFrames

Pandas is a library for data analysis

In [None]:
# import the pandas library

Let's try reading in a csv file! Here, `df` is a variable name we assign to the dataframe object obtained from reading in the csv using the method `read_csv` from the pandas library

In [None]:
# read a csv file

In [None]:
# head() takes a peak at the first few rows

In [None]:
# get a list of column names

### Sub-setting Data

In [None]:
# By columns

In [None]:
# By rows

In [None]:
# By rows according to some logic
# e.g get the rows where floor area > 60

### Series

In [None]:
# A data frame is made of multiple columns. Each column is called a series object

In [None]:
# check the max,min,mean

In [None]:
# get unique values & value counts

Exercises

In [None]:
# exercise: find the average price of a 4 room flat

In [None]:
# exercise: get the value counts for the top 10 flat_models

relevant information: https://www.hdb.gov.sg/cs/infoweb/residential/buying-a-flat/resale/types-of-flats<br>
From here we can see how to order the room types

### Apply

In [None]:
# apply the replace function to convert all the rows into numbers

In [None]:
# Sort the value counts in order of room type

In [None]:
# convert them from string to int

### Merge

In [None]:
example_df1 = pd.DataFrame({'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
                    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']})
example_df2 = pd.DataFrame({'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
                    'hire_date': [2004, 2008, 2012, 2014]})

In [None]:
# merge the two data frames based on the "employee" column

example taken from: https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html 
(visit to find out more!)

### "Machine Learning"

Label, $y$: Resale Price <br>
Features, $X$: Whatever information we think is predictive

Model: $f(X) = \hat{y}$
<br>Trained model is obtained through learning repeatedly from observations $(X,y)$ until it can produce $\hat{y} \approx y$<br>

Workflow:
0. Make sure data is in the right form (numeric), construct $X$
1. Split data into training set and test set
2. Pick a model, train it with training set
3. See how it performs on the test set
4. Make changes if the performance is not good enough

### Feature engineering

In [None]:
# read the "geocoded_school_data.csv" file as a dataframe

In [None]:
# read the "address_coordinates.csv" file as a dataframe

In [None]:
geocoded_df.coords = geocoded_df.coords.apply(eval)
schools.coords = schools.coords.apply(eval)

In [None]:
import math
def calculate_dist(lat, lng, lat2, lng2):
    rad = 0.000008998719243599958
    diff_lat = float(lat) - float(lat2)
    diff_lng = float(lng) - float(lng2)
    dist = math.sqrt(diff_lng**2 + diff_lat**2) / rad
    return dist

def count_method(lat, lng , list_landmarks, dist):
    count = 0
    maxdist = dist
    for landmark in list_landmarks:
        dist = calculate_dist(lat, lng, landmark[0], landmark[1])
        if dist <= maxdist:
            count += 1
    return count

def dist_method(lat, lng, list_landmarks):
    shortest_dist = 10000000
    for landmark in list_landmarks:
        dist = calculate_dist(lat, lng, landmark[0], landmark[1])
        if dist < shortest_dist:
            shortest_dist = dist
    return shortest_dist

In [None]:
# handle tuples in string format

In [None]:
# get number of schools in 1km radius 

In [None]:
# get distance to nearest school

In [None]:
# check unique values for storey range

Re-bin storey range column
1. 01 to 09
2. 10 to 21
3. 22 & above

In [None]:
df.storey_range = df.storey_range.replace(["01 TO 03", "04 TO 06", "07 TO 09"],1)
df.storey_range = df.storey_range.replace(["10 TO 12", "13 TO 15", "16 TO 18","19 TO 21"],2)

In [None]:
list(df.storey_range.unique())[2:]

In [None]:
df.storey_range = df.storey_range.replace(list(df.storey_range.unique())[2:],3)

### Train Test Split

In [None]:
# Split the data set into training and testing
# test: month of may 2018
# train: everything before that

In [None]:
# Separate train set into features and labels

In [None]:
# Separate test set into features and labels

### Linear regression

In [None]:
from sklearn import linear_model
from sklearn.metrics import r2_score

In [None]:
reg = linear_model.LinearRegression()
reg.fit(train_X,train_y)
yhat = reg.predict(test_X)

In [None]:
r2score(test_y,yhat)

### XGBoost

In [None]:
!pip install xgboost

In [None]:
import xgboost as xgb

In [None]:
dtrain = xgb.DMatrix(train_X, label=train_y)
dtest = xgb.DMatrix(test_X)

In [None]:
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'reg:linear'}
param['nthread'] = 4
param['eval_metric'] = 'rmse'

num_round = 10
bst = xgb.train(param, dtrain, num_round)

In [None]:
yhat = bst.predict(dtest)

In [None]:
r2_score(test_y, yhat)

### Visualisation (Bonus)

https://python-graph-gallery.com/140-basic-pieplot-with-panda/

## Extra Resources

Python for Data Analysis: http://shop.oreilly.com/product/0636920023784.do