# Hertie Data Science Society 

## Session 1

We will start with a simple task of training our machine learning model to learn the relationship between two numbers and then predict one number by feeding the model the other. 

To put it into context, let's say that those two numbers are: 
- X: the number of bedrooms in a house
- Y: the price of that house 

And let's assume that there's a very simple linear relationship between these 2: Y = 50000 + 50000 * X. We know that this relationship exists (it's by our design). But our model doesn't know that. So we need teach it to learn this relationship.


## Import 

We will begin by import relevant libraries. Libraries, as explained, are collection of codes and algorithms that others have written before us to help us solve some specific problems (It's called packages in R and libraries in Python). Most mainstream numerical problems that we want to solve in data science have been encountered and efficiently solved by others before us, and so we can reuse their code and solution by using libraries. 

We need to import these libraries first. And then call the relevant function/algorithm that we want to use in that library when we want to use it. The more you practice and solve data problems, the more you will work with these libraries and discover that there's a selected few that are usually used for most of the tasks that you need to implement. 

Here are a few of the important Python libraries for data science:
- **numpy**: low-level data manipulation tool for working with data in arrays;
- **pandas**: high-level data manipulation tool for working with data in all kinds of formats, especially tables;
- **matplotlib**: for basic (and a bit ugly) visualization;
- **beautifulsoup**: for webscraping and collecting data; 
- **seaborn**: for pretty visualization;
- **sklearn**: open-sourced libraries of machine learning algorithms;
- **tensorflow**: open-sourced libraries for deep learning (endorsed by Google);
- **keras**: built on top of tensorflow to make writing deep learning codes simpler and easier to understand;
- **pytorch**: another open-sourced libraries for deep learning (endorsed by Facebook)

In [0]:
#Import numpy library first to create X and Y 
import numpy as np

## Create the data

The fundamental way that machine learning algorithms learn is by training and testing. The idea is that: 
- We have a dataset with information on the independent/input/features variables (Xs) and dependent/outcome variable (Y); 
- We need to split this data into a training set and a test set; 
- The machine learning model will learn the relationship of our variables in the training set; 
- Afterwards, using the test set (data it has never seen before), we will input an X variable it has never seen and see how close it can predict the Y variable.
- The closer its guess is to the true value of Y, the better the model is at this prediction task.

For our problem, let's say that we have data on prices of 7 houses and the number of bedrooms that each of these house have. We will split this data into 2: the training set with information on the first 6 houses that we will use to train the model, and a test set with information on the last house which we will use to test how good the model is at figuring out this relationship.

In [0]:
#Create the training and test data
x_train = np.array([1, 2, 3, 4, 5, 6], dtype = float).reshape((-1,1))
y_train = np.array([100000, 150000, 200000, 250000, 300000, 350000], dtype = float)
x_test = np.array([7]).reshape((-1,1))
y_test = np.array([400000])

## Training machine learning models   

We will use the sklearn libraries to try out different types of machine learning algorithms and how good they are at this prediction business. As you can see below, the process are virtually the same for different types of algorithms: 
- 1. We call the machine learning model
- 2. We fit that model on the training set so it can learn the relationship based on the training data 
- 3. We predict the Y variable based on the X variable in the test data.

Each algorithm has its strengths and weaknesses and is good for different types of problems. There is a simple guide from sklearn on how to choose a relevant algorithm based on the use case that you have here. 

<img src="https://scikit-learn.org/stable/_static/ml_map.png">

We will test out a few algorithm: 
- Linear Regression;
- Decision Tree;
- Random Forest;
- XGBoost.



In [0]:
# Linear Regression Model

from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()
linear_model.fit(x_train, y_train)
linear_model.predict(x_test)

array([400000.])

In [0]:
# Decision Tree Regression Model 


from sklearn.tree import DecisionTreeRegressor

tree_model = DecisionTreeRegressor()
tree_model.fit(x_train, y_train)
tree_model.predict(x_test)

array([350000.])

In [0]:
# Random Forest Regression Model

from sklearn.ensemble import RandomForestRegressor

forest_model = RandomForestRegressor()
forest_model.fit(x_train, y_train)
forest_model.predict(x_test)



array([345000.])

In [0]:
# XGBoost Regression Model 

from xgboost import XGBRegressor

xgb_model = XGBRegressor()
xgb_model.fit(x_train, y_train)
xgb_model.predict(x_test)



array([348124.44], dtype=float32)

## Evaluate the model

As we design this relationship to be linear, the Linear Regression model unsurprisingly performs the best. This only means that Linear Regression is a good model for this particular use case. For other, more complicated use case, the other models are most likely to perform better. But Linear Regression is a good baseline upon which you can use to see how you can improve further with other models. 

The other models doesn't predict correctly the price for a couple of reasons: 
- There are too few training data: the more training data you have, the better the algorithm is at predicting;
- Due to their built-in mathematical operation, most of these algorithms work in terms of probabilities. They calculated that there is a very high probability that the relationship between X and Y is somewhere around 50000 + 50000 * X, but with only 6 data points we can't know for sure. As a result, the result for price of a house with 7 bedrooms is close to 400000, but not exactly 400000.

## Build a Deep Learning Model 

As mentioned in the workshop, deep learning is essentially trying to model the human brain neuron in its thinking. So we will try building a very simple deep learning model with just one single neuron that will try to learn the relationship between X and Y. Here are some explanation for the code: 

- **Dense**: a regular layer of neurons in a neural network; our network is simple so we need only 1 Dense layer; if we have a more complicated relationship that we need to model (more information on size of houses, how many stories the house has, which district it is in, is there any good schools around), then we can add more Dense layer to help the neural network learn better;
- **units**: the number of neuron in your neural network, as the relationship is very simple, we only need 1 neuron, but if it's more complicated, we will need more; 
- **optimizer**: think of the optimizer as a learner that is sitting in that one neuron that is trying to make a guess about the relationship and the rule book that the optimizer uses to guess is the loss. There are different types of learners and each of them are good for different types of problems. For this problem, our learn is 'sgd' which is short for STOCHASTIC GRADIENT DESCENT;
- **loss**: measures the guessed answers against the known correct answers and measures how well or how badly the optimizer did. The goal is to minimize the loss after every round of guessing to that we can try to get as close to the right value as possible; there are different rules that we can use to measure how close the guess is to the true answer, for this model, we use MEAN SQUARED ERROR;
- **epochs**: the number of guessing rounds that the optimizer will try; you can play around with this, increase and decrease the number to 200 - 1000 - 10000, and see how the model performs. 

Over time, with more practice, we will learn the different and appropriate loss and optimizer functions for different scenarios.

In [0]:
# Deep Learning Model

#import libraries 
import tensorflow as tf
from tensorflow import keras

model = tf.keras.Sequential([keras.layers.Dense(units =1, input_shape = [1])])
model.compile(optimizer = 'sgd', loss = "mean_squared_error")
model.fit(x_train,y_train, epochs = 500)
print(model.predict(x_test))