# project description

This project tries to train a model to predict if a student can be admitted to a graduate school based on features including GPA, GRE score and undergraduate school ranks. 

The source dataset can be found at  https://stats.idre.ucla.edu/stat/data/binary.csv

This problem was originally presented in Udacity. I found it was a perfect fit for myself to practice the basic techniques, such as data pre-preparation, define model, loss function, and gradient descent method to minimize loss function. 

Solution: single layer model

* build a network with one output layer
* use sigmoid function as output activation function
* use MSE (mean square error) as the loss function
* use gradient descent method to minimize the loss function




## (1) Import needed libraries

In [1]:
import numpy as np
import pandas as pd

## (2) load dataset and pre-processing

### load data using pandas

In [2]:
# read csv data using pandas
admissions = pd.read_csv('binary.csv')
admissions.head()

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


### pre-processing data

* **make dummy variables for rank**

(1) **pandas.get_dummies**(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)

Convert categorical variable into dummy/indicator variables

*Parameters*:	
**data** : array-like, Series, or DataFrame

**prefix** : string, list of strings, or dict of strings, default None
String to append DataFrame column names Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.

**prefix_sep** : string, default ‘_’

*Returns*
**dummies** : DataFrame or SparseDataFrame

(2) **concatenate objects along a particular axis with optional set logic along the other axes **

**pandas.concat**(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

**objs** : a sequence or mapping of Series, DataFrame, or Panel objects
If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised

**axis** : {0/’index’, 1/’columns’}, default 0
The axis to concatenate along

In [3]:
# Make dummy variables for rank
data = pd.concat([admissions, pd.get_dummies(admissions['rank'], prefix='rank')], axis=1)

print("---- print admissions['rank'] = [index, rank] ----")
print(admissions['rank'].head())

print("---- After pd.get_dummies operation -----")
print(pd.get_dummies(admissions['rank'], prefix='rank').head())

print("--- data after concat ----")
print(data.head())

---- print admissions['rank'] = [index, rank] ----
0    3
1    3
2    1
3    4
4    4
Name: rank, dtype: int64
---- After pd.get_dummies operation -----
   rank_1  rank_2  rank_3  rank_4
0       0       0       1       0
1       0       0       1       0
2       1       0       0       0
3       0       0       0       1
4       0       0       0       1
--- data after concat ----
   admit  gre   gpa  rank  rank_1  rank_2  rank_3  rank_4
0      0  380  3.61     3       0       0       1       0
1      1  660  3.67     3       0       0       1       0
2      1  800  4.00     1       1       0       0       0
3      1  640  3.19     4       0       0       0       1
4      0  520  2.93     4       0       0       0       1



* **remove the redundant column of "rank"**

In [4]:
data = data.drop('rank', axis=1)
print(data.head())

   admit  gre   gpa  rank_1  rank_2  rank_3  rank_4
0      0  380  3.61       0       0       1       0
1      1  660  3.67       0       0       1       0
2      1  800  4.00       1       0       0       0
3      1  640  3.19       0       0       0       1
4      0  520  2.93       0       0       0       1


* **standarize features - scaling **

DataFrame.loc

Purely label-location based indexer for selection by label.

.loc[ ] is primarily label based, but may also be used with a boolean array.

In [5]:
# Standarize features
print("---- before scaling ----")
print(data.head())

for field in ['gre', 'gpa']:
    mean, std = data[field].mean(), data[field].std()
    #print(data[field].head())
    #print(mean, std)
    data.loc[:,field]  = data.loc[:,field].apply(lambda x: (x - mean) / std)

print("---- after scaling ----")
print(data.head())

---- before scaling ----
   admit  gre   gpa  rank_1  rank_2  rank_3  rank_4
0      0  380  3.61       0       0       1       0
1      1  660  3.67       0       0       1       0
2      1  800  4.00       1       0       0       0
3      1  640  3.19       0       0       0       1
4      0  520  2.93       0       0       0       1
---- after scaling ----
   admit       gre       gpa  rank_1  rank_2  rank_3  rank_4
0      0 -1.798011  0.578348       0       0       1       0
1      1  0.625884  0.736008       0       0       1       0
2      1  1.837832  1.603135       1       0       0       0
3      1  0.452749 -0.525269       0       0       0       1
4      0 -0.586063 -1.208461       0       0       0       1


* **split off random 10% of dataset for testing **


(1) **numpy.random.choice(a, size=None, replace=True, p=None)**
Generates a random sample from a given 1-D array

*Parameters*:

**a** : 1-D array-like or int
If an ndarray, a random sample is generated from its elements. If an int, the random sample is generated as if a were np.arange(a)

**size** : int or tuple of ints, optional
Output shape. If the given shape is, e.g., (m, n, k), then m * n * k samples are drawn. Default is None, in which case a single value is returned.

**replace** : boolean, optional
Whether the sample is with or without replacement

**p** : 1-D array-like, optional
The probabilities associated with each entry in a. If not given the sample assumes a uniform distribution over all entries in a.

*Returns*:	
**samples** : single item or ndarray; The generated random samples


In [6]:
# generate index of 90% dataset randomly as training data; the rest is testing data
sample = np.random.choice(data.index, size=int(len(data)*0.9), replace=False)
# split data set into two subset: data is training set, and test_data is testing set
data, test_data = data.loc[sample], data.drop(sample)

* **split into features and labels **

In [7]:
features, targets = data.drop('admit', axis=1), data['admit']
features_test, targets_test = test_data.drop('admit', axis=1), test_data['admit']

## (3) define model, loss function


* **activation function**: sigmoid function

In [8]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

* **model**: simple linear regression model with weights

In [9]:
n_records, n_features = features.shape
# initialize weights
weights = np.random.normal(scale=1 / n_features**.5, size=n_features)

## (4) gradient descent method to minimize error

** SSE ** : sum square error

** MSE ** : mean square error as loss function





%%html
<img src="MSE.png">

Instead of the SSE, we're going to use the mean of the square errors (MSE). Now that we're using a lot of data, summing up all the weight steps can lead to really large updates that make the gradient descent diverge. To compensate for this, you'd need to use a quite small learning rate. Instead, we can just divide by the number of records in our data, m to take the average. This way, no matter how much data we use, our learning rates will typically be in the range of 0.01 to 0.001. 

%%html
<img src="single_layer.jpeg">

In [50]:
# Neural Network hyperparameters
epochs = 1000
learnrate = 2
last_loss = []

for e in range(epochs):
    # set weight step (or change/delta) to zero
    del_w = np.zeros(weights.shape)

    for x, y in zip(features.values, targets):
        # loop through all records, x is input, y is label
        output = sigmoid(np.dot(x, weights))
        # calculate output error
        error = y - output
        # calculate update on weights
        del_w += error * output * ( 1 - output ) * x
    # Update the weights here. The learning rate times the 
    # change in weights, divided by the number of records to average
    weights += learnrate * del_w / n_records

    # Printing out the mean square error on the training set
    if e % (epochs / 10) == 0:
        # put training data into the network -> go through activation function to get output
        out = sigmoid(np.dot(features, weights))
        # calculate MSE as loss function
        loss = np.mean((out - targets) ** 2)
        print("Train loss: ", loss)
        last_loss = loss

('Train loss: ', 0.19475092849607611)
('Train loss: ', 0.1947509281939236)
('Train loss: ', 0.19475092816854309)
('Train loss: ', 0.19475092816639603)
('Train loss: ', 0.1947509281662136)
('Train loss: ', 0.19475092816619818)
('Train loss: ', 0.1947509281661969)
('Train loss: ', 0.19475092816619674)
('Train loss: ', 0.19475092816619657)
('Train loss: ', 0.1947509281661967)


## (5) calculate accuracy on testing dataset

assume accuracy > 0.5 is good. That means our model should predict correctly on 50% samples of the testing dataset 

In [52]:
tes_out = sigmoid(np.dot(features_test, weights))
predictions = tes_out > 0.5
accuracy = np.mean(predictions == targets_test)
print("Prediction accuracy: {:.3f}".format(accuracy))

Prediction accuracy: 0.947
