Skip to content

Implementation of algorithms for click through rate predictions utilising sparsity.

License

Notifications You must be signed in to change notification settings

ksolarski/effCTR

Repository files navigation

GitHub release (latest SemVer) PyPI Downloads Build Status codecov Codacy Badge Code style: black PyPI - Python Version PyPI - License

effCTR

Efficient Click-Through Rate (effCTR) implements Logistic Regression with Stochastic Gradient Descent (SGD) and Naive Bayes in Python, and it utilises sparse matrices from scipy.sparse to achive up to 60x speedup compared to implementations from scikit-learn.

Below, more information is provided about Naive Bayes and Logistic Regression with SGD developed in this repository. For each algorithm, a simple user guide is presented. Furthermore, a notebook with an example of how algorithms can be used on the real dataset can be found here.

Installation

To install the package from PyPI:

pip3 install effCTR

One can also install it directly from github:

pip3 install git+https://github.com/ksolarski/effCTR.git

Logistic Regression with SGD

Due to the high dimension of the matrix in this problem, Logistic Regression is infeasible since the matrix cannot be inverted. Hence, Stochastic Gradient Descent is applied to this problem. The loss function is specified as:

$$ \begin{equation} Log Loss =-\left[y_{t} \log \left(p_{t}\right)+\left(1-y_{t}\right) \log \left(1-p_{t}\right)\right] \end{equation} $$

Then the gradient is given by:

$$ \begin{equation} \nabla Log Loss=\left(p_{t}-y_{t}\right) x_{t} \end{equation} $$

Hence, in each iteration of the algorithm, the weights are updated in the following way:

$$ \begin{equation} w_{t+1} = w_{t}-\eta_{t}\left(p_{t}-y_{t}\right) x_{t} \end{equation} $$

where $\eta_{t}$ denotes learning rate at iteration $t$, which can be specified in the argument learning_rate. One can also specify how many times to go through a dataset in the argument max_epochs, how large data chunk is used in each iteration in the argument chunksize, and whether to iterate through consecutive batches in dataset or draw a batch randomnly in the argumnet randomized. One can also add early stopping by using arguments early_stopping and n_iter_no_change. The demo shows that the implementation from this repo can be ~60x faster on large datasets than the implementation from scikit-learn.

To fit the model:

from effCTR.models.Logistic_SGD import Logistic_SGD
Logistic_SGD = Logistic_SGD()
Logistic_SGD.fit(X, y)

Afterwards, one can otain predictions:

Logistic_SGD = Logistic_SGD.predict(X)

One can also use methods plot_weights and log_likelihood to plot how weights and likelihood change throughout the training process.

Naive Bayes

Under assumption of conditional independence and using Bayes theorem, one can show that probability of click given set of features $X$ can be obtained using:

$$ \begin{equation} P(Y=1 \mid X=x) =\frac{P(Y=1) \prod_{i} P\left(X_{i}=x_{i} \mid Y=1\right)}{P(Y=1) \prod_{i} P\left(X_{i}=x_{i} \mid Y=1\right) + P(Y=0) \prod_{i} P\left(X_{i}=x_{i} \mid Y=0\right)} \end{equation} $$

where $P\left(X_{i}=x_{i} \mid Y=1\right)$ denotes probabilty of feature $X_i$ taking a value $x_i$ given that there is a click. Using logarithms, one obtains alternative expression, which enables us to utilize fast operations from scipy.sparse:

$$ \begin{equation} \begin{aligned} P(Y=1) \prod_{i} P\left(X_{i}=x_{i} \mid Y=1\right) = \exp [ {\log {{P(Y=1) \prod_{i} P\left(X_{i}=x_{i} \mid Y=1\right)} ] }} = \\ \exp [ {\log{P(Y=1)} + \sum_{i} \log{P\left(X_{i}=x_{i} \mid Y=1\right)}} ] \end{aligned} \end{equation} $$

However, estimated probability can be zero (no observations for particular feature vaue given click or given no click). Consequently, zero probabilities are replaced either by small $\epsilon$ or smallest positive values encountered in the dataset. User can specify this using the arguments replace_by_epsilon and epsilon.

To fit the model:

from effCTR.models.Naive_Bayes import Naive_Bayes
Naive_Bayes = Naive_Bayes()
Naive_Bayes.fit(X, y)

Afterwards, one can otain predictions:

preds_bayes = Naive_Bayes.predict(X)

Contributing & Future Work

There is no clear path for this project. It was created for fun and learning purposes. Contributions and collaborations are welcomed and encouraged.