Nugit Data Scientist Task

Calculating cannibalization of organic search visits when SEM advertising is purchased

Quick Terminology

SEM = Paid Search = sponsored ads within the same search results
SEO = Organic Search = non-paid listings presented in the search results

Aim

For every visit generated through SEM advertising, what is the impact on SEO visits?

Background

Organic Search drives substantial visits to a company's website, in the order of 20% - 40% of total traffic.

It is believed that when a company purchases SEM advertising, there may be a decrease in Organic Search visits because users that would normally navigate to the companies website via the natural listing use the paid listing instead.

Nugit wishes to get a better understanding of this relationship between results from SEM and SEO with regression analysis.

Files and Data

To get the files, you can either clone the repository:

$ git clone git@github.com:nugit/datascientist-task.git

Provided:

main.py Python file for you to get started and is the main executable file.
requirements.txt file for you to list any 3rd-party modules.
data/sample_data_Oct.csv sample CSV data files. Each CSV file contains the daily number of visits for SEO (organic/non-paid) and SEM (paid) in 2014. If you prefer to work with JSON, you may use my Csv-Json switcher python file on github: cjswitch
data/sampleoutput.json sample output JSON file

Rules:

Points marked with [Program] are to be completed in Python. Outputs are in JSON.
Points marked with [Question] are optional and are for you to show case your statistical/machine learning knowledge. Please keep it short in dot points.
Feel free to create more python files as necessary. Just make sure that main.py is the only file that gets executed.
Feel free to use any python module(s) as necessary. Just remember to add it in requirements.txt
Ensure that your .py files follow the pep8 coding style guide.

Submission:

main.py + other python files - for you to show-off your logic
requirements.txt - to list any 3rd-party modules
output.json - your JSON results. Please see data/sampleoutput.json for an example submission.
submission.md or submission.html or submission.txt or submission.pdf - for you to write your answers/comments/suggestions. We recommend using the online notebook wakari.

Please do not write your answers in a word doc.

Submit your completed task to terry@nugit.co by providing a link to a private bitbucket or github repository or somewhere online to view the files. Feel free to email me any questions.

Analysis/Statistics Task

(A) Linear Regression

Using the last 26 weeks of data in data/sample_data_Oct.csv:

[Program] Fit the data into a regression function of the form y = mx + b
[Program] Using the function, calculate the impact (the SEO value) when SEM has the highest number of visits
[Program] Using the function, calculate the impact (the SEO value) when SEM has the median number of visits
[Question] Are there other statistical methods that can show the impact of SEM visits on SEO visits?

Regression Function:

y = mx + b

where:  y is the dependent variable SEO
        x is the independent variable SEM
        m is gradient
        b is the y-intercept

Sample Output (these are made-up numbers. Note the decimal places):

Round gradient and yintercept to 2 decimal places and maxSEM and maxSEMimpact are integer number of visits.

{
    "filename": "sample_data_Oct.csv",
    "datarange": "12weeks"
    "gradient": 0.02
    "yintercept": 2004.45
    "maxSEM": 3000
    "maxSEMimpact": 1000
    "medianSEM": 1250
    "medianSEMimpact": 2200
}

(B) Model Validation

[Program] Calculate the Correlation Coefficient and the Coefficient of Determination
[Question] What does the result of the coefficients tell you about the regression function and the data?
[Question] Determining how well the data fits into your regression function can be done by calculating the correlation coefficient. However, it is also known that this is not a good measure of model validation. What other approaches could you use? Feel free to program this if you wish.

Sample Output (these are made-up numbers. Please round to `3dp`):

{
    "r": 0.922,
    "rsquared": 0.850
}

(C) Different date ranges

[Program] Perform the same analysis as in (A) and (B), but over a 12 week period

Bonus Points!

Provide tests to accompany your python functions. At nugit, we use unittest and nose with codecoverage
Provide an approach to remove outliers. Feel free to program this.
You will notice that there is a difference in results by using 3 and 6 months of data for trend estimation. How would you go about de-trending the data to produce a more accurate picture of the relationship?
Chart your results using any JavaScript library

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nugit Data Scientist Task

Quick Terminology

Aim

Background

Files and Data

Provided:

Rules:

Submission:

Analysis/Statistics Task

(A) Linear Regression

Regression Function:

Sample Output (these are made-up numbers. Note the decimal places):

(B) Model Validation

Sample Output (these are made-up numbers. Please round to `3dp`):

(C) Different date ranges

Bonus Points!

About

Releases

Packages

Contributors 3

Languages

nugit/datascientist-task

Folders and files

Latest commit

History

Repository files navigation

Nugit Data Scientist Task

Quick Terminology

Aim

Background

Files and Data

Provided:

Rules:

Submission:

Analysis/Statistics Task

(A) Linear Regression

Regression Function:

Sample Output (these are made-up numbers. Note the decimal places):

(B) Model Validation

Sample Output (these are made-up numbers. Please round to 3dp):

(C) Different date ranges

Bonus Points!

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Sample Output (these are made-up numbers. Please round to `3dp`):

Packages