# <h1 style="font-size:24px">Website traffic of Romanian media and advertising companies</h1>
--- 

## **ABOUT THE DATA**
[BRAT](https://www.brat.ro/ce-este-brat) is a non-profit organization of the media and advertising industry. Their aim is to support this industry by establishing common methods and standards on how to measure the performance indicators of media products. One of these KPIs is site traffic, that is measured in the [SATI](https://www.brat.ro/sati) project. The findings are published each day on SATI's webpage, in the [Traffic results](https://www.brat.ro/sati/rezultate/type/site/page/1/c/all) menu. They have the possibility to export the whole dataset, not only the filtered table that can be seen on their webpage. With Inspect the URL for the generated Excel file can be found out, and it can be scaped with Python's `requests` library.

It is important to keep in mind, that site traffic data is available only for the day before the current date. Moreover, for those who are not a BRAT member, data can be accessed only for the last month.

### **Data structure**
Downloaded data contains the following columns: 
* Categorie: the site's media category
* Site
* Sitecode
* Tip trafic: traffic type: desktop/laptop, mobile, mobile applications
* Editor site
* Contractor
* Regie de publicitate
* Afisari: Page Impression - a visitor displays a page on the site
* Vizite: Visit - a series of one or more impressions as a result of a visitor's request. A visit ends when the period between 2 consecutive impressions is longer than 30 minutes
* Clienti Unici: Unique user - a unique combination of IP address and other identifiers

From these only the category, traffic type, Contractor, page impression, visit and unique users will be used.

## **ABOUT THE NOTEBOOK**
This notebook is the main part of the project, where all methods are summarized and put together from downloading the data until the part where it is sent into InfluxDB. 

The data is scraped down in Excel format, so this can be imported directly into `pandas` - with the help of that the data processing part will be done. Fortunately, the data comes in a quite decent format, only the unneeded columns and empty rows should be deleted. Since it contains a total traffic row for all sites too, this row should have been eliminated too, as this is the sum of the trafic types. Lastly, because of InfluxDB's time series database nature, the dataframe's index should be of datetime type.

## **CODE**

import all dependencies

In [1]:
import data_processing_utils as utils #data download and normalization
import db_utils #utils to get db connection and write to db 
from datetime import datetime, timedelta
import os

A database connection should be made. If you want to make another measurement, you should uncomment the code for the database creation part. If you have a measurement, named other than `site_traffic`, please change the `db_name` parameter of the `get_data_frame_client` function.

In [2]:
# from influxdb import DataFrameClient

# user = 'root'
# password = 'root'
# host='influxdb'
# port=8086
# dbname='site_traffic'

# client = DataFrameClient(host, port, user, password, dbname)

# client.create_database(dbname)
# client.create_retention_policy(dbname, '1000d', 1, default=True) #data will be kept only for 1000 days
# # you should always check with client.query("show databases") command if you have the made database

client = db_utils.get_data_frame_client('influxdb', 'site_traffic')

Data will be downloaded in the period [min_d, max_d]. So if you want to download data from 11th of May, 2022 until the current date, you should change the date in `max_d` to "2022-05-11" and `min_d` should be the date of yesterday (see above, data available only until the date before the current date)

In [7]:
min_d = datetime.today() - timedelta(days=1) #datetime.strptime("2022-04-17", "%Y-%m-%d")
max_d = datetime.today()

Run the cell below to download, normalize and send data to InluxDB.

Obs: the measurement's name will be `traffic`. If you want to change this, please change the measurement arg of the `write_data_to_db` function

In [8]:
while min_d.date() < max_d.date():
    d_str = min_d.strftime("%Y-%-m-%-d")
    print(d_str)
    
    if not os.path.exists("./data"):
        os.mkdir("./data")
    
    # download and write to database the normalized df
    df = utils.get_normalized_resource(resource_url="https://www.brat.ro/sati/export-rezultate/export/xls/type/site/c/all/period_type/day/period_filter/"+ d_str +"/category/all/editor/all/order_by/name/order/desc/", file_path= "./data/" + d_str + ".xls", from_date=min_d)
    db_utils.write_data_to_db(client=client, data_df=df, measurement='traffic')

    min_d = min_d + timedelta(days=1)


2022-5-17
