# Title
Pietro Grandinetti,...

Date

## Introduction

In this article I am going to try out statistical inference techniques with the goal to draw insights from a pretty large dataset.

I will touch on several subjects, from Python and PostgreSQL, to data sampling, hypothesis testing and inference, while driven by a business point of view. In other terms, I will put myself at the place of the Data Scientist who's been given the task to provide its business with data-driven, actionable insights.

The content will thus be interesting for those of you who work on any of the following topics:
- Python development
- Database programming and optimization
- Inferential Statistics
- Data Science

## The data (and the story behind it)

Raw data is the fuel behind the most important innovations of these days. It's a pity that many organizations keep their data secret. On the one hand, I understand their reasons; on the other, history has shown that open-source contributions drive innovation like, or even more than, competition.

That's why we set on a mission to publish everything we do and make it reproducible. One of the first things we did was to take a complicated machinery developed by researchers at the Communication Systems Department
of Sophia-Antipolis, France, that was already open-sourced and we made it more easily accessible and 100% reproducible.

The result is... _a lot_ of data.

This is a system that simulates road traffic in the Principality of Monaco for ten hours, starting from 4am. It basically is a fairly large network of roads (think of a graph) inside which vehicles of any type and pedestrians  move (think of objects moving from node to node of a graph).

The key is that every vehicle is equipped with a GPS-like sensor. Therefore, at the end of the simulation, we can retrieve instantaneous positions and speed information for each vehicle. That sounds like a great dataset!

We've done this already, put all data in CSV format as well as loaded them into a PostgreSQL database. Then (suprise!), we simply published the data on the internet. You can download all of them on your computer without even asking. And you can reproduce the experiment (or run your own) in a couple of clicks. Careful though, it's many gigabytes of data-- may take a while. I would recommend to first read this through [this article](url) that shows some interesting exploratory data analysis done over a subset of the data. **ADD LINK**

## Data Settings

The dataset is definitely great, though it presents an obvious obstacle to start with: it's a bit too large!

In fact, I am not even sure how many records are there. For sure, it's fixed size, therefore I could theoretically download the CSV, wait for probably a few hours, then run a `wc -l` to know its size, wait several minutes and then I'd get the number. And then... what?

I suspect the dataset contains about 100 million rows, but this information is useless. I also know it's a CSV of around 8 GB (I can see the file size in the browser when it asks me to confirm the download), but this information is not very useful either.

I was not given a cluster of computers with a lot of memory. I am using my laptop with 8G of fast memory. Even if I download the entire file, how am I supposed to load it into memory?

The same problem applies to the PostgreSQL database. For sure the database has a lot of advantages over the CSV file (and the unique disadvantage that you need to know a bit of SQL!), but it comes with problems too. First of all, to maintain it costs money. I was forbidden by my team to run a `select count(*)` which would consume a lot of memory and maybe even take down the CPU of the server that hosts the DB. In fact, I suspect that our database administrator has disabled queries like that one. Imagine if I were to take down the entire server!

No reason to worry too much though. This is a fairly common situation for data scientists. Whether you were given 1 billion tweets, 500 million pictures, or 100 million payment transactions, you won't be able to analyze the dataset in its entirety.

You and I need a hat.

## Choose the right hat (Statistical Settings)

I need to wear the statistician hat to work on this task.

The database will be the _population_ that I have to study. I will have to come up with _hypothesis_ about this population, driven by _samples_ and _reject_ (or not) them via statistical _testing_ and _evidence_.

Statisticians never assume to know the entire population. Think about healthcare studies: when a company wants to evaluate the effectiveness of a new drug, they certainly don't assume to know how the entire world's population would react to it. They take a sample (volunteers usually), test the drug on this small sample and then make conclusions based on statistical evidence.

This is the correct Data Science approach for large dataset, and it's the one I'll use.

## My questions to the dataset

Let's get down to business now.

My approach to statistical analysis is to ask questions. These questions will drive sampling and light exploration, and then shape the hypotheses. Once I have some meaningful hypotheses, I will work on them statistically (via hypothesis testing).

Here are my initial questions for this dataset.

**What is the maximum capacity of the city?**

In fact, I don't know if traffic gets ever so bad that the network reaches its capacity. Nobody told me that, and I suspect it's not the case.
My question refers more to the maximum number of objects that are moving in the network _at the same time_. Again, I can't just scan the whole population to get the "true" answer, hence I will need to resort to some different technique and come up with a data-driven approximation of the "true" answer.

**From where do most vehicles enter the network?**

I think it can be very useful to present at my next team meeting a 2D map of the city that shows the hotspots where a lot of objects enter the city.

**What are the most congested spots?**

I am still thinking about a 2D map of the city. It would surely be useful to know where traffic gets the worst. I believe this can be interpreted in multiple ways: I could say that it's bad traffic if there a lot of cars, or if the speed is very low. This point will probably drive further analysis.

**What is the highest speed vehicles travel at?**

Again, not the "true" highest speed. Think statistically.

**What is the average speed vehicles travel at?**

Similar concept to the previous question.

**Hypothesis: A vehicle is 95% likely to travel at an average speed of 20 km/h. True or False?**

**What are the lengths of the 10 longest queues of vehicles? Where are they?**

I believe the answer to this question will require some extra computational power.