# Load data and pickle it

This notebook flows on from `scan - collection relationship` which investigates the `allScanDetections.csv` file and relates it to the collection in the mongodb.

In [None]:
from tqdm import tqdm
import json
import datetime
import pytz
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = (20, 10)

Load the data from the CSV and parse it into a dataframe. The CSV is about 6gb, so it needs about that much ram to be able to run. It might run on less capable systems if the chunks parameter is set.

In [None]:
a = pd.read_csv('training.csv')

dtypes = {"minor": int, "uuid": str, "time": str, "rssi": int, "agentId": str}
parse_dates = ["time"]
a = pd.read_csv("training.csv", 
                dtype=dtypes,
                parse_dates=parse_dates, 
                date_parser=pd.datetimes.to_datetime)
a.head()

## Let's have a look at what we're dealing with:

In [None]:
a.info()

In [None]:
a.memory_usage()

In [None]:
a.iloc[0]

In [None]:
a.iloc[100000]

In [None]:
a.columns

In [None]:
a.minor.unique()

In [None]:
len(a.minor.unique())

In [None]:
a.agentId.unique()

In [None]:
a.uuid.unique()

The dataframe is quite big, there's only actually one `uuid` in it, but it's the same memory usage as the `time` column. if we take it out we save almost a gigabyte of memory.

In [None]:
a.drop('uuid', axis=1, inplace=True)

In [None]:
a.memory_usage()

In [None]:
a.info()

In [None]:
a.rssi.hist(bins=100)

This is what we expect to see. There's no reason for `rssi` detections to be anything other than normal.

In [None]:
a.time.hist(bins=100)

This is also expected. It's showing a more or less uniform distribution in teh time periods that we queried for.

## Pickle

We pickle to dump a binary representation of the dataframe so that other notebooks can pick it up easily.

In [None]:
a.to_pickle("candidate_data_py2.p")

![](http://i0.kym-cdn.com/photos/images/newsfeed/001/282/726/110.png)