<a href="https://colab.research.google.com/github/markbriers/data-science-jupyter/blob/main/coursework_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The first stage in the coursework data analysis process is to copy the data linked in the documents supplied into your Google Drive, into a folder that is called "Data".

In order to read data from your Google Drive, you need to "mount" the drive. This is a slightly involved process, where you need to authorise colab to access your Google Drive. Executing the code below will generate a (unique) URL. You will need to click on this, click accept, and copy the long code. (Clicking on the link will open a new window.) The code needs to be pasted into the text box that will appear below. Press enter on your keyboard, and you should see text that says: "Mounted at /content/gdrive".

In [4]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


The process listed above will need to be completed every time the analysis is performed (for security reasons).

In [5]:
import numpy as np
import pandas as pd

This function loads the communication data into a DataFrame. I have optimised the memory usage - details are not important, but please do reuse the code below.

In [6]:
def readCommunicationData(fname):
  # load dcomms ata file
  comm = pd.read_csv(fname,dtype={"Timestamp": object, "from": np.uint32, "to": object, "location": object})
  # display initial memory usage
  comm.info(memory_usage='deep')
  # convert the timestamp field to a timestamp object
  comm['Timestamp'] = pd.to_datetime(comm['Timestamp'], infer_datetime_format=True) 
  # convert "from" field to 32-bit unsigned integer
  comm['from'] = comm['from'].astype('uint32')
  # convert all "external" references in the "to" field to be the value 0 so that we can convert this to an integer memory
  comm['to'] = comm['to'].replace('external',0)
  comm['to'] = comm['to'].astype('uint32')
  # display revised memory usage
  comm.info(memory_usage='deep')
  return comm

In [7]:
commFri = readCommunicationData('/content/gdrive/My Drive/Data/comm-data-Fri.csv')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 948739 entries, 0 to 948738
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   Timestamp  948739 non-null  object
 1   from       948739 non-null  uint32
 2   to         948739 non-null  object
 3   location   948739 non-null  object
dtypes: object(3), uint32(1)
memory usage: 190.2 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 948739 entries, 0 to 948738
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Timestamp  948739 non-null  datetime64[ns]
 1   from       948739 non-null  uint32        
 2   to         948739 non-null  uint32        
 3   location   948739 non-null  object        
dtypes: datetime64[ns](1), object(1), uint32(2)
memory usage: 75.8 MB


In [8]:
commSat = readCommunicationData('/content/gdrive/My Drive/Data/comm-data-Sat.csv')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1655866 entries, 0 to 1655865
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   Timestamp  1655866 non-null  object
 1   from       1655866 non-null  uint32
 2   to         1655866 non-null  object
 3   location   1655866 non-null  object
dtypes: object(3), uint32(1)
memory usage: 331.5 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1655866 entries, 0 to 1655865
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   Timestamp  1655866 non-null  datetime64[ns]
 1   from       1655866 non-null  uint32        
 2   to         1655866 non-null  uint32        
 3   location   1655866 non-null  object        
dtypes: datetime64[ns](1), object(1), uint32(2)
memory usage: 131.8 MB


In [9]:
commSun = readCommunicationData('/content/gdrive/My Drive/Data/comm-data-Sun.csv')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1548724 entries, 0 to 1548723
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   Timestamp  1548724 non-null  object
 1   from       1548724 non-null  uint32
 2   to         1548724 non-null  object
 3   location   1548724 non-null  object
dtypes: object(3), uint32(1)
memory usage: 310.4 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1548724 entries, 0 to 1548723
Data columns (total 4 columns):
 #   Column     Non-Null Count    Dtype         
---  ------     --------------    -----         
 0   Timestamp  1548724 non-null  datetime64[ns]
 1   from       1548724 non-null  uint32        
 2   to         1548724 non-null  uint32        
 3   location   1548724 non-null  object        
dtypes: datetime64[ns](1), object(1), uint32(2)
memory usage: 123.5 MB


This function returns the movement DataFrame.

In [10]:
def readMovementData(fname):
  # load movemement data file
  move = pd.read_csv(fname)
  # remove any null values that may exist
  move = move[pd.notnull(move["id"])]
  # clear erroneous data
  move = move[move['Timestamp'].str.len()==18]
  # display initial memory usage
  move.info(memory_usage='deep')
  # convert the timestamp field to a timestamp object
  move['Timestamp'] = pd.to_datetime(move['Timestamp'], infer_datetime_format=True,errors='ignore') 
  # convert "from" field to 32-bit unsigned integer
  move['id'] = move['id'].astype('uint32')
  # convert type field to categorical variable
  move['type'] = move['type'].astype('category')
  # convert positional fields to uint16
  move['X'] = move['X'].astype('uint16')
  move['Y'] = move['Y'].astype('uint16')
  # display revised memory usage
  move.info(memory_usage='deep')
  return move

In [11]:
moveFri = readMovementData('/content/gdrive/My Drive/Data/park-movement-Fri.csv')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6010914 entries, 0 to 6010913
Data columns (total 5 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   Timestamp  object
 1   id         int64 
 2   type       object
 3   X          int64 
 4   Y          int64 
dtypes: int64(3), object(2)
memory usage: 986.0 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6010914 entries, 0 to 6010913
Data columns (total 5 columns):
 #   Column     Dtype         
---  ------     -----         
 0   Timestamp  datetime64[ns]
 1   id         uint32        
 2   type       category      
 3   X          uint16        
 4   Y          uint16        
dtypes: category(1), datetime64[ns](1), uint16(2), uint32(1)
memory usage: 143.3 MB


In [12]:
moveSat = readMovementData('/content/gdrive/My Drive/Data/park-movement-Sat.csv')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9078623 entries, 0 to 9078622
Data columns (total 5 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   Timestamp  object
 1   id         int64 
 2   type       object
 3   X          int64 
 4   Y          int64 
dtypes: int64(3), object(2)
memory usage: 1.5 GB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 9078623 entries, 0 to 9078622
Data columns (total 5 columns):
 #   Column     Dtype         
---  ------     -----         
 0   Timestamp  datetime64[ns]
 1   id         uint32        
 2   type       category      
 3   X          uint16        
 4   Y          uint16        
dtypes: category(1), datetime64[ns](1), uint16(2), uint32(1)
memory usage: 216.5 MB


In [13]:
moveSun = readMovementData('/content/gdrive/My Drive/Data/park-movement-Sun.csv')

  if self.run_code(code, result):


<class 'pandas.core.frame.DataFrame'>
Int64Index: 10932424 entries, 0 to 10932424
Data columns (total 5 columns):
 #   Column     Dtype 
---  ------     ----- 
 0   Timestamp  object
 1   id         object
 2   type       object
 3   X          object
 4   Y          object
dtypes: object(5)
memory usage: 2.6 GB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10932424 entries, 0 to 10932424
Data columns (total 5 columns):
 #   Column     Dtype         
---  ------     -----         
 0   Timestamp  datetime64[ns]
 1   id         uint32        
 2   type       category      
 3   X          uint16        
 4   Y          uint16        
dtypes: category(1), datetime64[ns](1), uint16(2), uint32(1)
memory usage: 260.6 MB
