# Exploring the date columns of the combined collections data

The dataset had been merged and data cleansed previously but some outliers had been identified and filtered in visualisations.  This workbook was to explore if there were alternative options to just filtering to clean the data.

In [None]:
# import libraries

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# This jupyter notebook is to explore the combined colections dataset after the cleaning of Item Dates and Item Place.

df = pd.read_csv('../data/combined_collections_dataset.csv', sep=',')


In [None]:
# quick look at our data

df.head(5)

In [None]:
# Quick check of the size of the combined dataset

df.shape

In [None]:
# check labels for each column

df.columns.values

In [None]:
# Check on data types and number of nulls of the columns

df.info()

In [None]:
# SUMMARY STATISTICS

"""
Key Observations

1. Minumum values for StartDate, EndDate and MidpointDate are close to -60,000.  These need to be checked as -6000 would be expected.
2. Maximum values for StartDate, EndDate and MidpointDate are close to 193,000.  These need to be checked as 2025 would be expected

CONCLUSION: observations 1 and 2 suggests that there are extreme *values Outliers* in our dataset!

These columns are complete with no NaN.
 0   RecordID       8273 non-null   int64  
 1   Museum         8273 non-null   object - this could be useful to compare BM and V&A separately and then together.
 2   LocalID        8273 non-null   object - this is not necessary for correlations checks as duplicated in Record ID/
"""

df.describe(include = 'all')

In [None]:
"""
StartDate column

1. StartDate are checked to see if there are a range of values in the extremes.

"""
plt.scatter(df['StartDate'],df['EndDate'])
plt.title('Start Date versus End Date')
# First inspect all values where StartDate is > 2025

futureItem = (df['StartDate'] > 2025) | (df['EndDate'] > 2025)

# Some prints were used to summarise the outlier values

# start_outliers_df = df[['RecordID', 'ItemDate', 'StartDate', 'EndDate', 'MidpointDate']][futureItem]
# df.StartDate.unique()
# start_outliers_summary = pd.Series.value_counts(start_outliers_df.StartDate)
# print(start_outliers_df)
# print(start_outliers_summary)

# RecordIDs with Start Dates in future 5465, 7266, 7884

# Next looking for items made before 6000BC which is an outlier.

extreme_early = (df['StartDate'] < -6000) | (df['EndDate'] < -6000)

# extreme_outliers_df = df[['RecordID', 'ItemDate', 'StartDate', 'EndDate', 'MidpointDate']][extreme_early]
# extreme_outliers_summary = pd.Series.value_counts(extreme_outliers_df.StartDate)
# print(extreme_outliers_df)
# print(extreme_outliers_summary)

# RecordID                                      ItemDate  StartDate  \
# 552        553                                     206BC-220   -20600.0   
# 810        811                                       80BC-50    -8000.0   
# 1033      1034                                     400BC-300   -40000.0   
# 1387      1388                 400BC-AD400 (Marpole Culture)   -40000.0   
# 1756      1757                                     206BC-220   -20600.0   
# 1965      1966                               206 BC - 220 AD   -20600.0   
# 2298      2299                             100BC-100 (circa)   -10000.0   
# 2450      2451                       580 BC – 550 BC (circa)   -58000.0   
# 3003      3004                                     100BC-100   -10000.0   
# 3110      3111                                     100BC-100   -10000.0   
# 3179      3180                              250 BC -- 130 BC   -25000.0   
# 3594      3595                                 100 BC-100 AD   -10000.0   
# 3882      3883                                 200BC-100 (?)   -20000.0   
# 6537      6538                                   100BC-AD100   -10000.0   
# 6669      6670                                ca. 200-100 BC   -10000.0   
# 7045      7046  c.180 BC-50 BC or 4th century-5th century AD   -18000.0   
# 7825      7826                                206 BC- 220 AD   -20600.0   
# 7958      7959                                   206 BC-8 AD   -20600.0

# On inspection it seems that in most cases the StartDate, EndDate and MidpointDate could be corrected by dividing by 100

df['CorrectedStartDate'] = df['StartDate'][futureItem | extreme_early] / 100
df['CorrectedEndDate'] = df['EndDate'][futureItem | extreme_early] / 100

# The graph below highlights the extreme values

In [None]:
df.CorrectedStartDate.fillna(df.StartDate, inplace=True)
df.CorrectedEndDate.fillna(df.EndDate, inplace=True)

In [None]:
pd.Series.value_counts(df['CorrectedStartDate'])

In [None]:
plt.scatter(df['CorrectedStartDate'],df['CorrectedEndDate'])
plt.title('Start Date versus End Date Cleaned')

In [None]:
# checking for missing (NaN) values with the help of visualization


"""
EXPLANATION

Dataset has no missing values for:  
 0   RecordID       8273 non-null   int64  
 1   Museum         8273 non-null   object 
 2   LocalID        8273 non-null   object 

Where there is missing data this can be seen in white.  As they are 2 merged datasets the different spread of missing data can be seen here.

"""

sns.heatmap(df.isnull(),cbar=True,yticklabels=False)

In [None]:
# CHECKING FOR OUTLIERS
"""
OBSERVATION

There are still outlier values earlier than 3500BC but this is representitive of the data and has been left.

""" 

import math
    
plt.figure(1)
sns.boxplot(df['CorrectedStartDate'], color='black', orient='v')
plt.figure(2)
sns.boxplot(df['CorrectedEndDate'], color='black', orient='v')

In [None]:
# check distribution-skewness

plt.figure(1)
sns.displot(df['CorrectedStartDate'],kde=True)
plt.figure(2)
sns.displot(df['CorrectedEndDate'],kde=True)

# There are 2 peaks highlighting increased collections of items at these production dates.

In [None]:
print(df['CorrectedStartDate'].describe())
# inner quartitle range between 380BC and 1729AD.  Median is 1206AD.
print(f' Mean: {df['CorrectedStartDate'].mean()}\n Median: {df['CorrectedStartDate'].median()}\n Mode: {df['CorrectedStartDate'].mode()}\n')

In [None]:
df['CorrectedMidpointDate'] = round((df['CorrectedStartDate']+((df['CorrectedEndDate']-df['CorrectedStartDate'])/2)),0)

In [None]:
df['CorrectedMidpointDate']

In [None]:
df['MidpointDate']

# Assessing the AcqDate Column


In [None]:
# From info() we found that there were only 7822 non-null values of type object.  These should be dates (years) so we will try to convert to numeric. 
df['AcqDate'] = pd.to_numeric(df['AcqDate'], errors='coerce')
# I will not include year 0AD in my selection so NaN values can be converted to 0 to support int conversion.
df['AcqDate'] = df['AcqDate'].replace('NaN', pd.NA).fillna(0).astype(int)

In [None]:
df['AcqDate'].nunique()

# 227 uniques values available

In [None]:
# Removing the records with no dates
condition = df['AcqDate'] > 0
Acq_only_df = df['AcqDate'][condition]

Acq_only_df.describe()

In [None]:
Acq_count_df = df['AcqDate'].value_counts().reset_index().sort_values(by='AcqDate', ascending=True).reset_index()
Acq_df_sort = Acq_count_df.sort_values(by='count', ascending=False).reset_index()
Acq_df_sort.drop(index=0, inplace=True)
Acq_df_sort

In [None]:
Acq_only_df.plot.hist(bins=10)
plt.xlabel('Year')
plt.ylabel('Number of Acquisitions')

This histogram highlights is a good visualisation of the small percentage of items acquired after 2000. This will make analysis open to bias and results would not be suitable for modelling or predictions.

Calculation of percentage below.

In [None]:
condition = df['AcqDate'] > 2004
condition2 = df['AcqDate'] < 2025
footfall_df = df['AcqDate'][condition & condition2]


In [None]:
print(f'Percentage of Acquistions in range of footfall dates = ({len(footfall_df)} / {len(Acq_only_df)} * 100) = {round((len(footfall_df)/len(Acq_only_df))*100, 2)}%')


In [None]:
Footfall_acq_counts = footfall_df.value_counts().sort_index(ascending=True).reset_index()
values = Footfall_acq_counts['count']
x_label = Footfall_acq_counts['AcqDate'].astype(int)

These values could be used to compare against visitor footfall in the same year but consideration would be needed to identify the delay from acquisition to any impact to the visitor numbers.  As items are not necessarily on public display it was felt that this line of analysis would not be of value.

In [None]:
Footfall_acq_counts

# Creating a combined_collection_for_visualisations

To include this cleansed data into visualisations it would be useful to have the corrected columns in a new csv file.

I plan to filter the information to include only items that have an acquisition date between 2005 - 2024 to match with footfall data available.

In [None]:
df.columns

In [None]:
filtered_df= df[['RecordID', 'Museum', 'AcqDate', 'ObjectType', 'ItemDate', 'ModernCountry',
        'ItemMaterial', 'ItemTechnique', 'CorrectedStartDate',
       'CorrectedEndDate', 'CorrectedMidpointDate']][condition & condition2]

In [None]:
filtered_df.describe()

In [None]:
filtered_df.sample(20)

In [None]:
filtered_df.to_csv('../data/combined_collections_footfall_dates_dataset.csv', index=None)