# Preparation of metadata for cluster walk

This notebook requires the presence of the [Openrefine Client](https://github.com/opencultureconsulting/openrefine-client) on the same directory as the notebook itself. 
The process goes along these steps:
1. Download files from Visual Contagion Openrefine instance
2. Extract necessary columns for Cluster Walk
3. Combining the files into a single CSV
4. Filter by date (>1870)
5. Export as CSV

In [1]:
import pandas as pd
import os

## Download Files from Visual Contagion Openrefine instance

This operation will download the following dataset from the Visual Contagion Openrefine instance 129.194.213.75 

- 2624032615455: CORPUS_wikidata_visual_items_post_1870
- 1878246014108: Backup_2022_02_03_PDF_cleaned
- 2370544250568: BackupCBT_2022_02_03_IIIF_Only
- 1671940401029: BNF_periodic_explore
- 1650201645555: Venus Explore
- 2332843665824: Princeton Blue Mountain
- 1688128007940: Journaux Est-Asiatique (collecte barbara)

Any additions or change should be done in the cell below

In [2]:
dirName = 'backup'
try:
    os.mkdir(dirName)
    print("Directory " , dirName ,  " Created ") 
except FileExistsError:
    print("Directory " , dirName ,  " already exists")

Directory  backup  Created 


**Use this code if you are running Mac/Windows or Linux on an Intel CPU**

In [None]:
#!./openrefine-client_0-3-10_macos --host 129.194.213.75 --export 1878246014108 --output backup/Backup_2022_02_03_PDF_cleaned.csv
#!./openrefine-client_0-3-10_macos --host 129.194.213.75 --export 2624032615455 --output backup/CORPUS_wikidata_visual_items_post_1870.csv
#!./openrefine-client_0-3-10_macos --host 129.194.213.75 --export 2370544250568 --output backup/BackupCBT_2022_02_03_IIIF_Only.csv
#!./openrefine-client_0-3-10_macos --host 129.194.213.75 --export 1671940401029 --output backup/BNF_periodic_explore.csv
#!./openrefine-client_0-3-10_macos --host 129.194.213.75 --export 1650201645555 --output backup/Venus_Explore.csv
#!./openrefine-client_0-3-10_macos --host 129.194.213.75 --export 2332843665824 --output backup/Princeton_Blue_Mountain.csv
#!./openrefine-client_0-3-10_macos --host 129.194.213.75 --export 1688128007940 --output backup/Journaux_Est-Asiatique.csv

In [None]:
If you use windows, use the command below

In [None]:
!.\openrefine-client_0-3-10_windows.exe --host 129.194.213.75 --export 1878246014108 --output backup/Backup_2022_02_03_PDF_cleaned.csv
!.\openrefine-client_0-3-10_windows.exe --host 129.194.213.75 --export 2624032615455 --output backup/CORPUS_wikidata_visual_items_post_1870.csv
!.\openrefine-client_0-3-10_windows.exe --host 129.194.213.75 --export 2370544250568 --output backup/BackupCBT_2022_02_03_IIIF_Only.csv
!.\openrefine-client_0-3-10_windows.exe --host 129.194.213.75 --export 1671940401029 --output backup/BNF_periodic_explore.csv
!.\openrefine-client_0-3-10_windows.exe --host 129.194.213.75 --export 1650201645555 --output backup/Venus_Explore.csv
!.\openrefine-client_0-3-10_windows.exe --host 129.194.213.75 --export 2332843665824 --output backup/Princeton_Blue_Mountain.csv
!.\openrefine-client_0-3-10_windows.exe --host 129.194.213.75 --export 1688128007940 --output backup/Journaux_Est-Asiatique.csv

**Use this code if you are running on an ARM64 CPU**

In [3]:
!docker run --rm --platform linux/amd64 --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --host 129.194.213.75 --export 1878246014108 --output backup/Backup_2022_02_03_PDF_cleaned.csv
!docker run --rm --platform linux/amd64 --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --host 129.194.213.75 --export 2624032615455 --output backup/CORPUS_wikidata_visual_items_post_1870.csv
!docker run --rm --platform linux/amd64 --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --host 129.194.213.75 --export 2370544250568 --output backup/BackupCBT_2022_02_03_IIIF_Only.csv
!docker run --rm --platform linux/amd64 --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --host 129.194.213.75 --export 1671940401029 --output backup/BNF_periodic_explore.csv
!docker run --rm --platform linux/amd64 --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --host 129.194.213.75 --export 1650201645555 --output backup/Venus_Explore.csv
!docker run --rm --platform linux/amd64 --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --host 129.194.213.75 --export 2332843665824 --output backup/Princeton_Blue_Mountain.csv
!docker run --rm --platform linux/amd64 --network=host -v ${PWD}:/data:z felixlohmeier/openrefine-client:v0.3.10 --host 129.194.213.75 --export 1688128007940 --output backup/Journaux_Est-Asiatique.csv

Export to file backup/Backup_2022_02_03_PDF_cleaned.csv complete
Export to file backup/CORPUS_wikidata_visual_items_post_1870.csv complete
Export to file backup/BackupCBT_2022_02_03_IIIF_Only.csv complete
Export to file backup/BNF_periodic_explore.csv complete
Export to file backup/Venus_Explore.csv complete
Export to file backup/Princeton_Blue_Mountain.csv complete
Export to file backup/Journaux_Est-Asiatique.csv complete


A very first step is to preserve only the necessary columns for Cluster Walk to work:
+ Media URL
+ City
+ Country
+ wkt
+ normalized_date
+ Title

We start with the **Backup_2022_02_03_PDF_cleaned**

In [4]:
pdf = pd.read_csv("backup/Backup_2022_02_03_PDF_cleaned.csv")
keep_col = ['Media URL','City','Country','wkt', 'normalized_date', 'Title']
new_pdf = pdf[keep_col]
new_pdf.to_csv("backup/Backup_2022_02_03_PDF_cleaned.csv", index=False)

Following for **BackupCBT_2022_02_03_IIIF_Only**

In [5]:
iiif = pd.read_csv("backup/BackupCBT_2022_02_03_IIIF_Only.csv")
keep_col = ['Media URL','City','Country','wkt', 'normalized_date', 'Title']
new_iiif = iiif[keep_col]
new_iiif.to_csv("backup/BackupCBT_2022_02_03_IIIF_Only.csv", index=False)

In [6]:
wiki = pd.read_csv("backup/CORPUS_wikidata_visual_items_post_1870.csv", low_memory=False, dtype=object)
keep_col = ['Media URL','City','Country','wkt', 'normalized_date', 'Title']
new_wiki = wiki[keep_col]
new_wiki.to_csv("backup/CORPUS_wikidata_visual_items_post_1870.csv", index=False)

In [7]:
bnf = pd.read_csv("backup/BNF_periodic_explore.csv")
keep_col = ['Media URL','City','Country','wkt', 'normalized_date', 'Title']
new_bnf = bnf[keep_col]
new_bnf.to_csv("backup/BNF_periodic_explore.csv", index=False)

In [8]:
venus = pd.read_csv("backup/Venus_Explore.csv")
keep_col = ['Media URL','City','Country','wkt', 'normalized_date', 'Title']
new_venus = venus[keep_col]
new_venus.to_csv("backup/Venus_Explore.csv", index=False)

In [9]:
bm = pd.read_csv("backup/Princeton_Blue_Mountain.csv")
keep_col = ['Media URL','City','Country','wkt', 'normalized_date', 'Title']
new_bm = bm[keep_col]
new_bm.to_csv("backup/Princeton_Blue_Mountain.csv", index=False)

In [10]:
est = pd.read_csv("backup/Journaux_Est-Asiatique.csv")
keep_col = ['Media URL','City','Country','wkt', 'normalized_date', 'Title']
new_est = est[keep_col]
new_est.to_csv("backup/Journaux_Est-Asiatique.csv", index=False)

In [11]:
merged_df = pd.concat([new_pdf, new_iiif, new_wiki, new_bnf, new_venus, new_bm, new_est ])

## Filtering by Time

As now we have loaded all the dataframe into one, called 'merged_df', we need to filter for date value equal and above 1870

In [12]:
def date_split(date:str)-> (str,str,str):
    if type(date) is not str:
        return float('nan'),float('nan'),float('nan')
    splits = date.split('-')
    vals = splits + [float('nan')] * (3 - len(splits))
    return [float(v) for v in vals]

merged_df['year'], merged_df['month'], merged_df['day'] = zip(*merged_df['normalized_date'].apply(date_split))

merged_1870 = merged_df[merged_df['year'] >= 1870.]

In [13]:
merged_1870 = merged_1870.drop(['month','day', 'year'], 1)

  merged_1870 = merged_1870.drop(['month','day', 'year'], 1)


In [14]:
merged_1870.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 629768 entries, 0 to 4690
Data columns (total 6 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   Media URL        629764 non-null  object
 1   City             485671 non-null  object
 2   Country          502433 non-null  object
 3   wkt              497106 non-null  object
 4   normalized_date  629768 non-null  object
 5   Title            604868 non-null  object
dtypes: object(6)
memory usage: 33.6+ MB


We finally save the combined file

In [15]:
merged_1870.to_csv("backup/combined_1870.csv", index=False)

## Filtering by Place

Another important filter we may need is about the presence of the coordinate value related to a record

In [16]:
merged_1870_wkt = merged_1870[merged_1870['wkt'].notnull()]

In [17]:
merged_1870_wkt.nunique()

Media URL          491977
City                 2385
Country               108
wkt                  2491
normalized_date     40240
Title                9739
dtype: int64

In [18]:
merged_1870_wkt.to_csv("backup/combined_1870_wkt.csv", index=False)

In [19]:
def date_split(date:str)-> (str,str,str):
    if type(date) is not str:
        return float('nan'),float('nan'),float('nan')
    splits = date.split('-')
    vals = splits + [float('nan')] * (3 - len(splits))
    return [float(v) for v in vals]

merged_df['year'], merged_df['month'], merged_df['day'] = zip(*merged_df['normalized_date'].apply(date_split))

merged_1870 = merged_df[merged_df['year'] >= 1870.]

In [20]:
merged_1870.head()

Unnamed: 0,Media URL,City,Country,wkt,normalized_date,Title,year,month,day
0,https://iiif.unige.ch/dhportal/ug8101433/manifest,New York City,United States of America,POINT(-74.006015 40.712728),1936-11-23,LIFE,1936.0,11.0,23.0
1,https://iiif.unige.ch/dhportal/ug8026695/manifest,New York City,United States of America,POINT(-74.006015 40.712728),1936-11-30,LIFE,1936.0,11.0,30.0
2,https://iiif.unige.ch/dhportal/ug8013847/manifest,New York City,United States of America,POINT(-74.006015 40.712728),1936-12-07,LIFE,1936.0,12.0,7.0
3,https://iiif.unige.ch/dhportal/ug8010392/manifest,New York City,United States of America,POINT(-74.006015 40.712728),1936-12-21,LIFE,1936.0,12.0,21.0
4,https://iiif.unige.ch/dhportal/ug8032218/manifest,New York City,United States of America,POINT(-74.006015 40.712728),1936-12-28,LIFE,1936.0,12.0,28.0


In [21]:
merged_1870 = merged_1870.drop(['month','day', 'normalized_date'], 1)

  merged_1870 = merged_1870.drop(['month','day', 'normalized_date'], 1)


In [22]:
merged_1870.head()

Unnamed: 0,Media URL,City,Country,wkt,Title,year
0,https://iiif.unige.ch/dhportal/ug8101433/manifest,New York City,United States of America,POINT(-74.006015 40.712728),LIFE,1936.0
1,https://iiif.unige.ch/dhportal/ug8026695/manifest,New York City,United States of America,POINT(-74.006015 40.712728),LIFE,1936.0
2,https://iiif.unige.ch/dhportal/ug8013847/manifest,New York City,United States of America,POINT(-74.006015 40.712728),LIFE,1936.0
3,https://iiif.unige.ch/dhportal/ug8010392/manifest,New York City,United States of America,POINT(-74.006015 40.712728),LIFE,1936.0
4,https://iiif.unige.ch/dhportal/ug8032218/manifest,New York City,United States of America,POINT(-74.006015 40.712728),LIFE,1936.0


In [23]:
merged_1870['year'] = merged_1870['year'].astype(str)

In [24]:
merged_1870.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 629768 entries, 0 to 4690
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   Media URL  629764 non-null  object
 1   City       485671 non-null  object
 2   Country    502433 non-null  object
 3   wkt        497106 non-null  object
 4   Title      604868 non-null  object
 5   year       629768 non-null  object
dtypes: object(6)
memory usage: 33.6+ MB


In [25]:
merged_1870['year'] = merged_1870['year'].str.replace('\.0','', regex=True)

In [26]:
merged_1870.head()

Unnamed: 0,Media URL,City,Country,wkt,Title,year
0,https://iiif.unige.ch/dhportal/ug8101433/manifest,New York City,United States of America,POINT(-74.006015 40.712728),LIFE,1936
1,https://iiif.unige.ch/dhportal/ug8026695/manifest,New York City,United States of America,POINT(-74.006015 40.712728),LIFE,1936
2,https://iiif.unige.ch/dhportal/ug8013847/manifest,New York City,United States of America,POINT(-74.006015 40.712728),LIFE,1936
3,https://iiif.unige.ch/dhportal/ug8010392/manifest,New York City,United States of America,POINT(-74.006015 40.712728),LIFE,1936
4,https://iiif.unige.ch/dhportal/ug8032218/manifest,New York City,United States of America,POINT(-74.006015 40.712728),LIFE,1936


In [27]:
merged_1870.to_csv("backup/combined_1870_year.csv", index=False)