# Practical Python Tools for Metadata Assessment: 55-minute workshop

## Welcome

Welcome to the workshop! This is meant to be a a fun and beginner-friendly introduction to a few useful Python tools, in the context of exploring and manipulating tabular metadata for digital collections.   

In this session, we will focus on some basic functions of Python's pandas data analysis library. We will use pandas for exploring, filtering, reshaping, and merging datasets. 

This notebook provides code that you can execute to see results and generate outputs in the notebook itself, as well as explanations for the examples and exercises we'll be working through together. 

At the end of the notebook, there is a bonus example about dealing with duplicates, and a list of recommended resources for further exploration on your own. 

### Table of Contents

* [Workshop Plan](#wplan)
* [Introduction](#intro)
* [Using this Jupyter Notebook](#usingjn)
    * [Exercise 1: Modify this notebook](#ex1)
    * [Exercise 2: Run code and markdown in cells](#ex2)
* [Example 1: Explore a dataset](#md1)
    * [Exercise 3: Try out pandas methods](#ex3)
* [Example 2: Compare a group of metadata files](#md2)
    * [Exercise 4: Evaluate subjects data in a collection](#ex4)
* [Example 3: Merge information from separate files](#md3)
    * [Exercise 5: Other ways to merge](#ex5)
* [Bonus Example: Find and remove duplicates](#md4)
* [Resources](#res)


## Workshop Plan <a name="wplan"></a>

We will start with a quick demo in using the jupyter notebook and a couple exercises to get familiar with notebook commands. 

Then we will walk through three examples of using Python/pandas with digital collections metadata files.  
After each example there will be an exercise you can try out on your own.  


----
  
**Intro, & using the jupyter notebook (10 mins)**
    * Exercises 1 & 2: how to add cells, execute code and markdown cells
    
**Example 1: Explore a dataset (12 mins)** 
    * Exercise 3: Evaluate subjects data from a collection
    
**Example 2: compare a group of metadata files (12 mins)**
    * Exercise 4: 
    
**Example 3: Merge info from separate files (12 mins)**
    * Exercise 5 : other merges  
    
**Wrap-up (5 mins)**
    * review resources, and info about installing python

----



## Introduction to Python and pandas<a name="intro"></a>
(basic info about Python)

(basic info about Pandas)

## Using this Jupyter Notebook <a name="usingjn"></a>

You can edit the notebook to run code cells and generate output, and/or to add markdown cells.  

All paths in the notebook refer to locations within the repository, to access example data and/or save output files. 

### Keyboard Shortcuts for Jupyter Notebook :

* `CTRL` + `SHIFT`+ `P` : show 'command palette'
* `esc` : command mode
* `enter` : edit mode
* `a` : insert cell above
* `b` : insert cell below
*  `SHIFT` + `enter`: run a code cell (or render a markdown cell)
* `d d`: delete a cell

###  Exercise 1: Modify this Notebook<a name="ex1"></a>

Follow the instructions below to try out some of the jupyter keyboard shortcuts and get familiar with working in this notebook.

**check out the 'command palette' to see all notebook actions and shortcuts**

From the command palette, you can search for any command, and run that action directly from the palette, as well as seeing the shortcut for that action if available. 

1. Press ``CTRL + SHIFT+ P`` to show the command palette. 
    * With the palette open, search for 'edit', to find the shortcut for `enter 'edit' mode`. Click on this action from the list. This cell will then switch into 'edit' mode. 

**Use shortcuts to switch between edit and command mode**

Edit mode: used for adding/editing content in cells = Blue cell border

Command mode: used for navigating and modifying cells = Green cell border    

* Double-click in this cell to switch the cell into 'Edit mode.' 
* When you double click inside the cell, the cell border will change from blue to green, and the markdown will switch from rendered to markup text.
* Press `Esc` to switch back to 'command' mode. The cell border switches back to blue and you can use navigation commands (such as adding cells below/above, switching from code/markdown.)  
* Practice switching back and forth between edit/command on a few cells. 


**Edit the text in a Markdown cell, execute (render) a Markdown cell**

* Switch this cell into 'edit mode'. 
* Add a bullet point below these lines of text and type something, for example: ``'DONE'``. (can copy/paste example below if easier!)
* Then use `CTRL` + `Enter` to "run" the cell/render the markdown.  
    * for example: `DONE`   



**Add more cells**

* With this cell in command mode (blue border), add a cell underneath it, then add some cells above the cells you added. 
    * press `b` to add a cell below an existing cell. 
    * You can keep pressing `b` to add more cells; it doesn't hurt anything to have empty cells in the notebook. 
    * press `a` to add a cell above an existing cell. 

### Exercise 2: Add and run code and Markdown in cells <a name="ex2"></a>

Practice with creating cells, switching cells from code to Markdown, adding content to cells and running them. 

**Create a Markdown cell, add some text**

create a Markdown cell: 
* click on one of the empty cells you created in Exercise 1, or just create another one here. 
* A brand-new cell will be in command mode (blue cell border)  
* if there's an `In [ ]:` to the left of the cell, this means it's a code cell. Any text you type into the cell will be treated as code. 
* To convert the cell to a Markdown cell, press `m` to switch the cell to markdown.  
    * The `In [ ]:` to the left of the cell will disappear, indicating it is now a Markdown cell. Any test you type into the cell will now be treated as markdown. 
* If this did not work, make sure the cell is in command mode. Switch to Command mode by pressing `Esc`. The cell border will turn blue. 

add text in a Markdown cell: 
* Switch the cell to Edit mode by clicking inside it. 
* In the markdown cell, type a header, and then some regular text. For example: 

    ```
    #### here's an example header  
    and some regular paragraph text 
    ```

#### here's an example header  
and some regular paragraph text 

**Execute/render a markdown cell**

* As you did in exercise 1, press `CTRL` + `Enter` to "run" the markdown cell/render the markdown that you just typed.  
* the text will display as a formatted version, and the cell border will switch from green to blue. 


**Add code to a code cell**

Work with a code cell: 
* Click on one of the empty cells you have created, or just create a new one here. 
* Look for the `In [ ]:` to the left of the cell, to make sure it's a code cell. 
* if there's no  `In [ ]:` to the left of the cell, switch into Command mode and convert the cell to Code by pressing `y`. 
* Switch the cell back to Edit mode by clicking inside it. The cell border will turn green. 

Add some code: 
* Type a line of simple python code. For example: 

     ```
     print("here's a line of python code output.")
     ```


**Run the code cell.**

* As you did with the markdown cells, press `CTRL` + `Enter` to run the cell and execute the code in it. 


In [None]:
## note that in a code cell, hashtags indicate a comment, not a header as in a markdown cell

print("here's a line of python code output.")


## Metadata Example 1: Explore a Dataset with pandas<a name="md1"></a>

This example walks through getting oriented with using python/pandas for viewing and analysing descriptive metadata files.

In this scenario we are working with a small group of metadata files that have varied sets of inconsistently organized fields.   

We will import metadata from csv and tsv files, exclude empty and/or irrelevant fields from our dataframes, and identify a few relevant fields to focus on for assessment, selecting the same set of fields from each collection. 

**Learning objectives in this example:**

* reading a data file into a dataframe
* creating dataframes with differently delimited data
* assessing overall size and contents of the dataframe
* selecting relevant columns to include for a task
* identifying and changing datatypes


#### Import libraries for Python

Importing libraries loads them into memory so that Python can use them. 

Libraries provide specific methods for particular kinds of work. 

We are importing: 
* the pandas data analysis library
* `os` for working with files and directories  
* `matplotlib` for generating some basic graphs from data

Setting `%matplotlib inline` allows plots to render within the notebook. 


In [None]:
# import pandas and os libraries 
import pandas as pd
import os
import matplotlib.pyplot as plt
%matplotlib inline

#### Next: check that we are in the right place!!

We'll do a couple quick checks to get oriented and make sure that we're in the right directory. 

This is not really necessary, because this notebook is located in/running from within the 'notebook_exercises' subfolder, so we already know that we will be running commands relative to that location, but it's always nice to take a look around to see where you are. 

In [None]:
# os.getcwd() outputs the current working directory, similar to pwd in bash

os.getcwd()

In [None]:
# os.listdir() with no parameter returns a list of the files and directories in the current working directory

os.listdir()

In [None]:
# use os.listdir() plus a parameter to see what's in the exampleData subfolder

os.listdir('./exampleData/')

#### Create a dataframe from example datasets

A dataframe is a Python object with rows and columns that can be selected for running calculations and manipulating the data in a lot of ways. 

You can read many different data formats into a dataframe, including csv, tsv, and even excel sheets. 

The command below uses the variable name 'maps' for creating a dataframe using the pandas `read_csv` function. 

In [None]:
# create a dataframe: 'maps' from the example datasheet 

maps=pd.read_csv("./exampleData/maps.csv")

#### Inspect a dataframe

Next, we'll explore the maps dataframe with pandas attributes and methods, to get a sense of how large this datset is (how many rows and columns), what the column headers are and how many of them are empty, and what the datatypes are in each column. 

* shape
* columns 
* info() 

In [None]:
# the shape attribute displays the number of rows and columns for the dataframe, to get a sense of its overall size

maps.shape

In [None]:
# columns attribute displays column labels

maps.columns

In [None]:
# the info method displays datatypes and numbers of values per column, and memory information

maps.info()

**Sorting dataframe columns**

Since the columns listed above are not ordered in a logical way, it's hard to look for a particular column label in the output. 

Dateaframes can be sorted by rows, columns, or values to present the data according to the order we specify. 

Below we'll display the column labels sorted alphabetically, which makes it easy to check if this collection has fields named  Title, Usage Rights, Date, etc. 


In [None]:
# Sort the dataframe to order the output of column labels from the info() method

maps.sort_index(axis=1).info()

#### View contents of the dataframe 

The head() and tail() methods in pandas display the first or last n rows of data. 

By default head or tails will display 5 rows; below we specify 3 to see fewer rows. 

Note that the column headers are no longer sorted alphabetically, because we did not apply the sort persistently.


In [None]:
maps.head(3)

#### Explore a second dataset

You can also use the read_csv function for other kinds of delimiters. For example, you can specify tab-delimited as with the metadata file below. 


In [None]:
# read in another example dataset as a separate dataframe. 
# use sep parameter to specify tab as delimiter

rev=pd.read_csv("./exampleData/60001.txt", sep='\t')

**View size and columns information for second dataset**

We'll again use shape and info() to get a basic sense of the second dataset. 

We can see that this data sheet has nearly twice as many column labels, and tons of empty fields.  

In [None]:
rev.shape

In [None]:
rev.sort_index(axis=1).info()

#### Get rid of empty columns

Many of these columns are empty, so we will exclude them. 

Use the inplace attribute to apply this change to the dataframe we are currently working with.

Then check the columns again; our data is now more manageable. 


In [None]:
# Drop empty columns. Inplace attribute overwrites the working dataframe. 
rev.dropna(axis = 1, how ='all', inplace = True)

# output updated dataframe
rev.sort_index(axis=1).info()

#### Other ways to view and select data

The head() and tail() methods can also be applied to series as well as the whole dataframe. 

Use the head method to list the first twelve values in the Title field in our dataframe. 


In [None]:
# View the first 12 rows of the Title column

rev.Title.head(12)

**Chaining methods**

You can chain methods together to apply another method to an object that has a method applied. 

For example, we can use the sort_values method to sort the Title series reverse-alphabetically (via the 'ascending' parameter set to false), then apply the head() method to view th first 12 values. 


In [None]:
# View the first 12 values in the 'Title' series sorted alphabetically

rev.Title.sort_values(ascending=False).head(12)

**Specifying rows and columns**

The .loc method allows for selecting individual and multiple rows and columns. Below we'll focus on the ``'Title','Date', 'Usage Rights', and 'filename'`` columns, including all rows of the dataframe. 

We will use the tail() method to view the last n rows in these columns. 


In [None]:
# specify multiple rows and columns by label with loc method 

rev.loc[:,['Title','Date', 'Usage Rights', 'filename']].tail(8)

#### Working with dates 

Notice the format of the values in the 'Date' column above. Also, remember from the info() output that the datatype for 'Date' is currently 'object'. Right now python is viewing these dates as just strings, which limits what we can do with them. 

We can create a new column in the dataframe, using the `to_datetime` method to convert the 'Date' values to datatype 'datetime', which then has a lot of capabilities available to it. 


In [None]:
# check the current datatype and values information for the `Date` column in the rev dataframe

rev.Date.describe()

In [None]:
# Create a new column 'datesformat', with the values from the Date column 
# Use to_datetime to convert the 'Date' values to the datetime datatype

rev['datesformat']=pd.to_datetime(rev['Date'])

rev.datesformat.head()

In [None]:
# check the current datatype and values information for the new `datesformat` column in the rev dataframe

rev.datesformat.describe()


**Inspect the dataframe with new column added**

Note that the new column will be added at the end of the dataframe. 


In [None]:
# check the columns and dtypes in the rev dataframe

rev.info()

**Apply the same columns to the maps dataframe**

Meanwhile, our maps dataframe is still available in working memory. 

Because the same column labels exist in the maps dataset, we can select out these columns from maps using .loc, the same way as we did with rev above.  


In [None]:
# examine the same columns in the 'maps' dataframe
# tail method displays the last n rows

maps.loc[:,['Title','Date', 'Usage Rights', 'filename']].tail(8)

### Recap of Metadata Example 1

this example covered the following: 

* read data of different formats into a dataframe
* explore a dataframe as a whole, series within a dataframe, values in rows and columns
* filter datasets by sorting, selecting, and dropping columns
* work with multiple dataframes at once
* create a new column in a dataframe 
* format dates by converting column datatype with to_datetime




### Exercise 3: nameofexercise <a name="ex3"></a>


Try out some of the methods demonstrated above, with an example datasheet in this repository. 

Use the standard `df` variable name to create a dataframe from the '60001.txt' file.   

Note that this metadata file is not a csv, so you will need to specify the delimiter. 

The syntax for this is:
```
df=pd.read_csv('./exampleData/03883.txt', sep='\t')

```



In [None]:
df=pd.read_csv('./exampleData/03883.txt', sep='\t')

View the shape and column labels of the dataframe. Then inspect the Title column, sorted alphabetically. 


```
df.shape  
```
```
df.info()  
```
```
df.Title.sort_values().head(12)  

```

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.Title.sort_values().head(12)  

###  Solution to Exercise 3: nameofexercise

## Metadata Example 2: Compare a group of metadata files <a name="md2"></a>

This example continues where Example 1 left off. Having identified a set of fields to analyze in our group of collections, we will create new dataframes from the raw datasheets. This time we will include only the set of relevant columns from a group of collections. 

We will then concatenate the new dataframes into a single compiled dataset that makes it easy to compare completeness of metadata across the group as a whole. 

We will generate basic graphs to show comparisons between the collections. 


In [None]:
# create a variable to store the column labels we want to select from each dataset

coltitles=['Title','Date','Usage Rights', 'Subject Geographic', 'Subject Name' , 'Subject Topical']

In [None]:
# read in the same datasheets from the last example
# create a new column in each dataframe that contain a short Collection Name value

maps=pd.read_csv("./exampleData/maps.csv", usecols=coltitles)
maps['colname'] ='Civil War Maps'

rev=pd.read_csv("./exampleData/60001.txt", usecols=coltitles, sep='\t')
rev['colname'] ='Revolution Photographs'

rmb=pd.read_csv("./exampleData/03883.txt", sep='\t', usecols=coltitles)
rmb['colname'] ='Roy M Brown Papers'

In [None]:
# show info for updated dataframes 

maps.sort_index(axis=1).info()
print('\n')

rev.sort_index(axis=1).info()
print('\n')

rmb.sort_index(axis=1).info()

In [None]:
# concatenate dataframes into a single dataframe

collstack = pd.concat([maps, rmb, rev], axis=0, sort=True)

In [None]:
collstack.shape

In [None]:
collstack.info()

In [None]:
collstack.groupby('colname')['Title', 'Usage Rights'].count()

In [None]:
collstack.groupby('colname')['Title', 'Usage Rights'].count().plot(kind='bar', figsize=(8,6),width=0.8, )
plt.ylabel('Count of items')
plt.xlabel('')

In [None]:
collstack.groupby('colname')['Title', 'Subject Geographic', 'Subject Name', 'Subject Topical'].count()

In [None]:
collstack.groupby('colname')['Title', 'Subject Geographic', 'Subject Name', 'Subject Topical'].count().plot(kind='barh', figsize=(8,6), width=0.6)

In [None]:
maps.groupby(['Subject Name']).Title.count().sort_values(ascending=False).plot(kind='barh')

In [None]:
maps.groupby(['Subject Name']).Title.count().sort_values(ascending=False).head().plot(kind='barh', figsize=(8,6))

In [None]:
maps.groupby(['Subject Name']).Title.count().sort_values(ascending=False).head().plot(kind='pie', figsize=(8,8),startangle=180)
plt.ylabel('')
#plt.title("Most Common Subject Names in Maps Collection")

In [None]:
maps.groupby(['Subject Name']).Title.count().sort_values(ascending=False).head(10)

In [None]:
# create a variable that specifies a range to examine
# exclude the most-common heading to focus on the next seven values

subcounts=maps['Subject Name'].value_counts()

totals=subcounts[(subcounts <= 103) & (subcounts>2)]

totals.sort_values(ascending=False).plot(kind='pie', figsize=(8,8),startangle=90)
plt.ylabel('')

### Recap of Example 2 

* select specific columns from multiple different datasets using a variable to store a list of fields
* concatenate multiple dataframes into a single dataframe
* use groupby to organize datasets for comparison
* create basic graphs to compare datasets
* explore different graph formats to represent data attributes


### Exercise 4: Evaluate Subjects data in a collection<a name="ex4"></a>


In this exercise, you will evaluate data in the 'Subject Topical' field in the Maps collection. 

Based on the output below from maps.info(), it looks like the Subject Topical field is fairly complete in this collection. 

```
maps.info()
```

```
RangeIndex: 161 entries, 0 to 160
Data columns (total 7 columns):
Title                 161 non-null object
Date                  161 non-null object
Subject Topical       161 non-null object
Subject Name          161 non-null object
Subject Geographic    161 non-null object
Usage Rights          0 non-null float64
colname               161 non-null object
```
Try using the describe() method to get more details about the values represented within the 'Subject Topical' field. 

Would plotting the maps titles according to 'Subject Topical' fields assigned to them make an interesting graph? 

To do this exercise, add code cells below this cell (or use the empty cells provided). Use those cells to check the output from describe() for the maps dataframe, and to generate a plot for numbers of maps titles grouped by Subject Topical. 

For reference, the solution is demonstrated in the next cells below. 


### Solution to Exercise 4: Evaluate Subjects data in a collection

In [None]:
# use describe() to evaluate the contents of the 'Subject Topical' field

maps['Subject Topical'].describe()

The output from 'describe()' shows that although every item in this collection has a Subject Topical field, it is all the same value, which does not make a very interesting plot. 

In [None]:
# generate a plot for the distribution of Subject Topical field across titles

maps.groupby(['Subject Topical']).Title.count().plot(kind='barh',color=['grey'])
plt.xlabel('item count')

## Metadata Example 3: Merge information from separate files <a name="md3"></a>

Another useful feature of pandas is that it allows you to do SQL-like joins with plain text files.  

In this exercise, we will create a merged dataframe from descriptive metadata and file sizes information in separate datasets. We will rename columns in the descriptive metadata dataframe to merge based on columns in our filesizes datasheets. (It's also possible to specify the columns to merge separately for the left and right dataframes if they are not named the same!)  

**Learning objectives for this example:**
* Review removing empty columns - the first dataset has a large number of columns, some of which have no data
* Datatypes can be complicated and lead to potential errors; you may need to specify datatypes for columns
* Renaming columns 
* Merging dataframes 
* Write a dataframe to a csv output file or other format 



markdown

In [None]:
# uncomment the import statement below and run this cell if your notebook was reset and you need the libraries again

#import pandas as pd

In [None]:
# read in collections metadata file as 'metadata' dataframe
metadata=pd.read_csv("./exampleData/03823_metadata.txt", sep='\t')

markdown


In [None]:
# use pandas attributes and methods to examine the new dataframe. 
# Start with the shape attribute to summarize rows and columns.

print(metadata.shape)

markdown

In [None]:
# use the info method to see column names and item counts in each column

metadata.info()

markdown


In [None]:
#Remove the empty columns using dropna
# and re-check the column names and item counts by re-running the info method on the reshaped dataset.

metadata.dropna(axis = 1, how ='all', inplace = True)

metadata.info()

markdown


In [None]:
metadata['Collection Number'].head(8)

In [None]:
metadata=pd.read_csv("./exampleData/03823_metadata.txt", sep='\t', dtype={'Collection Number':object})

metadata['Collection Number'].head(8)

In [None]:
#Remove the empty columns using dropna

metadata.dropna(axis = 1, how ='all', inplace = True)

markdown


In [None]:
#Use the head method to see the first n rows

metadata.head()

markdown

In [None]:
#Create the second dataframe with the filelist datasheet

sizelist=pd.read_csv("./exampleData/03823_access_images.csv")

In [None]:
sizelist.info()

In [None]:
#Create the second dataframe with the filelist datasheet

sizelist=pd.read_csv("./exampleData/03823_access_images.csv", usecols=['Name','Full Path', 'Size'])

sizelist.rename(columns={'Name':'AccessName', 'Full Path':'AccessFilePath', 'Size' : 'AccessFileSize'}, inplace=True)

markdown

In [None]:
# Use shape and info to take a look at the sizelist dataframe.
print(sizelist.shape) 

sizelist.info()

markdown

In [None]:
#Use the head method to see the first n rows of a column

sizelist.head()

markdown

markdown

In [None]:
# Rename a column 

metadata.rename(columns={'Object file name':'AccessName'}, inplace=True)
metadata.columns

markdown

In [None]:
# Join sizelist data onto the metadata dataframe
# and view info for the new, merged dataframe

combined = pd.merge(metadata, sizelist,on='AccessName', how='left')

combined.info()

In [None]:
combined.AccessFileSize.head(8)


In [None]:
combined['AccessSizeNum'] = combined.AccessFileSize.apply(lambda x: x.replace(' Bytes',''))

combined=combined.astype({'AccessSizeNum':'int64'})

combined.info()

markdown

In [None]:
# Write the merged dataframe to a new csv

combined.to_csv('./output/accessfilesizes_metadata_03823.csv', index=False,encoding='utf-8-sig')

markdown

In [None]:
#Create a third dataframe with the masters files datasheet

masters=pd.read_csv("./exampleData/03823_masters.csv", usecols=['Name','Full Path', 'Size'])

masters.rename(columns={'Name':'MastersName', 'Full Path':'MasterFilePath', 'Size' : 'MasterFileSize'}, inplace=True)

masters['MasterSizeNum'] = masters.MasterFileSize.apply(lambda x: x.replace(' Bytes',''))

masters=masters.astype({'MasterSizeNum':'int64'})

In [None]:
masters.info()


In [None]:
masters.tail()

In [None]:
combined.head()

In [None]:
combined.rename(columns={'filename':'MastersName'}, inplace=True)

combined.info()

In [None]:
all3 = pd.merge(combined, masters,on='MastersName', how='left')

all3.info()

In [None]:
all3.loc[:,['Collection Number','Object', 'MastersName', 'AccessName', 'AccessSizeNum','MasterSizeNum']].head(12)


###  Exercise 5: Other ways to merge <a name="ex5"></a>

In [None]:
# Use the tail method to see the last n rows of a column
sizelist.tail()

In [None]:
# Use python string method to count instances of 'icon'

sizelist['AccessName'].str.count("icon").sum()

###  Solution to Exercise 5: Other ways to merge

## Bonus Metadata Example: Find and remove duplicates <a name="md4"></a>




markdown

In [None]:
coll=pd.read_csv('./exampleData/coll_dupes_example.csv')

markdown

In [None]:
coll.info()

In [None]:
coll=pd.read_csv('./exampleData/coll_dupes_example.csv', usecols=['Collection Number','Object', 'filename', 'Date created', 'Date modified'])

markdown

In [None]:
coll.info()

In [None]:
coll.shape

In [None]:
coll.Object.duplicated().sum()

In [None]:
# remove duplicates with drop_duplicates method

deduped= coll.drop_duplicates(subset= 'Object', keep='first')

In [None]:
deduped.shape


In [None]:
31+39125

In [None]:
dupes_all=coll.loc[coll.Object.duplicated(keep=False), :]

In [None]:

dupes_all.shape


In [None]:
dupes_all.sort_values(['Collection Number','Object'])

In [None]:
# write duplicates dataframe to a csv, sorted by Collection number then by Object field

dupes_all.sort_values(['Collection Number','Object']).to_csv('./output/coll_dupes_all_sorted.csv', index=False, encoding='utf-8')

In [None]:
# can look at new datasheet as a dataframe, or open it in an external spreadsheet program

da=pd.read_csv('./output/coll_dupes_all_sorted.csv')

In [None]:
da

markdown

## Resources <a name="res"></a>





