# LCWA Powerpoint Dataset Exploration

This notebook uses python to explore the dataset of 1,000 ppt/x files
that were randomly identified and isolated from files archived by the
Library of Congress from US Federal government websites. 

In [23]:
import csv
#import json
import collections
import pandas as pd

_NB: this file is delimited by tabs and in UTF-16 text encoding. 
You will need to specify these things, as demonstrated below, in order
for the file's data to be processable by python._ 

In [15]:
linecount = 0
with open('lcwa_gov_powerpoint_metadata.csv', 'r', newline='', encoding='utf-16') as data:
    lcwa_powerpoint_metadata = csv.reader(data, dialect='excel-tab', delimiter='\t')
    for line in lcwa_powerpoint_metadata:
        print(linecount,' '.join(line))
        linecount += 1

0 urlkey timestamp original mimetype digest title company creation_date last_modified revision_number slide_count word_count file_size sha256 sha512
1 ada.ky.gov/adaemploymenttaxincentives.ppt 20100331231723 http://ada.ky.gov/ADAEMPLOYMENTTaxIncentives.ppt application/vnd.ms-powerpoint 4Y6Z6KRRHS3TIVQ2ZJVYVW7RSGXOBS6Y    Human Development Institute Promoting Independence, Productivity, and Integration for All People _________________________________    - 1998-12-17T14:31:23Z 2005-03-25T15:56:39Z 47 31 781 137216 9b6b74447622ac1646a538e63e452569e111cf775d1244373678d947405e33be 9567992755c0b1b38fdccb4440a4bb4b1758301b5cd1978b95bfe807babb70bf4c1e5ede85d9d26daee75196a3d62453ac9919bdbbf829508a26e72c35fc4df6
2 astrophysics.gsfc.nasa.gov/conferences/supernova1987a/staveley-smith.ppt 20090419154246 http://astrophysics.gsfc.nasa.gov/conferences/supernova1987a/Staveley-Smith.ppt application/vnd.ms-powerpoint 7EZSHVTKZE2FXP5SXI6DO2RJPIDLMXOP xNTD and the Pathway to SKA Lister Staveley-Smith - 200

695 cdph.ca.gov/programs/cpns/documents/network-spu-howfundingwork.ppt 20100316050604 http://www.cdph.ca.gov/programs/cpns/Documents/Network-SPU-HowFundingWork.ppt application/vnd.ms-powerpoint BT2LYDUELPJB6BSQ5VVRQHQSHK77R5TZ Where do Network Funds Come From and How are They Used?  2007-10-09T20:53:04Z 2007-10-10T14:42:32Z 1 1 85 847872 7bac474e6ddd1dcc3ead1ff60dcf1a47ef3560e537230992aa40936de6294299 68a9188d7e3f44ec7ef0cea167f529e492aa18ab6dbf002c8910ae851a05a8a88824fcd54fe2d9a85d65396b4e93bb98ba2b1520b4fc1dbc487e35569eb6abbc
696 gsics.nesdis.noaa.gov/pub/development/20150316/5k_aisehng_wu_iasi_modis_airs.ppt 20151118141038 https://gsics.nesdis.noaa.gov/pub/Development/20150316/5k_Aisehng_Wu_IASI_MODIS_AIRS.ppt application/vnd.ms-powerpoint R24KLSE7EC7LFVAIFK6MOV2P4E7EHIBL Slide 1 - 1601-01-01T00:00:00Z 2015-03-16T21:22:21Z 172 13 472 1455104 3f3617ddfde34eba50e3877b4fdc2649625a2767841a65314797a4fa7f01897c 7f5d35fb9483a0dc519167a9d076254a9f76f36f443287262c4dbf3b6e64087fea910e1bea3005

In [13]:
print(linecount)

1001


In the cell above, we receive 1,001 as the line count since the variable indicates a count for all the lines, including the headers. So, this is what we would expect if the set includes 1,000 entries.

Now, let's read the dataset into a dictionary for easier processing.

In [19]:
lcwa_powerpoint_info = list()

with open('lcwa_gov_powerpoint_metadata.csv', 'r', newline='', encoding='utf-16') as data:
    lcwa_powerpoint_data = csv.DictReader(data, dialect='excel-tab', delimiter='\t')
    for line in lcwa_powerpoint_data:
        lcwa_powerpoint_info.append(line)

In [21]:
lcwa_powerpoint_info[:2]

[OrderedDict([('urlkey', 'ada.ky.gov/adaemploymenttaxincentives.ppt'),
              ('timestamp', '20100331231723'),
              ('original', 'http://ada.ky.gov/ADAEMPLOYMENTTaxIncentives.ppt'),
              ('mimetype', 'application/vnd.ms-powerpoint'),
              ('digest', '4Y6Z6KRRHS3TIVQ2ZJVYVW7RSGXOBS6Y'),
              ('title',
               '   Human Development Institute Promoting Independence, Productivity, and Integration for All People _________________________________   '),
              ('company', '-'),
              ('creation_date', '1998-12-17T14:31:23Z'),
              ('last_modified', '2005-03-25T15:56:39Z'),
              ('revision_number', '47'),
              ('slide_count', '31'),
              ('word_count', '781'),
              ('file_size', '137216'),
              ('sha256',
               '9b6b74447622ac1646a538e63e452569e111cf775d1244373678d947405e33be'),
              ('sha512',
               '9567992755c0b1b38fdccb4440a4bb4b1758301b5cd1978b9

In [37]:
pptx_headers = list()
for element in lcwa_powerpoint_info[0]:
    pptx_headers.append(element)

In [38]:
pptx_headers

['urlkey',
 'timestamp',
 'original',
 'mimetype',
 'digest',
 'title',
 'company',
 'creation_date',
 'last_modified',
 'revision_number',
 'slide_count',
 'word_count',
 'file_size',
 'sha256',
 'sha512']

In [43]:
creators = dict()

for ppt in lcwa_powerpoint_info:
    if ppt['company'] in creators:
    

-
Lister Staveley-Smith
State of Delaware
SC Office of Human Resources
Fermilab PPD
CERN
HRSA
CDC
NCHSTP
UC Davis
ICPSR

Southface Energy Institute

Accenture
State of Washington, Department of Personnel
-
GSA
Virginia Beach Public Schools
NIH
DOT
Fire Safety Section, FAA
US Treasury - FMS
Low + Associates
-
GSA
GSA
Center for Health Statistics
USDA - NRCS
DoED
U.S. Department of Education
Fermilab
VITA
-
-
DOER
Bull Services
St. John Evangelist
Department of Communications
Home
LDCT
-
-
USDA
Tippecanoe County
<DIT>
USDA
SLAC
NASA/GSFC
LAP-AUTH
-
California Army National Guard
Lockheed Martin Information Technology
-
CDC
Department of Consumer Affair
State of California
-
GSFC/NASA
Hooper Graphic Design/Perfect Mix
Dell Computer Corporation
Department of Personnel
sde
Hewlett-Packard
CA Energy Commission
US FDA
CoosWatershed
-
DOH
LMIT-ODIN
University of Georgia
USGS/EROS Data Center
USRA
	閠]狴逄嬘뿿��
UNC-CH
Library of Congress
loc
-
NAPHSIS
Terberg Design
-
USDA/ARS NSTL
Emory University

##################
Using pandas

In [24]:
ppt_df = pd.DataFrame.from_records(lcwa_powerpoint_info)

In [25]:
ppt_df.head()

Unnamed: 0,urlkey,timestamp,original,mimetype,digest,title,company,creation_date,last_modified,revision_number,slide_count,word_count,file_size,sha256,sha512
0,ada.ky.gov/adaemploymenttaxincentives.ppt,20100331231723,http://ada.ky.gov/ADAEMPLOYMENTTaxIncentives.ppt,application/vnd.ms-powerpoint,4Y6Z6KRRHS3TIVQ2ZJVYVW7RSGXOBS6Y,Human Development Institute Promoting Indep...,-,1998-12-17T14:31:23Z,2005-03-25T15:56:39Z,47,31,781,137216,9b6b74447622ac1646a538e63e452569e111cf775d1244...,9567992755c0b1b38fdccb4440a4bb4b1758301b5cd197...
1,astrophysics.gsfc.nasa.gov/conferences/superno...,20090419154246,http://astrophysics.gsfc.nasa.gov/conferences/...,application/vnd.ms-powerpoint,7EZSHVTKZE2FXP5SXI6DO2RJPIDLMXOP,xNTD and the Pathway to SKA,Lister Staveley-Smith,-,2007-02-20T13:41:45Z,51,17,553,1580032,34c77d3dc3cd3dc51d084584cafedf24189a42bf6bf6d3...,c36c59e694aad68c7b5fd4b32383e857520778131d9010...
2,awm.delaware.gov/sitecollectiondocuments/awm+g...,20100220062758,http://www.awm.delaware.gov/SiteCollectionDocu...,application/vnd.ms-powerpoint,ZHT4MJOX7D7OIRMCBPX4PMWZ4OQ7XSZ2,VSM Data Metrics,State of Delaware,2006-10-16T16:19:52Z,2006-10-26T19:37:20Z,4,1,2,136704,1db9d640446ab52654f891458a30c6219e0125915f61c4...,caa579649ea4a0a3126ab9842fe11e6e548695132d0da1...
3,bcbintranet.sc.gov/ohr/hr-advisory/workforcepl...,20090416103442,http://www.bcbintranet.sc.gov/OHR/hr-advisory/...,application/vnd.ms-powerpoint,KDBPF6PJCBK2BEIYC2L5I33THZMINFML,Why Workforce Planning is Important during Har...,SC Office of Human Resources,2009-01-07T14:28:04Z,2009-01-09T16:31:58Z,7,9,238,53760,758efaac57fdf38d82bf7271f9cd5d1f083f842d884bff...,19449593a8c1defae33567966314bfe72d822b387eac6b...
4,beamdocs.fnal.gov/ad/docdb/0010/001026/003/040...,20151115084934,http://beamdocs.fnal.gov/AD/DocDB/0010/001026/...,application/vnd.ms-powerpoint,OJOHGFEE4EI44UNSXLNV4Q5IKJ2L4H2Y,PowerPoint Presentation,Fermilab PPD,2003-06-30T15:15:33Z,2004-02-22T22:33:55Z,122,45,1638,8227840,3aeff996f66ee7f51e045ede1f5d21ab57639928b35804...,c6124015a81adf4d41e563b0a6460fd20275744f12b9b8...


In [45]:
creators = ppt_df.groupby('company')

creators.head()

Unnamed: 0,urlkey,timestamp,original,mimetype,digest,title,company,creation_date,last_modified,revision_number,slide_count,word_count,file_size,sha256,sha512
0,ada.ky.gov/adaemploymenttaxincentives.ppt,20100331231723,http://ada.ky.gov/ADAEMPLOYMENTTaxIncentives.ppt,application/vnd.ms-powerpoint,4Y6Z6KRRHS3TIVQ2ZJVYVW7RSGXOBS6Y,Human Development Institute Promoting Indep...,-,1998-12-17T14:31:23Z,2005-03-25T15:56:39Z,47,31,781,137216,9b6b74447622ac1646a538e63e452569e111cf775d1244...,9567992755c0b1b38fdccb4440a4bb4b1758301b5cd197...
1,astrophysics.gsfc.nasa.gov/conferences/superno...,20090419154246,http://astrophysics.gsfc.nasa.gov/conferences/...,application/vnd.ms-powerpoint,7EZSHVTKZE2FXP5SXI6DO2RJPIDLMXOP,xNTD and the Pathway to SKA,Lister Staveley-Smith,-,2007-02-20T13:41:45Z,51,17,553,1580032,34c77d3dc3cd3dc51d084584cafedf24189a42bf6bf6d3...,c36c59e694aad68c7b5fd4b32383e857520778131d9010...
2,awm.delaware.gov/sitecollectiondocuments/awm+g...,20100220062758,http://www.awm.delaware.gov/SiteCollectionDocu...,application/vnd.ms-powerpoint,ZHT4MJOX7D7OIRMCBPX4PMWZ4OQ7XSZ2,VSM Data Metrics,State of Delaware,2006-10-16T16:19:52Z,2006-10-26T19:37:20Z,4,1,2,136704,1db9d640446ab52654f891458a30c6219e0125915f61c4...,caa579649ea4a0a3126ab9842fe11e6e548695132d0da1...
3,bcbintranet.sc.gov/ohr/hr-advisory/workforcepl...,20090416103442,http://www.bcbintranet.sc.gov/OHR/hr-advisory/...,application/vnd.ms-powerpoint,KDBPF6PJCBK2BEIYC2L5I33THZMINFML,Why Workforce Planning is Important during Har...,SC Office of Human Resources,2009-01-07T14:28:04Z,2009-01-09T16:31:58Z,7,9,238,53760,758efaac57fdf38d82bf7271f9cd5d1f083f842d884bff...,19449593a8c1defae33567966314bfe72d822b387eac6b...
4,beamdocs.fnal.gov/ad/docdb/0010/001026/003/040...,20151115084934,http://beamdocs.fnal.gov/AD/DocDB/0010/001026/...,application/vnd.ms-powerpoint,OJOHGFEE4EI44UNSXLNV4Q5IKJ2L4H2Y,PowerPoint Presentation,Fermilab PPD,2003-06-30T15:15:33Z,2004-02-22T22:33:55Z,122,45,1638,8227840,3aeff996f66ee7f51e045ede1f5d21ab57639928b35804...,c6124015a81adf4d41e563b0a6460fd20275744f12b9b8...
5,beamdocs.fnal.gov/ad/docdb/0022/002206/001/shi...,20151114112127,http://beamdocs.fnal.gov/AD/DocDB/0022/002206/...,application/vnd.ms-powerpoint,H2HSFNFUJJON3KXETDSZRA3CVN2FXZQY,Betabeam Design Study in Eurisol,CERN,2002-06-20T12:58:39Z,2006-03-14T21:03:05Z,324,23,921,2650624,9b5fb6c819fa283e92fea7275a9f5239ce273b77557952...,efab6351855df7f42e1e8a45d1701273088fb0a851cab9...
6,bhpr.hrsa.gov/grants00/ta-2000.pps,20010429074551,http://www.bhpr.hrsa.gov:80/grants00/ta-2000.pps,application/vnd.ms-powerpoint,NHD45RERHGZMBYLD6GIAV5V75C7YAPML,No Slide Title,HRSA,1999-04-13T18:01:48Z,1999-11-22T14:07:40Z,105,28,1309,203264,1fdd82d9e1129ae39f5ea2f65e8c9d5652beeac2a1cc5a...,eaf6c146b7db0ea8052e4fe3f82df7eec27293b72fecd1...
7,cdc.gov/nchstp/dstd/stats_trends/chlamydi.ppt,20060828091122,http://www.cdc.gov/nchstp/dstd/Stats_Trends/ch...,application/vnd.ms-powerpoint,S3PMUWQ5M46L2SII66LHHYCXIWBOEAYZ,No Slide Title,CDC,1998-10-27T23:41:56Z,1998-10-28T00:00:02Z,1,5,-,839168,21558112991792446d0a5a5a1107cbbbd797028e4819b8...,11fbed54362d9f3c5946448d96b55823992e3d80382804...
8,cdc.gov/tb/programs/evaluation/guide/webinar/u...,20091113094734,http://cdc.gov/tb/programs/Evaluation/Guide/We...,application/vnd.ms-powerpoint,2L5JFJNVCBULD4V5UFNMPUKWM3BYSSYC,Step 4. Gather Credible Evidence,NCHSTP,2005-11-18T18:49:07Z,2005-12-13T20:26:13Z,183,35,1364,747008,e10d967c58afc8b380172245c057c2ba45a28bc9327fdb...,d1c0f9bfcb504367d426099249eb6d40bd1996ea6a35db...
9,cdph.ca.gov/programs/wicworks/documents/babybe...,20110703051408,http://www.cdph.ca.gov/programs/wicworks/Docum...,application/vnd.ms-powerpoint,FEB4N3DLWJEUJXZSPEABGKDIQP4WWGLI,PowerPoint Handout Part 1,UC Davis,2005-02-02T01:14:58Z,2011-03-01T17:11:06Z,336,83,3343,4776448,e7faa14dcc8cf29e91d2780df7eea3feeaa5db30f35502...,d2f231489dcd8b5384c95ffdfdb8db92bf0cad885e1600...
