# Python Open Labs: Reading, exploring, and writing data with Pandas

## Setup
With this Google Colaboratory (Colab) notebook open, click the "Copy to Drive" button that appears in the menu bar. The notebook will then be attached to your own user account, so you can edit it in any way you like -- you can even take notes directly in the notebook.

## Instructors
- Walt Gurley
- Claire Cahoon

## Open Labs agenda

1.   **Guided activity**: One of the instructors will share their screen to work through the guided activity and teach concepts along the way.

2.   **Open lab time**: After the guided portion of the Open Lab, the rest of the time is for you to ask questions, work collaboratively, or have self-guided practice time. You will have access to instructors and peers for questions and support.

Breakout rooms will be available if you would like to work in small groups. If you have trouble joining a room, ask in the chat to be moved into a room.

## Learning objectives

By the end of our workshop today, we hope you'll understand what the pandas library is and be able to work with pandas data structures like a `Series` and a `DataFrame`.

## Today's Topics
- What is pandas, and how does it relate to Python?
- Importing and using pandas
- How to read data into pandas
- Common pandas data structures (`Series` and `DataFrame`)
- Referencing data in a `DataFrame`
- How to write data from pandas

## Questions during the workshop

Please feel free to ask questions throughout the workshop.

We have a second instructor who will available during the workshop. They will answer as able, and will collect questions with answers that might help everyone to be answered at the end of the workshop.

The open lab time is when you will be able to ask more questions and work together on the exercises.

### Using Jupyter Notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop.  Colaboratory is “a Google research project created to help disseminate machine learning education and research.”  If you would like to know more about Colaboratory in general, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we're happy to help. Feel free to [get help from our graduate consultants](https://www.lib.ncsu.edu/dxl) or [schedule an appointment with Libraries staff](https://go.ncsu.edu/dvs-request).

## Guided Instruction
This week we're introducing the Pandas library for Python and working on importing, viewing, and referencing the data.

In this section, we will work through examples using data from the [Museum of Modern Art (MoMA) research dataset](https://github.com/MuseumofModernArt/collection) containing records of all of the works that have been cataloged in the database of the MoMA collection.

> "The Museum’s website features 89,695 artworks from 26,494 artists. This research dataset contains 138,151 records, representing all of the works that have been accessioned into MoMA’s collection and cataloged in our database. It includes basic metadata for each work, including title, artist, date made, medium, dimensions, and date acquired by the Museum. Some of these records have incomplete information and are noted as “not Curator Approved." - [MoMA Github repository for collection data](https://github.com/MuseumofModernArt/collection)

### What is a Python library?

A "Library" in this context is a package of code that adds to the functionality of Python. Base Python offers a lot of features, but not everything -- Python libraries can be imported at the beginning of your code to use for your specific purpose. 

For example, you may import Matplotlib to create graphs and plots, or Natural Language Toolkit (NLTK) to do natural language processing. Today we will be using the pandas library to manipulate a dataset.

### What is Pandas?

Pandas is a high-level data manipulation tool first created in 2008 by Wes McKinney. The name comes from the term “panel data,” an econometrics term for data sets that include observations over multiple time periods for the same individuals.<sup>[[wikipedia](https://en.wikipedia.org/wiki/Pandas_(software))]</sup>

From Jake Vanderplas’ book [**Python Data Science Handbook**](http://shop.oreilly.com/product/0636920034919.do):

> As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

#### What does Pandas do?
* Reading and writing data from persistent storage
* Cleaning, filtering, and otherwise preparing data
* Calculating statistics and analyzing data
* Visualization with help from Matplotlib

We can learn more about Pandas by using the help window in Google Colab.

In [None]:
# Type the function with a question mark afterwards and run the code to pull up a help window.
# Here we will find out more about Pandas
pd?

### Importing a Python library

To use any library, we must import it into our Python document.

In [2]:
# Import the Pandas library as pd (callable in our code as pd)
import pandas as pd
pd?

### Importing files into Pandas
We have prepared the data from the FAA website for this workshop. We will import those datasets into our notebook to use them for data analysis.

Datasets can be stored in several types of files, including .csv, .json, .txt, .xls, .xlsx, and more. Here we will import a .csv file and a .json file.

TODO - fix these preview links

- [Preview the CSV file](https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv)
- [Preview the JSON file](https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Python_Open_Labs/data/FAA_Wildlife_strikes_2010-2019.json)

CSV Files

A comma separated values (CSV) file is a plain text file containing data separated by commas.

In [5]:
# Import a comma-sperated values (csv) file as a DataFrame

# The file location
new_csv_file_url = 'https://media.githubusercontent.com/media/MuseumofModernArt/collection/master/Artworks.csv'

# Read in the file and print out the DataFrame
art_csv = pd.read_csv(new_csv_file_url)
art_csv.head()

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,Photographic reproduction with colored synthet...,...,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.8,,,50.8,,
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, color pencil, ink, and gouache on tr...",...,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,,,,38.4,,,19.1,,


JSON Files

JSON (JavaScript Object Notation) is a data storage format that uses name/value pairs to create objects and associative arrays. Learn more about [JSON files structure and syntax from W3Schools](https://www.w3schools.com/js/js_json_syntax.asp)

In [10]:
# Importing a JavaScript object notation (JSON) file

# The file location
json_file_url = ''
# TODO - find a JSON link

# Read in the file and print out the DataFrame
art_json = pd.read_json(json_file_url)
art_json.head()

ValueError: Expected object or value

### Pandas data structures

Pandas uses two main data structures: `Series` and `DataFrame`.

<img src="https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Data_Manipulation_with_Python/assets/nc_dataframes.png" alt="DataFrames are composed of Series" width="80%">

#### `Series`
A `Series` is a one-dimensional array of indexed data, or a single column of data. It can be thought of as a specialized dictionary or a generalized NumPy array. You can learn more about the Series data type in the [Pandas documentation for Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html).

#### `DataFrame`
A `DataFrame` is a two-dimensional array composed of one or more `Series`, similar to tabluar data (think of Excel). They can optionally have an `Index` and have flexible row indices and flexible column names. 

It can be thought of as a generalization of a two-dimensional NumPy array, or a specialization of a dictionary in which each column name maps to a `Series` of column data. You can learn more about the DataFrame data type in the [Pandas documentation for DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).

A `DataFrame` is made up of `Series` in a similar way in which a table is made up of columns. The only restriction is that each column must be of the same data type.  Many of the operations that can be performed on a `DataFrame` can also be performed on an individual `Series`.

In [11]:
# The csv file we imported earlier was stored in a DataFrame.
# Let's look at that data:
art_csv

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.600000,,,168.900000,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.640100,,,29.845100,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.300000,,,31.800000,,
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)",(),(1944),(0),(Male),1980,Photographic reproduction with colored synthet...,...,http://www.moma.org/media/W1siZiIsIjEyNCJdLFsi...,,,,50.800000,,,50.800000,,
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, color pencil, ink, and gouache on tr...",...,http://www.moma.org/media/W1siZiIsIjEyNiJdLFsi...,,,,38.400000,,,19.100000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
138146,Untitled,"Chesnutt Brothers Studio, Andrew Chesnutt, Lew...","133005, 133006, 133007","(American, 1861–1934) (American, 1860–1933)",() (American) (American),(0) (1861) (1860),(0) (1934) (1933),() (Male) (Male),c. 1890,Gelatin silver print,...,http://www.moma.org/media/W1siZiIsIjQ5MjcyMiJd...,,,,10.795022,,,16.510033,,
138147,Plate (folio 2 verso) from Muscheln und schirm...,Sophie Taeuber-Arp,5777,"(Swiss, 1889–1943)",(Swiss),(1889),(1943),(Female),1939,One from an illustrated book with four line bl...,...,http://www.moma.org/media/W1siZiIsIjQ4NTExNSJd...,,,,13.500000,,,10.000000,,
138148,Plate (folio 6) from Muscheln und schirme (She...,Sophie Taeuber-Arp,5777,"(Swiss, 1889–1943)",(Swiss),(1889),(1943),(Female),1939,One from an illustrated book with four line bl...,...,http://www.moma.org/media/W1siZiIsIjQ4NTExOCJd...,,,,13.500000,,,10.000000,,
138149,Plate (folio 12) from Muscheln und schirme (Sh...,Sophie Taeuber-Arp,5777,"(Swiss, 1889–1943)",(Swiss),(1889),(1943),(Female),1939,One from an illustrated book with four line bl...,...,http://www.moma.org/media/W1siZiIsIjQ4NTEyMCJd...,,,,11.000000,,,10.000000,,


In [12]:
# You can also view the "shape" of the Dataframe
# This tells you how many rows and columns there are
art_csv.shape

(138151, 29)

In [17]:
# A Series is a one-dimensional array, or one column of data
# When we take one column of a DataFrame, it is represented as a Series
artist = art_csv['Artist']
type(artist)

pandas.core.series.Series

In [18]:
# Now that we have created a Series, let's look at the data:
artist

0                                               Otto Wagner
1                                  Christian de Portzamparc
2                                                Emil Hoppe
3                                           Bernard Tschumi
4                                                Emil Hoppe
                                ...                        
138146    Chesnutt Brothers Studio, Andrew Chesnutt, Lew...
138147                                   Sophie Taeuber-Arp
138148                                   Sophie Taeuber-Arp
138149                                   Sophie Taeuber-Arp
138150                                   Sophie Taeuber-Arp
Name: Artist, Length: 138151, dtype: object

In [19]:
# You can also see the shape of a Series
# Since a Series only has one column, it will tell you how many rows there are
artist.shape

(138151,)

In [20]:
# You can convert a Series to a list with to_list()
artist.to_list()

['Otto Wagner',
 'Christian de Portzamparc',
 'Emil Hoppe',
 'Bernard Tschumi',
 'Emil Hoppe',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Louis I. Kahn',
 'Bernard Tschumi',
 'Marcel Kammerer',
 'Bernard Tschumi',
 'Otto Schönthal',
 'Bernard Tschumi',
 'Otto Schönthal',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard Tschumi',
 'Bernard

### Exploring your data

Now that we have our data, we can use Pandas to explore our data for analysis. This can be useful if you are new to a dataset to see what's there and how you should start analyzing.

#### View DataFrame column labels

Our DataFrame has 92 columns. We can quickly view the label names for each column using the DataFrame `columns` property.

In [21]:
# View column labels (headers)
art_csv.columns

Index(['Title', 'Artist', 'ConstituentID', 'ArtistBio', 'Nationality',
       'BeginDate', 'EndDate', 'Gender', 'Date', 'Medium', 'Dimensions',
       'CreditLine', 'AccessionNumber', 'Classification', 'Department',
       'DateAcquired', 'Cataloged', 'ObjectID', 'URL', 'ThumbnailURL',
       'Circumference (cm)', 'Depth (cm)', 'Diameter (cm)', 'Height (cm)',
       'Length (cm)', 'Weight (kg)', 'Width (cm)', 'Seat Height (cm)',
       'Duration (sec.)'],
      dtype='object')

#### View summaries of a DataFrame

We can quickly generate summaries of our DataFrame to observe some basic statistics and information such as column data types and non-null value counts.

In [22]:
# Get summary statistics of DataFrame columns using "describe()" (only includes
# numerical data types)
art_csv.describe()

Unnamed: 0,ObjectID,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
count,138151.0,10.0,13839.0,1462.0,120355.0,742.0,290.0,119434.0,0.0,2140.0
mean,97170.256618,44.86802,16.353863,23.094845,37.456124,89.687579,1283.674965,37.973398,,6156.488
std,81950.72057,28.631604,54.49596,44.626483,49.604159,329.428165,12017.50424,67.277097,,137125.0
min,2.0,9.9,0.0,0.635,0.0,0.0,0.09,0.0,,0.0
25%,36671.5,23.5,0.0,7.7788,17.938786,17.1,5.7267,17.5,,120.0
50%,73896.0,36.0,0.317501,13.6525,27.8,26.7,20.1851,25.400051,,433.0
75%,141636.5,71.125,9.84251,24.98095,43.9,79.7,77.6785,44.2,,1620.0
max,419289.0,83.8,1808.483617,914.4,9140.0,8321.0566,185067.585957,9144.0,,6283065.0


In [26]:
# Get summary statistics of single column using "describe()"
art_csv['Artist'].describe()

count           136868
unique           13684
top       Eugène Atget
freq              5050
Name: Artist, dtype: object

In [27]:
# Summarize column data types, non-null values, and memory usage using "info()"
art_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 138151 entries, 0 to 138150
Data columns (total 29 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   Title               138112 non-null  object 
 1   Artist              136868 non-null  object 
 2   ConstituentID       136868 non-null  object 
 3   ArtistBio           132226 non-null  object 
 4   Nationality         136868 non-null  object 
 5   BeginDate           136868 non-null  object 
 6   EndDate             136868 non-null  object 
 7   Gender              136868 non-null  object 
 8   Date                135949 non-null  object 
 9   Medium              128450 non-null  object 
 10  Dimensions          128397 non-null  object 
 11  CreditLine          135714 non-null  object 
 12  AccessionNumber     138151 non-null  object 
 13  Classification      138151 non-null  object 
 14  Department          138151 non-null  object 
 15  DateAcquired        131026 non-nul

#### Referencing and indexing a DataFrame

Referencing Rows

In [33]:
# Reference a row by index label
# Returns a Series

# Access first row of wl_strikes_csv by index label
# In this case the index label is 0
art_csv.loc[0]

# Access first row of wl_strikes_json by index label
# In this case the index label is not 0
# wl_strikes_json.loc[0]
# wl_strikes_json.loc[1080125]
# TODO - add JSON example here

Title                 Ferdinandsbrücke Project, Vienna, Austria (Ele...
Artist                                                      Otto Wagner
ConstituentID                                                      6210
ArtistBio                                         (Austrian, 1841–1918)
Nationality                                                  (Austrian)
BeginDate                                                        (1841)
EndDate                                                          (1918)
Gender                                                           (Male)
Date                                                               1896
Medium                    Ink and cut-and-pasted painted pages on paper
Dimensions                           19 1/8 x 66 1/2" (48.6 x 168.9 cm)
CreditLine            Fractional and promised gift of Jo Carole and ...
AccessionNumber                                                885.1996
Classification                                             Archi

In [34]:
# Reference multiple rows by index label (in this case the index label 0 through 2)
# Returns a DataFrame
art_csv.loc[0:2]

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,


In [35]:
# Reference a row or multiple rows by zero-based integer position

# Access first row of wl_strikes_csv by row integer value
# In this case the row is row 0
art_csv.iloc[0]

# Access first row of wl_strikes_json by row integer value
# In this case the row is also row 0
# wl_strikes_json.iloc[0]

# TODO - add JSON example

Title                 Ferdinandsbrücke Project, Vienna, Austria (Ele...
Artist                                                      Otto Wagner
ConstituentID                                                      6210
ArtistBio                                         (Austrian, 1841–1918)
Nationality                                                  (Austrian)
BeginDate                                                        (1841)
EndDate                                                          (1918)
Gender                                                           (Male)
Date                                                               1896
Medium                    Ink and cut-and-pasted painted pages on paper
Dimensions                           19 1/8 x 66 1/2" (48.6 x 168.9 cm)
CreditLine            Fractional and promised gift of Jo Carole and ...
AccessionNumber                                                885.1996
Classification                                             Archi

In [36]:
# Reference multiple rows by row number (in this case rows 0 through 2)
# Note that this time the range doesn't include the stop number
art_csv.iloc[0:3]

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio,Nationality,BeginDate,EndDate,Gender,Date,Medium,...,ThumbnailURL,Circumference (cm),Depth (cm),Diameter (cm),Height (cm),Length (cm),Weight (kg),Width (cm),Seat Height (cm),Duration (sec.)
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)",(Austrian),(1841),(1918),(Male),1896,Ink and cut-and-pasted painted pages on paper,...,http://www.moma.org/media/W1siZiIsIjU5NDA1Il0s...,,,,48.6,,,168.9,,
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)",(French),(1944),(0),(Male),1987,Paint and colored pencil on print,...,http://www.moma.org/media/W1siZiIsIjk3Il0sWyJw...,,,,40.6401,,,29.8451,,
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)",(Austrian),(1876),(1957),(Male),1903,"Graphite, pen, color pencil, ink, and gouache ...",...,http://www.moma.org/media/W1siZiIsIjk4Il0sWyJw...,,,,34.3,,,31.8,,


Referencing Columns

In [37]:
# Referencing a column by column label (in this case, "INDX_NR")
art_csv['ArtistBio']

0                               (Austrian, 1841–1918)
1                                 (French, born 1944)
2                               (Austrian, 1876–1957)
3           (French and Swiss, born Switzerland 1944)
4                               (Austrian, 1876–1957)
                             ...                     
138146    (American, 1861–1934) (American, 1860–1933)
138147                             (Swiss, 1889–1943)
138148                             (Swiss, 1889–1943)
138149                             (Swiss, 1889–1943)
138150                             (Swiss, 1889–1943)
Name: ArtistBio, Length: 138151, dtype: object

In [39]:
# Referencing multiple columns by a list of column labels 
# (in this case, the columns "INDX_NR" and "AIRPORT")
art_csv[['Artist', 'ArtistBio']]

Unnamed: 0,Artist,ArtistBio
0,Otto Wagner,"(Austrian, 1841–1918)"
1,Christian de Portzamparc,"(French, born 1944)"
2,Emil Hoppe,"(Austrian, 1876–1957)"
3,Bernard Tschumi,"(French and Swiss, born Switzerland 1944)"
4,Emil Hoppe,"(Austrian, 1876–1957)"
...,...,...
138146,"Chesnutt Brothers Studio, Andrew Chesnutt, Lew...","(American, 1861–1934) (American, 1860–1933)"
138147,Sophie Taeuber-Arp,"(Swiss, 1889–1943)"
138148,Sophie Taeuber-Arp,"(Swiss, 1889–1943)"
138149,Sophie Taeuber-Arp,"(Swiss, 1889–1943)"


Referencing both rows and columns

In [42]:
# Referencing a subset of rows and columns using index and column labels
# Note that we're using a range of column labels instead of a list
# Make sure that your column range starts with the leftmost label
art_csv.loc[:9, 'Title':'ArtistBio']

Unnamed: 0,Title,Artist,ConstituentID,ArtistBio
0,"Ferdinandsbrücke Project, Vienna, Austria (Ele...",Otto Wagner,6210,"(Austrian, 1841–1918)"
1,"City of Music, National Superior Conservatory ...",Christian de Portzamparc,7470,"(French, born 1944)"
2,"Villa near Vienna Project, Outside Vienna, Aus...",Emil Hoppe,7605,"(Austrian, 1876–1957)"
3,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)"
4,"Villa, project, outside Vienna, Austria, Exter...",Emil Hoppe,7605,"(Austrian, 1876–1957)"
5,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)"
6,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)"
7,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)"
8,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)"
9,"The Manhattan Transcripts Project, New York, N...",Bernard Tschumi,7056,"(French and Swiss, born Switzerland 1944)"


### Writing data to a file

In [None]:
# Save the subset from the previous cell in a variable
first_ten = art_csv.loc[:9, 'Title':'ArtistBio']

# Write to csv
first_ten.to_csv("new_data.csv")

In [None]:
#Write to an Excel file
first_ten.to_excel("new_data.xls")

In [None]:
# Write to a JSON file
first_ten.to_json("new_data.json")

----
## Open work time
You can use this time to ask questions, collaborate, or work on the following activities (on your own or in a group). 

### Exercise 1: Read in an Excel file
Take this Excel file, read it into a DataFrame, and print out the first five rows of the DataFrame.



> Hint: the syntax is very similar to reading a .csv file.



Link to the file: https://github.com/NCSU-Libraries/data-viz-workshops/blob/master/Python_Open_Labs/data/FAA_Wildlife_strikes_2000-2009.xlsx?raw=true

In [None]:
# Save the url as a variable
xls_file_url = 'https://github.com/NCSU-Libraries/data-viz-workshops/blob/master/Python_Open_Labs/data/FAA_Wildlife_strikes_2000-2009.xlsx?raw=true'

# Read the file in
wl_strikes_xls = pd.read_excel(xls_file_url)

# View the file
wl_strikes_xls.head()

Unnamed: 0,INDX_NR,INCIDENT_DATE,INCIDENT_MONTH,INCIDENT_YEAR,TIME,TIME_OF_DAY,AIRPORT_ID,AIRPORT,RUNWAY,STATE,FAAREGION,LOCATION,ENROUTE STATE,OPID,OPERATOR,REG,FLT,AIRCRAFT,AMA,AMO,EMA,EMO,AC_CLASS,AC_MASS,TYPE_ENG,NUM_ENGS,ENG_1_POS,ENG_2_POS,ENG_3_POS,ENG_4_POS,PHASE_OF_FLIGHT,HEIGHT,SPEED,DISTANCE,SKY,PRECIPITATION,AOS,COST_REPAIRS,OTHER_COST,COST_REPAIRS_INFL_ADJ,...,STR_ENG2,DAM_ENG2,STR_ENG3,DAM_ENG3,STR_ENG4,DAM_ENG4,STR_PROP,DAM_PROP,STR_WING_ROT,DAM_WING_ROT,STR_FUSE,DAM_FUSE,STR_LG,DAM_LG,STR_TAIL,DAM_TAIL,STR_LGHTS,DAM_LGHTS,STR_OTHER,DAM_OTHER,OTHER_SPECIFY,EFFECT,EFFECT_OTHER,SPECIES_ID,REMARKS,REMAINS_COLLECTED,REMAINS_SENT,WARNED,BIRDS_SEEN,BIRDS_STRUCK,SIZE,NR_INJURIES,NR_FATALITIES,COMMENT,REPORTER_NAME,REPORTER_TITLE,SOURCE,PERSON,LUPDATE,TRANSFER
0,707074,2009-12-24,12,2009,07:52,Dawn,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,36C,NC,ASO,,,JIA,PSA AIRLINES,N218PS,,CRJ100/200,188,10.0,22.0,4.0,A,3.0,D,2.0,5.0,5.0,,,Approach,100.0,138.0,,No Cloud,,,,,,...,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,,,,UNKBS,,False,False,Unknown,2-10,1,Small,,,/Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7,Airport Operations,2010-04-29,False
1,707361,2009-12-13,12,2009,,Day,KILM,WILMINGTON INTL,17,NC,ASO,,,ASH,MESA AIRLINES,,,EMB-145,332,14.0,1.0,10.0,A,3.0,D,2.0,5.0,5.0,,,Climb,,,,Overcast,Rain,,,,,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,PART NOT REPTD,,,UNKBM,UNKNOWN TYPE OF BIRD STRUCK. PILOT REPTD HITTI...,False,False,No,,1,Medium,,,XXXX-XX-XX-XXXXXX /Legacy Record=XXXXXX/,REDACTED,REDACTED,FAA Form 5200-7-E,Airport Operations,2010-04-29,False
2,707050,2009-12-11,12,2009,07:26,Day,KILM,WILMINGTON INTL,35,NC,ASO,,,1ASQ,ATLANTIC SOUTHEAST,N683AS?,4939.0,CRJ100/200,188,10.0,22.0,4.0,A,3.0,D,2.0,5.0,5.0,,,Take-off Run,0.0,130.0,0.0,Some Cloud,,,,,,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,,,YL001,WE SAW 2 SML BIRDS AND 2 HIT WINDSHLD. PILOT R...,True,False,No,2-10,2-10,Small,,,SOURCE = TWO XXXX-X (XXXX-XX-XX-XXXXXX & XXXXX...,REDACTED,REDACTED,FAA Form 5200-7-E,Air Transport Operations,2010-04-08,False
3,707146,2009-12-10,12,2009,16:45,Day,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,36C,NC,ASO,,,JIA,PSA AIRLINES,N718PS,215.0,CRJ700,188,16.0,22.0,4.0,A,4.0,D,2.0,5.0,5.0,,,Take-off Run,0.0,,0.0,Some Cloud,,,,,,...,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,,"None, Precautionary Landing",,YL001,ID BY SMITHSONIAN. FAA 3952. DNA.,True,True,Yes,,2-10,Small,,,SOURCE = THREE XXXX-X (XXXX-XX-XX-XXXXXX & RX)...,REDACTED,REDACTED,FAA Form 5200-7-E,Airport Operations,2010-08-19,False
4,707624,2009-12-08,12,2009,07:30,Dawn,KCLT,CHARLOTTE/DOUGLAS INTL ARPT,,NC,ASO,,,ASH,MESA AIRLINES,N935LR,2604.0,CRJ900,188,17.0,22.0,4.0,A,4.0,D,2.0,5.0,5.0,,,Approach,500.0,133.0,,,,,,,,...,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,,,,UNKBS,NO DMG TO A/C.,False,False,Yes,2-10,1,Small,,,/Legacy Record=XXXXXX/,REDACTED,REDACTED,Air Transport Report,Air Transport Operations,2010-04-29,False


### Exercise 2: Indexing cells

Use referencing and indexing to answer the following questions by finding the data in the rows, columns, and/or cells. 



#### 2a. Time of day
Airlines are interested in when they should schedule flights to minimize collisions. What is the time of day for each incident? Create a `Series` of the time of day (`TIME_OF_DAY`)

In [None]:
# 2a. Create a `Series` of the time of day (`TIME_OF_DAY`)
time_of_day = wl_strikes_csv['TIME_OF_DAY']

# Print new Series
time_of_day

0        Day
1        Day
2        Day
3       Dusk
4        NaN
       ...  
664      Day
665      Day
666      Day
667    Night
668     Dusk
Name: TIME_OF_DAY, Length: 669, dtype: object

#### 2b. Date and time
We want to find out when most of these collisions occur. What is the exact date and time of each incident? Print the third, fourth, and fifth columns from the data (`INCIDENT_MONTH`,	`INCIDENT_YEAR`, and	`TIME`).

In [None]:
# 2b. Print the third, fourth, and fifth columns from the data 
# (`INCIDENT_MONTH`, `INCIDENT_YEAR`, and `TIME`).
wl_strikes_csv[['INCIDENT_MONTH', 'INCIDENT_YEAR', 'TIME']]

Unnamed: 0,INCIDENT_MONTH,INCIDENT_YEAR,TIME
0,12,1999,10:15
1,12,1999,
2,12,1999,07:40
3,12,1999,17:00
4,12,1999,
...,...,...,...
664,6,1990,
665,5,1990,11:35
666,4,1990,
667,3,1990,21:30


#### 2c. Access the 126th row

Use row indexing to find the data in the 126th row in the `wl_strikes_json` DataFrame. Check that your result is correct by making sure your `INCIDENT_DATE` value is `2020-07-17`.

> Tip: Remember that the integer-based row location is zero based



In [None]:
# 2c. Access the 126th row from the 'wl_strikes_json` DataFrame
wl_strikes_json.iloc[125]

INCIDENT_DATE            2020-07-17
INCIDENT_MONTH                    7
INCIDENT_YEAR                  2020
TIME                          06:00
TIME_OF_DAY                     Day
                        ...        
REPORTER_TITLE             REDACTED
SOURCE            FAA Form 5200-7-E
PERSON                        Pilot
LUPDATE                  2020-08-05
TRANSFER                      False
Name: 1008687, Length: 91, dtype: object

#### 2d. Cloud cover
A particular airline has nine flights that they want to compare to see if the cloud cover in the area had anything to do with the collision. Print rows 60-65 and the columns `INDX_NR`, `SKY`, `PHASE_OF_FLIGHT`, and `AIRPORT`

In [None]:
# 2d. Print rows 60-65 and the columns 'INDX_NR', 'SKY', 'PHASE_OF_FLIGHT', and
# 'AIRPORT'
cloud_cover = wl_strikes_csv.loc[60:65, ['INDX_NR', 'SKY', 'PHASE_OF_FLIGHT', 'AIRPORT']]
cloud_cover

Unnamed: 0,INDX_NR,SKY,PHASE_OF_FLIGHT,AIRPORT
60,631616,Some Cloud,Climb,RALEIGH-DURHAM INTL
61,627351,,Take-off Run,KINSTON REGIONAL JETPORT AT STALLINGS FIELD
62,635160,Some Cloud,Approach,PIEDMONT TRIAD INTL
63,626190,Overcast,Climb,RALEIGH-DURHAM INTL
64,635214,No Cloud,Take-off Run,ALBERT J ELLIS
65,634350,Overcast,Take-off Run,RALEIGH-DURHAM INTL


### Exercise 3: Write to a file
Take the your result in exercise 2d. (or another DataFrame you have created), and write it to a .csv file.

In [None]:
# Write to a new .csv file
cloud_cover.to_csv("exercise3.csv")

## Further resources

### Filled version of this notebook

[Python Open Labs Week 1 filled notebook](https://colab.research.google.com/github/NCSU-Libraries/data-viz-workshops/blob/master/Python_Open_Labs/Reading_exploring_and_writing_data_with_Pandas/Python_Open_Labs_Week1_filled.ipynb) - a version of this notebook with all code filled in for the guided activity and exercises. TODO - update link

### Learning resources

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html) - a free, online version of Jake VanderPlas' introduction to data science with Python, includes a chapter on data manipulation with pandas.
- [Python Programming for Data Science](https://www.tomasbeuzen.com/python-programming-for-data-science/README.html) - a website providing a great overview of conducting data science with Python including pandas.
- [Real Python](https://realpython.com/) contains a lot of different tutorials at different levels
- [LinkedIn Learning](https://www.lynda.com/Python-training-tutorials/415-0.html) is free with NC State accounts and contains several video series for learning Python
- [Dataquest](https://www.dataquest.io/) is a free then paid series of courses with an emphasis on data science

### Finding help with pandas

The [Pandas website](https://pandas.pydata.org/) and [online documentation](http://pandas.pydata.org/pandas-docs/stable/) are useful resources, and of course the indispensible [Stack Overflow has a "pandas" tag](https://stackoverflow.com/questions/tagged/pandas).  There is also a (much younger, much smaller) [sister site dedicated to Data Science questions that has a "pandas" tag](https://datascience.stackexchange.com/questions/tagged/pandas) too.

## Evaluation Survey
Please, spend 1 minute answering these questions that help improve future workshops.

https://go.ncsu.edu/dvs-eval

## Credits

This workshop was created by Claire Cahoon and Walt Gurley, adapted from previous workshop materials by Scott Bailey and Simon Wiles, of Stanford Libraries.