# Getting started with Pandas

## What is a Python library?

A Python library is a package of code that adds to the functionality of Python. Base Python offers a lot of features, but not everything -- Python libraries can be imported at the beginning of  code to use for a specific purpose.

## What is pandas?

The pandas library is a high-level data manipulation tool <sup>[[wikipedia](https://en.wikipedia.org/wiki/Pandas_(software))]</sup>. Some features include:

* Reading and writing data from various data structures and file types
* Cleaning, filtering, and otherwise preparing data
* Calculating statistics and analyzing data
* Visualization with help from Matplotlib

## Importing a Python library

To use any library, it must first be imported into the Python environment.

In [None]:
# Import the pandas library as pd (callable in the code as pd)


## Reading data files with pandas

Datasets can be stored in several types of files, including .csv, .json, .txt, .xls, .xlsx, and more. The pandas library provides utilities to read in many of these file types.

### CSV Files

A comma separated values (CSV) file is a plain text file containing data separated by commas.

In [None]:
# The file location
csv_file_url = 'https://raw.githubusercontent.com/ncsu-libraries-data-vis/mi-reu-2021/main/data/perovskite_DFT_EaH_FormE.csv'

# Read in the file and print out the DataFrame


### Excel Files

An Excel file is the default for the spreadsheet application, Microsoft Excel. These files can often be converted for other spreadsheet applications.

In [None]:
# The file location
excel_file_url = 'https://github.com/ncsu-libraries-data-vis/mi-reu-2021/blob/main/data/perovskite_DFT_EaH_FormE.xlsx?raw=true'

# Read in the file and print out the DataFrame


### JSON Files

JSON (JavaScript Object Notation) is a data storage format that uses name/value pairs to create objects and associative arrays.

In [None]:
# The file location
json_file_url = 'https://raw.githubusercontent.com/ncsu-libraries-data-vis/mi-reu-2021/main/data/perovskite_DFT_EaH_FormE.json'


# Read in the file and print out the DataFrame


## Pandas data structures

Pandas uses two main data structures: **Series** and **DataFrame**.

<img src="https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Data_Manipulation_with_Python/assets/nc_dataframes.png" alt="DataFrames are composed of Series" width="75%">

### DataFrame

A **DataFrame** is a two-dimensional array, similar to tabluar data (think of Excel), with labaled rows (the index) and labeled columns. A **DataFrame** is made up of multiples **Series** in a similar way in which a table is made up of multiple columns. The only restriction is that each column must be of the same data type.

In [None]:
# View the DataFrame created from the csv file


In [None]:
# Get the shape of a DataFrame (the number of rows and columns)


### Series

A **Series** is a one-dimensional array of indexed data (row labels), or a single column of data. It can be thought of as a specialized dictionary.

In [None]:
# Select one column from a DataFrame (stored as a Series)


In [None]:
# Get the shape of a Series


## Exploring the data

The pandas library can be used to explore data for analysis. This can be useful for an initial assessment of a dataset to see what is included in the dataset and what could be useful to analyze. 

### View DataFrame column labels

The label names for each column can be viewed using the DataFrame `columns` attribute, which gives a list of each column name.

In [None]:
# View column labels (headers)


### View summaries of a DataFrame

Summaries of a DataFrame can be used to observe basic statistics and information such as column data types and non-null value counts.

In [None]:
# Get summary statistics of DataFrame columns using "describe()" (only includes
# numerical data types)


In [None]:
# Get summary statistics of single column using "describe()"


In [None]:
# Summarize column data types, non-null values, and memory usage using "info()"


### Referencing and indexing a DataFrame

#### Referencing Rows (.loc and .iloc)

In [None]:
# Reference a row by index label
# Returns a Series

# Access first row of data_csv by index label
# In this case the index label is 0


# Access first row of data_json by index label
# In this case the index label is not 0


In [None]:
# Reference multiple rows by index label (in this case the index label 0 through 3)
# Returns a DataFrame


In [None]:
# Reference a row or multiple rows by zero-based integer position

# Access first row of data_csv by row integer value
# In this case the row is row 0


# Access first row of data_json by row integer value
# In this case the row is also row 0


In [None]:
# Reference multiple rows by row number (in this case rows 0 through 2)
# Note that this time the range doesn't include the stop number


#### Referencing Columns

In [None]:
# Referencing a column by column label (in this case, "A site #1")


In [None]:
# Referencing multiple columns by a list of column labels 
# (in this case, the columns "A site #1" and "A site #2")


#### Referencing both rows and columns

In [None]:
# Referencing a subset of rows and columns using index and column labels
# Note that this statement uses a range of column labels instead of a list
# Make sure that the column range starts with the leftmost label


## Writing data to a file

In [None]:
# Save the subset from the previous cell in a variable


# Write a csv file to the folder "output_data" in the current directory


In [None]:
#Write to an Excel file to the folder "output_data" in the current directory


In [None]:
# Write to a JSON file to the folder "output_data" in the current directory
