# Cleaning, manipulating, and exploring data with pandas

In [None]:
# Import the pandas library as pd (callable in the code as pd)


In [None]:
# The csv data file location
csv_file_url = 'https://raw.githubusercontent.com/NCSU-Libraries/data-viz-instruction/main/MI_REU_2021/data/perovskite_DFT_EaH_FormE.csv'

# Read in the data file and print out the DataFrame


## Removing data
### Drop columns
Removing unnecessary columns of data using the DataFrame `drop()` method can simplify the dataset.

In [None]:
# Remove the empty columns using "drop()"


# Print out the first five records of the DataFrame


## Calculating new data
### Calculating with Expressions
New columns can be created based on data in other columns. For example, the new column "Number of elements" is the number of columns that represent sites that are not null.

In [None]:
# Create a new column for the number of elements in a compound


# Print out the new column


### Calculating with apply functions
New columns can also be created with the `apply` method, which uses functions to handle more complex manipulations of existing data.

In [None]:
# Create a function that returns "stable" or "unstable" based on the energy above hull
def is_stable(energy_above_hull):
    # less than 40meV/atom is stable
    if energy_above_hull < 40:
        return "Stable"
    # greater than 40meV/atom is unstable
    else:
        return "Unstable"

# Create a new column
# apply the function "is_stable" to each row of another column to populate the new column with data


# View the dataset


## Replacing data
Data can be replaced in a column based on conditions, similar to "find and replace." For example, this replace method replaces element abbreviations with the full element name.

In [None]:
# Replace 'Sr' and 'Ca' abbreviations with "Strontium" and "Calcium"


# Print out the updated column of data


In [None]:
# Change the element names back to their abbreviations


# Print out the updated column of data


## Filtering data
### Conditional filtering
Data can be filtered using conditional statements to remove unecessary rows of data or observe a specific range of data.

In [None]:
# Filter the data to only see perovskite oxides where Calcium was found at A site #1


# Print out the filtered data


In [None]:
# Filter the data to only see only perovskite oxides with a formation energy greater than -1


# Print out the filtered data


In [None]:
# Filter the data to only see perovskite oxides where Calcium was found at A site #1 
# that also have a formation energy greater than -1


# Print out the filtered data


# Show how many rows meet these requirements


## Aggregating data
### Unique
The `unique()` method can be used to see the total number of unique values are in a column of data. It returns a list of each value.

In [None]:
# Create a list of the unique elements at B site #3 with unique()


# Print out the unique species


In [None]:
# Get the length of the new array (How many unique elements are there?)


### Value counts
Value counts show how many instances there are of each unique entry in a column and returns a Series.

In [None]:
# Count the occurance of unique values on the column 'A site #2'


# Print out the value counts


### Minimum, maximum, average
Aggregates like  minimum, maximum, and mean of values can be calculated in a DataFrame or Series. Examples include:

- `mean()` to find the average of a range
- `min()` to find the smallest value
- `max()` to find the largest value
- `sum()` to sum the values of a range

In [None]:
# Calculate the minimum values for each column


In [None]:
# Calculate the average mean for energy above hull


The `agg()` method can call multiple aggregate functions at once.

In [None]:
# Calculate the minimum, maximum, and average for energy above hull


## Grouping data
`groupby()` groups data in the DataFrame by column values. This grouped data can be sorted and manipulated.

In [None]:
# Group the dataset by "A site #1"


# This creates a groupby object that contains information about the groups


# Find the mean of numerical columns grouped by element at "A site #1"


`groupby()` can group data by multiple factors, too.

In [None]:
# Group the data by A site #1 and then A site #2 and get the mean of the numerical columns using .mean()
     

## Data Visualization

Pandas can be used in conjunction with the **matplotlib visualization library** to create basic charts, such as bar charts, line charts, and scatter plots.

Import the pyplot interface (`matplotlib.pyplot`) as `plt` to access the plotting functionality of matplotlib. Once imported, charts can be created by using the matplotlib integration with pandas data structures, which calls the method `plot()` on a DataFrame or Series.

In [None]:
# Import the matplotlib pyplot interface as plt (callable in the code as plt)


### Create a Bar Chart

A bar chart is a simple way to compare values across categories.

In [None]:
# Get the number of strike records for each species type using "A site #1"


# Print the value counts


In [None]:
# Create a bar chart (kind="bar") with the labels (element names) along the horizontal
# axis and value counts (number of records) along the vertical axis


In [None]:
# Now create a horizontal bar chart (kind="barh") with the labels (element names) along 
# the vertical axis and value counts (number of records) along the horizontal axis


### Setting global chart styles

Basic graphic global paramenters can be set for the overall style of all plots.

In [None]:
# Set the default size of the plots to 10 inches wide and 8 inches tall


In [None]:
# Set the default graphic style of the plots to 'ggplot'


In [None]:
# Replot the horizontal bar chart with the updated style


### Create a Scatter Plot

Scatter plots are a standard graphing method for identifying a relationship between two variables.

In [None]:
# Create new columns to simplify calling column names in plots


# Create a scatter plot (kind="scatter") to plot the energy above hull (x axis) vs the formation energy (y axis)


### Set the style of the plot

The graphical encoding of the plot data can be manipulated using the keyword arguments like `color` for color and `marker` to change the style of the points.

In [None]:
# Plot the energy above hull (x axis) vs the formation energy (y axis)
# Add keyword arguments to change the color to purple and the marker to "x"


To make the chart more descriptive and easy to interperet, create a plot title and axis labels. Store the plot as a matplotlib `Axes` object in a variable by calling the `plot()` method so that the methods `set_title()`, `set_xlabel()`, and `set_ylabel()` can be used on the variable to set a descriptive title, x axis label, and y axis label, respectively.

In [None]:
# Create a variable that stores the plot


#Change titles and axis labels with set_title(), set_xlabel(), and set_ylabel()


### Other Data Visualization Libraries

`matplotlib` is not the only option for creating data visualizations, there are many other libraries that offer additional styling and interactive options.

#### Seaborn

Seaborn is a common visualization library that builds on matplotlib and provides more robust options for plot types and stylings.

In [None]:
# Import the seaborn library as sns (callable in the code as sns)


In [None]:
# Recreate the scatter plot of energy above hull vs formation energy with seaborn scatterplot()


In [None]:
# Create a histogram of energy above hull values by calling histplot() with seaborn


In [None]:
# Create violin plots of formation energy grouped by number of elements and stability
