## Pandas 
Pandas is a Python library used for data analysis and manipulation. The central data structure of pandas is called a DataFrame. Pandas DataFrames work very closely with NumPy arrays and Pandas dataframes are specifically for data which is two dimensional (rows and columns). NumPy arrays, while similar in some ways, can work with higher dimensional data.



In [None]:
import pandas as pd
df = pd.read_csv("PubChemElements_all.csv")

Initially when loading data in, and also at certain points as we're working with it, we'll want to see what our dataframe looks like. Youo can see a preview of your dataframe using the `.head` function

The `.info` function will give information about the columns and the data type of those columns. The data type will become very important later as we work with data more.

In [None]:
df.head()

In [None]:
df.info()

For this dataframe, we see that the first column, `AtomicNumber` has the data type of `int64`. Here, `int` means `integer` and `64` means `64 bit`. Similarly, `float64` means `64 bit floating point`. These are decimal numbers.

The other column names which read `object` are not numeric. They might be strings or they might be something else. We'll discuss more later.

The `describe` function can be used on a dataframe to quickly see statistics about columns with numerical data. If you look at the columns that statistics are computed for and compare to the data type shown from `info`, you will see that we only get statistics for columns which had `int64` or `float64` data types.

In [None]:
df.describe()

This information is extremely useful for understanding the data. We can also easily visualize the distribution of each column using Pandas's ``hist`` function.

In [None]:
df.hist(figsize=(8,8), edgecolor='black', grid=False)

In [None]:
df.shape

In [None]:
df['AtomicNumber']

In [None]:
#to select certain columns use the ".head()" and put a number in the parenthesis.
df["AtomicNumber"].head(5)

In [None]:
df[["Symbol", "ElectronConfiguration"]].head(5)

In [None]:
df['MeltingPointC'] = df['MeltingPoint'] - 273.15
display(df)

The .apply method in pandas is used to apply a function along a row or column of a dataframe. This is useful when you have a custom function that you need to use on every value in a column, but there is not a NumPy or Pandas function for it.

For example, we could apply the len function to our Name column to get the number of letters in the name for each element.

In [None]:
df["Name"].apply(len)

We can create a periodic table with ``Chem.GetPeriodicTable``, then use associated functions to get information about atoms.

In [None]:
from rdkit import Chem

In [None]:
#Initialize periodic table
periodic_table = Chem.GetPeriodicTable()

In [None]:
df['NOuter'] = df['Symbol'].apply(periodic_table.GetNOuterElecs)
df.head()

In [None]:
df["defaultVal"] = df["Symbol"].apply(periodic_table.GetDefaultValence)
df.head()

In [None]:
df.to_csv("periodic_table_processed.csv")

# Visualization

Visualizing data helps in understanding relationships and patterns that might not be apparent from raw data. Here, we will use Seaborn, a statistical visualization library, to create plots from our periodic table dataset. Seaborn is built on top of matplotlib, so if we would like to adjust any of the plots seaborn makes, we can do that through the Matplotlib interface we've used before.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
#Lets use Ionization energy and Nouter electrons to make a bar plot.
sns.catplot(data=df, x='NOuter', y='IonizationEnergy', kind='bar')
plt.xticks()

Seaborn can also allow us to easily create scatter plots to visualize relationships between continuous variables. For example, we can create a scatter plot to show the relationship between ionization energy and atomic radius

In [None]:
sns.scatterplot(data=df, x='AtomicRadius', y='IonizationEnergy', hue='GroupBlock')
plt.title('Electronegativity Vs Atomic Radius')
plt.xlabel('Atomic Radius')
plt.ylabel('Electronegativity')

# Visualization Correlation
The correlation matrix provides insights into the relationships between the variables. A correlation value close to 1 indicates a strong positive relationship, while a correlation value close to -1 indicates a strong negative relationship. A correlation value close to 0 indicates no relationship between the features.

In [None]:
# To calculate the correlation matrix
corr = df.corr(numeric_only=True)
corr

In [None]:
# Create a heatmap to make it easier to examine correlation of different variables
sns.heatmap(corr, annot=True, cmap='coolwarm', fmt='.2f')
#to adjust image size
plt.figure(figsize=(10,8))

The heatmap uses a "coolwarm" color scheme where red indicates positive correlation and blue indicates negative correlation between variables. Strongly correlated pairs are represented by darker shades of red, while strongly inversely correlated pairs are represented by darker shades of blue.