# Introducing Pandas

- Sections

- Introduction to Pandas

- Creating and Loading Data

- Exploring DataFrames

- Data Cleaning

- Visualization with Pandas

- Data Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

# Creating a DataFrame

A DataFrame in pandas can be easily created from a Python dictionary, where the dictionary keys become column names and the values (lists, arrays, or Series) become the column data. This is a convenient way to organize structured data for analysis, since DataFrames support powerful tools for filtering, aggregation, and visualization.

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve', 'Frank', 'Garnet', 'Hannah', 'Kendall'],
    'Age': [24, 27, 22, 32, 29, 45, 33, 23, np.nan],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Chicago', 'New York', 'Houston', np.nan],
    'Salary': [70000, 80000, 50000, 120000, 75000, 150000, 234092, 43223, np.nan]
}
data_df = pd.DataFrame(data, index = [3434, 5434, 903, 24312, 443,4312, 35, 352, 431])
print("\n--- DataFrame ---\n\n")
data_df

# Useful DataFrame Methods for Quick Data Analysis

df.head() / df.tail() → preview the first or last rows of a table.

df.info() → summary of column names, data types, and non-null counts.

df.describe() → quick statistics (mean, std, min, max, quartiles) for numerical columns.

df.shape → gives the number of rows and columns.

df.columns → lists all column names.

df.value_counts() → frequency counts of values in a column.

df.isnull().sum() → check for missing values in each column.

df.corr() → correlation matrix between numerical columns.

df.groupby() → group data by a column for aggregation (e.g., averages, counts).

df.sort_values() → sort rows by column values.

df.duplicated().sum()  # Count duplicates

In [None]:
data_df.head()

In [None]:
data_df.info()

In [None]:
data_df.info()

In [None]:
data_df.shape

In [None]:
data_df.isnull().sum()

In [None]:
data_df.sort_values(by='Salary', ascending=False)

In [None]:
#Seeing duplicates in the DataFrame
data_df.duplicated().sum()  # Count duplicates

In [None]:
data_df.drop_duplicates(inplace=True)  # Remove duplicates

# Accessing Data: 

In pandas, you can access data in a DataFrame using labels, positions, or boolean conditions. Two of the most common accessors are .loc and .iloc. 

## .loc()
The .loc accessor is label-based, meaning you select rows and columns using their names or index labels. 

## .iloc()
In contrast, .iloc is position-based, so you access rows and columns by their integer positions (like array indexing). 

For example, df.loc[5, "flux"] selects the row with index label 5 and the column "flux", while df.iloc[0, 2] selects the first row and third column regardless of labels. This distinction makes it easy to switch between human-readable label access and fast position-based indexing.

In [None]:
data = {
    "name": ["Sirius", "Vega", "Betelgeuse"],
    "magnitude": [14.56, 13.45, 12.345],
    "distance_ly": [8.6, 25.0, 642.5]
}
df = pd.DataFrame(data, index=["star1", "star2", "star3"])

print("DataFrame:")
print(df)

# Using .loc (label-based)
print("\n.loc example (row label 'star2', column 'magnitude'):")
print(df.loc["star2", "magnitude"])

# Using .iloc (position-based)
print("\n.iloc example (row 1, column 1):")
print(df.iloc[1, 1])

# 2. Loading Data from a CSV file (example)

To read in a CSV file you can use the pandas built in to_csv() function and it will automatically open up a CSV file in Python. Let us see this in action below.

In [None]:
file_path = 'Boston.csv'
df = pd.read_csv(file_path)

In [None]:
df.head()

# index_col

The index_col argument within the read_csv is a python-based index of which column to use for the DataFrame Index. 

In [None]:
df = pd.read_csv(file_path, index_col=0)

In [None]:
df.head()

## 2.1 Loading in .txt files

Pandas’ read_csv function can be used to open not only CSV files but also general text files containing structured tabular data. By specifying parameters like sep for the delimiter and index_col for which column should be used as the row index, you can easily read in whitespace- or tab-separated data. 


The command sep = r'\s+' treats any amount of whitespace as a separator, and uses the first column as the index. The resulting DataFrame can then be used for exploration, filtering, and analysis just like any other pandas DataFrame.

After loading a text file into a DataFrame with read_csv, it’s helpful to quickly inspect its structure before analysis. Useful commands include:

df.head() → view the first few rows to check that data was read correctly.

df.info() → see column names, data types, and non-null counts.

df.describe() → get summary statistics for numerical columns.

df.columns → list all column names to verify headers.

In [None]:

df_txt = pd.read_csv("Masses_V2_Table.txt", sep = r"\s+", index_col = 0)

In [None]:
df_txt 

# 3. Data Cleaning (Handling Missing Values)

In pandas, handling missing data is an important step before analysis. Missing values are usually represented as NaN, and pandas provides several methods to manage them. You can remove rows or columns containing missing data using dropna(), or fill in missing values with fillna()—for example, replacing them with a default value, the column mean, or a computed estimate. Cleaning missing data ensures that subsequent calculations, visualizations, and models are accurate and reliable.

In [None]:
#code to determine if there are NaNs in the columns and how many are there
df.isna().sum()

In [None]:
merge_df.isna().sum()

In [None]:
#Ways to handle NaNs
#1. Drop the rows with NaNs
df_no_missing = merge_df.dropna()
df_no_missing.isna().sum()

In [None]:
#2. Fill NaNs with a specific value
df_fill = merge_df.fillna(0)
df_fill.isna().sum()

In [None]:
df_fill.tail()

In [None]:
#3. Fill NaNs with the mean of the column
df_mean = merge_df.fillna(df.mean())
df_mean.isna().sum()

In [None]:
df_mean.tail()

In [None]:
#4. Fill NaNs with the median of the column
df_median = merge_df.fillna(df.median())
df_median.isna().sum()

In [None]:
#5. Fill NaNs with the mode of the column
df_mode = merge_df.fillna(df.mode().iloc[0])
df_mode.isna().sum()

# 4. Filtering

Boolean masking in pandas is a powerful way to filter a DataFrame based on conditions. By creating a boolean array from a comparison or logical operation, you can select only the rows that satisfy certain criteria. This allows for quick subsetting of data without modifying the original DataFrame.

Boolean masking is very useful in astronomy for filtering catalogs, selecting objects within a certain range of properties, or isolating outliers for further study.

If you want, I can also write a blurb combining boolean masking with .loc for a more flexible way to filter both rows and columns.

In [None]:
#Boolaen Masking to Filter the DataFrame
df_filtered = df[df['dis'] < 2]
print("\nFiltered Data (dis < 1):\n")
df_filtered

# 5. Visualization with Pandas

Pandas DataFrames come with built-in plotting capabilities that make quick visualizations of your data simple. By calling the .plot() method on a DataFrame or Series, you can generate line plots, scatter plots, histograms, bar charts, and more. These plots are powered by Matplotlib under the hood, so you can customize them further if needed. Built-in plots are especially useful for exploratory data analysis, allowing you to quickly spot trends, distributions, or outliers in your astronomical datasets.

In [None]:
df.plot(x='indus', y='crim', kind='scatter', title='Crime vs Industry')
plt.show()

In [None]:
df.plot(x='indus', y='crim', kind='scatter', title='Crime vs Industry')
plt.xlabel('Industry')
plt.ylabel('Crime')
plt.show()

# 6. Visualization with Pandas Friendly Packages (seaborn) 

Pandas works seamlessly with data visualization libraries like Seaborn, which is built on top of Matplotlib and designed for statistical plotting. Seaborn can directly accept pandas DataFrames, making it easy to create complex plots such as scatter plots with regression lines, histograms, kernel density estimates, pair plots, and categorical plots. Its integration with pandas means you can filter, group, or aggregate your DataFrame first, then pass it straight to Seaborn for visualization—perfect for exploring relationships in astronomical datasets.

Seaborn makes it much easier to produce informative, publication-quality plots directly from pandas DataFrames compared to using Matplotlib alone.

In [None]:
features_df = pd.read_csv('Features_with_Continuum.txt', sep = ' ', index_col = 0)
predictors = pd.read_csv('Predictions_with_Continuum.txt', sep = ' ', index_col = 0)

In [None]:
features_df.head()

In [None]:
predictors.head()

In [None]:
#filtering the data
good_fits_mask = features_df.chisq_phot < 50
EW_r_mask = predictors.EW_r.values < 500
total_mask = good_fits_mask & EW_r_mask

good_fits_data = features_df[total_mask]
y_pred = predictors[total_mask].EW_r

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (10, 10))

ax = axes.flatten()

cols= ['burst', 'dust:Av', 'stellar_mass', 'sfr']

for column, a in zip(cols, ax):
    sb.boxplot(good_fits_data[column], ax = a)

plt.show()

# 6. Data Analysis

Part of our bread and butter is that we can take in a data set and learn things from it. This is the essence of data analysis where we use the data to uncover trends hidden within the data and we usually use plots and summary statistics to understand what the data is trying to tell us. We will cover some of the data analysis plots and techniques in the next few cells to familiarize yourself with what it means to analyze a data set.

## Histograms

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (10, 10))

ax = axes.flatten()

cols= ['burst', 'dust:Av', 'stellar_mass', 'sfr']

for column, a in zip(cols, ax):
    a.hist(good_fits_data[column], bins = 30, color = 'purple')
    a.set_xlabel(column)

ax[0].set_ylabel('Counts')
ax[-2].set_ylabel('Counts')

plt.show()

## Scatter Plots

In [None]:
fig, axes = plt.subplots(2, 2, figsize = (10, 10))

ax = axes.flatten()

cols= ['burst', 'dust:Av', 'stellar_mass', 'sfr']

for column, a in zip(cols, ax):
    a.scatter(good_fits_data[column], y_pred, color = 'purple', alpha = 0.5, s = 10)
    a.set_xlabel(column)

ax[0].set_ylabel('EW_r')
ax[-2].set_ylabel('EW_r')


plt.show()

## Boxplots

In [None]:
good_fits_data[['burst', 'dust:Av', 'stellar_mass', 'sfr']].plot(kind = 'box', 
                                                                 subplots = True, 
                                                                 layout = (2, 2), 
                                                                 figsize = (10, 10))
plt.show()

## Corner Plots

In [None]:
plt.figure(figsize = (10, 10))
sb.pairplot(good_fits_data[['burst', 'dust:Av', 'stellar_mass', 'sfr']], corner = True)
plt.show()

## Correlation Matrix Plot

In [None]:

corr_matrix = good_fits_data[['burst', 'dust:Av', 'stellar_mass', 'sfr']].corr()

In [None]:
plt.figure(figsize = (10, 5))
sb.heatmap(corr_matrix, annot = True)
plt.show()