# Pandas Primer: Data Exploration and Analysis

## Introduction

This notebook was created by [Jupyter AI](https://github.com/jupyterlab/jupyter-ai) with the following prompt:

> /generate a demo of how to use pandas

This Jupyter notebook will demonstrate how to use the Pandas library, a powerful open-source data manipulation and analysis library for Python. It will cover importing the Pandas library, loading data into a DataFrame, exploring and cleaning the data, performing various data manipulation and analysis tasks, and visualizing the data using Pandas in combination with other data visualization libraries.

## Loading Data into a DataFrame

In [1]:
# Loading Data into a DataFrame

In [2]:
# Import the necessary libraries
import pandas as pd
import sqlite3

In [3]:
# Load data from a CSV file
print("Loading data from a CSV file:")
df_csv = pd.read_csv('data.csv')
print(df_csv.head())

Loading data from a CSV file:


FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

In [None]:
# Load data from an Excel spreadsheet
print("\nLoading data from an Excel spreadsheet:")
df_excel = pd.read_excel('data.xlsx', sheet_name='Sheet1')
print(df_excel.head())

In [None]:
# Load data from a SQL database
print("\nLoading data from a SQL database:")
conn = sqlite3.connect('database.db')
df_sql = pd.read_sql_query("SELECT * FROM table_name", conn)
print(df_sql.head())
conn.close()  # Close the database connection

In [None]:
# Print the types of the DataFrames
print("\nData types of the DataFrames:")
print(df_csv.dtypes)
print(df_excel.dtypes)
print(df_sql.dtypes)

## Exploring and Cleaning Data

In [None]:
Here's the improved version of the code:

In [None]:
# Import the pandas library
import pandas as pd

In [None]:
# Load the dataset into a Pandas DataFrame
df = pd.read_csv('dataset.csv')

In [None]:
# Explore the structure of the DataFrame
print(df.head())
print(df.info())
print(df.describe())

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
print("Missing values:\n", missing_values)

In [None]:
# Handle missing values
df = df.fillna(0)
df = df.dropna(subset=['column_name'])

In [None]:
# Perform basic data cleaning
df['column_name'] = df['column_name'].str.strip().str.lower().replace('old_value', 'new_value')

In [None]:
# Inspect the cleaned data
print(df.head())
print(df.info())
print(df.describe())

## Data Manipulation and Analysis

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('dataset.csv')

In [None]:
print("First few rows of the DataFrame:")
print(df.head())

In [None]:
print("\nDescriptive statistics of the DataFrame:")
print(df.describe())

In [None]:
print("\nFiltering the data:")
filtered_df = df[df['column_name'] > 100]
print(filtered_df)

In [None]:
print("\nSorting the data:")
sorted_df = df.sort_values(['column1', 'column2'], ascending=[False, True])
print(sorted_df)

In [None]:
print("\nGrouping and aggregating the data:")
grouped_df = df.groupby('column_name')['numeric_column'].mean()
print(grouped_df)

In [None]:
print("\nPerforming calculations on the data:")
df['new_column'] = df['column1'] + df['column2']
print(df)

## Visualizing Data with Pandas

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Load sample dataset
df = pd.read_csv('sample_data.csv')

In [None]:
# Explore the dataset
print(df.head(5))
print(df.info())

In [None]:
# Create a histogram to visualize the distribution of a numeric column
fig, ax = plt.subplots(figsize=(8, 6))
df['numeric_column'].hist(ax=ax)
ax.set_title('Distribution of Numeric Column')
ax.set_xlabel('Value')
ax.set_ylabel('Frequency')
plt.show()

In [None]:
# Create a scatter plot to visualize the relationship between two numeric columns
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df['numeric_column_1'], df['numeric_column_2'])
ax.set_title('Scatter Plot of Two Numeric Columns')
ax.set_xlabel('Column 1')
ax.set_ylabel('Column 2')
plt.show()

In [None]:
# Create a bar plot to visualize the count of a categorical column
fig, ax = plt.subplots(figsize=(8, 6))
df['categorical_column'].value_counts().plot(kind='bar', ax=ax)
ax.set_title('Bar Plot of Categorical Column')
ax.set_xlabel('Category')
ax.set_ylabel('Count')
plt.show()

In [None]:
# Create a line plot to visualize the trend of a numeric column over time
fig, ax = plt.subplots(figsize=(8, 6))
df.plot(x='time_column', y='numeric_column', kind='line', ax=ax)
ax.set_title('Line Plot of Numeric Column over Time')
ax.set_xlabel('Time')
ax.set_ylabel('Value')
plt.show()

In [None]:
# Create a heatmap to visualize the correlation between numeric columns
fig, ax = plt.subplots(figsize=(8, 6))
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='YlOrRd', ax=ax)
ax.set_title('Heatmap of Correlation Matrix')
plt.show()