# Pandas Tutorial and Examples

This notebook provides a comprehensive guide to using pandas, a powerful data manipulation library in Python. We'll cover various operations, from basic to advanced, to help you get started with data analysis.

## 1. Introduction to Pandas

First, let's import pandas and other libraries we'll use throughout this tutorial.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# For pretty plotting
%matplotlib inline
plt.style.use('ggplot')

## 2. Creating Pandas Objects

Pandas has two primary data structures: Series (1-dimensional) and DataFrame (2-dimensional).

In [None]:
# Creating a Series
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Series object:")
print(s)

In [None]:
# Creating a DataFrame from dictionary
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 42],
    'City': ['New York', 'Paris', 'Berlin', 'London'],
    'Salary': [65000, 70000, 62000, 85000]
}

df = pd.DataFrame(data)
print("DataFrame from dictionary:")
df

In [None]:
# Creating a DataFrame with date range
dates = pd.date_range('20230101', periods=6)
df2 = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print("DataFrame with date index:")
df2

## 3. Loading Data

Pandas can read data from various file formats. Let's load the COVID-19 dataset as an example.

In [None]:
# Loading data from CSV file
covid_df = pd.read_csv('covid_19_data.csv')

# Display the first 5 rows
covid_df.head()

## 4. Basic DataFrame Operations

Let's explore some basic operations we can perform on our DataFrames.

In [None]:
# Viewing basic information about the DataFrame
print("DataFrame shape:", covid_df.shape)
print("\nColumn names:")
print(covid_df.columns.tolist())
print("\nData types:")
print(covid_df.dtypes)
print("\nSummary statistics:")
covid_df.describe()

In [None]:
# Accessing specific columns
covid_df[['Country/Region', 'Confirmed', 'Deaths', 'Recovered']].head()

In [None]:
# Accessing rows by position
print("Rows 2-4:")
covid_df.iloc[2:5]

In [None]:
# Quick check for missing values
covid_df.isnull().sum()

## 5. Data Cleaning and Transformation

Often, the first step in data analysis is cleaning and transforming data into a usable format.

In [None]:
# Create a copy of the dataset to work with
df_clean = covid_df.copy()

# Renaming columns for better clarity
df_clean = df_clean.rename(columns={
    'ObservationDate': 'Date',
    'Province/State': 'State',
    'Country/Region': 'Country'
})

# Converting date to datetime format
df_clean['Date'] = pd.to_datetime(df_clean['Date'])

# Drop unnecessary columns
df_clean = df_clean.drop(['SNo', 'Last Update'], axis=1)

df_clean.head()

In [None]:
# Handling missing values
# Fill NaN values in State column with 'Unknown'
df_clean['State'] = df_clean['State'].fillna('Unknown')

# Fill NaN values in numeric columns with 0
df_clean[['Confirmed', 'Deaths', 'Recovered']] = df_clean[['Confirmed', 'Deaths', 'Recovered']].fillna(0)

# Verify no more missing values
df_clean.isnull().sum()

## 6. Data Analysis and Aggregation

Pandas excels at grouping, summarizing, and transforming data.

In [None]:
# Let's group by country and date to get a daily summary
country_date_summary = df_clean.groupby(['Country', 'Date'])[['Confirmed', 'Deaths', 'Recovered']].sum().reset_index()
country_date_summary.head()

In [None]:
# Find the countries with the highest confirmed cases
latest_date = df_clean['Date'].max()
latest_data = df_clean[df_clean['Date'] == latest_date]

top_countries = latest_data.groupby('Country')['Confirmed'].sum().sort_values(ascending=False).head(10)
top_countries

In [None]:
# Calculate death rate for each country (Deaths/Confirmed)
country_totals = df_clean.groupby('Country')[['Confirmed', 'Deaths', 'Recovered']].sum()
country_totals['Death_Rate'] = (country_totals['Deaths'] / country_totals['Confirmed'] * 100).round(2)

# Show countries with at least 1000 confirmed cases, sorted by death rate
high_cases_countries = country_totals[country_totals['Confirmed'] >= 1000].sort_values('Death_Rate', ascending=False)
high_cases_countries.head(10)

In [None]:
# Getting time series data for specific countries
countries_of_interest = ['US', 'India', 'Brazil', 'UK', 'Russia']
timeline_data = country_date_summary[country_date_summary['Country'].isin(countries_of_interest)]

# Pivot the data to have countries as columns and dates as index
confirmed_timeline = timeline_data.pivot(index='Date', columns='Country', values='Confirmed')
confirmed_timeline.tail()

## 7. Advanced Operations

Let's explore some more advanced pandas operations.

In [None]:
# Calculate daily new cases instead of cumulative
# First, get a single country to demonstrate
us_data = country_date_summary[country_date_summary['Country'] == 'US'].sort_values('Date')

# Calculate daily changes
us_data['New_Confirmed'] = us_data['Confirmed'].diff()
us_data['New_Deaths'] = us_data['Deaths'].diff()
us_data['New_Recovered'] = us_data['Recovered'].diff()

# Replace NaN with 0 (first row will be NaN due to diff())
us_data = us_data.fillna({'New_Confirmed': 0, 'New_Deaths': 0, 'New_Recovered': 0})

# Show results
us_data.head(10)

In [None]:
# Using rolling window functions to smooth data (7-day moving average)
us_data['7d_avg_new_cases'] = us_data['New_Confirmed'].rolling(window=7).mean()
us_data['7d_avg_new_deaths'] = us_data['New_Deaths'].rolling(window=7).mean()

# Show results
us_data[['Date', 'New_Confirmed', '7d_avg_new_cases', 'New_Deaths', '7d_avg_new_deaths']].tail(10)

## 8. Data Visualization with Pandas

Pandas integrates well with matplotlib to create visualizations directly from DataFrames.

In [None]:
# Plotting top 10 countries by confirmed cases
plt.figure(figsize=(12, 6))
top_countries.plot(kind='bar', color='skyblue')
plt.title('Top 10 Countries by COVID-19 Confirmed Cases')
plt.xlabel('Country')
plt.ylabel('Confirmed Cases')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

In [None]:
# Time series plot of confirmed cases for selected countries
plt.figure(figsize=(14, 7))
for country in countries_of_interest:
    if country in confirmed_timeline.columns:
        plt.plot(confirmed_timeline.index, confirmed_timeline[country], label=country)

plt.title('COVID-19 Confirmed Cases Over Time')
plt.xlabel('Date')
plt.ylabel('Confirmed Cases')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Visualizing daily new cases and 7-day moving average for US
plt.figure(figsize=(14, 7))
plt.bar(us_data['Date'], us_data['New_Confirmed'], color='skyblue', alpha=0.6, label='Daily New Cases')
plt.plot(us_data['Date'], us_data['7d_avg_new_cases'], color='red', linewidth=2, label='7-day Moving Average')
plt.title('US Daily New COVID-19 Cases')
plt.xlabel('Date')
plt.ylabel('Number of Cases')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 9. Merging and Joining DataFrames

Pandas provides powerful tools for combining multiple datasets.

In [None]:
# Create a simple DataFrame with country population data
population_data = pd.DataFrame({
    'Country': ['US', 'India', 'Brazil', 'UK', 'Russia', 'France', 'Italy', 'Germany', 'Spain', 'China'],
    'Population': [331000000, 1380000000, 212000000, 67000000, 146000000, 
                  67000000, 60000000, 83000000, 47000000, 1400000000]
})

population_data

In [None]:
# Merge COVID data with population data
# First, get the latest COVID data for each country
latest_country_data = latest_data.groupby('Country')[['Confirmed', 'Deaths', 'Recovered']].sum().reset_index()

# Merge with population data
merged_data = pd.merge(latest_country_data, population_data, on='Country', how='inner')

# Calculate cases per million population
merged_data['Cases_Per_Million'] = (merged_data['Confirmed'] / merged_data['Population'] * 1000000).round(2)
merged_data['Deaths_Per_Million'] = (merged_data['Deaths'] / merged_data['Population'] * 1000000).round(2)

# Sort by cases per million
merged_data.sort_values('Cases_Per_Million', ascending=False)

## 10. Exporting Data

Pandas makes it easy to export data to various formats.

In [None]:
# Export merged data to CSV
# merged_data.to_csv('covid_with_population.csv', index=False)

# Export to Excel
# merged_data.to_excel('covid_with_population.xlsx', index=False)

# Export to JSON
# merged_data.to_json('covid_with_population.json', orient='records')

## Summary

In this notebook, we've covered the basics of pandas:

1. Creating DataFrames
2. Reading data from CSV files
3. Performing basic DataFrame operations
4. Filtering and querying data
5. Using groupby and aggregations
6. Exporting data to CSV files

These operations form the foundation of data manipulation with pandas and can be combined in many ways to handle complex data tasks.