
### Handling Time Series Data in Pandas

Time series data is data that is recorded over time — like daily stock prices, monthly sales numbers, or hourly weather readings.

Pandas is very powerful for working with time series because it has special tools to:

* **Work with dates and times easily:**
  You can convert strings or numbers into datetime objects so pandas knows they represent dates or times.

* **Index data by time:**
  You can set a datetime column as the index, which makes it simple to select data from a specific day, month, or year.

* **Resample data:**
  If you have data recorded every minute but want to see daily or monthly summaries, pandas lets you easily group and aggregate the data by different time intervals.

* **Handle missing dates:**
  Pandas can fill in missing dates in your series, so your data is continuous, which is helpful for analysis.

* **Shift and lag data:**
  You can move data forward or backward in time to compare values at different time points, useful for calculating changes or growth.

* **Rolling and expanding windows:**
  Pandas lets you calculate moving averages or cumulative sums over a sliding window of time to spot trends or smooth out noise.



In [1]:
import pandas as pd
from datetime import datetime

# 1. Create a DatetimeIndex from a list of date strings
index = pd.DatetimeIndex(['2014-07-04', '2014-08-04', '2015-07-04', '2015-08-04'])

# 2. Create a Series with data indexed by these dates
data = pd.Series([0, 1, 2, 3], index=index)

# Print the series
print("Original Series with DatetimeIndex:")
print(data)
# The series values are linked to specific dates, which act as index labels

# 3. Use date-based slicing to select data between two dates
print("\nData between '2014-07-04' and '2015-07-04':")
print(data['2014-07-04':'2015-07-04'])
# This returns all rows from start date up to and including the end date

# 4. Select all data for a given year by passing just the year string
print("\nData for the year 2015:")
print(data['2015'])
# Pandas automatically filters all dates in 2015

# 5. Creating datetime objects in different formats using pd.to_datetime()
dates = pd.to_datetime([
    datetime(2015, 7, 3),          # datetime object
    '4th of July, 2015',           # string with natural language format
    '2015-Jul-6',                  # string with year-month-day format
    '07-07-2015',                  # string with month-day-year format
    '20150708'                     # compact string with no separators
])

print("\nDatetimeIndex created by pd.to_datetime():")
print(dates)
# Converts various date representations to a uniform DatetimeIndex

# 6. Convert the DatetimeIndex to a PeriodIndex with daily frequency
periods = dates.to_period('D')
print("\nConverted to PeriodIndex with daily frequency:")
print(periods)
# PeriodIndex stores time intervals rather than single timestamps

# 7. Calculate time differences (TimedeltaIndex) by subtracting the first date from all dates
time_deltas = dates - dates[0]
print("\nTimedeltaIndex showing difference from first date:")
print(time_deltas)
# This shows the elapsed time (in days) between each date and the first date


Original Series with DatetimeIndex:
2014-07-04    0
2014-08-04    1
2015-07-04    2
2015-08-04    3
dtype: int64

Data between '2014-07-04' and '2015-07-04':
2014-07-04    0
2014-08-04    1
2015-07-04    2
dtype: int64

Data for the year 2015:
2015-07-04    2
2015-08-04    3
dtype: int64

DatetimeIndex created by pd.to_datetime():
DatetimeIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
               '2015-07-08'],
              dtype='datetime64[ns]', freq=None)

Converted to PeriodIndex with daily frequency:
PeriodIndex(['2015-07-03', '2015-07-04', '2015-07-06', '2015-07-07',
             '2015-07-08'],
            dtype='period[D]')

TimedeltaIndex showing difference from first date:
TimedeltaIndex(['0 days', '1 days', '3 days', '4 days', '5 days'], dtype='timedelta64[ns]', freq=None)


In [12]:
#Resampling and Frequency Conversion
import pandas as pd
import numpy as np

# Create a time series with daily data starting from Jan 1, 2023, 10 days total
dates = pd.date_range('2023-01-01', periods=10, freq='D')
data1 = pd.Series(np.arange(10), index=dates)

print("\nOriginal Daily Data:")
print(data1)

# Resample the daily data to 3-day periods and sum the values in each 3-day block
resampled = data1.resample('3D').sum()

print("\n\nresampled data \n",resampled)

# Use Case:
# Resampling is useful when you want to aggregate time series data at different time intervals,
# e.g., converting daily stock prices to weekly or monthly sums, averages, etc.



Original Daily Data:
2023-01-01    0
2023-01-02    1
2023-01-03    2
2023-01-04    3
2023-01-05    4
2023-01-06    5
2023-01-07    6
2023-01-08    7
2023-01-09    8
2023-01-10    9
Freq: D, dtype: int64


resampled data 
 2023-01-01     3
2023-01-04    12
2023-01-07    21
2023-01-10     9
Freq: 3D, dtype: int64


In [13]:
#Time zone Handling
# Localize the naive datetime index to UTC (assign timezone info)
data_utc = data1.tz_localize('UTC')
print("printing data utc\n",data_utc)

# Convert the time series from UTC to US Eastern Time
data_est = data_utc.tz_convert('US/Eastern')
print("printing data est\n",data_est)

# Use Case:
# Timezone localization is important when working with data from multiple regions,
# ensuring all timestamps refer to the correct local time for accurate analysis.


printing data utc
 2023-01-01 00:00:00+00:00    0
2023-01-02 00:00:00+00:00    1
2023-01-03 00:00:00+00:00    2
2023-01-04 00:00:00+00:00    3
2023-01-05 00:00:00+00:00    4
2023-01-06 00:00:00+00:00    5
2023-01-07 00:00:00+00:00    6
2023-01-08 00:00:00+00:00    7
2023-01-09 00:00:00+00:00    8
2023-01-10 00:00:00+00:00    9
Freq: D, dtype: int64
printing data est
 2022-12-31 19:00:00-05:00    0
2023-01-01 19:00:00-05:00    1
2023-01-02 19:00:00-05:00    2
2023-01-03 19:00:00-05:00    3
2023-01-04 19:00:00-05:00    4
2023-01-05 19:00:00-05:00    5
2023-01-06 19:00:00-05:00    6
2023-01-07 19:00:00-05:00    7
2023-01-08 19:00:00-05:00    8
2023-01-09 19:00:00-05:00    9
Freq: D, dtype: int64


In [None]:
#rolling  window and moving statistics
# Calculate a rolling mean with a window of 3 days
rolling_mean = data1.rolling(window=3).mean()
print("rolling mean\n",rolling_mean)
#first two values will be NaN since there are not enough data points to calculate the mean

# Use Case:

# Rolling functions help smooth noisy data, calculate moving averages in finance,
# or detect trends by aggregating data over a sliding window of time.


rolling mean
 2023-01-01    NaN
2023-01-02    NaN
2023-01-03    1.0
2023-01-04    2.0
2023-01-05    3.0
2023-01-06    4.0
2023-01-07    5.0
2023-01-08    6.0
2023-01-09    7.0
2023-01-10    8.0
Freq: D, dtype: float64


In [16]:
# Shift the time series forward by 2 periods (days), adding NaNs at the start
shifted_forward = data1.shift(2)

# Shift the time series backward by 2 periods (days), adding NaNs at the end
shifted_backward = data1.shift(-2)

print(shifted_forward)
print(shifted_backward)

# Use Case:
# Shifting is useful to compare current data points with previous/future points,
# e.g., calculating day-over-day changes or creating lagged variables for modeling.


2023-01-01    NaN
2023-01-02    NaN
2023-01-03    0.0
2023-01-04    1.0
2023-01-05    2.0
2023-01-06    3.0
2023-01-07    4.0
2023-01-08    5.0
2023-01-09    6.0
2023-01-10    7.0
Freq: D, dtype: float64
2023-01-01    2.0
2023-01-02    3.0
2023-01-03    4.0
2023-01-04    5.0
2023-01-05    6.0
2023-01-06    7.0
2023-01-07    8.0
2023-01-08    9.0
2023-01-09    NaN
2023-01-10    NaN
Freq: D, dtype: float64


In [17]:
# Copy the series and introduce missing values (NaN) at index 3 and 4
data_with_nan = data1.copy()
data_with_nan.iloc[3:5] = np.nan

# Forward fill missing data with last valid observation
ffill = data_with_nan.ffill()

# Backward fill missing data with next valid observation
bfill = data_with_nan.bfill()

print(ffill)
print(bfill)

# Use Case:
# Time series often have gaps; filling missing values ensures continuity for analysis,
# like filling sensor readings or financial data where missing days occur.


2023-01-01    0.0
2023-01-02    1.0
2023-01-03    2.0
2023-01-04    2.0
2023-01-05    2.0
2023-01-06    5.0
2023-01-07    6.0
2023-01-08    7.0
2023-01-09    8.0
2023-01-10    9.0
Freq: D, dtype: float64
2023-01-01    0.0
2023-01-02    1.0
2023-01-03    2.0
2023-01-04    5.0
2023-01-05    5.0
2023-01-06    5.0
2023-01-07    6.0
2023-01-08    7.0
2023-01-09    8.0
2023-01-10    9.0
Freq: D, dtype: float64


In [18]:
# Generate a sequence of hourly timestamps for one day starting Jan 1, 2023
hourly_dates = pd.date_range('2023-01-01', periods=24, freq='H')
print(hourly_dates)

# Use Case:
# Useful for creating a complete timeline or index for data collection at regular intervals,
# e.g., hourly weather observations or system logs.


DatetimeIndex(['2023-01-01 00:00:00', '2023-01-01 01:00:00',
               '2023-01-01 02:00:00', '2023-01-01 03:00:00',
               '2023-01-01 04:00:00', '2023-01-01 05:00:00',
               '2023-01-01 06:00:00', '2023-01-01 07:00:00',
               '2023-01-01 08:00:00', '2023-01-01 09:00:00',
               '2023-01-01 10:00:00', '2023-01-01 11:00:00',
               '2023-01-01 12:00:00', '2023-01-01 13:00:00',
               '2023-01-01 14:00:00', '2023-01-01 15:00:00',
               '2023-01-01 16:00:00', '2023-01-01 17:00:00',
               '2023-01-01 18:00:00', '2023-01-01 19:00:00',
               '2023-01-01 20:00:00', '2023-01-01 21:00:00',
               '2023-01-01 22:00:00', '2023-01-01 23:00:00'],
              dtype='datetime64[ns]', freq='h')


  hourly_dates = pd.date_range('2023-01-01', periods=24, freq='H')


In [19]:
# Create a range of monthly periods starting Jan 2023 (4 months)
periods = pd.period_range('2023-01', periods=4, freq='M')
print(periods)

# Add one month to each period to shift them forward
periods_plus_one = periods + 1
print(periods_plus_one)

# Use Case:
# Period objects represent time intervals rather than points.
# Ideal for monthly, quarterly, or yearly reporting periods, and performing interval arithmetic.


PeriodIndex(['2023-01', '2023-02', '2023-03', '2023-04'], dtype='period[M]')
PeriodIndex(['2023-02', '2023-03', '2023-04', '2023-05'], dtype='period[M]')


In [23]:
# Create DataFrame with dates and random values
df = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=5),
    'value': np.random.rand(5)
})

# Set 'date' column as index to allow time-based indexing and slicing
df.set_index('date', inplace=True)
print(df)

# Use Case:
# Setting the datetime column as index optimizes filtering and slicing by dates,
# essential for time series analysis or resampling operations.




               value
date                
2023-01-01  0.567489
2023-01-02  0.457893
2023-01-03  0.521085
2023-01-04  0.145141
2023-01-05  0.640011


In [24]:
dates = pd.date_range('2023-01-01', periods=3, freq='D')
df = pd.DataFrame({'date': dates})

# Extract year, month, and day components from 'date' column using dt accessor
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day

print(df)

# Use Case:
# Accessors let you break down timestamps into meaningful components for grouping,
# filtering, or feature engineering (e.g., seasonal effects in models).


        date  year  month  day
0 2023-01-01  2023      1    1
1 2023-01-02  2023      1    2
2 2023-01-03  2023      1    3


In [25]:
# Two series with overlapping but not identical dates
s1 = pd.Series([1, 2, 3], index=pd.date_range('2023-01-01', periods=3))
s2 = pd.Series([4, 5], index=pd.date_range('2023-01-02', periods=2))

# Adding aligns data by dates and results in NaN where no overlap exists
result = s1 + s2
print(result)

# Use Case:
# When combining multiple time series, pandas aligns them on the datetime index,
# which is essential for correct arithmetic, merges, or joins.


2023-01-01    NaN
2023-01-02    6.0
2023-01-03    8.0
Freq: D, dtype: float64


In [26]:
# Create time series with irregular date intervals (not daily)
irregular_dates = pd.to_datetime(['2023-01-01', '2023-01-03', '2023-01-10'])
irregular_series = pd.Series([10, 20, 30], index=irregular_dates)

print(irregular_series)

# Use Case:
# Many real-world time series are irregular (e.g., transaction logs or event data),
# and pandas handles these smoothly without requiring regular intervals.


2023-01-01    10
2023-01-03    20
2023-01-10    30
dtype: int64


In [None]:
# Create a DataFrame with 1000 rows, each representing a date starting from 2023-01-01
df = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=1000),  # generates 1000 consecutive dates
    # Create a 'category' column with repeated values: pattern ['A', 'B', 'C', 'A', 'B'] repeated 200 times
    'category': ['A', 'B', 'C', 'A', 'B'] * 200  
})

# Convert 'category' column to categorical dtype
# This converts the column from object (string) type to 'category' type,
# which is more memory efficient because pandas stores the unique categories once,
# and then uses integer codes internally for the repeated values.
df['category'] = df['category'].astype('category')

# Display DataFrame info to see memory usage and data types
print(df.info())

# Use Case:
# Categorical dtype is very useful when you have a column with repeated string values.
# It reduces memory usage significantly compared to storing the raw strings repeatedly.
# It also speeds up operations like grouping, filtering, and sorting because
# pandas works internally with the integer codes instead of strings.
# This optimization is common in large datasets with categorical data such as gender, product type, or status codes.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   date      1000 non-null   datetime64[ns]
 1   category  1000 non-null   category      
dtypes: category(1), datetime64[ns](1)
memory usage: 9.0 KB
None
