# Introduction on pandas: data structures, manipulation tools & time Series

This jupyter notebook can be found on my GitHub account: https://github.com/mbonnemaison/Learning-Python

## Introduction
Sources:
- Information to install pandas, introduce pandas and the user guide: https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html
- Python for Data Analysis by Wes McKinney (2nd edition used here) - Chapter 5 (Introduction), Chapter 11 (Time Series)
- Video on Data Analysis (go to comments to go to part you're interested in): https://www.youtube.com/watch?v=r-uOLxNrNk8&list=RDCMUC8butISFwT-Wl7EV0hUK0BQ&index=3

### **pandas** is a python library that facilitates data analysis organized in a table.
What we'll talk about:
- Data structures : Series and DataFrame
- Data manipulation tools designed to make data cleaning and analysis fast and easy in Python
- Introduction to Time Series

In [2]:
import pandas as pd

## **Series**: a pandas data structure
We can think of a Series as a table with 1 column and an index. The purpose of the index is similar to the automatic index assigned to lists.
### **Build a Series**

In [None]:
population = pd.Series([8336817, 3979576, 2693976, 2320268, 1680992, 
                        1584064, 1547253, 1423851, 1343573, 1021795])

In [None]:
population

In [None]:
#Set index
population.index = ['New York City', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 
                    'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose']

In [None]:
population

In [None]:
population.values

In [None]:
population.index

In [None]:
#Look at the top 5 rows:
population.head()

In [None]:
#Look at the last 5 rows:
population.tail()

### **Series indexing**

In [None]:
population

In [None]:
population['Houston']

In [None]:
population[5]

In [None]:
population[1:5]

In [None]:
population[['New York City', 'San Jose']]

### **Series operations**

In [None]:
population / 1000000

In [None]:
(population / 1000000).round(3)

### **Add a row to a Series**

In [None]:
population["Austin"] = 978908

In [None]:
population

## **DataFrame**: a pandas data structure
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (i.e. numeric, string, boolean, etc).

A DataFrame has both a row and a column index. It can be thought of as a dictionary of Series all sharing the same index.

### **Build a DataFrame**

In [None]:
us_cities = pd.DataFrame(
{"State": ['New York', 'California', 'Illinois', 'Texas', 'Arizona', 
           'Pennsylvania', 'Texas', 'California', 'Texas', 'California', 'Texas'],
 "Population": [8336817, 3979576, 2693976, 2320268, 1680992, 
                1584064, 1547253, 1423851, 1343573, 1021795, 978908],
"Density(/sq mi)": [28317, 8484, 11900, 3613, 3120, 11683, 3238, 4325, 3866, 5777, 3031]},
index = ['New York City', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 
                    'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose', 'Austin']
)

In [None]:
us_cities

In [None]:
us_cities.head()

### **Get initial information on this DataFrame**

In [None]:
us_cities.info()

In [None]:
us_cities.describe()

In [None]:
us_cities.columns

In [None]:
us_cities.index

In [None]:
us_cities.values

### **Create a new column**

In [None]:
#Create a Series with population density per square mile
incorporated = pd.Series(['9/2/1664', '4/4/1850', '3/4/1837', '6/5/1837', '2/25/1881', '10/25/1701', '6/5/1837', '3/27/1850', '2/2/1856', '3/27/1850', '12/27/1839'],
                         index = ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 
                    'Philadelphia', 'San Antonio', 'San Diego', 'Dallas', 'San Jose', 'Austin'])

In [None]:
us_cities["Incorporated"] = incorporated

In [None]:
us_cities

### **Create a new row**

In [None]:
jacksonville = pd.Series(['Florida', 911507, 1178, '2/9/1832'], 
                         index = ["State", "Population", "Density(/sq mi)", "Incorporated"],
                        name = "Jacksonville")

In [None]:
jacksonville

In [None]:
us_cities.append(jacksonville)

### **Look at specific columns in the DataFrame**

In [None]:
us_cities[["Population", "State"]]

In [None]:
#Look at specific columns and rows
us_cities[["Population", "State"]][2:5]

### **Look at specific rows in the DataFrame**
2 methods exist: __loc__ and __iloc__

In [None]:
us_cities.loc[us_cities["State"] == "Texas"]

In [None]:
us_cities.loc[us_cities["Population"] > 2000000]

In [None]:
us_cities.iloc[1:5,[0,2]]

### **Rename column heads**

In [None]:
us_cities.info()

In [None]:
us_cities = us_cities.rename(columns={"Density(/sq mi)":"Density"})

In [None]:
us_cities

### **Remove columns we don't need in the DataFrame**

In [None]:
us_cities.drop(['Density'], axis = 'columns')

### **Remove rows we don't need in the DataFrame**
Use the index value to remove the unwanted row

In [None]:
us_cities = us_cities.drop('New York City')

In [None]:
us_cities

### **Sort values in a DataFrame**

In [None]:
us_cities.sort_values(by = ["Population"], ascending=True)

### **Save the DataFrame in a CSV file**
CSV stands for comma separated values

In [None]:
us_cities.to_csv("top10.csv")

## **Introduction to Time Data Types**
Anything that is observed or measured at many points in time forms a time series. Some of the elementary data structures for working with time series data are:

- **Timestamps** : specific instants in time
- **Timedeltas**: Intervals of time indicated by a start and end timestamp.
- **Periods** such as the month of March 2021 or the year 2020
    - *Periods* can be thought of as special cases of intervals
    - *Fixed frequency* consists of data points that occur at regular intervals, like every 5 minutes.

### ***Timestamp***
Python provides the date and time functionality in the **datetime** module that contains three different types:

- **Date**: day, month, year
- **Time**: hours, minutes, seconds, microseconds
- **datetime**: components of both date and time

***Timestamp*** is pandas equivalent of python’s datetime.datetime object and is interchangeable with it in most cases. It’s the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas.

### **Quick note on python's datetime module to generate Datetime**

In [None]:
from datetime import datetime
#From module import class

In [None]:
now = datetime.now()

In [None]:
now

In [None]:
now.year

In [None]:
now.time()

In [None]:
mydate = datetime(2021,4,5,23,12,34)

In [None]:
mydate

In [None]:
mydate.time()

### **Convert strings to Datetimes**
**Method #1**: Strings can be converted to dates using **datetime.strptime**.

Note: Information on format can be found here: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

In [None]:
datetime.strptime('2021-4-25 5:46:23', '%Y-%m-%d %H:%M:%S')

In [None]:
date_list_str = ['2021-03-14', '2020-12-25', '2025-02-19']

In [None]:
[datetime.strptime(x, '%Y-%m-%d') for x in date_list_str]

In [None]:
data = pd.read_csv("data3months.csv", sep = '\t')

In [None]:
data["Date"] = [datetime.strptime(x, '%Y-%m-%d %H:%M:%S') for x in data["Date"]]

In [None]:
data["Date"][0]

In [None]:
data3 = pd.read_csv("data3months-Copy1.csv", sep = '\t')

In [None]:
data3

In [None]:
data3["Date"] = [datetime.strptime(x, '%Y-%m-%d %H:%M:%S') for x in data3["Date"]]

**Method #2**: Strings can be converted to dates using **pd.to_datetime**.

**pandas** is generally oriented toward working with arrays of dates, whether used as an axis index or a column in a DataFrame. The **to_datetime** method parses many different kinds of date representations.

**pandas** can also handle missing values.

Note: Information on format can be found here: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior)

In [None]:
pd.to_datetime('2021-02-19 22:45:56')

In [None]:
pd.to_datetime('2021-02-19 22:45:56', format = '%Y-%m-%d')

In [None]:
date_list_str2 = ['2021-03-14', '2020-12-25', '2025-02-19', '2021-04-14', None]

In [None]:
pd.to_datetime(date_list_str)

**NaT** means Not a Time

In [None]:
data3 = pd.read_csv("data3months.csv", sep = '\t')

In [None]:
data3["Date"] = pd.to_datetime(data3["Date"])

In [None]:
data3["Date"][0]

In [None]:
data4 = pd.read_csv("data3months-Copy1.csv", sep = '\t')

In [None]:
data4

In [None]:
data4["Date"] = pd.to_datetime(data4["Date"])

In [None]:
data4["Date"][23]

### **Convert index into DatetimeIndex using to_datetime**

In [None]:
data = pd.Series([1,2,3,4], 
                 index = ['2021/03/03 5:5:5', '2021/03/04 13:9:15', '2021/03/05 2:8:14', '2021/03/06 23:55:10'])

In [None]:
data

In [None]:
data.index

In [None]:
data.index = pd.to_datetime(data.index, utc = True)

In [None]:
data.index

In [None]:
data

### **Convert the time zones of a DatetimeIndex**

In [None]:
data2 = data.reindex(data.index.tz_convert('US/Pacific'))

In [None]:
data2

### **Convert incorporated dates into timestamps using to_datetime**

In [None]:
us_cities

In [None]:
us_cities.info()

In [None]:
us_cities['Incorporated']

In [None]:
us_cities["Incorporated"] = pd.to_datetime(us_cities["Incorporated"], format= '%m/%d/%Y')

In [None]:
us_cities.info()

In [None]:
us_cities['Incorporated']['Los Angeles']

### **Data manipulations with Timestamps**

In [None]:
us_cities.loc[us_cities['Incorporated'] > '1850']

In [None]:
us_cities.sort_values(by = ["Incorporated"], ascending=True)

### **Generate Timestamps at fixed frequency**

In [None]:
ts = pd.Series(range(1,51), index = pd.date_range(start = '1/1/2021', periods = 50, freq = '4h'))

In [None]:
ts.index[0]

### ***Timedeltas***
Timedelta represents the temporal difference between two datetime objects.
Timedelta is part of Python and pandas.
### **Timedelta operations**
**Add time to Timestamps**

In [None]:
ts = pd.to_datetime('2021/3/23 23:20:00') + pd.Timedelta(days=-3)

In [None]:
ts

In [None]:
us_cities['Incorporated']

In [None]:
us_cities['Incorporated'] + pd.Timedelta(days=-3)

In [None]:
#Another way to add time: 
from pandas.tseries.offsets import Day
us_cities['Incorporated'] + Day(4)

**Difference between Timestamps generates a Timedelta**

In [None]:
delta = pd.to_datetime('2021/3/23 23:20:00') - pd.to_datetime('2021/3/20 2:34:14')

In [None]:
delta

**Adding Timedeltas**

In [None]:
td1 = pd.Timedelta(weeks = 3, days = 3, hours = 3)
td2 = pd.Timedelta(weeks = 1, days = 1, hours = 1)

In [None]:
td1+td2

In [None]:
td1 + delta

### **Convert strings (indicating time only) to Timedelta**

In [None]:
pd.to_timedelta('23:23:23')

In [None]:
pd.to_timedelta('2020-12-02 23:23:23')

In [None]:
mylist = ['2021/03/03', '2021/03/04', '2021/03/05', '2021/03/06']

In [None]:
mylist

## **Going further**
### **What's the difference between pd.to_datetime and pd.Timestamp?**

Timestamp returns the time in your time zone. It is possible to specify time zones with Timestamp. 

to_datetime returns the time in the UCT time zone.

In [None]:
pd.Timestamp('2021-02-19 22:45:56')

In [None]:
pd.Timestamp('2021-02-19 22:45:56', tz = "US/Eastern")

In [None]:
pd.to_datetime('2021-02-19 22:45:56')

In [None]:
pd.to_datetime('2021-02-19 22:45:56', format = '%Y-%m-%d')

In [4]:
pd.Timestamp('now')

Timestamp('2021-04-03 20:43:38.685449')

In [None]:
pd.to_datetime('now')

In [None]:
pd.Timestamp('now', tz = 'UTC')

In [5]:
pd.Timestamp('now', tz = 'US/Hawaii')

Timestamp('2021-04-03 14:43:47.892236-1000', tz='US/Hawaii')

In [None]:
#To find out which time zone to enter
import pytz
pytz.common_timezones[-5:]

In [None]:
stamp = pd.Timestamp('2021-02-19 3:45:56')

In [None]:
stamp_utc = stamp.tz_localize('UTC')

In [None]:
stamp_utc

In [None]:
stamp_utc.tz_convert('US/Mountain')

In [None]:
stamp2 = pd.to_datetime('2021-02-19 3:45:56')

In [None]:
stamp2_utc.tz_convert('US/Mountain')

### ***Time periods***
Time Periods correspond to a specific length of time between a start and end timestamp.

### **Generate Time Periods**

In [None]:
tp = pd.Period(2020, freq='A-OCT')
#A-OCT means that we are looking at a period starting on 1/1/2020 and ending on 10/31/2020.

In [None]:
tp

### **Generate Time Periods at fixed frequency**

In [None]:
tp2 = pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')

In [None]:
tp2

### **Convert Time Periods to Timestamps**

In [None]:
tp2.to_timestamp()