# Introduction to Pandas
Pandas is an open-source, BSD-licensed library providing high-performance, easy-to-use data structures, and data analysis tools for the Python programming language.
- It's built on top of the NumPy library and allows for fast analysis and data cleaning and preparation. 
- It excels in performance and productivity for users working with data sets.

In [1]:
import pandas as pd
import numpy as np

## Pandas data structures
**Series**: A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.

In [2]:
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

**DataFrame**: A DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns).

In [3]:
dates = pd.date_range('20230101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
2023-01-01,-0.133119,-1.635943,1.249918,-0.874151
2023-01-02,-1.093506,-1.735453,0.891404,0.932589
2023-01-03,-2.539071,-0.698046,1.236502,-1.027586
2023-01-04,-0.263186,-0.855431,0.337262,-1.098142
2023-01-05,1.150051,0.870804,-0.389544,-0.280743
2023-01-06,-1.671889,2.051701,0.5395,0.832865


## Loading and saving files

In [5]:
df_csv = pd.read_csv('data/cardio.csv')
df_csv.head()

Unnamed: 0,Duration,Date,Pulse,Maxpulse,Calories
0,60,'2020/12/01',110,130,409.1
1,60,'2020/12/02',117,145,479.0
2,60,'2020/12/03',103,135,340.0
3,45,'2020/12/04',109,175,282.4
4,45,'2020/12/05',117,148,406.0


In [7]:
df_excel = pd.read_excel('data/onei_data.xls')
df_excel

Unnamed: 0,AÑOS CENSALES,Total,Hombres,Mujeres,Tasa Hombres,Tasa Mujeres
0,1774,171620,...,...,,...
1,1792,273979,...,...,25.0,...
2,1817,553033,...,...,27.0,...
3,1827,704487,...,...,24.1,...
4,1841,1007624,...,...,26.0,...
5,1861,1366232,...,...,15.1,...
6,1877,1509291,...,...,6.2,...
7,1887,1609075,...,...,6.4,...
8,1899,1572797,815205,757592,-0.2,1076
9,1907,2048980,1074882,974098,33.1,1103


In [8]:
df_csv.to_excel('data/cardio.xlsx', sheet_name='MySheet')

## Data inspection and exploration

In [14]:
# First rows
df.head(2)

Unnamed: 0,A,B,C,D
2023-01-01,-0.133119,-1.635943,1.249918,-0.874151
2023-01-02,-1.093506,-1.735453,0.891404,0.932589


In [15]:
# last...
df.tail(2)

Unnamed: 0,A,B,C,D
2023-01-05,1.150051,0.870804,-0.389544,-0.280743
2023-01-06,-1.671889,2.051701,0.5395,0.832865


In [12]:
# Description
df.describe()

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.758453,-0.333728,0.644174,-0.252528
std,1.295835,1.497109,0.62492,0.925805
min,-2.539071,-1.735453,-0.389544,-1.098142
25%,-1.527293,-1.440815,0.387822,-0.989228
50%,-0.678346,-0.776738,0.715452,-0.577447
75%,-0.165636,0.478592,1.150228,0.554463
max,1.150051,2.051701,1.249918,0.932589


## Selecting DataFrames elements

In [16]:
# Selecting a single column, which yields a Serie
df['A']

2023-01-01   -0.133119
2023-01-02   -1.093506
2023-01-03   -2.539071
2023-01-04   -0.263186
2023-01-05    1.150051
2023-01-06   -1.671889
Freq: D, Name: A, dtype: float64

In [17]:
# Selecting via slicing the rows
df[0:3]

Unnamed: 0,A,B,C,D
2023-01-01,-0.133119,-1.635943,1.249918,-0.874151
2023-01-02,-1.093506,-1.735453,0.891404,0.932589
2023-01-03,-2.539071,-0.698046,1.236502,-1.027586


In [26]:
# Selecting columns
df[['A', 'C']]

Unnamed: 0,A,C
2023-01-01,-0.133119,1.249918
2023-01-02,-1.093506,0.891404
2023-01-03,-2.539071,1.236502
2023-01-04,-0.263186,0.337262
2023-01-05,1.150051,-0.389544
2023-01-06,-1.671889,0.5395


## Indexing
An index in pandas is like an address, that is, it is the name that you give to the rows or columns which helps in fast locating of data. Indexes can be numbers, dates, or strings (labels)

### Types of indexes

In [27]:
# Default numeric index
data = {'Name': ['John', 'Anna', 'Peter', 'Linda'],
        'Age': [28, 34, 29, 32]}

# Creating a DataFrame without specifying an index
df_default_index = pd.DataFrame(data)

df_default_index

Unnamed: 0,Name,Age
0,John,28
1,Anna,34
2,Peter,29
3,Linda,32


In [28]:
# Custom index
df_custom_index = pd.DataFrame(data).set_index('Name')
df_custom_index

Unnamed: 0_level_0,Age
Name,Unnamed: 1_level_1
John,28
Anna,34
Peter,29
Linda,32


In [30]:
# When creating the DataFrame, you can provide an index
names = ['a', 'b', 'b', 'd']
df_custom_index_2 = pd.DataFrame(data, index=names)
df_custom_index_2

Unnamed: 0,Name,Age
a,John,28
b,Anna,34
b,Peter,29
d,Linda,32


In [31]:
# DateTimeIndex
dates = ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04']
df_dates = pd.DataFrame(data, index=pd.to_datetime(dates))

df_dates

Unnamed: 0,Name,Age
2023-01-01,John,28
2023-01-02,Anna,34
2023-01-03,Peter,29
2023-01-04,Linda,32


In [34]:
# you can use a date_range instead of providing values
date_range = pd.date_range(start='2023-01-01', periods=4)
df_date_range = pd.DataFrame(data, index=date_range)

df_date_range

Unnamed: 0,Name,Age
2023-01-01,John,28
2023-01-02,Anna,34
2023-01-03,Peter,29
2023-01-04,Linda,32


### Selection using idexes

In [38]:
df_custom_index

Unnamed: 0_level_0,Age
Name,Unnamed: 1_level_1
John,28
Anna,34
Peter,29
Linda,32


In [35]:
df_custom_index.loc['John']

Age    28
Name: John, dtype: int64

In [37]:
df_custom_index.loc['Anna':'Linda']

Unnamed: 0_level_0,Age
Name,Unnamed: 1_level_1
Anna,34
Peter,29
Linda,32


In [40]:
# Using position while having index
df_default_index.iloc[1]

Name    Anna
Age       34
Name: 1, dtype: object

# Data cleaning
## Handling missing values

In [42]:
df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [np.nan, 5, np.nan], 'C': [1, 2, 3]})
df

Unnamed: 0,A,B,C
0,1.0,,1
1,2.0,5.0,2
2,,,3


In [43]:
# detecting ...
df.isnull()

Unnamed: 0,A,B,C
0,False,True,False
1,False,False,False
2,True,True,False


In [44]:
# counting ...
df.isnull().sum()

A    1
B    2
C    0
dtype: int64

In [45]:
# dropping rows...
df.dropna()

Unnamed: 0,A,B,C
1,2.0,5.0,2


In [46]:
# and columns ..
df.dropna(axis=1)

Unnamed: 0,C
0,1
1,2
2,3


In [47]:
# filling with value...
df.fillna(value=0)

Unnamed: 0,A,B,C
0,1.0,0.0,1
1,2.0,5.0,2
2,0.0,0.0,3


In [48]:
# filling with las valid value forward
df.fillna(method='ffill')

Unnamed: 0,A,B,C
0,1.0,,1
1,2.0,5.0,2
2,2.0,5.0,3


In [49]:
# filling with las valid value backward
df.fillna(method='bfill')

Unnamed: 0,A,B,C
0,1.0,5.0,1
1,2.0,5.0,2
2,,,3


## Removing duplicates

In [50]:
df_duplicates = pd.DataFrame({'A': [1, 2, 2, 3, 3], 'B': [1, 1, 2, 3, 3], 'C': ['a', 'b', 'b', 'c', 'c']})
df_duplicates

Unnamed: 0,A,B,C
0,1,1,a
1,2,1,b
2,2,2,b
3,3,3,c
4,3,3,c


In [51]:
# Identify duplicate rows
df_duplicates.duplicated()

0    False
1    False
2    False
3    False
4     True
dtype: bool

In [52]:
df_duplicates.drop_duplicates()

Unnamed: 0,A,B,C
0,1,1,a
1,2,1,b
2,2,2,b
3,3,3,c


## Data type conversion

In [57]:
df_types = pd.DataFrame({'A': ['1', '2', '3'], 'B': ['4.5', '5.5', '6.5'], 'C': ['7', '8', '9']})
df_types

Unnamed: 0,A,B,C
0,1,4.5,7
1,2,5.5,8
2,3,6.5,9


In [55]:
df_types.dtypes

A    object
B    object
C    object
dtype: object

In [56]:
df_types['A'] = df_types['A'].astype(int)
df_types['B'] = df_types['B'].astype(float)
df_types.dtypes

A      int64
B    float64
C     object
dtype: object

## Renaming columns

In [59]:
df_rename = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df_rename

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [60]:
# Rename columns
df_rename.columns = ['Column1', 'Column2']
df_rename

Unnamed: 0,Column1,Column2
0,1,4
1,2,5
2,3,6


In [61]:
# Alternatively, using the rename method
df_rename = df_rename.rename(columns={'Column1': 'FirstColumn', 'Column2': 'SecondColumn'})
df_rename

Unnamed: 0,FirstColumn,SecondColumn
0,1,4
1,2,5
2,3,6


## Efficiently handling categorical data

In [64]:
df_categorical = pd.DataFrame({'A': ['type1', 'type2', 'type3', 'type1']})
df_categorical

Unnamed: 0,A
0,type1
1,type2
2,type3
3,type1


In [62]:
df_categorical.dtypes

A    object
dtype: object

In [68]:
# Convert column A to categorical
df_categorical['A'] = df_categorical['A'].astype('category')
df_categorical.dtypes

A    category
dtype: object

In [69]:
df_categorical['A']

0    type1
1    type2
2    type3
3    type1
Name: A, dtype: category
Categories (3, object): ['type1', 'type2', 'type3']

In [70]:
df_categorical['A'].cat.categories

Index(['type1', 'type2', 'type3'], dtype='object')

# Data Manipulation

# Apply a function to a column

In [73]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [74]:
# Apply a function along each column
df.apply(lambda x: x + 10)

Unnamed: 0,A,B
0,11,14
1,12,15
2,13,16


In [77]:
# Apply a function to series values
series = pd.Series([1, 2, 3])

series.apply(lambda x: x * 2)

0    2
1    4
2    6
dtype: int64

In [79]:
# Using map() for Element-wise Transformations on a Series
s = pd.Series(['cat', 'dog', 'rabbit', 'cat', 'rabbit'])

s.map({'cat': 'feline', 'dog': 'canine', 'rabbit': 'lagomorph'})

0       feline
1       canine
2    lagomorph
3       feline
4    lagomorph
dtype: object

## Aggregation and grouping

Intermediate Topics

    Data Manipulation:
        Applying Functions: Using apply(), map(), and applying lambda functions.
        Aggregation and Grouping: Group by operations, aggregate functions, pivot tables.
        Sorting and Ranking: Sorting dataframes by index or values, ranking data.

    Merge, Join, and Concatenate: Techniques for combining multiple dataframes and series (concat, merge, join).

    Time Series Data: Basics of handling date and time series data, resampling, and time zone handling.

    Advanced Data Manipulation:
        MultiIndexing: Introduction to hierarchical indexing and working with multi-level indexes.
        Categorical Data: Managing categorical data for efficient storage and manipulation.