#  Pandas

Pandas is an open-source Python library used for data manipulation and analysis. It provides powerful tools to work with structured data, such as tabular data (like spreadsheets or SQL tables) and time-series data.

**Key Features of Pandas**
1. Data Structures:
    - **Series**: A one-dimensional labeled array (like a column in Excel).
    - **DataFrame**: A two-dimensional labeled table (like an Excel sheet or SQL table).
    - **Panel**: A three-dimensional data structure (less common, replaced by other solutions like xarray).
1. Data Manipulation:
    - Filter, slice, and subset data.
    - Handle missing data (NaN) effectively.
    - Reshape and pivot datasets.
1. Data Cleaning:
    - Replace, fill, or drop missing or incorrect values.
    - Detect and remove duplicate entries.
1. Integration:
    - Load data from various file formats like CSV, Excel, JSON, SQL, etc.
    - Export data to these formats.
1. Powerful Aggregation:
    - Grouping, summarizing, and applying custom functions for analysis.
1. Time-Series Support:
    - Perform operations on time-indexed data (resampling, shifting, rolling).

**Why Use Pandas?**
- It simplifies data preparation tasks, which are crucial before analysis or visualization.
- Makes it easy to perform exploratory data analysis (EDA).
- Integrates seamlessly with other Python libraries like NumPy, Matplotlib, and Scikit-learn.
- Reduces the complexity of working with datasets compared to raw Python lists or dictionaries.

### Series

A **Pandas Series** is a one-dimensional, labeled array capable of holding any data type, such as integers, floats, strings, or Python objects. It is similar to a column in an Excel spreadsheet or a single dimension in NumPy arrays, but with the added functionality of labels (indices) for each element.

**Key Characteristics of a Series**
- **Index**: Each element in a Series has a label (index) by default starting from 0. Custom indices can also be used.
- **Data Types**: A Series can store data of any type, such as integers, floats, strings, or even Python objects.
- **Homogeneity**: Unlike lists, all elements in a Series are of the same data type, similar to NumPy arrays.

**Syntax:**

```pandas.Series(data, index=index)```
- **data**: The data to store (list, array, scalar value, or dictionary).
- **index**: (Optional) Custom labels for the Series.

In [38]:
import pandas as pd
import numpy as np

In [39]:
series1 = pd.Series([1,2,3])
series1

0    1
1    2
2    3
dtype: int64

In [40]:
series1= pd.Series(np.array([13,4,2]))
series1

0    13
1     4
2     2
dtype: int32

In [41]:
type(series1)

pandas.core.series.Series

In [42]:
series2 = pd.Series([13,4,2],index=['S1','S2','S2'])
series2

S1    13
S2     4
S2     2
dtype: int64

Index

In [43]:
prices = [10.70, 10.86, 10.74, 10.71, 10.79]
shares = pd.Series(prices)
shares

0    10.70
1    10.86
2    10.74
3    10.71
4    10.79
dtype: float64

In [44]:
days = ['Mon', 'Tue', 'Wed', 'Thur', 'Fri']
shares = pd.Series(prices, index=days)
shares

Mon     10.70
Tue     10.86
Wed     10.74
Thur    10.71
Fri     10.79
dtype: float64

In [45]:
# Examin Index
shares.index

Index(['Mon', 'Tue', 'Wed', 'Thur', 'Fri'], dtype='object')

In [47]:
shares.index[4]

'Fri'

In [48]:
shares.index[:4]

Index(['Mon', 'Tue', 'Wed', 'Thur'], dtype='object')

In [49]:
shares.index[-4:]

Index(['Tue', 'Wed', 'Thur', 'Fri'], dtype='object')

In [50]:
print(shares.index.name)

None


In [51]:
shares.index.name = 'weekdays'
shares

weekdays
Mon     10.70
Tue     10.86
Wed     10.74
Thur    10.71
Fri     10.79
dtype: float64

In [52]:
shares.index[2] = 'Wednesday'

TypeError: Index does not support mutable operations

In [None]:
shares.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
shares

### DataFrame

A **Pandas DataFrame** is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table and is the most commonly used data structure in Pandas for data manipulation and analysis.

**Key Characteristics of a DataFrame**
1. **Tabular Structure**:
    - Rows represent individual records.
    - Columns represent attributes or features of the data.
1. **Labeled Axes**:
    - Each row and column has a label (index for rows and column names for columns).
1. **Heterogeneous Data**:
    - A DataFrame can contain columns with different data types (e.g., integers, floats, strings).
1. **Size-Mutable**:
    - Rows and columns can be added, modified, or removed.
1. **Built-in Methods**:
    - Includes methods for cleaning, filtering, grouping, and analyzing data.

**Syntax:**
```pandas.DataFrame(data, index=index, columns=columns)```
- **data**: The data to populate the DataFrame (list, dictionary, array, or another DataFrame).
- **index**: (Optional) Custom row labels.
- **columns**: (Optional) Custom column labels


In [53]:
#Dataframe - 2D labelled array, row and col index, tabular, structured data

df = pd.DataFrame([[1,"Karthik"],[2,"Nikhil"],[3,"Anurabh"]])
df

Unnamed: 0,0,1
0,1,Karthik
1,2,Nikhil
2,3,Anurabh


In [54]:
type(df)

pandas.core.frame.DataFrame

DataFrame From List

In [55]:
data = [
    [1,"Karthik"],
    [2,"Nikhil"],
    [3,"Anurabh"],
    [4,"Juhi"]]

index=['a','b','c','d']

df=pd.DataFrame(data, index=index, columns=["Roll No", "Name"])
df

Unnamed: 0,Roll No,Name
a,1,Karthik
b,2,Nikhil
c,3,Anurabh
d,4,Juhi


Basic Function

In [56]:
df.count()

Roll No    4
Name       4
dtype: int64

In [57]:
# give you information on the height of your DataFrame.

len(df)

4

In [58]:
df.dtypes

Roll No     int64
Name       object
dtype: object

In [59]:
df.shape

(4, 2)

In [60]:
df.ndim

2

In [61]:
# Returns the number of elements in the DataFrame.
df.size

8

In [62]:
#Returns the actual data in the DataFrame as an NDarray
print(df.values)
type(df.values)

[[1 'Karthik']
 [2 'Nikhil']
 [3 'Anurabh']
 [4 'Juhi']]


numpy.ndarray

DataFrame from Dict

In [63]:
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Create dictionary my_dict with three key:value pairs
my_dict = {'country': names, 'drives_right': dr, 'cars_per_cap': cpc}

# Build a DataFrame cars from my_dict
cars = pd.DataFrame(my_dict)

print(cars)

         country  drives_right  cars_per_cap
0  United States          True           809
1      Australia         False           731
2          Japan         False           588
3          India         False            18
4         Russia          True           200
5        Morocco          True            70
6          Egypt          True            45


In [64]:
# Definition of row_labels
cars.index = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
cars

Unnamed: 0,country,drives_right,cars_per_cap
US,United States,True,809
AUS,Australia,False,731
JAP,Japan,False,588
IN,India,False,18
RU,Russia,True,200
MOR,Morocco,True,70
EG,Egypt,True,45


Read CSV File

In [66]:
cars = pd.read_csv("pd_dataset/cars.csv")
cars

Unnamed: 0.1,Unnamed: 0,cars_per_cap,country,drives_right
0,US,809,United States,True
1,AUS,731,Australia,False
2,JAP,588,Japan,False
3,IN,18,India,False
4,RU,200,Russia,True
5,MOR,70,Morocco,True
6,EG,45,Egypt,True


In [67]:
cars = pd.read_csv("pd_dataset/cars.csv", index_col=0)
cars

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [68]:
cars.head()

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True


In [69]:
cars.tail()

Unnamed: 0,cars_per_cap,country,drives_right
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


Dataset info and shape

In [70]:
cars.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, US to EG
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   cars_per_cap  7 non-null      int64 
 1   country       7 non-null      object
 2   drives_right  7 non-null      bool  
dtypes: bool(1), int64(1), object(1)
memory usage: 175.0+ bytes


In [71]:
cars.axes

[Index(['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG'], dtype='object'),
 Index(['cars_per_cap', 'country', 'drives_right'], dtype='object')]

### Select Data

Columns

In [72]:
cars

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [73]:
cars['country']

US     United States
AUS        Australia
JAP            Japan
IN             India
RU            Russia
MOR          Morocco
EG             Egypt
Name: country, dtype: object

In [74]:
type(cars['country'])

pandas.core.series.Series

In [75]:
cars[['country', 'drives_right']]

Unnamed: 0,country,drives_right
US,United States,True
AUS,Australia,False
JAP,Japan,False
IN,India,False
RU,Russia,True
MOR,Morocco,True
EG,Egypt,True


In [76]:
type(cars[['country']])

pandas.core.frame.DataFrame

Row Access

In [77]:
cars

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [78]:
cars[2:5]

Unnamed: 0,cars_per_cap,country,drives_right
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True


loc

In [79]:
# Row as pandas series

cars.loc['IN']

cars_per_cap       18
country         India
drives_right    False
Name: IN, dtype: object

In [80]:
cars.loc[['IN','US','EG']]

Unnamed: 0,cars_per_cap,country,drives_right
IN,18,India,False
US,809,United States,True
EG,45,Egypt,True


In [81]:
# Rows & Columns
cars.loc[['IN','US','EG'],['country','drives_right']]


Unnamed: 0,country,drives_right
IN,India,False
US,United States,True
EG,Egypt,True


In [82]:
cars.loc[:,['country','drives_right']]

Unnamed: 0,country,drives_right
US,United States,True
AUS,Australia,False
JAP,Japan,False
IN,India,False
RU,Russia,True
MOR,Morocco,True
EG,Egypt,True


iloc

In [83]:
cars

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [84]:
cars.loc[['AUS']]

Unnamed: 0,cars_per_cap,country,drives_right
AUS,731,Australia,False


In [85]:
cars.iloc[[1]]

Unnamed: 0,cars_per_cap,country,drives_right
AUS,731,Australia,False


In [86]:
cars.iloc[[1,2,3]]

Unnamed: 0,cars_per_cap,country,drives_right
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False


In [87]:
cars.loc[['AUS','JAP','IN'],['country','drives_right']]

Unnamed: 0,country,drives_right
AUS,Australia,False
JAP,Japan,False
IN,India,False


In [88]:
cars.iloc[[1,2,3],[1,2]]

Unnamed: 0,country,drives_right
AUS,Australia,False
JAP,Japan,False
IN,India,False


In [89]:
cars.iloc[:4,:2]

Unnamed: 0,cars_per_cap,country
US,809,United States
AUS,731,Australia
JAP,588,Japan
IN,18,India


Filtering

In [91]:
pop = pd.read_csv('pd_dataset/brics.csv', index_col=0)
pop

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


In [92]:
pop['area']
# pop.loc[:, 'area']
# pop.iloc[:,2]

BR     8.516
RU    17.100
IN     3.286
CH     9.597
SA     1.221
Name: area, dtype: float64

In [93]:
exp = pop['area'] > 8 
exp

BR     True
RU     True
IN    False
CH     True
SA    False
Name: area, dtype: bool

In [94]:
pop['area'] < 13

BR     True
RU    False
IN     True
CH     True
SA     True
Name: area, dtype: bool

In [95]:
pop[ exp ]

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
CH,China,Beijing,9.597,1357.0


In [96]:
exp = np.logical_and(pop["area"] > 8, pop["area"] < 10)
exp

BR     True
RU    False
IN    False
CH     True
SA    False
Name: area, dtype: bool

In [97]:
pop[ np.logical_and(pop["area"] > 8, pop["area"] < 10) ]

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
CH,China,Beijing,9.597,1357.0


Iteration

In [100]:
cars = pd.read_csv("pd_dataset/cars.csv", index_col=0)
cars

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [101]:
for val in cars:
    print(val)

cars_per_cap
country
drives_right


In [102]:
for lab, row in cars.iterrows():
    print(row['country'])

United States
Australia
Japan
India
Russia
Morocco
Egypt


In [103]:
for lab, row in cars.iterrows():
    print(lab, "\t: ", row['country'])

US 	:  United States
AUS 	:  Australia
JAP 	:  Japan
IN 	:  India
RU 	:  Russia
MOR 	:  Morocco
EG 	:  Egypt


In [104]:
for lab, row in cars.iterrows():
    cars.loc[lab, 'COUNTRY'] = row['country'].upper()

cars

Unnamed: 0,cars_per_cap,country,drives_right,COUNTRY
US,809,United States,True,UNITED STATES
AUS,731,Australia,False,AUSTRALIA
JAP,588,Japan,False,JAPAN
IN,18,India,False,INDIA
RU,200,Russia,True,RUSSIA
MOR,70,Morocco,True,MOROCCO
EG,45,Egypt,True,EGYPT


In [107]:
# insted of for loop apply method can also be used to do the same.

cars = pd.read_csv("pd_dataset/cars.csv", index_col=0)
cars

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [106]:
cars['COUNTRY'] = cars['country'].apply(str.upper)
cars

Unnamed: 0,cars_per_cap,country,drives_right,COUNTRY
US,809,United States,True,UNITED STATES
AUS,731,Australia,False,AUSTRALIA
JAP,588,Japan,False,JAPAN
IN,18,India,False,INDIA
RU,200,Russia,True,RUSSIA
MOR,70,Morocco,True,MOROCCO
EG,45,Egypt,True,EGYPT


### Reading data from file

Import form flat file

In [108]:
file = 'pd_dataset/titanic.csv'
df = pd.read_csv(file)
df.head()

Unnamed: 0,passengerId,survived,pclass,sex,age,sibSp,parch,ticket,fare,cabin,embarked
0,1,0,3,male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,male,35.0,0,0,373450,8.05,,S


In [109]:
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame
data_array = data.values
print(data_array)

[['passengerId' 'survived' 'pclass' 'sex' 'age' 'sibSp' 'parch' 'ticket'
  'fare' 'cabin' 'embarked']
 ['1' '0' '3' 'male' '22.0' '1' '0' 'A/5 21171' '7.25' nan 'S']
 ['2' '1' '1' 'female' '38.0' '1' '0' 'PC 17599' '71.2833' 'C85' 'C']
 ['3' '1' '3' 'female' '26.0' '0' '0' 'STON/O2. 3101282' '7.925' nan 'S']
 ['4' '1' '1' 'female' '35.0' '1' '0' '113803' '53.1' 'C123' 'S']]


In [110]:
print(type(data_array))

<class 'numpy.ndarray'>


Import from excel file

In [111]:
# you have to install openpyxl module
# open your terminal and type the following command

# pip install openpyxl

In [114]:

file = 'pd_dataset/battledeath.xlsx'

# Load spreadsheet
xl = pd.ExcelFile(file)

print(xl.sheet_names)

['2002', '2004']


In [118]:
df = xl.parse(1)

In [119]:
df.head()

Unnamed: 0,War(country),2004
0,Afghanistan,9.451028
1,Albania,0.130354
2,Algeria,3.407277
3,Andorra,0.0
4,Angola,2.597931


# Cleaning Data

Missing Value

In [120]:
df = pd.DataFrame(
    np.random.randn(5, 3), 
    index=['a', 'c', 'e', 'f', 'h'],
    columns=['one', 'two', 'three'] )
df

Unnamed: 0,one,two,three
a,-0.313066,-0.048935,1.307849
c,0.868421,0.749835,1.058547
e,-0.58945,0.475523,-0.335693
f,-1.786195,0.017477,1.072859
h,0.71687,-0.048662,-0.322911


In [121]:
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df

Unnamed: 0,one,two,three
a,-0.313066,-0.048935,1.307849
b,,,
c,0.868421,0.749835,1.058547
d,,,
e,-0.58945,0.475523,-0.335693
f,-1.786195,0.017477,1.072859
g,,,
h,0.71687,-0.048662,-0.322911


In [122]:
# Check for missing value: isnull()

print(df['one'].isnull())

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool


In [123]:
# Check for missing value: notnull()

print(df['one'].notnull())

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool


In [124]:
# sum the value in column: one

df.one.sum()

-1.1034199773130307

In [125]:
df

Unnamed: 0,one,two,three
a,-0.313066,-0.048935,1.307849
b,,,
c,0.868421,0.749835,1.058547
d,,,
e,-0.58945,0.475523,-0.335693
f,-1.786195,0.017477,1.072859
g,,,
h,0.71687,-0.048662,-0.322911


In [126]:
# Replaced NaN with '0'
print(df.fillna(0))

        one       two     three
a -0.313066 -0.048935  1.307849
b  0.000000  0.000000  0.000000
c  0.868421  0.749835  1.058547
d  0.000000  0.000000  0.000000
e -0.589450  0.475523 -0.335693
f -1.786195  0.017477  1.072859
g  0.000000  0.000000  0.000000
h  0.716870 -0.048662 -0.322911


In [127]:
# Fill NA Forward pad/ffill

print(df.ffill())

        one       two     three
a -0.313066 -0.048935  1.307849
b -0.313066 -0.048935  1.307849
c  0.868421  0.749835  1.058547
d  0.868421  0.749835  1.058547
e -0.589450  0.475523 -0.335693
f -1.786195  0.017477  1.072859
g -1.786195  0.017477  1.072859
h  0.716870 -0.048662 -0.322911


In [128]:
# Fill NA Backward backfill/bfill

print(df.bfill())

        one       two     three
a -0.313066 -0.048935  1.307849
b  0.868421  0.749835  1.058547
c  0.868421  0.749835  1.058547
d -0.589450  0.475523 -0.335693
e -0.589450  0.475523 -0.335693
f -1.786195  0.017477  1.072859
g  0.716870 -0.048662 -0.322911
h  0.716870 -0.048662 -0.322911


In [129]:
# Drop NaN rows

print(df.dropna())

        one       two     three
a -0.313066 -0.048935  1.307849
c  0.868421  0.749835  1.058547
e -0.589450  0.475523 -0.335693
f -1.786195  0.017477  1.072859
h  0.716870 -0.048662 -0.322911


In [130]:
df

Unnamed: 0,one,two,three
a,-0.313066,-0.048935,1.307849
b,,,
c,0.868421,0.749835,1.058547
d,,,
e,-0.58945,0.475523,-0.335693
f,-1.786195,0.017477,1.072859
g,,,
h,0.71687,-0.048662,-0.322911


In [131]:
# Drop NaN columns
# axis = 1 -> Column
# axis = 0 -> Row

print(df.dropna(axis=1))

Empty DataFrame
Columns: []
Index: [a, b, c, d, e, f, g, h]


In [133]:
df = pd.read_csv('pd_dataset/literary_birth_rate.csv', sep=';')
df.head()

Unnamed: 0,Country,Continent,female literacy,fertility,population
0,Chine,ASI,90.5,1.769,1324655000
1,Inde,ASI,50.8,2.682,1139964932
2,USA,NAM,99.0,2.077,304060000
3,Indonésie,ASI,88.8,2.132,227345082
4,Brésil,LAT,90.2,1.827,191971506


In [134]:
df.tail()

Unnamed: 0,Country,Continent,female literacy,fertility,population
157,Vanuatu,OCE,79.5,3.883,233866
158,Samoa,OCE,98.5,3.852,178869
159,Sao Tomé-et-Principe,AF,83.3,3.718,160174
160,Aruba,LAT,98.0,1.732,105455
161,Tonga,ASI,99.1,3.928,103566


In [135]:
df.shape

(162, 5)

In [136]:
df.columns

Index(['Country ', 'Continent', 'female literacy', 'fertility', 'population'], dtype='object')

In [137]:
df = df.rename(columns={'Country ':'country','Continent ':'continent','female literacy':'female_literacy'})
df.columns

Index(['country', 'Continent', 'female_literacy', 'fertility', 'population'], dtype='object')

In [138]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 162 entries, 0 to 161
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   country          162 non-null    object 
 1   Continent        162 non-null    object 
 2   female_literacy  162 non-null    float64
 3   fertility        162 non-null    float64
 4   population       162 non-null    object 
dtypes: float64(2), object(3)
memory usage: 6.5+ KB


In [139]:
print(df.describe())

       female_literacy   fertility
count       162.000000  162.000000
mean         80.107407    2.878673
std          23.052415    1.427597
min          12.600000    0.966000
25%          66.425000    1.823250
50%          90.000000    2.367500
75%          98.500000    3.880250
max         100.000000    7.069000


In [140]:
df.Continent.value_counts(dropna=False)

Continent
AF     49
ASI    47
EUR    36
LAT    24
OCE     4
NAM     2
Name: count, dtype: int64

In [141]:
df.country.value_counts(dropna=False).head()

country
Chine           1
Lituanie        1
Finland         1
Kirghizistan    1
Turkménistan    1
Name: count, dtype: int64

In [142]:
df.fertility.value_counts(dropna=False).head()

fertility
3.371    2
1.841    2
1.393    2
1.854    2
1.436    2
Name: count, dtype: int64

**Melting**

Melting is a process in Pandas where you transform a wide-format DataFrame into a long-format DataFrame. This is particularly useful when preparing data for analysis or visualization that requires a specific structure.

**In a melted DataFrame:**

- Each row represents a single observation.
- Columns that previously held multiple variables are now combined into two or more columns:
    - A column for variable names.
    - A column for the corresponding values.

**Syntax:**

```pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value')```

**Parameters:**
- **frame**: The DataFrame to melt.
- **id_vars**: Columns that should remain as identifiers and not be melted.
- **value_vars**: Columns to melt into rows (default is all columns except id_vars).
- **var_name**: Name of the column created for variable names.
- **value_name**: Name of the column created for values (default is 'value').

In [143]:
airquality = pd.read_csv('pd_dataset/airquality.csv')

# Print the head of airquality
print(airquality.head())

   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5


In [144]:
# Melt airquality
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'])
print(airquality_melt.head())

   Month  Day variable  value
0      5    1    Ozone   41.0
1      5    2    Ozone   36.0
2      5    3    Ozone   12.0
3      5    4    Ozone   18.0
4      5    5    Ozone    NaN


In [145]:
# Melt airquality
airquality_melt = pd.melt(
    frame=airquality, 
    id_vars=['Month', 'Day'], 
    var_name='measurement', 
    value_name='reading')
    
print(airquality_melt.head())

   Month  Day measurement  reading
0      5    1       Ozone     41.0
1      5    2       Ozone     36.0
2      5    3       Ozone     12.0
3      5    4       Ozone     18.0
4      5    5       Ozone      NaN


**Pivot Table**

A **pivot table** in Pandas is a way to summarize, aggregate, and reorganize data in a DataFrame. It allows you to transform a long-format DataFrame into a summary table, using rows and columns for grouping and applying aggregation functions to calculate metrics.

It is particularly useful for analyzing and reporting data, similar to Excel pivot tables.

**Syntax:**

```pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False)```

**Parameters:**
- **data**: The DataFrame to pivot.
- **values**: Column(s) to aggregate.
- **index**: Keys to group by on the rows.
- **columns**: Keys to group by on the columns.
- **aggfunc**: Aggregation function (default is mean but can be sum, count, min, max, etc.).
- **fill_value**: Value to replace missing values.
- **margins**: If True, adds totals for rows and columns.

In [147]:
# Pivot airquality_melt
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

print(airquality_pivot.head())

measurement  Ozone  Solar.R  Temp  Wind
Month Day                              
5     1       41.0    190.0  67.0   7.4
      2       36.0    118.0  72.0   8.0
      3       12.0    149.0  74.0  12.6
      4       18.0    313.0  62.0  11.5
      5        NaN      NaN  56.0  14.3


In [148]:
# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

print(airquality_pivot.head())
print(airquality.head())

measurement  Month  Day  Ozone  Solar.R  Temp  Wind
0                5    1   41.0    190.0  67.0   7.4
1                5    2   36.0    118.0  72.0   8.0
2                5    3   12.0    149.0  74.0  12.6
3                5    4   18.0    313.0  62.0  11.5
4                5    5    NaN      NaN  56.0  14.3
   Ozone  Solar.R  Wind  Temp  Month  Day
0   41.0    190.0   7.4    67      5    1
1   36.0    118.0   8.0    72      5    2
2   12.0    149.0  12.6    74      5    3
3   18.0    313.0  11.5    62      5    4
4    NaN      NaN  14.3    56      5    5


Split the column

In [150]:
tb = pd.read_csv('pd_dataset/tb.csv')
tb

Unnamed: 0,country,year,m014,m1524,m2534,m3544,m4554,m5564,m65,mu,f014,f1524,f2534,f3544,f4554,f5564,f65,fu
0,AD,2000,0.0,0.0,1.0,0.0,0.0,0.0,0.0,,,,,,,,,
1,AE,2000,2.0,4.0,4.0,6.0,5.0,12.0,10.0,,3.0,16.0,1.0,3.0,0.0,0.0,4.0,
2,AF,2000,52.0,228.0,183.0,149.0,129.0,94.0,80.0,,93.0,414.0,565.0,339.0,205.0,99.0,36.0,
3,AG,2000,0.0,0.0,0.0,0.0,0.0,0.0,1.0,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,
4,AL,2000,2.0,19.0,21.0,14.0,24.0,19.0,16.0,,3.0,11.0,10.0,8.0,8.0,5.0,11.0,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
196,YE,2000,110.0,789.0,689.0,493.0,314.0,255.0,127.0,,161.0,799.0,627.0,517.0,345.0,247.0,92.0,
197,YU,2000,,,,,,,,,,,,,,,,
198,ZA,2000,116.0,723.0,1999.0,2135.0,1146.0,435.0,212.0,,122.0,1283.0,1716.0,933.0,423.0,167.0,80.0,
199,ZM,2000,349.0,2175.0,2610.0,3045.0,435.0,261.0,174.0,,150.0,932.0,1118.0,1305.0,186.0,112.0,75.0,


In [151]:
tb_melt = pd.melt(frame=tb, id_vars=['country', 'year'])
tb_melt

Unnamed: 0,country,year,variable,value
0,AD,2000,m014,0.0
1,AE,2000,m014,2.0
2,AF,2000,m014,52.0
3,AG,2000,m014,0.0
4,AL,2000,m014,2.0
...,...,...,...,...
3211,YE,2000,fu,
3212,YU,2000,fu,
3213,ZA,2000,fu,
3214,ZM,2000,fu,


In [152]:
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# Print the head of tb_melt
print(tb_melt.head())

  country  year variable  value gender age_group
0      AD  2000     m014    0.0      m       014
1      AE  2000     m014    2.0      m       014
2      AF  2000     m014   52.0      m       014
3      AG  2000     m014    0.0      m       014
4      AL  2000     m014    2.0      m       014


In [155]:
ebola = pd.read_csv('pd_dataset/ebola.csv')
ebola

Unnamed: 0,Date,Day,Cases_Guinea,Cases_Liberia,Cases_SierraLeone,Cases_Nigeria,Cases_Senegal,Cases_UnitedStates,Cases_Spain,Cases_Mali,Deaths_Guinea,Deaths_Liberia,Deaths_SierraLeone,Deaths_Nigeria,Deaths_Senegal,Deaths_UnitedStates,Deaths_Spain,Deaths_Mali
0,1/5/2015,289,2776.0,,10030.0,,,,,,1786.0,,2977.0,,,,,
1,1/4/2015,288,2775.0,,9780.0,,,,,,1781.0,,2943.0,,,,,
2,1/3/2015,287,2769.0,8166.0,9722.0,,,,,,1767.0,3496.0,2915.0,,,,,
3,1/2/2015,286,,8157.0,,,,,,,,3496.0,,,,,,
4,12/31/2014,284,2730.0,8115.0,9633.0,,,,,,1739.0,3471.0,2827.0,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
117,3/27/2014,5,103.0,8.0,6.0,,,,,,66.0,6.0,5.0,,,,,
118,3/26/2014,4,86.0,,,,,,,,62.0,,,,,,,
119,3/25/2014,3,86.0,,,,,,,,60.0,,,,,,,
120,3/24/2014,2,86.0,,,,,,,,59.0,,,,,,,


In [156]:
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
ebola_melt

Unnamed: 0,Date,Day,type_country,counts
0,1/5/2015,289,Cases_Guinea,2776.0
1,1/4/2015,288,Cases_Guinea,2775.0
2,1/3/2015,287,Cases_Guinea,2769.0
3,1/2/2015,286,Cases_Guinea,
4,12/31/2014,284,Cases_Guinea,2730.0
...,...,...,...,...
1947,3/27/2014,5,Deaths_Mali,
1948,3/26/2014,4,Deaths_Mali,
1949,3/25/2014,3,Deaths_Mali,
1950,3/24/2014,2,Deaths_Mali,


In [157]:
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
ebola_melt

Unnamed: 0,Date,Day,type_country,counts,str_split
0,1/5/2015,289,Cases_Guinea,2776.0,"[Cases, Guinea]"
1,1/4/2015,288,Cases_Guinea,2775.0,"[Cases, Guinea]"
2,1/3/2015,287,Cases_Guinea,2769.0,"[Cases, Guinea]"
3,1/2/2015,286,Cases_Guinea,,"[Cases, Guinea]"
4,12/31/2014,284,Cases_Guinea,2730.0,"[Cases, Guinea]"
...,...,...,...,...,...
1947,3/27/2014,5,Deaths_Mali,,"[Deaths, Mali]"
1948,3/26/2014,4,Deaths_Mali,,"[Deaths, Mali]"
1949,3/25/2014,3,Deaths_Mali,,"[Deaths, Mali]"
1950,3/24/2014,2,Deaths_Mali,,"[Deaths, Mali]"


In [158]:
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

print(ebola_melt.head())

         Date  Day  type_country  counts        str_split   type country
0    1/5/2015  289  Cases_Guinea  2776.0  [Cases, Guinea]  Cases  Guinea
1    1/4/2015  288  Cases_Guinea  2775.0  [Cases, Guinea]  Cases  Guinea
2    1/3/2015  287  Cases_Guinea  2769.0  [Cases, Guinea]  Cases  Guinea
3    1/2/2015  286  Cases_Guinea     NaN  [Cases, Guinea]  Cases  Guinea
4  12/31/2014  284  Cases_Guinea  2730.0  [Cases, Guinea]  Cases  Guinea


Concatenating

In [159]:
uber1 = pd.read_csv('pd_dataset/uber/uber1.csv')
uber2 = pd.read_csv('pd_dataset/uber/uber2.csv')
uber3 = pd.read_csv('pd_dataset/uber/uber3.csv')

In [160]:
uber = pd.concat([uber1, uber2, uber3])
print(uber.shape)
print(uber.head())

(297, 5)
   Unnamed: 0         Date/Time      Lat      Lon    Base
0           0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1           1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2           2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3           3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4           4  4/1/2014 0:33:00  40.7594 -73.9722  B02512


In [161]:
uber.loc[98,:]

Unnamed: 0.1,Unnamed: 0,Date/Time,Lat,Lon,Base
98,98,4/1/2014 6:59:00,40.7898,-73.9661,B02512
98,98,5/1/2014 6:08:00,40.7273,-73.9922,B02512
98,98,6/1/2014 6:51:00,40.7621,-73.9817,B02512


In [162]:
uber = pd.concat([uber1, uber2, uber3], ignore_index=True)
uber.loc[0,:]

Unnamed: 0                   0
Date/Time     4/1/2014 0:11:00
Lat                     40.769
Lon                   -73.9549
Base                    B02512
Name: 0, dtype: object

In [163]:
# Iterating and concatenating all file matches using glob method

import glob

In [166]:
pattern = 'pd_dataset/uber/*.csv'
csv_files = glob.glob(pattern)
csv_files

['pd_dataset/uber\\uber1.csv',
 'pd_dataset/uber\\uber2.csv',
 'pd_dataset/uber\\uber3.csv']

In [167]:
df_list = []

for csv in csv_files:
    df = pd.read_csv(csv)
    df_list.append(df)

uber = pd.concat(df_list, ignore_index=True)

print(uber.shape)
print(uber.head())

(297, 5)
   Unnamed: 0         Date/Time      Lat      Lon    Base
0           0  4/1/2014 0:11:00  40.7690 -73.9549  B02512
1           1  4/1/2014 0:17:00  40.7267 -74.0345  B02512
2           2  4/1/2014 0:21:00  40.7316 -73.9873  B02512
3           3  4/1/2014 0:28:00  40.7588 -73.9776  B02512
4           4  4/1/2014 0:33:00  40.7594 -73.9722  B02512


Merging data

In [168]:
import sqlite3

In [169]:
connection = sqlite3.connect("pd_dataset/survey.db")

# site dataframe
site = pd.read_sql_query("select name, lat, long from Site", connection)
print(site.head())

# visited dataframe dataframe
visited = pd.read_sql_query("select id, site, dated from Visited", connection)
print(visited.head())

survey = pd.read_sql_query("select * from Survey", connection)
print(survey.head())

    name    lat    long
0   DR-1 -49.85 -128.57
1   DR-3 -47.15 -126.72
2  MSK-4 -48.87 -123.40
    id  site       dated
0  619  DR-1  1927-02-08
1  622  DR-1  1927-02-10
2  734  DR-3  1930-01-07
3  735  DR-3  1930-01-12
4  751  DR-3  1930-02-26
   taken person quant  reading
0    619   dyer   rad     9.82
1    619   dyer   sal     0.13
2    622   dyer   rad     7.80
3    622   dyer   sal     0.09
4    734     pb   rad     8.41


In [170]:
# 1-to-1 data merge

# Merge the DataFrames: o2o
o2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

print(o2o)

    name    lat    long   id   site       dated
0   DR-1 -49.85 -128.57  619   DR-1  1927-02-08
1   DR-1 -49.85 -128.57  622   DR-1  1927-02-10
2   DR-1 -49.85 -128.57  844   DR-1  1932-03-22
3   DR-3 -47.15 -126.72  734   DR-3  1930-01-07
4   DR-3 -47.15 -126.72  735   DR-3  1930-01-12
5   DR-3 -47.15 -126.72  751   DR-3  1930-02-26
6   DR-3 -47.15 -126.72  752   DR-3        None
7  MSK-4 -48.87 -123.40  837  MSK-4  1932-01-14


In [171]:
# Mant-to-Many

m2o = pd.merge(left=site, right=visited, left_on='name', right_on='site')

# Merge m2m and survey: m2m
m2m = pd.merge(left=m2o, right=survey, left_on='id', right_on='taken')

# Print the first 20 lines of m2m
print(m2m.head(20))

     name    lat    long   id   site       dated  taken person quant  reading
0    DR-1 -49.85 -128.57  619   DR-1  1927-02-08    619   dyer   rad     9.82
1    DR-1 -49.85 -128.57  619   DR-1  1927-02-08    619   dyer   sal     0.13
2    DR-1 -49.85 -128.57  622   DR-1  1927-02-10    622   dyer   rad     7.80
3    DR-1 -49.85 -128.57  622   DR-1  1927-02-10    622   dyer   sal     0.09
4    DR-1 -49.85 -128.57  844   DR-1  1932-03-22    844    roe   rad    11.25
5    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734     pb   rad     8.41
6    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734   lake   sal     0.05
7    DR-3 -47.15 -126.72  734   DR-3  1930-01-07    734     pb  temp   -21.50
8    DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735     pb   rad     7.22
9    DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735   None   sal     0.06
10   DR-3 -47.15 -126.72  735   DR-3  1930-01-12    735   None  temp   -26.00
11   DR-3 -47.15 -126.72  751   DR-3  1930-02-26    751     pb  

Converting data types

In [173]:
tips = pd.read_csv('pd_dataset/tips.csv')

print(tips.head())

print("\n\n")
print(tips.info())

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  244 non-null    float64
 1   tip         244 non-null    float64
 2   sex         244 non-null    object 
 3   smoker      244 non-null    object 
 4   day         244 non-null    object 
 5   time        244 non-null    object 
 6   size        244 non-null    int64  
dtypes: float64(2), int64(1), object(4)
memory usage: 13.5+ KB
None


In [174]:
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

# Print the info of tips
print(tips.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    object  
 5   time        244 non-null    object  
 6   size        244 non-null    int64   
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.4+ KB
None


Categorical data
- Converting categorical data to ‘category’ dtype:
- Can make the DataFrame smaller in memory
- Can make them be utilized by other Python libraries for analysis

In [175]:
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')

# Print the info of tips
print(tips.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    object  
 5   time        244 non-null    object  
 6   size        244 non-null    int64   
dtypes: category(2), float64(2), int64(1), object(2)
memory usage: 10.4+ KB
None


Function

In [176]:
tips = pd.read_csv('pd_dataset/tips.csv')

def recode_sex(sex_value):
    if sex_value == 'Male':
        return 1   
    elif sex_value == 'Female':
        return 0  
    else:
        return np.nan

tips['sex_recode'] = tips.sex.apply(recode_sex)

tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,sex_recode
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,1
2,21.01,3.5,Male,No,Sun,Dinner,3,1
3,23.68,3.31,Male,No,Sun,Dinner,2,1
4,24.59,3.61,Female,No,Sun,Dinner,4,0


Dropping duplicate data

In [177]:
billboard = pd.read_csv('pd_dataset/billboard.csv')
billboard.head()

Unnamed: 0,year,artist,track,time,date.entered,week,rank
0,2000,2Ge+her,The Hardest Part Of ...,3:15,9/2/2000,wk1,91.0
1,2000,2 Pac,Baby Don't Cry,4:22,2/26/2000,wk1,87.0
2,2000,3 Doors Down,Kryptonite,3:53,4/8/2000,wk1,81.0
3,2000,3 Doors Down,Loser,4:24,10/21/2000,wk1,76.0
4,2000,504 Boyz,Wobble Wobble,3:35,4/15/2000,wk1,57.0


In [178]:
tracks = billboard[['year','artist','track','time']]

# Print info of tracks
print(tracks.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24071 entries, 0 to 24070
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    24071 non-null  int64 
 1   artist  24071 non-null  object
 2   track   24071 non-null  object
 3   time    24071 non-null  object
dtypes: int64(1), object(3)
memory usage: 752.3+ KB
None


In [179]:
# Drop the duplicates: tracks_no_duplicates
tracks_no_duplicates = tracks.drop_duplicates()

# Print info of tracks
print(tracks_no_duplicates.info())

<class 'pandas.core.frame.DataFrame'>
Index: 317 entries, 0 to 316
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   year    317 non-null    int64 
 1   artist  317 non-null    object
 2   track   317 non-null    object
 3   time    317 non-null    object
dtypes: int64(1), object(3)
memory usage: 12.4+ KB
None


Fill Missing Value

In [180]:
airquality = pd.read_csv('pd_dataset/airquality.csv')
airquality.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Ozone    116 non-null    float64
 1   Solar.R  146 non-null    float64
 2   Wind     153 non-null    float64
 3   Temp     153 non-null    int64  
 4   Month    153 non-null    int64  
 5   Day      153 non-null    int64  
dtypes: float64(3), int64(3)
memory usage: 7.3 KB


In [181]:
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)

# Print the info of airquality
print(airquality.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Ozone    153 non-null    float64
 1   Solar.R  146 non-null    float64
 2   Wind     153 non-null    float64
 3   Temp     153 non-null    int64  
 4   Month    153 non-null    int64  
 5   Day      153 non-null    int64  
dtypes: float64(3), int64(3)
memory usage: 7.3 KB
None


### Groupby

In [182]:
sales = pd.DataFrame(
	{
	'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
	'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
	'bread': [139, 237, 326, 456],
	'butter': [20, 45, 70, 98]
	}
)

sales

Unnamed: 0,weekday,city,bread,butter
0,Sun,Austin,139,20
1,Sun,Dallas,237,45
2,Mon,Austin,326,70
3,Mon,Dallas,456,98


In [183]:
sales['weekday'] == 'Sun'

0     True
1     True
2    False
3    False
Name: weekday, dtype: bool

In [184]:
sales.loc[sales['weekday'] == 'Sun'].count()

weekday    2
city       2
bread      2
butter     2
dtype: int64

In [185]:
# Split the data, apply the function and finally combine the result

sales.groupby('weekday').count()

Unnamed: 0_level_0,city,bread,butter
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Mon,2,2,2
Sun,2,2,2


In [186]:
sales

Unnamed: 0,weekday,city,bread,butter
0,Sun,Austin,139,20
1,Sun,Dallas,237,45
2,Mon,Austin,326,70
3,Mon,Dallas,456,98


In [187]:
print(sales.groupby('weekday'))

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C1A532DE80>


In [188]:
sales.groupby('weekday').groups

{'Mon': [2, 3], 'Sun': [0, 1]}

In [189]:
sales_g = sales.groupby('city')
sales_g

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001C1A532E0F0>

In [190]:
print(sales_g.groups)
print(type(sales_g.groups))
print(sales_g.groups.keys())

{'Austin': [0, 2], 'Dallas': [1, 3]}
<class 'pandas.io.formats.printing.PrettyDict'>
dict_keys(['Austin', 'Dallas'])


In [191]:
# Groupby and Sum
sales.groupby('weekday')['bread'].sum()

weekday
Mon    782
Sun    376
Name: bread, dtype: int64

In [192]:
# Do the sum of multiple column

sales.groupby('weekday')[['bread','butter']].sum()

Unnamed: 0_level_0,bread,butter
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,782,168
Sun,376,65


In [193]:
sales

Unnamed: 0,weekday,city,bread,butter
0,Sun,Austin,139,20
1,Sun,Dallas,237,45
2,Mon,Austin,326,70
3,Mon,Dallas,456,98


In [194]:
print(sales.groupby(['city','weekday']).groups)

{('Austin', 'Mon'): [2], ('Austin', 'Sun'): [0], ('Dallas', 'Mon'): [3], ('Dallas', 'Sun'): [1]}


In [195]:
# multi-level index

sales.groupby(['city','weekday']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,bread,butter
city,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1
Austin,Mon,326.0,70.0
Austin,Sun,139.0,20.0
Dallas,Mon,456.0,98.0
Dallas,Sun,237.0,45.0


In [196]:
sales

Unnamed: 0,weekday,city,bread,butter
0,Sun,Austin,139,20
1,Sun,Dallas,237,45
2,Mon,Austin,326,70
3,Mon,Dallas,456,98


In [197]:
# Do groupby on Series

customers = pd.Series(['Dave','Alice','Bob','Alice'])
customers

0     Dave
1    Alice
2      Bob
3    Alice
dtype: object

In [198]:
sales

Unnamed: 0,weekday,city,bread,butter
0,Sun,Austin,139,20
1,Sun,Dallas,237,45
2,Mon,Austin,326,70
3,Mon,Dallas,456,98


In [199]:
sales.groupby(customers)['bread'].sum()

Alice    693
Bob      326
Dave     139
Name: bread, dtype: int64

Categorical data

In [200]:
sales['weekday'].unique()

array(['Sun', 'Mon'], dtype=object)

In [201]:
sales['weekday'] = sales['weekday'].astype('category')
sales['weekday']

0    Sun
1    Sun
2    Mon
3    Mon
Name: weekday, dtype: category
Categories (2, object): ['Mon', 'Sun']

Groupby and aggregation

In [202]:
sales.groupby('city')[['bread','butter']].max()

Unnamed: 0_level_0,bread,butter
city,Unnamed: 1_level_1,Unnamed: 2_level_1
Austin,326,70
Dallas,456,98


In [203]:
# Multiple aggregation
sales.groupby('city')[['bread','butter']].agg(['max','sum'])

Unnamed: 0_level_0,bread,bread,butter,butter
Unnamed: 0_level_1,max,sum,max,sum
city,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Austin,326,465,70,90
Dallas,456,693,98,143


In [204]:
sales.groupby('weekday')[['bread', 'butter']].agg(['max','min'])

  sales.groupby('weekday')[['bread', 'butter']].agg(['max','min'])


Unnamed: 0_level_0,bread,bread,butter,butter
Unnamed: 0_level_1,max,min,max,min
weekday,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
Mon,456,326,98,70
Sun,237,139,45,20


In [206]:
# Custom aggregation function

def range1(s):
    return s.max() - s.min()

In [207]:
sales.groupby('weekday')[['bread', 'butter']].agg(range1)

  sales.groupby('weekday')[['bread', 'butter']].agg(range1)


Unnamed: 0_level_0,bread,butter
weekday,Unnamed: 1_level_1,Unnamed: 2_level_1
Mon,130,28
Sun,98,25


In [208]:
sales.groupby(customers)[['bread', 'butter']].agg({'bread':'sum', 'butter':range1})

Unnamed: 0,bread,butter
Alice,693,53
Bob,326,0
Dave,139,0


Group By

### Query pandas dataframe

In [209]:
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
courses = pd.DataFrame(technologies)
courses

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000,30days,1000
1,PySpark,25000,50days,2300
2,Hadoop,23000,30days,1000
3,Python,24000,,1200
4,Pandas,26000,,2500


In [210]:
spark = courses.query("Courses == 'Spark'")
spark

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000,30days,1000


In [211]:
# using variable

value = 'Python'
spark = courses.query("Courses == @value")
spark

Unnamed: 0,Courses,Fee,Duration,Discount
3,Python,24000,,1200


In [212]:
courses

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000,30days,1000
1,PySpark,25000,50days,2300
2,Hadoop,23000,30days,1000
3,Python,24000,,1200
4,Pandas,26000,,2500


In [213]:
courses.query("Courses == 'Spark'", inplace=True)
courses

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000,30days,1000


In [214]:
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
courses = pd.DataFrame(technologies)
courses


Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000,30days,1000
1,PySpark,25000,50days,2300
2,Hadoop,23000,30days,1000
3,Python,24000,,1200
4,Pandas,26000,,2500


In [215]:
print(courses.query("Courses != 'Spark'"))

   Courses    Fee Duration  Discount
1  PySpark  25000   50days      2300
2   Hadoop  23000   30days      1000
3   Python  24000     None      1200
4   Pandas  26000      NaN      2500


In [216]:
print(courses.query("Courses in ('Spark','PySpark')"))

   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300


In [217]:
values=['Spark','PySpark']
print(courses.query("Courses in @values"))

   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300


In [218]:
values=['Spark','PySpark']
print(courses.query("Courses not in @values"))

  Courses    Fee Duration  Discount
2  Hadoop  23000   30days      1000
3  Python  24000     None      1200
4  Pandas  26000      NaN      2500


In [219]:
# Query by multiple conditions

print(courses.query("Fee >= 23000 and Fee <= 24000"))

  Courses    Fee Duration  Discount
2  Hadoop  23000   30days      1000
3  Python  24000     None      1200


In [220]:
# By using lambda function

print(courses.apply(lambda row: row[courses['Courses'].isin(['Spark','PySpark'])]))

   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300


In [221]:
# Other examples you can try to query rows
courses[courses["Courses"] == 'Spark'] 

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000,30days,1000


In [222]:
courses.loc[courses['Courses'] == value]

Unnamed: 0,Courses,Fee,Duration,Discount
3,Python,24000,,1200


In [223]:
courses.loc[courses['Courses'] != 'Spark']

Unnamed: 0,Courses,Fee,Duration,Discount
1,PySpark,25000,50days,2300
2,Hadoop,23000,30days,1000
3,Python,24000,,1200
4,Pandas,26000,,2500


In [224]:
courses.loc[courses['Courses'].isin(values)]

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000,30days,1000
1,PySpark,25000,50days,2300


In [225]:
courses.loc[~courses['Courses'].isin(values)]

Unnamed: 0,Courses,Fee,Duration,Discount
2,Hadoop,23000,30days,1000
3,Python,24000,,1200
4,Pandas,26000,,2500


In [226]:
courses.loc[(courses['Discount'] >= 1000) & (courses['Discount'] <= 2000)]

Unnamed: 0,Courses,Fee,Duration,Discount
0,Spark,22000,30days,1000
2,Hadoop,23000,30days,1000
3,Python,24000,,1200


In [227]:
courses.loc[(courses['Discount'] >= 1300) & (courses['Fee'] >= 23000 )]

Unnamed: 0,Courses,Fee,Duration,Discount
1,PySpark,25000,50days,2300
4,Pandas,26000,,2500


In [228]:
# Select based on value contains
print(courses[courses['Courses'].str.contains("Spark")])

   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300


In [229]:
# Select after converting values
print(courses[courses['Courses'].str.lower().str.contains("spark")])

   Courses    Fee Duration  Discount
0    Spark  22000   30days      1000
1  PySpark  25000   50days      2300


In [230]:
#Select startswith
print(courses[courses['Courses'].str.startswith("P")])

   Courses    Fee Duration  Discount
1  PySpark  25000   50days      2300
3   Python  24000     None      1200
4   Pandas  26000      NaN      2500
