# Chapter 6

Pandas is a Python package that provides fast, flexible and expressive data structures designed to make working with "relational" and "labeled" data easy and intuitive. It aims to be the foundational building block for doing practical, real-world data analysis in Python. It also has the broader purpose of being the most powerful and flexible open source data analysis/manipulation tool available in any language.

## Learning Goals

- Introduction
- Series
- Dataframe
- Filtering and Manipulation
- Reading Methods
- Groupby and Concatenation Operator

## Authors

- Mert Candar, mccandar@gmail.com
- Aras Kahraman, aras.kahraman@hotmail.com

## Learning Curve Boosters

https://github.com/kyclark/tiny_python_projects

https://github.com/Python-World/python-mini-projects/tree/master/projects

https://github.com/rlvaugh/Impractical_Python_Projects

## Goals

- Read
- Manipulate
- Analyze
- Compute


## Why Choose `pandas`

![](https://d2h0cx97tjks2p.cloudfront.net/blogs/wp-content/uploads/sites/2/2019/04/Pros-and-Cons-of-pandas.jpg)

## Series, `pandas.Series`

One dimensional arrays, namely, vectors.

In [None]:
import numpy as np
import pandas as pd

### Creation of `series` object

In [None]:
data_list = [10,20,30,40,50]

In [None]:
d = pd.Series(data = data_list)
d

In [None]:
d.index

In [None]:
labels_list = ["Dave","Oliver","Jane","Bobby","Sarah"]
d = pd.Series(data = data_list,index = labels_list)
d

In [None]:
d.index

In [None]:
arr = np.array([10,20,30,40,50])
pd.Series(arr)

In [None]:
d = pd.Series(arr,index = labels_list) # first argument is always data
d

In [None]:
d.dtype

In [None]:
d = pd.Series(data = labels_list,index = arr)
d

In [None]:
d.dtype

In [None]:
datadict = {"Dave":30,"Jane":80,"Oliver":60}
pd.Series(datadict)

### Indexing, `loc` & `iloc`

In [None]:
d = pd.Series(arr,index = labels_list)
d

In [None]:
d[0]

In [None]:
d[1]

In [None]:
d[-1]

In [None]:
d[1:4]

In [None]:
d[[0,1,3]]

In [None]:
d.iloc[1]

In [None]:
d.iloc[3]

In [None]:
d.iloc[1:3]

In [None]:
d.iloc[[0,3,4]]

In [None]:
d.loc[0]

In [None]:
d.loc['Oliver']

In [None]:
d.loc['Jane']

In [None]:
d.loc['Oliver':'Sarah']

In [None]:
idx = ['Oliver','Dave','Bobby']
d.loc[idx]

### Arithmetic Ops

In [None]:
browsers2019 = pd.Series([8000,200000,15000],["Safari","Chrome","Firefox"])
browsers2019

In [None]:
browsers2020 = pd.Series([30000,188000,21000],["Safari","Chrome","Firefox"])
browsers2020

In [None]:
browsers2019 + browsers2020

In [None]:
browsers2021 = pd.Series([13000,19000,122000],['Firefox',"Safari","Chrome"])
browsers2021

In [None]:
browsers2020 + browsers2021 # index is crucial here

In [None]:
stck2017 = pd.Series([5,10,14,20],["PC","TV","PS","Phone"])

In [None]:
stck2018 = pd.Series([2,12,12,6],["PC","TV","Notebook","Phone"]) 

In [None]:
stck2017

In [None]:
stck2018

In [None]:
total_stock = stck2017 + stck2018

In [None]:
total_stock

In [None]:
stck2017.add?

In [None]:
total_stock.loc["PC"]

In [None]:
total_stock.loc["Phone"]

In [None]:
total_stock.loc["Notebook"]

In [None]:
total_stock.loc["PS"]

## Dataframe, `pandas.DataFrame`

Two dimensional arrays.

### Creation of `DataFrame` object

In [None]:
numbers = [
    [6, 30, 180.51],
    [3, 58, 8600.38],
    [2, 43, 2950.97]
]

In [None]:
pd.DataFrame(numbers)

In [None]:
df = pd.DataFrame(
    numbers, 
    index = ["user_1","user_2","user_3"], # row names
    columns = ["Basket Items","Purchased Count","Purchased Total"] # column names
)

In [None]:
df

In [None]:
data_dict = {
    "Basket Items": {"user_1": 6, "user_2": 3, "user_3": 2},
    "Purchased Count": {"user_1": 30, "user_2": 58, "user_3": 43},
    "Purchased Total": {"user_1": 180.51,"user_2": 8600.38,"user_3": 2950.97},
}

In [None]:
df = pd.DataFrame(data_dict)

In [None]:
df

In [None]:
# add a new column
df["Time on Page"] = pd.Series(np.random.exponential(3,size=3), index = ["user_1","user_2","user_3"])

In [None]:
df

### Indexing, `loc` & `iloc`

In [None]:
df[0,0]

In [None]:
df["Basket Items"]

In [None]:
df[["Basket Items","Purchased Count"]]

In [None]:
type(df["Basket Items"])

In [None]:
type(df[["Basket Items"]])

In [None]:
df.iloc[0] # index 0 --> user_1

In [None]:
df.iloc[1]

In [None]:
df.iloc[[0, 2]]

In [None]:
df.iloc[0:2]

In [None]:
df.iloc[:,0]

In [None]:
df.iloc[:,1]

In [None]:
df.iloc[:,-1]

In [None]:
df.iloc[:,[0,1]]

In [None]:
df.loc["user_1"]

In [None]:
df.loc["user_2"]

In [None]:
df.loc[["user_1","user_2"]]

In [None]:
df.loc["user_1":"user_3"]

In [None]:
df.loc['user_1']['Purchased Count']

In [None]:
df.loc["user_1","Purchased Count"]

In [None]:
df.loc["user_2","Purchased Total"]

In [None]:
df.loc[["user_1","user_2"],["Purchased Count","Purchased Total"]]

In [None]:
# this time exclude what I specify
df.drop("user_1")

In [None]:
df.drop("user_1",axis=0)

In [None]:
df.drop("Basket Items",axis=1)

In [None]:
df.drop(["Basket Items",'Time on Page'],axis=1)

In [None]:
df['Basket Items'] + df['Purchased Count']

In [None]:
df['Purchased Total'] / 8.7

### Filtering Operations

In [None]:
x = pd.Series([6,1,2,4,1,12,314,5,6,34,23,23,13214,1])

In [None]:
x

In [None]:
x > 23

In [None]:
mask = x > 23
x[mask]

In [None]:
mask = (x > 23) & (x <= 314)
x[mask]

In [None]:
mask = (x <= 6) | (x == 12)
x[mask]

In [None]:
df

In [None]:
mask = df['Basket Items'] > 3
df[mask]

In [None]:
mask

In [None]:
mask = (df['Basket Items'] > 3) | (df['Purchased Total'] > 5000)
df[mask]

In [None]:
df.loc[mask,'Purchased Count']

In [None]:
df['Purchased Count'].loc[mask]

In [None]:
mask = (df['Basket Items'] > 3) | (df['Purchased Total'] > 5000)
df.loc[mask]

## Reading Data

In [None]:
df = pd.read_csv('data/ga_sample.csv')

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df = pd.read_excel('data/ga_pv_sample.xlsx')

In [None]:
df.head()

In [None]:
df.info()

### Dataframe Operations

In [None]:
df = pd.DataFrame({
    "daily_page_views":np.random.randint(1e7,1e9,size=7),
    "number_of_customers":np.random.randint(1e5,1e7,size=7),
    "client":["Trendyol","Hepsiburada","N11","Amazon","Gittigidiyor","Çiçek Sepeti","Alınıyor"]
})

In [None]:
df

In [None]:
df = df.sort_values('client')
df

In [None]:
df.sort_values('number_of_customers',ascending=False)

In [None]:
df.index = df['client'].values
df = df.drop('client',axis=1)

In [None]:
df

In [None]:
df.mean()

In [None]:
df.mean(axis=1)

In [None]:
df['view_per_customer'] = df['daily_page_views'] / df['number_of_customers']

In [None]:
df

In [None]:
(np.random.exponential(20,size=9) + 20).reshape((-1,1))

In [None]:
new_column = pd.DataFrame([
    34.85711832,
    60.64934357,
    51.31118755,
    26.86458086,
    54.84203497,
    28.01588788,
    48.04544531
],
    index=df.index,
    columns=['avg_time_on_page'])

In [None]:
new_column

In [None]:
df = pd.concat([df,new_column],axis=1)

In [None]:
df

In [None]:
new_row = pd.DataFrame([[
    80973196,
    1866268,
    28.0605,
    39.545,
]],
    index=['Morhipo'],
    columns=['daily_page_views','number_of_customers','view_per_customer','avg_time_on_page'])

In [None]:
new_row

In [None]:
df

In [None]:
df = pd.concat([df,new_row],axis=0)

In [None]:
df

In [None]:
df = pd.DataFrame({
    "Day" : ["Friday","Tuesday","Sunday","Friday","Tuesday","Sunday","Friday","Tuesday","Sunday"],
    "Site":["N11","N11","N11","Hepsiburada","Hepsiburada","Hepsiburada","Trendyol","Trendyol","Trendyol"],
    "Customer":[100,250,500,200,600,800,300,700,750]
})

In [None]:
df

In [None]:
df.pivot_table(index = "Site",columns ="Day",values = "Customer")

In [None]:
df.pivot_table(index = "Day",columns ="Site",values = "Customer")

### What if data is not good

In [None]:
from datetime import datetime

In [None]:
df = pd.read_csv('data/Covid19 India.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df = pd.read_csv(
    "data/Covid19 India.csv",
    parse_dates=[1],
    date_parser= lambda x: datetime.strptime(x,"%d-%m-%Y"),
    index_col=0)

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.sum()

In [None]:
df['Date'].min()

In [None]:
df['Date'].max()

In [None]:
df['State/UnionTerritory'].unique()

In [None]:
df['State/UnionTerritory'].value_counts()

In [None]:
df.groupby('State/UnionTerritory')

In [None]:
gpd = df.groupby('State/UnionTerritory').sum()
gpd

In [None]:
mask = gpd['ConfirmedIndianNational'] < gpd['ConfirmedForeignNational']
gpd[mask]

In [None]:
df = pd.read_csv("data/Zynga Stock Prices.csv")
df.head()

In [None]:
df.info()

In [None]:
def dropper(x):
    out = x.strip("$")
    return float(out)

df = pd.read_csv(
    "data/Zynga Stock Prices.csv",
    parse_dates=[0],
    date_parser= lambda x:datetime.strptime(x,"%m/%d/%Y"),
    converters={"Open":dropper,"Close/Last":dropper,"High":dropper,"Low":dropper},
    index_col=0
)

In [None]:
df.head()

In [None]:
df.info()

### Concatenation

In [None]:
dataset1 = {
    "Trendyol"   : ["Phone","PC","Jean","Necklace"],
    "N11"        : ["TV","Camera","T-Shirt","Watch"],
    "Hepsiburada": ["PS","X-Box","Printer","Shoe"]
}

In [None]:
dataset2 = {
    "Trendyol"   : ["Headphone","Sweatshirt","Chair","Jacket"],
    "N11"        : ["Bed","Curtain","Mirror","Glass"],
    "Hepsiburada": ["Carpet","Iron","Mug","Bag"]
}

In [None]:
df1 = pd.DataFrame(dataset1)
df2 = pd.DataFrame(dataset2)

In [None]:
df1

In [None]:
df2

In [None]:
d = pd.concat([df1,df2])
d

In [None]:
d.loc[0] # <-- this is a problem

In [None]:
pd.concat([df1,df2],ignore_index=True)

In [None]:
d = pd.concat([df1,df2],axis = 1) # mind the column names
d

In [None]:
d.loc[:,'Hepsiburada']

In [None]:
df1 = pd.DataFrame(dataset1)
df2 = pd.DataFrame(dataset2,index=[4,5,9,11])

In [None]:
pd.concat([df1,df2],axis = 1) # mind the index

# Example

In [None]:
df = pd.read_csv('data/ga_sample.csv')
df.head()

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['Page'].nunique()

In [None]:
df.sort_values('Bounce Rate',ascending=False)

In [None]:
q = df['Pageviews'].quantile(0.8)
q

In [None]:
mask = df['Pageviews'] > q

In [None]:
res = df.loc[mask]
res

In [None]:
res = res.sort_values('Unique Pageviews',ascending=False)
res

In [None]:
d = pd.Series(res['Unique Pageviews'].values,index=res['Page'])

In [None]:
d

In [None]:
import matplotlib.pyplot as plt
d.plot(kind='bar')
plt.show()

# Next Week

- More detailed examples
- Google analytics connection (hopefully)
- Visualization

# References

https://data-flair.training/blogs/advantages-of-python-pandas/

https://www.electrictoolbox.com/examples/google-analytics-data.csv.html