#  Pandas

Pandas is an open-source Python library used for data manipulation and analysis. It provides powerful tools to work with structured data, such as tabular data (like spreadsheets or SQL tables) and time-series data.

**Key Features of Pandas**
1. Data Structures:
    - **Series**: A one-dimensional labeled array (like a column in Excel).
    - **DataFrame**: A two-dimensional labeled table (like an Excel sheet or SQL table).
    - **Panel**: A three-dimensional data structure (less common, replaced by other solutions like xarray).
1. Data Manipulation:
    - Filter, slice, and subset data.
    - Handle missing data (NaN) effectively.
    - Reshape and pivot datasets.
1. Data Cleaning:
    - Replace, fill, or drop missing or incorrect values.
    - Detect and remove duplicate entries.
1. Integration:
    - Load data from various file formats like CSV, Excel, JSON, SQL, etc.
    - Export data to these formats.
1. Powerful Aggregation:
    - Grouping, summarizing, and applying custom functions for analysis.
1. Time-Series Support:
    - Perform operations on time-indexed data (resampling, shifting, rolling).

**Why Use Pandas?**
- It simplifies data preparation tasks, which are crucial before analysis or visualization.
- Makes it easy to perform exploratory data analysis (EDA).
- Integrates seamlessly with other Python libraries like NumPy, Matplotlib, and Scikit-learn.
- Reduces the complexity of working with datasets compared to raw Python lists or dictionaries.

### Series

A **Pandas Series** is a one-dimensional, labeled array capable of holding any data type, such as integers, floats, strings, or Python objects. It is similar to a column in an Excel spreadsheet or a single dimension in NumPy arrays, but with the added functionality of labels (indices) for each element.

**Key Characteristics of a Series**
- **Index**: Each element in a Series has a label (index) by default starting from 0. Custom indices can also be used.
- **Data Types**: A Series can store data of any type, such as integers, floats, strings, or even Python objects.
- **Homogeneity**: Unlike lists, all elements in a Series are of the same data type, similar to NumPy arrays.

**Syntax:**

```pandas.Series(data, index=index)```
- **data**: The data to store (list, array, scalar value, or dictionary).
- **index**: (Optional) Custom labels for the Series.

In [None]:
import pandas as pd
import numpy as np

In [None]:
series1 = pd.Series([1,2,3])
series1

In [None]:
series1= pd.Series(np.array([13,4,2]))
series1

In [None]:
type(series1)

In [None]:
series2 = pd.Series([13,4,2],index=['S1','S2','S2'])
series2

Index

In [None]:
prices = [10.70, 10.86, 10.74, 10.71, 10.79]
shares = pd.Series(prices)
shares

In [None]:
days = ['Mon', 'Tue', 'Wed', 'Thur', 'Fri']
shares = pd.Series(prices, index=days)
shares

In [None]:
# Examin Index
shares.index

In [None]:
shares.index[4]

In [None]:
shares.index[:4]

In [None]:
shares.index[-4:]

In [None]:
print(shares.index.name)

In [None]:
shares.index.name = 'weekdays'
shares

In [None]:
shares.index[2] = 'Wednesday'

In [None]:
shares.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday']
shares

### DataFrame

A **Pandas DataFrame** is a two-dimensional, size-mutable, and heterogeneous tabular data structure with labeled axes (rows and columns). It is similar to a spreadsheet or SQL table and is the most commonly used data structure in Pandas for data manipulation and analysis.

**Key Characteristics of a DataFrame**
1. **Tabular Structure**:
    - Rows represent individual records.
    - Columns represent attributes or features of the data.
1. **Labeled Axes**:
    - Each row and column has a label (index for rows and column names for columns).
1. **Heterogeneous Data**:
    - A DataFrame can contain columns with different data types (e.g., integers, floats, strings).
1. **Size-Mutable**:
    - Rows and columns can be added, modified, or removed.
1. **Built-in Methods**:
    - Includes methods for cleaning, filtering, grouping, and analyzing data.

**Syntax:**
```pandas.DataFrame(data, index=index, columns=columns)```
- **data**: The data to populate the DataFrame (list, dictionary, array, or another DataFrame).
- **index**: (Optional) Custom row labels.
- **columns**: (Optional) Custom column labels


In [None]:
#Dataframe - 2D labelled array, row and col index, tabular, structured data

df = pd.DataFrame([[1,"Karthik"],[2,"Nikhil"],[3,"Anurabh"]])
df

In [None]:
type(df)

DataFrame From List

In [None]:
data = [
    [1,"Karthik"],
    [2,"Nikhil"],
    [3,"Anurabh"],
    [4,"Juhi"]]

index=['a','b','c','d']

df=pd.DataFrame(data, index=index, columns=["Roll No", "Name"])
df

Basic Function

In [None]:
df.count()

In [None]:
# give you information on the height of your DataFrame.

len(df)

In [None]:
df.dtypes

In [None]:
df.shape

In [None]:
df.ndim

In [None]:
# Returns the number of elements in the DataFrame.
df.size

In [None]:
#Returns the actual data in the DataFrame as an NDarray
print(df.values)
type(df.values)

DataFrame from Dict

In [None]:
names = ['United States', 'Australia', 'Japan', 'India', 'Russia', 'Morocco', 'Egypt']
dr =  [True, False, False, False, True, True, True]
cpc = [809, 731, 588, 18, 200, 70, 45]

# Create dictionary my_dict with three key:value pairs
my_dict = {'country': names, 'drives_right': dr, 'cars_per_cap': cpc}

# Build a DataFrame cars from my_dict
cars = pd.DataFrame(my_dict)

print(cars)

In [None]:
# Definition of row_labels
cars.index = ['US', 'AUS', 'JAP', 'IN', 'RU', 'MOR', 'EG']
cars

Read CSV File

In [None]:
cars = pd.read_csv("data/cars.csv")
cars

In [None]:
cars = pd.read_csv("data/cars.csv", index_col=0)
cars

In [None]:
cars.head()

In [None]:
cars.tail()

Dataset info and shape

In [None]:
cars.info()

In [None]:
cars.axes

### Select Data

Columns

In [None]:
cars

In [None]:
cars['cars_per_cap']

In [None]:
type(cars['country'])

In [None]:
cars[['country', 'drives_right']]

In [None]:
type(cars[['country']])

Row Access

In [None]:
cars

In [None]:
cars[2:5]

loc

In [None]:
# Row as pandas series

cars.loc['IN']

In [None]:
cars.loc[]['IN','US','EG']

In [None]:
# Rows & Columns
cars.loc[['IN','US','EG'] , ['country','drives_right']]


In [None]:
cars.loc[:, ['country','drives_right']]

iloc

In [None]:
cars

In [None]:
cars.loc[['AUS']]

In [None]:
cars.iloc[[1]]

In [None]:
cars.iloc[[1,4,3]]

In [None]:
cars.loc[['AUS','JAP','IN'],['country','drives_right']]

In [None]:
cars.iloc[[1,2,3],[1,2]]

In [None]:
cars.iloc[:4,:2]

Filtering

In [None]:
pop = pd.read_csv('data/brics.csv', index_col=0)
pop

In [None]:
# pop['area']
# pop.loc[:, 'area']
pop.iloc[:,2]

In [None]:
exp = pop['area'] > 8 
exp

In [None]:
~ (pop['area'] < 13)

In [None]:
pop[ ~exp ]

In [None]:
exp = np.logical_and(pop["area"] > 8, pop["area"] < 10)
exp

In [None]:
pop[ exp ]

In [None]:
exp = (pop["area"] > 8) & (pop["area"] < 10)
exp

In [None]:
exp = pop["area"].between(8,10, inclusive='neither')
exp

Iteration

In [None]:
cars = pd.read_csv("data/cars.csv", index_col=0)
cars

In [None]:
for val in cars:
    print(val)

In [None]:
for idx, row in cars.iterrows():
    print(idx, row['cars_per_cap'])

In [None]:
for idx, row in cars.iterrows():
    print(idx, "\t: ", row['country'])

In [None]:
for idx, row in cars.iterrows():
    cars.loc[idx, 'COUNTRY'] = row['country'].upper()

cars

In [None]:
# insted of for loop apply method can also be used to do the same.

cars = pd.read_csv("data/cars.csv", index_col=0)
cars

In [None]:
cars['COUNTRY'] = cars['country'].apply(str.upper)
cars

### Reading data from file

Import form flat file

In [None]:
file = 'data/titanic.csv'
df = pd.read_csv(file)
df.head()

In [None]:
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame
data_array = data.values
print(data_array)

In [None]:
print(type(data_array))

Import from excel file

In [None]:
# you have to install openpyxl module
# open your terminal and type the following command

# !pip install openpyxl

In [None]:

file = 'data/battledeath.xlsx'

# Load spreadsheet
xl = pd.ExcelFile(file)

print(xl.sheet_names)

In [None]:
df = xl.parse(1)

In [None]:
df.head()

# Cleaning Data

Missing Value

In [None]:
df = pd.DataFrame(
    np.random.randn(5, 3), 
    index=['a', 'c', 'e', 'f', 'h'],
    columns=['one', 'two', 'three'] )
df

In [None]:
df = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df

In [None]:
# Check for missing value: isnull()

print(df['one'].isnull())

In [None]:
# Check for missing value: notnull()

print(df['one'].notnull())

In [None]:
# sum the value in column: one

df.one.sum()

In [None]:
df

In [None]:
# Replaced NaN with '0'
print(df.one.fillna(0))

In [None]:
# Fill NA Forward pad/ffill

print(df.ffill())

In [None]:
# Fill NA Backward backfill/bfill

print(df.bfill())

In [None]:
# Drop NaN rows

print(df.dropna())

In [None]:
df

In [None]:
# Drop NaN columns
# axis = 1 -> Column
# axis = 0 -> Row

print(df.dropna(axis=0))

In [None]:
df = pd.read_csv('data/literary_birth_rate.csv', sep=';')
df.head()

In [None]:
df.tail()

In [None]:
df.shape

In [None]:
df['Country '].count()

In [None]:
df.columns

In [None]:
df = df.rename(columns={'Country ':'country','Continent ':'continent','female literacy':'female_literacy'})
df.columns

In [None]:
df.info()

In [None]:
print(df.describe())

In [None]:
df.Continent.value_counts(dropna=False)

In [None]:
df.country.value_counts(dropna=False).head()

In [None]:
df.fertility.value_counts(dropna=False).head()

**Melting**

Melting is a process in Pandas where you transform a wide-format DataFrame into a long-format DataFrame. This is particularly useful when preparing data for analysis or visualization that requires a specific structure.

**In a melted DataFrame:**

- Each row represents a single observation.
- Columns that previously held multiple variables are now combined into two or more columns:
    - A column for variable names.
    - A column for the corresponding values.

**Syntax:**

```pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value')```

**Parameters:**
- **frame**: The DataFrame to melt.
- **id_vars**: Columns that should remain as identifiers and not be melted.
- **value_vars**: Columns to melt into rows (default is all columns except id_vars).
- **var_name**: Name of the column created for variable names.
- **value_name**: Name of the column created for values (default is 'value').

In [None]:
airquality = pd.read_csv('data/airquality.csv')

# Print the head of airquality
print(airquality.head())
print(len(airquality))

In [None]:
# Melt airquality
airquality_melt = pd.melt(frame=airquality, id_vars=['Month', 'Day'])
print(airquality_melt.head())
print(len(airquality_melt))

In [None]:
# Melt airquality
airquality_melt = pd.melt(
    frame=airquality,
    id_vars=['Month', 'Day'],
    var_name='measurement',
    value_name='reading')
    
print(airquality_melt.head())

**Pivot Table**

A **pivot table** in Pandas is a way to summarize, aggregate, and reorganize data in a DataFrame. It allows you to transform a long-format DataFrame into a summary table, using rows and columns for grouping and applying aggregation functions to calculate metrics.

It is particularly useful for analyzing and reporting data, similar to Excel pivot tables.

**Syntax:**

```pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False)```

**Parameters:**
- **data**: The DataFrame to pivot.
- **values**: Column(s) to aggregate.
- **index**: Keys to group by on the rows.
- **columns**: Keys to group by on the columns.
- **aggfunc**: Aggregation function (default is mean but can be sum, count, min, max, etc.).
- **fill_value**: Value to replace missing values.
- **margins**: If True, adds totals for rows and columns.

In [None]:
# Pivot airquality_melt
airquality_pivot = airquality_melt.pivot_table(index=['Month', 'Day'], columns='measurement', values='reading')

print(airquality_pivot.head())
print(airquality_pivot.tail())
print(len(airquality_pivot))

In [None]:
# Reset the index of airquality_pivot
airquality_pivot = airquality_pivot.reset_index()

print(airquality_pivot.head())
print(airquality.head())

Split the column

In [None]:
tb = pd.read_csv('data/tb.csv')
tb

In [None]:
tb_melt = pd.melt(frame=tb, id_vars=['country', 'year'])
tb_melt

In [None]:
# Create the 'gender' column
tb_melt['gender'] = tb_melt.variable.str[0]

# Create the 'age_group' column
tb_melt['age_group'] = tb_melt.variable.str[1:]

# Print the head of tb_melt
print(tb_melt.head())

In [None]:
ebola = pd.read_csv('data/ebola.csv')
ebola

In [None]:
ebola_melt = pd.melt(ebola, id_vars=['Date', 'Day'], var_name='type_country', value_name='counts')
ebola_melt

In [None]:
# Create the 'str_split' column
ebola_melt['str_split'] = ebola_melt.type_country.str.split('_')
ebola_melt

In [None]:
# Create the 'type' column
ebola_melt['type'] = ebola_melt.str_split.str.get(0)

# Create the 'country' column
ebola_melt['country'] = ebola_melt.str_split.str.get(1)

print(ebola_melt.head())

Concatenating

In [None]:
uber1 = pd.read_csv('data/uber/uber1.csv')
uber2 = pd.read_csv('data/uber/uber2.csv')
uber3 = pd.read_csv('data/uber/uber3.csv')

In [None]:
uber = pd.concat([uber1, uber2, uber3])
print(uber.shape)
print(uber.head())

In [None]:
uberx = pd.read_csv('data/uber/*.csv')

In [None]:
uber.loc[98,:]

In [None]:
uber = pd.concat([uber1, uber2, uber3], ignore_index=True)
uber.loc[98,:]

In [None]:
# Iterating and concatenating all file matches using glob method

import glob

In [None]:
pattern = 'data/uber/*.csv'
csv_files = glob.glob(pattern)
csv_files

In [None]:
df_list = []

for csv in csv_files:
    df = pd.read_csv(csv)
    df_list.append(df)

uber = pd.concat(df_list, ignore_index=True)

print(uber.shape)
print(uber.head())

Converting data types

In [None]:
tips = pd.read_csv('data/tips.csv')

print(tips.head())

print("\n\n")
print(tips.info())

In [None]:
# Convert the sex column to type 'category'
tips.sex = tips.sex.astype('category')

# Convert the smoker column to type 'category'
tips.smoker = tips.smoker.astype('category')

# Print the info of tips
print(tips.info())

Categorical data
- Converting categorical data to ‘category’ dtype:
- Can make the DataFrame smaller in memory
- Can make them be utilized by other Python libraries for analysis

In [None]:
# Convert 'total_bill' to a numeric dtype
tips['total_bill'] = pd.to_numeric(tips['total_bill'], errors='coerce')

# Convert 'tip' to a numeric dtype
tips['tip'] = pd.to_numeric(tips['tip'], errors='coerce')

# Print the info of tips
print(tips.info())

Function

In [None]:
tips = pd.read_csv('data/tips.csv')

def recode_sex(sex_value):
    if sex_value == 'Male':
        return 1   
    elif sex_value == 'Female':
        return 0
    else:
        return np.nan

tips['sex_recode'] = tips.sex.apply(recode_sex)

tips.head()

Dropping duplicate data

In [None]:
billboard = pd.read_csv('data/billboard.csv')
billboard.head()

In [None]:
tracks = billboard[['year','artist','track','time']]

# Print info of tracks
print(tracks.info())

In [None]:
# Drop the duplicates: tracks_no_duplicates
tracks_no_duplicates = tracks.drop_duplicates()

# Print info of tracks
print(tracks_no_duplicates.info())

Fill Missing Value

In [None]:
airquality = pd.read_csv('data/airquality.csv')
airquality.info()

In [None]:
# Calculate the mean of the Ozone column: oz_mean
oz_mean = airquality.Ozone.mean()

# Replace all the missing values in the Ozone column with the mean
airquality['Ozone'] = airquality.Ozone.fillna(oz_mean)

# Print the info of airquality
print(airquality.info())

### Groupby

In [None]:
sales = pd.DataFrame(
	{
	'weekday': ['Sun', 'Sun', 'Mon', 'Mon'],
	'city': ['Austin', 'Dallas', 'Austin', 'Dallas'],
	'bread': [139, 237, 326, 456],
	'butter': [20, 45, 70, 98]
	}
)

sales

In [None]:
sales['weekday'] == 'Sun'

In [None]:
sales.loc[sales['weekday'] == 'Sun']

In [None]:
# Split the data, apply the function and finally combine the result

sales.groupby('weekday').count()

In [None]:
sales

In [None]:
print(sales.groupby('weekday'))

In [None]:
sales.groupby('weekday').groups

In [None]:
sales_g = sales.groupby('city')
sales_g

In [None]:
print(sales_g.groups)
print(type(sales_g.groups))
print(sales_g.groups.keys())

In [None]:
# Groupby and Sum
sales.groupby('weekday')['bread'].sum()

In [None]:
# Do the sum of multiple column

sales.groupby('weekday')[['bread','butter']].sum()

In [None]:
sales

In [None]:
print(sales.groupby(['city','weekday']).groups)

In [None]:
# multi-level index

sales.groupby(['city','weekday']).mean()

In [None]:
sales

In [None]:
# Do groupby on Series

customers = pd.Series(['Dave','Alice','Bob','Alice'])
customers

In [None]:
sales

In [None]:
sales.groupby(customers)['butter'].sum()

Categorical data

In [None]:
sales['weekday'].unique()

In [None]:
sales['weekday'] = sales['weekday'].astype('category')
sales['weekday']

Groupby and aggregation

In [None]:
sales.groupby('city')[['bread','butter']].max()

In [None]:
# Multiple aggregation
sales.groupby('city')[['bread','butter']].agg(['max','sum'])

In [None]:
sales.groupby('weekday')[['bread', 'butter']].agg(['max','min'])

In [None]:
# Custom aggregation function

def range1(s):
    return s.max() - s.min()

In [None]:
sales.groupby('weekday')[['bread', 'butter']].agg(range1)

In [None]:
sales.groupby(customers)[['bread', 'butter']].agg({'bread':'sum', 'butter':range1})

### Query pandas dataframe

In [None]:
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
courses = pd.DataFrame(technologies)
courses

In [None]:
spark = courses.query("Courses == 'Spark'")
spark

In [None]:
# using variable

value = 'Spark'
spark = courses.query("Courses == @value")
spark

In [None]:
courses

In [None]:
# course = courses.query("Courses == @value")

courses.query("Courses == 'Spark'", inplace=True)
courses

In [None]:
technologies= {
    'Courses':["Spark","PySpark","Hadoop","Python","Pandas"],
    'Fee' :[22000,25000,23000,24000,26000],
    'Duration':['30days','50days','30days', None,np.nan],
    'Discount':[1000,2300,1000,1200,2500]
          }
courses = pd.DataFrame(technologies)
courses


In [None]:
print(courses.query("Courses != 'Spark'"))

In [None]:
print(courses.query("Courses in ('Spark','PySpark')"))

In [None]:
values=['Spark','PySpark']
print(courses.query("Courses in @values"))

In [None]:
values=['Spark','PySpark']
print(courses.query("Courses not in @values"))

In [None]:
# Query by multiple conditions

print(courses.query("Fee >= 23000 and Fee <= 24000"))

In [None]:
# By using lambda function

print(courses.apply(lambda row: row[courses['Courses'].isin(['Spark','PySpark'])]))

In [None]:
# Other examples you can try to query rows
courses[courses["Courses"] == 'Spark'] 

In [None]:
courses.loc[courses['Courses'] == value]

In [None]:
courses.loc[courses['Courses'] != 'Spark']

In [None]:
courses.loc[courses['Courses'].isin(values)]

In [None]:
courses.loc[~courses['Courses'].isin(values)]

In [None]:
courses.loc[(courses['Discount'] >= 1000) & (courses['Discount'] <= 2000)]

In [None]:
courses.loc[(courses['Discount'] >= 1300) & (courses['Fee'] >= 23000 )]

In [None]:
# Select based on value contains
print(courses[courses['Courses'].str.contains("P")])

In [None]:
# Select after converting values
print(courses[courses['Courses'].str.lower().str.contains("spark")])

In [None]:
#Select startswith
print(courses[courses['Courses'].str.startswith("P")])

In [None]:
courses.pop('Discount')

In [None]:
courses