# **Pandas**
## **Data Structures**
Pandas provides two fundamental data structures: **Series** and **DataFrame**.

**Series:**&emsp;A 1-D labeled array holding data of any type (such as Integer, String, User Defined Object..) <br>
&emsp;&emsp;**Labeling:**&emsp;Each element in a Series has a label or an index - Easy access and manipulation. <br>
&emsp;&emsp;**Homogeneous Data:**&emsp;Series stores data of same type to ensure consistency. <br>
&emsp;&emsp;**Vectorized Operations:**&emsp;Efficient for element-wise calculation - Operations on entire Series without need of explicit loops. <br>
&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;In simple words, arithmetic operations performed on a Pandas Series reflects in each of its element. Let's illustrate this with an example!

In [2]:
# Importing pandas
import pandas as pd            # Pandas is imported as 'pd' by convention. Not a must thing to do.

## Pandas Series

<br>**pandas.Series(data=None, index=None, dtype=None, name=None, copy=None, fastpath=False)**<br><br>
Let's explore it!

In [3]:
# Empty Series
print(pd.Series())   # data is also optional. Empty series is obtained if data not mentioned
print(pd.Series([])) # keeping data as empty list yields same result

Series([], dtype: object)
Series([], dtype: object)


In [4]:
# Datatype of Pandas Series
type(pd.Series())

pandas.core.series.Series

In [5]:
# Series with list data
data = [1,2,3,4,5]
series = pd.Series(data, name="MySeries")  # name attribute is just for documentation and identification purpose - doesn't affect functionality of Series.
series

0    1
1    2
2    3
3    4
4    5
Name: MySeries, dtype: int64

In [6]:
# Series with dictionary
data = {'a':1, 'b':2, 'c':3, 'd':4, 'e':5}
series = pd.Series(data)
print(series)
print(series['b'])

a    1
b    2
c    3
d    4
e    5
dtype: int64
2


In [7]:
'''
    Will integer indexing still work even when the indexing has been customized?
    Answer is YES!
    print(series[2])   # Works now with a FutureWarning. Returns 3. In future this may no longer be available.
    Use the below one instead
'''
# print(series[2])
series.iloc[2]

3

As you can see, when dictionary is given as input, keys are automatically taken as index, and values are automatically taken as entries. This wasn't there with the list. To perform the same task with the list, we need two lists - one having keys and other having values. Let's try that as well!

In [8]:
# Customized index
data = ['Alice', 'Bob', 'Catherine', 'Donovan']
index = ['A','B','C','D']
series = pd.Series(data, index=index)  # index argument lets us list out the indices for the given data
series

A        Alice
B          Bob
C    Catherine
D      Donovan
dtype: object

In [9]:
# Series with tuple data
pd.Series(data = ('a','b','c'))

0    a
1    b
2    c
dtype: object

In [10]:
# Series with scalar value as data
pd.Series(data=45)

0    45
dtype: int64

In [11]:
# Series with datatype changing
data_list = [1,2,3.0,4]
pd.Series(data = data_list)  # pd.Series(data_list) is also correct. Because, 'data' is the first argument. Python matches with Positional Argument Concept.

0    1.0
1    2.0
2    3.0
3    4.0
dtype: float64

In [12]:
# Here as you can see, since one entry was wrongly entered as float, all became float.
# float64 is more memory consuming than int64 itself.
# use dtype argument
pd.Series(data_list, dtype=int) # here explicit mentioning of 'data' argument name was avoided. Still works.

0    1
1    2
2    3
3    4
dtype: int32

<br><br>**Operations on Series**

In [13]:
# Copying a series
orig_series = pd.Series(['America', 'Berlin','Canada','Delhi'])
orig_series

0    America
1     Berlin
2     Canada
3      Delhi
dtype: object

In [14]:
copy1 = pd.Series(orig_series, copy=False)
print("Before Alteration:")
print(copy1)                              # Printing before alteration
copy1[1] = 'Bombay'                       # Alteration
print("\nAfter Alteration:")
print(copy1)                              # Printing after alteration
print("\nOriginal DataFrame:")
print(orig_series)

Before Alteration:
0    America
1     Berlin
2     Canada
3      Delhi
dtype: object

After Alteration:
0    America
1     Bombay
2     Canada
3      Delhi
dtype: object

Original DataFrame:
0    America
1     Bombay
2     Canada
3      Delhi
dtype: object


<br><br> As you can notice, when copy=False, the alterations on the copied DataFrame also reflects on the original one. <br>
To avoid this, we may keep  copy=True. <br><br>
copy=True  --> Deep Copy<br>
copy=False --> Shallow Copy <br>

In [15]:
orig_serues = pd.Series(['America', 'Berlin','Canada','Delhi'])
copy1 = pd.Series(orig_series, copy=True)       # copy=True ensures that the alterations on the copy DataFrame doesn't reflect in original DataFrame.
print("Before Alteration:")
print(copy1)                                # Printing before alteration
copy1[1] = 'Bombay'                         # Alteration
print("\nAfter Alteration:")
print(copy1)                                # Printing after alteration
print("\nOriginal DataFrame:")
print(orig_series)

Before Alteration:
0    America
1     Bombay
2     Canada
3      Delhi
dtype: object

After Alteration:
0    America
1     Bombay
2     Canada
3      Delhi
dtype: object

Original DataFrame:
0    America
1     Bombay
2     Canada
3      Delhi
dtype: object


In [16]:
# Alternate Approach
orig_series = pd.Series(['America', 'Berlin','Canada','Delhi'])
copy1 = orig_series.copy()  # Deep copy
copy1[2] = "Banana"
orig_series

0    America
1     Berlin
2     Canada
3      Delhi
dtype: object

In [17]:
orig_series = pd.Series(['America', 'Berlin','Canada','Delhi'])
copy1 = orig_series.copy(deep=False)  # Shallow copy
copy1[2] = "Banana"
orig_series

0    America
1     Berlin
2     Banana
3      Delhi
dtype: object

<br><br> **fastpath Attribute:** This attribute is not a commonly used parameter. It is an internal optimization feature intended for advanced users. It can potentially improve performance in certain scenarios. This attribute is considered experimental and might be removed in future versions

<br><br>**Other Functionalities:**

In [18]:
series

A        Alice
B          Bob
C    Catherine
D      Donovan
dtype: object

In [21]:
# Mathematical and Statistical Operations
series = pd.Series([1,2,3,4])
print(series.sum())
print(series.mean())
print(series.median())
print(series.min())
print(series.max())
print(series.std())
print(series.var())

10
2.5
2.5
1
4
1.2909944487358056
1.6666666666666667


In [22]:
# Element wise operations
series1 = pd.Series([1,2,3])
series2 = pd.Series([4,5,6])
print(series1+series2)                # Addition of 2 series
print(series1-series2)                # Subtraction of 2 series
print(series1*series2)                # Multiplication of 2 series
print(series1/series2)                # Divison of 2 series
print(series1*5)                      # Scalar Multiplication on a series

0    5
1    7
2    9
dtype: int64
0   -3
1   -3
2   -3
dtype: int64
0     4
1    10
2    18
dtype: int64
0    0.25
1    0.40
2    0.50
dtype: float64
0     5
1    10
2    15
dtype: int64


In [23]:
# Boolean indexing conditions
data = [1,2,3,4,5]
series = pd.Series(data)
series>2

0    False
1    False
2     True
3     True
4     True
dtype: bool

In [24]:
# Retrieval of elements satisfying condition
series[series>2]

2    3
3    4
4    5
dtype: int64

In [25]:
# Sorting of values
series1 = pd.Series([57,23,73,43,35,67,86,43,21])
series1.sort_values()

8    21
1    23
4    35
3    43
7    43
0    57
5    67
2    73
6    86
dtype: int64

In [26]:
series1             # The sort wasn't reflected in the series itself.

0    57
1    23
2    73
3    43
4    35
5    67
6    86
7    43
8    21
dtype: int64

In [27]:
# To make the changes in the series itself
series1.sort_values(inplace=True)        # series1 = series1.sort_values()  this will also work
series1

8    21
1    23
4    35
3    43
7    43
0    57
5    67
2    73
6    86
dtype: int64

In [28]:
# Descending sort
series1 = pd.Series([57,23,73,43,35,67,86,43,21])
series1 = series1.sort_values(ascending=False)       # Descending Sort
series1

6    86
2    73
5    67
0    57
3    43
7    43
4    35
1    23
8    21
dtype: int64

In [29]:
# Handing Missing data
series2 = pd.Series([1,2,None,4,5])
series2

0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
dtype: float64

In [30]:
series2.isnull()   # Finds the Null entries - NOTE: '-', 'NA', '0' these may also represent NULL entries but not recognized by this function.

0    False
1    False
2     True
3    False
4    False
dtype: bool

In [31]:
series2.fillna(0) # Filling all null values with 0
series2           # Changes not made inplace.

0    1.0
1    2.0
2    NaN
3    4.0
4    5.0
dtype: float64

In [32]:
series2.fillna(0, inplace=True) 
series2           # Now it's done

0    1.0
1    2.0
2    0.0
3    4.0
4    5.0
dtype: float64

In [33]:
# Vectorized String operations
series2 = pd.Series(["apple", "ball", "cat"])

print(series2.str.capitalize())               # Capitalize -- Only first letter caps
print()
print(series2.str.upper())                    # upper -- all letter caps
print()
print(series2.str.lower())                    # lower -- all letters lower   


0    Apple
1     Ball
2      Cat
dtype: object

0    APPLE
1     BALL
2      CAT
dtype: object

0    apple
1     ball
2      cat
dtype: object


In [37]:
# Element-wise functions
print(series)

# Let's square each term
series.apply(lambda x:x**2)   # series**2 will also perform this task

0    1
1    2
2    3
3    4
4    5
dtype: int64


0     1
1     4
2     9
3    16
4    25
dtype: int64

In [47]:
# A typical example where apply() comes handy
# Calculating salary increment based on experience.

# Sample data
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Salary': [50000, 60000, 75000],
    'Experience (years)': [3, 5, 8]
}
df = pd.DataFrame(data)
print(df)
print()

# Implementation
def calc_increment(row):
    '''
        Sample logic:
            if experience > 5:  20% salary increment
            if between 0 to 5:  10% salary increment
    '''
    salary = row["Salary"]
    experience = row["Experience (years)"]

    if experience > 5:
        salary = salary + salary*20/100
    else:
        salary = salary + salary*10/100

    row["Salary"] = salary
    return row

# Expected output
'''
    50000+10% = 55000, 
    60000+10% = 66000,
    75000+20% = 90000
'''

# Applying function
df = df.apply(calc_increment, axis=1)
print(df)

      Name  Salary  Experience (years)
0    Alice   50000                   3
1      Bob   60000                   5
2  Charlie   75000                   8

      Name   Salary  Experience (years)
0    Alice  55000.0                   3
1      Bob  66000.0                   5
2  Charlie  90000.0                   8


In [52]:
# I don't want float. I want int instead
df["Salary"] =  df["Salary"].astype(int)
df

Unnamed: 0,Name,Salary,Experience (years)
0,Alice,55000,3
1,Bob,66000,5
2,Charlie,90000,8


 This is more than sufficient to have very good understanding of Series. The last example best illustrates the need of apply() but using DataFrame. Just like Series, DataFrame is also a Data Structure of Pandas. Considering the significant size of this notebook, let's delve deeper into it in the next notebook. Stay inquisitive!